Using the New York Times API to Scrape Metadata

Last week, I wrote an introduction to scraping web pages to collect metadata, mentioning that it’s not possible to scrape the New York Times site. The Times paywall blocks your attempts to gather basic metadata. But there is a way around this using the New York Times API.

Using the New York Times API to Scrape Metadata

Using the New York Times API to Scrape Metadata

Recently I began building a community site on top of the Yii platform, which I wrote about in Programming With Yii2: Building Community with Comments, Sharing and Voting (СodeHolder Tuts+). I wanted to make it easy to add links related to content on the site. While it’s easy for people to paste URLs into forms, it becomes time-consuming to also provide title and source information.

So in today’s tutorial, I’m going to expand the scraping code I wrote recently to leverage the New York Times API to gather headlines when Times links are added.

Remember, I participate in the comment threads below, so tell me what you think! You can also reach me on Twitter @lookahead_io.

Getting Started

Sign Up for an API Key

Using the New York Times API to Scrape Metadata

First, let’s sign up to request an API Key:

Using the New York Times API to Scrape Metadata

After you submit the form, you’ll receive your key in an email:

Using the New York Times API to Scrape Metadata

Exploring the New York Times API

Using the New York Times API to Scrape Metadata

The Times offers APIs in the following categories:

  • Archive
  • Article Search
  • Books
  • Community
  • Geographic
  • Most Popular
  • Movie Reviews
  • Semantic
  • Times Newswire
  • TimesTags
  • Top Stories

It’s a lot. And, from the Gallery page, you can click on any topic to see the individual API category documentation:

Using the New York Times API to Scrape Metadata

The Times uses LucyBot to power their API docs, and there is a helpful FAQ:

Using the New York Times API to Scrape Metadata

They even show you how to quickly get your API usage limits (you’ll need to plug in your key):

 curl --head 
   https://api.nytimes.com/svc/books/v3/lists/overview.json?api-key=<your-api-key>
    2>/dev/null | grep -i "X-RateLimit"
    X-RateLimit-Limit-day: 1000
    X-RateLimit-Limit-second: 5
    X-RateLimit-Remaining-day: 180
    X-RateLimit-Remaining-second: 5

I initially struggled to make sense of the documentation—it’s a parameter-based specification, not a programming guide. However, I posted some questions as issues to the New York Times API GitHub page, and they were quickly and helpfully answered.

Working With Article Search

For today’s episode, I’m going to focus on using the NY Times Article Search. Basically, we’ll extend the Create Link form from the last tutorial:

Using the New York Times API to Scrape Metadata

When the user clicks Lookup, we’ll make an ajax request through to Link::grab($url). Here’s the jQuery:

$(document).on("click", '[id=lookup]', function(event) {
  $.ajax({
     url: $('#url_prefix').val()+'/link/grab',
     data: {url:   $('#url').val()},
     success: function(data) {
       $('#title').val(data);
       return true;
     }
  });
});

Here’s the controller and model method:

// Controller call via AJAX Lookup request
public static function actionGrab($url) {
  Yii::$app->response->format = Response::FORMAT_JSON;
  return Link::grab($url);
}
...
// Link::grab() method
public static function grab($url) {
  //clean up url for hostname
  $source_url = parse_url($url);
  $source_url = $source_url['host'];  
  $source_url=str_ireplace('www.','',$source_url);
  $source_url = trim($source_url,' \');
  // use the NYT API when hostname == nytimes.com 
  if ($source_url=='nytimes.com') {
   ...

Next, let’s use our API key to make an article search request:

    $nytKey=Yii::$app->params['nytapi'];    
    $curl_dest = 'http://api.nytimes.com
        /svc/search/v2/articlesearch.json?fl=headline&fq=web_url:%22'.
        $url.'%22&api-key='.$nytKey;
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_URL,$curl_dest);
    $result = json_decode(curl_exec($curl));
    $title = $result->response->docs[0]->headline->main;
  } else {
    // not NYT, use the standard metatag scraper from last episode
         ...
    }
  }
  return $title;
}

And it works quite easily—here’s the resulting headline (by the way, climate change is killing Polar Bears and we should care):

Using the New York Times API to Scrape Metadata

If you want more details from your API request, just add additional arguments to the ?fl=headline request such as keywords and lead_paragraph:

Yii::$app->response->format = Response::FORMAT_JSON;
$nytKey=Yii::$app->params['nytapi'];
$curl_dest = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?'.
  'fl=headline,keywords,lead_paragraph&fq=web_url:%22'.$url.'%22&api-key='.$nytKey;
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL,$curl_dest);
$result = json_decode(curl_exec($curl));
var_dump($result);

Here’s the result:

Using the New York Times API to Scrape Metadata

Perhaps I’ll write a PHP library to better parse the NYT API in coming episodes, but this code breaks out the keywords and the lead paragraph:

Yii::$app->response->format = Response::FORMAT_JSON;
$nytKey=Yii::$app->params['nytapi'];
$curl_dest = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?'.
  'fl=headline,keywords,lead_paragraph&fq=web_url:%22'.$url.'%22&api-key='.$nytKey;
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL,$curl_dest);
$result = json_decode(curl_exec($curl));
echo $result->response->docs[0]->headline->main.'<br />'.'<br />';
echo $result->response->docs[0]->lead_paragraph.'<br />'.'<br />';
foreach ($result->response->docs[0]->keywords as $k) {
  echo $k->value.'<br/>';
}

Here’s what it shows for this article:

Polar Bears’ Path to Decline Runs Through Alaskan Village

The bears that come here are climate refugees, on land because
the sea ice they rely on for hunting seals is receding.

Polar Bears
Greenhouse Gas Emissions
Alaska
Global Warming
Endangered and Extinct Species
International Union for Conservation of Nature
National Snow and Ice Data Center
Polar Bears International
United States Geological Survey

Hopefully that starts to expand your imagination about how to use these APIs. It’s pretty exciting what may now be possible.

In Closing

The New York Times API is very useful, and I’m glad to see them offering it to the developer community. It was also refreshing to get such quick API support via GitHub—I just didn’t expect this. Keep in mind that it’s intended for non-commercial projects. If you have some money-making idea, send them a note to see if they’ll work with you. Publishers are eager for new sources of revenue.

I hope you found these web scraping episodes helpful and put them to use in your projects. If you’d like to see today’s episode in action, you can try out some of the web scraping on my site, Active Together.

Please do share any thoughts and feedback in the comments. You can also always reach me on Twitter @lookahead_io directly. And be sure to check out my instructor page and other series, Building Your Startup With PHP and Programming With Yii2.

Related Links