Git Product home page Git Product logo

nytimes's Introduction

Indexing NY Times Articles in Elasticsearch

Usage

Start a 3-node ES 7.x cluster:

$ docker compose -f deployments/es7-3n-compose.yml up -d

Fetch NY Times articles from their public API, starting from Dec 2022:

$ go run cmd/fetch/main.go --year 2022 --month 12

Fetching URL: https://api.nytimes.com/svc/archive/v1/2022/12.json?api-key=
Wrote 19477556 / 19477556 compressed bytes to file data/articles-2022-12.json.gz
Sleeping for 6 sec to avoid rate limit ..
Fetching URL: https://api.nytimes.com/svc/archive/v1/2023/1.json?api-key=
Wrote 18589680 / 18589680 compressed bytes to file data/articles-2023-1.json.gz
Sleeping for 6 sec to avoid rate limit ..
Fetching URL: https://api.nytimes.com/svc/archive/v1/2023/2.json?api-key=
Wrote 15873102 / 15873102 compressed bytes to file data/articles-2023-2.json.gz
We're at 2023-2 - we're all caught up in time! Donezo!

# You now have a bunch of *.json.gz files full of articles in ./data

Index NY Times articles in ES:

$ go run cmd/load/main.go --create-index --start-from 2022-12

> HEAD http://localhost:9200/?error_trace=true
Pinged ES successfully
> DELETE http://localhost:9200/nytimes-articles?ignore_unavailable=true&pretty=true
Deleted existing index `nytimes-articles` (status: 200)
Created new index `nytimes-articles` (status: 200)
Reading dir: data/
Indexed 9880 articles total
Done. Read 3 files in 2.013946478s

Run benchmark against the new ES index:

# Query command usage:
$ go run cmd/query/main.go --help

Usage of /cmd/query/main:
      --cache          enable search engine caching (default true)
      --count int      number of calls to search engine (default 10)
      --dump           dump search engine result of first query
      --index string   search engine index name (default "nytimes-articles")
      --query string   query to run (path to JSON file) (default "./assets/mappings/nytimes/query-simple.json")
      --threads int    number of threads to run benchmark in concurrently (default 10)

# Run a simple benchmark, 10 iterations across 10 threads:
$ go run cmd/query/main.go

> HEAD http://localhost:9200/?error_trace=true
Pinged ES successfully
Query payload:
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "headline": "President"
          }
        }
      ]
    }
  }
}

> GET http://localhost:9201/_stats
Completed first request in 98.946383ms
[Thread 10] running benchmark, 10 iterations
[Thread 02] running benchmark, 10 iterations
[Thread 01] running benchmark, 10 iterations
[Thread 03] running benchmark, 10 iterations
[Thread 08] running benchmark, 10 iterations
[Thread 06] running benchmark, 10 iterations
[Thread 05] running benchmark, 10 iterations
[Thread 07] running benchmark, 10 iterations
[Thread 09] running benchmark, 10 iterations
[Thread 04] running benchmark, 10 iterations
[Thread 10] completed 10 requests in 188.858577ms (52.9 req/s) | min: 4.937454ms / max: 80.216712ms / mean: 8.889573ms
[Thread 06] completed 10 requests in 194.420731ms (51.4 req/s) | min: 5.151222ms / max: 72.469707ms / mean: 7.086113ms
[Thread 05] completed 10 requests in 213.133434ms (46.9 req/s) | min: 3.867028ms / max: 138.370538ms / mean: 7.039386ms
[Thread 08] completed 10 requests in 215.596919ms (46.4 req/s) | min: 4.45052ms / max: 77.41287ms / mean: 8.158992ms
[Thread 01] completed 10 requests in 216.713314ms (46.1 req/s) | min: 5.253569ms / max: 110.809016ms / mean: 8.763341ms
[Thread 04] completed 10 requests in 216.318873ms (46.2 req/s) | min: 3.782726ms / max: 141.199454ms / mean: 7.196361ms
[Thread 09] completed 10 requests in 219.484012ms (45.6 req/s) | min: 4.46494ms / max: 108.46021ms / mean: 11.587092ms
[Thread 02] completed 10 requests in 230.458599ms (43.4 req/s) | min: 3.377986ms / max: 155.998477ms / mean: 6.732632ms
[Thread 07] completed 10 requests in 236.990408ms (42.2 req/s) | min: 4.508526ms / max: 150.563158ms / mean: 8.163512ms
[Thread 03] completed 10 requests in 240.853723ms (41.5 req/s) | min: 3.70519ms / max: 156.083115ms / mean: 7.352016ms
Done. Completed 100 requests total in 241.141037ms (414.7 req/s)
Query cache   : +0   hits / +0   miss
Request cache : +98  hits / +3   miss

Data

The dataset is all NY Times articles since the Jan 1852, fetched from https://developer.nytimes.com/apis. A typical article looks as follows:

{
  "_id": "nyt://article/ab920e95-4eb9-561c-84c8-ecb5286d3766",
  "abstract": "On the pleasures and pains of joining up with other people after a long, quiet time in the Covid doldrums.",
  "byline": {
    "organization": "",
    "original": "By Andy Miller",
    "person": [
      {
        "firstname": "Andy",
        "lastname": "Miller"
      }
    ]
  },
  "headline": {
    "main": "Hello, World. It’s Been a While.",
    "print_headline": "Trying to Chug Along but Going Off the Rails"
  },
  "keywords": [
    {
      "name": "subject",
      "rank": 1,
      "value": "Books and Literature"
    },
    {
      "name": "organizations",
      "rank": 2,
      "value": "Arsenal (Soccer Team)"
    },
    {
      "name": "persons",
      "rank": 3,
      "value": "Wilde, Oscar"
    },
    {
      "name": "persons",
      "rank": 4,
      "value": "Everett, Rupert"
    },
    {
      "name": "glocations",
      "rank": 5,
      "value": "Great Britain"
    },
    {
      "name": "subject",
      "rank": 6,
      "value": "Coronavirus (2019-nCoV)"
    }
  ],
  "lead_paragraph": "I am traveling on a train, reading a book, glad to be alive.",
  "multimedia": [
    {
      "url": "images/2022/08/07/fashion/31EPISODE-MILLER/31EPISODE-MILLER-articleLarge.jpg",
      "width": 600,
      "height": 600,
      "subType": "xlarge"
    },
    {
      "url": "images/2022/08/07/fashion/31EPISODE-MILLER/31EPISODE-MILLER-thumbStandard.jpg",
      "width": 75,
      "height": 75,
      "subType": "thumbnail"
    },
    {
      "url": "images/2022/08/07/fashion/31EPISODE-MILLER/31EPISODE-MILLER-thumbLarge.jpg",
      "width": 150,
      "height": 150,
      "subType": "thumbLarge"
    }
  ],
  "pub_date": "2022-08-01T00:00:09+0000"
}

Generate data based on NY Times articles

# Generate 10 sentences with max 12 words each based on 10,000 NY Times articles.
# The data follows the natural word distribution of the articles and uses a
# "word graph" to ensure that words follow each other naturally, i.e. that a
# word can naturally occur after another (as opposed to a completely random word).
#
$ go run cmd/datagen/main.go --dir ./data --start-from articles-2022 --max-docs 10_000 --num 10 --max-words 12

navajo nation had announced on the nonprofit organization pen lope cruz by a
criminally responsible by brands are so much of closure in case that lets
nocturna’ review do visitors had provided the shop is there are cops who
kurson kushner . but when amina begum stares at more daunting final challenge
wanes after the jan . when the annual state of forbes avenue raised
badder androids dream after the texas spike in this newsletter i speak and
parton eminem planned meeting with black skinhead at the facebook illumina now there
wants. . the winter . the nose the 1 million followers who attempted
symonds . bp as ambitious ambassadors of the breach in the day buffeted
cream-free creamy minty allure offering developed more leverage his front lines . said

nytimes's People

Contributors

anrid avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.