Git Product home page Git Product logo

news_collector's Introduction

News Collector

Fetch top news from AP and NewsAPI into Elasticsearch. Optionally, you can enrich the data with entity and location information extracted via the Rosette API. And, you can also archive the resulting ES-indexed data into S3.

The three main classes are:

  • NewCollector: does the news collection
  • EntityMomentum: does the entity momentum calculation described below
  • Archiver: does the archiving

Each of the main classes has a main(), a Lambda RequestHandler, and a Dockerfile that gets built by Gradle.

The entity "momentum" metric tries to assign a score to entities each day based on their change in mentions over the past few days:

log(avg(M_t)) * min( S / 5, 2) * ( (M_1 / M_0) + (M_2 / M_1) )

where:
- S = number of sources mentioning the entity during the period
- M = mentions
- M_t = total mentions during period
- M_x = mention for a given day in the period

The intuition is to look at the change in mentions from the start of the period to the end and boost the score in a diminishing way for entities with a lot of mentions during that period, and also boost if those mentions occured across a lot of different sources (10 sources delivers the max 2x boost)

Config

You have a two main options: CLI args and/or a properties.groovy (create your own via the supplied example in src/main/resources). There are also defaults defined in NewsCollector.groovy ...probably should just remove those and use properties.groovy as the default property repo.

If you're using AWS Lambda, you can also use values in properties.groovy, but they'll end-up packaged with the uberjar. Alternatively, you can pass in JSON (see details below).

Authentication

If you set an Elasticsearch user using one of the various config methods, then you'll need to also send a password. If you don't configure a user, it'll figure you don't have authentication setup! And shame on you if you don't. X-Pack is loaded with features (to include security) that make Elasticsearch into a production-level analytic powerhouse.

News API

If you don't want to get a NewsAPI key, no worries! Just leave that config blank and it'll be skipped

EntityMomentum

usage: EntityMomentum
 -help                    print this message
 -indexPrefix <prefix>    ES index prefix [default: news]
 -pass <pass>             password for ES authentication [default: changeme]
 -propertiesFile <file>   properties file
 -url <URL>               URL for ES [default: http://localhost:9200]
 -user <user>             username for ES authentication [default: elastic]

Archiver

The Archiver is designed to build a tar.gz of data from all the indicies created by the news collector and save it to an S3 bucket. By default, it will archive all the data from the first day of "this" month until "yesterday". There is a main() method that will accept ES and S3 configuration as well as a start and end date

usage: Archiver
 -bucket <bucket>             S3 bucket name for Archiver
 -endDate <bucket>            end date for Archiver 'yyyy-MM-dd'
 -help                        print this message
 -indexPrefix <prefix>        ES index prefix [default: news]
 -outputFileName <filename>   The filename to be used when writing to s3
                              [default: all_data-<startDate yyyyMMdd>.tar.gz]
 -pass <pass>                 password for ES authentication [default: changeme]
 -prefix <prefix>             S3 file prefix for Archiver
 -startDate <bucket>          start date for Archiver in 'yyyy-MM-dd
 -url <URL>                   URL for ES [default: http://localhost:9200]
 -user <user>                 username for ES authentication [default: elastic]

news_collector's People

Contributors

mbarretta avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.