Git Product home page Git Product logo

baleen's Introduction

Baleen

An automated ingestion service for blogs to construct a corpus for NLP research.

PyPI version Build Status Coverage Status Code Health Documentation Status Stories in Ready

Space Whale

Quick Start

This quick start is intended to get you setup with Baleen in development mode (since the project is still under development). If you'd like to run Baleen in production, please see the documentation.

  1. Clone the repository
$ git clone [email protected]:bbengfort/baleen.git
$ cd baleen
  1. Create a virtualenv and install the dependencies
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
  1. Add the baleen module to your $PYTHONPATH via the virtualenv.
$ echo $(pwd) > venv/lib/python2.7/site-packages/baleen.pth
  1. Create your local configuration file. Edit it with the connection details to your local MongoDB server. This is also a good time to check and make sure that you can create a database called Baleen on Mongo.
$ cp conf/baleen-example.yaml conf/baleen.yaml
debug: true
testing: false
database:
    host: localhost
    port: 27017
    name: baleen
server:
    host: 127.0.0.1
    port: 5000
  1. Run the tests to make sure everything is ok.
$ make test
  1. Make sure that the command line utility is ready to go:
$ bin/baleen --help
  1. Import the feeds from the feedly.opml file in the fixtures.
$ bin/baleen load tests/fixtures/feedly.opml
Ingested 36 feeds from 1 OPML files
  1. Perform an ingestion of the feeds that were imported from the feedly.opml file.
$ bin/baleen ingest

Your Mongo database collections should be created as you add new documents to them, and at this point you're ready to develop!

Docker Setup

Included in this repository are files related to setting up the development environment using docker if you wish.

  1. Install Docker Machine and Docker Compose e.g. with Docker Toolbox.

  2. Clone the repository

$ git clone [email protected]:bbengfort/baleen.git
$ cd baleen
  1. Create your local configuration file. Edit it with your configuration details; your MongoDB server will be at host mongo.
$ cp conf/baleen-example.yaml conf/baleen.yaml
debug: true
testing: false
database:
    host: mongo
    port: 27017
    name: baleen
server:
    host: 127.0.0.1
    port: 5000
  1. Exec interactively into the app container to interact with baleen as described in the above setup directions 5-8.
    docker exec -it baleen_app_1 /bin/bash

Web Admin

There is a simple Flask application that ships with Baleen that provides information about the current status of the Baleen ingestion. This app can be run locally in development with the following command:

$ bin/baleen serve

You can then reach the website at http://127.0.0.1:5000/. Note that the host and port can be configured in the YAML configuration file or as command line arguments to the serve command.

Deployment

The web application is deployed in production as an Nginx + uWSGI + Flask application that is managed by upstart.

About

Baleen is a tool for ingesting formal natural language data from the discourse of professional and amateur writers: e.g. bloggers and news outlets. Rather than performing web scraping, Baleen focuses on data ingestion through the use of RSS feeds. It performs as much raw data collection as it can, saving data into a Mongo document store.

Throughput

Throughput Graph

Attribution

The image used in this README, "Space Whale" by hbitik is licensed under CC BY-NC-ND 3.0

baleen's People

Contributors

bahadasx avatar bbengfort avatar danieljohnbenton avatar lauralorenz avatar ojedatony1616 avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

baleen's Issues

Fetch/Ingest Timeout

Add a timeout so that if a post or feed is having trouble being downloaded, we skip it and carry on.

Docker image is empty

After connecting to the docker image, all I see is the requirements.txt folder.

To reproduce:

  • Follow steps 1-3 of the Docker directions on the Quickstart instructions in docs/index.md
  • docker build -t "baleen_app_1" -f Dockerfile-app
  • follow step 4: docker exec -it baleen_app_1 /bin/bash

Results:

root@05c2ca45b232:/baleen# ls
requirements.txt

short urls

@bahadasx - I'm shortening the URLs to make it easier to type on my phone, e.g. job_status --> status. Hope that's ok!

export commandline options

Add support for the following export commandline options

  • “html” - like safe, except includes meta tags
  • “json” - TBD

Bootstrapify /status

I did a basic job of adding bootstrap styles to both pages in the app, but I didn't really touch the status page. You can use bootstrap components and things will be a lot cleaner, for example:

  • list groups with badges for the errors and counts, similar to what I did on the feed nav page.
  • put the log records into a table that is striped and bordered for easy review and access.
  • Use grid layouts instead of   to better layout the page.

I know you might not do bootstrap, but it's really easy and it goes a long way to making things look great without having to be a designer.

Edit the README

Get the readme going and add the Baleen architecture diagram.

Baleen Corpus Reader

During the NLP workshop, we had a good idea - why not create some NLP processing tools in Baleen? In particular, if we create a BaleenCorpusReader class that extends or provides a similar API to the nltk.corpus.CategorizedPlaintextCorpusReader - then we could use NLTK style analytics directly on top of the MongoDB.

Create Ingestion Log

Create ingestion log with start/stop and feeds ingested/errored/posts ingested/errored information.

Make posts.htmlize() smarter

This method was originally written to wrap html snippets to look like a real web page. Now we have the ability to fetch complete web pages from RSS feeds. However In some use cases, such as when the RSS feed fails to download a web page, the old wrapping behavior will still be necessary.

Requirements:
htmlize() should be either return complete webpages

  1. wrap snippets in a web page (what we already do)
  2. return complete web page

Create baleend

Create the Baleen daemon service that uses scheduling to run in the background.

Trouble getting installing feedparser

@bbengfort I'm having trouble pip installing feedparser from the requirements.txt. Here is the error I'm getting:

Collecting feedparser==5.2.1 (from -r requirements.txt (line 2))
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79990>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79b10>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79c90>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79e10>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79f90>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Could not find a version that satisfies the requirement feedparser==5.2.1 (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for feedparser==5.2.1 (from -r requirements.txt (line 2))
ERROR: Service 'app' failed to build: The command '/bin/sh -c pip install -r requirements.txt' returned a non-zero code: 1

Use readability on HTML

Add the readability mechanism to get the good text from the HTML dump (or for insertion into mongo).

Refactor utilities

Move the utilities into their own package, including:

  • logger
  • mongolog
  • utils (timez)

Refactor Logging

The logging is awesome! But let's go ahead and refactor it and the mongolog into our mixin based logger that we now more routinely implement.

Also add logging configuration to confire.

Quick time display

It would be really helpful to get some "at a glance " times into the application, particularly for duration (finished - started).

  • Add a duration method to baleen.models.Job that computes the number of seconds between started and finished (unless not finished, then between now and started).
  • Add the duration to the page colored green if it's less than an hour, yellow if it's less than 5 hours, and red otherwise.

Iconography from font awesome would also help make things stand out!

Counting posts per feed is slow/uses too much memory

The method:

baleen.models.Feed.count_posts

Is too slow on the deployment server. It seems that:

Post.objects(feed=self).count()

is going through the entire collection and filtering, which is bad.

Need to figure out a better way to do this.

Index?

Python 3.5 Support

  • Make sure that the tests work
  • Drop 2.7 dependency from travis
  • ensure 3 compatibility in all packages.

Currently running status screen a bit wonky

The status screen in currently running got a bit wonky by accident:

screenshot 2016-04-19 12 52 24

I think this was just caused by us writing updates at the same time; I made some changes and I'm sure you did too. So fixes:

  1. If running, we should hide the "Finished" row in the in the job info table.
  2. Use the "empty" modifier in the for loop for counts and errors to indicate that there are no counts or errors yet (which will generally fix the case where there are in fact, no counts and errors).
  3. Colorize the job running row and the job info panel with the info class for at a glance-running indicator.

Deploy web application

Need a few things

  • deploy Flask app with Gunicorn on server
  • add deployment files to repo.

Latest post appears incorrect

It seems that the latest post is incorrect on the /status page?

Compare:

screen shot 2016-04-07 at 13 58 47

To:

screen shot 2016-04-07 at 13 59 27

Which were both run at the same time.

Create Database Test Harness

Create a test harness for testing the database (e.g. something that creates a testing version of the database and then destroys it when complete).

Add Mongo dependency to Travis for testing.

Memory Crash

in deployment, going to /status causes the app to crash because the server runs out of memory (even though it has 4.0 GB worth of memory).

Sanitize HTML

Use bleach to sanitize the post HTML to ensure there are no harmful scripts.

Either on Export or for Mongo storage.

Time zone and humanized numbers

On the dates their is some weirdness, they say:

"03:21 PM"

Which is 15:21 but sort of looks like 03:21 am, and it's confusing.

Also, the dates are in UTC time, so either:

  • put the timezone as part of the string
  • convert the timezone to local time.

If you could also add humanization, that would be really helpful for example:

  • add time since now "3 hours ago" to finished, in addition to the duration.
  • add intcomma to the big numbers: 133,791 instead of 133791.

That would really help, thanks!

Better Export

  • Santize MP3 and other documents.
  • Put appropriate number of zeros as padding in front of digits depending on the order of documents.

Commis CLI

Add commis for the command line interface.

Segmentation Fault --> 404 Error on Status

Looks like we're getting an occassional segfault on /status/, the /var/log/uwsgi/app/baleen.log has the following to say about it:

Note that this appears to the user as a 404 error from Nginx.

Thu Apr  7 18:00:38 2016 - !!! uWSGI process 3488 got Segmentation Fault !!!
Thu Apr  7 18:00:38 2016 - *** backtrace of 3488 ***
/usr/bin/uwsgi(uwsgi_backtrace+0x2e) [0x45121e]
/usr/bin/uwsgi(uwsgi_segfault+0x21) [0x4515f1]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f0310177d40]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Malloc+0x248) [0x7f030eff9298]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyCapsule_New+0x28) [0x7f030ef835c8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1a0fe9) [0x7f030efc5fe9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1a2a0b) [0x7f030efc7a0b]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(_PyArg_ParseTuple_SizeT+0x89) [0x7f030ef839c9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x8d7cd) [0x7f030eeb27cd]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4bd4) [0x7f030efb20d4]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8) [0x7f030efb1dd8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8) [0x7f030efb1dd8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1c36d0) [0x7f030efe86d0]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xbb7bd) [0x7f030eee07bd]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x7f030efcd577]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xe19a6) [0x7f030ef069a6]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x93912) [0x7f030eeb8912]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x60f12) [0x7f030ee85f12]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x13268f) [0x7f030ef5768f]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x2316) [0x7f030efaf816]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1c37a5) [0x7f030efe87a5]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xeb1) [0x7f030efae3b1]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1c36d0) [0x7f030efe86d0]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xbb7bd) [0x7f030eee07bd]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1347e5) [0x7f030ef597e5]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x7f030efcd577]
/usr/lib/uwsgi/plugins/python_plugin.so(python_call+0x11) [0x7f030f3994f1]
/usr/lib/uwsgi/plugins/python_plugin.so(uwsgi_request_wsgi+0x127) [0x7f030f39b847]
/usr/bin/uwsgi(wsgi_req_recv+0xa1) [0x413f31]
/usr/bin/uwsgi(simple_loop_run+0xc4) [0x44d5d4]
/usr/bin/uwsgi(uwsgi_ignition+0x17b) [0x45180b]
/usr/bin/uwsgi(uwsgi_worker_run+0x26d) [0x4523ad]
/usr/bin/uwsgi(uwsgi_start+0x15e3) [0x453b23]
/usr/bin/uwsgi(main+0xfb5) [0x413595]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f0310162ec5]
/usr/bin/uwsgi() [0x413649]
*** end of backtr^@<80>>^B<AB><FD>^?^@Thu Apr  7 18:00:44 2016 - DAMN ! worker 2 (pid: 3488) died, killed by signal 11 :( trying respawn ...
Thu Apr  7 18:00:44 2016 - Respawned uWSGI worker 2 (new pid: 3708)

Add version number to footer

The version number was removed from the header of the status page to make the format consistent with the rest of the site. The footer seems like the best place to add it back in unless we add an "About" page at some point.

commit seed file to /fixtures

I found the seed file, feedly.opml, in /tests/fixtures/

According to the install instructions, we should move it to /fixtures/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.