othernet-project / artexin Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 2.0 853 KB

Article Extraction and Indexing for Outernet

License: GNU General Public License v3.0

Python 100.00%

artexin's People

Contributors

Stargazers

Watchers

Forkers

foxbunny mp52

artexin's Issues

Simplify auth

The whole site will be accessed via SSH tunnel, so no need for convoluted auth.

Remove fixtures

Clean up fixtures from the repository (and don't use them in the tests).

Expose directory with zipballs publicly

Simply use nginx's autoindex on for now.

Add homepage to artexin web UI

Use database to track progress of collected pages

At very least, we should track URLs and their status ('added', 'collected', 'broadcast'). Including some metadata about the URLs is optional.

Pass soup objects to all preprocessors

Do not convert to soup object every time. Not very efficient.

Add support for image files

When image URL is passed, generate a HTML page that contains a single image with whatever metadata is available to us.

Simplify page ID

Use either lower-case-lettters-only (a-z) or mix of lower- and upper-case letters (a-zA-Z).

`mail.send()` will not use user-supplied settings

Clean style, class, and similar unneeded attributes from HTML nodes

This is potentially expensive, though.

Add web UI for collecting pages

Add a basic UI that takes a list of URLs and starts collecting them into zip files.

The software should check for available space before starting the operation, and should consider each URL as taking up 3MB of space in total during processing. The software should warn the user when sufficient space is not available.

Rationale for 3MB per URL storage capacity comes from the average size of a web page today of 1.5MB and the additional storage needed to create the zip file.

The maximum size per page should be configurable, and there should be a safety margin as well (i.e., X megabytes less than total free space should be taken by the collecting).

To prevent race conditions, there should be a lock file which the web UI should create before attempting to reserve space for its operation. If web UI finds a lock file, it should wait for its removal before attempting to calculate the available storage space.

There should be a file, called 'reservations', that contains the total space used by all processes (rounded up to nearest integer). Each process should:

install the lock file first
read the file contents
calculate physically available space
take into account the space registered in the reservations
add to reservations the space it needs
overwrite the reservations file with updated data
release the lock

If the thread cannot reserve enough space, it should release the lock immediately, and inform the user.

Simple tool for testing content

Ideally, we should be able to do something like bundle.py http://whatever.com/or/the/other.html and get an unencrypted zipball. This tool would be used by site owners to test ArtExIn output. It should also generate enough debug data so that meaningful bug reports can be made. Also provide .exe version for Windows users.

Add support for extracting text from PDF and DOC files

TODO: Add some notes about desired behavior here

Extract actual article name from page title

Page titles very commonly unify site name with article title using separators like dash, dot, bar, etc. We need an intelligent way of stripping away the site name.

Failing test for `urlutils.split()`

======================================================================
FAIL: Doctest: artexin.urlutils.split
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.4/doctest.py", line 2193, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for artexin.urlutils.split
  File "/vagrant/artexin/urlutils.py", line 61, in split

----------------------------------------------------------------------
File "/vagrant/artexin/urlutils.py", line 78, in artexin.urlutils.split
Failed example:
    split('http://localhost?foo=bar')
Expected:
    ('http://localhost', '/?foo=bar')
Got:
    '/?foo=bar'


----------------------------------------------------------------------

Sign zipballs using GnuPG

Missing title

Example page: https://docs.python.org/3.4/library/argparse.html#metavar

Switch BeautifulSoup parser to lxml

Convert MD5 zip file names to letter-only form (a-z)

Support for proper job queue

Instead of processing URLs in child processes, we need a proper job queue (like Celery).

ArtExIn should provide an UI for monitoring the queue and reporting on success or failure.

User that creates a new task should receive email notification on success, and there should be an ability to send out notification to admins as well.

Report zip file size in `artexin.pack.collect()`

We could use this info when deciding whether to broadcast it or not, and whether to put it in priority lane or not.

Retire ArtExIn and write simplified cross-platform tools for offline use

The tools may or may not include a web-based UI, but local use should be assumed.

The tools are plural, each handling a specific use case, rather than trying to be a complete tool shed.

Add setup module

IMDb special case for article extraction

Readability seems to think article begins quite low inside the page. Actual 'article' is contained within a div#title-overview-widget.

Example page: http://www.imdb.com/title/tt0087803/

Passing processes as bytestring crashes app

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/bottle.py", line 862, in _handle
    return route.call(**args)
  File "/usr/local/lib/python3.4/dist-packages/bottle.py", line 1729, in wrapper
    rv = callback(*a, **ka)
  File "/srv/code/artexin_webui/auth.py", line 272, in wrapped
    return f(*args, **kwargs)
  File "/srv/code/artexin_webui/app.py", line 80, in collections_process
    max_procs=request.app.config['artex.processes'])
  File "/srv/code/artexin_webui/schema.py", line 141, in process_urls
    results = batch(urls, **kwargs)  # WARNING: batch() has many children!
  File "/srv/code/artexin/batch.py", line 48, in batch
    pool = multiprocessing.Pool(max_procs)
  File "/usr/lib/python3.4/multiprocessing/context.py", line 118, in Pool
    context=self.get_context())
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 160, in __init__
    if processes < 1:
TypeError: unorderable types: str() < int()

Automatically or manually derive the language of the page and add to metadata

Add basic styling to artexin web UI

Add two-step verification and session handling

Type in email and password
Receive login link via email
Click on the link to access the site

The link should contain a one-time code that expires after 3 minutes.

Sessions should expire when browser window is closed (ideally). Opt-in 'remember me' feature should extend the session for a maximum of 14 days.

Make sure top-most heading is level 1

Add web UI for managing zip files

The web UI should allow the admin to see a listing of zip files, removal of individual zip files or multiple zip files at once.

Channels

Each channel should have a human-readable name, and a folder name (single word consisting of alphanumerics and underscores). Each channel folder should be created when a new channel is created in the UI, and should contain a .name file that contains the human readable name as single-line string.

Each piece of content should be assigned a channel when processed in a batch. There should be an UI for assigning the channel to each URL as well as choosing the default channel for the entire batch.