othernet-project / artexin Goto Github PK
View Code? Open in Web Editor NEWArticle Extraction and Indexing for Outernet
License: GNU General Public License v3.0
Article Extraction and Indexing for Outernet
License: GNU General Public License v3.0
The whole site will be accessed via SSH tunnel, so no need for convoluted auth.
Clean up fixtures from the repository (and don't use them in the tests).
Simply use nginx's autoindex on
for now.
At very least, we should track URLs and their status ('added', 'collected', 'broadcast'). Including some metadata about the URLs is optional.
Do not convert to soup object every time. Not very efficient.
When image URL is passed, generate a HTML page that contains a single image with whatever metadata is available to us.
Use either lower-case-lettters-only (a-z) or mix of lower- and upper-case letters (a-zA-Z).
This is potentially expensive, though.
Add a basic UI that takes a list of URLs and starts collecting them into zip files.
The software should check for available space before starting the operation, and should consider each URL as taking up 3MB of space in total during processing. The software should warn the user when sufficient space is not available.
Rationale for 3MB per URL storage capacity comes from the average size of a web page today of 1.5MB and the additional storage needed to create the zip file.
The maximum size per page should be configurable, and there should be a safety margin as well (i.e., X megabytes less than total free space should be taken by the collecting).
To prevent race conditions, there should be a lock file which the web UI should create before attempting to reserve space for its operation. If web UI finds a lock file, it should wait for its removal before attempting to calculate the available storage space.
There should be a file, called 'reservations', that contains the total space used by all processes (rounded up to nearest integer). Each process should:
If the thread cannot reserve enough space, it should release the lock immediately, and inform the user.
Ideally, we should be able to do something like bundle.py http://whatever.com/or/the/other.html
and get an unencrypted zipball. This tool would be used by site owners to test ArtExIn output. It should also generate enough debug data so that meaningful bug reports can be made. Also provide .exe version for Windows users.
TODO: Add some notes about desired behavior here
Page titles very commonly unify site name with article title using separators like dash, dot, bar, etc. We need an intelligent way of stripping away the site name.
======================================================================
FAIL: Doctest: artexin.urlutils.split
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.4/doctest.py", line 2193, in runTest
raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for artexin.urlutils.split
File "/vagrant/artexin/urlutils.py", line 61, in split
----------------------------------------------------------------------
File "/vagrant/artexin/urlutils.py", line 78, in artexin.urlutils.split
Failed example:
split('http://localhost?foo=bar')
Expected:
('http://localhost', '/?foo=bar')
Got:
'/?foo=bar'
----------------------------------------------------------------------
Instead of processing URLs in child processes, we need a proper job queue (like Celery).
ArtExIn should provide an UI for monitoring the queue and reporting on success or failure.
User that creates a new task should receive email notification on success, and there should be an ability to send out notification to admins as well.
We could use this info when deciding whether to broadcast it or not, and whether to put it in priority lane or not.
The tools may or may not include a web-based UI, but local use should be assumed.
The tools are plural, each handling a specific use case, rather than trying to be a complete tool shed.
Readability seems to think article begins quite low inside the page. Actual 'article' is contained within a div#title-overview-widget
.
Example page: http://www.imdb.com/title/tt0087803/
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/bottle.py", line 862, in _handle
return route.call(**args)
File "/usr/local/lib/python3.4/dist-packages/bottle.py", line 1729, in wrapper
rv = callback(*a, **ka)
File "/srv/code/artexin_webui/auth.py", line 272, in wrapped
return f(*args, **kwargs)
File "/srv/code/artexin_webui/app.py", line 80, in collections_process
max_procs=request.app.config['artex.processes'])
File "/srv/code/artexin_webui/schema.py", line 141, in process_urls
results = batch(urls, **kwargs) # WARNING: batch() has many children!
File "/srv/code/artexin/batch.py", line 48, in batch
pool = multiprocessing.Pool(max_procs)
File "/usr/lib/python3.4/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.4/multiprocessing/pool.py", line 160, in __init__
if processes < 1:
TypeError: unorderable types: str() < int()
The link should contain a one-time code that expires after 3 minutes.
Sessions should expire when browser window is closed (ideally). Opt-in 'remember me' feature should extend the session for a maximum of 14 days.
The web UI should allow the admin to see a listing of zip files, removal of individual zip files or multiple zip files at once.
Each channel should have a human-readable name, and a folder name (single word consisting of alphanumerics and underscores). Each channel folder should be created when a new channel is created in the UI, and should contain a .name
file that contains the human readable name as single-line string.
Each piece of content should be assigned a channel when processed in a batch. There should be an UI for assigning the channel to each URL as well as choosing the default channel for the entire batch.
We probably want more flexibility in choosing what to do with each URL.
The UI should present a multi-select box for each URL where user can choose what pre- and post- processor to run (including clean pass-through).
The simple template on Blogger has post title outside the DIV which contains the article text. This needs to be handled at least for *.blogspot.com URLs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.