Negapedia refresh generator and development environment: this package and docker image is responsible of generating negapedia, a website on social data extracted from wikipedia.
You will need a machine with internet connection, 16GB of RAM, 300GB of storage and docker storage base directory properly setted.
This image take in input the nationalization and store the result of the operations in /data
(in-container folder). All the operation of data fetching are totally automatized and the result is negapedia website in the form of a gzipped tarball of gzipped webpages. The operations flow is composed of thee phases:
- preprocessing of data: CPU intensive, it requires a good internet connection and 16GB of RAM;
- exporting to csv, CPU intensive.
- (optional) calculating TFIDF, CPU and IO intensive.
- construction of in-container database - IO intensive, requires 300GB of storage, best if SSD.
- exporting and compressing the static website from quering the database and TFIDF data.
lang
: wikipedia nationalization to parse, defaultit
.url
: Output base URL,%s
is the optional placeholder for subdomain, defaulthttp://%s.negapedia.org
.source
: source of data (net
orsavepoint
), defaultnet
.keep
: keep every savepoint after the execution (true
orfalse
), defaultfalse
.tfidf
: calculate TFIDF, iffalse
, try available precalculated measures (true
orfalse
), defaultfalse
.test
: Run as test on a fraction of the articles beforesavepoint
(true
orfalse
), defaultfalse
.
docker run negapedia/negapedia refresh -lang en
: basic usage, run the image on the english nationalization and store the result in the in-containter/data
folder.docker run -v /path/2/out/dir:/data negapedia/negapedia --rm refresh -lang en
: ..1. run the image as before. ..2. mount as a volume the guest/data
folder to the host folder/path/2/out/dir
, the output folder, so that at the end of the operations/path/2/out/dir
will contain the result. This folder can be changed to an arbitrary folder of your choice. ..3. remove the image right after the execution.docker run -v /path/2/out/dir:/data --rm --init -d negapedia/negapedia refresh -lang en
, you may want to use this commad : ..1. run the image as before. ..2. run an init process that will take care of killing eventual zombie processes - just in case. ..3. run the image in detatched mode. For further explanations please refer to docker run reference
docker pull negapedia/negapedia
Update the image to the last revision.docker kill --signal=SIGQUIT $(docker ps -ql)
Quit the last container and log trace dump.docker kill --signal=SIGUSR1 $(docker ps -ql)
Log the trace dump of the last container without quitting it.docker logs -f $(docker ps -lq)
Fetch the logs of the last container.docker system prune -fa --volumes
Remove all unused images and volume without asking for confirmation.