archivesunleashed / auk Goto Github PK
View Code? Open in Web Editor NEWRails application for the Archives Unleashed Cloud.
Home Page: https://cloud.archivesunleashed.org/
License: Other
Rails application for the Archives Unleashed Cloud.
Home Page: https://cloud.archivesunleashed.org/
License: Other
It appears Spark is running multiple times when the CollectionsSparkJob
is fired off.
Right now we just have the plain old Rails no route error.
This should help us get something better.
Maybe wave to Olga? 😃
User will start with the basic all-links dump. Then we will have a series of filters that can be combined to then lead to a CSV export.
We have previously used python-slugify
to create our HDFS directory structure.
Our HDFS root for the project is /shared/au/
.
Institution directories follow this pattern: slugified-institution-name-institutiion-number
Example: simon-fraser-university-library-727
Collection directories follow this pattern: slugified-collection-name-collection-number
Example: canadian-political-parties-and-political-interest-groups-227
So, a full path would look like: /shared/au/university-of-toronto-libraries-75/canadian-political-parties-and-political-interest-groups-227
belongs_to user
userWe'll probably need to change the default here if we're downloading massive collections.
Right now AUK works really well with Archive-It collections. It would be great to investigate if we could get collections into AUK from other sources, ideally using WASAPI.
WebRecorder.io would be ideal as we could then support individual researchers by ingesting their personally-created WARCs and provide analysis.
Right now, however, to get collection data we rely on the Internet Archive's Collections API.
Right now, when you decide to download a collection, you get a nice pop-up window like so:
After you press OK, in the back end the files begin to download. But in the front end, there's no real way to tell if things are downloading or if they are not (and @SamFritz and I worried that users might keep hitting the download button over and over again).
Could we do a second prompt, linked to #31:
Your collection has begun downloading. An e-mail will be sent to [E-MAIL ADDRESS FROM USER] once it is complete.
Environment variables are not being picked up where they are set, and only work if they are explicitly passed when starting up delayed_job.
RAILS_ENV=production WASAPI_KEY=somekey DOWNLOAD_PATH=/some/path SPARK_SHELL=/some/path/bin/spark-shell SPARK_MEMORY_DRIVER=90G SPARK_NETWORK_TIMEOUT=10000000 AUT_VERSION=0.12.2 bin/delayed_job start
This should be resolved with #44. If not, this will be a documentation ticket.
The home page should be a bit more descriptive - and fun!
Ideas include:
Most of these spark-shell
options/flags should probably be environment variables.
--driver-memory 5G
--conf spark.network.timeout=10000000
--packages "io.archivesunleashed:aut:0.12.1"
Make a word cloud image of the plain text for each collections page.
We’ll need to make sure to have stopwords so it’s not overwhelmed by the headers etc. But would then give us a visualization for each derivative dataset!
Need to find a good way to connect AUK to Altiscale other than a ssh session behind screen.
Different users may want different derivatives from the same collection.
A feature to include permissions for user may be a good feature.
Moving to background jobs wasapi work. Not best to do most of it synchronously.
Need to setup a wasapi model/table for all the wasapi fields.
I, [2018-03-05T02:21:37.317173 #3672] INFO -- : [INFO] File exists: /data/75/231/warcs/231-20051024234801-00008-crawling018.arc.gz
Don't need to specify [INFO]
like I do here.
Need a collection count size method to determine the total size of warcs/arcs in a collection.
This should be a simple Archive-It WASAPI endpoint query; .files.size
.
Clean this up
67a7f0f
We need to determine how much space we have available on the Altiscale workbench before we start downloading files to later move over into HDFS. This could be a helper method that is remotely executed.
For the collection display page.
Ryan previously built an Archives Unleashed Toolkit (AUT) Wizard - https://github.com/archivesunleashed/aut-wizard. On a branch, experiment with putting this in and writing out custom scala scripts to be run by Spark.
I think the way we add to wasapi_files
allows for the possibility to duplicate entries. We might have to move from create!
to find_or_create
or something similar.
We have a nice visualization tool at aut-viz/crawl-sites, which shows the distribution of domains over the life of a crawl.
We'd like it to be a widget on the default AUK dashboard. @lintool or one of his students will take this on. It needs to run process.py to create the d3 viz.
Each collection will have a set of standard scripts run when first ingested.
After the first collection info downloads, user may want to refresh their available collections and files. Perhaps a “refresh” button?
If there is not account info yet, it will return the disk usage for the entire root of the download directory.
User will start with the basic all domain frequency dump. Then we will have a series of filters that can be combined to then lead to a CSV export.
has_many
collectionsc79945f#diff-215dc9eac0021077dd1a293506c222c7
Is a really ugly, and inefficient solution. Come back around to this and make it better.
Once a collection is downloaded, run basic analysis on it.
...we'll need to have Spark with aut
loaded up and running.
https://github.com/laserlemon/figaro
dotenv isn't meant for production. As we are starting to tip our toes in the water of using auk in production, we should start setting things up better. Fiagaro will be of great use.
The collection page should show basic information about the collection itself.
For collections that have not been downloaded, let's have placeholders that prompt the user to begin the download and initial analytics jobs.
For collections that have been downloaded, let's provide basic information:
A table of the top 10-15 domains, with opportunity to download more.
An embedded sigma.js diagram of the hyperlinks? This does not need to be part of the earliest release but would be nice.
Download the text, domains, hyperlink information.
And then have options for further faceting: selecting only plain text relating to a given domain, for example (i.e. liberal.ca) or a given date(s) (i.e. 200809 and 200810). This might be done using CSV parsers.
#homesnothondas Tweets (Homes not Hondas) has a bunch of empty derivative files.
[nruest@gorila:7515]$ ls -lash -R .
.:
total 16K
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 11 nruest nruest 4.0K Feb 27 17:17 ..
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 1
4.0K drwxrwxr-x 2 nruest nruest 4.0K Feb 27 16:18 warcs
./1:
total 16K
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 ..
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar 1 13:04 derivatives
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 spark_jobs
./1/derivatives:
total 20K
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 ..
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 all-domains
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 all-text
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 gephi
./1/derivatives/all-domains:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar 1 13:04 ..
0 -rw-rw-r-- 1 nruest nruest 0 Mar 1 13:04 7515-fullurls.txt
./1/derivatives/all-text:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar 1 13:04 ..
0 -rw-rw-r-- 1 nruest nruest 0 Mar 1 13:04 7515-fulltext.txt
./1/derivatives/gephi:
total 12K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar 1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest 462 Mar 1 13:04 7515-gephi.gexf
./1/spark_jobs:
total 16K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar 1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest 1.2K Mar 1 13:04 7515.scala
4.0K -rw-rw-r-- 1 nruest nruest 537 Mar 1 13:04 7515.scala.log
./warcs:
total 164M
4.0K drwxrwxr-x 2 nruest nruest 4.0K Feb 27 16:18 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar 1 13:04 ..
78M -rw-rw-r-- 1 nruest nruest 78M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB219383-20160615020258159-00000.warc.gz
31M -rw-rw-r-- 1 nruest nruest 31M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB220928-20160622020301503-00000.warc.gz
28M -rw-rw-r-- 1 nruest nruest 28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB222514-20160629020319891-00000.warc.gz
28M -rw-rw-r-- 1 nruest nruest 28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB223613-20160706020315320-00000.warc.gz
We should probably have some logic around whether or not the file is empty or not. But, this also could be an aut
issue, or this is just a bad collection. @ianmilligan1 what do you think?
Need a collection count helper method to determine the number of warcs/arcs in a collection.
This should be a simple Archive-It WASAPI endpoint query.
@SamFritz has a wonderful design sense – so will tackle a fun and original 404 page. 😄
This might need to be two separate methods. But for now, I'll create a single ticket.
We need to download a given file. Before we download, we check and see if we have enough space to do it. The Archive-It WASAPI endpoint should give us the size
, checksums
, and download locations
to do this. Once it is downloaded, we can run a fixity check, and then move it over to the correct HDFS location. Once it is moved over to HDFS, we can delete the file from the scratch space.
We'll probably want to run this as a background job. We could store all this info in the database, and work our way through it.
User will start with the basic full text dump. Then we will have a series of filters that can be combined to then lead to a CSV export.
salt and hash these as well
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.