archivesunleashed / auk Goto Github PK

View Code? Open in Web Editor NEW

11.0 7.0 4.0 50.94 MB

Rails application for the Archives Unleashed Cloud.

Home Page: https://cloud.archivesunleashed.org/

License: Other

Ruby 30.54% JavaScript 8.00% HTML 58.71% CoffeeScript 0.06% SCSS 2.69%

rails rails-application archives-unleashed archives-unleashed-toolkit apache-spark webarchives

auk's Introduction

AUK; Archives Unleashed Cloud

Rails application for the Archives Unleashed Cloud.

Requirements

Ruby 2.4.x
Rails 5.1.2 or later
Apache Spark 2.3.2 or later
GraphPass 0.3.0 or later
NPM (For testing/ESlint)

Installation

Run the test suite

Ensure Rails is not running (ports 3000), then:

$ bundle exec rake

If you would like to make sure JavaScript files are linted:

$ npm install
$ bundle exec rake

Run a development server

$ rails s

Run the background job

In another command line tab, run the background job with:

bundle exec rake jobs:work

Or to simulate production environment with Delayed::Job:

bin/delayed_job --pool=spark,tasks:1 --pool=graphpass,tasks:1 --pool=seed,tasks:10 --pool=download,tasks:4 --pool=cleanup,tasks:2 --pool=textfilter,tasks:2 start

Then visit http://localhost:3000.

Delayed Job Dashboard

To take advantage of the Delayed Job Dashboard, set the DJW_USERNAME and DJW_PASSWORD in config/application.yml. Then visit http://localhost:3000/jobs.

Retry jobs

If you need to "retry" a stuck or failed job, you can use the "retry" method with a job id (1234):

$ RAILS_ENV=production rails console
Running via Spring preloader in process 19680
Loading production environment (Rails 5.1.4)
irb(main):001:0> Delayed::Job.find(1234).retry!

Configuration

This application makes use of figaro.

You will need a config/application.yml file in the root of the application.

Dashboard

Set the DASHBOARD_USER and DASHBOARD_PASS in config/application.yml. Then visit http://localhost:3000/dashboards.

Sitemap

To generate a sitemap:

bundle exec rake sitemap:refresh:no_ping

To generate a new sitemap, and submit to Google and Bing, setup a cronjob that runs the following:

bundle exec rake sitemap:refresh

Run a console

You can also run bin/console for an interactive prompt that will allow you to experiment.

Contributing

Please see contributing guidelines for details.

Bug reports
Pull requests are welcome on AUK

License

This application is available as open source under the terms of the Apache License, Version 2.0.

Acknowledgments

This work is primarily supported by the Andrew W. Mellon Foundation. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

auk's People

Contributors

Stargazers

Watchers

Forkers

michaelnashed ruebot ianmilligan1 greebie

auk's Issues

PRODUCTION: delayed_job - environment variables are not being picked up

Environment variables are not being picked up where they are set, and only work if they are explicitly passed when starting up delayed_job.

RAILS_ENV=production WASAPI_KEY=somekey DOWNLOAD_PATH=/some/path SPARK_SHELL=/some/path/bin/spark-shell SPARK_MEMORY_DRIVER=90G SPARK_NETWORK_TIMEOUT=10000000 AUT_VERSION=0.12.2 bin/delayed_job start

This should be resolved with #44. If not, this will be a documentation ticket.

Create a download background job

...and create a column on the user's profile in the collection display that kicks off the job.
.

We should have a confirmation for this as well, since folks might be downloading a TB+

New homepage

The home page should be a bit more descriptive - and fun!

Ideas include:

a carousel of three/four images (i.e. a word cloud/network diagram/some extracted images/etc).
should we have a separate "about" static page?

Port Aut-Viz to Auk

We have a nice visualization tool at aut-viz/crawl-sites, which shows the distribution of domains over the life of a crawl.

We'd like it to be a widget on the default AUK dashboard. @lintool or one of his students will take this on. It needs to run process.py to create the d3 viz.

Cannot parse nil as CSV error from display_domains helper

#homesnothondas Tweets (Homes not Hondas) has a bunch of empty derivative files.

[nruest@gorila:7515]$ ls -lash -R .
.:
total 16K
4.0K drwxrwxr-x  4 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 11 nruest nruest 4.0K Feb 27 17:17 ..
4.0K drwxrwxr-x  4 nruest nruest 4.0K Mar  1 13:04 1
4.0K drwxrwxr-x  2 nruest nruest 4.0K Feb 27 16:18 warcs

./1:
total 16K
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 derivatives
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 spark_jobs

./1/derivatives:
total 20K
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 all-domains
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 all-text
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 gephi

./1/derivatives/all-domains:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
   0 -rw-rw-r-- 1 nruest nruest    0 Mar  1 13:04 7515-fullurls.txt

./1/derivatives/all-text:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
   0 -rw-rw-r-- 1 nruest nruest    0 Mar  1 13:04 7515-fulltext.txt

./1/derivatives/gephi:
total 12K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest  462 Mar  1 13:04 7515-gephi.gexf

./1/spark_jobs:
total 16K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest 1.2K Mar  1 13:04 7515.scala
4.0K -rw-rw-r-- 1 nruest nruest  537 Mar  1 13:04 7515.scala.log

./warcs:
total 164M
4.0K drwxrwxr-x 2 nruest nruest 4.0K Feb 27 16:18 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
 78M -rw-rw-r-- 1 nruest nruest  78M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB219383-20160615020258159-00000.warc.gz
 31M -rw-rw-r-- 1 nruest nruest  31M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB220928-20160622020301503-00000.warc.gz
 28M -rw-rw-r-- 1 nruest nruest  28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB222514-20160629020319891-00000.warc.gz
 28M -rw-rw-r-- 1 nruest nruest  28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB223613-20160706020315320-00000.warc.gz

We should probably have some logic around whether or not the file is empty or not. But, this also could be an aut issue, or this is just a bad collection. @ianmilligan1 what do you think?

Confirmation after Download Job Begins

Right now, when you decide to download a collection, you get a nice pop-up window like so:

After you press OK, in the back end the files begin to download. But in the front end, there's no real way to tell if things are downloading or if they are not (and @SamFritz and I worried that users might keep hitting the download button over and over again).

Could we do a second prompt, linked to #31:

Your collection has begun downloading. An e-mail will be sent to [E-MAIL ADDRESS FROM USER] once it is complete.

disk_usage helper needs some conditional logic

If there is not account info yet, it will return the disk usage for the entire root of the download directory.

Add user_id to wasapi_files table

Refresh Archive-It account

After the first collection info downloads, user may want to refresh their available collections and files. Perhaps a “refresh” button?

collection_size and collection_count helpers need to take in user_id

We get multiples if we have more than one account from the same institution.

Background job for basic analysis

Once a collection is downloaded, run basic analysis on it.

...we'll need to have Spark with aut loaded up and running.

How to determine Collection Name

Get 'collection_name` add to Archive-It WASAPI endpoint
Cache our own database table of collection names, and collection ids and do a lookup
Do ugly scrapping with the Perl Archive-It endpoint.

Try Incorporating Aut-Wizard into AUK

Ryan previously built an Archives Unleashed Toolkit (AUT) Wizard - https://github.com/archivesunleashed/aut-wizard. On a branch, experiment with putting this in and writing out custom scala scripts to be run by Spark.

Collection size

Need a collection count size method to determine the total size of warcs/arcs in a collection.

This should be a simple Archive-It WASAPI endpoint query; .files.size.

Plain Text Filters

User will start with the basic full text dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Domain Filters

User will start with the basic all domain frequency dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Altiscale remote command - HDFS directory structure

We have previously used python-slugify to create our HDFS directory structure.

Our HDFS root for the project is /shared/au/.

Institution directories follow this pattern: slugified-institution-name-institutiion-number

Example: simon-fraser-university-library-727

Collection directories follow this pattern: slugified-collection-name-collection-number

Example: canadian-political-parties-and-political-interest-groups-227

So, a full path would look like: /shared/au/university-of-toronto-libraries-75/canadian-political-parties-and-political-interest-groups-227

Add Custom filters model for multiple users of the same institution

Different users may want different derivatives from the same collection.

A feature to include permissions for user may be a good feature.

wasapi/archive-it model

Moving to background jobs wasapi work. Not best to do most of it synchronously.

Need to setup a wasapi model/table for all the wasapi fields.

Collect WASAPI credentials

salt and hash these as well

Delayed::Worker.max_run_time is only 14400 seconds

We'll probably need to change the default here if we're downloading massive collections.

Archive-It Collection model

Collection model

belongs_to user user
Collection ID (from WASAPI)
Collection name (tbd)
Collection size (from WASAPI)

Handle 404s better

Right now we just have the plain old Rails no route error.

This should help us get something better.

Maybe wave to Olga? 😃

Spark-shell command options/flags

Most of these spark-shell options/flags should probably be environment variables.

--driver-memory 5G
--conf spark.network.timeout=10000000
--packages "io.archivesunleashed:aut:0.12.1"

db needs to keep track of what versions of aut scripts are running

Possible duplication in wasapi_files

I think the way we add to wasapi_files allows for the possibility to duplicate entries. We might have to move from create! to find_or_create or something similar.

Setup Delayed::Job

https://github.com/collectiveidea/delayed_job

Setup a mailer

“Download” button becomes “Update” button after derivatives generated

For the collection display page.

Collection Pages Display

The collection page should show basic information about the collection itself.

For collections that have not been downloaded, let's have placeholders that prompt the user to begin the download and initial analytics jobs.

For collections that have been downloaded, let's provide basic information:

Domains

A table of the top 10-15 domains, with opportunity to download more.

Hyperlink Diagram

An embedded sigma.js diagram of the hyperlinks? This does not need to be part of the earliest release but would be nice.

Download Options

Download the text, domains, hyperlink information.

Further Analysis

And then have options for further faceting: selecting only plain text relating to a given domain, for example (i.e. liberal.ca) or a given date(s) (i.e. 200809 and 200810). This might be done using CSV parsers.

Twitter has to "authorize" every time you log in

Login - salt/hash

User model

Basic User model

username/name¹
email address
institution
archive-it username
archive-it password
has_many collections

Currently we are using OmniAuth, and authenticating with Twitter and GitHub, and have not rolled our own account.

Standard Scripts

Each collection will have a set of standard scripts run when first ingested.

Altiscale remote command - find out space available on scratch space

We need to determine how much space we have available on the Altiscale workbench before we start downloading files to later move over into HDFS. This could be a helper method that is remotely executed.

Making a Fun 404 Page

@SamFritz has a wonderful design sense – so will tackle a fun and original 404 page. 😄

Move from dotenv to figaro

https://github.com/laserlemon/figaro

dotenv isn't meant for production. As we are starting to tip our toes in the water of using auk in production, we should start setting things up better. Fiagaro will be of great use.

Collection count helper method

Need a collection count helper method to determine the number of warcs/arcs in a collection.

This should be a simple Archive-It WASAPI endpoint query.

Hyperlink Filters

User will start with the basic all-links dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Notify user when jobs are completed

DRY up app/jobs/wasapi_files_download_job.rb

Clean this up
67a7f0f

Duplicate INFO in production logs

I, [2018-03-05T02:21:37.317173 #3672] INFO -- : [INFO] File exists: /data/75/231/warcs/231-20051024234801-00008-crawling018.arc.gz

Don't need to specify [INFO] like I do here.

Refactor collection table seeding

c79945f#diff-215dc9eac0021077dd1a293506c222c7

Is a really ugly, and inefficient solution. Come back around to this and make it better.

It appears Spark is running multiple times...

It appears Spark is running multiple times when the CollectionsSparkJob is fired off.

Can we reliability create a wordcloud derivative?

Make a word cloud image of the plain text for each collections page.

We’ll need to make sure to have stopwords so it’s not overwhelmed by the headers etc. But would then give us a visualization for each derivative dataset!

Altiscale remote command - download file and move to HDFS

This might need to be two separate methods. But for now, I'll create a single ticket.

We need to download a given file. Before we download, we check and see if we have enough space to do it. The Archive-It WASAPI endpoint should give us the size, checksums, and download locations to do this. Once it is downloaded, we can run a fixity check, and then move it over to the correct HDFS location. Once it is moved over to HDFS, we can delete the file from the scratch space.

We'll probably want to run this as a background job. We could store all this info in the database, and work our way through it.

Connect AUK to Altiscale

Need to find a good way to connect AUK to Altiscale other than a ssh session behind screen.

Non Archive-It WASAPI Import

Right now AUK works really well with Archive-It collections. It would be great to investigate if we could get collections into AUK from other sources, ideally using WASAPI.

WebRecorder.io would be ideal as we could then support individual researchers by ingesting their personally-created WARCs and provide analysis.

Right now, however, to get collection data we rely on the Internet Archive's Collections API.

Add "Space Used" to user sidebar

Right now the user sidebar looks like:

There's room to add more - could we add a "space used" that totals the amount of data that a given user has downloaded. We could then make it a ratio of a soft quota we could give each user.