Git Product home page Git Product logo

auk's Introduction

AUK; Archives Unleashed Cloud

codecov Contribution Guidelines LICENSE Depfu

King Auk

Rails application for the Archives Unleashed Cloud.

Requirements

Installation

Run the test suite

Ensure Rails is not running (ports 3000), then:

$ bundle exec rake

If you would like to make sure JavaScript files are linted:

$ npm install
$ bundle exec rake

Run a development server

$ rails s

Run the background job

In another command line tab, run the background job with:

bundle exec rake jobs:work

Or to simulate production environment with Delayed::Job:

bin/delayed_job --pool=spark,tasks:1 --pool=graphpass,tasks:1 --pool=seed,tasks:10 --pool=download,tasks:4 --pool=cleanup,tasks:2 --pool=textfilter,tasks:2 start

Then visit http://localhost:3000.

Delayed Job Dashboard

To take advantage of the Delayed Job Dashboard, set the DJW_USERNAME and DJW_PASSWORD in config/application.yml. Then visit http://localhost:3000/jobs.

Retry jobs

If you need to "retry" a stuck or failed job, you can use the "retry" method with a job id (1234):

$ RAILS_ENV=production rails console
Running via Spring preloader in process 19680
Loading production environment (Rails 5.1.4)
irb(main):001:0> Delayed::Job.find(1234).retry!

Configuration

This application makes use of figaro.

You will need a config/application.yml file in the root of the application.

Dashboard

Set the DASHBOARD_USER and DASHBOARD_PASS in config/application.yml. Then visit http://localhost:3000/dashboards.

Sitemap

To generate a sitemap:

bundle exec rake sitemap:refresh:no_ping

To generate a new sitemap, and submit to Google and Bing, setup a cronjob that runs the following:

bundle exec rake sitemap:refresh

Run a console

You can also run bin/console for an interactive prompt that will allow you to experiment.

Contributing

Please see contributing guidelines for details.

License

This application is available as open source under the terms of the Apache License, Version 2.0.

Acknowledgments

This work is primarily supported by the Andrew W. Mellon Foundation. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

auk's People

Contributors

dependabot[bot] avatar depfu[bot] avatar greebie avatar ianmilligan1 avatar imgbot[bot] avatar ruebot avatar samfritz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

auk's Issues

PRODUCTION: delayed_job - environment variables are not being picked up

Environment variables are not being picked up where they are set, and only work if they are explicitly passed when starting up delayed_job.

RAILS_ENV=production WASAPI_KEY=somekey DOWNLOAD_PATH=/some/path SPARK_SHELL=/some/path/bin/spark-shell SPARK_MEMORY_DRIVER=90G SPARK_NETWORK_TIMEOUT=10000000 AUT_VERSION=0.12.2 bin/delayed_job start

This should be resolved with #44. If not, this will be a documentation ticket.

Create a download background job

...and create a column on the user's profile in the collection display that kicks off the job.
2018-02-07 15 08 33.

We should have a confirmation for this as well, since folks might be downloading a TB+

New homepage

The home page should be a bit more descriptive - and fun!

Ideas include:

  • a carousel of three/four images (i.e. a word cloud/network diagram/some extracted images/etc).
  • should we have a separate "about" static page?

Port Aut-Viz to Auk

We have a nice visualization tool at aut-viz/crawl-sites, which shows the distribution of domains over the life of a crawl.

We'd like it to be a widget on the default AUK dashboard. @lintool or one of his students will take this on. It needs to run process.py to create the d3 viz.

Cannot parse nil as CSV error from display_domains helper

#homesnothondas Tweets (Homes not Hondas) has a bunch of empty derivative files.

[nruest@gorila:7515]$ ls -lash -R .
.:
total 16K
4.0K drwxrwxr-x  4 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 11 nruest nruest 4.0K Feb 27 17:17 ..
4.0K drwxrwxr-x  4 nruest nruest 4.0K Mar  1 13:04 1
4.0K drwxrwxr-x  2 nruest nruest 4.0K Feb 27 16:18 warcs

./1:
total 16K
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 derivatives
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 spark_jobs

./1/derivatives:
total 20K
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 all-domains
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 all-text
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 gephi

./1/derivatives/all-domains:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
   0 -rw-rw-r-- 1 nruest nruest    0 Mar  1 13:04 7515-fullurls.txt

./1/derivatives/all-text:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
   0 -rw-rw-r-- 1 nruest nruest    0 Mar  1 13:04 7515-fulltext.txt

./1/derivatives/gephi:
total 12K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest  462 Mar  1 13:04 7515-gephi.gexf

./1/spark_jobs:
total 16K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest 1.2K Mar  1 13:04 7515.scala
4.0K -rw-rw-r-- 1 nruest nruest  537 Mar  1 13:04 7515.scala.log

./warcs:
total 164M
4.0K drwxrwxr-x 2 nruest nruest 4.0K Feb 27 16:18 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
 78M -rw-rw-r-- 1 nruest nruest  78M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB219383-20160615020258159-00000.warc.gz
 31M -rw-rw-r-- 1 nruest nruest  31M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB220928-20160622020301503-00000.warc.gz
 28M -rw-rw-r-- 1 nruest nruest  28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB222514-20160629020319891-00000.warc.gz
 28M -rw-rw-r-- 1 nruest nruest  28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB223613-20160706020315320-00000.warc.gz

We should probably have some logic around whether or not the file is empty or not. But, this also could be an aut issue, or this is just a bad collection. @ianmilligan1 what do you think?

Confirmation after Download Job Begins

Right now, when you decide to download a collection, you get a nice pop-up window like so:

screen shot 2018-02-22 at 10 17 20 am

After you press OK, in the back end the files begin to download. But in the front end, there's no real way to tell if things are downloading or if they are not (and @SamFritz and I worried that users might keep hitting the download button over and over again).

Could we do a second prompt, linked to #31:

Your collection has begun downloading. An e-mail will be sent to [E-MAIL ADDRESS FROM USER] once it is complete.

Refresh Archive-It account

After the first collection info downloads, user may want to refresh their available collections and files. Perhaps a “refresh” button?

How to determine Collection Name

  • Get 'collection_name` add to Archive-It WASAPI endpoint
  • Cache our own database table of collection names, and collection ids and do a lookup
  • Do ugly scrapping with the Perl Archive-It endpoint.

Collection size

Need a collection count size method to determine the total size of warcs/arcs in a collection.

This should be a simple Archive-It WASAPI endpoint query; .files.size.

Plain Text Filters

User will start with the basic full text dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Domain Filters

User will start with the basic all domain frequency dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Altiscale remote command - HDFS directory structure

We have previously used python-slugify to create our HDFS directory structure.

Our HDFS root for the project is /shared/au/.

Institution directories follow this pattern: slugified-institution-name-institutiion-number

Example: simon-fraser-university-library-727

Collection directories follow this pattern: slugified-collection-name-collection-number

Example: canadian-political-parties-and-political-interest-groups-227

So, a full path would look like: /shared/au/university-of-toronto-libraries-75/canadian-political-parties-and-political-interest-groups-227

wasapi/archive-it model

Moving to background jobs wasapi work. Not best to do most of it synchronously.

Need to setup a wasapi model/table for all the wasapi fields.

Handle 404s better

Right now we just have the plain old Rails no route error.

This should help us get something better.

Maybe wave to Olga? 😃

image 1

Spark-shell command options/flags

Most of these spark-shell options/flags should probably be environment variables.

  • --driver-memory 5G
  • --conf spark.network.timeout=10000000
  • --packages "io.archivesunleashed:aut:0.12.1"

Possible duplication in wasapi_files

I think the way we add to wasapi_files allows for the possibility to duplicate entries. We might have to move from create! to find_or_create or something similar.

Collection Pages Display

The collection page should show basic information about the collection itself.

For collections that have not been downloaded, let's have placeholders that prompt the user to begin the download and initial analytics jobs.

For collections that have been downloaded, let's provide basic information:

Domains

A table of the top 10-15 domains, with opportunity to download more.

Hyperlink Diagram

An embedded sigma.js diagram of the hyperlinks? This does not need to be part of the earliest release but would be nice.

Download Options

Download the text, domains, hyperlink information.

Further Analysis

And then have options for further faceting: selecting only plain text relating to a given domain, for example (i.e. liberal.ca) or a given date(s) (i.e. 200809 and 200810). This might be done using CSV parsers.

User model

Basic User model

  • username/name¹
  • email address
  • institution
  • archive-it username
  • archive-it password
  • has_many collections

  1. Currently we are using OmniAuth, and authenticating with Twitter and GitHub, and have not rolled our own account.

Standard Scripts

Each collection will have a set of standard scripts run when first ingested.

Collection count helper method

Need a collection count helper method to determine the number of warcs/arcs in a collection.

This should be a simple Archive-It WASAPI endpoint query.

Hyperlink Filters

User will start with the basic all-links dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Duplicate INFO in production logs

I, [2018-03-05T02:21:37.317173 #3672] INFO -- : [INFO] File exists: /data/75/231/warcs/231-20051024234801-00008-crawling018.arc.gz

Don't need to specify [INFO] like I do here.

Can we reliability create a wordcloud derivative?

Make a word cloud image of the plain text for each collections page.

We’ll need to make sure to have stopwords so it’s not overwhelmed by the headers etc. But would then give us a visualization for each derivative dataset!

Altiscale remote command - download file and move to HDFS

This might need to be two separate methods. But for now, I'll create a single ticket.

We need to download a given file. Before we download, we check and see if we have enough space to do it. The Archive-It WASAPI endpoint should give us the size, checksums, and download locations to do this. Once it is downloaded, we can run a fixity check, and then move it over to the correct HDFS location. Once it is moved over to HDFS, we can delete the file from the scratch space.

We'll probably want to run this as a background job. We could store all this info in the database, and work our way through it.

Non Archive-It WASAPI Import

Right now AUK works really well with Archive-It collections. It would be great to investigate if we could get collections into AUK from other sources, ideally using WASAPI.

WebRecorder.io would be ideal as we could then support individual researchers by ingesting their personally-created WARCs and provide analysis.

Right now, however, to get collection data we rely on the Internet Archive's Collections API.

Add "Space Used" to user sidebar

Right now the user sidebar looks like:

screen shot 2018-02-07 at 5 29 37 pm

There's room to add more - could we add a "space used" that totals the amount of data that a given user has downloaded. We could then make it a ratio of a soft quota we could give each user.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.