Git Product home page Git Product logo

auk's People

Contributors

dependabot[bot] avatar depfu[bot] avatar greebie avatar ianmilligan1 avatar imgbot[bot] avatar ruebot avatar samfritz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

auk's Issues

How to determine Collection Name

  • Get 'collection_name` add to Archive-It WASAPI endpoint
  • Cache our own database table of collection names, and collection ids and do a lookup
  • Do ugly scrapping with the Perl Archive-It endpoint.

Handle 404s better

Right now we just have the plain old Rails no route error.

This should help us get something better.

Maybe wave to Olga? 😃

image 1

Add "Space Used" to user sidebar

Right now the user sidebar looks like:

screen shot 2018-02-07 at 5 29 37 pm

There's room to add more - could we add a "space used" that totals the amount of data that a given user has downloaded. We could then make it a ratio of a soft quota we could give each user.

Hyperlink Filters

User will start with the basic all-links dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Altiscale remote command - HDFS directory structure

We have previously used python-slugify to create our HDFS directory structure.

Our HDFS root for the project is /shared/au/.

Institution directories follow this pattern: slugified-institution-name-institutiion-number

Example: simon-fraser-university-library-727

Collection directories follow this pattern: slugified-collection-name-collection-number

Example: canadian-political-parties-and-political-interest-groups-227

So, a full path would look like: /shared/au/university-of-toronto-libraries-75/canadian-political-parties-and-political-interest-groups-227

Non Archive-It WASAPI Import

Right now AUK works really well with Archive-It collections. It would be great to investigate if we could get collections into AUK from other sources, ideally using WASAPI.

WebRecorder.io would be ideal as we could then support individual researchers by ingesting their personally-created WARCs and provide analysis.

Right now, however, to get collection data we rely on the Internet Archive's Collections API.

Confirmation after Download Job Begins

Right now, when you decide to download a collection, you get a nice pop-up window like so:

screen shot 2018-02-22 at 10 17 20 am

After you press OK, in the back end the files begin to download. But in the front end, there's no real way to tell if things are downloading or if they are not (and @SamFritz and I worried that users might keep hitting the download button over and over again).

Could we do a second prompt, linked to #31:

Your collection has begun downloading. An e-mail will be sent to [E-MAIL ADDRESS FROM USER] once it is complete.

PRODUCTION: delayed_job - environment variables are not being picked up

Environment variables are not being picked up where they are set, and only work if they are explicitly passed when starting up delayed_job.

RAILS_ENV=production WASAPI_KEY=somekey DOWNLOAD_PATH=/some/path SPARK_SHELL=/some/path/bin/spark-shell SPARK_MEMORY_DRIVER=90G SPARK_NETWORK_TIMEOUT=10000000 AUT_VERSION=0.12.2 bin/delayed_job start

This should be resolved with #44. If not, this will be a documentation ticket.

New homepage

The home page should be a bit more descriptive - and fun!

Ideas include:

  • a carousel of three/four images (i.e. a word cloud/network diagram/some extracted images/etc).
  • should we have a separate "about" static page?

Spark-shell command options/flags

Most of these spark-shell options/flags should probably be environment variables.

  • --driver-memory 5G
  • --conf spark.network.timeout=10000000
  • --packages "io.archivesunleashed:aut:0.12.1"

Can we reliability create a wordcloud derivative?

Make a word cloud image of the plain text for each collections page.

We’ll need to make sure to have stopwords so it’s not overwhelmed by the headers etc. But would then give us a visualization for each derivative dataset!

wasapi/archive-it model

Moving to background jobs wasapi work. Not best to do most of it synchronously.

Need to setup a wasapi model/table for all the wasapi fields.

Duplicate INFO in production logs

I, [2018-03-05T02:21:37.317173 #3672] INFO -- : [INFO] File exists: /data/75/231/warcs/231-20051024234801-00008-crawling018.arc.gz

Don't need to specify [INFO] like I do here.

Collection size

Need a collection count size method to determine the total size of warcs/arcs in a collection.

This should be a simple Archive-It WASAPI endpoint query; .files.size.

Create a download background job

...and create a column on the user's profile in the collection display that kicks off the job.
2018-02-07 15 08 33.

We should have a confirmation for this as well, since folks might be downloading a TB+

Possible duplication in wasapi_files

I think the way we add to wasapi_files allows for the possibility to duplicate entries. We might have to move from create! to find_or_create or something similar.

Port Aut-Viz to Auk

We have a nice visualization tool at aut-viz/crawl-sites, which shows the distribution of domains over the life of a crawl.

We'd like it to be a widget on the default AUK dashboard. @lintool or one of his students will take this on. It needs to run process.py to create the d3 viz.

Standard Scripts

Each collection will have a set of standard scripts run when first ingested.

Refresh Archive-It account

After the first collection info downloads, user may want to refresh their available collections and files. Perhaps a “refresh” button?

Domain Filters

User will start with the basic all domain frequency dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

User model

Basic User model

  • username/name¹
  • email address
  • institution
  • archive-it username
  • archive-it password
  • has_many collections

  1. Currently we are using OmniAuth, and authenticating with Twitter and GitHub, and have not rolled our own account.

Collection Pages Display

The collection page should show basic information about the collection itself.

For collections that have not been downloaded, let's have placeholders that prompt the user to begin the download and initial analytics jobs.

For collections that have been downloaded, let's provide basic information:

Domains

A table of the top 10-15 domains, with opportunity to download more.

Hyperlink Diagram

An embedded sigma.js diagram of the hyperlinks? This does not need to be part of the earliest release but would be nice.

Download Options

Download the text, domains, hyperlink information.

Further Analysis

And then have options for further faceting: selecting only plain text relating to a given domain, for example (i.e. liberal.ca) or a given date(s) (i.e. 200809 and 200810). This might be done using CSV parsers.

Cannot parse nil as CSV error from display_domains helper

#homesnothondas Tweets (Homes not Hondas) has a bunch of empty derivative files.

[nruest@gorila:7515]$ ls -lash -R .
.:
total 16K
4.0K drwxrwxr-x  4 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 11 nruest nruest 4.0K Feb 27 17:17 ..
4.0K drwxrwxr-x  4 nruest nruest 4.0K Mar  1 13:04 1
4.0K drwxrwxr-x  2 nruest nruest 4.0K Feb 27 16:18 warcs

./1:
total 16K
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 derivatives
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 spark_jobs

./1/derivatives:
total 20K
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 all-domains
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 all-text
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 gephi

./1/derivatives/all-domains:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
   0 -rw-rw-r-- 1 nruest nruest    0 Mar  1 13:04 7515-fullurls.txt

./1/derivatives/all-text:
total 8.0K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
   0 -rw-rw-r-- 1 nruest nruest    0 Mar  1 13:04 7515-fulltext.txt

./1/derivatives/gephi:
total 12K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 5 nruest nruest 4.0K Mar  1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest  462 Mar  1 13:04 7515-gephi.gexf

./1/spark_jobs:
total 16K
4.0K drwxrwxr-x 2 nruest nruest 4.0K Mar  1 13:04 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
4.0K -rw-rw-r-- 1 nruest nruest 1.2K Mar  1 13:04 7515.scala
4.0K -rw-rw-r-- 1 nruest nruest  537 Mar  1 13:04 7515.scala.log

./warcs:
total 164M
4.0K drwxrwxr-x 2 nruest nruest 4.0K Feb 27 16:18 .
4.0K drwxrwxr-x 4 nruest nruest 4.0K Mar  1 13:04 ..
 78M -rw-rw-r-- 1 nruest nruest  78M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB219383-20160615020258159-00000.warc.gz
 31M -rw-rw-r-- 1 nruest nruest  31M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB220928-20160622020301503-00000.warc.gz
 28M -rw-rw-r-- 1 nruest nruest  28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB222514-20160629020319891-00000.warc.gz
 28M -rw-rw-r-- 1 nruest nruest  28M Feb 27 16:18 ARCHIVEIT-7515-WEEKLY-JOB223613-20160706020315320-00000.warc.gz

We should probably have some logic around whether or not the file is empty or not. But, this also could be an aut issue, or this is just a bad collection. @ianmilligan1 what do you think?

Collection count helper method

Need a collection count helper method to determine the number of warcs/arcs in a collection.

This should be a simple Archive-It WASAPI endpoint query.

Altiscale remote command - download file and move to HDFS

This might need to be two separate methods. But for now, I'll create a single ticket.

We need to download a given file. Before we download, we check and see if we have enough space to do it. The Archive-It WASAPI endpoint should give us the size, checksums, and download locations to do this. Once it is downloaded, we can run a fixity check, and then move it over to the correct HDFS location. Once it is moved over to HDFS, we can delete the file from the scratch space.

We'll probably want to run this as a background job. We could store all this info in the database, and work our way through it.

Plain Text Filters

User will start with the basic full text dump. Then we will have a series of filters that can be combined to then lead to a CSV export.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.