Git Product home page Git Product logo

soch-download-cli's Introduction

SOCH Download CLI

screenshot

SOCH Download CLI lets you do multithreaded batch downloads of Swedish Open Cultural Heritage (K-samsök) records for offline processing and analytics.

Prerequirements

  • Python >=3.4 and PIP

Installing

pip install soch-download

Usage Examples

Heads up: This program might use all the systems available CPUs.

Download records based on a SOCH search query (Text, CQL, indexes, etc):

soch-download --action=query --query=thumbnailExists=j --outdir=path/to/target/directory

Download records from an specific institution:

soch-download --action=institution --institution=raa --outdir=path/to/target/directory

Download records using a predefined action/query:

soch-download --action=all --outdir=path/to/target/directory
soch-download --action=geodata-exists --outdir=path/to/target/directory

Unpacking

The download actions by default downloads large XML files containing up to 1000 RDFs each, after such a download you can use the unpack argument to convert all those files into individual RDF files:

soch-download --unpack=path/to/xml/files --outdir=path/to/target/directory

Misc

List all available parameters and actions:

soch-download --help

Target a custom SOCH API endpoint:

soch-download --action=query --query=itemKeyWord=hus --outdir=path/to/target/directory --endpoint=http://lx-ra-ksam2.raa.se:8080/

soch-download-cli's People

Contributors

abbe98 avatar carwash avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

carwash fornpunkt

soch-download-cli's Issues

Error: Invalid value for '--outdir': './my path' is not a valid boolean.

The error is probably related to my Python environment (I am not an experienced Python user/developer) but I am now out of ideas.

MacOS 12.3, zsh, Python 3.10.2, pyenv 2.2.4, pip 21.2.4

Output:

anders@macmini / % pip3 install soch-download
[...]
anders@macmini ~ % soch-download --action==query --query='serviceName=bbrp' --outdir=./bbr
Usage: soch-download [OPTIONS]
Try 'soch-download --help' for help.

Error: Invalid value for '--outdir': './bbr' is not a valid boolean.
anders@macmini ~ % echo $PATH                                                                                     
/Users/anders/.pyenv/shims:/Library/Frameworks/Python.framework/Versions/3.10/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
anders@macmini ~ % python3 --version
Python 3.10.2

Input is not being URL-encoded and therefore queries fail

...--action=institution --institution... can't handle institutions with reserved characters, such as spaces, irrespective of whether they are escaped or not.

Running pipenv run python soch-download.py --action=institution --institution='The Unstraight Museum' --key=test results in an error. As does the escaped version, pipenv run python soch-download.py --action=institution --institution='The%20Unstraight%20Museum' --key=test.

As a workaround, one can compose the properly-escaped query manually: pipenv run python soch-download.py --action=query --query='serviceOrganization="The%20Unstraight%20Museum"' --key=test but that rather defeats the purpose of having a convenience flag such as --institution in the first place.

Threading fails

Running a query on macOS, installed via pip, starts okay:

Validating arguments...
Fetching query data and calculating requirements...
Found 7398 results, they would be split over 15 requests/files
This program might use all the systems available CPUs(8)!
Would you like to proceed with the download? y/n

But then results in the following error when answering 'y':

Traceback (most recent call last):
  File "/Users/marcus/.pyenv/versions/3.6.7/bin/soch-download", line 186, in <module>
    start()
  File "/Users/marcus/.pyenv/versions/3.6.7/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/marcus/.pyenv/versions/3.6.7/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/marcus/.pyenv/versions/3.6.7/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/marcus/.pyenv/versions/3.6.7/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/marcus/.pyenv/versions/3.6.7/bin/soch-download", line 179, in start
    confirm(query, outdir)
  File "/Users/marcus/.pyenv/versions/3.6.7/bin/soch-download", line 91, in confirm
    return pre_fetch(query, required_n_requests, outdir)
  File "/Users/marcus/.pyenv/versions/3.6.7/bin/soch-download", line 62, in pre_fetch
    fetch(build_query(query, 500, start_record), start_record, outdir)
TypeError: do_task() takes 0 positional arguments but 3 were given

Tested with Python 3.6.7 and 3.7.1.

Install requires *exactly* Python 3.6

If you try to install with a Python version < 3.6, the installer quite rightly refuses.
But if you try to install with a Python version > 3.6, the installer also refuses. This doesn't seem right. Is the downloader really incompatible with Python 3.7?
Only with a Python version of 3.6.x does the install proceed.

Error handling in threads

Any error or exception happening in outside of the main thread gets ignored and is not visible for the end user.

If a request fails dude to an connection exception, timeout, or a bad response it should retry the request.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.