Git Product home page Git Product logo

data-capture-module's Introduction

data-capture-module

Data Capture Module to recieve uploaded datasets, and validate client-side checksums.

In more general terms, this is an external module designed to allow users to upload large datasets to a repository (designed for Dataverse) without going through http.

The presentation slides from the 2017 Dataverse Community Meeting may provide some additional information. The design is intented to be agnostic to transfer protocol, and currently implements rsync over ssh.

DCM installation

See installation instructions for DCM installation instructions, and the Dataverse Guides for configuring the two systems together.

general organization

  • api/ : external interface that repository software will call
  • gen/ : transfer script generation for rsync+ssh uploads
  • scn/ : scanning for completed uploads, and handling related tasks

data-capture-module's People

Contributors

pameyer avatar pdurbin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-capture-module's Issues

DCM installation on Ubuntu 18.04

Hi,

I tried to install DCM on Ubuntu 18.04 using the .rpm file but I was not successful.

sudo apt-get install alien

curl -L https://github.com/sbgrid/data-capture-module/releases/tag/0.6/dcm-0.6-0.noarch.rpm > dcm-0.6-0.noarch.rpm

sudo alien dcm-0.6-0.noarch.rpm
error: dcm-0.6-0.noarch.rpm: not an rpm package (or package manifest)
Error executing "LANG=C rpm -qp --queryformat %{NAME} 'dcm-0.6-0.noarch.rpm'": at /usr/share/perl5/Alien/Package.pm line 489.
I would like to know what the instructions are for installing DCM from the .tar.gz file.

Thank you

Move repo from sbgrid to IQSS GitHub org

Over at http://guides.dataverse.org/en/4.6.1/installation/prep.html#architecture-and-components we document "Architecture and Components" including optional components such as Zelig, TwoRavens, and Geoconnect, the code for which is housed under the "IQSS" GitHub organization at https://github.com/IQSS

As of this writing, this code base is housed under the "sbgrid" organization at https://github.com/sbgrid

I think IQSS should "own" this data-capture-module code base and that it should live under the IQSS GitHub organization. Eventually it'll appear on that "Architecture and Components" list. Here's a screenshot of how it looks now:

screen shot 2017-05-16 at 10 57 22 am

API change

with some of the package-file / primary data directory related redesign (IQSS/dataverse#3353); the DCM will need to include a subdirectory within the /hold filesystem accessible to dataverse (/hold/$dset/primary_data), and send that subdirectory to Dataverse when notifying it that there's a dataset that has passed checksum validation and is ready for import.

Need help in DCM installation

Hi all,

We're trying to install DCM on our Test environment (Dataverse 4.10.1).

We've installed DCM as described here and configured dataverse in consequence (with ::DataCaptureModuleUrl and :UploadMethods).

However, when we call the dataverse api to get the upload script (api/datasets/:persistentId/dataCaptureModule/rsync?persistentId=$PERISTENTID), we have this error message in the log :

There was a problem getting the script for XGRMMT . DCM returned status code: 404]]

We have no idea what really is the problem and whether the DCM is well installed as needed.

Can someone please tell us how to check on the DCM installation ?

Thanks in advanced,

Thanh Thanh

compare ints

In ur.py we need to change

if req['datasetIdentifier'] != uid:

to

if int(req['datasetIdentifier']) != int(uid):

to avoid a 500 error when calling sr.py

Error in request Dataverse API to report successful upload

Hi,

I think we're almost making DCM work. The steps to follow
are being performed correctly:

  • Obtaining the upload script through the Dataverse API
  • Upload data set files to /deposit
  • Move files to /hold

But we are having a problem. After moving files to /hold.
the post_upload.bash script makes a request to the Dataverse API to inform
successful in receiving the files. However, the Dataverse is returning an error.

# source /opt/dcm/scn/post_upload.bash
post_upload starting at  Thu Sep 10 19:33:23 -03 2020
/deposit/WZGANH/WZGANH/files.sha  :  /deposit/WZGANH/WZGANH  :  WZGANH  :  WZGANH
checksums verified
data moved
ERROR: dataverse at https://xxxxxxxxxxx had a problem handling the DCM success API call
{"status":"ERROR","code":500,"message":"Internal server error. More details available at the server logs.","incidentId":"284d10d6-1101-47da-be9e-9bde74cf3828"}
will retry in 60 seconds
ERROR: retry failed, will need to handle manually
post_upload completed at  Thu Sep 10 19:34:59 -03 2020

In the Dataverse log

[2020-09-10T19:33:59.309-0300] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.api.errorhandlers.ServeletExceptionHandler] [tid: _ThreadID=52 _ThreadName=jk-connector(3)] [timeMillis: 1599777239309] [levelValue: 1000] [[
  API internal error 284d10d6-1101-47da-be9e-9bde74cf3828: Null Pointer
java.lang.NullPointerException
        at edu.harvard.iq.dataverse.api.Datasets.receiveChecksumValidationResults(Datasets.java:1351)

In the source code of Dataverse version 4.20 (edu.harvard.iq.dataverse.api.Datasets)


1346                 String storageDriver = dataset.getDataverseContext().getEffectiveStorageDriverId();
1347                 String uploadFolder = jsonFromDcm.getString("uploadFolder");
1348                 int totalSize = jsonFromDcm.getInt("totalSize");
1349                 String storageDriverType = System.getProperty("dataverse.file." + storageDriver + ".type");
1350 
1351                 if (storageDriverType.equals("file")) {
1352                     logger.log(Level.INFO, "File storage driver used for (dataset id={0})", dataset.getId());

Apparently the property "dataverse.file." + storageDriver + ".type" is not defined.

We would like to know how in configuration we can define the appropriate values for
this property so that the request to Dataverse API can be handled correctly.

We noticed that the properties related to the upload folder start with "dataverse.files" and not "dataverse.file".

Thanks for any help.

eliminate cron dependencies

dataverse has impedance mismatch when interacting with systems that depend on cron. needs re-factoring to eliminate this dependency, at least at the generation step. cron dependency for checksum validation less of a UI issue (there's already an async out-of-browser step); would also require more backend and transfer script changes.

rq startup issues

fresh vagrant/ansible:

root@localhost log]# tail rq_worker.log
Traceback (most recent call last):
File "/usr/bin/rq", line 7, in
from rq.cli import main
File "/usr/lib/python2.6/site-packages/rq/init.py", line 6, in
from .connections import (Connection, get_current_connection, pop_connection,
File "/usr/lib/python2.6/site-packages/rq/connections.py", line 7, in
from redis import StrictRedis
ImportError: cannot import name StrictRedis

Initial searches on the error not particularly helpful; initial guess is issues w\ not specifying versions in pip or problem with the init script.

Vagrant port conflict with 8080 (Glassfish)

The typical Dataverse developer will have Glassfish running on port 8080 locally so vagrant up won't "just work":

murphy:data-capture-module pdurbin$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'geerlingguy/centos6'...
==> default: Matching MAC address for NAT networking...
==> default: Setting the name of the VM: data-capture-module_default_1478539051355_11284
Vagrant cannot forward the specified ports on this VM, since they
would collide with some other application that is already listening
on these ports. The forwarded port to 8080 is already in use
on the host machine.

To fix this, modify your current project's Vagrantfile to use another
port. Example, where '1234' would be replaced by a unique host port:

  config.vm.network :forwarded_port, guest: 80, host: 1234

Sometimes, Vagrant will attempt to auto-correct this for you. In this
case, Vagrant was unable to. This is usually because the guest machine
is in a state which doesn't allow modifying port forwarding. You could
try 'vagrant reload' (equivalent of running a halt followed by an up)
so vagrant can attempt to auto-correct this upon booting. Be warned
that any unsaved work might be lost.
murphy:data-capture-module pdurbin$ 

I'm not sure what a better port would be. Maybe 8888 or something.

roles/dcm/vars/secrets.yml was not found

I just ran vagrant up on 4eea1af and saw this error:

ERROR! vars file roles/dcm/vars/secrets.yml was not found
Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.

I know to use ansible/roles/dcm/vars/secrets.yml.template as a starting point but this should probably mentioned in the README.

When rsync upload begins, send a message to Dataverse.

Last week when discussing https://trello.com/c/Nbte37k1/9-rsync-file-upload-%26-download-(4.8) with @mheppler @TaniaSchlatter @dlmurphy we agreed that we'd all like to see the Data Capture Module (DCM) POST some JSON to Dataverse when upload has begun. That is to say, the DCM will recognize that the user has started executing the rsync script and inform Dataverse of this fact. When Dataverse receives the "upload has begun" message (or "uploadHasBegun"?), Dataverse will take some actions, possibly sending a notification to the user, preventing the dataset from being deleted, etc. It would be awfully nice if the DCM could send the number of bytes Dataverse should expect, but this is not a hard requirement.

I believe the issue on the Dataverse side is IQSS/dataverse#3348

general ansible cleanup

initially noticed while bringing in the new roles

  • DCM user/group/uid/gid needs better organization

  • check OS version for dependencies (vs currently just checking distributions and assuming it's enough for me)

  • rq config needs to go from hardcoded files to templates/variables in case of deployment path changes

  • judging from initial feedback from pdurbin; system config section of readme is probably unclear. readme needs redoing anyhow.

403 - Forbidden: allow Dataverse to talk to DCM

vagrant up got my DCM up and running but post-installation I wanted to open it up so that I could communicate with it from Dataverse without "403 - Forbidden" errors.

I did something like this (Dataverse was on 10.0.2.2 for whatever reason):

[root@uiswhlpt2614019 ~]# cd /etc/lighttpd
[root@uiswhlpt2614019 lighttpd]# cp -a lighttpd.conf lighttpd.conf.orig
[root@uiswhlpt2614019 lighttpd]# vi lighttpd.conf
[root@uiswhlpt2614019 lighttpd]# diff lighttpd.conf.orig lighttpd.conf
10c10
< $HTTP["remoteip"] !~ "10.0.10.173|127.0.0.1" {

---
> $HTTP["remoteip"] !~ "10.0.2.2|10.0.10.173|127.0.0.1" {
[root@uiswhlpt2614019 lighttpd]# 
[root@uiswhlpt2614019 lighttpd]# /etc/init.d/lighttpd restart

See also FRONTEND_IP in this repo:

Moving docker-dcm to this repo

@poikilotherm and I were just discussing the future of docker-aio in the context of a new containerization effort we are kicking off next week. (All are welcome!)

Thanks to @poikilotherm we've made some excellent progress recently in creating a Dataverse container for dev purposes, to the point that we're considering removing docker-aio from the code base.

However, I'm aware that docker-dcm and its docker-compose file rely on docker-aio. Would it be possible to copy the docker-dcm files (and the docker-aio, files if they are desired) into this repo? Perhaps someone will find them useful some day. Alternatively, we could simply delete them. They'll be in the git history! ๐Ÿ˜…

On a related note, we are discussing the deprecation of the rsync feature in Dataverse here:

Please let us know what you think. Thanks! โค๏ธ

better branching

after #28 and
#29 are sorted; should create develop branch w\ feature branches off that, merging back to master on release (aka - usual sane multi-dev approach).

storage limits

Was reminded that there isn't currently anything for "restrict the size of datasets"; and this might be something useful for other installations.

Document how to get geerlingguy/centos6 Vagrant box

I'm resuming work on IQSS/dataverse#3352 and thought I'd run vagrant up so I have a DCM instance to develop against. I got this error:

murphy:data-capture-module pdurbin$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'geerlingguy/centos6' could not be found. Attempting to find and install...
    default: Box Provider: virtualbox
    default: Box Version: >= 0
The box 'geerlingguy/centos6' could not be found or
could not be accessed in the remote catalog. If this is a private
box on HashiCorp's Atlas, please verify you're logged in via
`vagrant login`. Also, please double-check the name. The expanded
URL and error message are shown below:

URL: ["https://atlas.hashicorp.com/geerlingguy/centos6"]
Error: 
murphy:data-capture-module pdurbin$

isolation

for production use, should adapt the full filestore isolation of DCM/RSAL for transfers between filestores

upgrade testing

Upgrading systems w\ the pre-RPM installation to RPM based (without sufficient cleaning) fails. By itself, not a significant problem - but will need to test (and may need to fix things) the scenario of upgrading from one RPM version to another (without requiring a fresh system). The question isn't if RPM supports that, but if the RPM spec used is doing it correctly.

installation / documentation improvements

related: #12 #11 #5

Generic improvements for staging / production deployment, and installation / configuration documentation. Currently thinking goal for installation procedure should be "install RPM" (possibly including dependency RPMs); edit configuration file(s); start services.

  • RPMs for dependencies (redis, etc), preferably in repos but build if necessary.
  • python (pip) for dependencies (redis, rq, etc). preferably in repos but build if necessary (investigation of setuptools -> rpm suggests that approach is more trouble than it's worth).
  • Configuration files for dependencies
  • Configuration files for DCM (generator, scanner); follow convention for locations (aka - /etc/dcm rather than /root/.bashrc and DCM repository clone)
  • init and unit files for DCM (cent7, cent6)
  • RPM for DCM
  • pipeline for testing user-facing install procedure (use template/example config files); script generation, file transfer procedure, call scanner outside cron job.
  • Documentation for staging / production dependencies, installation and configuration.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.