Git Product home page Git Product logo

ocfl-index's People

Contributors

srerickson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ocfl-index's Issues

Don't use gRPC for file transfer

A dedicated, plain HTTP handler should be used for downloading files because gRPC is not well-suited for file transfer. Useful discussion here. In fact, the download handler already does this, but it could be simplified. I also need to figure out how best to document this API endpoint since it won't be part for the gRPC service definition. Maybe it's sufficient to describe in the comments/doc for the service.

Database interface for removing indexed objects

  • Bulk remove based on timestamp: RemoveObjectsBefore(before time.Time): remove objects that were last seen before time. This can be used during object root sync, which should set/update the indexed_at timestamp for all object directories.

ox command improvements

Implement the following:

the ls command:

  • ox ls: list objects
  • ox ls [-r | --recursive] [--versions] [--version {"head"}] [object_id] [dir]: list contents of an object.

the export command:

  • ox export [-V|--version=] {object_id} {dst} [flags]: export object's files to the local file system.

Edit: implement export instead of cp

file sizes not indexed on reindex

If I run ocfl-index server --inventories, then quit the server and run ocfl-index server --filesizes, the existing indexed inventories are supposed to be re-indexed with file-sizes. This doesn't seem to be happening.

Failure to load large S3 storage root

For an S3 storage root with the same structure as #7, but with 10,000 top-level object directories and 147,192 total OCFL objects, the following command produces the following results:

time ./main index --s3-bucket [my-bucket-2] --s3-path .
2022/07/24 16:50:12 created new index tables
indexing to index.sqlite, schema: v0.1
2022/07/24 16:50:12 using S3 bucket=[my-bucket-2] path=.
2022/07/24 16:50:13 reading storage root: not an ocfl storage root: NAMASTE declaration not found

real	0m1.844s
user	0m0.043s
sys	0m0.037s

It is possible that an OCFL object is corrupted in this storage root, but I have no particular reason to believe that is the case.

Handle invalid objects during indexing

Currently, indexing halts with an error if an invalid object is encountered; valid objects may be left un-indexed. Eventually, it would be good to add error status/messages to the database schema so that invalid objects can be indexed as such. For now, just log when an invalid object is encountered and continue with the indexing process.

release process

  • build everything when a version tag is pushed to github
  • binaries and container image
  • pin versions of dev tools

improvements for ListObjects

  • remove old, unused sort option code (only sorting by id)
  • add prefix option for list objects request
  • implement prefix in sqlite3 queries

Multi-stage indexing scheme

Fully indexing OCFL storage roots is slow, especially for very large S3-based repositories. The resulting index can also be very large. Even if the end-user only needs the list of object paths in the index, there is currently no option other than full indexing: all objects, all object states.

Instead of an all-or-nothing indexing approach, ocfl-index should allow a piecemeal indexing strategy. Users may prefer to sacrifice index completeness and the possibility of "random access" for faster indexing and a smaller index footprint.

In a multi-stage indexing scheme, object can be indexed with different levels of completeness:

  1. Storage root definition
  2. Object path/declaration (requires readdir of object root, providing: OCFL spec, digest algorithm of sidecar).
  3. Object state (requires object's inventory file: object id, all version state information )
  4. Content file size (requires readdir of content directories and sub directories).

Edit: I'm not adopting the notion of "lazy indexing" used in the original issue title and description. Instead, this issue focuses on multi-stage indexing, with levels: object roots, inventories, and file sizes. To explain: "lazy indexing" was the idea that certain details wouldn't be indexed until that information was requested. The problem with this is that it adds a lot uncertainty to how a given request might behave. API requests should have clear straightforward logic and predictable behavior. If a client requests size information that's not available, respond with an error stating that fact. Avoid magic.

API for updating the index

The API should expose the following functionalities:

  • Object Root Sync: scan the storage root for new object root directories and also remove any object roots in the index that no longer exist in the storage root.

  • (Re)Index Objects: Parse object inventories and update the index. This request can be scoped to particular object directories or object IDs but it defaults to all indexed object roots. If the scope is a list of object IDs, the objects must have either already been indexed or the storage root must have a layout that allows the object path to be determined from the ID.

  • ox command for reindex

Likely user error

After building ocfl-index, the following command results in the following output:

source config
./main index --s3-bucket [my-bucket]

..where "config" contains the export AWS_... credentials.

Output:

indexing to index.sqlite, schema: v0.1
2022/07/24 09:20:25 using FS dir=.
2022/07/24 09:20:25 reading storage root: not an ocfl storage root: NAMASTE declaration not found

update goreleaser

  • Fix goreleaser config, which hasn't been updated since the gRPC rewrite.
  • build ocfl-index and ox
  • build container image

Better concurrency controls

Currently there is just one concurrency setting that is used for multiple purposes.

  • Number of workers for object root scan
  • Number of workers for file size scan
  • Number of workers for downloading and parsing inventory files.

The first two are closely related and might use the same value, which can be quite high in the S3 case (~100). The latter value should be smaller. In any case, it should be possible to set the latter value separately.

Unexpected results: S3

The following command produces the following results:

$ time ./main index --s3-bucket [my-bucket] --s3-path .
2022/07/24 16:08:26 created new index tables
indexing to index.sqlite, schema: v0.1
2022/07/24 16:08:26 using S3 bucket=[my-bucket] path=.
scanning for objects...found 1435
indexed 1435/1435 objects
done

real	14m40.187s
user	0m20.607s
sys	0m16.283s

However, the bucket actually contains 10,093 objects. The following command produced the linked output. Note, there are 7141 top-level object directories in this S3 storage root.

aws s3 ls --profile [my-profile] s3://[my-bucket] --recursive | grep ocfl_object > objects.txt

Fix logging

  • no logging anywhere by default
  • package-level logger that can be set to help with debugging
  • ability to add a logger to methods where it's important: validation, commit, FS actions, etc.

server mode

Add a serve command that starts a lightweight http server for querying the index.

concurrent inventory download

the object scanning process uses concurrent connections to speed-up the process, however inventory download is still one-at-a-time. (indexing itself needs to be serialized)

`ox cat` returns zero bytes

index.GetContentPath() is returning a bad path:

17:43:53.632210 open file name=public-data/public-data/13e/e63/0d5/g_70-2_d45_1995/v1/content/ReadMe.txt

The storage root directory is repeated twice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.