srerickson / ocfl-index Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 0.0 411 KB

An API for OCFL repositories

License: MIT License

Go 95.64% Makefile 0.31% Dockerfile 0.30% Ruby 0.82% Shell 2.93%

connect-go golang grpc ocfl

ocfl-index's People

Contributors

Stargazers

Watchers

ocfl-index's Issues

indexing shouldn't fail if storage layout is unrecognized

Indexing fails if the storage root uses a layout extension that the underlying OCFL package doesn't support. This behavior doesn't make sense because the layout doesn't affect the indexing process.

support s3 endpoint url

Don't use gRPC for file transfer

A dedicated, plain HTTP handler should be used for downloading files because gRPC is not well-suited for file transfer. Useful discussion here. In fact, the download handler already does this, but it could be simplified. I also need to figure out how best to document this API endpoint since it won't be part for the gRPC service definition. Maybe it's sufficient to describe in the comments/doc for the service.

Database interface for removing indexed objects

Bulk remove based on timestamp: RemoveObjectsBefore(before time.Time): remove objects that were last seen before time. This can be used during object root sync, which should set/update the indexed_at timestamp for all object directories.

ox command improvements

Implement the following:

the ls command:

ox ls: list objects
ox ls [-r | --recursive] [--versions] [--version {"head"}] [object_id] [dir]: list contents of an object.

the export command:

ox export [-V|--version=] {object_id} {dst} [flags]: export object's files to the local file system.

Edit: implement export instead of cp

file sizes not indexed on reindex

If I run ocfl-index server --inventories, then quit the server and run ocfl-index server --filesizes, the existing indexed inventories are supposed to be re-indexed with file-sizes. This doesn't seem to be happening.

Failure to load large S3 storage root

For an S3 storage root with the same structure as #7, but with 10,000 top-level object directories and 147,192 total OCFL objects, the following command produces the following results:

time ./main index --s3-bucket [my-bucket-2] --s3-path .
2022/07/24 16:50:12 created new index tables
indexing to index.sqlite, schema: v0.1
2022/07/24 16:50:12 using S3 bucket=[my-bucket-2] path=.
2022/07/24 16:50:13 reading storage root: not an ocfl storage root: NAMASTE declaration not found

real	0m1.844s
user	0m0.043s
sys	0m0.037s

It is possible that an OCFL object is corrupted in this storage root, but I have no particular reason to believe that is the case.

versions command

sub-command to print the release version.

Handle invalid objects during indexing

Currently, indexing halts with an error if an invalid object is encountered; valid objects may be left un-indexed. Eventually, it would be good to add error status/messages to the database schema so that invalid objects can be indexed as such. For now, just log when an invalid object is encountered and continue with the indexing process.

Consider adding "store_id" to API methods

Currently, the API assumes a single storage root but it might be wise to add necessary fields for managing multiple storage roots.

release process

build everything when a version tag is pushed to github
binaries and container image
pin versions of dev tools

improvements for ListObjects

remove old, unused sort option code (only sorting by id)
add prefix option for list objects request
implement prefix in sqlite3 queries

json output for query

add a query --json flag to output query results in json format.

Multi-stage indexing scheme

Fully indexing OCFL storage roots is slow, especially for very large S3-based repositories. The resulting index can also be very large. Even if the end-user only needs the list of object paths in the index, there is currently no option other than full indexing: all objects, all object states.

Instead of an all-or-nothing indexing approach, ocfl-index should allow a piecemeal indexing strategy. Users may prefer to sacrifice index completeness and the possibility of "random access" for faster indexing and a smaller index footprint.

In a multi-stage indexing scheme, object can be indexed with different levels of completeness:

Storage root definition
Object path/declaration (requires readdir of object root, providing: OCFL spec, digest algorithm of sidecar).
Object state (requires object's inventory file: object id, all version state information )
Content file size (requires readdir of content directories and sub directories).

Edit: I'm not adopting the notion of "lazy indexing" used in the original issue title and description. Instead, this issue focuses on multi-stage indexing, with levels: object roots, inventories, and file sizes. To explain: "lazy indexing" was the idea that certain details wouldn't be indexed until that information was requested. The problem with this is that it adds a lot uncertainty to how a given request might behave. API requests should have clear straightforward logic and predictable behavior. If a client requests size information that's not available, respond with an error stating that fact. Avoid magic.

tree view without js

https://iamkate.com/code/tree-views/

API for updating the index

The API should expose the following functionalities:

Object Root Sync: scan the storage root for new object root directories and also remove any object roots in the index that no longer exist in the storage root.
(Re)Index Objects: Parse object inventories and update the index. This request can be scoped to particular object directories or object IDs but it defaults to all indexed object roots. If the scope is a list of object IDs, the objects must have either already been indexed or the storage root must have a layout that allows the object path to be determined from the ID.
ox command for reindex

Likely user error

After building ocfl-index, the following command results in the following output:

source config
./main index --s3-bucket [my-bucket]

..where "config" contains the export AWS_... credentials.

Output:

indexing to index.sqlite, schema: v0.1
2022/07/24 09:20:25 using FS dir=.
2022/07/24 09:20:25 reading storage root: not an ocfl storage root: NAMASTE declaration not found

update goreleaser

Fix goreleaser config, which hasn't been updated since the gRPC rewrite.
build ocfl-index and ox
build container image

Better concurrency controls

Currently there is just one concurrency setting that is used for multiple purposes.

Number of workers for object root scan
Number of workers for file size scan
Number of workers for downloading and parsing inventory files.

The first two are closely related and might use the same value, which can be quite high in the S3 case (~100). The latter value should be smaller. In any case, it should be possible to set the latter value separately.

Unexpected results: S3

The following command produces the following results:

$ time ./main index --s3-bucket [my-bucket] --s3-path .
2022/07/24 16:08:26 created new index tables
indexing to index.sqlite, schema: v0.1
2022/07/24 16:08:26 using S3 bucket=[my-bucket] path=.
scanning for objects...found 1435
indexed 1435/1435 objects
done

real	14m40.187s
user	0m20.607s
sys	0m16.283s

However, the bucket actually contains 10,093 objects. The following command produced the linked output. Note, there are 7141 top-level object directories in this S3 storage root.

aws s3 ls --profile [my-profile] s3://[my-bucket] --recursive | grep ocfl_object > objects.txt

ability to index file sizes

It would be very useful to be able to include file size information in the index.

Fix logging

no logging anywhere by default
package-level logger that can be set to help with debugging
ability to add a logger to methods where it's important: validation, commit, FS actions, etc.

In grpc handlers, I'm returning naked err values on error, but they should be wrapped with grpc error codes. See here: https://connect.build/docs/go/errors
If the GetContent() handler returns an error, the client doesn't catch it. The fix is to check the response error after receive is done.

`ox cat` returns zero bytes

index.GetContentPath() is returning a bad path:

17:43:53.632210 open file name=public-data/public-data/13e/e63/0d5/g_70-2_d45_1995/v1/content/ReadMe.txt

The storage root directory is repeated twice.

api & ox command for reindex

gRPC interface for reindexing
ox reindex command

srerickson / ocfl-index Goto Github PK

ocfl-index's People

Contributors

Stargazers

Watchers

ocfl-index's Issues

Recommend Projects

Recommend Topics

Recommend Org