srerickson / ocfl-index Goto Github PK
View Code? Open in Web Editor NEWAn API for OCFL repositories
License: MIT License
An API for OCFL repositories
License: MIT License
Indexing fails if the storage root uses a layout extension that the underlying OCFL package doesn't support. This behavior doesn't make sense because the layout doesn't affect the indexing process.
A dedicated, plain HTTP handler should be used for downloading files because gRPC is not well-suited for file transfer. Useful discussion here. In fact, the download
handler already does this, but it could be simplified. I also need to figure out how best to document this API endpoint since it won't be part for the gRPC service definition. Maybe it's sufficient to describe in the comments/doc for the service.
RemoveObjectsBefore(before time.Time)
: remove objects that were last seen before time. This can be used during object root sync, which should set/update the indexed_at
timestamp for all object directories.Implement the following:
the ls
command:
ox ls
: list objectsox ls [-r | --recursive] [--versions] [--version {"head"}] [object_id] [dir]
: list contents of an object.the export
command:
ox export [-V|--version=] {object_id} {dst} [flags]
: export object's files to the local file system.Edit: implement export
instead of cp
If I run ocfl-index server --inventories
, then quit the server and run ocfl-index server --filesizes,
the existing indexed inventories are supposed to be re-indexed with file-sizes. This doesn't seem to be happening.
For an S3 storage root with the same structure as #7, but with 10,000 top-level object directories and 147,192 total OCFL objects, the following command produces the following results:
time ./main index --s3-bucket [my-bucket-2] --s3-path .
2022/07/24 16:50:12 created new index tables
indexing to index.sqlite, schema: v0.1
2022/07/24 16:50:12 using S3 bucket=[my-bucket-2] path=.
2022/07/24 16:50:13 reading storage root: not an ocfl storage root: NAMASTE declaration not found
real 0m1.844s
user 0m0.043s
sys 0m0.037s
It is possible that an OCFL object is corrupted in this storage root, but I have no particular reason to believe that is the case.
sub-command to print the release version.
Currently, indexing halts with an error if an invalid object is encountered; valid objects may be left un-indexed. Eventually, it would be good to add error status/messages to the database schema so that invalid objects can be indexed as such. For now, just log when an invalid object is encountered and continue with the indexing process.
Currently, the API assumes a single storage root but it might be wise to add necessary fields for managing multiple storage roots.
add a query --json
flag to output query results in json format.
Fully indexing OCFL storage roots is slow, especially for very large S3-based repositories. The resulting index can also be very large. Even if the end-user only needs the list of object paths in the index, there is currently no option other than full indexing: all objects, all object states.
Instead of an all-or-nothing indexing approach, ocfl-index
should allow a piecemeal indexing strategy. Users may prefer to sacrifice index completeness and the possibility of "random access" for faster indexing and a smaller index footprint.
In a multi-stage indexing scheme, object can be indexed with different levels of completeness:
Edit: I'm not adopting the notion of "lazy indexing" used in the original issue title and description. Instead, this issue focuses on multi-stage indexing, with levels: object roots, inventories, and file sizes. To explain: "lazy indexing" was the idea that certain details wouldn't be indexed until that information was requested. The problem with this is that it adds a lot uncertainty to how a given request might behave. API requests should have clear straightforward logic and predictable behavior. If a client requests size information that's not available, respond with an error stating that fact. Avoid magic.
The API should expose the following functionalities:
Object Root Sync: scan the storage root for new object root directories and also remove any object roots in the index that no longer exist in the storage root.
(Re)Index Objects: Parse object inventories and update the index. This request can be scoped to particular object directories or object IDs but it defaults to all indexed object roots. If the scope is a list of object IDs, the objects must have either already been indexed or the storage root must have a layout that allows the object path to be determined from the ID.
ox command for reindex
After building ocfl-index
, the following command results in the following output:
source config
./main index --s3-bucket [my-bucket]
..where "config" contains the export AWS_...
credentials.
Output:
indexing to index.sqlite, schema: v0.1
2022/07/24 09:20:25 using FS dir=.
2022/07/24 09:20:25 reading storage root: not an ocfl storage root: NAMASTE declaration not found
ocfl-index
and ox
Currently there is just one concurrency setting that is used for multiple purposes.
The first two are closely related and might use the same value, which can be quite high in the S3 case (~100). The latter value should be smaller. In any case, it should be possible to set the latter value separately.
The following command produces the following results:
$ time ./main index --s3-bucket [my-bucket] --s3-path .
2022/07/24 16:08:26 created new index tables
indexing to index.sqlite, schema: v0.1
2022/07/24 16:08:26 using S3 bucket=[my-bucket] path=.
scanning for objects...found 1435
indexed 1435/1435 objects
done
real 14m40.187s
user 0m20.607s
sys 0m16.283s
However, the bucket actually contains 10,093 objects. The following command produced the linked output. Note, there are 7141 top-level object directories in this S3 storage root.
aws s3 ls --profile [my-profile] s3://[my-bucket] --recursive | grep ocfl_object > objects.txt
It would be very useful to be able to include file size information in the index.
Add a serve
command that starts a lightweight http server for querying the index.
TheGetStatus
API should include values for all fields
The index.Backend
interface should include a method for listing object roots directories (apart from Objects, which are really inventories).
Also, improve help messages.
from #5
the object scanning process uses concurrent connections to speed-up the process, however inventory download is still one-at-a-time. (indexing itself needs to be serialized)
Two separate issues, but I'll track them together:
err
values on error, but they should be wrapped with grpc error codes. See here: https://connect.build/docs/go/errorsindex.GetContentPath()
is returning a bad path:
17:43:53.632210 open file name=public-data/public-data/13e/e63/0d5/g_70-2_d45_1995/v1/content/ReadMe.txt
The storage root directory is repeated twice.
ox reindex
commandA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.