Comments (5)
@alexdunnjpl Group this thinking with the scalable harvest components.
from harvest.
Given that harvest-service and (standalone) harvest are parallel development projects.
Tentative thoughts:
- Identify infra-agnostic stages of the process, such as
- Given a file-system root node (ie path), enumerate all products (Bundles, Collections, and simple Products) under that node's subtree.
- Given an enumeration of products, map them to registry JSON documents.
- Given a batch of registry JSON documents, process their registration with an OpenSearch instance (maybe - is this infra-dependent?)
- Extract each of these stages as utility modules (using standalone harvest for development, since that will be simplest).
- Once these utility modules are written/extracted, replace harvest-service implementations with calls to the modules as dependencies.
Once complete, the implementation code of harvest-service will just be the management/delegation code and some simple calls to glue it to the utility modules, and the implementation code of (standalone) harvest will just be a CLI wrapper around calls to the utility modules.
As a result, each (standalone/scalable) version of harvest will be doing exactly the same thing, and the utility libraries will be easily unit-testable.
from harvest.
@alexdunnjpl will organize a meeting to discuss that with @jordanpadams @viviant100 and @tloubrieu-jpl next week.
from harvest.
Better question than "how should we support both?" is "why do we support both?".
Since targeting a bundle directory with a element will ingest all labels nested within, I don't see what the benefit of the option (which iterates on the bundle label, and all first-descendant collection labels, and all <=20th-descendant product labels) as a separate thing.
@jordanpadams is it reasonable to argue for dropping support for the functionality entirely in preference of the approach?
from harvest.
Per @jordanpadams
This is true.
<bundles>
was kept for backwards compatibility support, but we changed the way we treated this when we decided to just use this part of the config to know where to look, but no longer decipher between bundles/collections/products. Just read everything below where this points and load the data as fast as you can.
Currently unclear whether removal of support is now acceptable - will be determined in tomorrow's meeting.
from harvest.
Related Issues (20)
- WARNING related to http requests HOT 1
- Harvest skips path that is the root of a soft link HOT 13
- Harvest's warning could be clearer
- harvest halts when it encounters a directory with mode 000
- "null" output when trying to run harvest on LADEE bundle
- Improve skipped product INFO message HOT 16
- Harvest skips XML label with bad prolog
- overwrite option does not work on harvest configuration files with <bundles> HOT 2
- --overwrite flag is not respected for <bundles> elements in harvest config HOT 2
- Change the initialization value for archive_status
- As a Data Archivist, I want to log the successful/failed loading of a data product in a centralized location
- Update to utilize new multi-tenancy approach HOT 7
- As a developer, I want to know what version of Harvest was used to load a product HOT 5
- Harvest not finding collections >1 level deep in a directory structure HOT 4
- As a Node Operator, I want to specify an alternate file paths for 1 or more archive products
- Access forbidden during nominal pipeline execution of harvest on Mars2020 archive HOT 2
- Improve Fault Tolerance of Harvest for Forbidden Access error and Timeout HOT 12
- Harvest failing on Juno collection with "Missing ids" error HOT 1
- As a user, I want to know when the file/label URL is not a URL HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from harvest.