Git Product home page Git Product logo

harvest's Introduction

Harvest Tool

DOI [πŸ€ͺ Unstable integration & delivery 😌 Stable integration & delivery

The Harvest Tool captures and indexes product metadata. Each discipline node of the Planetary Data System runs the tool to crawl the local data repositories, discovering products and indexing associated metadata into the Registry Service. As such, it's a sub-component of the PDS Registry Application (https://github.com/NASA-PDS/registry).

For more detailed documentation on this tool, see the PDS Registry Documentation: https://nasa-pds.github.io/registry/.

Documentation

The documentation for the latest release of the Harvest Tool, including release notes, installation, and operation of the software is ready to browse online.

If you would like to get the latest documentation, including any updates since the last release, you can execute the "mvn site:run" command and view the documentation locally at http://localhost:8080/.

πŸ‘₯ Contributing

Within the NASA Planetary Data System, we value the health of our community as much as the code. Towards that end, we ask that you read and practice what's described in these documents:

  • Our contributor's guide delineates the kinds of contributions we accept.
  • Our code of conduct outlines the standards of behavior we practice and expect by everyone who participates with our software.

πŸ”’ Versioning

We use the SemVer philosophy for versioning this software. Or not! Update this as you see fit.

πŸͺ› Development

To develop this project, use your favorite text editor, or an integrated development environment with Java support, such as Eclipse. You'll also need Apache Maven version 3. With these tools, you can typically run

mvn package

to produce a complete package. This runs all the phases necessary, including compilation, testing, and package assembly. Other common Maven phases include:

  • compile - just compile the source code
  • test - just run unit tests
  • install - install into your local repository
  • deploy - deploy to a remote repository β€” note that the Roundup action does this automatically for releases

πŸ’‚β€β™‚οΈ Secrets Detection Setup and Update

The PDS uses [Detect Secrets](Detect Secrets](https://nasa-ammos.github.io/slim/docs/guides/software-lifecycle/security/secrets-detection/)) to help prevent committing information to a repository that should remain secret.

For Detect Secrets to work, there is a one-time setup required to your personal global Git configuration, as well as several steps to create or update the required .secrets.baseline file needed to avoid false positive failures of the software. See the wiki entry on Detect Secrets to learn how to do this.

πŸͺΒ Pre-Commit Hooks

This package comes with a configuration for Pre-Commit, a system for automating and standardizing git hooks for code linting, security scanning, etc. Here in this Java template repository, we use Pre-Commit with Detect Secrets to prevent the accidental committing or commit messages containing secrets like API keys and passwords.

Pre-Commit and detect-secrets are language-neutral, but they themselves are written in Python. To take advantage of these features, you'll need a nearby Python installation. A recommended way to do this is with a virtual Python environment. Using the command line interface, run:

$ python -m venv .venv
$ source .venv/bin/activate   # Use source .venv/bin/activate.csh if you're using a C-style shell
$ pip install pre-commit git+https://github.com/NASA-AMMOS/slim-detect-secrets.git@exp

See Detect Secrets information above to setup your secrets baseline prior to proceeding.

Finally, install the pre-commit hooks:

pre-commit install
pre-commit install -t pre-push
pre-commit install -t prepare-commit-msg
pre-commit install -t commit-msg

You can then work normally. Pre-commit will run automatically during git commit and git push so long as the Python virtual environment is active.

πŸ‘‰ Note: For Detect Secrets to work, there is a one-time setup required to your personal global Git configuration. See the wiki entry on Detect Secrets to learn how to do this.

πŸš… Continuous Integration & Deployment

Thanks to GitHub Actions and the Roundup Action, this software undergoes continuous integration and deployment. Every time a change is merged into the main branch, an "unstable" (known in Java software development circles as a "SNAPSHOT") is created and delivered to the releases page and to the OSSRH.

You can make an official delivery by pushing a release/X.Y.Z branch to GitHub, replacing X with the major version number, Y with the minor version number, and Z with the micro version number. This results in a stable (non-SNAPSHOT) release generated and cryptographically signed (but by an automated process so alter trust expectations accordingly) and made available on the releases page and OSSRH; the website published; changelogs and requirements updated; and a new version number in the main branch prepared for future development.

The following sections detail how to do this manually should the automated steps fail.

πŸ”§ Manual Publication

πŸ‘‰ Note: Requires using PDS Maven Parent POM to ensure release profile is set.

Update Version Numbers

Update pom.xml for the release version or use the Maven Versions Plugin, e.g.:

$ # Skip this step if this is a RELEASE CANDIDATE, we will deploy as SNAPSHOT version for testing
$ VERSION=1.15.0
$ mvn -DnewVersion=$VERSION versions:set
$ git add pom.xml
$ git add */pom.xml

Update Changelog

Update Changelog using Github Changelog Generator. Note: Make sure you set $CHANGELOG_GITHUB_TOKEN in your .bash_profile or use the --token flag.

$ # For RELEASE CANDIDATE, set VERSION to future release version.
$ GITHUB_ORG=NASA-PDS
$ GITHUB_REPO=validate
$ github_changelog_generator --future-release v$VERSION --user $GITHUB_ORG --project $GITHUB_REPO --configure-sections '{"improvements":{"prefix":"**Improvements:**","labels":["Epic"]},"defects":{"prefix":"**Defects:**","labels":["bug"]},"deprecations":{"prefix":"**Deprecations:**","labels":["deprecation"]}}' --no-pull-requests --token $GITHUB_TOKEN
$ git add CHANGELOG.md

Commit Changes

Commit changes using following template commit message:

$ # For operational release
$ git commit -m "[RELEASE] Validate v$VERSION"
$ # Push changes to main
$ git push --set-upstream origin main

Build and Deploy Software to Maven Central Repo

$ # For operational release
$ mvn --activate-profiles release clean site site:stage package deploy
$ # For release candidate
$ mvn clean site site:stage package deploy

Push Tagged Release

$ # For Release Candidate, you may need to delete old SNAPSHOT tag
$ git push origin :v$VERSION
$ # Now tag and push
$ REPO=validate
$ git tag v${VERSION} -m "[RELEASE] $REPO v$VERSION" -m "See [CHANGELOG](https://github.com/NASA-PDS/$REPO/blob/main/CHANGELOG.md) for more details."
$ git push --tags

Deploy Site to Github Pages

From cloned repo:

$ git checkout gh-pages
$ # Copy the over to version-specific and default sites
$ rsync --archive --verbose target/staging/ .
$ git add .
$ # For operational release
$ git commit -m "Deploy v$VERSION docs"
$ # For release candidate
$ git commit -m "Deploy v${VERSION}-SNAPSHOT docs"
$ git push origin gh-pages

Update Versions For Development

Update pom.xml with the next SNAPSHOT version either manually or using Github Versions Plugin.

For RELEASE CANDIDATE, ignore this step.

$ git checkout main
$ # For release candidates, skip to push changes to main
$ VERSION=1.16.0-SNAPSHOT
$ mvn -DnewVersion=$VERSION versions:set
$ git add pom.xml
$ git commit -m "Update version for $VERSION development"
$ # Push changes to main
$ git push --set-upstream origin main

Complete Release in Github

Currently the process to create more formal release notes and attach Assets is done manually through the Github UI.

NOTE: Be sure to add the tar.gz and zip from the target/ directory to the release assets, and use the CHANGELOG generated above to create the RELEASE NOTES.

πŸ“ƒ License

The project is licensed under the Apache version 2 license.

Maven JAR Dependency Reference

If you want to access snapshots, add the following to your ~/.m2/settings.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
     <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

harvest's People

Contributors

actions-user avatar al-niessner avatar alexdunnjpl avatar dependabot[bot] avatar galenhollins avatar jordanpadams avatar lylebarner avatar mcayanan avatar nutjob4life avatar pdsen-ci avatar ramesh-maddegoda avatar seanhardman avatar tdddblog avatar testpersonal avatar tloubrieu-jpl avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

harvest's Issues

As a user, I want to know when the file/label URL is not a URL

πŸ’ͺ Motivation

...so that I can more easily adapt my config to work for file paths.

πŸ“– Additional Details

If your file path replacement does not work, it goes straight into the registry as a file path. We should catch this and immediately throw an ERROR and not ingest in the products. URLs must be required for all products ingested. We do not necessarily need to check the validity of the URL (we may need a tool to do this separately), but there should at least be an https: in the output URLs.

βš–οΈ Acceptance Criteria

Given
When I perform
Then I expect

βš™οΈ Engineering Details

Update documentation for harvesting PDS3 data

EDIT: I figured it out: I had to put the target directory on the command-line. I don't know whether the documentation or the code should be fixed.


ORIGINAL ISSUE:

I easily could be doing something wrong that I can't figure out.

% harvest -C /PDS4tools/harvest/conf/search/defaults -o /PDS4tools -c /Users/rchen/Desktop/killme.xml
java.lang.NullPointerException
at gov.nasa.pds.harvest.search.crawler.metadata.extractor.Pds3MetExtractorConfig.(Pds3MetExtractorConfig.java:44)
at gov.nasa.pds.harvest.search.HarvesterSearch.harvest(HarvesterSearch.java:229)
at gov.nasa.pds.harvest.search.HarvestSearchLauncher.doHarvesting(HarvestSearchLauncher.java:497)
at gov.nasa.pds.harvest.search.HarvestSearchLauncher.processMain(HarvestSearchLauncher.java:601)
at gov.nasa.pds.harvest.search.HarvestSearchLauncher.main(HarvestSearchLauncher.java:620)
null
******* The optimized code generation is disabled ************
PDS Harvest Tool Log
Version Version 2.5.2
Time Tue, Oct 29 2019 at 04:02:18 PM
Target(s) [/Users/rchen/Desktop/test/testHarv/DAWNGRAND1B/DATA]
Target Type PDS3
File Inclusions [*.LBL]
Severity Level INFO
Config directory /PDS4tools/harvest/conf/search/defaults
Output directory /PDS4tools/solr-docs
Transaction ID d5bb8905-4cd3-473c-a8ca-a4fcde971495
INFO: XML extractor set to the following default namespace: http://pds.nasa.gov/pds4/pds/v1
Summary:
0 of 0 file(s) processed, 0 other file(s) skipped
0 error(s), 0 warning(s)
Product Labels:
0 Successfully registered
0 Failed to register
Search Service Solr Documents:
0 Successfully created
0 Failed to get created
XPath Solr Documents (Quick Look Only, Ignore Failed Ingestions):
0 Successfully registered
0 Failed to register
Product Types Handled:
Registry Package Id: N/A
%
%
% ls -Rl /PDS4tools/harvest/conf/search/defaults
total 0
drwxr-xr-x@ 4 rchen JPL\Domain Users 128 Oct 29 12:18 pds/
drwxr-xr-x@ 3 rchen JPL\Domain Users 96 Oct 29 12:18 psa/
/PDS4tools/harvest/conf/search/defaults/pds:
total 0
drwxr-xr-x@ 13 rchen JPL\Domain Users 416 Oct 29 12:18 pds3/
drwxr-xr-x@ 26 rchen JPL\Domain Users 832 Oct 29 12:18 pds4/
/PDS4tools/harvest/conf/search/defaults/pds/pds3:
total 128
-rw-r--r--@ 1 rchen JPL\Domain Users 7313 Oct 29 12:18 attribute.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 5245 Oct 29 12:18 class.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 4286 Oct 29 12:18 context.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 139 Oct 29 12:18 core.properties
-rw-r--r--@ 1 rchen JPL\Domain Users 5905 Oct 29 12:18 dataset.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3469 Oct 29 12:18 instrument.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3321 Oct 29 12:18 instrumenthost.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3532 Oct 29 12:18 investigation.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3069 Oct 29 12:18 proxy.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 6171 Oct 29 12:18 service.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3217 Oct 29 12:18 target.xml
/PDS4tools/harvest/conf/search/defaults/pds/pds4:
total 328
-rw-r--r--@ 1 rchen JPL\Domain Users 3509 Oct 29 12:18 ancillary.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 9118 Oct 29 12:18 attribute.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 7850 Oct 29 12:18 browse.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 9593 Oct 29 12:18 bundle.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 7046 Oct 29 12:18 class.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 9346 Oct 29 12:18 collection.xml
drwxr-xr-x@ 6 rchen JPL\Domain Users 192 Oct 29 12:18 context/
-rw-r--r--@ 1 rchen JPL\Domain Users 11887 Oct 29 12:18 context.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 130 Oct 29 12:18 core.properties
-rw-r--r--@ 1 rchen JPL\Domain Users 3938 Oct 29 12:18 dip.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3964 Oct 29 12:18 dip_deep_archive.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 8787 Oct 29 12:18 document.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 7842 Oct 29 12:18 file_text.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 6629 Oct 29 12:18 native.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 8807 Oct 29 12:18 observational.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 6858 Oct 29 12:18 service.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3938 Oct 29 12:18 sip.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 4041 Oct 29 12:18 sip_deep_archive.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3535 Oct 29 12:18 software.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 6641 Oct 29 12:18 spice_kernel.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3418 Oct 29 12:18 thumbnail.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3292 Oct 29 12:18 update.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 3420 Oct 29 12:18 xml_schema.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 2379 Oct 29 12:18 zipped.xml
/PDS4tools/harvest/conf/search/defaults/pds/pds4/context:
total 32
-rw-r--r--@ 1 rchen JPL\Domain Users 2197 Oct 29 12:18 instrument-host.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 1947 Oct 29 12:18 instrument.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 2441 Oct 29 12:18 investigation.xml
-rw-r--r--@ 1 rchen JPL\Domain Users 1392 Oct 29 12:18 target.xml
/PDS4tools/harvest/conf/search/defaults/psa:
total 0
drwxr-xr-x@ 4 rchen JPL\Domain Users 128 Oct 29 12:18 pds3/
/PDS4tools/harvest/conf/search/defaults/psa/pds3:
total 24
-rw-r--r--@ 1 rchen JPL\Domain Users 138 Oct 29 12:18 core.properties
-rw-r--r--@ 1 rchen JPL\Domain Users 4926 Oct 29 12:18 dataset.xml
%
%
% ls -lR /PDS4tools/solr-docs/
%
%
% ls -l /Users/rchen/Desktop/killme.xml -rw-r--r-- 1 rchen JPL\Domain Users 3600 Oct 29 16:02 /Users/rchen/Desktop/killme.xml
%
%
% ls -l /Users/rchen/Desktop/test/testHarv/DAWNGRAND1B/DATA
total 10256
-rw-r--r-- 1 rchen JPL\Domain Users 5497 Oct 19 2016 GRD-L1B-151023-151216_160816-CTL-BGOC.LBL
-rw-r--r-- 1 rchen JPL\Domain Users 12376 Oct 19 2016 GRD-L1B-151023-151216_160816-CTL-BGOC.TAB
-rw-r--r--@ 1 rchen JPL\Domain Users 39840 Sep 13 2016 GRD-L1B-151023-151216_160816-EPG.LBL
-rw-r--r--@ 1 rchen JPL\Domain Users 5184439 Aug 16 2016 GRD-L1B-151023-151216_160816-EPG.TAB
%
%
% cat /Users/rchen/Desktop/killme.xml

/Users/rchen/Desktop/test/testHarv/DAWNGRAND1B/DATA *.LBL http://starbase.jpl.nasa.gov $HOME DATA_SET_ID
  <!-- Tells Harvest what element values to use to create the title. -->
  <titleContents harvest:appendFilename="true">
    <elementName>DATA_SET_ID</elementName>
  </titleContents>

  <!-- Register static metadata for each product. -->
  <staticMetadata>
    <slot harvest:name="information_model_version">
      <value>1.11.0.0</value>
    </slot>
    <slot harvest:name="product_class">
      <value>Product_Proxy_PDS3</value>
    </slot>
    <slot harvest:name="data_set_ref">
      <value>urn:nasa:pds:context_pds3:data_set:data_set.vg2-j-pls-5-summ-ele-mom-96.0sec-v1.0</value>
    </slot>
    <slot harvest:name="investigation_ref">
      <value>urn:nasa:pds:context_pds3:investigation:mission.voyager</value>
    </slot>
    <slot harvest:name="instrument_host_ref">
      <value>urn:nasa:pds:context_pds3:instrument_host:instrument_host.vg2</value>
    </slot>
    <slot harvest:name="instrument_ref">
      <value>urn:nasa:pds:context_pds3:instrument:instrument.pls_vg2</value>
    </slot>
    <slot harvest:name="target_ref">
      <value>urn:nasa:pds:context_pds3:target:target.planet.jupiter</value>
    </slot>
  </staticMetadata>

  <!-- Register dynamic metadata from elements in the product labels. -->
  <ancillaryMetadata>
    <elementName harvest:slotName="observation_start_date_time">
      START_TIME
    </elementName>
    <elementName harvest:slotName="observation_stop_date_time">
      STOP_TIME
    </elementName>
    <elementName harvest:slotName="product_type">
      PRODUCT_TYPE
    </elementName>
    <elementName harvest:slotName="creation_date_time">
      PRODUCT_CREATION_TIME
    </elementName>
    <elementName harvest:slotName="encoding_type">
      INTERCHANGE_FORMAT
    </elementName>
  </ancillaryMetadata>
</pds3ProductMetadata>

Fix bug where ingested product start_date_time is off by 12 hours

This may be harvest's problem, but it may be elsewhere. Maybe it goes away without the registry, if that's still a thing.

I just ingested/indexed some OREX bundles, and the start_date_times of the search result and the original input file differ, as circled in orex.jpg. Note that the output is 12 hours behind the input .xml file. In insight.jpg, the output is 12 hours ahead of the input file. Also odd is that stop_date_time is correct in insight.jpg.

See issue on internal Github for more details.

https://github.jpl.nasa.gov/PDSEN/harvest/issues/3#issue-205493

harvest ingest is not creating all product_lidvid as an array

πŸ› Describe the bug

Loaded elasticsearch (ES) with latest pds-registry-app using fresh download of harvest-3.5.0-SNAPSHOT (has date of download May 12 not creation). Modifying registry-service-app to use the registry-ref index and found that for this collection id

urn:nasa:pds:izenberg_pdart14_meap:data_tnmap::1.0::P1

"product_lidvid" : "urn:nasa:pds:izenberg_pdart14_meap:data_tnmap:thermal_neutron_map::1.0",

πŸ“œ To Reproduce

Steps to reproduce the behavior:

  1. create and fill ES database using README.md instructions for docker
  2. in browser goto http://localhost:9200/registry-refs/_search?pretty
  3. Scroll down to "id": "urn:nasa:pds:izenberg_pdart14_meap:data_tnmap::1.0::P1"
  4. See error a couple of lines below

πŸ•΅οΈ Expected behavior

Verified with @jordanpadams and @tloubrieu-jpl that the design desirement is that it look like:

"product_lidvid" : ["urn:nasa:pds:izenberg_pdart14_meap:data_tnmap:thermal_neutron_map::1.0"],

πŸ“š Version of Software Used

🏞Screenshots

not-an-array


** πŸ¦„ Applicable requirements**

As a user, I want to be able to see a summary of all logs messages after harvest execution completes

πŸ’ͺ Motivation

...so that I can easily tell if products failed, if there are warnings I should check out, etc.

πŸ“– Additional Details

βš–οΈ Acceptance Criteria

Given a bundle with a product label that contains something that will cause an error in harvest
When I perform a harvest execution on that bundle that includes that product
Then I expect to see a part in the summary that includes a number of errors (and warnings) that occurred during the execution.

βš™οΈ Engineering Details

Harvest failing on Juno collection with "Missing ids" error

πŸ› Describe the bug

Related to #96, not sure what the issue is or how to resolve it.

πŸ•΅οΈ Expected behavior

Executes successfully and throws more usable error.

πŸ“š Version of Software Used

3.6.0-SNAPSHOT

🩺 Test Data / Additional context

Data is here: https://atmos.nmsu.edu/PDS/data/jnogrv_1001/DOCUMENT/


πŸ¦„ Related requirements

βš™οΈ Engineering Details

  • It appears this traces back to the CollectionProcessor class.

This is a blocker for ATM Juno ingestion.

populate instrument / instrument host / mission / target names and types using context products

Is your feature request related to a problem? Please describe.
Current implementation parses this information from the labels. We should customize this to never use those values (they are not validated and may not be accurate), and instead use a local config file to populate. Similar to what validate does:

https://github.com/NASA-PDS/validate/blob/master/src/main/resources/util/registered_context_products.json

The fields in labels this should apply to:

//Investigation_Area/name
//Investigation_Area/type
//Observing_System_Component/name
//Observing_System_Component/type
//Target_Identification/name
//Target_Identification/type

Error ingesting an XML boolean with values of 0/1

πŸ› Describe the bug

Valid XML boolean value not being converted to the ES boolean field type

πŸ“œ To Reproduce

bash-4.2$ registry-manager load-data -es https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com:443 \-auth /home/itrejo/es-auth.cfg -dir out & 
[1] 17567
bash-4.2$ [INFO] Registry URL: https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com:443
[INFO] Registry index: registry
[INFO] Loading PDS to ES data type mapping from /home/atmos4/PDS/pds-registry-app-1.0.3/registry-manager-4.3.1/elastic/data-dic-types.cfg
[INFO] Updating LDDs from /export/atmos4/PDS/harvest/out/missing_xsds.txt
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
jo[INFO] This LDD already loaded.
[INFO] Updating 'mars2020' LDD. Schema location: https://pds.nasa.gov/pds4/mission/mars2020/v1/PDS4_MARS2020_1G00_1000.xsd
[INFO] This LDD already loaded.
[INFO] Updating schema with fields from /export/atmos4/PDS/harvest/out/missing_fields.txt
[INFO] Updated 226 fields
[INFO] Loading ES data file: /export/atmos4/PDS/harvest/out/registry-docs.json
[ERROR] failed to parse field [mars2020:Algorithm_Parameter_Table_Values/mars2020:persistence_en] of type [boolean] in document with id 'urn:nasa:pds:mars2020_moxie:data_derived:ox__0004_0667297777ddr___0010052moxi00101p01::1.0'. Preview of field's value: '0'
[ERROR] Could not load data.

πŸ•΅οΈ Expected behavior

Ingests successfully as a boolean field type.

πŸ“š Version of Software Used

SNAPSHOT as of 6e68b5c

🩺 Test Data / Additional context

https://atmos.nmsu.edu/PDS/data/PDS4/Mars2020/moxie_bundle/data_derived/0004/OX__0004_0667297777DDR___0010052moxi00101P01.xml

🏞Screenshots

πŸ–₯ System Info

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

πŸ¦„ Related requirements

βš™οΈ Engineering Details

PDS4 ASCII_Boolean maps to xs:boolean, which, per these docs can be one of (0, 1, true, false).

After poking around at how to handle this with ES, it sounds like if we translate the 0,1,true,false from the XML to JSON, it will automatically turn those into true,false (I think?). But since we don't really pay attention to which field is a boolean or not, we don't do this translation.

Possible solutions:

  • Translate these fields from the client
  • Add a script to the field type in ES to translate on the service ingestion side

As an additional note, looks like this was supposed to be handled by ElasticSearch a while ago, but there was a regression at some point? elastic/elasticsearch#26

Could not parse date in yyyy-MM-ddZ format

Some OREX labels (collections) have <start_date_time> in "yyyy-MM-ddZ" format (e.g., 2016-09-08Z) which Harvest could not parse. For example, OREX/OCAMS/data_calibrated/collection_ocams_data_calibrated.xml collection.

It doesn't look like a valid date. A date without time, but with a timezone considered invalid, at least in Java. Timezone usually makes sense if there is both a date and a time.

Need confirmation that "2016-09-08Z" is valid or not in PDS labels.

File system metadata not sufficiently captured per requirements

Describe the bug
Per requirement NASA-PDS/pds-registry-app#67, the following information was not captured as expected. More details given as to what is expected for these fields and the rationale for capturing it.

  • base file path for product(s) – this should be the actual path(s) to the product(s) on the file system and/or a separate field for the URL so someone could easily download them

  • checksum (_file_md5 also refers to the label) – even though this metadata is in the label, we shouldn’t trust that. Also, that metadata is optional in the label, and we will want this for every product in the archive so we can eventually use the registry for integrity checking and our software to deliver products to the NSSDCA

  • file size (_file_size also refers to the label) – Richard, is this required in the products themselves? If is, I would still like us to duplicate this information to some other field (e.g. _product_size) so we have all the β€œmetrics” fields named similarly so they can be easily found in the output.

  • file timestamp (don't see this at all) – I was thinking this would be the timestamp of the file on the filesystem. Some folks use this for integrity checking to see if/when the file was touched.

  • MIME type (also don't see) – I would like us to also have a _product_mime_type (or some similar name) for what it is according to the filesystem, not necessarily the label.

issue identified by @rchenatjpl

Applicable requirements:
πŸ¦„ NASA-PDS/pds-registry-app#67

harvest Incorrect "lidvid" and "_id" fields are ingested

πŸ› Describe the bug identified during I&T

I changed the version_id from my previous testing to 2.0. 1.10.0 and 1.20, but no matter what I set it to, when I query http://localhost:9200/registry/_search?q=* I always get back urn:nasa:pds:lab_shocked_feldspars:data::1.0, you can see my test steps and test results here: https://cae-testrail.jpl.nasa.gov/testrail/index.php?/tests/view/3869751&group_by=cases:section_id&group_order=asc&group_id=90426

πŸ–₯ System Info

Harvest version: 3.7.0-SNAPSHOT
Build time: 2022-05-27T15:39:34Z

Related issue

Bug raised while testing
#90

βš™οΈ Engineering Details

MD5 digest encoding is in Base64 instead of Hex

Describe the bug
bug identified by @rchenatjpl

My config file to harvest has

After harvesting and loading, localhost:9200/... shows
_file_md5 "z1axD9CwAw7i8jlMaldSrQ=="
in both the web page and the .json output. That's supposed to be an md5 value, right? If so, is that some alternate format of md5? I thought md5 values were all he

Error ingesting a datetime field pds:Time_Coordinates/pds:stop_date_time

πŸ› Describe the bug

πŸ“œ To Reproduce

Steps to reproduce the behavior:

bash-4.2$ registry-manager load-data -es [https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com:443](https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com/) -auth /home/itrejo/es-auth.cfg -dir /home/atmos4/PDS/harvest/out &
[1] 45180
bash-4.2$ [INFO] Registry URL: [https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com:443](https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com/)
[INFO] Registry index: registry
[INFO] Loading PDS to ES data type mapping from /home/atmos4/PDS/pds-registry-app-1.0.3/registry-manager-4.3.1/elastic/data-dic-types.cfg
[INFO] Updating LDDs from /home/atmos4/PDS/harvest/out/missing_xsds.txt
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating 'pds' LDD. Schema location: http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd
[INFO] This LDD already loaded.
[INFO] Updating schema with fields from /home/atmos4/PDS/harvest/out/missing_fields.txt
[INFO] Updated 1 fields
[INFO] Loading ES data file: /home/atmos4/PDS/harvest/out/registry-docs.json
[INFO] Loaded 500 document(s)
[INFO] Loaded 1000 document(s)
[INFO] Loaded 1500 document(s)
[INFO] Loaded 2000 document(s)
[INFO] Loaded 2500 document(s)
[INFO] Loaded 3000 document(s)
[INFO] Loaded 3500 document(s)
[INFO] Loaded 4000 document(s)
[ERROR] failed to parse field [pds:Time_Coordinates/pds:stop_date_time] of type [date] in document with id 'urn:nasa:pds:eldorado_nv_dd_ptv:data_derived:f_p10_2_data-133::1.0'. Preview of field's value: '2012-06-31T02:22:47.956Z'
[ERROR] Could not load data.

πŸ•΅οΈ Expected behavior

Ingests the data successfully or throws error but still ingests the data products.

πŸ“š Version of Software Used

Latest snapshot.

🩺 Test Data / Additional context

https://atmos.nmsu.edu/PDS/data/PDS4/eldorado_nv_dd_ptv/data_derived/Location_F/f_p10_2_data-133.xml
https://atmos.nmsu.edu/PDS/data/PDS4/eldorado_nv_dd_ptv/data_derived/Location_F/f_p10_2_data-133.csv


πŸ¦„ Related requirements

βš™οΈ Engineering Details

Creating additional ticket to overall handle errors more gracefully

Investigate issues getting erroneous "Content is not allowed in prolog" errors

From @rchenatjpl :

What's up with the message "Content is not allowed in prolog."? vtool validates the .LBL files successfully. I forget where (vtool or maybe validate) I've seen that error message before.

DATA.zip
harvestDawnNondata.xml.txt

% harvest -o /PDS4tools -C /PDS4tools/harvest/conf/search/defaults -c testHarv/harvestDawnNondata.xml testHarv/DAWNGRAND1B/DATA
******* The optimized code generation is disabled ************
PDS Harvest Tool Log

Version Version 2.5.2
Time Tue, Oct 29 2019 at 05:34:05 PM
Target(s) [testHarv/DAWNGRAND1B/DATA]
Severity Level INFO
Config directory /PDS4tools/harvest/conf/search/defaults
Output directory /PDS4tools/solr-docs
Transaction ID 6177ddb2-46fa-4789-abdb-8eddc46d9b50

INFO: XML extractor set to the following default namespace: http://pds.nasa.gov/pds4/pds/v1
INFO: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-EPG.LBL] Begin processing.
ERROR: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-EPG.LBL] line 1: Content is not allowed in prolog.
INFO: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-EPG.TAB] Begin processing.
ERROR: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-EPG.TAB] line 1: Content is not allowed in prolog.
INFO: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-CTL-BGOC.TAB] Begin processing.
ERROR: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-CTL-BGOC.TAB] line 1: Content is not allowed in prolog.
INFO: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-CTL-BGOC.LBL] Begin processing.
ERROR: [testHarv/DAWNGRAND1B/DATA/GRD-L1B-151023-151216_160816-CTL-BGOC.LBL] line 1: Content is not allowed in prolog.

Summary:

0 of 4 file(s) processed, 0 other file(s) skipped
4 error(s), 0 warning(s)

Product Labels:
0 Successfully registered
0 Failed to register

Search Service Solr Documents:
0 Successfully created
0 Failed to get created

XPath Solr Documents (Quick Look Only, Ignore Failed Ingestions):
0 Successfully registered
0 Failed to register

Product Types Handled:

Registry Package Id: N/A

End of Log

Fix bug where unable to execute harvest-ctrl

This may well be my fault, but maybe not, or maybe we need to support such environments.

% harvest-ctrl
-Djava.ext.dirs=/PDS4tools/harvest/lib is not supported. Use -classpath instead.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Is that because of my java version?
% java -version
java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

Error "Missing ids" does not provide enough information for debugging

πŸ› Describe the bug

Identified by ATM:

[SUMMARY] Output format: json 
[SUMMARY] Reading configuration from /home/itrejo/jnogrv_1001.cfg
[SUMMARY] Elasticsearch URL: https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Reading registry schema from Elasticsearch
[INFO] Processing bundle directory /PDS/data/anonymous/PDS/data/jnogrv_1001
[INFO] Processing bundle /PDS/data/anonymous/PDS/data/jnogrv_1001/bundle_juno_grav.xml
[INFO] Processing collection /PDS/data/anonymous/PDS/data/jnogrv_1001/DOCUMENT/collection_document.xml
[ERROR] Missing ids

πŸ•΅οΈ Expected behavior

Error message includes much more useful information

πŸ“š Version of Software Used

v3.6.0

🩺 Test Data / Additional context

TBD download from ATM


πŸ¦„ Related requirements

βš™οΈ Engineering Details

Looks like issue is here: https://github.com/NASA-PDS/harvest/blob/main/src/main/java/gov/nasa/pds/harvest/dao/EsRequestBuilder.java#L58

For whatever reason, the software is unable to read the LIDs from the collection inventory.

Tightly coupled with #97

Integrate supplementer into harvest

Motivation

...so that as a user, I don't have to run a separate utility to ingest Product_Metadata_Supplemental

Additional Details

Initial implementation requires a use to pull our product metadata supplemental products and run supplementer separately. I am OK with having a separate tool specifically for that, however, I think harvest should auto-detect these files and update records accordingly.

Per the docs, it looks like this is pretty much already happening, I just think we should kick-off supplementer automatically unless explicitly noted not to.

Acceptance Criteria

Given a Product_Metadata_Supplemental product in a bundle
When I perform harvest of that bundle
Then I expect harvest to ingest the product metadata supplemental records into the registry product metadata.

Engineering Details

Implement date conversion from PDS4 date/time strings to Solr format

Add optional "dataType" attribute in <xpath> element of custom field mapping file to indicate that a field is a date or time and has to be parsed and converted to Solr date format.

<xpaths>
  <xpath fieldName="start_date_time" dataType="date">
    /Product_Observational/Observation_Area/Time_Coordinates/start_date_time</xpath>
</xpaths>

For autogenerated fields, convert dates if a field name contains "date".

Incorrect "lidvid" and "_id" fields are ingested (trailing zeros are truncated)

πŸ› Describe the bug

If a version_id field has trailing zeros, for example 1.10, the value is truncated to 1.1 when lidvid and _id fields are generated.
For example logical_identifier = urn:nasa:pds:maven.anc, version_id = 1.10. Generated lidvid and _id = urn:nasa:pds:maven.anc::1.1

πŸ“œ To Reproduce

Steps to reproduce the behavior:

  1. Create a label with the version_id field containing trailing zeros, for example, 1.10
  2. Run Harvest to Ingest the label.
  3. Check the lidvid and _id fields in OpenSearch / Elasticsearch.

πŸ•΅οΈ Expected behavior

version_id should not be truncated.

Combine ingestion components under new ingest repo

Engineering Details

My two cents: we rename harvest repo site to ingest, and combine all these components under there.

Here are the components I think we should combine: harvest, registry-mgr-elastic, big-data-*, supplementer, pds-registry-common

Bash script does not launch on macos, likely not on linux

Describe the bug
Bash script does not launch on mac
I had to replace "/dev/nul" with "/dev/null"

To Reproduce
Steps to reproduce the behavior:

  1. harvest -c ..
  2. the error show in stdout

Expected behavior
The script should work

** Version of Software Used**
3.2.0

Desktop (please complete the following information):

  • OS: macos, should be the same on linux

Rename scripts to use pds- prefix

Is your feature request related to a problem? Please describe.
In an effort to provide a consistent set of tools to the discipline nodes, let's rename our CLI tools to include pds- as a prefix.

e.g.

pds-harvest
pds-harvest.bat

Quick fix to support date/time conversion to "ISO instant" format

Motivation

Before automatic conversion of date/time fields based on LDD definition is implemented, we need a way to specify a list of date / time fields to be converted to "ISO instant" format supported by Elasticsearch.

Additional Details

  • Keep existing logic and convert all fields having "date" in their names.
  • Provide an option to list field names that need to be converted to ISO format in Harvest configuration file.

Acceptance Criteria

Given an attribute in a label that is an ASCII_Date type from PDS4, but does NOT contain date in the attribute name
When I perform the configuration noted below, and ingest the data
Then I expect harvest to convert these attributes to ISO dates

Engineering Details

An example of new dateFields element in Harvest configuration file:

<autogenFields>
  <dateFields>
    <field>cassini:VIMS_Specific_Attributes/cassini:earth_received_start_time</field>
    <field>cassini:VIMS_Specific_Attributes/cassini:earth_received_stop_time</field>
    <field>cassini:VIMS_Specific_Attributes/cassini:start_time_doy</field>
    <field>cassini:VIMS_Specific_Attributes/cassini:stop_time_doy</field>
  </dateFields>
</autogenFields>

#54 will be a future, better solution at a later date.

Improve and simplify Harvest execution and configuration to only manage Registry collection

  • Remove search index creation (this will be created / managed by separate tool)
  • Remove SolrJ push
  • Create "solr docs" for batch processing to registry index
  • Update docs to show how to run registry-mgr and push solr docs to registry collection
  • Simplify config to consolidate all the different product types to 1 product type and software will registry all PDS4 XML products (not necessarily sure how to ensure it is PDS4?)
  • Cleanup and pull out old dead code

As a user, I want Harvest automatically convert date / time fields to ISO format supported by Elasticsearch

Motivation

...so that I can load my data into Elasticsearch

Additional Details

Current version of Harvest only converts fields containing "date" in the field name, such as start_date_time. Fields like earth_received_start_time are not converted and documents with those fields could not be loaded into Elasticsearch. Harvest should convert all date / time fields based on data dictionary field definition.

Follow-on to #55

Acceptance Criteria

**Given a date/time field defined in an LDD
**When I perform harvesting of products referring that LDD
**Then I expect all date/time fields converted to "ISO instant" format supported by Elasticsearch.

Engineering Details

  • Harvest should be able to read registry data dictionary from Elasticsearch registry-dd index.
  • If a field data type is one of PDS4 date/time types, the field value should be converted to "ISO instant" format.

harvest stops rather than skips a file with bad permissions

πŸ› Describe the bug

harvest seems to halt if it finds a file it cannot read, but should warn and skip.

πŸ“œ To Reproduce

Steps to reproduce the behavior:

  1. Restrict permissions on a product
  2. Attempt to have harvest index it
  3. Note that the process stops when it cannot handle this file

πŸ•΅οΈ Expected behavior

harvest should warn the user and skip the file

πŸ“š Version of Software Used

[bcops@psapre01 ~]$ harvest --version
Harvest version: 3.5.1
Build time: 2021-12-10T05:25:44Z

🩺 Test Data / Additional context

Last lines of outout:

2022-01-12 2022-01-12 22:35:51,042 [INFO] Processing /repo/esa/psa/em16_tgo_cas/data_raw/2019-10-29/cas_raw_sc_20191028t185912-20191028t185916-8610-27-nir-554504520-29-1/2.0/cas_raw_sc_20191028T185912-20191028T185916-8610-27-NIR-554504522022-01-12 22:35:51,078 [INFO] Processing /repo/esa/psa/em16_tgo_cas/data_raw/2019-10-29/cas_raw_sc_20191028t185912-20191028t185916-8610-27-pan-554504520-29-0/2.0/cas_raw_sc_20191028T185912-20191028T185916-8610-27-PAN-554504520-29-0.xml
2022-01-12 22:35:51,826 [ERROR] Could not parse file /repo/esa/psa/em16_tgo_cas/data_raw/2019-10-29/cas_raw_sc_20191028t204114-20191028t204118-8611-29-blu-554506372-0-2/2.0/cas_raw_sc_20191028T204114-20191028T204118-8611-29-BLU-554506372-0-2.xml. /repo/esa/psa/em16_tgo_cas/data_raw/2019-10-29/cas_raw_sc_20191028t204114-20191028t204118-8611-29-blu-554506372-0-2/2.0/cas_raw_sc_20191028T204114-20191028T204118-8611-29-BLU-554506372-0-2.xml (Permission denied)
on denied)

πŸ–₯ System Info

  • OS: linux
  • Version [e.g. 22]

πŸ¦„ Related requirements

βš™οΈ Engineering Details

Add release datetime to version output

similar to validate, we need to include a release time to the version output for handling SNAPSHOT versions.

https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/validate/ValidateLauncher.java#L1167

Validate uses a properties file that it populates with information at build time, which it then reads to output the version information. We don't necessarily need to follow that model, but need to include some sort of release / build time in the version output.

Update following products:

  • Harvest
  • Registry manager

Check for special characters in input file strings to avoid vulnerability

ingest/harvest-search/src/main/java/gov/nasa/pds/harvest/search/HarvestSearchLauncher.java 141 Local-user-controlled data in path expression (CWE-022) Local-user-controlled data in path expression (CWE-022).Β  Accessing paths influenced by users can allow an attacker to access unexpected resources. Y Should check these URLs are not some URL exploit. Β  3 CWE-022 Test variable ”new File(value)” to ensure no special characters before being opened potential redirect vulnerability.
ingest/harvest/src/main/java/gov/nasa/pds/harvest/HarvestLauncher.java 297 Local-user-controlled data in path expression (CWE-022) Local-user-controlled data in path expression (CWE-022).Β  Accessing paths influenced by users can allow an attacker to access unexpected resources. Y Should check these URLs are not some URL exploit. Β  3 CWE-022 Test variable β€œNew File(keystore)” to ensure no special characters before being opened potential redirect vulnerability.

harvest tool removed all json files on error

πŸ› Describe the bug identified during I&T

This is related to test for Close all open files on error #67

The normal test file generates json files in /tmp/harvest/output, but after adding the suggested bad xml file with content "kdfdfkdjf;'dkflkf", harvest outputs the expected error message. but the files in /tmp/harvest/output are all empty.

πŸ₯Ό Related Test Case(s)

https://cae-testrail.jpl.nasa.gov/testrail/index.php?/cases/view/1013928

πŸ” Related issues

NASA-PDS/pds-registry-app#205


βž• Additional Details

πŸ“œ To Reproduce

Steps to reproduce the behavior:
see #67

πŸ“š Version of Software Used

Harvest version: 3.5.2-SNAPSHOT
Build time: 2021-12-10T07:51:23Z

🩺 Test Data / Additional context

🏞Screenshots

before adding x.xml
image
after adding x.xml
image

πŸ–₯ System Info

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

πŸ¦„ Related requirements

βš™οΈ Engineering Details

As an ingest user, I want a schema to validate harvest config against

πŸ’ͺ Motivation

...so that I can more easily configure validate using common XML tools (e.g. Oxygen)

πŸ“– Additional Details

βš–οΈ Acceptance Criteria

Given a valid harvest config file with a harvest namespace / schema specified
When I perform validation through an XML validator (e.g. Oxygen)
Then I expect the config to validate successfully and be utilized by harvest

Given an invalid harvest config file with an harvest namespace / schema specified
When I perform validation through an XML validator (e.g. Oxygen)
Then I expect the config to fail validation and fail when trying to be used by harvest

βš™οΈ Engineering Details

Check input URIs to avoid potential security vulnerability

ingest/harvest-search/src/main/java/gov/nasa/pds/harvest/search/HarvestSearchLauncher.java 141 Local-user-controlled data in path expression (CWE-022) Local-user-controlled data in path expression (CWE-022).Β  Accessing paths influenced by users can allow an attacker to access unexpected resources. Y Β  Β  4 CWE-022 Test variable ”new File(value)” to ensure no special characters before being opened potential redirect vulnerability.
ingest/harvest/src/main/java/gov/nasa/pds/harvest/HarvestLauncher.java 297 Local-user-controlled data in path expression (CWE-022) Local-user-controlled data in path expression (CWE-022).Β  Accessing paths influenced by users can allow an attacker to access unexpected resources. Y Β  Β  4 CWE-022 Test variable β€œNew File(keystore)” to ensure no special characters before being opened potential redirect vulnerability.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.