Presumably we should be able to push directly to KNB as well (many benefits, including

<input type="checkbox" id="" disabled=""

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

That's great, <a class="user-mention notranslate" data-hovercard-type="user" data-hove

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I get the error <div class="snippet-clipboard-content notranslate position-relativ

Add KNB as a publish node?,about ropensci/eml

Comments (21)

mbjones commented on July 18, 2024

Should be pretty straightforward using the dataone library -- the only issue will be logging in via CILogin to get the user certificate before attempting the MN.create() operation from the dataone library.

from eml.

cboettig commented on July 18, 2024

Sounds great. We handle OAuth login already for rfigshare. Not familiar
with CILogin though.

On Sat, Jun 29, 2013 at 11:40 AM, Matt Jones [email protected]:

Should be pretty straightforward using the dataone library -- the only
issue will be logging in via CILogin to get the user certificate before
attempting the MN.create() operation from the dataone library.

—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/20#issuecomment-20234859
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

from eml.

cboettig commented on July 18, 2024

Create publish method for KNB using dataone package.

See notes http://carlboettiger.info/2013/10/10/notes.html on CILogin.

from eml.

cboettig commented on July 18, 2024

@mbjones Just implemented the knb method on the development branch, following your example from mbjones/opensci_r_esa_2013#4. Seems something on the server side is down though, KNB isn't resolving metadata files at the moment: e.g. even in a browser this doi gives me an error: http://dx.doi.org/10.5063/AA/wolkovich.30.1 resolves to this page with an error:
https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fwolkovich.30.1

Also, quick question on the ids:

is id.mta (the id of the metadata D1Object in mbjones/opensci_r_esa_2013#4) the same as id.pkg (and also the same as packageId attribute on the head eml node?) If not, which one is the packageId and where does the other ID live inside the EML?
Also in mbjones/opensci_r_esa_2013#4, you show uploading the EML both by passing the filename, "text.xml", and later by first using a "readLines" and passing the text. Does either approach work or is the readLines step necessary, given the name of an xml file?

from eml.

cboettig commented on July 18, 2024

@mbjones KNB seems to be back up and the function is generally working:

I've just assumed that the EML packageId attribute should correspond to the metadata of the eml D1Object itself, while the pkg.id passed to the new("dataPackage"``call should be something different (I just append_package` to the eml packageId). Please correct me if this is wrong or ill-advised!!

The example shown in test_knb.R generates these IDs for the EML, the CSV, and the package, respectively, and it appears each of them resolve in the browser:

Yay!

> a = getD1Object(cli, pid[["csv"]])
checkServerTrusted - RSA
checkServerTrusted - RSA
   === Trying Location: urn:node:KNB (https://knb.ecoinformatics.org/knb/d1/mn/v1/object/urn:uuid:931dea76-2f84-4736-9875-4beade1d62a8)
checkServerTrusted - RSA
@@ D1Object-class:R initialize as something else
@@ D1Object-class:R initialize with jobjRef
> asDataFrame(a)
theData is textConnectionconnection
  river                      spp   stg  ct        day
1   SAC Oncorhynchus tshawytscha smolt 293 2013-09-01
2   SAC Oncorhynchus tshawytscha  parr 410 2013-09-01
3    AM     Oncorhynchus kisutch smolt 210 2013-09-02

Um, I'm not sure what the equivalent of asDataFrame is for the XML file. For reml at least I think we might prefer a wrapper to getD1Object / getD1Package so that we can read in EML-annotated data in the same way from KNB as we do with local files in the eml_read function.

I think I still need some advice on how to write an appropriate test case. The server detects that my files have already been uploaded based on matching metadata:

 * SystemMetadata indicates that this object was already created (uploaded Tue Oct 29 13:12:14 PDT 2013). Skipping create.

even when the test has generated a new ID, which I guess is good, though it's not obvious to me what's happening. Am I creating multiple ids that point to the same file this way? What's the best way to go about writing tests so that (a) they don't trigger this error, and (b) I don't end up writing a lot of junk to KNB?

from eml.

mbjones commented on July 18, 2024

That's great, @cboettig! Glad to see its working, and sorry you hit the server while we were migrating it from one IP address to another. Should be good to go now.

About packageId, we have always used the ID of the EML document there, and use that as a surrogate for the package. There are arguments that this isn't right given our new DataPackage model, but history is trumping novelty for the time being for us. Your way of generating IDs for the packages sounds great and matches ours.

I need to turn down the debug output from the R Client -- kinda annoying that it spits out all of that garbage. Need to work on that, so I entered a ticket.

We don't have a good viewer for XML in the dataone package -- I agree it would be great to automatically create a reml instance from it so it is more manageable in R. I'm open to suggestions on how to do that. As I've mentioned, I don't think asDataFrame() belongs in the dataone package anyways.

About your error saying that the object was already created -- there is something wrong there. It's perfectly legitimate to upload two identical copies of an object that differ only by identifier. There may be a bug lurking here. Can you describe more precisely which part of the code triggers this message?

Regarding testing, it would be best to not upload test content to the KNB (and to archive anything that isn't real science data so it doesn't show up in searches). We run a variety of testing servers that you can use instead of the KNB. They, of course, are not stable like the KNB because we clear content and reinstall them fairly regularly, but they serve the purpose of allowing you to generate a lot of test data without swamping the view normal scientists see in the KNB. We provide an overview of the environments which is good background but the details of the nodes are now wrong. You can find out what nodes are available in each environment using the nodes service:

Production: https://cn.dataone.org/cn/v1/node
Staging: https://cn-stage.test.dataone.org/cn/v1/node
Sandbox: https://cn-sandbox.test.dataone.org/cn/v1/node
Dev: https://cn-dev.test.dataone.org/cn/v1/node

from eml.

cboettig commented on July 18, 2024

@mbjones very good. Can I select these environments from the dataone R client? From the docs its not clear if sandbox or staging would be preferred for unit testing?

from eml.

mbjones commented on July 18, 2024

@cboettig Yes, you can do so when setting up the client. Don't be surprised if the test environments are sometimes not working, as they go up and down a lot. Nevertheless, here's how you would use the 'DEV' environment.

# Initialize a client to interact with DataONE
mn_nodeid <- "urn:node:mnDemo5"           # MN for DEV env
cli <- D1Client("DEV", mn_nodeid)

And I updated a full example here:
https://github.com/mbjones/opensci_r_esa_2013/blob/master/dataone-r/dataone-write-pkg.R

from eml.

cboettig commented on July 18, 2024

I get the error

Error: The provided mnNodeid value is invalid for the DataONE

environment

with both

mn_nodeid <- "urn:node:mnDemo5" # MN for DEV env
cli <- D1Client("DEV", mn_nodeid)

and

mn_nodeid <- "urn:node:mnSandboxUCSB1
cli <- D1Client("SANDBOX", mn_nodeid)

is that just because the environments are down?

On Wed, Oct 30, 2013 at 3:18 PM, Matt Jones [email protected]:

@cboettig https://github.com/cboettig Yes, you can do so when setting
up the client. Don't be surprised if the test environments are sometimes
not working, as they go up and down a lot. Nevertheless, here's how you
would use the 'DEV' environment.

Initialize a client to interact with DataONEmn_nodeid <- "urn:node:mnDemo5" # MN for DEV envcli <- D1Client("DEV", mn_nodeid)

And I updated a full example here:

https://github.com/mbjones/opensci_r_esa_2013/blob/master/dataone-r/dataone-write-pkg.R

—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/20#issuecomment-27444667
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

from eml.

mbjones commented on July 18, 2024

I got that error yesterday too. I quit R Studio and relaunched, and then it worked fine. I just tried it now and its working for me. The problem seems to be because R loads Java classes statically via a JNI interface, so I'm thinking that the library is caching the old environment and therefore not seeing the the MN you request. There's no way in R to cause the Java library to reload. I can consistently change environments as long as I restart R in between trying to switch environments with the D1Client. Can you try restarting your R/Java environment and trying again? I opened an issue for this: https://redmine.dataone.org/issues/4151

from eml.

cboettig commented on July 18, 2024

Cool, restarting R worked. I can run the upload function successfully, but
get an error on getD1Object on the resulting id for, say, the uploaded
csv file:

 csv <- getD1Object(cli, pid[["csv"]])

With message and trace:

checkServerTrusted - RSA
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
org.dataone.service.exceptions.NotFound: No system metadata could be
found for given PID: urn:uuid:1ae5ec40-0fcf-49ce-aae8-fe71669e7594

Enter a frame number, or 0 to exit

1: getD1Object(cli, pid[["csv"]])
2: getD1Object(cli, pid[["csv"]])
3: .local(x, identifier, ...)
4: J("org/dataone/client/D1Object")$download(jPid)
5: .jrcall(x@name, name, ...)
6: .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, .jcast(if
(i
7: .jcheck(silent = FALSE)

On Thu, Oct 31, 2013 at 11:50 AM, Matt Jones [email protected]:

I got that error yesterday too. I quit R Studio and relaunched, and then
it worked fine. I just tried it now and its working for me. The problem
seems to be because R loads Java classes statically via a JNI interface, so
I'm thinking that the library is caching the old environment and therefore
not seeing the the MN you request. There's no way in R to cause the Java
library to reload. I can consistently change environments as long as I
restart R in between trying to switch environments with the D1Client. Can
you try restarting your R/Java environment and trying again? I opened an
issue for this: https://redmine.dataone.org/issues/4151

—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/20#issuecomment-27514472
.

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

from eml.

mbjones commented on July 18, 2024

Great! Did you wait the required time for synchronization to occur to DataONE before looking for the object? Usually about 3-4 minutes? Each MN decides how long the interval is between sync calls with DataONE; Metacat defaults to 3 minutes. So, no object will show up on DataONE until the sync has occurred after the create.

The R client depends on the CN to find an object (for various good reasons), but we could also make it possible to directly look on the MN for the created object, which is available immediately after create(). Its probably a good change.

from eml.

cboettig commented on July 18, 2024

I didn't wait. Working now. I can add a sleep command into the unit test I suppose. I'm just using getD1Object in the unit test as way of confirming that the data was successfully uploaded in the first place, so perhaps there's a more sensible query I can use anyhow.

So, looks like we now have working publish method for the KNB. A few issues still to attend to:

The function should probably be updating the EML file to include the download URL for the CSV and EML file itself, right? I gather the full URL can be inferred from the identifier once the D1Client and mn_node have been specified?
The function assumes the certificate is already available in /tmp/, and thus I haven't wrapped any of the authentication steps; e.g.

cm <- CertificateManager()
downloadCert(cm)
getCertExpires(cm)

even though we're dynamically loading dataone library inside the function call. This is the same way we handle figshare -- neither package is strictly a dependency and is instead only on the "suggests" list. The documentation simply directs the user to install those packages and consult there docs on how to set up authentication. The only difference is that for figshare this is more of a 'set and forget' process, where you copy some API keys from the web client into your .Rprofile once and then you're set. It looks like the dataone certificates expire in 24 hrs(!?), so it's more of a burden there.

Might consider adding some user prompts about details like waiting the 3 minutes before querying the objects (not that we provide any query methods from reml yet anyway -- but that's a separate issue -- we could just read in the EML file directly from it's published URL).

Other feedback on the function is welcome! See the function definition and the test example (version stable links, devel branch).

from eml.

mbjones commented on July 18, 2024

The download URL may change over time, and we would like to protect against that so the metadata remains useful over time. This can be due to changes in the URL that is used for services on a node, or because a node is unavailable temporarily but replica copies may be available on other nodes, or because a node is defunct and a replica copy of an object on another node has become the authoritative copy. DataONE tracks multiple copies of objects, and knows which are available at any given moment. DataONE provides its resolve() service to report the current locations of objects. The dataone R package uses the resolve service to find out which nodes contain a copy of an object, and then downloads a copy from one of those, rather than a static URL. (This is why you currently have to wait for the object to be reported to the DataONE Coordinating Node, because that is what allows the resolve() service to know about the object.)

So, a couple things to consider:

Although you can predict the download URL for a node using its reported baseUrl (e.g., https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fnceas.290.8), the baseUrl is almost guaranteed to change over the course of a few years.
The DataONE resolve() service should represent a stable URL for an object. For example, for the DOI above, resolve run from a browser will redirect to the same knb URL above, and for a user agent requesting XML, will provide a list of locations for that object. So, for https://cn.dataone.org/cn/v1/resolve/doi:10.5063%2FAA%2Fnceas.290.8, a browser will return the EML document via a redirect, while through curl you'll see the resolve response.

$ curl -H "Accept: text/xml" https://cn.dataone.org/cn/v1/resolve/doi:10.5063%2FAA%2Fnceas.290.8

<?xml version="1.0" encoding="UTF-8"?>
<d1:objectLocationList xmlns:d1="http://ns.dataone.org/service/types/v1">
  <identifier>doi:10.5063/AA/nceas.290.8</identifier>
  <objectLocation>
    <nodeIdentifier>urn:node:CN</nodeIdentifier>
    <baseURL>https://cn.dataone.org/cn</baseURL>
    <version>v1</version>
    <url>https://cn.dataone.org/cn/v1/object/doi:10.5063%2FAA%2Fnceas.290.8</url>
  </objectLocation>
  <objectLocation>
    <nodeIdentifier>urn:node:KNB</nodeIdentifier>
    <baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
    <version>v1</version>
    <url>https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fnceas.290.8</url>
  </objectLocation>
</d1:objectLocationList>

So, you might consider adding the Resolve URI for the online/url field in EML as that should over time be the most stable URL for an object (notwithstanding issues such as DataONE failure, which is a possibility with any repository, but less likely for DataONE given our institutional diversity).

from eml.

cboettig commented on July 18, 2024

So when publishing to the KNB via the dataone package, would it then be advisable to encode the resolve URL as the download URL for each object?

Or is it in fact preferable to omit the download URLs entirely for content on the KNB, since the file should be accessed by its identifier rather than a download URL?

e.g. would it be advisable to have the <physical><distribution> state something like:

<online>
  <url function="download">"https://cn.dataone.org/cn/v1/resolve/urn:uuid:931dea76-2f84-4736-9875-4beade1d62a8" </url>
</online>

(and then presumably we would substitute in https://cn-dev.test.dataone.org/ etc for the test case using the dev server?), or should we avoid writing such a url?

On the reading end, sounds like we should teach eml_read to accept identifiers rather than URLs, e.g. you would do something like:

eml_read("doi:10.5063%2FAA%2Fnceas.290.8")

(Though we can also make it such that forwarding URL links also work, e.g.

eml_read("https://cn.dataone.org/cn/v1/resolve/doi:10.5063%2FAA%2Fnceas.290.8")

would work as long as that URL also resolves...

from eml.

mbjones commented on July 18, 2024

Yes, I think it would be advisable to provide the CN resolve URL for an object in its online/url field as you propose. This will help clients that do not know that these objects are on the DataONE network and accessible by DataONE services. Clients who do know will be able to use the id attribute to find the identifier for use in DataONE service calls. Others will find the URL handy. And yeah, it's fine to use cn-dev for testing URLs.

So I think eml_read should be able to take an identifier, and it could easily delegate to the dataone library to handle all of the MN.get() calls. You could, for example, use something like this, with of course error checking:

eml_read <- function(pid) {
    # Set up a D1Client class and stuff ahead of time
    cli <- D1Client()
    # then get the metadata object from dataone package, which handles the resolve and getting the data bytes
    obj0 <- getD1Object(cli, pid)    # Be sure to check for errors before proceeding
    formatId <- getFormatId(obj0)
    metadata <- xmlParse(getData(obj0))
    # if the formatId is some form of EML, send it off for further parsing in reml
}

from eml.

cboettig commented on July 18, 2024

I've created a little R repository to play with the REST API further: ropensci-archive/dataone-restful#1

from eml.

cboettig commented on July 18, 2024

Identifiers are generated for both external csv data and the eml, and eml_read can access the data based on these identifiers. We may eventually revisit this issue if we move to the REST-based KNB implementation and end up needing to call the functions slightly differently, but the top-level api for publish_eml with KNB as the destination repository will probably remain unchanged, so I think we can now close this.

from eml.

cboettig commented on July 18, 2024

@mbjones looks like the eml_knb function isn't compatible with the dataone 2.0.0 (e.g. D1Client not found). Clearly I need to get up to speed on the new package interface etc.

I did try a quick work-around with direct REST calls to the API. It looks like the DEV node I was using for testing is down (502 error): https://mn-demo-5.test.dataone.org/knb/d1/mn, but checking https://cn-dev.test.dataone.org/cn/v1/node I realized I could just switch from 5 to 6. I could get non-authenticated calls to work this way, but tried uploading an example file using a CIlogon certificate but I get 'permission denied' (401 error). Maybe not worth debugging this since I should just be able to use the dataone package, but not sure why it's not working all the same. (Example of this failing test here: https://github.com/cboettig/dataone-lite/blob/master/tests/testthat/d1_upload.R )

from eml.

mbjones commented on July 18, 2024

@cboettig yeah, that's only because we are working from the low-level API back up towards the top in terms of refactoring. So, D1Client will be the last API to be refactored. I'll look into why its not working.

from eml.

cboettig commented on July 18, 2024

uploading EML to KNB, etc should now be handled in the dataone package, probably no need for an additional wrapper on EML end (also keeps dependencies lighter).

from eml.

Add KNB as a publish node? about eml HOT 21 CLOSED

Comments (21)

Initialize a client to interact with DataONEmn_nodeid <- "urn:node:mnDemo5" # MN for DEV envcli <- D1Client("DEV", mn_nodeid)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent