Comments (21)
Should be pretty straightforward using the dataone library -- the only issue will be logging in via CILogin to get the user certificate before attempting the MN.create() operation from the dataone library.
from eml.
Sounds great. We handle OAuth login already for rfigshare. Not familiar
with CILogin though.
On Sat, Jun 29, 2013 at 11:40 AM, Matt Jones [email protected]:
Should be pretty straightforward using the dataone library -- the only
issue will be logging in via CILogin to get the user certificate before
attempting the MN.create() operation from the dataone library.—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/20#issuecomment-20234859
.
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
from eml.
- Create publish method for KNB using
dataone
package.
See notes http://carlboettiger.info/2013/10/10/notes.html on CILogin.
from eml.
@mbjones Just implemented the knb method on the development branch, following your example from mbjones/opensci_r_esa_2013#4. Seems something on the server side is down though, KNB isn't resolving metadata files at the moment: e.g. even in a browser this doi gives me an error: http://dx.doi.org/10.5063/AA/wolkovich.30.1 resolves to this page with an error:
https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fwolkovich.30.1
Also, quick question on the ids:
- is
id.mta
(the id of the metadata D1Object in mbjones/opensci_r_esa_2013#4) the same asid.pkg
(and also the same aspackageId
attribute on the heademl
node?) If not, which one is thepackageId
and where does the other ID live inside the EML? - Also in mbjones/opensci_r_esa_2013#4, you show uploading the EML both by passing the filename, "text.xml", and later by first using a "readLines" and passing the text. Does either approach work or is the
readLines
step necessary, given the name of an xml file?
from eml.
@mbjones KNB seems to be back up and the function is generally working:
I've just assumed that the EML packageId
attribute should correspond to the metadata of the eml D1Object itself, while the pkg.id
passed to the new("dataPackage"``call should be something different (I just append
_package` to the eml packageId). Please correct me if this is wrong or ill-advised!!
The example shown in test_knb.R generates these IDs for the EML, the CSV, and the package, respectively, and it appears each of them resolve in the browser:
- eml: urn:uuid:7d01d639-0e89-4d75-9d88-e334f11b8bad
- csv: urn:uuid:931dea76-2f84-4736-9875-4beade1d62a8
- package: urn:uuid:7d01d639-0e89-4d75-9d88-e334f11b8bad_package
Yay!
> a = getD1Object(cli, pid[["csv"]])
checkServerTrusted - RSA
checkServerTrusted - RSA
=== Trying Location: urn:node:KNB (https://knb.ecoinformatics.org/knb/d1/mn/v1/object/urn:uuid:931dea76-2f84-4736-9875-4beade1d62a8)
checkServerTrusted - RSA
@@ D1Object-class:R initialize as something else
@@ D1Object-class:R initialize with jobjRef
> asDataFrame(a)
theData is textConnectionconnection
river spp stg ct day
1 SAC Oncorhynchus tshawytscha smolt 293 2013-09-01
2 SAC Oncorhynchus tshawytscha parr 410 2013-09-01
3 AM Oncorhynchus kisutch smolt 210 2013-09-02
- Um, I'm not sure what the equivalent of
asDataFrame
is for the XML file. Forreml
at least I think we might prefer a wrapper togetD1Object
/getD1Package
so that we can read in EML-annotated data in the same way from KNB as we do with local files in theeml_read
function.
I think I still need some advice on how to write an appropriate test case. The server detects that my files have already been uploaded based on matching metadata:
* SystemMetadata indicates that this object was already created (uploaded Tue Oct 29 13:12:14 PDT 2013). Skipping create.
even when the test has generated a new ID, which I guess is good, though it's not obvious to me what's happening. Am I creating multiple ids that point to the same file this way? What's the best way to go about writing tests so that (a) they don't trigger this error, and (b) I don't end up writing a lot of junk to KNB?
from eml.
That's great, @cboettig! Glad to see its working, and sorry you hit the server while we were migrating it from one IP address to another. Should be good to go now.
About packageId
, we have always used the ID of the EML document there, and use that as a surrogate for the package. There are arguments that this isn't right given our new DataPackage
model, but history is trumping novelty for the time being for us. Your way of generating IDs for the packages sounds great and matches ours.
I need to turn down the debug output from the R Client -- kinda annoying that it spits out all of that garbage. Need to work on that, so I entered a ticket.
We don't have a good viewer for XML in the dataone
package -- I agree it would be great to automatically create a reml
instance from it so it is more manageable in R. I'm open to suggestions on how to do that. As I've mentioned, I don't think asDataFrame() belongs in the dataone
package anyways.
About your error saying that the object was already created -- there is something wrong there. It's perfectly legitimate to upload two identical copies of an object that differ only by identifier. There may be a bug lurking here. Can you describe more precisely which part of the code triggers this message?
Regarding testing, it would be best to not upload test content to the KNB (and to archive anything that isn't real science data so it doesn't show up in searches). We run a variety of testing servers that you can use instead of the KNB. They, of course, are not stable like the KNB because we clear content and reinstall them fairly regularly, but they serve the purpose of allowing you to generate a lot of test data without swamping the view normal scientists see in the KNB. We provide an overview of the environments which is good background but the details of the nodes are now wrong. You can find out what nodes are available in each environment using the nodes service:
- Production: https://cn.dataone.org/cn/v1/node
- Staging: https://cn-stage.test.dataone.org/cn/v1/node
- Sandbox: https://cn-sandbox.test.dataone.org/cn/v1/node
- Dev: https://cn-dev.test.dataone.org/cn/v1/node
from eml.
@mbjones very good. Can I select these environments from the dataone R client? From the docs its not clear if sandbox or staging would be preferred for unit testing?
from eml.
@cboettig Yes, you can do so when setting up the client. Don't be surprised if the test environments are sometimes not working, as they go up and down a lot. Nevertheless, here's how you would use the 'DEV' environment.
# Initialize a client to interact with DataONE
mn_nodeid <- "urn:node:mnDemo5" # MN for DEV env
cli <- D1Client("DEV", mn_nodeid)
And I updated a full example here:
https://github.com/mbjones/opensci_r_esa_2013/blob/master/dataone-r/dataone-write-pkg.R
from eml.
I get the error
Error: The provided mnNodeid value is invalid for the DataONE
environment
with both
mn_nodeid <- "urn:node:mnDemo5" # MN for DEV env
cli <- D1Client("DEV", mn_nodeid)
and
mn_nodeid <- "urn:node:mnSandboxUCSB1
cli <- D1Client("SANDBOX", mn_nodeid)
is that just because the environments are down?
On Wed, Oct 30, 2013 at 3:18 PM, Matt Jones [email protected]:
@cboettig https://github.com/cboettig Yes, you can do so when setting
up the client. Don't be surprised if the test environments are sometimes
not working, as they go up and down a lot. Nevertheless, here's how you
would use the 'DEV' environment.Initialize a client to interact with DataONEmn_nodeid <- "urn:node:mnDemo5" # MN for DEV envcli <- D1Client("DEV", mn_nodeid)
And I updated a full example here:
https://github.com/mbjones/opensci_r_esa_2013/blob/master/dataone-r/dataone-write-pkg.R
—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/20#issuecomment-27444667
.
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
from eml.
I got that error yesterday too. I quit R Studio and relaunched, and then it worked fine. I just tried it now and its working for me. The problem seems to be because R loads Java classes statically via a JNI interface, so I'm thinking that the library is caching the old environment and therefore not seeing the the MN you request. There's no way in R to cause the Java library to reload. I can consistently change environments as long as I restart R in between trying to switch environments with the D1Client. Can you try restarting your R/Java environment and trying again? I opened an issue for this: https://redmine.dataone.org/issues/4151
from eml.
Cool, restarting R worked. I can run the upload function successfully, but
get an error on getD1Object
on the resulting id for, say, the uploaded
csv file:
csv <- getD1Object(cli, pid[["csv"]])
With message and trace:
checkServerTrusted - RSA
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
org.dataone.service.exceptions.NotFound: No system metadata could be
found for given PID: urn:uuid:1ae5ec40-0fcf-49ce-aae8-fe71669e7594
Enter a frame number, or 0 to exit
1: getD1Object(cli, pid[["csv"]])
2: getD1Object(cli, pid[["csv"]])
3: .local(x, identifier, ...)
4: J("org/dataone/client/D1Object")$download(jPid)
5: .jrcall(x@name, name, ...)
6: .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, .jcast(if
(i
7: .jcheck(silent = FALSE)
On Thu, Oct 31, 2013 at 11:50 AM, Matt Jones [email protected]:
I got that error yesterday too. I quit R Studio and relaunched, and then
it worked fine. I just tried it now and its working for me. The problem
seems to be because R loads Java classes statically via a JNI interface, so
I'm thinking that the library is caching the old environment and therefore
not seeing the the MN you request. There's no way in R to cause the Java
library to reload. I can consistently change environments as long as I
restart R in between trying to switch environments with the D1Client. Can
you try restarting your R/Java environment and trying again? I opened an
issue for this: https://redmine.dataone.org/issues/4151—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/20#issuecomment-27514472
.
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
from eml.
Great! Did you wait the required time for synchronization to occur to DataONE before looking for the object? Usually about 3-4 minutes? Each MN decides how long the interval is between sync calls with DataONE; Metacat defaults to 3 minutes. So, no object will show up on DataONE until the sync has occurred after the create.
The R client depends on the CN to find an object (for various good reasons), but we could also make it possible to directly look on the MN for the created object, which is available immediately after create(). Its probably a good change.
from eml.
I didn't wait. Working now. I can add a sleep
command into the unit test I suppose. I'm just using getD1Object
in the unit test as way of confirming that the data was successfully uploaded in the first place, so perhaps there's a more sensible query I can use anyhow.
So, looks like we now have working publish method for the KNB. A few issues still to attend to:
- The function should probably be updating the EML file to include the download URL for the CSV and EML file itself, right? I gather the full URL can be inferred from the identifier once the D1Client and mn_node have been specified?
- The function assumes the certificate is already available in /tmp/, and thus I haven't wrapped any of the authentication steps; e.g.
cm <- CertificateManager()
downloadCert(cm)
getCertExpires(cm)
even though we're dynamically loading dataone
library inside the function call. This is the same way we handle figshare
-- neither package is strictly a dependency and is instead only on the "suggests" list. The documentation simply directs the user to install those packages and consult there docs on how to set up authentication. The only difference is that for figshare this is more of a 'set and forget' process, where you copy some API keys from the web client into your .Rprofile once and then you're set. It looks like the dataone certificates expire in 24 hrs(!?), so it's more of a burden there.
- Might consider adding some user prompts about details like waiting the 3 minutes before querying the objects (not that we provide any query methods from reml yet anyway -- but that's a separate issue -- we could just read in the EML file directly from it's published URL).
Other feedback on the function is welcome! See the function definition and the test example (version stable links, devel branch).
from eml.
The download URL may change over time, and we would like to protect against that so the metadata remains useful over time. This can be due to changes in the URL that is used for services on a node, or because a node is unavailable temporarily but replica copies may be available on other nodes, or because a node is defunct and a replica copy of an object on another node has become the authoritative copy. DataONE tracks multiple copies of objects, and knows which are available at any given moment. DataONE provides its resolve()
service to report the current locations of objects. The dataone
R package uses the resolve service to find out which nodes contain a copy of an object, and then downloads a copy from one of those, rather than a static URL. (This is why you currently have to wait for the object to be reported to the DataONE Coordinating Node, because that is what allows the resolve() service to know about the object.)
So, a couple things to consider:
- Although you can predict the download URL for a node using its reported baseUrl (e.g., https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fnceas.290.8), the baseUrl is almost guaranteed to change over the course of a few years.
- The DataONE
resolve()
service should represent a stable URL for an object. For example, for the DOI above, resolve run from a browser will redirect to the same knb URL above, and for a user agent requesting XML, will provide a list of locations for that object. So, for https://cn.dataone.org/cn/v1/resolve/doi:10.5063%2FAA%2Fnceas.290.8, a browser will return the EML document via a redirect, while through curl you'll see the resolve response.
$ curl -H "Accept: text/xml" https://cn.dataone.org/cn/v1/resolve/doi:10.5063%2FAA%2Fnceas.290.8
<?xml version="1.0" encoding="UTF-8"?>
<d1:objectLocationList xmlns:d1="http://ns.dataone.org/service/types/v1">
<identifier>doi:10.5063/AA/nceas.290.8</identifier>
<objectLocation>
<nodeIdentifier>urn:node:CN</nodeIdentifier>
<baseURL>https://cn.dataone.org/cn</baseURL>
<version>v1</version>
<url>https://cn.dataone.org/cn/v1/object/doi:10.5063%2FAA%2Fnceas.290.8</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:KNB</nodeIdentifier>
<baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
<version>v1</version>
<url>https://knb.ecoinformatics.org/knb/d1/mn/v1/object/doi:10.5063%2FAA%2Fnceas.290.8</url>
</objectLocation>
</d1:objectLocationList>
So, you might consider adding the Resolve URI for the online/url field in EML as that should over time be the most stable URL for an object (notwithstanding issues such as DataONE failure, which is a possibility with any repository, but less likely for DataONE given our institutional diversity).
from eml.
So when publishing to the KNB via the dataone
package, would it then be advisable to encode the resolve URL as the download URL for each object?
Or is it in fact preferable to omit the download URLs entirely for content on the KNB, since the file should be accessed by its identifier rather than a download URL?
e.g. would it be advisable to have the <physical><distribution>
state something like:
<online>
<url function="download">"https://cn.dataone.org/cn/v1/resolve/urn:uuid:931dea76-2f84-4736-9875-4beade1d62a8" </url>
</online>
(and then presumably we would substitute in https://cn-dev.test.dataone.org/ etc for the test case using the dev server?), or should we avoid writing such a url?
On the reading end, sounds like we should teach eml_read
to accept identifiers rather than URLs, e.g. you would do something like:
eml_read("doi:10.5063%2FAA%2Fnceas.290.8")
(Though we can also make it such that forwarding URL links also work, e.g.
eml_read("https://cn.dataone.org/cn/v1/resolve/doi:10.5063%2FAA%2Fnceas.290.8")
would work as long as that URL also resolves...
from eml.
Yes, I think it would be advisable to provide the CN resolve URL for an object in its online/url
field as you propose. This will help clients that do not know that these objects are on the DataONE network and accessible by DataONE services. Clients who do know will be able to use the id attribute to find the identifier for use in DataONE service calls. Others will find the URL handy. And yeah, it's fine to use cn-dev for testing URLs.
So I think eml_read
should be able to take an identifier, and it could easily delegate to the dataone
library to handle all of the MN.get() calls. You could, for example, use something like this, with of course error checking:
eml_read <- function(pid) {
# Set up a D1Client class and stuff ahead of time
cli <- D1Client()
# then get the metadata object from dataone package, which handles the resolve and getting the data bytes
obj0 <- getD1Object(cli, pid) # Be sure to check for errors before proceeding
formatId <- getFormatId(obj0)
metadata <- xmlParse(getData(obj0))
# if the formatId is some form of EML, send it off for further parsing in reml
}
from eml.
I've created a little R repository to play with the REST API further: ropensci-archive/dataone-restful#1
from eml.
Identifiers are generated for both external csv data and the eml, and eml_read
can access the data based on these identifiers. We may eventually revisit this issue if we move to the REST-based KNB implementation and end up needing to call the functions slightly differently, but the top-level api for publish_eml
with KNB as the destination repository will probably remain unchanged, so I think we can now close this.
from eml.
@mbjones looks like the eml_knb
function isn't compatible with the dataone 2.0.0
(e.g. D1Client
not found). Clearly I need to get up to speed on the new package interface etc.
I did try a quick work-around with direct REST calls to the API. It looks like the DEV node I was using for testing is down (502 error): https://mn-demo-5.test.dataone.org/knb/d1/mn, but checking https://cn-dev.test.dataone.org/cn/v1/node I realized I could just switch from 5
to 6
. I could get non-authenticated calls to work this way, but tried uploading an example file using a CIlogon certificate but I get 'permission denied' (401 error). Maybe not worth debugging this since I should just be able to use the dataone package, but not sure why it's not working all the same. (Example of this failing test here: https://github.com/cboettig/dataone-lite/blob/master/tests/testthat/d1_upload.R )
from eml.
@cboettig yeah, that's only because we are working from the low-level API back up towards the top in terms of refactoring. So, D1Client will be the last API to be refactored. I'll look into why its not working.
from eml.
uploading EML to KNB, etc should now be handled in the dataone
package, probably no need for an additional wrapper on EML end (also keeps dependencies lighter).
from eml.
Related Issues (20)
- set_attributes forces all numeric fields to have storageType = "float" HOT 7
- Taxonomic Coverage and bibtex HOT 1
- Species name epithet is not handled the way specified in the EML schema HOT 2
- Error with molePerKilogram in unit list returned by get_unitList() HOT 3
- dataset and datatable entries from README example fail HOT 2
- `shiny_attributes` performance improvments HOT 8
- Revisit how users can find a learn to use the `eml$*` constructors HOT 2
- Add a minimum version requirement on taxadb and wait to release the next version of this package HOT 1
- Web scraping | sapply function | Error in readBin(5L, "raw", 65536L) : Failure when receiving data from the peer HOT 1
- Creating EML elements with XML attributes HOT 2
- Duplicate person when using `write_eml()` HOT 2
- Set attributes for properties, e.g. `<title xml:lang="eng">` HOT 3
- Function to convert DataCite metadata to EML: good fit for this package? HOT 7
- `<![CDATA[` not always recognized HOT 1
- [Units] Discussion about current unit list HOT 5
- `set_coverage()`: Express common names in `commonName` in `taxonomicCoverage` HOT 10
- `set_responsibleParty()`: allow to create organization parties HOT 1
- namespace conflict introduced when importing/exporting EML generated under older schema
- EML::eml_validate conflicts with knb.ecoinformatics.org parser & appears to introduce invalid xml into valid files HOT 1
- EML seems to have trouble with foreign key constraints HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eml.