Comments (23)
Note that the schema does not completely specify all of the restrictions of the EML Specification. The conditional restrictions on uniqueness of IDs depending on the value of the scope attribute was simply not expressible in XML Schema. So, we wrote the EML Validator to accompany the spec which fully validates the document. This is available as a service on the web (http://knb.ecoinformatics.org/emlparser/), and can be run from a Java API call after compiling the EMLParser validation class. Source code is in the EML SVN repo: https://code.ecoinformatics.org/code/eml/trunk/src/org/ecoinformatics/eml/EMLParser.java
from eml.
Fantastic, was wondering about this.
Will have to see if I can figure out the java API call; we should have the
wrappers available in R but I have virtually no java experience. I suppose
we could alternatively bundle the java in the R package and validate
locally.
Looks like this gives us three levels of validation: whether the EML
parses, whether it validates according to the schema, and whether we meet
EML Validator checks for ids. Any advice on the right workflow / user
interface for this would be good -- e.g. not sure that we would want to run
the EML validator every time a user tries to read in or write out some EML
-- it might be sufficient to know that it parses. Still, we want to
provide support for these tools...
Also, I'm not actually clear on how / when we should be going about
generating element ids in the first place. The only place I have element
ids currently is on <attribute>
nodes (and one on the
<additionalMetadata id = 'figshare'>
for specifying what metadata is
exposed to figshare's database). Any advice on how to come up with
<attribute>
ids? (currently I create a hash from the
<attributeDescription>
text, just as a placeholder -- obviously this is
not what we actually want).
On Sat, Jun 29, 2013 at 1:12 PM, Matt Jones [email protected]:
Note that the schema does not completely specify all of the restrictions
of the EML Specification. The conditional restrictions on uniqueness of IDs
depending on the value of the scope attribute was simply not expressible in
XML Schema. So, we wrote the EML Validator to accompany the spec which
fully validates the document. This is available as a service on the web (
http://knb.ecoinformatics.org/emlparser/), and can be run from a Java API
call after compiling the EMLParser validation class. Source code is in the
EML SVN repo:
https://code.ecoinformatics.org/code/eml/trunk/src/org/ecoinformatics/eml/EMLParser.java—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/7#issuecomment-20236408
.
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
from eml.
Regarding the validator -- it doesn't check that much, and so I think it would be much better to just reimplement those checks in R. The Java code just iterates across a bunch of XML elements, checking that the id attribute on those elements is unique within the document, and then checks that any <references>
elements in the document point at an @id
in the document -- i.e., there are no dangling pointers. The list of elements that are checked is in a config file located here: https://code.ecoinformatics.org/code/eml/trunk/lib/config.xml
About when to check validity -- in Morpho we check validity when we try to upload to an external repository, as we want to ensure that everything is in good working order before distributing the document.
The only time you really need to generate ids is when you want to have a handle to reference. The main place for that is for <attribute>
elements. The other place that it is commonly used is for unit definitions in STMML in associatedMetadata, and to provide a globally-scoped identifier for individuals (e.g., provide an ORCID ID for someone when listing them as a Creator).
from eml.
@duncantl added code to validate EML from R using the online Java tools from @mbjones in commit 1722a4b
from eml.
Philosophical point: Validation is really a concern for us / developers, not for the end users. Our programmatically generated EML should always be valid. Meanwhile, if we read in EML that is not valid, what are we going to do, just throw an error? Better to make the most of it.
Note: we can validate in R using the XML package (via libxml2
), using the xmlSchemaValidate()
function, though if I understand correctly, the code added above should also perform the additional validation that @mbjones describes.
- Add a unit test that includes validation
from eml.
There are two examples in a if(FALSE) {} block at the top of the file containing the functions. These can be used for a unit test.
Also, we should either raise an error or put a class on the result of processValidateResponse if either of the tests were not passed.
from eml.
(Not closed at all!)
from eml.
The recursion problem in the XMLSchema package where a schema imports another schema which imports the first one is working now. Use inline = FALSE in the call to readSchema.
library(XMLSchema)
x = readSchema("~/Downloads/eml-2.1.1/eml.xsd", inline = FALSE)
This cures the parsing and processing of the types. It remains to be seen if it breaks anything else. And of course there will still be issues with the actual type descriptions it has created.
from eml.
@duncantl Great, readSchema
works for me. Hitting an error when I then try defineClasses
Note: method with signature 'RestrictedStringDefinition#list' chosen for function 'resolve',
target signature 'RestrictedStringDefinition#SchemaCollection'.
"SchemaType#SchemaCollection" would also be valid
Error in .getClassFromCache(Class, where) :
attempt to use zero-length variable name
Also, xmlSchemaValidate()
seems unhappy about my eml files, even though they validate fine with your new eml_validate
function...
> xmlSchemaValidate("inst/xsd/eml.xsd", "inst/doc/my_eml_data.xml")
$status
[1] 1845
$errors
[[1]]
$msg
[1] "Element '{eml://ecoinformatics.org/eml-2.1.0}eml': No matching global declaration available for the validation root.\n"
$code
XML_SCHEMAV_CVC_ELT_1
1845
$domain
XML_FROM_SCHEMASV
17
$line
[1] 2
$col
[1] 0
$level
XML_ERR_ERROR
2
$filename
[1] "inst/doc/my_eml_data.xml"
attr(,"class")
[1] "XMLError"
attr(,"class")
[1] "XMLStructuredErrorList"
attr(,"class")
[1] "XMLSchemaValidationResults"
from eml.
I am looking into to the defineClasses() and defClass() issues.
As for the validation error, I suspect the error message is correct although I can't tell what schema you are using. I imagine you are using the eml-2.1.1 while the my_eml_data.xml is using the namespace eml... 2.1.0
from eml.
Indeed! We've moved to reml to writing 2.1.1 by default, #36 and now our example generated EML passes xmlSchemaValidate()
against the 2.1.1 eml.xsd
, as expected.
Sounds good -- keep us posted on defineClasses()
from eml.
@mbjones is there an external URL I can use to validate against? (Just for the schema files, I know we can use http://knb.ecoinformatics.org/emlparser/ but have to figure out what went wrong in the RHTMLForm
function first).
Currently I have a local copy of the schema downloaded that I use, but a more portable solution would be better.
from eml.
@cboettig Not quite sure what you are asking. The parsing service can be called at:
http://knb.ecoinformatics.org/emlparser/parse
as long as you do an HTTP POST and provide the proper parameters. For example, with curl
you could do:
curl -F action=textparse -F doctext=@/Users/jones/Desktop/eml-sample.xml http://knb.ecoinformatics.org/emlparser/parse
from eml.
Is there a canonical URL for for eml.xsd?
I only see the option to download the xsd files as a tarball, not browse
them as web files....
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 3, 2013 3:51 PM, "Matt Jones" [email protected] wrote:
@cboettig https://github.com/cboettig Not quite sure what you are
asking. The parsing service can be called at:http://knb.ecoinformatics.org/emlparser/parse
as long as you do an HTTP POST and provide the proper parameters. For
example, with curl you could do:curl -F action=textparse -F doctext=@/Users/jones/Desktop/eml-sample.xml http://knb.ecoinformatics.org/emlparser/parse
—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/7#issuecomment-23753570
.
from eml.
Ah. Now I see. No, we do not provide one, because it is a security hole for applications to use an external schema for validation (similar to a SQL injection attack, this is an XML Injection, also called XML External Entity XXE Processing. Although our copy may be secure now, if many applications point at it, then it becomes an attractive and central point of attack, and if our host is compromised, then all apps that point at our schema URL would potentially be compromised as well. Compromises can lead to reading sensitive data on your computer (such as files like /etc/passwd), injection of malicious content into your application, and other maladies. So, we try to make it hard for people to be insecure. In general, trusting xsi:schemaLocation is allowing a third party to inject data into your process -- you are better served by downloading the schema, inspecting it, and if it is trustworthy, pointing at your local copy for validation.
from eml.
@mbjones I'm not sure I follow the logic. Wouldn't that mean even more so by extension that webpages shouldn't include .js from anywhere, shouldn't include CSS from anywhere, etc, not even from the originating website because that would make it a target for being compromised? W3C schemas do give the schema location, and xs:imports even require it. For example, here's the PROV schema: http://www.w3.org/ns/prov-core.xsd Are you suggesting they are giving a bad example?
from eml.
Yes, indeed -- you should only include javascript from a highly trusted source. That is even more of an issue than XML. Even when you don't intentionally include untrusted JS code, people develop clever XSRF and related attacks just to inject JS into your pages. Its a bad thing. I inspect any JS I include, lock it down as a local copy, and don't rely on external copies, as I would be trusting the security and goodwill of that host.
The W3C does know about these XML injection issues, and it influenced their web architecture documents. The W3C TAG discussed these issues, which was expressed in the web architecture principle of Reference does not imply dereference; although the security implications are mentioned, they are glossed over in that document. What they are saying is that just because someone provides an xsi:schemaLocation in their document is not an indication that you should dereference that in your parsing and validation. If that were so, then every document author would have the ability to inject potentially harmful content into your process. Rather, xsi:schemaLocation is defined as a 'hint' to help someone locate a schema for a namespace that is unknown, but blindly importing them is certainly an exploitable security hole. I first learned of these issues from a talk in 2000 by David Meggison from the W3C XML Working Group, but they persist today (and are actually easier, as there are now more injection vectors). The recommended practice to avoid XEE attacks is to download and inspect DTDs and schemas yourself, and then set up a catalog for mapping namespaces to the vetted local schema copy for validation, thereby avoiding potential injection attacks. XML (and SGML before it) catalogs are a common technology, and every XML and XSLT engine I have seen support them in their APIs. We use them in Metacat, and simply register each schema we wish to support with its associated xsd or dtd that we have inspected and stored locally. This is pretty simple and avoids user-driven content injection.
Our experience was that when we provided a resolvable copy of the schema, we started seeing many EML documents pointing at it (ok), and people automatically dereferencing it (bad). So we took it down. I think its reasonable to argue that in principle we should have a resolvable copy of the schema, but in practice it lead to bad dereferencing practices that we wanted to curtail in our community. For example, I think @cboettig was not aware that its a potential security hole to directly dereference the EML xsd, and would not have found that out if we had placed eml.xsd at a resolvable location. So, with this in mind, should we put up a resolvable copy?
Related pieces:
https://www.owasp.org/images/5/5d/XML_Exteral_Entity_Attack.pdf
http://www.soatothecloud.com/2008/08/dont-follow-that-schemalocation.html
http://www.slideshare.net/qqlan/bh-ready-v4
http://www.securityfocus.com/archive/1/297846/30/0/threaded
from eml.
@mbjones @hlapp Thanks both for the input and discussion here. A bit over my head but I'm trying to follow along. @mbjones Stupid question: the attack requires that the attacker alter the schema file that lives at the URL given?
from eml.
Yes, the attacker must manipulate one of the information sets that will be injected into the parsing process, Within the document itself, these include external entities that are defined, and any of the multiple external files that can be included by reference to a 3rd party URI. In the specific case of the namespace, the xsi:schemaLocation will point at a schema document, and the attacker would need to modify that document, which would potentially compromise multitudes of computers if they all point at that single schema file.
from eml.
@mbjones I know about the semantics of xsi:schemaLocation, and why it is only a hint for how to obtain the schema definition, but not required to be dereferencable. What I was asking is whether W3C is giving a bad example by providing in its own XSD documents xsi:schemaLocation and xsi:import URIs that actually do dereference, to the correct schema document, even though as you say it's not required. I'm also sure that lots of people do dereference their schemaLocation URIs, and yet they haven't found this undesirable.
I also didn't suggest that one include JS, or XSD for that matter, from arbitrary and untrusted sources. What I did say is that by your logic JS included on an NCEAS-served website that's loaded from the NCEAS server that serves the website is bad and not to be trusted, because NCEAS servers could get compromised and hacked. While that is of course a possibility, and that machines get hacked every day is a fact, surely I shouldn't conclude from that not to visit NCEAS websites in my browser, but download them first via cURL and then inspect them by hand for malicious content?
I don't find anything wrong or hazardous with trusting sites and URI locations provided by well-known institutions (such as, for example, the W3C), and I (and I don't think I'm alone with this) expect that these institutions will strive to apply best sysadmin practices to prevent such compromise. We certainly do so at NESCent, and we take security and compromise detection very seriously, and I would expect NCEAS to do no less. So I'm sorry, but I can't see the case for erecting barriers to developers who want to develop applications with your schemas. The W3C certainly doesn't.
from eml.
The eml_validate() function is now working again. The problem seems to have been that the format of the HTML changed from using h2 to h4 for the relevant headers.
So nothing to do with RHTMLForms, just how we process the HTML response.
Ideally, the validator would allow us to request the response as XML or JSON and give it to us without the HTML formatting.
from eml.
That would be a good change. We didn't originally design the validator that way, but it would be fairly easy to change it to output more structured data on request.
from eml.
Since we have a working eml_validate
function at this point, I think we can close this issue. See #46 on workflow for validation.
from eml.
Related Issues (20)
- set_attributes forces all numeric fields to have storageType = "float" HOT 7
- Taxonomic Coverage and bibtex HOT 1
- Species name epithet is not handled the way specified in the EML schema HOT 2
- Error with molePerKilogram in unit list returned by get_unitList() HOT 3
- dataset and datatable entries from README example fail HOT 2
- `shiny_attributes` performance improvments HOT 8
- Revisit how users can find a learn to use the `eml$*` constructors HOT 2
- Add a minimum version requirement on taxadb and wait to release the next version of this package HOT 1
- Web scraping | sapply function | Error in readBin(5L, "raw", 65536L) : Failure when receiving data from the peer HOT 1
- Creating EML elements with XML attributes HOT 2
- Duplicate person when using `write_eml()` HOT 2
- Set attributes for properties, e.g. `<title xml:lang="eng">` HOT 3
- Function to convert DataCite metadata to EML: good fit for this package? HOT 7
- `<![CDATA[` not always recognized HOT 1
- [Units] Discussion about current unit list HOT 5
- `set_coverage()`: Express common names in `commonName` in `taxonomicCoverage` HOT 10
- `set_responsibleParty()`: allow to create organization parties HOT 1
- namespace conflict introduced when importing/exporting EML generated under older schema
- EML::eml_validate conflicts with knb.ecoinformatics.org parser & appears to introduce invalid xml into valid files HOT 1
- EML seems to have trouble with foreign key constraints HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eml.