Git Product home page Git Product logo

Comments (7)

mbjones avatar mbjones commented on July 18, 2024 1

No worries, @mmfink, the difference is in how they are defined in the schema. The storageType behavior is determined by the code in the EML package, whereas the EML schema says it can be any string value. In contrast, the NumberType definition in the EML schema restricts the field to only have one of the 4 enumerated values through an xs:restriction. You can see the definition here: https://eml.ecoinformatics.org/schema/eml-attribute_xsd.html#NumberType The R package might let you put anything in that field, but any value other than natural, whole, integer, and real should produce a schema validation error when you try to validate with the R package.

from eml.

cboettig avatar cboettig commented on July 18, 2024

set_attributes is using a narrow list of possible R types to map to select the measurementScales. Numeric measurements in EML can be "interval" or "ratio", though set_attributes() only maps numeric data to "ratio", so it doesn't map integer differently.

Of course "integer" data could be used to encode interval data, nominal or ordinal types in EML, or even dates! (the atomic type of a date in R is integer). So as you see, the map between what EML uses for measurementScale and data encodings in R is not at all straight-forward -- really we have no business guessing just about any measurementScale from the R class to begin with.

But set_attributes(), like all set_ methods, is really just a convenience function that covers common cases, where it's reasonable for a user to assume that EML will have some notion to distinguish numeric values from dates, etc, so it does a best-attempt to map the R class to an EML class. measurementScale is a required element, so set_attributes() does it's best to help users out by mapping R atomic types to measurementScales.

So that's all about measurementScale, but you asked about, storageType. storageType is an optional EML attribute that accompanies measurement scale, but is actually a little closer to the notion of datatype (i.e. it's supposed to be an XML Schema storage type, xs:string, etc), so while we were at it we went and mapped those too. Agree that it would obviously be better to map an integer to xs:integer, but set_attributes() is taking a short-cut here that is trying to do the mapping based on the same choice it made for the measurementScale, which is obviously not great. But set_attributes isn't inspecting the actual values of a data.frame, it is just getting the class passed in an argument. So there may be room to improve this a little (recognize integer as a separate class with a separate storage type, but still map it to measurementScale of... interval? ratio?)

Note by default, a numeric is a double-precision floating point in R, witness:

 is.double(numeric(1L))

though perhaps confusingly, is.numeric(integer(1L)) is also true. Overall we are possibly better just omitting this, as the comment suggests. I'm not sure how to improve the guessing based on the simplified interface that set_attributes() presents to the user though.

In practical terms, storageType seems to have relational databases in mind, though it's certainly useful sometimes to know if read_csv() should default to integer or a float (though describing a csv file containing a number as having "(double) floating point precision" is perhaps misleading?) At the end of the day, where datatypes are important it's probably necessary to use a data format that encodes the type rather than rely on the metadata?

A user can always have more fine-grained control by constructing the EML more explicitly, e.g. with lists (or with https://github.com/cboettig/build.eml ).

from eml.

mmfink avatar mmfink commented on July 18, 2024

@cboettig Thank you for the well considered reply, those are all good points. I had started out supplying storageType and col_classes to set_attributes() so that the code would fill in the measurementScale, but the validation code for storageType will not accept "integer".

So instead I have switched to supplying numberType (as "real" or "integer") and col_classes - as shown in the package vignette. That works, but I found it inconsistent that set_attributes() still filled in all numeric storageTypes as "float", regardless of numberType. Your points about how R regards numeric values perhaps makes this all moot (as well as your point about not being required to use set_attributes()). On the other hand, this could just be a matter of allowing the storageType validation code to accept "integer". 😃

from eml.

cboettig avatar cboettig commented on July 18, 2024

yeah I'd definitely love a PR to help storageType take 'integer' as an optional value in col_classes argument of set_attributes, and have it map storageType to integer for that case. I'd also be happy with PR that just drops any attempt to assign storage type based on the column class (because sometimes less is more when things are confusing, e.g. typeof(Sys.Date()) and typeof(as.factor("A")))

from eml.

mbjones avatar mbjones commented on July 18, 2024

The domain of a numeric value is really established in measurementScale/ratio/numericDomain/numberType, which should be set to integer if the values actually represent integers and not, for example, real values. The possible values for numberType are natural, whole, integer, and real, as defined in common number theory (or see the schema for the definitions).

eml:storageType is intended to support data types from multiple languages, and is an optional hint as to how to convert a value from a particular domain as defined in numericDomain (e.g., real numbers > 10 and <= 100) into a particular storage system or language. It is a string value and is meant to provide the type name (e.g., short) from a particular data typing system (e.g., the XSD Datatypes schema, or the C language, or the Java language, or the Postgres database). You could use R's type system (e.g., numeric would be perfectly fine). It can also be repeated, so I can say that values would be an int in Java or an short in C, but integer in R.

So, I think that getting the numberType right is more important than getting storageType right. Is the package setting integer for numberType correctly?

from eml.

mmfink avatar mmfink commented on July 18, 2024

@mbjones
Apologies if I am misreading the code, but numberType appears to be accepted without any validation or assignment, unlike storageType.

from eml.

mmfink avatar mmfink commented on July 18, 2024

Thanks for accepting the PR. Closing.

from eml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.