ropensci / eml Goto Github PK
View Code? Open in Web Editor NEWEcological Metadata Language interface for R: synthesis and integration of heterogenous data
Home Page: https://docs.ropensci.org/EML
License: Other
Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
Home Page: https://docs.ropensci.org/EML
License: Other
Taxonomic coverage, geographic coverage and temporal coverage are both common and rather essential metadata we should illustrate the use of.
This should include tools to generate coverage nodes from columns of the data frame: species names, lat/longs to bounding boxes, time frame from series of times.
Also include tools to summarize coverage metadata, including extraction from columns and extraction into a separate data.frame (or appropriate R spatial object).
Currently units must come from the EML Standard Units list, already written according to the specification there (camelCase and all).
customUnits must be completely defined in STMML syntax, making them a bit more onerous to use than custom string types.
I've posted this question on Stack Overflow, but I wonder if I'm better coming straight here.
Here's the discussion so far on SO.
Ok, I'm trying to convert the following JSON data into an R data frame.
For some reason fromJSON in the RJSONIO package only reads up to about character 380 and then it stops converting the JSON properly.
Here is the JSON:-
"{\"metricDate\":\"2013-05-01\",\"pageCountTotal\":\"33682\",\"landCountTotal\":\"11838\",\"newLandCountTotal\":\"8023\",\"returnLandCountTotal\":\"3815\",\"spiderCountTotal\":\"84\",\"goalCountTotal\":\"177.000000\",\"callGoalCountTotal\":\"177.000000\",\"callCountTotal\":\"237.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"74.68\"}\n{\"metricDate\":\"2013-05-02\",\"pageCountTotal\":\"32622\",\"landCountTotal\":\"11626\",\"newLandCountTotal\":\"7945\",\"returnLandCountTotal\":\"3681\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"210.000000\",\"callGoalCountTotal\":\"210.000000\",\"callCountTotal\":\"297.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"70.71\"}\n{\"metricDate\":\"2013-05-03\",\"pageCountTotal\":\"28467\",\"landCountTotal\":\"11102\",\"newLandCountTotal\":\"7786\",\"returnLandCountTotal\":\"3316\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"186.000000\",\"callGoalCountTotal\":\"186.000000\",\"callCountTotal\":\"261.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"71.26\"}\n{\"metricDate\":\"2013-05-04\",\"pageCountTotal\":\"20884\",\"landCountTotal\":\"9031\",\"newLandCountTotal\":\"6670\",\"returnLandCountTotal\":\"2361\",\"spiderCountTotal\":\"51\",\"goalCountTotal\":\"7.000000\",\"callGoalCountTotal\":\"7.000000\",\"callCountTotal\":\"44.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.08\",\"callConversionPerc\":\"15.91\"}\n{\"metricDate\":\"2013-05-05\",\"pageCountTotal\":\"20481\",\"landCountTotal\":\"8782\",\"newLandCountTotal\":\"6390\",\"returnLandCountTotal\":\"2392\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"12.50\"}\n{\"metricDate\":\"2013-05-06\",\"pageCountTotal\":\"25175\",\"landCountTotal\":\"10019\",\"newLandCountTotal\":\"7082\",\"returnLandCountTotal\":\"2937\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"24.000000\",\"callGoalCountTotal\":\"24.000000\",\"callCountTotal\":\"47.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.24\",\"callConversionPerc\":\"51.06\"}\n{\"metricDate\":\"2013-05-07\",\"pageCountTotal\":\"35892\",\"landCountTotal\":\"12615\",\"newLandCountTotal\":\"8391\",\"returnLandCountTotal\":\"4224\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"239.000000\",\"callGoalCountTotal\":\"239.000000\",\"callCountTotal\":\"321.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.89\",\"callConversionPerc\":\"74.45\"}\n{\"metricDate\":\"2013-05-08\",\"pageCountTotal\":\"34106\",\"landCountTotal\":\"12391\",\"newLandCountTotal\":\"8389\",\"returnLandCountTotal\":\"4002\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"221.000000\",\"callGoalCountTotal\":\"221.000000\",\"callCountTotal\":\"295.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"74.92\"}\n{\"metricDate\":\"2013-05-09\",\"pageCountTotal\":\"32721\",\"landCountTotal\":\"12447\",\"newLandCountTotal\":\"8541\",\"returnLandCountTotal\":\"3906\",\"spiderCountTotal\":\"54\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"280.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.66\",\"callConversionPerc\":\"73.93\"}\n{\"metricDate\":\"2013-05-10\",\"pageCountTotal\":\"29724\",\"landCountTotal\":\"11616\",\"newLandCountTotal\":\"8063\",\"returnLandCountTotal\":\"3553\",\"spiderCountTotal\":\"139\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"301.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"68.77\"}\n{\"metricDate\":\"2013-05-11\",\"pageCountTotal\":\"22061\",\"landCountTotal\":\"9660\",\"newLandCountTotal\":\"6971\",\"returnLandCountTotal\":\"2689\",\"spiderCountTotal\":\"52\",\"goalCountTotal\":\"3.000000\",\"callGoalCountTotal\":\"3.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.03\",\"callConversionPerc\":\"7.50\"}\n{\"metricDate\":\"2013-05-12\",\"pageCountTotal\":\"23341\",\"landCountTotal\":\"9935\",\"newLandCountTotal\":\"6960\",\"returnLandCountTotal\":\"2975\",\"spiderCountTotal\":\"45\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"12.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-13\",\"pageCountTotal\":\"36565\",\"landCountTotal\":\"13583\",\"newLandCountTotal\":\"9277\",\"returnLandCountTotal\":\"4306\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"246.000000\",\"callGoalCountTotal\":\"246.000000\",\"callCountTotal\":\"324.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"75.93\"}\n{\"metricDate\":\"2013-05-14\",\"pageCountTotal\":\"35260\",\"landCountTotal\":\"13797\",\"newLandCountTotal\":\"9375\",\"returnLandCountTotal\":\"4422\",\"spiderCountTotal\":\"59\",\"goalCountTotal\":\"212.000000\",\"callGoalCountTotal\":\"212.000000\",\"callCountTotal\":\"283.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"74.91\"}\n{\"metricDate\":\"2013-05-15\",\"pageCountTotal\":\"35836\",\"landCountTotal\":\"13792\",\"newLandCountTotal\":\"9532\",\"returnLandCountTotal\":\"4260\",\"spiderCountTotal\":\"94\",\"goalCountTotal\":\"187.000000\",\"callGoalCountTotal\":\"187.000000\",\"callCountTotal\":\"258.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.36\",\"callConversionPerc\":\"72.48\"}\n{\"metricDate\":\"2013-05-16\",\"pageCountTotal\":\"33136\",\"landCountTotal\":\"12821\",\"newLandCountTotal\":\"8755\",\"returnLandCountTotal\":\"4066\",\"spiderCountTotal\":\"65\",\"goalCountTotal\":\"192.000000\",\"callGoalCountTotal\":\"192.000000\",\"callCountTotal\":\"260.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"73.85\"}\n{\"metricDate\":\"2013-05-17\",\"pageCountTotal\":\"29564\",\"landCountTotal\":\"11721\",\"newLandCountTotal\":\"8191\",\"returnLandCountTotal\":\"3530\",\"spiderCountTotal\":\"213\",\"goalCountTotal\":\"166.000000\",\"callGoalCountTotal\":\"166.000000\",\"callCountTotal\":\"222.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.42\",\"callConversionPerc\":\"74.77\"}\n{\"metricDate\":\"2013-05-18\",\"pageCountTotal\":\"23686\",\"landCountTotal\":\"9916\",\"newLandCountTotal\":\"7335\",\"returnLandCountTotal\":\"2581\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"5.000000\",\"callGoalCountTotal\":\"5.000000\",\"callCountTotal\":\"34.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.05\",\"callConversionPerc\":\"14.71\"}\n{\"metricDate\":\"2013-05-19\",\"pageCountTotal\":\"23528\",\"landCountTotal\":\"9952\",\"newLandCountTotal\":\"7184\",\"returnLandCountTotal\":\"2768\",\"spiderCountTotal\":\"57\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"14.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"7.14\"}\n{\"metricDate\":\"2013-05-20\",\"pageCountTotal\":\"37391\",\"landCountTotal\":\"13488\",\"newLandCountTotal\":\"9024\",\"returnLandCountTotal\":\"4464\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"227.000000\",\"callGoalCountTotal\":\"227.000000\",\"callCountTotal\":\"291.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"78.01\"}\n{\"metricDate\":\"2013-05-21\",\"pageCountTotal\":\"36299\",\"landCountTotal\":\"13174\",\"newLandCountTotal\":\"8817\",\"returnLandCountTotal\":\"4357\",\"spiderCountTotal\":\"77\",\"goalCountTotal\":\"164.000000\",\"callGoalCountTotal\":\"164.000000\",\"callCountTotal\":\"221.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.24\",\"callConversionPerc\":\"74.21\"}\n{\"metricDate\":\"2013-05-22\",\"pageCountTotal\":\"34201\",\"landCountTotal\":\"12433\",\"newLandCountTotal\":\"8388\",\"returnLandCountTotal\":\"4045\",\"spiderCountTotal\":\"76\",\"goalCountTotal\":\"195.000000\",\"callGoalCountTotal\":\"195.000000\",\"callCountTotal\":\"262.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.57\",\"callConversionPerc\":\"74.43\"}\n{\"metricDate\":\"2013-05-23\",\"pageCountTotal\":\"32951\",\"landCountTotal\":\"11611\",\"newLandCountTotal\":\"7757\",\"returnLandCountTotal\":\"3854\",\"spiderCountTotal\":\"68\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"231.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.44\",\"callConversionPerc\":\"72.29\"}\n{\"metricDate\":\"2013-05-24\",\"pageCountTotal\":\"28967\",\"landCountTotal\":\"10821\",\"newLandCountTotal\":\"7396\",\"returnLandCountTotal\":\"3425\",\"spiderCountTotal\":\"106\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"203.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"82.27\"}\n{\"metricDate\":\"2013-05-25\",\"pageCountTotal\":\"19741\",\"landCountTotal\":\"8393\",\"newLandCountTotal\":\"6168\",\"returnLandCountTotal\":\"2225\",\"spiderCountTotal\":\"78\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"28.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-26\",\"pageCountTotal\":\"19770\",\"landCountTotal\":\"8237\",\"newLandCountTotal\":\"6009\",\"returnLandCountTotal\":\"2228\",\"spiderCountTotal\":\"79\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-27\",\"pageCountTotal\":\"26208\",\"landCountTotal\":\"9755\",\"newLandCountTotal\":\"6779\",\"returnLandCountTotal\":\"2976\",\"spiderCountTotal\":\"82\",\"goalCountTotal\":\"26.000000\",\"callGoalCountTotal\":\"26.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.27\",\"callConversionPerc\":\"65.00\"}\n{\"metricDate\":\"2013-05-28\",\"pageCountTotal\":\"36980\",\"landCountTotal\":\"12463\",\"newLandCountTotal\":\"8226\",\"returnLandCountTotal\":\"4237\",\"spiderCountTotal\":\"132\",\"goalCountTotal\":\"208.000000\",\"callGoalCountTotal\":\"208.000000\",\"callCountTotal\":\"276.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.67\",\"callConversionPerc\":\"75.36\"}\n{\"metricDate\":\"2013-05-29\",\"pageCountTotal\":\"34190\",\"landCountTotal\":\"12014\",\"newLandCountTotal\":\"8279\",\"returnLandCountTotal\":\"3735\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"179.000000\",\"callGoalCountTotal\":\"179.000000\",\"callCountTotal\":\"235.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.49\",\"callConversionPerc\":\"76.17\"}\n{\"metricDate\":\"2013-05-30\",\"pageCountTotal\":\"33867\",\"landCountTotal\":\"11965\",\"newLandCountTotal\":\"8231\",\"returnLandCountTotal\":\"3734\",\"spiderCountTotal\":\"63\",\"goalCountTotal\":\"160.000000\",\"callGoalCountTotal\":\"160.000000\",\"callCountTotal\":\"219.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.34\",\"callConversionPerc\":\"73.06\"}\n{\"metricDate\":\"2013-05-31\",\"pageCountTotal\":\"27536\",\"landCountTotal\":\"10302\",\"newLandCountTotal\":\"7333\",\"returnLandCountTotal\":\"2969\",\"spiderCountTotal\":\"108\",\"goalCountTotal\":\"173.000000\",\"callGoalCountTotal\":\"173.000000\",\"callCountTotal\":\"226.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"76.55\"}\n\r\n"
and here is my R output
metricDate
"2013-05-01"
pageCountTotal
"33682"
landCountTotal
"11838"
newLandCountTotal
"8023"
returnLandCountTotal
"3815"
spiderCountTotal
"84"
goalCountTotal
"177.000000"
callGoalCountTotal
"177.000000"
callCountTotal
"237.000000"
onlineGoalCountTotal
"0.000000"
conversionPerc
"1.50"
callConversionPerc
"74.68\"}{\"metricDate\":\"2013-05-02\",\"pageCountTotal\":\"32622\",\"landCountTotal\":\"11626\",\"newLandCountTotal\":\"7945\",\"returnLandCountTotal\":\"3681\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"210.000000\",\"callGoalCountTotal\":\"210.000000\",\"callCountTotal\":\"297.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"70.71\"}{\"metricDate\":\"2013-05-03\",\"pageCountTotal\":\"28467\",\"landCountTotal\":\"11102\",\"newLandCountTotal\":\"7786\",\"returnLandCountTotal\":\"3316\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"186.000000\",\"callGoalCountTotal\":\"186.000000\",\"callCountTotal\":\"261.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"71.26\"}{\"metricDate\":\"2013-05-04\",\"pageCountTotal\":\"20884\",\"landCountTotal\":\"9031\",\"newLandCountTotal\":\"6670\",\"returnLandCountTotal\":\"2361\",\"spiderCountTotal\":\"51\",\"goalCountTotal\":\"7.000000\",\"callGoalCountTotal\":\"7.000000\",\"callCountTotal\":\"44.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.08\",\"callConversionPerc\":\"15.91\"}{\"metricDate\":\"2013-05-05\",\"pageCountTotal\":\"20481\",\"landCountTotal\":\"8782\",\"newLandCountTotal\":\"6390\",\"returnLandCountTotal\":\"2392\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"12.50\"}{\"metricDate\":\"2013-05-06\",\"pageCountTotal\":\"25175\",\"landCountTotal\":\"10019\",\"newLandCountTotal\":\"7082\",\"returnLandCountTotal\":\"2937\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"24.000000\",\"callGoalCountTotal\":\"24.000000\",\"callCountTotal\":\"47.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.24\",\"callConversionPerc\":\"51.06\"}{\"metricDate\":\"2013-05-07\",\"pageCountTotal\":\"35892\",\"landCountTotal\":\"12615\",\"newLandCountTotal\":\"8391\",\"returnLandCountTotal\":\"4224\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"239.000000\",\"callGoalCountTotal\":\"239.000000\",\"callCountTotal\":\"321.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.89\",\"callConversionPerc\":\"74.45\"}{\"metricDate\":\"2013-05-08\",\"pageCountTotal\":\"34106\",\"landCountTotal\":\"12391\",\"newLandCountTotal\":\"8389\",\"returnLandCountTotal\":\"4002\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"221.000000\",\"callGoalCountTotal\":\"221.000000\",\"callCountTotal\":\"295.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"74.92\"}{\"metricDate\":\"2013-05-09\",\"pageCountTotal\":\"32721\",\"landCountTotal\":\"12447\",\"newLandCountTotal\":\"8541\",\"returnLandCountTotal\":\"3906\",\"spiderCountTotal\":\"54\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"280.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.66\",\"callConversionPerc\":\"73.93\"}{\"metricDate\":\"2013-05-10\",\"pageCountTotal\":\"29724\",\"landCountTotal\":\"11616\",\"newLandCountTotal\":\"8063\",\"returnLandCountTotal\":\"3553\",\"spiderCountTotal\":\"139\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"301.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"68.77\"}{\"metricDate\":\"2013-05-11\",\"pageCountTotal\":\"22061\",\"landCountTotal\":\"9660\",\"newLandCountTotal\":\"6971\",\"returnLandCountTotal\":\"2689\",\"spiderCountTotal\":\"52\",\"goalCountTotal\":\"3.000000\",\"callGoalCountTotal\":\"3.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.03\",\"callConversionPerc\":\"7.50\"}{\"metricDate\":\"2013-05-12\",\"pageCountTotal\":\"23341\",\"landCountTotal\":\"9935\",\"newLandCountTotal\":\"6960\",\"returnLandCountTotal\":\"2975\",\"spiderCountTotal\":\"45\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"12.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}{\"metricDate\":\"2013-05-13\",\"pageCountTotal\":\"36565\",\"landCountTotal\":\"13583\",\"newLandCountTotal\":\"9277\",\"returnLandCountTotal\":\"4306\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"246.000000\",\"callGoalCountTotal\":\"246.000000\",\"callCountTotal\":\"324.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"75.93\"}{\"metricDate\":\"2013-05-14\",\"pageCountTotal\":\"35260\",\"landCountTotal\":\"13797\",\"newLandCountTotal\":\"9375\",\"returnLandCountTotal\":\"4422\",\"spiderCountTotal\":\"59\",\"goalCountTotal\":\"212.000000\",\"callGoalCountTotal\":\"212.000000\",\"callCountTotal\":\"283.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"74.91\"}{\"metricDate\":\"2013-05-15\",\"pageCountTotal\":\"35836\",\"landCountTotal\":\"13792\",\"newLandCountTotal\":\"9532\",\"returnLandCountTotal\":\"4260\",\"spiderCountTotal\":\"94\",\"goalCountTotal\":\"187.000000\",\"callGoalCountTotal\":\"187.000000\",\"callCountTotal\":\"258.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.36\",\"callConversionPerc\":\"72.48\"}{\"metricDate\":\"2013-05-
(I've truncated the output a little).
The R output has been read properly up until "callConversionPerc" and after that the JSON parsing seems to break. Is there some default parameter that I've missed that could couse this behaviour? I have checked for unmasked speechmarks and anything obvious like that I didn't see any.
Surely it wouldn't be the new line operator that occurs shortly after, would it?
EDIT: So this does appear to be a new line issue.
Here's another 'JSON' string I've pulled into R, again the double quote marks are all escaped
"{\"modelId\":\"7\",\"igrp\":\"1\",\"modelName\":\"Equally Weighted\",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90}\n{\"modelId\":\"416\",\"igrp\":\"1\",\"modelName\":\"First and Last Click Weighted \",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"firstWeight\":3,\"lastWeight\":3}\n{\"modelId\":\"5\",\"igrp\":\"1\",\"modelName\":\"First Click\",\"modelType\":\"first\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90}\n{\"modelId\":\"8\",\"igrp\":\"1\",\"modelName\":\"First Click Weighted\",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"firstWeight\":3}\n{\"modelId\":\"128\",\"igrp\":\"1\",\"modelName\":\"First Click Weighted across PPC\",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"firstWeight\":3,\"channelsMode\":\"include\",\"channels\":[5]}\n{\"modelId\":\"6\",\"igrp\":\"1\",\"modelName\":\"Last Click\",\"modelType\":\"last\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90}\n{\"modelId\":\"417\",\"igrp\":\"1\",\"modelName\":\"Last Click Weighted \",\"modelType\":\"spread\",\"status\":200,\"matchCriteria\":\"\",\"lookbackDays\":90,\"lastWeight\":3}\n\r\n"
When I try to parse this using fromJSON
I get the same problem, it gets to the last term on the first line and then stop parsing properly. Note that in this new case the output is slightly different from before returning NULL
for the last item (instead of the messy string from the previous example.
$modelId
[1] "7"
$igrp
[1] "1"
$modelName
[1] "Equally Weighted"
$modelType
[1] "spread"
$status
[1] 200
$matchCriteria
[1] ""
$lookbackDays
NULL
As you can see, the components now use the "$" convention as if they are naming components and the last item is null.
I am wondering if this is to do with the way that fromJSON
is parsing the strings, and when it is asked to create a variable with the same name as a variable that already exists it then fails and just returns a string or a NULL.
I would have thought that dealing with that sort of case would be coded into RJSONIO as it's pretty standard for JSON data to have repeating names.
I'm stumped as to how to fix this.
I'll be very grateful if you can advise as to what I'm doing wrong! Is there some parameter I need to be specifying to get it to recognise variable names properly?
Cheers,
Simon
coverage: eml_write
, eml_dataset
, eml_dataTable
, eml_attributeList
coverage: eml_write
, eml_dataset
, eml_dataTable
, eml_attributeList
<attribute><definition>
from xpath matches definition passed to eml_write
.Continue developing unit tests as (or before) development proceeds.
That way if someone doesn't like the EML, they know who to blame ;-)
So far eml_read
only extracts the three objects in the proof-of-principle test. Of course we will want generic access to all metadata objects, probably with a variety of tools for their extraction.
Currently, to convert a data.frame
into EML, we use a workflow that passes a data.frame, a list of column_metadata
and a list of unit_metadata
to a function,
dat = data.frame(river=factor(c("SAC", "SAC", "AM")),
spp = factor(c("king", "king", "ccho")),
stg = factor(c("smolt", "parr", "smolt")),
ct = c(293L, 410L, 210L))
col_metadata = c(river = "http://dbpedia.org/ontology/River",
spp = "http://dbpedia.org/ontology/Species",
stg = "Life history stage",
ct = "count of number of fish")
unit_metadata =
list(river = c(SAC = "The Sacramento River", AM = "The American River"),
spp = c(king = "King Salmon", ccho = "Coho Salmon"),
stg = c(parr = "third life stage", smolt = "fourth life stage"),
ct = "number")
Then eml is created by passing these objects to the high-level function eml_write
doc <- eml_write(dat, col_metadata, unit_metadata)
I'm not sure if this is a good way to ask the users for metadata. One of the design goals is to reuse the natural R structures as much as possible and avoid asking for redundant information.
One problem with this is that it structures the metadata by column headings, rather than column by column, which might suggest something like this:
metadata <-
list("river" = list("River site used for collection",
c(SAC = "The Sacramento River", AM = "The American River")),
"spp" = list("Species common name",
c(king = "King Salmon", ccho = "Coho Salmon")),
"stg" = list("Life Stage",
c(parr = "third life stage", smolt = "fourth life stage")),
"ct" = list("count",
"number")
Which provides a more column by column approach. Still, this seems unsatisfactory, as we don't reuse the levels
of a factor in a column (e.g. SAC
and AM
), instead requiring they be rewritten; likewise we still have to repeat the column headings in our named list.
Rather than using a named list, we might also do better to capture the attribute metadata in the object, e.g.
river_metadata <- list("river",
"River site used for collection",
c(SAC = "The Sacramento River", AM = "The American River"))
which maps better to the schema attribute
. Still none of these make maximum re-use of the data.frame objects and all are a bit cumbersome.
A more natural solution would be to write directly into the S4 slots, but I'm not clear on how this would would work. Using the above structures we could do
as("eml:attributeList", metadata)
and a more low-level option:
as("eml:attribute", river_metadata)
but not sure if that would feel more natural to users than the function calls (particularly since most R using ecologists are not familiar with S4 methods).
@schamberlain @karthikram @mbjones @duncantl
Would love any feedback on this or generally how the API should look to specify these values. Can we attach them to the data.frame/columns more directly, and is that better? (e.g. I considered labels
option for factors, but that just overwrites the levels
)....
Deploying EML file and associated data objects on the gh-pages branch of a github repository would provide a more natural URL endpoint, and facilitate forking, pull requests, and rapid versioning.
Would require many of the same steps as #3, but because we don't have a native R interface to Git the actual commit and push could be left to an external script or the user. Most natural simply to specify the repository end-point to form the appropriate URLs.
Carl Boettiger <[email protected]>
and coerce to R person object, then to eml_person object.An underlying philosophy of reml
has been to map native R objects to EML structure (as opposed to either the raw XML or the S4 representations, which won't be familiar to most users). While data.frame
is the natural candidate, it doesn't include essential metadata such as units.
I've taken a stab at extending the data.frame
class as data.set
, providing the additional attributes unit.defs
and col.defs
. See data.set.R. This object should be able to be used wherever a data.frame
can be applied, but we can also define additional methods that operate on this metadata.
This also suggests an alternative way to define metadata of a data.frame in place of the current approach illustrated in the README. I've added a function so that this data.set
object can be created analogously to a data.frame:
dat = data.set(river = c("SAC", "SAC", "AM"),
spp = c("king", "king", "ccho"),
stg = c("smolt", "parr", "smolt"),
ct = c(293L, 410L, 210L),
col.defs = c("River site used for collection",
"Species common name",
"Life Stage",
"count of live fish in traps"),
unit.defs = list(c(SAC = "The Sacramento River",
AM = "The American River"),
c(king = "King Salmon",
ccho = "Coho Salmon"),
c(parr = "third life stage",
smolt = "fourth life stage"),
"number"))
an existing data.frame can also be passed in along with col.defs
and unit.defs
.
@duncantl @mbjones Is the implementation of the extension sensible? data.frame
is actually a rather confusing class -- cannot tell if it is S4 or S3 (e.g. new('data.frame')
creates an S4 object, but data.frame()
does not...), and has attributes like names
and row.names
which may or may not be S4 slots... e.g. they can be accessed by slot()
but not @
...). data.set.R
What other metadata do we want? e.g. we could make full eml act like this, but that makes rather big data.frames...
Some metadata a user would probably rather set once in some global configuration than have to specify each time, such as their personal contact information. The package API currently uses eml$set
and eml$get
to handle this.
At the same time, if a user needs to adjust the contact information for a particular file, they should be able override these values without altering their global configuration.
Need to be careful to avoid collisions in the eml$set
approach. As implemented, it won't support structuring metadata (e.g. contact_givenName, contact_surName, ...). Ultimately we might want to be more clever about this, or just go to the yaml approach entirely.
Need also to be careful in avoiding lengthy and fragile function APIs. eml$set
helps with this, as the function can get the data it needs without passing down through many levels, but also makes the override issue harder.
Once we have more implementation examples, we can give this some hard thought. Meanwhile:
eml$set(contact_email = "[email protected]")
contact:
email: [email protected]
_Here's a running list of questions I have for @duncantl or whomever, largely arising as I try to understand the S4 based approach to representing the schema and various puzzles that arise in the process. _
as(list(a=1, b=2), "character")
)addChildren(parent, class@empty_option)
would not do anything when the slot was truly empty.ListOf
class. Goodness, but this is annoying.yup, tedious but mindless.
<attributeList>
, followed by <attribute>
<attribute>
, ... sometimes it doesn't. Does it make sense to write an extra object class associated with the first case (e.g. class for attributeList?)?yup, classes for all elements, and more classes for ListOf
Set the collate
order for the files, describing which order they should be loaded in (e.g. class basic class definitions before richer ones, classes before methods). In Roxygen, the order is set by using @include fileA.R
ton the documentation of fileB.R
to indicate fileA
has definitions needed for fileB
.
Yup. Tedious again but not problematic. Writing to/from methods takes care of this explicitly.
setAs("class", "XMLInternalNode", function(from){...
or with some other kind of function?sure, though sometimes preferable to define as a method, allowing us to make use of
callNextMethod
to convert against the inherited slots.
setOldClass
and S3part
?Answer: Just use
contains
in thesetClass
definition (Inheritance)
Answer: Using "new", we must know the slot name corresponding to the type. Coercion allows us to specify the type, e.g.
setAs("eml:nominal", "eml:measurementScale", function(from) new("eml:measurementScale", nominal = from))
setAs("eml:ordinal", "eml:measurementScale", function(from) new("eml:measurementScale", ordinal = from))
can be used with as(from[[3]], from[[4]])
, reading the class name from a varaible instead of hardwiring the slot name. The coercion methods take care of mapping the class names to the appropriate slot names.
setClass('somenode', representation(title="eml:title"...
and then setClass("eml:title", representation(title = "character", id = "character")
?How about just having all inherit from a common base?
eml_read and eml_write both assume online endpoints for all files.
<distribution><online>
node. That should be added by eml_publish
.@mbjones Can you clarify where any of the elements inherit some of their definition from existing definitions? e.g. I think that's what is going on with entityGroup
and referenceGroup
. Wondering if there is also a base class inherited by most everything that defines the id
attribute? Or is that just manually added to each definition where appropriate?
install_github("reml", "ropensci")
Quitting from lines 62-63 (vingette.Rmd)
Error: processing vignette 'vingette.Rmd' failed with diagnostics:
argument ".contact" is missing, with no default
Execution halted
Error: Command failed (1)
@mbjones eml_write
is assigning the namespace 2.1.0. Any reason we shouldn't be using 2.1.1?
Add a motivating example using attribute-level metadata.
We can get citation information for R packages with citation("reml")
, so it would be natural to get the citation information for an EML object with:
eml <- read_eml("my_eml_file.xml")
citation(eml)
read_eml
should make use of the S3 class eml
and return a pointer to the XML root node as doc
eml_citation
with alias citation.eml
that extracts the appropriate data citation.bibentry
R class, so that citation can be returned in various formats (e.g. print(citation(doc), style='bibtex')
)@duncantl For some reason I cannot get your ext_validate
to run successfully any more. e.g. running this test gives the error:
"): non-character argument
1: eml_validate(txt) at test_ext_validate.R:17
2: .reader(ans) at /home/cboettig/Documents/code/reml/R/ext_validate.R:59
3: strsplit(ans, ": ") at /home/cboettig/Documents/code/reml/R/ext_validate.R:76
not sure what's up on this one.
The R dataone package also has some preliminary EML parsing routines, which extract relevant metadata from EML and make it available for use in the dataone client. This is partially used for the asDataFrame() method that converts a dataone binary file to a data frame. These classes may be able to be replaced with more capable reml package methods. See:
https://repository.dataone.org/software/cicore/trunk/itk/d1_client_r/dataone/R/EMLParser-class.R
https://repository.dataone.org/software/cicore/trunk/itk/d1_client_r/dataone/R/EMLParser-methods.R
The currently generated EML is not valid and needs to be fixed. I have identified the following issues to be fixed:
@packageId
attribute on root <eml>
element@system
attribute on root <eml>
element<title>
field<creator>
field<contact>
field<entityDescription>
field is empty<recorDelimiter>
, should be <recordDelimiter>
<numericDomain>
is out of order and should follow <unit>
This is the holy grail of metadata infrastructure and ostensibly the primary purpose of EML, see Jones et al 2006. Despite that, integration is not actually possible without semantic definitions as well, see Michener & Jones 2012, from which we adapt this minimal example below.
This example provides minimal and sometimes missing semantics; which may make it unresolvable. A complete semantic solution is diagrammed in the figure from Michener & Jones 2012.
dat = data.frame(river=c("SAC", "SAC", "AM"),
spp = c("king", "king", "ccho"),
stg = c("smolt", "parr", "smolt"),
ct = c(293L, 410L, 210L))
col_metadata = c(river = "http://dbpedia.org/ontology/River",
spp = "http://dbpedia.org/ontology/Species",
stg = "Life history stage",
ct = "count")
unit_metadata =
list(river = c(SAC = "The Sacramento River", AM = "The American River"),
spp = c(king = "King Salmon", ccho = "Coho Salmon"),
stg = c(parr = "third life stage", smolt = "fourth life stage"),
ct = "number")
dat = data.frame(site = c("SAC", "AM", "AM"),
species = c("Chinook", "Chinook", "Silver"),
smct = c(245L, 511L, 199L),
pcnt = c(290L, 408L, 212L))
col_metadata = c(site = "http://dbpedia.org/ontology/River",
species = "http://dbpedia.org/ontology/Species",
smct = "Smolt count",
pcnt = "Parr count")
unit_metadata =
list(river = c(SAC = "The Sacramento River", AM = "The American River"),
spp = c(Chinook = "King Salmon", Silver = "Coho Salmon"),
smct = "number",
pcnt = "number")
We should be able to do
nex <- readSchema("eml.xsd")
defineClasses(nex)
doc <- xmlParse("my_eml_data.xml")
fromXML(doc)
@duncantl is adjusting XMLSchema
to handle EML's schema for this (some challenges in the recursive referencing of schema files).
@mbjones Do we have an online URI for eml.xsd
? So far I've had to download a tarball from http://knb.ecoinformatics.org/software/download.jsp#eml
Truly automatic data integration needs some level of formal semantics. Should start thinking about how semantics would fit into the reml
workflow, even though most are still in their infancy.
An ideal system would allow authors to contribute to existing ontologies, or at least push to a 'working' or 'draft' ontology that could later be formalized / mapped to a more central effort like OBOE
Not sure if we yet have any R-based tools for semantic reasoning, etc. (Though we do have SPARQL). Ultimately this might require a separate repository to tackle implementation and reasoning of semantic terms. (Hopefully developed by some actual domain experts in the R community).
R's read.table() function (which read.csv is aliased to) provides lots of options that we should
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text)
See ?read.table
for details. Particularly important for the read interface.
<publisher>
node when using eml_figshare
EML coverage nodes specify taxanomic, geographic, and temporal coverage.
They can refer to a dataset node but can also be used to define coverage of individual columns (e.g. a species column) or individual cells in a column (e.g. the species name). The latter is much richer but less commonly implemented.
@schamberlain I think ideally taxonomic coverage would make use of taxize_
to help identify and correct species names. While higher taxonomic information can be specified, this would probably best be reserved for cases not referring to a particular species, since (a) we can already programmatically recover the rest of the classification given the genus and species, and (b) higher taxonomy may be inconsistent anyway.
See eml temporal coverage documentation
We'll want to automatically decide if the coverage is a specific range of calendar dates, an estimated timescale (geological timescale), approximate uncertainty, and whether to include any citations to literature describing dating method (e.g. carbon dating). Could be a whole wizard / module....
Meanwhile, just supporting manual definition of this structure would be a good start.
Can be bounding box, polygon, or geographicDescription (e.g. "Oregon"). Tempting to process natural language descriptions into coordinates, but that throws out true data in place of estimated data (e.g. best left to read-eml world, not the write-eml).
install_github("reml", "ropensci")
Installing github repo(s) reml/master from ropensci
Downloading reml.zip from https://github.com/ropensci/reml/archive/master.zip
Installing package from C:\Users\thart\AppData\Local\Temp\RtmpuCh7ZJ/reml.zip
Installing reml
"C:/PROGRA1/R/R-301.1/bin/i386/R" --vanilla CMD INSTALL
"C:\Users\thart\AppData\Local\Temp\RtmpuCh7ZJ\reml-master"
--library="C:/Users/thart/Documents/R/win-library/3.0" --with-keep.source --install-tests
Involves several steps:
fs_create
to establish the figshare link metadata first. Files will need to be uploaded to get their URLs, then the EML file will need to be modified with those URLs.)@mbjones would it make sense to timestamp the EML file with date it is generated? The files have most of the information you might want to cite the data: creator, title, url or identifier, potentially organization or repository responsible, but I don't see a date associated with this. If so, where would the logical place for such a date go? (presumably this would not be ambiguous to the date data was actually collected, e.g. temporalCoverage
)...
A somewhat more elegant approach to reading in XML is to define an S4 class for a given node and then just cast the XML into the S4 slots using xmlToS4
. Undefined slots are ignored. I provide an illustration of how to do this in my advice on the RNeXML package, which compares it to alternative methods of reading XML.
This approach has the particular advantage that, at least in principle, we shouldn't need to define these classes by hand the way I show there, since their definitions can be extracted programmatically from the schema. The XMLSchema
package should soon be able to do this.
Not only does this streamline our approach to reading in the EML into R objects, but it provides several other benefits. We can define coercion methods that take each of these S4 objects and coerce them into the appropriate R objects. For instance <dataTable>
node into an R data.frame
, along with appropriate metadata available (in S4, since there is not a natural way to attach metadata to R objects...), or <person>
node into an R person
, etc.
Somewhat more powerful and potentially more tricky is using the S4 approach as a write method. Rather than constructing the XML node by node as we do currently with newXMLNode
and addChildren
, etc, we would simply coerce our R objects (data.frames
, person
or strings, R DESCRIPTION
files of (R) software, etc etc) into these S4 class definitions we extracted from the schema (which can be done automatically by matching slot names if the matches are good enough? e.g. person$givenName
to <person><givenName>
?). With luck(?), xmlSchema will be able to use the schema to figure out how to write this S4 object into XML (e.g. which slots are encoded as attributes, which as child nodes, ordering of slots, etc).
As XMLSchema is probably not up to this task yet (particularly on the write
end?), we may do well to continue as we are "by hand"; though perhaps we should still be leveraging the S4 class definition in the process (and then manually turning it to XML with the calls to newXMLNode
, etc...?)
@duncantl will hopefully clarify some of these questions and anything I've misstated about this strategy.
An EML software node needs:
Optionally,
<implementation>
minimally needs a url, via
<distribution>
node, which we also use elsewhere (e.g. when publishing data to a url).For R packages, we can extract all the necessary information by providing the R package name. A wrapper function can use eml_software and the package name to create this (using the packageDescription
function)
Perhaps there is no need to provide the optional R software entries, (e.g. all the optional fields under implementation) since such data is already programmatically available knowing the package distribution URL...
For a first-time user of the reml package, it might be easier to simply call eml_write
on an R data.frame
object and have the function coach them through what minimal metadata they must add, prompting them for inputs along the way ("Define column "X1:" ).
No doubt any regular user would find this frustrating to repeat each time, and would rather provide a list of metadata ahead-of-time, possibly generated programmatically (e.g. pulled from an existing file in the same format) or specified in a configuration file (e.g. no need to ask me my name every time).
Presumably we should be able to push directly to KNB as well (many benefits, including being a DataONE node...).
May need Matt's help on getting API tools to do this...
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-dataTable.html#numberOfRecords
Optional value providing non-header lines in CSV. Can simply be pulled from dim(dataframe)[1]
A couple of options on how to do this:
as.yaml(xmlToList(eml.xml))
Ideal rendering would drop some of the less essential markup (e.g. stuff intended more for machines than people -- unit definitions, numeric types, etc).
Given the EML file defining a CSV and metadata types, extract the R object information. This should allow the user to reconstruct the following R objects from EML generated by #2
dat = data.frame(river=c("SAC", "SAC", "AM"),
spp = c("king", "king", "ccho"),
stg = c("smolt", "parr", "smolt"),
ct = c(293L, 410L, 210L))
with the following accompanying metadata:
col_metadata = c(river = "http://dbpedia.org/ontology/River",
spp = "http://dbpedia.org/ontology/Species",
stg = "Life history stage",
ct = "count")
unit_metadata =
list(river = c(SAC = "The Sacramento River", AM = "The American River"),
spp = c(king = "King Salmon", ccho = "Coho Salmon"),
stg = c(parr = "third life stage", smolt = "fourth life stage"),
ct = "number")
Ensure that all objects have the correct object type: e.g. (ordered) factors should be (ordered) factors, etc.
Currently, in the dataone
package, we can download EML and parse it with an XML parser to extract some metadata for use in our script. For example:
library(dataone)
cli <- D1Client()
# Download and parse some EML metadata
obj4 <- getMember(pkg, "doi:10.5063/AA/nceas.982.3")
getFormatId(obj4)
metadata <- xmlParse(getData(obj4))
# Extract and print a list of all attribute names in the metadata
attList <- sapply(getNodeSet(metadata, "//attributeName"), xmlValue)
attList
This seems like it would be better handled by handing the eml document off to the reml
package for parsing, which might provide some nicer accessor methods, plus the ability to insert new metadata or change existing fields. I've been thinking about how its best to do this. @cboettig Should the dataone
package load reml
to do its parsing, or should the reml
package load dataone
to handle its downloading. I'm thinking of this in terms of other metadata standards as well, such as FGDC or ISO 19115, and wanting to support those through the dataone
library as well. Thoughts?
In our early discussions about validation, we agreed it was really just part of the developer testing suite. For a user consuming EML, having the software complain the file isn't valid isn't really helpful, it's best just to give it our best shot anyway. For writing EML, since this is programmatically generated we can assure it is valid ... or can we?
The S4 R objects we use mimic the schema, but they don't enforce required vs optional slots (in fact, all slots are always 'present' in the S4 objects, so an operational definition of "empty" is that the slot has an empty S4 object (recursive) or a length 0 character/numeric/logical string.) A user can create an S4 object and pass it into their EML file (seems like a useful/powerful option to have, particularly for reusing elements). If the object is missing some required elements, this will create invalid EML.
We can avoid this in several ways:
new
constructor. This is the strategy we employ so far, but we still permit pre-built S4 nodes to be passed to some constructors to facilitate reuse (but bypassing the protection regarding required elements).With a robust XMLSchema package, a few other things become easy. Integration should happen at the more universal level of the schemas themselves rather than the R level, just hopefully something we can take advantage of from there.
EML already maps to Biological Data Profile, BDP, from the Federal Geographic Data Committee (used by the now-defunct National Biological Information Ifrastructure, NBII). But reverse mapping is not available (Jones & Co, 2006).
A running list of questions that I might direct to Matt if I cannot figure them out:
id
attributes for <attribute>
elements?definition
with a URI to an existing ontology?Units already have clear semantic definitions, but assigning good definitions to columns or values for character strings (such as species names, or geographic sites, etc) is considerably less developed. We do have a somewhat round-about way to attach things like "Coverage" definitions to columns (attributes). Replacing definitions with URIs would seem simpler...
<attributeDefinition>
and <textDomain><definition>
redundant in the case of character string columns? (e.g. see example below) <attribute id="1354213311470">
<attributeName>run.num</attributeName>
<attributeDefinition>which run number. (integer)</attributeDefinition>
<measurementScale>
<nominal>
<nonNumericDomain>
<textDomain>
<definition>which run number</definition>
</textDomain>
</nonNumericDomain>
</nominal>
</measurementScale>
</attribute>
This issue is mostly note to myself in thinking out potential illustrative use cases. no input really needed at this time, kinda trivial example here.
What would eml look like to encode replicate models? Can we easily convert metadata to additional column when combining data of matching column descriptions but differing metadata, e.g.
model: Allen
parameters: r=1, K = 10, C = 5
nuisance parameters: sigma_g = 0.1
seed: 1234
value | density |
---|---|
0.0 | 0.12 |
0.1 | 0.14 |
0.2 | 0.22 |
0.3 | 0.4 |
And:
model: Myers
parameters: r=1, K = 10, theta = 1
nuisance parameters: sigma_g = 0.1
seed: 1234
value | density |
---|---|
0.0 | 0.14 |
0.1 | 0.11 |
0.2 | 0.33 |
0.3 | 0.33 |
I think that the dateTime
of a single observation should always be given in a single column. For reasons unfathomable to me, some data represents year as a column, month as another column, day as another column, etc.
This is really only a problem when we do not have good metadata to recognize that these all refer to the same observation. For instance, in the dataset linked above, we can tell that all the columns are "dateTime" objects, but we have generally no way to be sure that the "year" in column 2 is the year that corresponds to the "day" in column 3. These could be independent dateTime observations, such as the start time and end time of a study, etc.
While it seems obvious that a single observation should get a single cell, apparently it isn't. I'm open to ideas on how to approach these issues.
This is a problem for read_eml
for two reasons:
POSIXt
class) objects, we need to be able to associate them. A crude as.POSIXt
would instead render the date as the current year, rather than that given in the column.The package makes frequent use of camelCase. This is to provide consistent mapping to EML nodes, which are all defined in camelCase, e.g. <dataset>
but <dataTable>
. Deal with it.
Package functions use =
instead of <-
for assignment. Should be fixed. XML package has lots of nice syntactic sugar, and this makes it a bit more fluid to move the definitions around.
Need to stick with consistent and transparent use of addChildren
vs, ... parent=
, etc.
Thus far issues have been divided between read, write, and publish, or integrate EML. Development has mostly focused on writing EML. Publish is relatively straightforward extension of writing the EML, just adds a few extra fields to the EML file and pushes the data to the appropriate repository with appropriate metadata.
Reading EML is potentially more of a challenge since (a) we assume the user doesn't know xpath and (b) want to provide conversion into native R objects wherever possible. We have only the basic proof-of-principle based on the trivial write-EML example which imports the csv file into an appropriately labeled data.frame.
Not sure if searching across EML is a read-eml issue or a separate task, since in general such a query might be posed across a database of EML files rather than a single XML file.
To have a focal example, I'll just borrow one posed by one of my PIs:
"Find all data that involves a morphometric measurement of an invertebrate species at 3 or more geographically distinct locations.
(e.g. 3+ different populations of the same species) This kind of data would be useful for all sorts of within-species variation comparisons (when put against environmental variables, etc), but is remarkably difficult to find, as vertically integrated databases tend to omit morphological data (like most GBIF entries), or else aggregate at the species level, discarding the geographic data. Many papers have less than three populations, and it is all but impossible to find another paper that makes the same morphometric measurements on the same species at a unique location.
It seems like this is the kind of query we could construct in EML; and in particular perform the aggregation step. But that assumes a model in which we query directly against all available EML files. I'm not sure if that is sensible or if there's a more clever way to do these queries. (particularly as we would have to do some computation in the process - e.g. to isolate data with invertebrate coverage we would have to query the taxonomic coverage and then query against ITIS or something to determine if the species etc listed was an invertebrate) @mbjones is there a better way to think about complex queries (metacat?)?
To begin with, consider rendering a data.frame such as this
dat = data.frame(river=c("SAC", "SAC", "AM"),
spp = c("king", "king", "ccho"),
stg = c("smolt", "parr", "smolt"),
ct = c(293L, 410L, 210L))
with the following accompanying metadata:
col_metadata = c(river = "http://dbpedia.org/ontology/River",
spp = "http://dbpedia.org/ontology/Species",
stg = "Life history stage",
ct = "count")
unit_metadata =
list(river = c(SAC = "The Sacramento River", AM = "The American River"),
spp = c(king = "King Salmon", ccho = "Coho Salmon"),
stg = c(parr = "third life stage", smolt = "fourth life stage"),
ct = "number")
into EML.
Immediately after cloning the reml repo on a Mac, git status shows the man/eml_datatable.Rd file as modified, without any editing.
This seems to be due to the existence of files that differ only by case -- particularly man/eml_datatable.Rd and man/eml_dataTable.Rd. Removing the duplicate file should fix the problem, but at the moment I am unclear as to which is the right one to remove, or if in fact both are needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.