Git Product home page Git Product logo

Comments (25)

MattBlissett avatar MattBlissett commented on August 29, 2024 7

Recommended best practice is to use a persistent, globally unique identifier.

Perhaps expanding the notes would help, to explain why "persistent" and "globally unique" are useful.

Recommended best practice is to use a persistent, globally unique identifier.

A persistent identifier is one that will not change, allowing others to link to this record in perpetuity. Avoid including words or numbers that might change, such as the name or abbreviation of a museum department, or a current storage location.

A globally unique identifier works without any other information to identify the occurrence.

A UUID can meet both criteria, so long as the UUID is reliably stored, for example in a collections database.

(I'm not aware of any specific examples Andrea may have in mind.)

from dwc.

ahahn-gbif avatar ahahn-gbif commented on August 29, 2024 7

Thanks for this great discussion! I am rather overwhelmed.

As more background was asked, just a few notes: for the last year, GBIF has been monitoring more closely which datasets change a significant portion of their occurrenceIDs in successive ingestion runs, holding ingestion on those, and communicating with publishers on options. It turns out that for some datasets, ids will always change and there is (supposedly) nothing to be done about it - this is more often the case for data aggregates and observation data, and maybe a different conversation to be had; others change through some kind of error, and publishers notified are happy to revert; and yet another group change because some systematic change in one of the text values causes all occurrenceIDs to change.

My suggestion was not meant to promote UUIDs as the only option, sorry if it came across that way. It was rather to not promote triplet ids as much as we have done so far by removing the explicit triplet example (no. 3 more than no. 1) - and being lazy in not providing examples for alternative options.

Just to be clear, there is no intention on the side of GBIF to enforce a particular use of occurrenceIDs, or to phase out others. We recognize that people will do what is most practical in their daily work. Where the option exists, however, we would want to encourage going for both persistent and globally unique, so that we can move one more step away from identifiers that keep going out of scope. What exact shape that takes is not as important, provided it aims for real persistence alongside uniqueness.

from dwc.

albenson-usgs avatar albenson-usgs commented on August 29, 2024 5

I'd really love to discuss this topic in more detail than I think is possible in a GitHub thread (and apologies if I have taken this change request off topic a bit). I'd like to pose it as an unconference topic at the upcoming TDWG. Would folks be interested? Give me a thumbs up emoji if so.

from dwc.

deepreef avatar deepreef commented on August 29, 2024 5

Wow... great discussion! So, I was probably too anemic in my initial response to this proposal. Although I do happen to support UUIDs as the ideal identifier for our context, that was not the reason I strongly supported this proposal. My support was for the notion of encouraging opaque identifier -- regardless of whether they are UUIDs or integers or randomly generated character strings or whatever. When I first started developing biodiversity data management systems (1980s), I was strongly opposed to what we used to call "surrogate primary keys". Instead, I preferred what we called "natural primary keys" -- that is, identifying some field or combination of fields in the database that (collectively) represented a unique combination of values for each record. My reasoning was that there was no need to maintain yet another field to uniquely identify each record, when unique identity was self-evident from the actual data-bearing fields themselves.

It didn't take very long for me to realize what a bad mistake that was. In the decades since then, I've not only embraced "surrogate primary keys" for databases, I've also recognized the importance of NOT using these surrogate primary keys as publicly accessible identifiers. Instead, they should be optimize for database purposes, such as integrity enforcement and performance for complex joins and stuff like that. That led me to a database architecture whereby a single data table in my database (called "PK") is responsible for locking in two permanent values: an integer and a UUID. The integer serves as the source for internal primary keys for every table in the database, and the UUID is married 1:1 with these integers and is used for representing an identifier for each record whenever the data are exposed externally.

Alas, despite two full-on workshops, multiple mini-confrences and symposia, several robust whitepapers, a couple of publicaitons, and dozens of presentations at TDWG, our community still seems to be stuck when it comes to representing unique identifiers for data records in our exchanged data. Part of the problem, I think, is that our community got too hung up on the "resolvable" thing. Basically, to make writing code a little easier, a lot of people wanted to follow the LOD path and conflate the roles of "resolution" (=dereferencing) and "identification" (i.e., unique, stable, persistent). Some even advocated that every identifier should begin with the characters "http://" (which, of course, results in breaking every single identifier once an SSL certificate is implemented, but I'll leave that one alone for now...)

Coming back to the issue at hand: there is a strong desire by many to make our identifiers friendly to human eyeballs. This is one of the reasons everyone likes DOIs more than UUIDs. But the REAL advantage of DOIs is not that they're easy on the human eyeballs, but because of the robust network of identifier dereferencing that exists (e.g., Crossref, etc.). It's also one of the reasons why (unfortunately) the outcome of the aforementioned TDWG/GBIF workshops resulted in a recommendation of LSIDs (sigh).

Sorry for the rambling context above, but this all leads me to my key point: We should keep things like triplets and other non-opaque values (i.e. "natural primary keys") in our datasets to make it easier for humans to track things down. But if we're ever going to cross the threshold of data integration and reusability in our community, we really need to get serious about moving towards real identifiers for data records. And I think this proposal (encouraging opacity for occurrenceID values) is an important step in that direction (though it's certainly not the only step).

Even for people who maintain their data in Excel -- if they can manage a column in their spreadsheets for things like "catalogNumber", why can't they simply add one more column for "occurrenceID", and populate it with some arbitrary and unique and non-information-bearing value (UUID, integer, whatever), then never change that value? They can use a formula like the one @albenson-usgs gave to initially populate the value, but there's no need to re-run the formula everytime the dataset is exported.

Sorry for the rant -- it is NOT my intention to hijack this discussion to become a general debate about identifiers (which we've had many, many times before). But I felt the need to respond (indirectly) to some of the comments posted, and expand the explanation for my support for this proposal.

from dwc.

albenson-usgs avatar albenson-usgs commented on August 29, 2024 3

While I understand the rationale for this, in my opinion having worked with many data providers across a spectrum of data collection methods, it's practically speaking impossible to implement unless someone (hand waving) is going to mint and keep track of these opaque identifiers for projects. Most of the people I work with are still operating in excel spreadsheets - they absolutely do not have a good way to mint and keep track of opaque identifiers. If you ask them to mint opaque identifiers for their data what will happen is they will run this line of code (occurrence$occurrenceID <- sapply(occurrence$occurrenceID, function(x) UUIDgenerate(use.time = TRUE)) and you will get a brand new set of occurrenceIDs every time they republish the data. From my experience at least, if they are creating their occurrenceIDs using information in their data a person can walk that back to the actual record. Having opaque identifiers makes that impossible. If there is some solution to this that I'm am completely missing I would really like to learn about it.

from dwc.

albenson-usgs avatar albenson-usgs commented on August 29, 2024 2

Ok but the justification for this change states

In an effort to move towards stable identifiers for occurrence records, we observe that a non-negligible portion of inadvertent ID changes in GBIF-ingested datasets happens because e.g. collection identifier schemes within an institution get consolidated to common pattern across collections. This update then translates into the concatenated occurrenceID values, resulting in an id change.

And what I'm saying is that I don't believe making this change will lead to occurrenceIDs that do not change. If that is the goal then this change, I do not believe, will achieve that result. People will just create a series of letters and numbers instead that will change when they update the dataset.

If what we want to see is stable identifiers for occurrence records then I think we should work on that problem which I don't think will be resolved by only showing UUIDs as the example.

from dwc.

ymgan avatar ymgan commented on August 29, 2024 2

Hi everyone!

@albenson-usgs @sformel-usgs @jdpye @emiliom and I would like to bring in our perspectives as data managers for this topic by taking a different tangent - the PROCESS of creating Occurrence record and occurrenceID. We work with field scientists, we clean and transform their raw data into Darwin Core tables. We felt that this part of the conversation was not being understood.

Our data providers do not usually manage an Occurrence table

Very often, our data providers DO NOT have an Occurrence table in the original data. Examples:

Screenshot 2024-08-26 at 18 01 37
Screenshot 2024-08-26 at 18 01 46

Very often Occurrence is represented as a cell linked to multiple tables. Why wide table? We asked our data providers. This is how field biologists think of data!

occurrenceID is needed to trace back to the original data!

Exactly why we use transparent identifier for occurrenceID!

Screenshot 2024-08-26 at 18 02 53
Screenshot 2024-08-26 at 18 03 03

Since our data provider does not manage an Occurrence table. They will be confused if a user found their contact info on the dataset EML, email them "Hey, I think you might be missing a negative in your decimalLatitude in your record (occurrenceID: 4823f29b-2c6e-4a43-86b1-430e16a9c34e)". How can our data provider knows which row, which cell to look at? Unless they send the link to the record. Btw, not all aggregators display verbatim data while changing the data (e.g. replacing values).

Data comes in different shapes and forms

Most of our data providers manage data in excel spreadsheets, the tables also often lack primary key (they do not need it). We also receive word and pdf documents where the data consists of multiple tables, unstructured text and sometimes with multimedia. We all have different level of resources. Some data providers have databases, some data providers do not have CMS.

We may not have the resources to maintain opaque occurrenceID, which leads us to the next point.

The cost of changing our data provider's approach

I tried developing data templates for our data providers before they go to expedition which forces them to record data in a tidier manner. It creates friction in how they manage the data. When I receive the data upon their return, they added columns they used to have and only use some of the columns that I created. The data becomes even more difficult to clean and transform. Subsequently, we were being excluded from the next expedition meetings and I did not receive datasets from them anymore.

FAIR data is not a requirement (at least for most of the data providers of the antarctic node) and also not a priority for many of our data providers (they are being evaluated by the ranking of the journals, not number of datasets). Many of our data providers are only required to make their data freely available. When there is friction, they will go to easier route, such as publishing dataset to Zenodo, PANGAEA.

What we are really saying is that if we promote meaningless identifier and REQUIRE them to be STABLE when our data providers DO NOT even manage Occurrence table, is creating friction with our data providers who lack the resources. This may come with a cost and risk of having dataset not being updated or not wanting to publish the datasets in Darwin Core Archive at all.

Transparent identifiers are still useful

We certainly do not want our data providers to stop updating their datasets or stop publishing their datasets to GBIF.

We also think that the stability of gbifID and stability of occurrenceID are two separate issues. We think that acknowledging transparent identifiers is important because at this stage, transparent identifiers maybe the best thing that we could come up with! We hope that we could acknowledge this with curiosity and empathy.

Please keep the transparent identifier example

Please keep the triplet examples in the example. The current triplet of institution:collection:catalogNumber is still useful and valuable for data providers without a CMS.

We think that it will be helpful to have a set of principles to guide the user in creating or choosing the identifiers they need and maintain. Exactly where the principles should be, we are not sure, perhaps in the comments?

SYM25 in TDWG 2024

Finally, we submitted an abstract titled "What matters for an occurrenceID and what is an occurrenceID that matters?" in SYM25 Occurrences are Neither Specimens or Samples: Data modelling challenges and opportunities for information storage and exchange

Some of the screenshots above are taken from our slides. If you are interested, we will be delighted to share our experience with you in the talk!

Thank you so much!

from dwc.

deepreef avatar deepreef commented on August 29, 2024 1

I STRONGLY support this proposal (as those who know my opinions about this general topic could probably guess).

from dwc.

tucotuco avatar tucotuco commented on August 29, 2024 1

Thank you all for this thorough and fundamental perspective. The "Darwin Core Triplet" example was included exactly to reduce obstacles to data sharing. It is good to have this additional experience from another community of practice.

from dwc.

matdillen avatar matdillen commented on August 29, 2024

If we remove these two recommendations, do we state to which terms such identifiers should be mapped instead? I presume this would be materialEntityID or parentMaterialSampleID, but those terms are not accepted yet.

Or is the implication that persistent identifiers with any kind of meaning (including most URIs that are also URLs) should not be used for specimens at all?

from dwc.

jdpye avatar jdpye commented on August 29, 2024

Following up on what Abby said, because I was struggling with how to say it and I think she represented it very well, a UUID is not a useful tool for people updating and extending datasets that are actively adding new records and need to distinguish old records from new ones.

An example I have is that a listening station detects a coded pinger that is associated to an animal attachment event. Well, the attachment event tells me species, the station deployment tells me place, and the pinger recording event on the instrument tells me time.

If one of these components changes due to a mistake in recording any piece of that process, when I republish the dataset, I need to locate the entry for the occurrence record I need to correct, using the IDs of the components that I know very well contributed to the creation of the occurrence record. If it's a UUID, I have potentially lost my ability to do that transparently and authoritatively.

From my experience, which I grant is specific compared to the whole of the database, guidance for this field that would be valuable to a user learning how to create these archives would be to use something that is guaranteed unique, related to the occurrence record's creation, and authoritative to an individual record, and if you don't have something obvious to do that with, use a UUID as a last resort.

from dwc.

peterdesmet avatar peterdesmet commented on August 29, 2024

As I understand, the intent is to remove examples that carry meaning (which I understand). The only example that then remains is a UUID, which seems to give the impression (from the comments above) that that is the only valuable alternative.

In my opinion, integer identifiers that are maintained by the source (e.g. a database) and are immutable, are also valuable alternatives. I would therefore include the following as an additional example:

20622886648

Note: this example is derived from https://www.gbif.org/occurrence/3797662301, where the occurrenceID is the record identifier assigned by the source database (in this case Movebank).

from dwc.

dbloom avatar dbloom commented on August 29, 2024

I was just responding to this when @peterdesmet's message posted which mirrors my own understanding. This is a request only to have certain examples removed (which I understand, too), not a request to stop the use of certain types of identifiers, such as Triplets.

I will, however, admit to a little confusion about the implications of removing certain types of identifiers from the list of examples. Is there an implication here that GBIF is planning to move away from Triplets and URLs in the occurrenceID field?

Certain collections systems, Arctos being a solid example, are not likely to stop using a URL any time in the foreseeable future. Similarly, as @albenson-usgs described, the Triplet might be the most stable option for many collections with limited staff and technical expertise, especially as we work to include more data from ecologists and biologists not affiliated with traditional museum collections directly. Like Abby, I work with datasets like these, and their publishers, regularly.

I am also curious about the implications here as a GBIF Trainer/Mentor. Historically, Triplets are a recommended option for publishers moving through the BID Programme as well as other GBIF-inspired trainings around the world. Data publishers tend to take the definitions and examples provided in the DwC Quick Reference Guide quite literally, so I could see some confusion arising among publishers with this change.

from dwc.

dbloom avatar dbloom commented on August 29, 2024

@albenson-usgs That could be a useful conversation. I am unable to attend TDWG this year, so I don't know how participation will work in that unconference setting - we'll have to see how that room is set up.

For purposes here, I would also be curious for a little more clarification from @ahahn-gbif on this matter. No doubt she and the GBIF Data Products folks see many more datasets than we do, but I'm curious to know a little more about the value of removing examples like these from the GBIF perspective. I'm not opposed to the change, I would just like to know more and, like @albenson-usgs, I'm curious what else might be out there that isn't a randomly generated UUID.

from dwc.

qgroom avatar qgroom commented on August 29, 2024

Please don't let this thread get sidetracked on the various merits of different identifier systems.
@ahahn-gbif is not recommending a UUID she is removing the implicit recommendation of using a triple.
Triples have long be discredited, they are sometimes not unique, but worse still they totally fail when confronted by the myriad of ways people construct them.

Therefore I suggest adding an example of a URI as an example of a stable occurence ID

e.g. http://www.botanicalcollections.be/specimen/BR5020224598676V

References about rubbish triples...
https://doi.org/10.1371/journal.pone.0114069
https://doi.org/10.37044/osf.io/93qf4

from dwc.

dbloom avatar dbloom commented on August 29, 2024

Yes. That is also where my confusion/concern lies. I don't see how the means leads to the desired end.

from dwc.

qgroom avatar qgroom commented on August 29, 2024

Except that all the evidence has shown that triples do not work. So far I've not seen any evidence that URIs or DOIs are not stable.

from dwc.

albenson-usgs avatar albenson-usgs commented on August 29, 2024

Ok here is an example where the occurrenceIDs completely changed between V1.0 and V1.1. Admittedly it is me as a naïve data manager just trying to do my best and being told that it must be a UUID. But I am not alone in this- this is what people will do and it will not lead to the result you are wanting. I think where I struggle with this requirement is how you anticipate it being carried out for small datasets that are managed in Excel. Can you please show me an example of such a dataset managed completely in Excel that has successfully implemented a DOI or URI scheme for their occurrenceIDs?

from dwc.

qgroom avatar qgroom commented on August 29, 2024

Human errors will always happen with whatever system is used. Excel is particular clever at amplifying the chance of human errors.
The problem with triples is that in addition to human errors they are not inherently unique, stable or standardized.

I think we can suggest any of the common systems, including UUIDs, but not triples, because they are not fit for purpose.

from dwc.

Jegelewicz avatar Jegelewicz commented on August 29, 2024

Triples may not be unique, but I don't think you can get more unique than http://arctos.database.museum/guid/MSB:Mamm:233627? A web address must be globally unique or the internet wouldn't function. Just because we include what someone calls a "triple" in it should not make it unworthy of being a GUID. @dustymc

from dwc.

deepreef avatar deepreef commented on August 29, 2024

Triples may not be unique, but I don't think you can get more unique than http://arctos.database.museum/guid/MSB:Mamm:233627? A web address must be globally unique or the internet wouldn't function. Just because we include what someone calls a "triple" in it should not make it unworthy of being a GUID. @dustymc

Sure, you can certainly achieve uniqueness; but you break persistence the moment the collection decides to change the collectionCode to "Mam", or someone discovers a typo in the catalogNumber, or if you add an SSL certificate to the website and the identifier changes to "https://arctos.database.museum/guid/MSB:Mamm:233627".

Sure, I know that last one is easily brushed aside because web resolvers usually hande redirection appropriately. But the point of opacity in identifiers is more about persistence (stability) than it is about uniqueness.

from dwc.

albenson-usgs avatar albenson-usgs commented on August 29, 2024

Ok here is an example of a dataset just freshly published. Code is here. For this one the project was observing sound in the ocean which also includes species observations from those sounds. The project leads have archived all the data and shared it publicly. They are interested to share the species observation portion but the data are not managed with that goal. A data manager separate from the project then took those observations from those sound recordings, aggregated them together (they were spread across multiple datasets), aggregated them so that only one occurrence per day is recorded and aligned them to Darwin Core. An occurrenceID is necessary. The person performing the work is not the holder of the data. I'm truly and honestly curious what those advocating for opaque identifiers would have the data manager do in this instance? Anything that's done will not be stored by the original project.

from dwc.

dustymc avatar dustymc commented on August 29, 2024

Yup. MSB:Mamm:233627 isn't a GUID, and if I had my way we'd stop using it even internally. (It works fine internally, but tends to leak and that leads to things like low-quality citations.)

http://arctos.database.museum/guid/MSB:Mamm:233627 is a GUID, and is as stable as the Curators care to make it, just like the material is represents. Containing a "triple" doesn't contaminate the identifier.

We run SSL and it still works.

8088573f-8aba-4f9c-90ab-fefa895b9532 is more likely to be unique than 1 - unless there's a unique index involved - but it's can't really DO anything and won't generally lead humans to material. (Anyone with the keys to MSB can probably find https://arctos.database.museum/guid/MSB:Mamm:233627 without involving electrons.)

break persistence the moment the collection decides to change the collectionCode to "Mam",

That's not a collectionCode, so nope, it'll survive that change no problem. It might devolve to a UUID-like "probably unique but doesn't DO STUFF" identifier if the HTTP protocol goes away or something, but that wouldn't remove its identifier-ness.

DOIs might keep DOING STUFF through events that could remove functionality from URLs, but they're also more work to maintain. There are a few (~20K - example: https://doi.org/10.7299/X7C829FR ) assigned to records in Arctos, there's no barrier to using them I think the cost/benefit analysis just tends to come up lacking except in some specific situations.

someone discovers a typo in the catalogNumber,

That can be dealt with in the same sort of way that discovering a problem with the labeling of physical items can be dealt with, and doesn't do anything to the identifier. http://arctos.database.museum/guid/DGR:Mamm:10002772 is an example of a record that's been recataloged (not quite a typo, still a change).

more about persistence (stability) than it is about uniqueness.

Both are necessary if one wants to DO STUFF at scale.

from dwc.

Jegelewicz avatar Jegelewicz commented on August 29, 2024

In the grand scheme of things who cares what the recommendations are? People will do whatever until there is an affordable and manageable path that allows them to do what is necessary.

from dwc.

deepreef avatar deepreef commented on August 29, 2024

Sure, as long as "http://arctos.database.museum/guid/MSB:Mamm:233627" is minted once, and remains as such always even if someone realizes the catalog number is actually 233672 for this record -- that's great. This, of course, requires that the identifier be stored as a literal value (not contstucted automatically from the constituent parts at export time) -- and I wonder how often that's done. Even if it is done, in my experience there is an overwhelming temptation to want to "correct" the identifier to "http://arctos.database.museum/guid/MSB:Mamm:233672" (when the catalog number error is discovered). But as long as the data manager can resist that urge, then everything should be OK.

As for conflating the functions of identity with "DOING STUFF", I'll refer to here. I've written elsewhere (extensively) about why the roles of identity and doing stuff should not be conflated; but that doesn't mean I'm right.

from dwc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.