Git Product home page Git Product logo

Comments (123)

tmcelrath avatar tmcelrath commented on June 21, 2024 6

@edwbaker The issue is actually slightly different. "Parsing" text into many verbatim fields automatically introduces interpretation by its very nature. For example: What is a "verbatimLocality"? Should all locality info go in it? Or just the most specific locality? We've had differences of opinion just within our own group on just this one field.

To answer your question, DWC absolutely does not have enough verbatim fields. There are no verbatim identification fields, or verbatim curation labels fields (e.g. accession numbers, comments about preparation, etc ...). We use the ones that DWC has in addition to the verbatim one we are providing. Users do not have to use these fields, and yes, it introduces duplication of text, but that actually adds more power in terms of text-breakdown. We will never stop misreading labels and having poor quality control, but having this field allows for comparisons to the original verbatim label and will allow for corrections to be made.

The idea of this field is in part, quality control. I have found having this field INVALUABLE more times than I can count when looking back at the original text, comparing incorrect GPS coords, poorly interpreted localities, or people misreading labels.

from dwc.

seltmann avatar seltmann commented on June 21, 2024 5

I also support the creation of a verbatimLabel field in Darwin Core that is not an extension. This very issue came up at the last Entomological Collections Network meeting with a lot of support (not Just from Taxon Works folks). Also, it would benefit users of the NSF ADBC Parasite Tracker and Big-Bee projects. The reason is that student and volunteer digitizers are often tasked to capture (type) data "verbatim" and the data is curated later by a more experienced person. Also, some of the data does not fit neatly into a Darwin Core field immediately, so it ends up incorrectly in occurrenceRemarks, which has been a dumping ground for verbatim label data for a while now. Thanks, everyone!

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024 4

Proposed revised definition (in bold) alongside old definitions. Unchanged terms are not formatted or restated (@tucotuco):

verbatimLabel: New term
Original Submitter: Hannu Saarenmaa
New Submitters: Tommy McElrath @tmcelrath, Debbie Paul @debpaul, Tim Robertson @timrobertson100,
Christian Bölling @cboelling

Original Efficacy Justification: In the first phase of the digitization process we try to capture everything "as is". Interpretation should follow from that

Revised Efficacy Justification (why is this term necessary?): To provide a digital representation derived from and as close as possible in content to what is on the original label(s), in order to provide quality control and comparison to any and all parsed data from a label. Other use cases are outlined here: https://doi.org/10.1093/database/baz129

Demand Justification (name at least two organizations that independently need this term): Survey of digitizing collections conducted by @tmcelrath, DataShot (MCZ), TaxonWorks, GBIF

Stability Justification (what concerns are there that this might affect existing implementations?): New term, does not adversely affect any existing terms or implementations.

Implications for dwciri: namespace (does this change affect a dwciri term version)?: As a "verbatim" term, dwc:verbatimLabel is not expected to have a dwciri: analog, so there are no implications in that namespace.

Proposed attributes of the new term:
Term name (in lowerCamelCase for properties, UpperCamelCase for classes): verbatimLabel

Organized in Class (e.g., Occurrence, Event, Location, Taxon): MaterialSample

Original definition: The full, verbatim text from the specimen label.

Revised Definition of the term (normative): A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label affixed on, near, or explicitly associated with a preserved specimen, free from any and all interpretation, translation, or transliteration.

Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Lines or breakpoints between blocks of text to establish context that could be verified by seeing the original labels or images of them can be used, but are not required or recommended. Breakpoints should be able to be ignored by text-filtering and recognition algorithms such as md5. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks

While textual content from automated processes such as optical character recognition is not comparable to human transcription, this term is not meant to restrict anyone’s definition of what should or shouldn’t go in this field. E.g. If you feel that your OCR output supports discoverability in aggregation, feel free to use this DWC-field. Best practice is to add “verbatimLabel derived from unadulterated OCR output” in occurrenceRemarks

Transcribed text from accession books or field notebooks can go in this field (e.g. not physically attached to the specimen) as long as they are explicitly connected in some fashion to a voucher, like through an accession number or field code. Best practice is to add comment “verbatimLabel derived from human transcription of label and/or accession/field notebooks”

Examples (not normative):
(labels affixed to a specimen in a vial)
1)
ILL: Union Co.
Wolf Lake by Powder Plant
Bridge. 1 March 1975
Coll. S. Ketzler, S. Herbert

Monotoma
longicollis 4 ♂
Det TC McElrath 2018

INHS
Insect Collection
456782
With comment “verbatimLabel derived from human transcription” added in occurrenceRemarks

(OCR content of an herbarium sheet)
0 1 2 3 4 5 6 7 8 9 10
cm copyright reserved
The New York
Botanical Garden

NEW YORK
BOTANICAL
GARDEN

NEW YORK BOTANICAL GARDEN
ACADEMY OF NATURAL SCIENCES OF PHILADELPHIA
EXPLORATION OF BERMUDA
NO. 355
Cymbalaria Cymbalaria (L.) Wettst
Roadside wall, The Crawl.
STEWARDSON BROWN
}COLLECTORS AUG. 31-SEPT. 20, 1905
N.L. BRITTON

NEW YORK BOTANICAL GARDEN
00499439
With comment “verbatimLabel derived from unadulterated OCR output” added in occurrenceRemarks

Refines (identifier of the broader term this term refines; normative): None

Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None. Does not replace any current DWC “verbatim” terms. Other “verbatim” terms have already been “parsed” to a certain data class and have their own uses

ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative): /Marks/Mark/MarkText

from dwc.

matdillen avatar matdillen commented on June 21, 2024 3

Would this term support standards for rendering OCR output, in JSON or XML formats? This would allow capturing text location within the image and text entity annotation, which is much more useful to support the training of text capture and text recognition algorithms.

JSON structures are supported by dwc:dynamicProperties, but I suspect they may also break some csv readers.

from dwc.

benwbrum avatar benwbrum commented on June 21, 2024 3

I'm not really a stakeholder here, but have been watching this conversation since @debpaul pointed me at it.

I suspect that it's really important to be explicit about whether characters not part of the label itself are allowed -- in other words, are the contents of this field allowed to contain XML, HTML, ALTO, JSON or are they required to be plaintext only, with any formatting limited to whitespace? Plaintext is lossy, since human transcribers can't mark text with tags like <unclear> and automated systems can't add positional information like bbox="102,391,2913,432". However, all the mark-up tags in XML or JSON encoding become noise for full-text search and textual analysis.

Noise from tags (or other mark-up) can be removed, of course, if the consumer knows the the tags exist. If the format of the verbatimLabel is not restricted, you are likely to have one institution assuming that all contents will be plaintext, then running into serious problems when they encounter data containing raw ALTO-XML in verbatimLabel.

from dwc.

timrobertson100 avatar timrobertson100 commented on June 21, 2024 3

verbatimLabel = def. A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label.

Given that any particular binary object asserted to be a verbatimLabel can actually contain errors, corresponding usage notes would certainly be warranted.

This seems very clear, along with usage notes capturing expectations around errors, why they may appear and guidance for OCR use.

It also covers the cases @debpaul raises as you can fixup OCR within the scope of this definition.

For the most part, DwC doesn't provide guidelines on how to serialize values with a few exceptions. While DwC remains an untyped standard, I tend to think it best we consider serialization restrictions and guidance the responsibility of the implementer.

from dwc.

cboelling avatar cboelling commented on June 21, 2024 3

I am in support of the revised definition. Seeing that I proposed the first part of it, you can also put me down as a submitter, if that helps the process. While I support the definition, I think it could still be simplified without loss of expressivity:

A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label, free from interpretation, translation, or transliteration.

which omits the "affixed on, near, or explicitly associated with a preserved specimen," and the "any and all" part: Reference to the object, that the label whose verbatim representation is the subject of inquiry is associated with, and listing ways this label is associated with the object are not necessary for defining verbatimLabel (only the reference to the label is). This is also consistent with the last paragraph of the proposed usage comments. In addition, as @deepreef mentions, restricting the definition to object types which are themselves under discussion might be problematic.

With regard to usage comments:
I suggest explicitly stating that abbreviations and (supposed) misspellings should not be expanded or corrected, respectively. I consider line breaks (non-printable) characters that are just as much part of a label as the letters of an alphabet and which are important for interpreting the content subsequently. Therefore, I would not discourage encoding them in a verbatimLabel. The sentence "Breakpoints should be able to be ignored by text-filtering and recognition algorithms such as md5." is confusing to me. Depending on the serialization at hand, parsing out breakpoints will usually be possible in a downstream application if one desires to do so, but I see this as an independent step in the use of the data. In summary, I would change the first paragraph of the usage comments to:

The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them can be used. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks.

from dwc.

timrobertson100 avatar timrobertson100 commented on June 21, 2024 3

We're heading toward a new round of public review in Darwin Core within the coming weeks, so have the opportunity to get this one over the line.

My reading of this is that we have a new proposal in this comment which has addressed the concerns as best as possible and subsequently received supportive comments.

@tucotuco - is it OK that a proposal lives buried down in commentary, or should we a) update the opening comment which may be confusing, or b) open a new issue for public commentary, please?

from dwc.

timrobertson100 avatar timrobertson100 commented on June 21, 2024 3

Thanks @tucotuco

I have updated the issue as per @tmcelrath proposal and applied the following changes:

  1. "...content of a label affixed on, near, or explicitly associated with a preserved specimen, fossil specimen, or material sample..." to address the discussion between @tmcelrath @deepreef and @cboelling
  2. The usage comment suggested here was used.
  3. I changed "... Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used..." to be clear line breaks were optional
  4. I applied some markdown formating on examples only to aid readability in the github issue view

Please can you check I haven't overlooked anything @tmcelrath @cboelling so we're ready for public review shortly?

from dwc.

matdillen avatar matdillen commented on June 21, 2024 2

There are various different use cases for verbatim data. We described quite a few of them in a paper we wrote a while ago, more specifically in this table..

Darwin Core terms currently hardly support these use cases, with many verbatim concepts unaccounted for and no unambiguous term for the uninterpreted text dump as Tommy described.

While the content of this term will be messy and not very practical for machine training purposes, which seems like it could be a nice use case, it would support improved findability, validation efforts and linguistic aspects.

from dwc.

edwbaker avatar edwbaker commented on June 21, 2024 2

The issue I see with adding verbatimLabel or an equivalent (in name it doesn't cover other data sources, such as occurrences from a notebook) is that if we have that, why do we need all the verbatim fields in dwc? The current process seems to be we put the label data in verbatimX and cleaner data in X. If we follow this precedent, then we should look at what verbatim label data is missed at present, and how we address that (two possible solutions in my above comment). If we don't follow this precedent then (in my mind) we have a much larger discussion.

I think the point raised above by @albenson-usgs between data management (which I take in this instance to broadly be within an institution) and data standards (broadly between institutions) is highly relevant. From what I can see (glancing over dwc) this would be the first break from relatively atomic data to a definition that might include multiple data types. This alone I think is worthy of some serious discussion.

I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024 2

@museumjames The entire point of the field is to "dump data". So is "verbatimLocality" and "preparations". The difference is, the idea behind this field is specifically to do just that: dump data for later use in data validation, artificial intelligence, etc...

from dwc.

dshorthouse avatar dshorthouse commented on June 21, 2024 2

Here's one, from OCR output on an NYBG herbarium sheet NYBG_00499439

That's a nice, clean example @debpaul. Much of the OCR I've seen however is chock-full of artifacts when the OCR engine "sees" all kinds of text when none exists. Do we still call these outputs verbatimLabel? When/if OCR artifacts are auto-removed to help clean it up prior to publishing, is it still considered "verbatim"?

from dwc.

cboelling avatar cboelling commented on June 21, 2024 2

Given the clear use case and demand, and in the interest of trying to achieve consensus in this review round, would a change in the description to clarify expectations be more readily accepted? For example:

verbatimLabel:
The best effort to capture the full text from all labels affixed on or near a MaterialSample, free from any and all interpretation, translation, or transliteration. Consumers of this field should consider that errors may be present, originating from e.g. damage, illegible handwriting, or automated approaches.

@timrobertson100 Is this proposed as a definition or as a usage comment?

If I had to come up with a definition of verbatimLabel based on the role they play in the digitization processes I deal with it would be something like

verbatimLabel = def. A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label.

Given that any particular binary object asserted to be a verbatimLabel can actually contain errors, corresponding usage notes would certainly be warranted.

from dwc.

cboelling avatar cboelling commented on June 21, 2024 2

are the contents of this field allowed to contain XML, HTML, ALTO, JSON

These can all serve as - structured- serializations of document content, content encompassing both literals and potentially document structure. I personally would be in favor of using those for precisely the reasons @benwbrum describes.

If the format of the verbatimLabel is not restricted, you are likely to have one institution assuming that all contents will be plaintext, then running into serious problems when they encounter data containing raw ALTO-XML in verbatimLabel.

I would still prefer an exchange standard that allows representations using different serializations given the varied nature of the material and use cases. It is not unreasonable for applications to have to check the data that is consumed.

from dwc.

deepreef avatar deepreef commented on June 21, 2024 2

@debpaul : I read through the earlier posts briefly, but haven't had time to digest fully.

So it is "verbatim" but not as captured strictly from the OCR. Rather, we started with OCR, then improved to capture the handwritten bits.

So, yeah -- OCR is just the tool to convert patterns of ink on paper into UTF-8-encorded character strings -- no different from the package/workflow consisting of human eyeballs-->human brain--human fingers-->computer keyboard--database record. Both of these mechanisms to transcribe those ink symbols on paper into computer-intelligible text strings are prone to errors.

I think it goes without saying that the fidelity between the translation/transcription from patterns of ink on paper to UTF-8-encoded character strings is potentially imperfect, and that improvements/corrections over time are inevitable.

Note also that the "original" data need not be patterns of ink on paper, as some "verbatim" data may actually be born digital.

In addition to the question of embedded markup protocols pointed out by @benwbrum (which itself would requires some mechanism of escaping characters like "<" and ">" to indicate literal values of such), it would also need to be clear that "verbatim" excludes "interpreted" values (corrections of spellings, etc.)

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024 2

Here is an assessment of discussions to this point, with current obstacles to consensus:

  1. There is contention about whether the term should be a "dumping ground" without expectations of how to populate or interpret it, versus using it to provide data where there is some expectation about how to populate or interpret it using one or more community practices. My feeling is that this contention can only be solved with a permissive definition that allows both ends of the spectrum to use the term as they wish. Usage comments can make non-normative recommendations beyond the definition.
  2. Even with a consensus for a permissive definition as described in 1) there is contention that a dumping ground term is not appropriate for a data exchange standard. Proponents have demonstrated demand and valid use cases, so my assessment is that this is not an obstacle to adoption of the term.
  3. There is a desire to expand the scope of the term to be applicable to media other than just written text, and not just specimen labels, but also sound recordings in the field, for example. To cover this expanded perspective the term name isn't satisfactory, nor is its proposed organization in MaterialSample, nor are the discussions about micro-encoding inclusive enough. My feeling is that this contention can be solved by a) consensus from specimen-view adherents that the expanded view is a valid use case and provide a different proposal to order to be inclusive, or b) consensus from the expanded-view adherents that it is acceptable not to mix the expanded use cases with the specimen-specific one for written labels.
  4. Even with consensus one way or the other on 3), there is the issue of a consensus definition. For the specimen view, some see MaterialSample in the definition as problematic and propose 'specimen' instead, which can be interpreted in any way the English language allows. MaterialSample in the definition signifies the meaning sensu Darwin Core's MaterialSample class - A physical result of a sampling (or subsampling) event. In biological collections, the material sample is typically collected, and either preserved or destructively processed. For the expanded view, no alternative definition has thus far been provided.

Right now, no consensus is in sight. In terms of process, if an apparent consensus can be reached in the next four days (when the 30-day minimum public review period is up), then the clock can be reset with the consensus definition open for further review and potential revision for at least another 30 days. If consensus is still not in sight in the next four days, then the proposal will remain open, but will be excluded from advancement toward ratification in this milestone (https://github.com/tdwg/dwc/milestone/14). In any case, as soon as possible after 2021-05-31, the Darwin Core Maintenance Group will determine how to proceed with respect to a new release of the standard...

No given proposal has to reach a consensus for a release to be made. To avoid a potentially never-ending review the Maintenance Group will assess the state of proposals in the milestone after the first thirty days and decide whether to make a release of those proposals that had no controversy, or to wait long enough to include all of those that appear to have viable solutions to any problems that are identified in public review. (from Darwin Core Maintenance Frequently Asked Questions.

from dwc.

deepreef avatar deepreef commented on June 21, 2024 2

@debpaul :

does this matter?

Maybe not. I guess it depends on what content providers and consumers hope to actually get out of values provided for this term. Rightly or wrongly, I often tell people that in the context of databases, "consistency trumps accuracy". That may be a flawed premise, but I suspect the concerns expressed by @museumjames (which I tend to share), relate to the tension between imprecise boundaries on what kinds of content are to be provided (which affects consistency), and what the shared understanding of utility to be gained by consumers of that content. This is all still very fuzzy in my mind (admittedly in part because I haven't spent as many showers/traffic jams/late-night ceiling-staring sessions contemplating it as I have for other sub-consensus issues in this round of DwC topics).

Edit: I just read the post above by @tucotuco and see that he covered what I was trying to get at much more effectively than I did.

from dwc.

timrobertson100 avatar timrobertson100 commented on June 21, 2024 2

I think I am becoming more and more confused about what the point of this term is. I am imagining some bad OCR, with or without markup, being used to train a machine learning model. This model could then run on labels from other specimens, and possibly populate the same field.

This isn't what I've understood from the originators of this request. There are people who have put in the effort to digitize the content of the label and currently DwC doesn't provide a concept to accommodate this. Full-text search is one use case that has been identified.

Clear explanation in comments and examples that make it easy to do the right thing and difficult to do the wrong thing seem appropriate and would address the worry of ML developer misuse. If not that, we should find an alternative way to accommodate the original request.

I worry if novice modelers become a reason to block progress. Concepts like identification and presence/absence at a location are far more likely to cause problems than text from a label in a concept that's adequately explained as potentially having errors.

Edited to add: More complex cases capturing the output and workflows around OCR, fields with markup and the type of markup a consumer should expect would be well suited to an extension in a DwC-A (e.g. capturing the equipment used, software, version, raw output, cleaned output etc). DwC-A is of course only one use of DwC, but is a common one.

from dwc.

teleaslamellatus avatar teleaslamellatus commented on June 21, 2024 2

Katja is 100% right! Besides that, all taxonomic publications in the last >200 years have used verbatim label infos. Having a field like that in DWC might actually make DWC more popular amongst taxonomists that would eventually contribute to the first-hand generation of DWC archives in taxonomic publications. Don't get me wrong, transcribing label information is crucial, but knowing about the high number of errors due to wrong transcriptions makes a verbatim label column perhaps one of the five most important ones in a DWC table.

Also, why is anyone afraid to create a new term if it is really needed by so many users???? I have spent some time in developing community governed controlled vocabularies (biomedical ontologies, RO, PATO, BSPO etc) and have never seen such an unreasonable resistance against creating a term that is needed and such a long-lasting discussion about a new term. Actually, it is almost as entertaining to read people's comments here as reading the Taxacom mailing list.

from dwc.

deepreef avatar deepreef commented on June 21, 2024 2

I'm definitely not afraid to create the term, and would generally support it. I think the concern/pushback was is in terms of how, exactly, it would be defined. A lot (most?) of the effort put in to this last round of DwC proposed changes involved definitions of terms that seemed logical and intuitive at the time they were created, but ended up being interpreted by different people in different ways, potentially causing more confusion than value. At least that was the basis for my concern expressed earlier in this issue.

I think the new definition (and associated usage comments) is very explicit and precise, and obviates many of the previous concerns, so I am very much in support of it. The only caveat is that the use of the term "preserved specimen" may ultimately be problematic, in that the DwC class of the same name (PreservedSpecimen) is also being discussed/debated right now. Similarly, I assume we'd want verbatimLabel to be applied to specimens characterized as FossilSpecimen, which by current DwC definitions are different from PreservedSpecimens.

Maybe change to "...content of a label affixed on, near, or explicitly associated with a material sample..." (or something generic like that).

Just trying to head off potential future confusion.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024 2

@deepreef the only problem I see with "...content of a label affixed on, near, or explicitly associated with a material sample..." is that it runs into the exact same issue that you just described. What if we just make it more broad and say "...content of a label affixed on, near, or explicitly associated with a preserved specimen, fossil specimen, or material sample..." in order to make it more broad?

@cboelling the only problem with removing the statement is that others previously have requested that it was added to make sure that this term was not used for other pseudo-transcriptions of things that are not explicitly associated with some collection object preserved in a museum. That's the reason it's in there and I think it needs to stay in for that reason.

@cboelling I agree with your revised first paragraph of usage comments. Seems fine to me. As long as we do not formally require breakpoints or disallow them, I'm fine with them. There were previous concerns that some algorithms wouldn't be able to filter them out, but I agree that most are easily filterable/ignorable or otherwise processable, as evidenced by comments by @timrobertson100 previously.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024 2

@timrobertson100 The current proposal to be assessed should be opening comment along with a comment down here that it was updated.

from dwc.

chicoreus avatar chicoreus commented on June 21, 2024 1

Wes use a field for verbatim transcription of a label in the DataShot object to image to data workflow software. This captures the verbatim transcription of text from a region of interest representing a single label identified in an image of a set of labels. Subsequent workflow steps add interpretation of this verbatim text into structured data. In a less formal manner, there is a twitter feed https://twitter.com/EntoTranslator and a facebook group https://www.facebook.com/groups/232785306782255/ where images of difficult to interpret labels are posted for members of the community to either provide transcriptions from difficult to read handwriting or interpretations of words, phrases, abbreviations, and such on the labels. There are clear upstream needs in digitization workflows for representing verbatim label text in structured form.

from dwc.

chicoreus avatar chicoreus commented on June 21, 2024 1

As noted above, We've got a field for this in the DataShot system at the MCZ associated with a region of interest in an image that contains multiple lables, but haven't been able to go very far with this in the absence of a means of sharing with the community.

from dwc.

edwbaker avatar edwbaker commented on June 21, 2024 1

@albenson-usgs I think the only way of going back to the original data here is to include a label image. Having a label field is one potential source of error, then any further processing from that is another potential source of error.

There are a number of potential solutions to "the verbatim problem" in this thread (using either SKOS or a separate dwc namespace).

from dwc.

matdillen avatar matdillen commented on June 21, 2024 1

The GNA verbatimLabel term is a part of the Types and Specimen extension, which extends core Taxon data to support multiple type names or type specimens. This extension does not (currently) do the same for Occurrence data.

Raw images, even of segmented labels, are not a perfect substitute for verbatim (annotated) strings. Images are not machine readable and may not be as human readable as textual strings or strings annotated into different verbatim terms with a more specific meaning. Handwritten text may be poorly legible and label text may be ambiguous in its meaning. Partially transcribed text may allow different people looking at the image to build on each other's work.

The problem here is, as I suggested earlier, that verbatim data have many different use cases. For some use cases, looking at the image is sufficient or even optimal. For others, it is not feasible at all. For some, different verbatimX terms are desirable. For others, they are not useful at all. We're not going to address everyone's concerns with a single new term or namespace.

More specific standards exist for the exchange of verbatim text captured from images, such as Alto. But these have their own drawbacks and there are some complications when mixing handwritten text, typed text and marked up text such as logos, stamps, tables...

from dwc.

chicoreus avatar chicoreus commented on June 21, 2024 1

In ABCD, there's the concept of a Mark "http://rs.tdwg.org/abcd/terms/Mark where the present term proposal looks like it corresponds closely to the verbatimText property of a Mark https://abcd.tdwg.org/terms/#verbatimText In ABCD 2.1 that looks like /Marks/Mark/MarkText https://github.com/tdwg/abcd/blob/9fb1355511334aad423be5c03ac43369580749b8/xml/ABCD_2.1.xsd#L1128

from dwc.

debpaul avatar debpaul commented on June 21, 2024 1

Much of the OCR I've seen however is chock-full of artifacts when the OCR engine "sees" all kinds of text when none exists. Do we still call these outputs verbatimLabel? When/if OCR artifacts are auto-removed to help clean it up prior to publishing, is it still considered "verbatim"?

@dshorthouse thanks. I wonder if we can let providers do what suits them for this? (They already do, to some extent, sometimes "structure" or "correct typos" in verbatim fields - for better or worse.

Who are these data for? If they are for facilitating searches (discovery before atomization complete), if they are for mining, does it matter?

Maybe, we can call it something else like @edwbaker suggested? rawLabel, rawPrimarySource ...

And you can see (as others raised), we could think of how to link "has image" concept to the text output.

from dwc.

cboelling avatar cboelling commented on June 21, 2024 1

Much of the OCR I've seen however is chock-full of artifacts when the OCR engine "sees" all kinds of text when none exists. Do we still call these outputs verbatimLabel? When/if OCR artifacts are auto-removed to help clean it up prior to publishing, is it still considered "verbatim"?

@dshorthouse thanks. I wonder if we can let providers do what suits them for this? (They already do, to some extent, sometimes "structure" or "correct typos" in verbatim fields - for better or worse.

In our digitization efforts, what we strive for and refer to as "verbatim" is a character-by-character-representation of the original source in some other format, including characters for line breaks and other structural features of the original text, preserving abbreviations, misspellings and any other idiosyncratic elements of the original. How that verbatim representation is generated is a corollary regarding its provenance (and that is desirable metadata in a number of use cases) but we do not make any assumptions on that process, i.e. if it is carried out by some automated procedure, by a human, or by some combination of that and possibly other elements.

from dwc.

deepreef avatar deepreef commented on June 21, 2024 1

@tmcelrath -- ok, understood. I think if it's organized in the MaterialSample class, then ultimately it will automatically imply preserved specimen and fossil specimen, but I see no harm in keeping the definition explicit.

Just to be clear, though -- there is no debate about MaterialSample vs. Occurrence. There is a little bit of ambiguity with respect to MaterialSample vs. Organsim, but that's not really part of the discussion (yet). The real debate that is currently happening is whether the terms PreservedSpecimen and FossilSpecimen should be maintained as they are in DwC (i.e., as classes), or if these terms should be deprecated in favor of MaterialSample (with the distinction of "Preserved" vs "Fossil" vs. "Living" being represented in other properties/terms within the MaterialSample class).

In any case, what matters for this issue, I think, is clearly defining the scope of what this new term is "intended" to be used for. I don't have any opinions on that scope -- only that it be indicated as explicitly as possible (whatever it is).

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

This proposal still needs evidence of demand.

My question is, "Is it not sufficient/preferable to capture the label images? That is one level less of interpretation already."

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

We use this field in the TaxonWorks. We split it into three fields "Buffered Determination Label", "Buffered Collecting Event Label" and "Buffered Other Labels". Just having an image is not enough, or sometimes we do not have an image.

Basically, I, and many other collections using TaxonWorks, want this DWC field.

from dwc.

matdillen avatar matdillen commented on June 21, 2024

Does this encompass both "gold standard" verbatim transcriptions of specimen labels and outputs of automated OCR processes (e.g. Tesseract)? How to encode the different approaches and their metadata (methodology)?

How to differentiate between labels and their relative location? I don't think $ and are reliable enough, in particular if OCR outputs are in scope.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

Closing for lack of demand.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

"Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?"

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

@tucotuco What specifically, do you want us to provide then? would a survey of different natural history collections members with documented support of their need of this field suffice?

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

@tmcelrath TaxonWorks suffices to represent that class of proponent. That is the equivalent of one proponent. What other organization or project needs it? If you can come up with that, the next step is to submit a templated New term request. I can do that, adding it to the beginning of the first comment to keep all the discussion in one place, but I need that evidence of demand.

from dwc.

edwbaker avatar edwbaker commented on June 21, 2024

This initially seems like a straightforward enough proposal, but how does it interplay with the existing (and numerous) verbatim fields within DarwinCore? It seems to risk becoming a dumping ground for data that could/should go into existing fields, and perhaps discouraging their use because it's easier to just put it all, unstructured, into verbatimLabel.

I think my main reservation is the following: are there many examples where the existing verbatim fields are inadequate, and could these be better covered by additional verbatim field(s) rather than such a loosely defined single field?

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

To anyone following this thread, I have a poll out right now: https://forms.gle/fgxbQUmQLQC4a1NY6 collecting people's thoughts about this proposed DWC field. Please help me gather responses there. I am looking to get as many diverse stakeholders as possible.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

Reopened to accommodate renewed vigor in the proposal.

from dwc.

albenson-usgs avatar albenson-usgs commented on June 21, 2024

What I'm wondering about this proposal is if we are conflating data management with implementing a standard. In my work for OBIS-USA I rarely receive data already in Darwin Core and I have to do a crosswalk. When I do that work there is always a chance that I performed that work incorrectly in some way and so I do my best to preserve the original data in a data repository and a link to that in the IPT so that future users of the data can get back to the original data to check the translation if they need to. For me it would not make sense to have all of that information stored in verbatim fields. When and how is the best place to separate out the standardization of the data from management of the data? Apologies if my comment doesn't make sense in this context since this is primarily considering museum collection data and I'm thinking of sampling event data.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

So far in poll, all respondents want to see this term implemented in some form:
image

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

Respondents are from a variety of different Collection Management Systems/databases:
image

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

About half of respondents already use this field in their CMS:
image

from dwc.

edwbaker avatar edwbaker commented on June 21, 2024

Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

from dwc.

dshorthouse avatar dshorthouse commented on June 21, 2024

I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against.

...and, having supplied a near-perfect clone of the item as an image, could we then eliminate all the existing verbatimX terms? If you want verbatimX, look at the image!

A significant part of this discussion stems from a lack of precision on all verbatimX terms & @albenson-usgs has identified this very well. DwC is an exchange standard & we're trying to shoehorn our very real need to track the provenance of data and the decisions/interpretations made. If anything, it would help the users of our data to better appreciate what goes into crafting assertions. If we take verbatimLocality as an example, it is defined as:

The original textual description of the place.

It is not qualified with, "...as written on physical media in close proximity to the physical object in question, free from any and all interpretation, translation, or transliteration." Evidently, it is not meant to represent a near-perfect snapshot. It offers no particular guidance when the original textual description of the place is physically, conceptually, or temporally removed from the physical specimen itself. What if that description of place is in Inuktitut in a field notebook held in an entirely different institution, written 2 years before the specimen was collected?

If we proceed with this, I would really like to see far more precision on the definition of verbatimLabel so it is abundantly clear what is its expected content & there are no downstream misunderstandings of how it could/should be used.

Definition: The full text from all labels affixed on or near a specimen, free from any and all interpretation, translation, or transliteration. There are no embellishments, prefixes, headers or other additions made to the text. However, lines or breakpoints between blocks of text are faithfully represented to establish context.

That said, is OCR-generated text from a label considered verbatim? What do you do about all the machine-generated artifacts, which are evidently "embellishments" of a sort? Is curation by a human an implicit requirement for verbatimLabel?

from dwc.

deepreef avatar deepreef commented on June 21, 2024

...and, having supplied a near-perfect clone of the item as an image, could we then eliminate all the existing verbatimX terms? If you want verbatimX, look at the image!

Ha! This takes me back to a conversation many years ago, when @stanblum was working at Bishop Museum, and he and I had this exact conversation about verbatimx fields. We both agreed that the true "verbatim" data would be an image of the label and/or catalogue ledger. This wasn't snark -- several of our collection managers have used handwriting to help sleuth out various data mysteries (the hadwriting points to who wrote it, which points to other sources associated with that person, etc.)

At the time, digital imaging technology was such that it was a pipe-dream on a Museum budget. But now it's becoming the norm.

So... yes... +1 for shifting to images of labels & ledgers in place of ASCII/UTF-8 encoded interpretations of "original data".

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

Results from the survey are in and viewable here:
https://docs.google.com/spreadsheets/d/1eIiAgM_nJ_XpGbUCQ8f-ftO4ocIHFfklwHpa79OVdWk/edit#gid=192695490

In short, 97.7% of respondents want this field in some form (3 respondents wanted images included too, but supported the field in principle). This represents 18 different Collection Management systems, 33 different institutions, in multiple countries.

Respondents were evenly split on doing this as 1 vs. 3 fields, but I think considering some comments above, 1 field which minimizes interpretation is best, CMS can easily merge to one field.

This field is useful for quality control, transcription workflows, artificial intelligence learning of labels (see paper linked by @matdillen above) and more. Nearly the entire community wants it, are mostly already using it (60% already supporting in survey results), and there is active discussion and consensus that it is useful.

All comments so far against can basically be summarized as:

  1. Use images instead (great but not every label will be imaged, and OCR still needs an output field; additionally even if you move from text to images, you may still need an original text field to store "all data"
  2. Why do this when we could just make "verbatimX" fields. Separate issue in most respects. Do we probably need that? Yes. Is that the issue in question? No.
  3. The term needs clarification. Absolutely. Let's do that in discussion. Personally, I really like @dshorthouse definition.

See additional comments in the spreadsheet, or I'm happy to add them here.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

@tucotuco I'd be happy to lead/co-lead the next step in getting this DWC term adopted. Point me to what I need to do, and I'll do it.

from dwc.

dshorthouse avatar dshorthouse commented on June 21, 2024

I worry that there are domain-specific issues/practices that come into play here in how verbatimLabel might be populated. Take for example this image of a botanical specimen. Do you proceed top-bottom left-right bottom-top (making a 'U' in your traversal) to fill the single field? Could that seemingly innocuous decision to collapse the semantic dimensionality/positioning of labels result in downstream misinterpretation when the image is no longer present for examination? Note also that there are explicit "fields" expressed on some of these labels like "Locality:" or "Date:" and that these may be presented in single or multiple columns or may not have any content at all. For example, there is one label here with two "Date" prefixes but only one of them has human-supplied content - do you still supply both as part of the verbatimLabel? Absence of data in a particular "field" on a label might itself be meaningful & so inclusion of "Date" without a value could be important. One candidate definition above in a bid to seek precision states that we ought to remove embellishments or prefixes, are these considered prefixes? That blank "Date" is clearly associated with the Det. whereas the other one is the collection event. But, these are only knowable by their positioning in columns. Columns of data on a single label would also need to be collapsed in verbatimLabel too, right? How? This gets messy in a real hurry...

And so, whatever the definition of verbatimLabel, it has to speak to how (or how not) to type stuff in it when faced with considerable variability in source.

http___api aucklandmuseum com_id_media_v_37518_rendering=original

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

@dshorthouse I think there will considerable variability in how data is put into this field, and that's okay. Obviously there are always going to be cultural differences in how labels are read/transcribed, but how is that different from any other DWC supported field? And depending on the discipline, they may or may not choose to use this field. Botanists, from what I understand, seem to like images of specimens much more than transcribed text, precisely for the reasons you give above. They are allowed much more space, have more information, and therefore it's easier to take a photo of their specimen labels (for example, I don't even know how you'd transcribe the formatting of the labels on the botany specimen above, which are rarely done on entomology labels, in order to conserve space. However, entomologists, who put less information on a label in general, find it much easier to just quickly transcribe a label _exactly as it appears on the specimen, in the order it appears on the specimen, as close to as it appears on the specimen as possible, without any interpretation because that is the easiest, quickest way to make sure all the information gets into a field that can later be parsed out.

So, all I'm suggesting is that we have the option to export that field for a variety of reasons. Will there be differences in how it's formatted from museum to museum? Probably. And I think we can mitigate exactly what you are describing by being explicit about best practices. For example, introducing no new formatting into a label except when needed to maintain meaning; or using [marks] to denote any interpretation if, for example, handwriting is uncertain (that's an example).

In all honesty, this field will NEVER be as "standardizable" as something like coordinates, agents, or dates can be (obviously some of those examples are under expansion/discussion. But how great would it be if you always had something to compare all those other parsed fields to. For example, the "agent strings" that get put into "determiner" or "collector" are sometimes only partially complete. Having the verbatimLabel to refer back to can help with that. Excel issue with auto-formatting dates when exported? Check the verbatimLabel. Bad GPS formatting? Check the verbatimLabel.

So, I think all of this discussion is great. So many people have brought up good points. Let's move to the next step because the demand is here, there is community interest, and we should push to the next step.

from dwc.

dshorthouse avatar dshorthouse commented on June 21, 2024

So, I think all of this discussion is great. So many people have brought up good points. Let's move to the next step because the demand is here, there is community interest, and we should push to the next step.

I guess this boils down to a single question: Does this proposed term help bring clarity (=evidence) to how we communicate the assertions made once information is parsed into structured fields - the non-verbatimX terms? We're assuming (correctly) that there is some slippage in that normalization/transcription process & that this verbatimLabel is a natural way home for end-users to re-examine for themselves what are the implicit assertions or mishaps.

Under some (most?) circumstances, your argument is, "Yes, it does bring clarity". However, under other circumstances, we could also argue that the very act of transcribing content into verbatimLabel, however generated, introduces yet more noise because it degrades the evidence (ie columns collapsed, semantic positioning of labels collapsed). And so "home" - the absolute version of truth - may not be efficiently articulated through this vehicle, it merely points an end-user to an approximation of "home". After all, that "home" state may not in fact be correctly expressed in verbatimLabel despite near-perfect representations of the physical evidence because there are other discoveries, other streams of evidence that correctly dispute or uncover error that the label creator unknowingly introduced in the first place!

Another way of thinking about this is what might be the workflow in the other direction, not as a QA mechanism but as a route to generating content for the normalized, non-verbatimX DwC terms. Would/could we recommend that verbatimLabel be populated before are all other terms as a first step in the transcription process, leaving it to others to later parse into structured fields absent an image? Are we ok with that telephone game?

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

@tucotuco I'd be happy to lead/co-lead the next step in getting this DWC term adopted. Point me to what I need to do, and I'll do it.

I have removed the label for evidence of demand. The next step is a proper change request using the template available at https://github.com/tdwg/dwc/issues/new?assignees=&labels=Term+-+add&template=new-term-template.md&title=New+Term+-+ It is best to copy that template and paste it in a comment here without adding a new issue, since all of the discussion is here. Fill in all of the requested fields as well as possible. I'll take it to the next step from there, which is to amend anything in the proposal that is needed and prepend it to the first comment so it is right at the beginning of the issue, which will facilitate public review when we go there. We'll go there when I have had a chance to fully prepare the numerous issues for consideration and make a public announcement.

from dwc.

deepreef avatar deepreef commented on June 21, 2024

I have mixed feelings about this term, but I wanted to clarify whether it is proposed for the MaterialSample class, or the Occurrence class (or both? Or Record-level terms?) @tucotuco has tagged it as class MaterialSample, but some of the comments seem to imply it would apply to Occurrence.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

@deepreef @tucotuco should we provisionally include it in both in discussion and wait until the dust settles RE Occurrence vs. materialSample to assign it to one? I know there is a lot of discussion right now about that. I could go either way.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

from dwc.

deepreef avatar deepreef commented on June 21, 2024

It seems to me that there are legitimate examples where verbatimLabel data could apply to Occurrence instances (as in a field notebook), or MaterialSample instances (specimen labels), or Event instances (either notebooks or labels). While it may not be a normative statement which class it's assigned to, eventually it will need to be assigned to at least one of them. Perhaps this would best be organized with Record-level terms?

Also, I would be inclined to capture verbatim data (translated from the original source into UTF8) as instances of MeasurmentOFact. That would not only allow for applying it to records in multiple different classes, but would also allow capturing more than one for any given resource, in cases where there are multiple labels with data worthy of capturing in electronic text form.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

This proposal is in danger of not being mature enough to be considered in the public review for the next Darwin Core release, which will begin 2021-05-01. To be ready for public review, please provide the full term change template (see #32 (comment)).

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

Since there was so much recent spirited effort to get this considered for review, I have done my best to provide a templated proposal at the beginning of the first comment. It must be checked. There remain three things to fill in there, and as soon as possible. The first is an additional usage comment to suggest how to represent the lines or breakpoints. The second is to provide an example. The third is to confirm if there is a mapping to ABCD.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

Thanks @chicoreus. Incorporated.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

This proposal has been labelled as controversial. If no evidence of consensus can be reached by the 30-day minimum review period, the proposal will be deferred for later consideration. If there is evidence that a consensus can be reached, the review period will be extended for an additional 30 days from the time apparent consensus is established (everyone participating in the discussion expresses their satisfaction with the proposed solution).

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

I am satisfied with the term as proposed above.

from dwc.

museumjames avatar museumjames commented on June 21, 2024

I have big concerns that this will just becoming a dumping ground for data. It seems to me this is a "quick win" that will come back to haunt the community.

from dwc.

afuchs1 avatar afuchs1 commented on June 21, 2024

The Australasian Herbarium Information Systems Committee (HISCOM) endorses the proposal to have a verbatimLabel term, but think it should be organised in Occurrence, not MaterialSample (see comments about MaterialSample on #332).
We also propose to replace the ‘MaterialSample’ in the definition, with ‘specimen’ (all lowercase).

from dwc.

debpaul avatar debpaul commented on June 21, 2024

The first is an additional usage comment to suggest how to represent the lines or breakpoints.

@tucotuco do we have to provide these? If there's OCR output from a label, it will have lines for sure, that correspond to the actual way the text appears on the label. But if it's typed in, it may just be one string. (Both are okay with me). But not sure of your requirements here. If encoding = UTF-8 does that help?

from dwc.

debpaul avatar debpaul commented on June 21, 2024

The second is to provide an example.

I'm sure we can provide many? examples. @tmcelrath could you put a ENT verbatim label example so we can discuss "lines" and "breakpoints." thanks!

from dwc.

debpaul avatar debpaul commented on June 21, 2024

The second is to provide an example.

Here's one, from OCR output on an NYBG herbarium sheet NYBG_00499439

0 1 2 3 4 5 6 7 8 9 10		
cm	copyright reserved	
The New York
Botanical Garden


NEW YORK
BOTANICAL
GARDEN


NEW YORK BOTANICAL GARDEN
ACADEMY OF NATURAL SCIENCES OF PHILADELPHIA
EXPLORATION OF BERMUDA
NO. 355
Cymbalaria Cymbalaria (L.) Wettst
Roadside wall, The Crawl.
STEWARDSON BROWN
}COLLECTORS AUG. 31-SEPT. 20, 1905
N.L. BRITTON


NEW YORK BOTANICAL GARDEN
00499439

In the above case, you can see the image of the herbarium sheet and its label here at GBIF. And note the label text from OCR is

NEW YORK BOTANICAL GARDEN
ACADEMY OF NATURAL SCIENCES OF PHILADELPHIA
EXPLORATION OF BERMUDA
NO. 355
Cymbalaria Cymbalaria (L.) Wettst
Roadside wall, The Crawl.
STEWARDSON BROWN
}COLLECTORS AUG. 31-SEPT. 20, 1905
N.L. BRITTON

And in future scenarios you could imagine we'd know (have coordinates for) where in the image, we would find any labels.

@edwbaker I find your suggestion about Audubon Core and "transcription of data" interesting. Note the example I've given here is not strictly a transcription though. It's OCR output + human to transcribe the handwritten parts.

from dwc.

debpaul avatar debpaul commented on June 21, 2024

I have big concerns that this will just becoming a dumping ground for data. It seems to me this is a "quick win" that will come back to haunt the community.

@museumjames an example of what you mean by "haunt" would be helpful? Thanks! There are some very powerful uses for the verbatim text -- and sadly at the moment, it's often not published, but stuck inside the CMS.

from dwc.

debpaul avatar debpaul commented on June 21, 2024

To all (is there a way to "@channel" for a given ticket)?

To all, please see the thread on twitter from Ben Brumfield @benwbrum on GitHub and Twitter. His expertise spans the #digitalhumanties and he has experience with biocollections data too (field notebooks, diaries, labels, ...)

  • quote " As a result, we have elaborate standards for what should go into our "verbatimTranscript" equivalent, ranging from Record Type in the 19th century to @TEIconsortium in the 21st."

and

  • quote "access to images has mostly moved ... some practice away from this attempt at lossless transcription, with projects moving from diplomatic transcription to semi-diplomatic, and scholarly editions (and their funders) pushing for more early-access publication of transcripts that represent a first or second draft of a text, rather than waiting on the final edition before researchers get access"

from dwc.

debpaul avatar debpaul commented on June 21, 2024

Does this encompass both "gold standard" verbatim transcriptions of specimen labels and outputs of automated OCR processes (e.g. Tesseract)? How to encode the different approaches and their metadata (methodology)?

@matdillen I'm guessing it would include both. What do you mean by "encode ..."?

How to differentiate between labels and their relative location? I don't think $ and are reliable enough, in particular if OCR outputs are in scope.

@matdillen do mean location of the label (on the specimen, in the jar?). I don't know what you mean by "$ ..."

I think the main uses supported by these data are a) searchability and discoverability, b) and opportunities for many to contribute to further atomization, c) visualization of clusters (tokens) found in the raw textual data across huge datasets.

from dwc.

debpaul avatar debpaul commented on June 21, 2024

In our digitization efforts, what we strive for and refer to as "verbatim" is a character-by-character-representation of the original source in some other format, including characters for line breaks and other structural features of the original text, preserving abbreviations, misspellings and any other idiosyncratic elements of the original.

@cboelling could you provide an example here? As we did (i.e. @dshorthouse and myself)? It would be great to see your "label/s" and the "verbatim" output as you capture it.

from dwc.

timrobertson100 avatar timrobertson100 commented on June 21, 2024

It seems that there is a considerable wish for this, and certainly a full-text index in e.g. GBIF.org would aid discovery.

The concerns seem to be around the expectation of structure (e.g. micro-encoding), accuracy (e.g. OCR), and the notion of what "verbatim" may mean.

Given the clear use case and demand, and in the interest of trying to achieve consensus in this review round, would a change in the description to clarify expectations be more readily accepted? For example:

verbatimLabel:
The best effort to capture the full text from all labels affixed on or near a MaterialSample, free from any and all interpretation, translation, or transliteration. Consumers of this field should consider that errors may be present, originating from e.g. damage, illegible handwriting, or automated approaches.

Alternatives to this could be to expand the scope to Occurrence data with additional wordsmithing or even to rename the term itself (e.g. originalWrittenText or so). Renaming may result in it not completing in this review period unless it got immediate strong support. Alternatives would still be with the intention to indicate it really is intended to be a dumping ground for those who need it.

from dwc.

deepreef avatar deepreef commented on June 21, 2024

A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label.

This is comparable to the rough definition I follow in similar contexts, which boils down to (in plain-speak), "the closest representation of the patterns of ink printed on paper that can be rendered using UTF-8 encoded characters" (or something like that).

from dwc.

debpaul avatar debpaul commented on June 21, 2024

the closest representation of the patterns of ink printed on paper that can be rendered using UTF-8 encoded characters" (or something like that).

@deepreef @timrobertson100 do your suggestions for definition / usage guidance allow for the examples here from @dshorthouse and me?

Note well in my example, it's "gold standard" transcription of OCR output. AND the handwritten terms on the label have been typed. So it is "verbatim" but not as captured strictly from the OCR. Rather, we started with OCR, then improved to capture the handwritten bits.

from dwc.

debpaul avatar debpaul commented on June 21, 2024

I'm not really a stakeholder here, but have been watching this conversation since @debpaul pointed me at it.

@benwbrum yes you most certainly are a stakeholder. This is our open, public review of Darwin Core terms (new and ones where folks are suggesting changes). You do help communities in the GLAM world to mobilize their data, and you've also worked directly with our stakeholders in doing this work. Thank you very much for engaging in this conversation!

from dwc.

debpaul avatar debpaul commented on June 21, 2024

In addition to the question of embedded markup protocols pointed out by @benwbrum (which itself would requires some mechanism of escaping characters like "<" and ">" to indicate literal values of such), it would also need to be clear that "verbatim" excludes "interpreted" values (corrections of spellings, etc.)

@deepreef does this matter? In the example I provided, we didn't "correct" per se, but we did interpret the handwriting (which could of course be wrong).

from dwc.

debpaul avatar debpaul commented on June 21, 2024

I suspect that it's really important to be explicit about whether characters not part of the label itself are allowed -- in other words, are the contents of this field allowed to contain XML, HTML, ALTO, JSON or are they required to be plaintext only, with any formatting limited to whitespace?

For those newer to this, a small side note. I don't know about you, but I understand (am familiar with) XML, HTML, and JSON. But ALTO was a new acronym for me. So I had a quick google search and now quoting from Wikipedia

ALTO (Analyzed Layout and Text Object) is an open XML Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text OCR and layout information of pages for digitized material.

(Hope that helps).

from dwc.

debpaul avatar debpaul commented on June 21, 2024

@matdillen wrote:

For some use cases, looking at the image is sufficient or even optimal. For others, it is not feasible at all.

Yes! Not all folks take images of labels, preferring to capture the text and skip the more involved and expensive imaging and image mgmt route (see data capture for entomology collections done from looking at the label directly). So @deepreef @dshorthouse some folks will not have images. I see great value in having the images (for many reasons), but this will not always be the case.

As to purpose, at least in my mind, and as @timrobertson100 wrote:

... certainly a full-text index in e.g. GBIF.org would aid discovery.

💯both for collection managers (to aid discovery) but also anyone else rooting around in the pile. And not to mention text-mining and data clustering and community-based public participation in our efforts ...

from dwc.

deepreef avatar deepreef commented on June 21, 2024

So @deepreef @dshorthouse some folks will not have images. I see great value in having the images (for many reasons), but this will not always be the case.

My point was not so much that images are necessarily better (or even feasible), but rather represent a different level of "verbatim". I think the key issue, that @tucotuco captured very eloquently, is that there is a spectrum of possible meanings of "verbatim", and there are consequences (both good and bad) of defining it more narrowly or more broadly. I agree with @tucotuco that in this context it would need to be defined as broadly as possible -- which is good from the perspective of making it easy for content providers to provide it; but maybe less good for content consumers to consume it.

from dwc.

edwbaker avatar edwbaker commented on June 21, 2024

I think I am becoming more and more confused about what the point of this term is. I am imagining some bad OCR, with or without markup, being used to train a machine learning model. This model could then run on labels from other specimens, and possibly populate the same field.

This is possibly over-pessimistic.

from dwc.

dshorthouse avatar dshorthouse commented on June 21, 2024

which is good from the perspective of making it easy for content providers to provide it; but maybe less good for content consumers to consume it.

There are certainly nuisances to contend with if you're a consumer and the food you receive is of varying types - text strings in some instances, structured XML in others. But, as @cboelling says, you best look at the food before you put it in your mouth.

What I worry more about is if providers elect to collapse the spatial position of text on labels ("just share the raw OCR text"), resulting in significant downstream misinterpretation absent the image(s). When does seemingly "good enough" actually do damage? That potential for misinterpretation is unevenly distributed because it is likely to be dependent on taxon-specific curatorial practices. Should a consumer have to first ask if verbatimLabel originates from an herbarium vs. an entomological collection? A consumer would benefit a great deal if the former were provided as ALTO-XML whereas raw text from an insect label is usually adequate; there can be many labels on the former, few on the latter. The opposite scenario presents different sets of decisions for the consumer. My point here is that there are likely to be varying best practices implicit in the use of this term, dependent on domain or type of object. This has a number of spill-over effects including the challenge it presents to users new to TDWG standards, the collection management system if it is meant to support all collection objects, as well as for the aggregator that may try to make sense of how to consistently present it either in a UI or in an API.

from dwc.

dshorthouse avatar dshorthouse commented on June 21, 2024
  • Definition of the term (normative): The full text from all labels affixed on or near a MaterialSample, free from any and all interpretation, translation, or transliteration.

As @tucotuco has mentions above, it appears the definition needs a rethink. Much of this came from me in response to the original definition when I was overly concerned by the prospect of excessively messy OCR vs transcription by humans vs what is a "label" (i.e. would/could OCR of a field notebook be considered a "label" when there isn't anything particularly useful affixed to the specimen proper) & so am sorry about being the cause of delay. Consensus here seems to be, so what? All the above are acceptable.

verbatimLabel = def. A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label.

@cboelling's definition above much cleaner, but I wonder if it can be tidied-up even more so we can safely reset the clock and keep this alive. Perhaps "serialized", "encoding" and "literal" can be replaced as well as the "i.e.".

I'm in favour of keeping this in MaterialSample if only because it does actually make it very explicit that this is precisely about etchings on paper made by humans (or their surrogates) meant to aid communication with other humans about physical objects, OCR/ML notwithstanding. The bespoke GBIF term in the Types and Specimens extension is likewise in the same vein.

@tucotuco - is it up to Hannu Saarenmaa & @tmcelrath as champion to propose a new definition, which must now be resubmitted within the next 3 days? Does that take the form of a comment in this thread or a new issue?

from dwc.

cboelling avatar cboelling commented on June 21, 2024

verbatimLabel = def. A serialized encoding intended to represent the literal, i.e. character by character, textual content of a label.

@cboelling's definition above much cleaner, but I wonder if it can be tidied-up even more so we can safely reset the clock and keep this alive. Perhaps "serialized", "encoding" and "literal" can be replaced as well as the "i.e.".

@dshorthouse, this ad hoc definition surely could be tidied up. Do you have specific suggestions?

I just think the definition of the term verbatimLabel (in contrast to usage notes or examples) should be strictly about what makes its instances verbatim representations of a given source document (i.e. label). I would favor the definition being agnostic about the process through which one has arrived at that representation and also agnostic about the specific format in which that representation is provided (apart from it being "digital-textual", hence the, perhaps a bit clumsy, "serialized encoding" part of the definition offered earlier.
If I understand @timrobertson100 correctly, this would be consistent with how other terms in DwC are defined. I do not share the sentiment that such a course of action makes the term a "dumping ground". In many cases it is favorable to intentionally separate concerns that regard specific formats from what the (abstract) information value of an item is.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

@tucotuco - is it up to Hannu Saarenmaa & @tmcelrath as champion to propose a new definition, which must now be resubmitted within the next 3 days? Does that take the form of a comment in this thread or a new issue?

Anyone can propose an updated definition (I do it all the time in trying to capture consensus). The important thing is consensus, and to that end, clarity. Therefore I suggest that a revised proposal be in the form of a comment in this thread that includes a copy of the full proposal template from the first comment with the full new revised proposal interjected so that everyone can compare the two easily in one place - something like this (completely fabricated):

Proposed attributes of the new term:

  • Term name (in lowerCamelCase for properties, UpperCamelCase for classes): verbatimLabel
  • Organized in Class (e.g., Occurrence, Event, Location, Taxon): MaterialSample
  • Originally proposed definition of the term (normative): The full text from all labels affixed on or near a MaterialSample, free from any and all interpretation, translation, or transliteration.
  • Revised proposed definition of the term (normative): The full text of labels associated with a specimen.
  • Originally proposed usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text except to designate lines or breakpoints between blocks of text to establish context that could be verified by seeing the original labels or images of them.
  • Revised proposed usage comments (recommendations regarding content, etc., not normative): The content of this term should capture the original text present on all labels as closely as possible. Any uncertainties, interpretations, encodings or other methods used in the representation of the original text should be explained in occurrenceRemarks.
  • Originally proposed examples (not normative):
  • Revised proposed examples (not normative):
0 1 2 3 4 5 6 7 8 9 10		
cm	copyright reserved	
The New York
Botanical Garden


NEW YORK
BOTANICAL
GARDEN


NEW YORK BOTANICAL GARDEN
ACADEMY OF NATURAL SCIENCES OF PHILADELPHIA
EXPLORATION OF BERMUDA
NO. 355
Cymbalaria Cymbalaria (L.) Wettst
Roadside wall, The Crawl.
STEWARDSON BROWN
}COLLECTORS AUG. 31-SEPT. 20, 1905
N.L. BRITTON


NEW YORK BOTANICAL GARDEN
00499439

with verbatimLabel derived from unadulterated OCR output in occurrenceRemarks

  • Refines (identifier of the broader term this term refines; normative): None
  • Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None
  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative): /Marks/Mark/MarkText

from dwc.

EstebanMH-SiB avatar EstebanMH-SiB commented on June 21, 2024

We agree with the new term, but are a little bit worried about codification problems that can arise for using this element with literal values.

Publishers can use several special characters as ", ', line breaks, tabs, etc. that could result in indexing problems, we should like to know if there is a consideration about these possible issues?

We made this comment in behalf of @SiBColombia

from dwc.

debpaul avatar debpaul commented on June 21, 2024

@timrobertson100 wrote:

I worry if novice modelers become a reason to block progress. Concepts like identification and presence/absence at a location are far more likely to cause problems than text from a label in a concept that's adequately explained as potentially having errors.

Edited to add: More complex cases capturing the output and workflows around OCR, fields with markup and the type of markup a consumer should expect would be well suited to an extension in a DwC-A (e.g. capturing the equipment used, software, version, raw output, cleaned output etc). DwC-A is of course only one use of DwC, but is a common one.

Thanks Tim, this is lovely and concise. We can have a) a verbatimLabel, and b) a way to share richer information along with the metadata necessary to facilitate its use.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

This proposal has been labeled as 'Controversial'. It will remain open for public review in pursuit of a consensus solution for another 30 days, but will not be included in the release to be prepared from the public review of 2021-05-01/2021-05/31.

from dwc.

tucotuco avatar tucotuco commented on June 21, 2024

Public review of this issue has now concluded with objections to the proposed change. The issue will remain open for discussion and potential resolution.

from dwc.

deepreef avatar deepreef commented on June 21, 2024

@deepreef the only problem I see with "...content of a label affixed on, near, or explicitly associated with a material sample..." is that it runs into the exact same issue that you just described.

Agreed! I much prefer the suggested revised wording from @cboelling (both for the definition and for the usage comments).

@cboelling the only problem with removing the statement is that others previously have requested that it was added to make sure that this term was not used for other pseudo-transcriptions of things that are not explicitly associated with some collection object preserved in a museum. That's the reason it's in there and I think it needs to stay in for that reason.

But what, then, would be the scope of "things" from which labels are transcribed and shared using this DwC term? Is there harm in maintaining it as broad in scope? If a broader/more liberal scope is preferred, then perhaps this term would best be organized among the Record-level terms? Or, conversely, if the idea is to restrict its usage to instances of MaterialSample, would that be sufficiently evident from its organization within that class?

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

@deepreef There have been multiple users pointing out that they "don't want this term being used as not intended". As intended, it is supposed to be used for collection objects aka dead things in a museum with labels. That is an easily understood and defined scope that makes sense in 99% of use cases. Keeping the definition as ".... content of a label affixed on, near, or explicitly associated with a preserved specimen, fossil specimen, or material sample..." keeps it "narrow" and gets the term out there. Making it broader opens the term up to more interpretation, more debate, and makes the term take longer to get approved. I don't think we need to change it. Let's not open up any more cans of worms RE where it belongs.

The debate of materialSample vs. Occurrence does not need to happen here. Whatever is decided there will obviously affect this term, but the term makes sense as defined.

from dwc.

debpaul avatar debpaul commented on June 21, 2024

@tmcelrath @cboelling does the above look good to you now?

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

Yes.

from dwc.

cboelling avatar cboelling commented on June 21, 2024

@tmcelrath @cboelling does the above look good to you now?

As I proposed earlier I recommend to strike the "any and all" from the definition. It doesn't add anything.
I still believe (see the same comment) that the "affixed on, near, or explicitly associated with..." part followed by certain types of entities should go to the usage comments rather than the definition - the current version builds unnecessary dependencies and may unduly restrict the domain of dwc:verbatimLabel. The current discussion around a new top level term for physical entities seems to vindicate this view.

If these improvements cannot be incorporated, the term, as proposed, will probably do the job.

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

I support striking "any and all" from the definition. @cboelling is right, it doesn't add anything.

I do not agree with the second part, ie "that the "affixed on, near, or explicitly associated with..." part followed by certain types" should be moved to usage comments, because that text was explicitly added to address certain other comments in the above discussion, namely to clarify and specify usage of this term. Since @cboelling isn't too strongly vocal about it being removed, I suggest that we leave it as is.

from dwc.

cboelling avatar cboelling commented on June 21, 2024

@tmcelrath:

I do not agree with the second part, ie "that the "affixed on, near, or explicitly associated with..." part followed by certain types" should be moved to usage comments, because that text was explicitly added to address certain other comments in the above discussion, namely to clarify and specify usage of this term.

If the purpose of this part is to clarify and specify usage of this term wouldn't it be exactly right to place it in the usage comments?

from dwc.

tmcelrath avatar tmcelrath commented on June 21, 2024

@cboelling we've been over this:
image

But if you can propose a different verbiage which does not run into the issues above, I'm happy to look at it.

from dwc.

cboelling avatar cboelling commented on June 21, 2024

@tmcelrath
With my last two posts, I have presented all arguments from my side. Please decide and go ahead.

from dwc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.