umr4nlp / umr-guidelines Goto Github PK

View Code? Open in Web Editor NEW

8.0 9.0 6.0 1.78 MB

umr-guidelines's Introduction

UMR Guidelines

See guidelines.md

umr-guidelines's People

Contributors

Stargazers

Watchers

Forkers

banana1007 jbonn kansusu alicia-tu somelinguist ufal

umr-guidelines's Issues

Events as named entities

The taxonomy of named entities includes a class called event. I agree that some events have names but this is an interesting (and potentially confusing) twist because "event" is used as a term in the guidelines, and it is said that all processes are events, but states and entities are events only if they are packaged in predication.

I suppose that "event" in the taxonomy of named entities is a different "event", i.e., we can use it as an abstract concept even if it is packaged as reference or modification. If this is a correct assumption, it would be helpful to say it explicitly in the guidelines.

Now my main question is how do we use this concept in annotation. Even if it has a name, it is probably still an event. At least if it is a process, which named events typically are. So, let's say, the sentence is:

The Gallic Wars culminated in the decisive Battle of Alesia in 52 BC.

Table 5 provides war as a type of the class event, so we could do this:

(w/ war
    :name (n/ name :op1 "Gallic Wars")
    :wiki "Q202161")

But we probably also have an eventive concept "war-01", right? This is a non-abstract concept, anchored in a vallency lexicon, so if we are annotating Czech, it will be "válka-01", in Spanish "guerra-01", in Latin "bellum-01" etc. If we use one of these concepts, we will lose the link to the taxonomy of named entities. If we don't use them, we will lose the eventive nature and parallelism to expressions like the war of the Romans against the Gauls:

(w/ war-01
    :ARG0 (p/ person :mod (c/ country :name (n/ name :op1 "Roman Empire")) :ref-number Plural)
    :ARG1 (p2/ person :mod (r/ region :name (n2/ name :op1 "Gaul")) :ref-number Plural))

So what is the expected annotation in these cases?

Cultural artifacts, especially publications

In Table 5, the class cultural-artifact contains (among others) the types literature and publication, the latter with subtypes book, newspaper, magazine, journal. What is the difference between literature and publication? Can't a book be literature?

I also don't see any type for movie titles. On the other hand, theater plays probably can be annotated as show, so is a movie a show, too?

Biomedical named entities

I suppose that the classes animal and plant are not included in biomedical-entity because they are only used in cases when we either need an abstract concept (for a pronoun), or the animal/plant has a name as an individual, such as Archie. Right?

Now, as for the types under biomedical-entity: At least some of them seem problematic to me, namely taxon, species, and disease.

In Czech, species usually have two-word names, a common noun followed by an adjective. (NB: This word order is specific to names of species, otherwise the adjective would precede the modified noun.) The first word (the common noun) denotes the genus, i.e., a taxon. Do we really want to treat it as a named entity? It can be an unusual (from the Czech perspective) word, such as ptakopysk "platypus" or araukárie "araucaria", but it will also include all common animal and plant names, such as kočka "cat" or dub "oak". What is it that makes these words named entities?

In fact, kočka “cat” is an animal with a particular set of characteristics, just like dub “oak” is a particular type (hyponym) of tree, and hrad “castle” is a particular type of building. But the first two words are biological genuses, hence taxons, while hrad has no special status in the UMR taxonomy. (In the Czech grammar, all three are common nouns.) There is no reason why kočka and dub should be named entities. And by extension, there is little reason why species should be named entities, for example kočka domácí “cat (Felis catus)”, or dub letní “pedunculate oak (Quercus robur)”, or why other taxons should, for example šelmy “beasts of prey, Carnivora”, savci “mammals”, or živočichové “animals, Animalia”. It is true that some species have names that are less common than others and were invented by scholars who discovered and described the species, rather than being part of the language since ancient times. But it would be neither tractable nor helpful to attempt to distinguish them. Perhaps the only exception is the scientific names in Latin, provided that the language of the annotated text is not Latin.

Similarly, diseases may have scientific names but many common diseases are just common nouns or expressions (angína “tonsillitis”, chřipka “flu”, mor “plague”, neštovice “chickenpox”) and it is not clear why they should be handled differently from other common nouns. Moreover, diseases are states rather than entities, aren't they?

How to recognize event nominals

Part 3-1-1-2 of the guidelines gives driver as an example of a noun that is derived from a verb but does not refer to a process. It does not show the UMR annotation of the example sentence with the driver (2b).

Part 3-2-1-1-1 has an annotated example with teacher and while it does not say that the teacher directly refers to the process of teaching, the annotation still uses the process concept teach-01 to define the teacher, instead of using an entity concept teacher:

(p/ person
		:ref-number Singular
		:ARG0-of (t / teach-01))

Would the driver be annotated analogously as "a person who drives"? What are the criteria to decide when to use two concepts (person :ARGX-of (process)) and when to use a single entity concept? For example, businessman in Part 1 is annotated just with a lexical entity concept: (b/ businessman). Why is he not annotated as (p/ person :ARG0-of (d/ do-business-01))? Is it just because you don't have do-business-01 in your valency lexicon? Would the process appear in the annotation if the noun were derived from a verb that exists in the valency lexicon? (NB: The Czech equivalent of businessman is obchodník or podnikatel, both are derived from verbs, obchodovat resp. podnikat.) What about nouns denoting other participants of processes than actors? For example, Czech jídlo "food" is derived from jíst "to eat"; should it be annotated as "the thing that is eaten", i.e., (t/ thing :ARG1-of (j/ jíst-01))? And if so, should the English word food be annotated as (t/ thing :ARG1-of (e/ eat-01)) in order to preserve cross-linguistic parallelism?

Questions about Epistemic versus Deontic Modality and Modality Negation

Hello. I had some confusion about the modality annotation instructions, and I just want to write down my thoughts while I have them fresh on my mind. Please don't treat this as an urgent issue.

My first question is about epistemic versus deontic modality. I was surprised that UMR does not make a distinction between epistemic and deontic modality in the annotation. Based on 4-3-2 in the guidelines, It looks like Full means "certain", Partial means "probable", and Neutral means "possible" in the case of epistemic modals, but they mean something else when dealing with deontic modals. Have I understood this correctly? Why not use different terms for epistemic versus deontic modality? I think this distinction is really important for downstream applications which will want to know the likelihood of something happening in reality. Perhaps the labels could be Certain, Probable, Possible for epistemic modals (same as FactBank) and Required, Recommended, Suggested or something similar for deontic modals.

I have another question about negation. You represent negation in your annotations of modality as FullNeg, PrtNeg, and NeutNeg. Would it make sense to separate the negation as a :polarity - attribute? That would be more consistent with AMR notation elsewhere and I think it would make negation more human and computer-readable as well.

Thank you for your time and feedback.

What is `variable` in the taxonomy of named entities?

It would be helpful to know what the class of variable is supposed to mean in Table 5. Any examples how it would be used in annotation?

Attributes vs. relations

The guidelines should be more consistent in using the terms "attribute" vs. "relation". For example, quantity :quant. In Part 2-1 (which introduces the term "UMR attribute"), :quant is clearly introduced as an attribute. But later in part 2-2-5, there is the phrase "the :quant relation". Despite the fact that it has a numeric value (and not a child node in the UMR graph structure), and also despite the fact that the part 2-2-5 is titled Attributes.

:quot vs. :quote

The guidelines say that :quot should be used in modality annotation to refer from the reported event back to the reporting event. In a few examples elsewhere in the guidelines it is uppercase :QUOT, which is probably a mistake. In the Google spreadsheet with the quick reference for annotators, the Roles tab has :quot among Modal roles.

However, UMR 1.0 data always uses :quote (in all six languages) and never :quot.

What should be fixed: the documentation, or the data?

Alignment using string position instead of word number

In the scenario that some annotations used a different word split (especially in Chinese, and sometimes in English multi-word expression), if we use word numbering as markers for alignment, then if some words are tokenized differently, all subsequent words would be misaligned.

File format specification?

I am wondering if there is any formal specification that each UMR file is supposed to follow. The guidelines in this repository give some idea (even if incomplete) about the sentence level graphs and document level graphs. But they do not say that there are four annotation blocks for each sentence (tokens, sentence graph, alignment, document graph), each block followed by an empty line, the last one by two empty lines etc.

I am writing a validation script for UMR and it would be probably easier to follow a specification (if it exists) than trying to guess from the data files what is allowed and what not.

BTW, the data in UMR release 1.0 seem to follow different conventions in different languages, also different from what the guidelines say, and occasionally they have issues that are clear bugs regardless specification (such as non-matching brackets).

Sections 3-1-4 through 4-1-2 of the guidelines seem to be hidden because of improperly closed comment

There seems to be an improperly closed comment at line 1990 in guidelines.md that causes sections 3-1-4 through 4-1-2 to be hidden when the document is rendered on GitHub:

umr-guidelines/guidelines.md

Lines 1990 to 1993 in 42fbba0

 <!-- we need a list of such predicates> 

 [Back to Table of Contents](#toc) 

 #### Part 3-1-4. Word senses

The comment isn't actually closed until line 6559, which seems to be actually meant to close the comment from lines 6551-6559:

umr-guidelines/guidelines.md

Lines 6551 to 6559 in 42fbba0

 <!--- 

 - Script / subevent? 

 For now UMR does not annotate cases where one event is a subevent of another event, or when an event is part of a "script". These will be annotated under temporal relations. 

 ``` 

  Reports suggest that the group of nine were having a picnic on Friday when they were abducted in the Saada province of Yemen. A spokesman for the Yemeni Embassy said "The foreigners ventured outside the city of Saada without the required police escorts due to the heightened security situation in the area. 

 ``` 

  --->

Are these sections meant to be hidden, or is this unintentional, as I suspect?

Representation of pronoun "he" should include male gender

The representation for the pronoun "he" is missing the male gender feature:

(p/person
:ref-person 3rd
:ref-number Singular)

Note that the representation of gender is crucial for co-reference determination.

Nationality vs. ethnic-group vs. regional-group

Why is nationality not a type of social-group, just like ethnic-group and regional-group?

How is it supposed to be annotated? The example in part 3-1-2 shows nationality for American in Edmond Pope is an American businessman. But there are questions that the example (nor the guidelines around) does not answer:

The name concept shows "America" as the name, not "American". What are the rules for obtaining the canonical form of the name (as opposed to the actual string in the sentence)? (I'm asking because "America" is not the lemma of "American", so some morphological derivation is obviously undone as well.)
In the example, the nationality node is a modifier of the businessman node. What happens if there is only the nationality, without occupation, as in Edmond Pope is an American? Are we supposed to use the abstract concept person, and the nationality is still just a modifier? How do we annotate Americans celebrate their Independence Day?
Do we want to distinguish nationality from any other affiliation with a country? How do we annotate Edmond Pope is an American resident (meaning he lives there but he does not necessarily have the US passport and his nationality is not necessarily American)? And what about Chevrolet is an American car, or Philadelphia is an American city?

Wikification

Part 3-2-2-4 of the guidelines says that "Named entities can also take a :wiki relation, whose daughter is the title of the Wikipedia page corresponding to the entity in question." First, I think that it is wrong to refer to :wiki as a "relation". I think it is an "attribute" because it has a string value and not a concept node as daugter (the value is not in brackets). (See also #15.) Is that correct?

Now, my main concern is about the value being "the title of the Wikipedia page". This has a number of drawbacks:

Non-unique source and form. For example, the title of the article is Ramstein Air Base with spaces, as seen in the large font on the top of the article. But the annotation examples in the guidelines seem to prefer titles with underscores, probably taken from the URL after the last slash: :wiki "Ramstein_Air_Base" (see here).
- This is even more confusing if the title contains non-English characters, as in Český Krumlov. There are at least three possible strings; the last one is taken from the actual URL but I would argue that it is the worst option of them all: 1. "Český Krumlov"; 2. "Český_Krumlov"; 3. "%C4%8Cesk%C3%BD_Krumlov".
Unstable. A Wikipedia article may be renamed. The old name then becomes a redirect (unless the editors establish that it was really an error and the old name should be used for something else), so in many cases the outdated :wiki value can still be used to reach the correct article, but it is quite unfortunate that it cannot be treated as a unique identifier within the UMR annotated data, without resolving the updates of Wikipedia. For example, Astana was named "Astana" before 2019, then it was renamed to "Nur-Sultan", but in September 2022 it was renamed back to "Astana". (And it had other names in history, but those were used before the Wikipedia article was written.)
Language-specific. The examples in the guidelines seem to automatically assume that we are talking about the English Wikipedia. But there are 200+ language mutations of Wikipedia. The English and Czech articles "Český Krumlov" describe the same entity and are thus equivalent. Fortunately the title is the same and we do not use the full URL, so seemingly it does not matter which Wikipedia we are talking about. But it does! The Czech equivalent of "Ramstein Air Base" is "Letecká základna Ramstein" (this is the title of the article in the Czech Wikipedia). One might suggest that when annotating English text, we will link to English Wikipedia, when Czech text, it will be Czech Wikipedia etc. But what about languages which do not have their Wikipedia (Arapaho?) And, more importantly, the article may not exist in the preferred language now, but it may exist in one of the other languages. For example, Jan Hajič currently has a Wikipedia article in Czech and Basque, but not in English :-)

For all these reasons, I think we should use the Wikidata identifier instead. For example, the Ramstein Air Base would have :wiki "Q161348" (see here). It can be easily resolved to the corresponding article in each Wikipedia that has it. And it may have additional machine-readable information, such as the relation instance of, pointing to the Wikidata node for "air base".

Nominal Predication

Pivoting of Bill's chapter and the UNM proposal, I wanted to quickly add some notes on nominal predications, with a focus how the different strategies outlined by UNM actually get expressed in the semantic graphs.

I want to make explicit the three "tools in our semantic graph toolbox" in terms of mapping language-specific strategies into the UMRs themselves. For "John is a doctor", we have three choices:

Relationship as the top (~semantic head). : (b / be-instance-of-91 :arg1 (p / person ... "John") :arg2 (d / doctor))
Object/adjective as the top, treated as a predicate (d / doctor-02 :theme (p / person ... "John")
Object/adjective as the top, treated as an entity: (d / doctor :domain (p / person ... "John")

I personally would like to rule out that third option, because any kind of temporal or modal modification needs to the entity as if it were a predicate, ambiguous whether the node "doctor" means "the set of doctors" or "the state of being a doctor".

My AMR-centric proposal: Why not model most copular and "zero-copula" nominal predications all with the "Relationship as the top" strategy (I'll give examples of this below), and to use the "treating objects/adjectives as semantic predicates" strategy only in two cases -- when languages employ what Jens' document refers to as "Strategy C" with inflectional morphology on the grammatical predicate, and where the predication is doing property attribution, wherein the predicate will likely eventually need lexicalized semantic roles anyways.

"Relationship as the top node" examples

I'd strongly advocate for this to be the "default" when possible. To propose a slightly cleaner form than what we used for English (adding identity-91 and instance-of-91 to replace most usages of the vague ":domain" label):

Identity-91 (:arg1 and :arg2 are literally the same entity -- possibly also use deictic-presentational)

This is Emily Burns
(i / identical-91
   :arg1 (t / this)
   :arg2 (p / person... Emily Burns))

Emily is the Science Director of the Save-the-Redwoods League
(i / identical-91
   :arg1 (p / person... Emily Burns))
   :arg2 (p2 / person .. :arg0-of (d / direct-01
             :arg1 organization ... Save the Redwoods League
             :mod (s / science))

Instance-of-91 (:arg1 is an instance of set arg2)

The Fender Stratocaster is a type of electric guitar.
(i / instance-of-91
      :arg1 (t / thing ... Fender Stratocaster)
      :arg2 (g / guitar
            :mod (e / electricity))

Sally is a brilliant scientist
(i / instance-of-91
   :arg1 (p / person ...Sally)
   :arg2 (s / scientist
        :mod (b / brilliant))

Possession:

The book is mine
(p / possess-91
    :arg0 (i / i)
    :arg1 (b / book)

Location:

The mail is in the kitchen
(b / be-located-at-91
    :arg1 (m / mail)
    :arg2 (i /in
        :op1 (k / kitchen))

Relational

He is my brother
(h / have-rel-role-91
    :arg1 (h2 / he)
    :arg2 (i / i )
    :arg3 (b / brother))

And existential/ presentational could be argued for too, e.g.:

There are three giraffes
(e / exist-91
   :arg1 (g / giraffe :quant 3)

The final, most thorny one, is adjectival predications and random :domain predications. The relation as top analysis would be as follows:
Sally is brilliant:

(have-mod-91
   :arg1 (p / person ... Sally)
    :arg2 (b / brilliant)

This is quite feasible with most examples discussed in the UNM proposal:

purukuparli martina "Purukuparli is the boss."; Tiwi

(i / instance-of-91
   :arg1 (p / person ..."purukuparli")
   :arg2 (m / martina))

ni-tīcitl "I am a doctor"; Nahuatl

(i / instance-of-91
    :arg1 (s / sg1)
    :arg2 (t / tīcitl))

ija sigin ca (Amele, "I have a knife")

(p / possess-91
     :arg1 (s / sing1)
     :arg2 (s2 / sigin))

Treating objects/adjectives as semantic predicates" option

In English, we actually express adjectives as being the "top" and treat them like semantic predicates:

(b / brilliant-01
   :ARG1 (p / person ... Sally)

UNM provided plenty of examples where there are morphosyntactic reasons why we might want to do this when things have inflectional morphology, e.g. Arapaho:

hii-wo3onohoe-ni-noo (I have a book / I am book-having)

(w / hii-wo3onohoe-ni
    :agent (s / sing1)

I'd argue that language-specific annotations should be able to make their own judgement call about where to draw the line, but that we should strongly prefer reifying relations when possible.

Typo: underscores instead of hyphens in named entity classes/types/subtypes

Some labels in the named entity taxonomy contain underscores. Instead, there should be hyphens as in the other labels.

Examples: social_group, international_organization, government_organization...

Encouraging Annotators to Choose the Right Level of Detail

One of the features I really like about UMR is that it allows annotators to choose the level of detail to annotate, using UMR’s lattices and Stage 0/1 distinction. You added this feature to accommodate annotation of low-resource languages, and I wanted to emphasize that it is also useful for downstream applications. For example, in our applications, tense and aspect are really important, but person and number are not. If we annotated our data with UMRs we would want a way to annotate tense and aspect while keeping person and number information underspecified.

Here are a few questions/requests with that in mind:

Is there or should there be a canonical way to indicate that a part of the annotation is underspecified? Perhaps an abstract concept umr-unspecified could take the place of the feature for aspect, person, number, etc.?
Is it possible to make thetic/non-thetic into a lattice so that this level of detail is optional to annotators? Hypothetically, that lattice could include other information-structure features as well.
How should Smatch interact with UMR lattices, ideally? For example, should a concept 1st match Inclusive in Smatch? Is there any situation where you might need to compare these features for the same language?

Confusion, Questions, and Requests related to UMR Notations

Good Morning. I realize the UMR notation encodes a lot of information incorporating many areas of research. I found some parts of the notation confusing, and I thought it might be useful to identify potential sources of confusion and ask questions and offer some feedback. Thank you in advance for your replies.

Capitalization and Naming Conventions

I noticed some inconsistencies in the guidelines related to capitalization and naming of concepts, relations, and attribute values:

Sometimes concepts and relations are capitalized: Singular, Paucal
Sometimes all caps is used: AUTH, PRESENT_REF
Sometimes lowercase is used: :ref-number, imperative (Imperative is also used)
Sometimes camel-case is used to distinguish separate words: :FullAff
Sometimes hyphens are used to distinguish separate words: :ref-person
Sometimes underscores are used to distinguish separate words: PRESENT_REF
AMR concepts and relations use lowercase words separated by hyphens. The only exceptions are core-role relations (:ARGX) to make them visually stand out and names in quotes. Even attribute values such as imperative and expressive are lowercase in standard AMR.

With that in mind I would request:

I would strongly encourage you to keep the AMR naming convention in UMR (lowercase names with hyphens between words). That will reduce errors in downstream applications in the long run, reduce annotator typos from inconsistent capitalization, and improve the backward-compatibility of UMR and AMR. A slight change from AMR’s conventions might be fine as long as you are consistent and the convention is clear and simple.

Abbreviations and Acronyms

Some of UMR’s notations rely on acronyms and difficult-to-read abbreviations such as DCT, AUTH, PrtAff, and modstr. AMR is designed to be human-readable, which is important for its use as an explanatory tool, and I believe it also reduces the learning curve for reading and annotating AMR. I would also stress that these guidelines won’t just be used by linguists, but also computer scientists who want to be able to read or parse UMR.

With that in mind I would request:

Please try to almost never use acronyms in the UMR representation and use abbreviations only in moderation. Instead of DCT, AUTH, PrtAff, and modstr, you could write doc-creation-time, author, :partial-affirm, and :modal-strength for example.
Rather than using linguistic acronyms in the guidelines, I would spell them out, e.g., Tense, Aspect, Mood instead of TAM.

Transliteration

You have added a nice notation for transliterating words such as for annotating low-resource languages:

(e / enhleama-00  'travel' 
     …

I like this notation for transliteration, and I think it will be very useful for annotating and using multilingual UMR data. However, please be aware that this notation changes the AMR data structure and many code libraries for reading, writing, and representing the AMR data will need to be updated to even be able to run on UMR inputs with this notation. For example, Smatch and penman will fail to run if you try to run their current code on a UMR with transliteration. A possible workaround could be to represent transliteration as attributes, e.g. e / enhleama-00 :transl "travel", at least until this notation is supported in libraries that currently work on AMR (I think you could do this in a post-processing script rather than changing the notation).

Questions/Requests:

Do you want the transliteration to be stored in the UMR data structure or is this only for readability for annotators?

Document-Level Representation

Similarly with transliteration, the notation for document-level representations that is show in the guidelines will not be supported by current code for AMRs because a notation like :temporal((DCT :depends-on s1t2) (s1t2 :contained s1t)) isn't supported. If these representations are always connected graphs, it might be good to make them conform to AMR notation.

Questions/Requests:

Is there one document-level graph per document or per sentence? If it's one per document, should the root note be called document rather than sentence?
The guidelines use node IDs like s1a to refer to node a in the first sentence. Could you add a dot between s1 and a to make it s1.a? I think that would make it visually easier to read and it would clarify that s1 is the namespace of a.
Could you include concepts in the document-level graph as well, just for readability? So, instead of s1t2, you can write (s1.t2 / today).

Temporal Relations

I found the notation of :after and :before confusing at first. I read the relation A :after B as “A happens after B happens”, but according to the guidelines, it is the other way around. I think it’s easier in English to read it as “A happens after B happens” and other people might be confused by this as well.

Questions/Requests:

Would it make sense to switch the direction of :after and :before relations so that A :after B can be read as “A happens after B” and A :before B can be read as “A happens before B”?
In one or two places in the guidelines you use a relation :op in (b / before :op (n / now)). I would change that to :op1 to stay consistent with AMR.

Complement clauses: predication, modification, or reference?

Part 3-1-1-2 is titled Processes in modification and reference and it includes example (1b) of a complement clause:

She wanted to go to school.

Why is the going process not considered predication (which would make it inappropriate for this part, and it should be in the previous part, which states that "regardless whether they are in an independent or a dependent clause, predicated processes are always identified as events")? And if you really believe it is not predication, what do you think it is? Modification, or reference?

	<!-- we need a list of such predicates>

	[Back to Table of Contents](#toc)
	#### Part 3-1-4. Word senses

	<!---
	- Script / subevent?

	For now UMR does not annotate cases where one event is a subevent of another event, or when an event is part of a "script". These will be annotated under temporal relations.

	```
	Reports suggest that the group of nine were having a picnic on Friday when they were abducted in the Saada province of Yemen. A spokesman for the Yemeni Embassy said "The foreigners ventured outside the city of Saada without the required police escorts due to the heightened security situation in the area.
	```
	--->

umr4nlp / umr-guidelines Goto Github PK

umr-guidelines's Introduction

UMR Guidelines

umr-guidelines's People

Contributors

Stargazers

Watchers

Forkers

umr-guidelines's Issues

"Relationship as the top node" examples

Treating objects/adjectives as semantic predicates" option

Recommend Projects

Recommend Topics

Recommend Org