Git Product home page Git Product logo

penman's Introduction

Shows the Penman logo: a slash character between two parentheses.

Penman โ€“ a library for PENMAN graph notation

PyPI Version Python Support .github/workflows/checks.yml Documentation Status

This package models graphs encoded in PENMAN notation (e.g., AMR), such as the following for the boy wants to go:

(w / want-01
   :ARG0 (b / boy)
   :ARG1 (g / go
            :ARG0 b))

The Penman package may be used as a Python library or as a script.

For Javascript, see chanind/penman-js.

Features

  • Read and write PENMAN-serialized graphs or triple conjunctions
  • Read metadata in comments (e.g., # ::id 1234)
  • Read surface alignments (e.g., foo~e.1,2)
  • Inspect and manipulate the graph or tree structures
  • Customize graphs for writing:
    • Adjust indentation and compactness
    • Select a new top node
    • Rearrange edges
    • Restructure the tree shape
    • Relabel node variables
  • Transform the graph
    • Canonicalize roles
    • Reify and dereify edges
    • Reify attributes
    • Embed the tree structure with additional TOP triples
  • AMR model: role inventory and transformations
  • Check graphs for model compliance
  • Tested (but not yet 100% coverage)
  • Documented (see the documentation)

Library Usage

>>> import penman
>>> g = penman.decode('(b / bark-01 :ARG0 (d / dog))')
>>> g.triples
[('b', ':instance', 'bark-01'), ('b', ':ARG0', 'd'), ('d', ':instance', 'dog')]
>>> g.edges()
[Edge(source='b', role=':ARG0', target='d')]
>>> print(penman.encode(g, indent=3))
(b / bark-01
   :ARG0 (d / dog))
>>> print(penman.encode(g, indent=None))
(b / bark-01 :ARG0 (d / dog))

(more information)

Script Usage

$ echo "(w / want-01 :ARG0 (b / boy) :ARG1 (g / go :ARG0 b))" | penman
(w / want-01
   :ARG0 (b / boy)
   :ARG1 (g / go
            :ARG0 b))
$ echo "(w / want-01 :ARG0 (b / boy) :ARG1 (g / go :ARG0 b))" | penman --make-variables="a{i}"
(a0 / want-01
    :ARG0 (a1 / boy)
    :ARG1 (a2 / go
              :ARG0 a1))

(more information)

Demo

For a demonstration of the API usage, see the included Jupyter notebook:

PENMAN Notation

A description of the PENMAN notation can be found in the documentation. This module expands the original notation slightly to allow for untyped nodes (e.g., (x)) and anonymous relations (e.g., (x : (y))). It also accommodates slightly malformed graphs as well as surface alignments.

Citation

If you make use of Penman in your work, please cite Goodman, 2020. The BibTeX is below:

@inproceedings{goodman-2020-penman,
    title = "{P}enman: An Open-Source Library and Tool for {AMR} Graphs",
    author = "Goodman, Michael Wayne",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.35",
    pages = "312--319",
    abstract = "Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a framework for semantic dependencies that encodes its rooted and directed acyclic graphs in a format called PENMAN notation. The format is simple enough that users of AMR data often write small scripts or libraries for parsing it into an internal graph representation, but there is enough complexity that these users could benefit from a more sophisticated and well-tested solution. The open-source Python library Penman provides a robust parser, functions for graph inspection and manipulation, and functions for formatting graphs into PENMAN notation. Many functions are also available in a command-line tool, thus extending its utility to non-Python setups.",
}

For the graph transformation/normalization work, please use the following:

@inproceedings{Goodman:2019,
  title     = "{AMR} Normalization for Fairer Evaluation",
  author    = "Goodman, Michael Wayne",
  booktitle = "Proceedings of the 33rd Pacific Asia Conference on Language, Information, and Computation",
  year      = "2019",
  pages     = "47--56",
  address   = "Hakodate"
}

Disclaimer

This project is not affiliated with ISI, the PENMAN project, or the AMR project.

penman's People

Contributors

danielhers avatar dependabot[bot] avatar goodmami avatar shenganzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

penman's Issues

Support iterative printing with dump()

dump() currently uses dumps(), which builds the entire output string before writing to a stream. Since dump() doesn't return a string, it can be done iteratively, which would help especially when the output stream is stdout (so you see output right away). Also see #21.

Iterdecode should work on streams instead of strings

Some of the benefit of iterative decoding is lost if the whole string must be read into memory before processing It still has the benefit that each item is yielded as it is parsed, but it would be better to pull from a stream (e.g., an open file), especially for large texts.

Identifying re-entrancies

Re-entrancies are where two or more paths converge on a single node. Here are a couple of examples:

(a / A                       (a / A
   :rel (b / B                  :rel (b / B
           :rel (c / C))                :rel-of (c / C)))
   :rel c)

It should be sufficient to search for two or more triples with the same target node, as long as that target is a variable (node identifier) and not a constant.

Note that re-entrancies are not the same as cycles. Cycles are a type of re-entrancy where a directed traversal returns to an already-visited node.

Add an option to set top of Graph to root

Perhaps the Graph constructor could take an additional keyword argument to initialize the graph with the top of the graph being the root (that is, the node of the DAG with in-degree 0). A hackier way to support this could be to have a special value for the top argument to the Graph constructor (e.g. -1), or perhaps this could be the default (when top argument is None).

This could be useful for applications that require transversing the graph in topological order (for example, translating AMR to FOL http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00257).

Remove 'triples' parameter

I don't think anyone uses the triples parameter in the following places:

  • PENMANCodec
    • decode()
    • iterdecode()
    • encode()
  • penman.interface
    • load()
    • loads()
    • decode()
    • dump()
    • dumps()
    • encode()

It's mostly a historical artifact. If someone wants to read or write triples there are the PENMANCodec.parse_triples() and PENMANCodec.format_triples() functions, and it will still be available via the penman command.

Removing this parameter will simplify things a bit, which is a good for maintainability.

See if linearization order (encoding) can be done with a priority queue

The order of linearized nodes currently happens in the _encode_penman() function, although it might make sense to move it to a separate function (see #16). The code maintains lists of edges in both directions, separated by whether they are preferred or not. Maybe this can be done with a priority queue instead, which could simplify the code. Luckily, there's a standard library module for this: https://docs.python.org/3/library/heapq.html

Read metadata lines

The # ::key value.. style of metadata (in the lines directly before a graph) are now a fairly standard format. They should be parsed and stored with Graph objects. Some metadata types, like ::tok and ::alignment could tie in with the handling of surface alignments (see #19)

Function to check graphs for model compliance

The model describes valid roles and transformations but there is not yet any automatic way to check for model compliance. At a minimum, this should check if all roles are described by the model and if the graph is connected. In the future it would be nice to also check the concepts and the roles used on each node, but currently the model does not have this information.

variable/value types are perhaps too permissive

Allowing arbitrary types for node variables was fun, but could be problematic, because if something occurs with an attribute value with the same form as a variable, it will look like a re-entrancy. With AMR or DMRS, this is unlikely to happen, but it is possible. E.g., consider the hypothetical graph:

# I bought 3 fish
(1 / buy
   :ARG1 (2 / i)
   :ARG2 (3 / fish
            :count 3))

One possible solution is to always wrap variables in some custom object; maybe node types could be bundled with them? It doesn't solve the problem of parsing the graph (e.g. in the example would we wrap the value of :count or not?). Alternatively, or in addition, the parser could require some symbol-like identifier for variables (e.g., x, x2, etc., as AMR does), and then restrict the kinds of symbols the parser accepts. By default, symbols could be disallowed, but certain codecs could allow them, being careful not to confuse the allowed symbols with variables. E.g., AMRCodec would not allow expressive as a variable, nor a (more generally /[a-zA-Z]\d*/) as a symbol value.

Support for alignment to text tokens

AMR supports annotations on nodes and edges, specifying alignment to one or more token indices. See an example in Nathan Schneider's AMR IO module: https://github.com/nschneid/amr-hackathon/blob/master/src/amr.py#L172
The module I mentioned supports reading AMRs, but building them from triples for the purpose of writing to file is more convenient with the penman package - except that it doesn't support alignments.
A simple solution would be to allow a suffix to a node/edge label containing "~e." and a number.

Edge dereification

Penman now has edge and attribute reification, ported from Norman, but not the reverse process. Edge contraction (also called "collapsing" or "dereification" in Norman) would be more useful for DMRS than AMR since DMRS tends to have "reified" nodes by default (in a sense), whereas AMR almost never reifies relations unless necessary.

Separate instance triples from other attributes

It is easy to get instance triples:

>>> g = decode('(a / alpha :ARG0 (b / beta :attr val))')
>>> g.attributes(role=':instance')
[Attribute(source='a', role=':instance', target='alpha'), Attribute(source='b', role=':instance', target='beta')]

It is more difficult to get non-instance attributes:

>>> [t for t in g.attributes() if t[1] != ':instance']
[Attribute(source='b', role=':attr', target='val')]

Since the instance attributes are a distinguished class and it is common to request them, it makes sense to give them their own accessor. Since that would make g.attributes(role=':instance') redundant, it makes sense to have g.attributes() ignore instance triples. This way, the following would be true:

>>> sorted(g.instances() + g.attributes() + g.edges()) == sorted(g.triples)
True

Inverted edges without a source variable

In #19, @danielhers comments:

[...] I had an issue when trying to use penman to read AMRs in my branch. When reading a_pmid_2094_2929.48 from the BioAMR training set (which contains the triple f / figure~e.17 :mod "1A"~e.19), I was asking if "1A" in amr.variables() and the answer was True. I think this is because of the :mod edge, which is an inverted :domain. In a way, it is true that the triple ("1A", "domain", "f") exists, but that doesn't mean "1A" is a variable, while f is.

This is due to the interaction of 2 assumptions I make in Penman:

  • that :mod is the inverse of :domain
  • that the source of a normalized (i.e., deinverted) triple will always be a node identifier

Specifically, the AMRCodec treats mod as the inversion of domain; that is, mod is considered an inverted edge while domain is the canonical form. The guidelines for AMR suggest as much:

We can write :domain-of as an inverse of :domain, but we often shorten this to :mod.

and

... we instead open up the inverse of :domain , i.e. the role :mod

On the other hand, :mod is in the table of relations with reifications, while :domain is not, nor is :domain in the list of relations without reifications (either it was forgotten, or perhaps the author considered :mod to be the canonical form and :domain the inverse).

The second assumption with PENMAN is that the source of every non-inverted edge is a node identifier. If mod is not inverted, there is not really a problem: it's just an attribute triple like :polarity (assuming it's ok for :mod to behave that way). But does that mean that :domain cannot behave like an attribute? Or can both of them (meaning they are not simple inverses)?

Regex pattern bug in _default_cast

@goodmami The code for the function is shown below.

def _default_cast(x):
    if isinstance(x, basestring):
        if x.startswith('"'):
            x = x  # strip quotes?
        elif re.match(
                r'-?(0|[1-9]\d*)(\.\d+[eE][-+]?|\.|[eE][-+]?)\d+', x):
            x = float(x)
        elif re.match(r'-?\d+', x):
            x = int(x)
    return x

Here are some examples that cause unexpected behavior.

_default_cast("123a") # expected: "123a", actual: ValueError from int()
_default_cast("-3.14z") # expected: "-3.14z", actual: ValueError from float()

The problem arises from using regex patterns without ^ and $. Adding these characters as shown below fixes the problems.

def _default_cast(x):
    if isinstance(x, basestring):
        if x.startswith('"'):
            x = x  # strip quotes?
        elif re.match(
                r'^-?(0|[1-9]\d*)(\.\d+[eE][-+]?|\.|[eE][-+]?)\d+$', x):
            x = float(x)
        elif re.match(r'^-?\d+$', x):
            x = int(x)
    return x

Another alternative would be not to use regex at all here, and use try-except blocks instead, as shown below.

def _default_cast(x):
    if isinstance(x, basestring):
        try:
            x = int(x)
        except ValueError:
            try:
                x = float(x)
            except ValueError:
                pass
    return x

Bugs when getting attribute triples

Here is an example:

(c / contrast-01
      :ARG2 (a / and
            :op1 (l / lack-01
                  :ARG0 (i / it)
                  :ARG1 (b2 / benefit-01)
                  :degree (t / total
                        :polarity -))
            :op2 (b / believe-01
                  :ARG0 (i2 / i)
                  :ARG1 (b4 / benefit-01
                        :ARG0 i
                        :mod (g / great
                              :degree (m / most))
                        :manner (a2 / and
                              :op1 (s / serve-01
                                    :ARG0 i
                                    :ARG1 (s2 / strip-01
                                          :ARG0 i
                                          :ARG1 (b3 / benevolence
                                                :mod (f / false)
                                                :ARG2-of (c2 / compose-01
                                                      :ARG1 (s3 / skin
                                                            :part-of (a3 / and
                                                                  :op1 (e2 / elitist)
                                                                  :op2 (p / politician)
                                                                  :op3 (c3 / capitalist))
                                                            :mod (o / outer))))))
                              :op2 (e / expose-01
                                    :ARG0 i
                                    :ARG1 (c7 / core
                                          :ARG1-of (r / rot-01)
                                          :mod (c6 / cannibal)
                                          :location (i3 / inside
                                                :op1 a3))
                                    :manner (o2 / once-and-for-all)))))))

Currently, penman determines whether a triple is an attribute purely based on the target of the triple being a variable or not. However, the above example will give us a triple (i2, instance, i) which is an attribute but penman fails to detect. The reason is that i is also a variable name according to line 4 :ARG0 (i / it).

Seems a Variable class should be added to fix this issue cleanly?

Allow simple codec customization without needing subclasses

Codecs that only need to customize some basic bits like the TYPE_REL, TOP_REL, or TOP_VAR could be able to do this through the PENMANCodec constructor instead of needing to create a subclass. Customizing the parsing regexes this way may be too much, though, but more encoding options could be added, such as for including attributes on the same line as the node type, etc.

Add support for concept inventories to Model

This includes:

  • list/map of available concepts
  • Model.has_concept() (like Model.has_role())
  • Add checks to Model.errors()
  • (Maybe) consistency checks of roles, reifications, normalizations, and concepts

Delay parsing surface alignments to simplify trees

It's nice to describe surface alignments in the grammar, but there a couple of consequences:

  • The production is rather complicated in regular expressions, which slows down parsing
  • It makes the Tree data structures ugly with the lists of epidata (which at the Tree stage are currently only surface alignments)

On the other hand, delaying parsing has the downside that some parsing tasks occur outside of the lexer + codec modules.

The proposal here is to detect and deal with surface alignments in layout.interpret() and layout.configure() rather than in the codec.

Reconfigure graphs with triple sorting

penman.layout.rearrange() takes a sorting key function to rearrange the branches of a tree, but penman.layout.reconfigure() does not allow any such sorting function for the triples. Give it a key argument to allow graphs to be more drastically reconfigured.

There is a chance merging multiple attribute values to one node?

Hi.
I find sometimes PENMAN will merge multiple attribute values (yet of course those have the same value) into a node.

An example output by penman.amr is like this:

(g1 / good-02
     :polarity (- :domain (m10 / mention-01
                                               :ARG0 (w17 / we)
                                               :mod (a16 / again)))
     :ARG1 (t4 / that)) 

Note the attribute value ''-"! is it intended to do so?

Add function to recreate variables

A function to recreate variables could help with things like normalization. The default behavior will follow convention, which takes the first letter of the concept. For other behaviors, users could supply a function that takes the concept and returns the identifier prefix. For non-unique identifiers, an integer will be appended (regardless of the method used).

As this doesn't change the graph's meaning, it doesn't really belong in penman.transform, and neither it is a layout change like layout.rearrange(), so the choice is down to penman.graph or penman.tree. I think it's better on tree for two reasons:

  1. we don't have to rebuild the epidata dictionary with the new variables
  2. integers on non-unique variables generally increase based on order of occurrence in the tree

So we then have two new functions:

  • penman.tree.default_variable_prefix(*concept*): default function for getting a variable prefix for a concept
  • penman.tree.Tree.reset_variables(*function*): reset variables in the tree using function to get variable prefixes, with function defaulting to default_variable_prefix()

Make sorting functions work on roles instead of branches

The sorting functions in penman.model.Model accept a Branch and return a sort key. This is because they are used to rearrange trees. But so far they only use the role of a branch and not the target. While having the target available is more powerful, it may be unnecessary, and if we want to be able to sort triples during reconfiguration (see #52), it would be more useful to work only on the role as both branches and triples have roles.

There are some differences between branch roles and triple roles:

  • branch roles can be inverted
  • concept roles are :instance in triples and / in branches

Freeze and Treeify operations

Some functions one may want to apply to the graphs work better when the graph can be treated like a tree (e.g. for branch pruning or reliable traversals), so some functions to "freeze" the graph or to reduce it to a tree could be useful. There may be several levels to this:

  1. fix edge orientations (e.g. (a :ARG1 b) vs (b :ARG1-of a))
  2. fix node relation ordering (e.g. (a :ARG1 b :ARG2 c) vs (a :ARG2 c :ARG1 b))
  3. sever reentrancies (e.g. (a :ARG1 (b / abc) :ARG2 b) vs (a :ARG1 (b / abc) :ARG2 (b2 / abc)))

Parsing with ~

I had an issue while parsing the following graph

(h / hyperlink-91 :ARG3 (u / url-entity :value "http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB162/29.pdf\"))")

This error is throwed
SurfaceError: invalid alignments: 'nsarchiv/NSAEBB/NSAEBB162/29.pdf"'

It seems to be caused by ~ in the URL. Anyway to fix this (identify the string first)?

Chunchuan

Duplicate relations trip up the disconnected-graph check in layout configuration

The following AMR from the BioAMR corpus could not round-trip as layout.configuration() thought it was a disconnected graph. Note that node d5 has :location c twice, with the second time at the end of the node context.

# ::id pmid_2112_2157.180 ::date 2015-07-21T15:25:46 ::annotator SDL-AMR-09 ::preferred
# ::snt The four aforementioned inhibitors significantly decreased IL-6 secretion in the clinically isolated cancer cells differently (U0126, p < 0.01; AG490, LY294002 and BAY11-7082, all p < 0.001) (Figure <xref ref-type="fig" rid="F8">8A</xref>).
# ::save-date Fri Dec 18, 2015 ::file pmid_2112_2157_180.txt
(d / decrease-01
      :ARG0 (s / small-molecule :quant 4
            :ARG1-of (i / inhibit-01)
            :ARG1-of (m / mention-01
                  :time (b / before)))
      :ARG1 (s3 / secrete-01
            :ARG1 (p / protein :name (n / name :op1 "IL-6")))
      :ARG2 (s2 / significant-02)
      :location (c / cell
            :manner (c3 / clinical)
            :mod (d4 / disease :wiki "Cancer" :name (n6 / name :op1 "cancer")
                  :ARG1-of (i2 / isolate-01)))
      :ARG1-of (d2 / differ-02)
      :ARG1-of (m2 / mean-01
            :ARG2 (a / and
                  :op1 (d5 / decrease-01
                        :ARG0 (s7 / small-molecule :name (n2 / name :op1 "U0126"))
                        :ARG1 s3
                        :location c
                        :ARG1-of (s8 / statistical-test-91
                              :ARG2 (l / less-than :op1 0.01))
                        :location c)
                  :op2 (d6 / decrease-01
                        :ARG0 (a2 / and
                              :op1 (s4 / small-molecule :name (n3 / name :op1 "AG490"))
                              :op2 (s5 / small-molecule :name (n4 / name :op1 "LY294002"))
                              :op2 (s6 / small-molecule :name (n5 / name :op1 "BAY11-7082"))
                              :mod (a3 / all))
                        :ARG1 s3
                        :ARG1-of (s9 / statistical-test-91
                              :ARG2 (l2 / less-than :op1 0.001)))))
      :ARG1-of (d3 / describe-01
            :ARG0 (f / figure :mod "8A")))

Since the triples for these are identical, the first one will call up the epigraphical POP datum and leave the node context. On its own this is not a problem; the algorithm will work around it. But when this happens a few levels down it leaves some POP epidata on the layout agenda and the function thinks there is more work to do.

For a test, here is a smaller (contrived) example that causes the error:

# ::snt I think you failed to not not act.
(t / think
   :ARG0 (i / i)
   :ARG1 (f / fail
      :ARG0 (y / you)
      :ARG1 (a / act
         :polarity -
         :polarity -)))

Some notes:

  • the original graph is wrong, but I don't want to rule out the possibility of having identical triples
  • the algorithm should not consider lone POP items to be meaningful agenda items; they should just be discarded if there's no place to use them
  • even better, though, would be to somehow identify which triple they actually belong to, so that even weird graphs like this can be round-tripped without structural changes

Simple membership test: X in graph

It would be useful to have a simple way to test if a graph has certain elements. This would be done my implementing the __contains__() method on the Graph class.

I can imagine several behaviors based on the type of thing being checked:

  • string ('xyz' in g) - return True if the string is an attribute (i.e. terminal) value
  • Graph (Graph(...) in g) - return True if the query graph is isomorphic to a subgraph of g
  • tuple (('x', 'polarity', '-') in g) - return True if the query tuple is a triple in g

To start, maybe we should just consider string values.

Add function to retrieve particular epigraphical markers

This issue is spun off of #37. To find a particular type of marker for some triple, one must do something like:

for epi in g.epidata[triple]:
    if isinstance(epi, marker_type):
        # do something with epi

It would be convenient to be able to request those directly, e.g.:

g.get_marker(marker_type)

Questions:

  • where does this function go? on the Graph class? in penman.epigraph?
  • should there be separate functions to return (1) all markers of a type and (2) just the first one, or just one function? Aside from POP, there is generally never multiple markers of the same (leaf) type for a triple, so it seems more convenient to just get the first one.

Duplicate relations trip up reification

This is similar to #34 but it affects edge reification so it is a different issue. It appears to be the same assumption (that triples are unique) causing the bug, though. Here's an example from BioAMR that causes an error with edge reification:

# ::id pmid_1177_7939.216 ::date 2015-03-06T14:47:40 ::annotator SDL-AMR-09 ::preferred
# ::snt P Rich, proline rich region; SH3, SH3 domain.
# ::save-date Sun Mar 22, 2015 ::file pmid_1177_7939_216.txt
(a2 / and
      :op1 (n4 / name :op1 "P" :op2 "Rich"
            :ARG2-of (d / describe-01
                  :ARG1 (r2 / rich
                        :mod a
                        :domain (r3 / region)
                        :mod (a / amino-acid :name (n / name :op1 "proline")))))
      :op2 (n5 / name :op1 "SH3"
            :ARG2-of (d2 / describe-01
                  :ARG1 (p2 / protein-segment :name (n3 / name :op1 "SH3" :op2 "domain")))))

And here's a smaller example:

(a / alpha
   :mod b
   :mod (b / beta))

The error appears when reifying using the AMR model, e.g., from the commandline:

$ penman --amr --reify-edges <<< '(a / alpha :mod b :mod (b / beta))'

It appears to be trying to do a dict.pop() multiple times on the same key (the duplicate triple), resulting in a KeyError.

Mutable graphs

There is currently no good way to modify graphs. In order to take an existing graph and change its triples, currently a new graph needs to be created. This issue is for coming up with an API. Here are some proposals:

  • Graph.extend(data) -- just like instantiation but data is appended to the existing data
    • Graph.remove(triple) for removal but only works on one triple
  • Graph + Triple (and +=)
    • -/-= would work similarly for removing triples
    • Only one triple at a time
  • Graph + Graph (and +=)
    • As above but allows changes to multiple triples
    • But requires creating a new Graph object each time

Optimizing the lexer and parser

This issue is to document potential optimizations.

Some low-hanging fruit of lexing and parsing performance has already been tackled:

  • optimize logging calls inside loops
  • avoid parsing and casting numeric types specially (#44)

On parsing the training portion of the BioAMR corpus before and after these changes (average of 3 times at each revision), the performance improves almost 20% (3.6s vs 2.9s).

Another possible optimization is to replace the Token class (a typing.NamedTuple) with regular tuples, then use index access to get at members (e.g., token[1] instead of token.text). Doing so shaves off another 0.3s, meaning the overall speedup is about 28%. While the improvement is significant, I hesitate because, in the end, it saved 0.3s on a fairly large corpus at the expense of code legibility, so it may not be worth it.

Another optimization is to use a compiled parser (e.g., pe), but then the library becomes harder to distribute.

layout.appears_inverted() incorrect in some common cases

layout.appears_inverted() is meant to be an approximation, but nevertheless there are some common cases for which it is inaccurate. Since it relies on the presence of a Push epidatum whose variable is the triple's source, it fails on reentrancies:

(a / alpha
   :ARG0 (b / beta)
   :ARG1 (g / gamma
            :ARG0 (e / epsilon)
            :ARG1-of (k / kappa)
            :ARG1-of b))

Here the ('b', ':ARG1', 'g') triple would not appear inverted because it does not Push('b'). We cannot just look to the source of the previous triple to estimate context, because it would be k, nor the previous Push variable, which would be k, then e. It seems we'd need a fast way to determine the current node context by tracking all Pushs and POPs from the top (maybe only if a Push is not found on the triple in question).

Add ability to reorder relations on nodes

There are times when it is useful to reorder relations on nodes without changing the basic tree structure. For example, randomized relations can avoid annotator bias and a canonical ordering could be useful for annotators or otherwise for normalization.

This would entail the definition of functions to determine the order of tree branches (similar functions, penman.original_order, penman.alphanum_order, etc., were removed in v0.7.0). In the current version these would probably go on Model so they have access to information about inverted relations. The function that does the reordering of relations would be in penman.layout and operate on trees. It should just sort the branches on each node using the specified sort key, and leave the concept branch in-place.

original text:

For experiments it would be convenient to have a relation-sorting function that randomizes the order.

Unnormalized triples

Currently all graphs are stored with normalized triples, and the layout engine determines the serialization direction. This is nice, but some groups (namely AMR, who hand-write their graphs) would probably want the directions as-encoded to be retained as much as possible in a round-trip, and also to be able to do operations (e.g. search, manipulation) based on the encoded edge directions. The latter case is also appealing to me, since I could do operations like "sever reentrancies" or "prune subtree" and it would make sense.

A basic solution is to provide a codec that doesn't normalize edge directions (and test that the resulting graph objects are still valid).

Encoding with node type as top causes problems

A graph with a single node an only a node type specified can be (erroneously) encoded with the node type as the top, but not if the node has any other relations:

>>> g = penman.decode('(a / abc)')
>>> print(penman.encode(g, top='abc'))
(abc / abc)
>>> g = penman.decode('(a / abc :ARG1 (b / bcd))')
>>> print(penman.encode(g, top='abc'))
[...]
penman.EncodeError: Invalid graph; possibly disconnected.
>>> print(penman.encode(g, top='bcd'))
[...]
penman.EncodeError: Invalid graph; possibly disconnected.

General-purpose graph rewriting

A useful feature would be to modify a corpus of graphs according to some patterns. (see also #2)

  • prune leaf node or edge (e.g., prune udef_q, prune :polarity)
  • prune branch (if any reentrancies are self-contained)
  • treeify (sever reentrancies on serialization)
  • prune branch and sever reentrancies
  • replace node type or relation (sub def_q existential_q, sub :-EQ :MOD-EQ)
  • insert relation and node (insert (a / *_n_* !:RSTR-H-of *) (a :RSTR-H-of (b / udef_q)))

Feature request: structured representation with layout information

Great to see the new work. I use the Graph.triples() output from the previous version of penman as an intermediate format when converting amr strings to, eg, networkx or igraph. This relies on the inverted attribute of Triples. With that attribute now gone, and no public interface to the epigraph logic, it's not clear how make use of the new version.

MRS-based graphs can't fully round-trip

Calling the module as a script, I see that an AMR will roundtrip nicely, but an MRS-based one is losing a lot of information. This might be because the graph traversal is failing to make the right links.

$ echo "(b / bark :ARG1 (d / dog))" | python penman.py
(b / bark
   :ARG1 (d / dog))
$ echo "(10000 / bark :ARG1 (10001 / dog))" | python penman.py
(10000 / bark
       :ARG1 10001)

Notice how in the second call, the dog nodetype is lost. Perhaps the node ids are problematic (int vs str comparison or something)?

Allow comma and caret in symbols

The grammar (now documented at https://penman.readthedocs.io/en/latest/notation.html) disallows commas in symbols (specifically the NameChar production), which is to avoid problems when parsing triples (e.g., instance(a, alphabet)), where the comma is used to delimit the source and target of the triple.

The problem is that the comma does turn up in symbols (at least from some parsers, such as CAMR), such as numbers:

:quant (x26 / 900,000)))

Furthermore, the issue with parsing triples is not an issue in practice because the source of the triple should always be a variable which is even less likely to contain commas. Since there is no other reason to forbid commas, it should be safe to allow them if triple parsing always splits on the first comma and ignores the rest.

Identifying cycles

Issue #11 is concerned with finding re-entrancies, but cycles may be harder to detect. If we can efficiently detect cycles, we could (a) flag graphs as being cyclical; (b) identify the relations involved in creating a cycle; or even (c) reject cyclical graphs from being created. A simple cycle looks like this:

(a / A :rel a)

Issue #11 has the following example (left); the same graph with a different top (right) looks like a cycle:

(a / A                    (b / B
   :rel (b / B               :rel (c / C
           :rel (c / C))             :rel-of (a / A
   :rel c)                                      :rel b)))

However this is not a cycle, since the relation from c to a (:rel-of) is inverted.

Don't cast or specially handle numeric datatypes

Specially describing and casting integer and float values is an extension of this library that isn't described by AMR or DMRS. It may be useful to some, but since it is not standard I think it should be removed. Casting is then the responsibility of applications using the library. If this becomes such a problem that the feature is requested again, it can be added optionally (maybe as a parameter to PENMANCodec).

As a bonus, not specially parsing or casting these gives a speed boost during decoding.

Graph search

Allow for searching the corpus for matching graphs.

  • by node type search "book"
  • by relation search ":polarity"
  • with wildcards search "have-*", search "ARG*"
  • by subgraph `search "(s / see-* :ARG1 (p / penguin))" (node ids don't need to be as in the file, but can be repeated for reentrancies)
  • negated matches: search "! book", search "! :polarity", search "(s / see-* :ARG1 (_ / !penguin))"

Add AMR concept inventory

This depends on #57, although #57 will be informed by this issue.

The AMR concept inventory is, as I understand, a fork of Propbank frames. It is packaged with the LDC release which I don't have access to and also raises questions of licensing if I were to include (a derived form of) those in this repository.

There's a plain-text version, that I think is equivalent, at https://amr.isi.edu/doc/propbank-amr-frames-arg-descr.txt

One thing I could do here is make a reader or converter for the frame files so someone with access to the LDC release could create the appropriate files for Penman to use. I just need to confirm the format of those files, then.

Failed to parse ISI Alignment

I'm using the branch v0.7.0 to load AMRs with ISI alignments, and I get a fatal error on this AMR:

(a / and~e.29 :op1 (b / believe-01~e.17 :ARG0 (p2 / person :ARG0-of (h2 / have-org-role-91~e.12 :ARG1 (c2 / company :wiki - :name (n / name :op1 "IM"~e.15) :mod (c3 / country :wiki "United_States" :name (n2 / name :op1 "United"~e.14 :op2 "States"~e.14))) :ARG2 (o / officer~e.11 :mod~e.11 (e3 / executive~e.11) :mod~e.11 (c7 / chief~e.11)))) :ARG1~e.18 (c8 / capable-01~e.25 :ARG1 (p / person~e.22 :ARG1-of~e.22 (e / employ-01~e.22 :ARG0 c2) :mod (e2 / each~e.19)) :ARG2~e.26 (i / innovate-01~e.27 :ARG0 p))) :op2 (f / formulate-01~e.30 :ARG0 (o2 / officer~e.11 :mod~e.11 (e4 / executive~e.11) :mod~e.11 (c / chief~e.11)) :ARG1 (c4 / countermeasure~e.32 :mod (s / strategy~e.31) :purpose~e.33 (i2 / innovate-01~e.34 :prep-in~e.35 (i3 / industry~e.37)))) :time (a3 / after~e.0 :op1 (i4 / invent-01~e.3 :ARG0 (c5 / company~e.16 :ARG0-of (c6 / compete-02~e.2 :ARG1~e.1 c2~e.1)) :ARG1 (m / machine~e.8 :ARG0-of (w / wash-01~e.7) :ARG1-of (l / load-01~e.6 :mod (f2 / front~e.5))))))
File "C:\Users\austi\Desktop\amr-ccg-parsing\penman\penman.py", line 809, in decode
    return codec.decode(s)
  File "C:\Users\austi\Desktop\amr-ccg-parsing\penman\penman.py", line 172, in decode
    span, data = self._decode_penman_node(s)
  File "C:\Users\austi\Desktop\amr-ccg-parsing\penman\penman.py", line 405, in _decode_penman_node
    span, data = self._decode_penman_node(s, pos=pos)
  File "C:\Users\austi\Desktop\amr-ccg-parsing\penman\penman.py", line 405, in _decode_penman_node
    span, data = self._decode_penman_node(s, pos=pos)
  File "C:\Users\austi\Desktop\amr-ccg-parsing\penman\penman.py", line 405, in _decode_penman_node
    span, data = self._decode_penman_node(s, pos=pos)
  [Previous line repeated 2 more times]
  File "C:\Users\austi\Desktop\amr-ccg-parsing\penman\penman.py", line 427, in _decode_penman_node
    raise DecodeError('Expected ":" or "/"', string=s, pos=pos)
penman.penman.DecodeError: Expected ":" or "/" at position 149

It seems to fail after :op1 "IM"~

Do you have any idea what's wrong?

Allow reification transformations

Reification is a constrained graph transformation that replaces an edge with a binary node. For example, A --> B becomes A <-- C --> B. This allows cycles to be broken and relations to become focused (e.g., as root nodes). Not any edge can be reified, and the new node and new edges are not all the same.

Hopefully reification can be done on a Graph without knowledge of the codec.

De-inverting :instance-of relations makes weird graphs

The algorithm for serializing should maybe just forbid the de-inverting of :instance-of, because it makes strange graphs, such as this:

 # ::id 10010540
 # ::snt Some writers restrict the definition of ''algorithm'' to procedures that eventually finish.
 (10002 / _restrict_v_to
   :ARG1-NEQ (10001 / _writer_n_of
     :RSTR-H-of (10000 / _some_q))
   :ARG2-NEQ (10004 / _definition_n_of
     :ARG1-NEQ (10006 / _algorithm_n_1)
     :RSTR-H-of (10003 / _the_q))
   :ARG3-NEQ (10008 / _procedure_n_1
     :ARG1-EQ-of (10010 / _finish_v_1
       :ARG1-EQ-of (10009 / _eventual_a_1))
     :RSTR-H-of (10007 / (udef_q :instance (10005 :RSTR-H 10006)))))

Note the last line, how the udef_q quantifiers for two things appear in one spot. This can get worse:

# ::id 10010400
 # ::snt Typically, when an algorithm is associated with processing information, data are read from an input source or device, written to an output sink or device, and/or stored for     further processing.
 (10000 / _typical_a_1
   :ARG1-H (10001 / _when_x_subord
     :ARG1-H (10025 / implicit_conj
       :R-INDEX-NEQ (10036 / _and_c
         :R-INDEX-NEQ (10037 / _store_v_cause
           :ARG2-NEQ (10013 / _data_n_1
             :RSTR-H-of (10012 / (udef_q :instance (10033 :RSTR-H (10035 / _device_n_1))
                 :instance (10031 :RSTR-H (10032 / _sink_n_1))
                 :instance (10022 :RSTR-H (10024 / _device_n_1))
                 :instance (10020 :RSTR-H (10021 / _source_n_of))
                 :instance (10008 :RSTR-H (10010 / nominalization
[snip]

Layout engine may introduce some diffs

One goal of the project is to model the PENMAN structure as graphs but to retain enough information from their serialization so the tree structure doesn't change on reserialization. Here is an example from the Bio-AMR corpus where a diff is introduced:

(e / enhance-01~e.11 :li 2~e.0 
      :ARG1 (a3 / and~e.6 
            :op1 (n6 / nucleic-acid 
                  :name (n / name :op1 "mRNA"~e.5) 
                  :ARG0-of (e2 / encode-01 
                        :ARG1 p)) 
            :op2 (p / protein~e.7 
                  :name (n2 / name :op1 "serpinE2"~e.4))) 
      :manner~e.10 (m / marked~e.10) 
      :mod (a2 / also~e.9) 
      :location~e.12 (c / cell~e.15 
            :ARG0-of (e3 / exhibit-01~e.16 
                  :ARG1 (m2 / mutate-01~e.17 
                        :ARG1 (a4 / and~e.22 
                              :op1 (g / gene 
                                    :name (n4 / name :op1 "KRAS"~e.20)) 
                              :op2 (g2 / gene 
                                    :name (n5 / name :op1 "BRAF"~e.24))))) 
            :mod (h / human~e.13) 
            :mod (d / disease 
                  :name (n3 / name :op1 "CRC"~e.14))) 
      :manner~e.2 (i / interesting~e.2))

Here is what is produced (with whitespace differences normalized):

(e / enhance-01~e.11 :li 2~e.0
      :ARG1 (a3 / and~e.6
            :op1 (n6 / nucleic-acid
                  :name (n / name :op1 "mRNA"~e.5)
                  :ARG0-of (e2 / encode-01
                        :ARG1 (p / protein~e.7
                              :name (n2 / name :op1 "serpinE2"~e.4))))
            :op2 p)
      :manner~e.10 (m / marked~e.10)
      :mod (a2 / also~e.9)                                                                     
      :location~e.12 (c / cell~e.15                                                            
            :ARG0-of (e3 / exhibit-01~e.16                                                     
                  :ARG1 (m2 / mutate-01~e.17                                                   
                        :ARG1 (a4 / and~e.22                                                   
                              :op1 (g / gene                                                   
                                    :name (n4 / name :op1 "KRAS"~e.20))                        
                              :op2 (g2 / gene                                                  
                                    :name (n5 / name :op1 "BRAF"~e.24)))))                     
            :mod (h / human~e.13)
            :mod (d / disease       
                  :name (n3 / name :op1 "CRC"~e.14)))
      :manner~e.2 (i / interesting~e.2))

Note how the reentrancy of the p node is reversed. The layout engine prefers edges to appear in their original orientation, but in this case they do. I could possibly prefer reentrancies to start from deeper nestings, or maybe I could embed some info about reentrancy in the triple (as I do with inversion).

Decoding error

I am running penman with output from an AMR parser. It outputs ":null_edge (x20 / 876-9))))))"
Hence, penman cannot decode it because of the relation "null_edge".
I replace the underscore with a minus then it can be solved.

I am not sure whether "null_edge" violates the annotation of penman or the decoder can not recognize this pattern?

Many thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.