korap / krill Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 3.0 11.39 MB

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups

License: BSD 2-Clause "Simplified" License

Java 100.00%

krill's People

Contributors

Stargazers

Watchers

Forkers

seantyh ehsan-keshavarzian morckx

krill's Issues

Only add a single warning for timeout

When the server times out, a cascade of 682 warnings is currently issued. It should only be one.

"warnings": [
    [
      682,
      "Response time exceeded"
    ],
    [
      682,
      "Response time exceeded"
    ],
    ...
]

Position queries are broken

@jbingel found a bug (both Poliqarp):

overlaps(<s>, [orth=Mann])

and

overlaps([orth=Mann], <s>)

return different results.

A similar issue arises with the following queries:

[cnx/syn=@NH & corenlp/ne_dewac_175m_600=I-ORG]

contains([cnx/syn=@NH], [corenlp/ne_dewac_175m_600=I-ORG])

I guess the problem is too optimistic forwarding. Should be fixable with the switch to bitvector comparisons.

And sometimes matches are not shown - e.g. in the following query in the current version of wikipedia one of three matches is skipped:

http://10.0.10.14:6666/?q=contains%28%3Cs%3E%2C+ich%29&ql=poliqarp&cutoff=1
(Internal instance)

This bug was transfered from Trac Issue #127

Empty token with optional neighboring tokens

A query containing empty tokens with optional neighbors, for example

der? [] Mann?

should be deserialized as a disjunction with all possible subqueries as its operands, i.e.

der []
[] Man
der [] Man
[]

and a warning should be added for the last subquery since empty query is not allowed / cannot be searched.

Missing matches

Using https://github.com/KorAP/Kustvakt/tree/master/sample-index, Krill does not return any matches for:

{
    "query": {
        "@type": "koral:token",
        "wrap": {
            "@type": "koral:term",
            "match": "match:eq",
            "key": "der",
            "layer": "orth",
            "foundry": "opennlp"
        }
    },
    "collection": {
        "@type": "koral:docGroup",
        "operation": "operation:and",
        "operands": [
            {
                "@type": "koral:doc",
                "match": "match:eq",
                "type": "type:regex",
                "value": "CC-BY.*",
                "key": "availability"
            },
            {
                "@type": "koral:doc",
                "match": "match:ne",
                "value": "GOE/AGI/00000",
                "key": "textSigle"
            }
        ]
    }
}

This happens because GOE/AGI/00000 is not part of the index. Replacing it with GOE/AGI/04846 will return some matches.

Metadata fields for text do not support space delimiters

When searching a meta data field like "author" as part of a virtual corpus, currently it's not possible to query this as a string with a space delimiter, e.g. "author eq 'Theodor Fontane'" does not work. Maybe text fields with "eq" should be treated like sequences of tokens delimited by spaces.

Fallback for element contexts

Currently, Krill refers and relies to some annotations from the base, namely s to set boundaries for annotation retrieval and p for snippet retrieval (in case, this is wanted). But - this fails in case a match is not in a sentence or a boundary, which can happen with the new data from Wikipedia.

For match retrieval there should exist a fallback mechanism to use token contexts whenever sentence- or paragraph-contexts fail.

Reload opener after foreign commit

When the index is started by a second process, the server is not aware of the new data and won't reopen the reader in KrillIndex. At the moment that means Kustvakt needs to be restarted to reopen the reader. A better approach would be to have a command that enforces Krill to reopen the reader. This could be issued on commit by the second process or manually.

SubSpans need to be sorted, when starting with a negative offset

SubSpans need to be sorted, in case the offset is negative, as the ending of embedded spans is not in a defined order.
There is a failing test at TestSubSpanIndex#testCaseNegativeSubSpan in the sort-subspans-bug branch.

Improve JavaDoc

The JavaDoc can currently not be generated. This needs to be fixed. All public methods should somehow be documented.

SpanRelationQuery with Regex

In Annis, it should be possible to search for relations using regex, see KorAP/Koral#45

KwiC results for empty matches

Krill supports indexing empty elements. This can be useful for certain queries, but it also comes with some problems, for example, in case the user searches for empty elements, the match for such an element will be empty. There needs to be a way to mark the empty element in result sets and preverably visualize the position in clients.

Currently trying to generate a snippet with an empty match throws an exception.

Support Position Frames in SpanWithinQueries

The new Frame-Proposal (internally discussed in the GDoc "Position Frames") introduces various modifications on how SpanWithinQueries are expected to work.
It not only introduces frame vectors for different combinations of overlap-configurations, it also has a proposal for support of partial matches based on classes.

This task is pretty complex and not expected to be realized soon. This would introduce full support for Cosmas' #OV.

As KoralQuery already supports the new frame model, not supporting position frames is considered a bug.

This require support of #8.

Metadata fields require type information when returned

Field information that is part, e.g., of a match, require attached type information like "date" or "readonly" to be reused in Kalamar correctly. For example readonly fields can't be part of a virtual corpus constraint and therefore shouldn't be dynamically clickable for that purpose.

Port frame-based position query from Krawfish to Krill

Currently position queries in Krill are very limited, while the position query mechanism in Krawfish is pretty elaborated (and is completely based on frames). The original code is here and here.

Wrong matches for negative SpanExpansions

In some cases wrong matches occur in queries with negative expansions, like [orth=a][orth!=b][orth=c]. This is likely a skipTo bug in the extension query.
There is a failing test available in Gerrit.
The bug was reported by Verginica Mititelu, member of the DRuKoLA project.

Distance Dominance in Constituency Queries

Constituency queries are currently realized using SpanWithin queries. But there are further information indexed in the payloads of spans, declaring the position of a span term in a tree (depth).

<>:xip/c:NPA$<i>15<i>28<i>6<b>1

The byte information represents the position of NPA in the tree. 1 means, it’s below the root (which has no depth information at that position or is 0).
To make queries possible, that take distances in account, these payloads have to be read and compared.
This has to be done within a SpanWithinQuery (and not outside) as otherwise the payloads would be mangled.

This is necessary for AnnisQL. This issue was copied from Trac issue #146 .

SpanFocusQuery() needs to sort spans

Currently spans returned by focus() are not guaranteed to be sorted, but a lot of queries rely on that property. There are two cases that are problematic and force the use of forward-looking span caches:

a) The span <a>...{1:...}...</a> is modified using focus(1:...). A following span <a>{1:...}...</a> has the class 1 in a preceding position.

b) The span <a>...{1:...}...{2:...}...</a> is modified using focus(2:...), but still contains a class 1. Now, if the span is again modified using focus(1:...) a preceding match span may be <a>...{2:...}...{1:...}...</a>, so the second class 1 may precede the first class 1.

There is now a failing test TestFocusIndex#testFocusSorting in the focus-sort branch.

P.S. This issue refers to #169 in our old Trac ticket system.

Multiple distances in Cosmas Query

Multiple distance query needs a synchronization among the distances so that they refer to the same elements/terms.

Search fails without Glimpse

As reported by Verginica, sometimes queries return no results (without any reported error), when Glimpse is deaktivated.

Provided example queries for the DRuKoLa instance:

[drukola/m!="msd:ts.*"][drukola/m="msd:nc..o.*"][drukola/m="msd:nc..rn.*"]
[drukola/m="msd:nc..o.*"][drukola/m!="msd:s.*"][drukola/m="msd:nc..rn.*"]

I think this involves two errors:

Problem with error reporting in Kalamar
Problem with snippet generation in Krill

Support arbitrary metadata

Although this is already prepared in Krill, currently the metadata items are still focussed on I5 metadata. To make it more flexible, input documents should introduce metadata not as key value pairs, but as fields, like

      {
        "@type":"koral:field",
        "key":"license",
        "value":"closed",
        "type":"type:string"
      },
      {
        "@type":"koral:field",
        "key":"textLength",
        "value": 8,
        "type":"type:integer"
      }

(This format is based on Krawfish and the new return value of the textInfo endpoint.) That way, arbitrary fields can be ingested. Supported types should be type:string, type:text, type:date and type:integer. Keywords should be represented as list of strings in the value field. In the future, text should also be pretokenized. dates may also contain (multiple) (open) date ranges in the future. Integers may also contain multiple items in the future.

Relations without key

Currently, relations without annotation value such as
node ->malt/d node
is not allowed in Krill.

Thex can nevertheless be interpreted as relation with "any annotation value" thus identical to using regex:
node ->malt/d[func=/.*/] node

SpanOr-Bug in spanWithin

There is a bug in spanWithin that seems to be close to TestWithinIndex.queryJSONpoly2.

Breaking test in TestWithinIndex.queryJSONcomplexSpanOrTerm (span-or-bug branch).

Stacktrace:

          at org.apache.lucene.search.spans.SpanOrQuery$1.doc(SpanOrQuery.java:234)
          at de.ids_mannheim.korap.query.spans.WithinSpans.toSameDoc(WithinSpans.java:423)
          at de.ids_mannheim.korap.query.spans.WithinSpans.next(WithinSpans.java:375)
          at de.ids_mannheim.korap.KrillIndex.search(KrillIndex.java:1293)
          at de.ids_mannheim.korap.Krill.apply(Krill.java:304)

Date ranges and date additions in I5:<creatDate>

Need for possibility to specify and process date ranges or date additions in the I5:<creatDate&gt, field. E.g. according to specification of the BOT-ent field (predecessor of <creatDate> in the BOT Manual by Doris al-Wadi (p.22f.), use of a date range in at least on corpus in DeReKo, and use in the Sprache 1933-1945 project. Concerns <creatDate> of collections, but also of single texts.

Examples: "1893.06./07.", "1809-", "1960-1974"
In COSMAS II the strategy is (roughly) to process only the first element of a range for search and also for display and throw the rest away.
As of november 2016, the feature request should be more thoroughly specified by MK, ND, PH and HL i.e. allowed formats and semantics.

Multiple distances in a disjunction query

There is an Index out of bounds bug in case of multiple distances (The example query by @bansp was "({1:Sonne []* Erde} | {2: Erde []* Sonne})" in Poliqarp+). The stack trace is:

java.lang.ArrayIndexOutOfBoundsException: -1
    at java.util.ArrayList.elementData(ArrayList.java:400)
    at java.util.ArrayList.get(ArrayList.java:413)
    at de.ids_mannheim.korap.query.spans.TokenDistanceSpans.cost(TokenDistanceSpans.java:128)
    at de.ids_mannheim.korap.query.spans.ClassSpans.cost(ClassSpans.java:186)
    at org.apache.lucene.search.spans.SpanOrQuery$1.initSpanQueue(SpanOrQuery.java:181)
    at org.apache.lucene.search.spans.SpanOrQuery$1.next(SpanOrQuery.java:193)
    at de.ids_mannheim.korap.KrillIndex.search(KrillIndex.java:1325)
    at de.ids_mannheim.korap.Krill.apply(Krill.java:310)
    at de.ids_mannheim.korap.Krill.apply(Krill.java:279)
    at de.ids_mannheim.korap.search.TestKrill.searchJSONtokenDistanceSpanBug(TestKrill.java:824)

There is a failing test case in the tokendistancespan-bug branch
committed as acf46c9 .

Distance with text span seems to be broken

For Cosmas queries like "Katze und Hund" internally we serialize to a distance query with a text distance of 0 (i.e. both words occur in the same text). There is now a failing test for this scenario in the distance-with-t-bug remote branch at TestElementDistanceIndex#testCase6 .

MultipleSpanDistanceQuery with Wildcards Bug

Franck Bodmer reported that the following queries do not yield any results:

meine* /+w1:2,s0 &Erfahrung
meine? /+w1:2,s0 &Erfahrung
meine+ /+w1:2,s0 &Erfahrung

Wrong behaviour for startswith profile in WithinSpans

StartsWith() seems to fail in certain configurations. For Example the query startsWith(<base/s=s>, der/i [corenlp/p=ADJA] Mann) in the Goethe-Korpus with the VC title contains Wanderjahre yields to no results (/?q=startsWith%28%3Cbase%2Fs%3Ds%3E%2C+der%2Fi+[corenlp%2Fp%3DADJA]+Mann%29&collection-name=&collection=title+~+%22Wanderjahre%22&ql=poliqarp).

There is now a failing test TestWithinIndex#indexExample8 in the startswith-bug branch.

SpanWithinQuery: sentence within a token

The query

contains([sgbr/p=ADJA ],<base/s=s>)

returns incorrect results. It should probably throw an error.

Unordered element distance spans

The results of of an unordered element distance spans may become unsorted. The current strategy is to create a list of matches for the smallest subspan. When both sub-spans are of the same occurrence, the second subspan is chosen and the first subspan is proceed later. However, there is a possibility that a match for the first subspan has a smaller position than those of the second subspan.

Use ICU for casefolding

Currently our approach to case insensitivity is rather naive and should be improved by storing and retrieving casefolded variants of terms (using ICU4j). The storing part needs to be implemented in Korap::XML::Krill.

Support Poliqarps Alignment Operator

Poliqarp supports free context alignment of matches by defining anchors.
So all matches will be aligned in the KWIC view at that point.

[pos=adj & case=nom]+^[pos=subst & case=nom]+

This involves modification of the frontend, the API, the match serialization and the deserializer.
The serializer has this information now in the meta object of KoralQuery.
Once Krill supports this feature, the issue will be reopened in Kalamar.

This is a copy of Trac issue #148 .

Adding indexing date in Krill index

Indexing date must be added as a field in Krill index during indexing. It is needed to define a persistent VC that must not change over time. Persistent VC is constrained with an indexing date, e.g.

indexingDate leq 2018-08-22

so that only documents indexed up to a the specified indexing date are considered.

Support fields in text info endpoint

The fields parameter is currently not supported by Krill or Kustvakt for text information (regarding metadata).
It would be helpful to support fields, to specify

a) Which fields should be retrieved,
b) Which fields should be listed (even with empty values, as requested by DRuKoLA),
c) In which order.

The parameter should be a comma separated list. The default parameter is @all, requesting all fields that are stored for the specific text.

Optional negated classes do not work properly

Currently a query like der {[pos!=ADJA]*} Mann throws an error.

Reported by @margaretha

Sequence deserialization does not work with multiple different extensions

In case, negation and empty tokens are prefixed to an anchor like with this:

[][orth!=der][]Baum

one extension seems to be lost in deserialization.

Sentence length restriction

Due to the licenses of some resources, only a small amount of data/text can be shown as matches/ query results. It might happen that some sentences contain numerous words and thus a restriction on the sentence length is needed. The restriction should be sent by Kustvakt and handled by Krill while doing search.

Edit: the restriction should only be sent for nested spanqueries, since it won't reduce Krill workloads otherwise.

Support assets / attachements to meta data fields

To link texts to separate resources (external or internal), it is required to store arbitrary data in meta data fields including arbitrary descriptions in a conventional way, to give clients retrieving the data some hints about how to handle the data.
To make it easy and to introduce some advanced formalisms of KoralQuery, I would argue the field should be indexed and returned as a koral:field with type:attachement, as exemplified in Krawfish.
The key can be an arbitrary name field, like "Wikipedia-Link", the value needs to follow the data uri scheme.

{
  "@type":"koral:field",
  "key":"Wikipedia",
  "value":"data:application/x.korap-link,https://de.wikipedia.org/wiki/Beispiel",
  "type":"type:attachement"
},
{
  "@type":"koral:field",
  "key":"Reference",
  "value":"data:text/plain,This is a reference string",
  "type":"type:attachement"
}

This makes it possible to store arbitrary data like text or images (base64 encoded) as well as references in the value. It may seem unintuitive to use the data uri scheme for hyperlinks, as they already are scheme prefixed. The reason we use data for all attachements and the new application/x.korap-link media type is that data uris support parameters, that can be used to describe the resource and give the KoralQuery consumer a hint how to handle the resource. E.g. for links that will be embedded in Kalamar, the parameter can give a link title to view instead of the URI. Or if Kalamar has a plugin to embed images, a parameter can give hints about title tags of the image.
data:image/png;title=Palimpsest;base64,...

Support named cached virtual corpora

To adapt virtual corpora from COSMAS II it's necessary to have a "close to the index" mechanism to store virtual corpora, that can be retrieved by a single ID. In COSMAS II these virtual corpora are stored as vectors of text siglen. To support this in KorAP, we may need:

Support for vectors in KoralQuery
Support for named virtual corpora in KoralQuery (probably by storing the KQ at every Krill node)
Fast retrieval of document IDS based on text-siglen to fastly create a document vector
A simple mechanism to cache and retrieve this document vector by it's ID

Indirect relations

Krill has not support indirect relation queries, such as
node ->malt/d * node

This requires a SpanRepetitionQuery wrapping each possible SpanRelationQuery.

Useless warnings on datelike strings

In collections, sometimes warnings are raised by the assumption that a value is a date. This is sometimes completely confusing (s. below) and sometimes wrong, as document identifiers may look like dates.
Failing example test:

    @Test
    public void testNotDate() throws JsonProcessingException, IOException {
        collection = "author=\"firefighter1974\"";
        qs.setQuery(query, ql);
        qs.setCollection(collection);
        res = mapper.readTree(qs.toJSON());
        assertEquals("koral:doc", res.at("/collection/@type").asText());
        assertEquals("author", res.at("/collection/key").asText());
        assertEquals("firefighter1974", res.at("/collection/value").asText());
        assertEquals("match:eq", res.at("/collection/match").asText());
        assertEquals(res.at("/errors/0/0").asText(), "");
        assertEquals(res.at("/warnings/0/0").asText(), "");
    }

(I corrupted my local repo, that's why I'm reporting it this way.)

SpanFocusQuery() needs to respect classes outside the sub span

Follow up to #7 :

Currently focus queries require that classes are part of sub spans. Unfortunately there are cases where this is not true.

Example: The span <a>...{1:...}...{2:...}...</a> is modified using focus(2:...), but still contains a class 1. Now, if the span is again modified using focus(1:...) a preceding match span may be <a>...{2:...}...{1:...}...</a>, so the second class 1 may precede the first class 1.

Error retrieving matches at the end of texts with spanExpansions

There is now a failing test for span expansions exceeding text boundaries documented as TestSpanExpansionIndex.testExpansionQueryBug3 in branch span-expansion-match-creation-bug .

There is also a related serialization bug in TestKrillQuery.KorapSequenceWithEmptyRepetitionQuery.

Empty token in SpanWithinQuery

Krill should handles an empty query as an operand of SpanWithinQuery and throws an appropriate error. Currently both

contains([],<base/s=s>)
contains(<base/s=s>,[])

throws an error : You can't queryize an empty query

Implement SpanClassRefOpQuery

SpanClassRefOp is a SpanQuery that modifies classes gathered by the query. A classRefOp does not skip matches and it does not alter spans of the match. There are several problems this approach should solve:

Support Class Inversion (for Cosmas' #NHIT)

(Trac-Ticket #151)
Class inversion is necessary for matches between defined classes. The serializer will match the surrounding classes and calculates the inversion. An example for Poliqarp (although realizable differently) is:

focus(inv(1): {1:[orth=das]}[]{1:[orth=Haus]})

This matches the token between "das" and "Haus" indirectly.

The above description is just a show case. In fact this is necessary for the #NHIT operator in Cosmas-II.

SpanClassResetQuery - Cleaning up the payload

(Trac-Ticket #149)
Some queries use internal classes to deal with certain parts of matches. These classes shouldn't be inherited by embedding queries, so these class information have to be removed from the payloads.
It's part of the partial matching stuff we have to support for #OV in Cosmas-II.

Support Splitting and class operators for Submatching

(Trac-Ticket #147)
Although it's not necessary for any language in particular, it's a good addition to submatching to support the class reference operators already serialized from Poliqarp+.
This allows finegrained focusing on class intersections.

focus(1|2: ...{1:...}...{2:...}...)

In addition, the split operator is nearly from the beginning in the Poliqarp+ Proposal but never made it to real life. It supports the splitting of matches based on its classes.

As this is needed for support of position frames ( #9 ), not supporting SpanClasRefOpQuery is considered a bug.

Implement nextStart() method in spans

Some spans would benefit from a nextStart() method, advising the span to forward to the next start position. This is especially useful for positional queries, like

contains(<s>, der []*)

where the expansion spans may go up to an end position of start + 100, although the positional span knows at a certain point advancing the end position can't be satisfying anymore. Another example are element spans in the same configuration.

Fix autocaching of named VCs

Currently referenced named VCs that are not yet cached will be autocached during VC retrieval. This, however, can fail, when the reference is nested and an external constraint forces documents or index segments to be skipped during retrieval.
This means: autocached VCs may return fewer resulting documents than possible.

A failing test is TestKrillCollectionIndex#testNamedVCsAfterCorpusWithMissingDocs .

Fix ordering of tree items in constituency trees in match API

Sometimes, when a match is sentence expended and the second sentence starts at the beginning of a paragraph, the paragraph is ranked below the sentence, although the tree-depth is closer to the root.

Example text:

<p>
  <s>a</s>
  <s>b</s>
</p>
<p>
  <s>x</s>
  <s>y</s>
</p>

In case the match is bx and the match is sentence expanded, the tree in Kalamar will look like that:

 /^\
|   s
|   |
s   p
|   |
b   x

A better solution would be to ignore the paragraph, in case it ends after the match, resulting in

 /^\
s   s
|   |
b   x

Update to Lucene 5

Lucene 5 introduces Roaring Bitmaps for bit sets. That sounds beneficial for a lot of things happening in Krill, like the creation of complex virtual collections, calculations of cardinality, regex rewrites (probably) etc. Changes will require a switch to RoaringDocIdSet and SparseFixedBitSet (instead of FixedBitSet), I guess.

Serialization differences of koral:distance and cosmas:distance

There are now differences between koral:distance and cosmas:distance that need to be deserialized correctly. This is already fixed (somehow) but two failing tests still need investigation.

This is reborted in the korap-distance-deserialization-bug branch.

Missing matches with spanExpansion

Given the following text

Eine besonders künstlerische Aufnahme entstand mit dem Blick durch den Berliner Rathausturm, auf den man den Zeppelin zufliegen sieht. Bei der vom Boden aufgenommenen Landung lässt sich die Mannschaft an Bord, die das Landungsseil in Richtung Boden abwirft, gut erkennen.

The following query

Rathausturm auf {1:[base=die]} []{0,10} {2:[base=die]}

should return the matches

1. Rathausturm, auf den man den

and

2. Rathausturm, auf den man den Zeppelin zufliegen sieht. Bei der

But currently only the first match is returned.
This is probably an issue with the SpanExpansion function.
(Reported by Bryan Jurish)