korap / krill Goto Github PK
View Code? Open in Web Editor NEW:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
License: BSD 2-Clause "Simplified" License
:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
License: BSD 2-Clause "Simplified" License
When the server times out, a cascade of 682 warnings is currently issued. It should only be one.
"warnings": [
[
682,
"Response time exceeded"
],
[
682,
"Response time exceeded"
],
...
]
@jbingel found a bug (both Poliqarp):
overlaps(<s>, [orth=Mann])
and
overlaps([orth=Mann], <s>)
return different results.
A similar issue arises with the following queries:
[cnx/syn=@NH & corenlp/ne_dewac_175m_600=I-ORG]
contains([cnx/syn=@NH], [corenlp/ne_dewac_175m_600=I-ORG])
I guess the problem is too optimistic forwarding. Should be fixable with the switch to bitvector comparisons.
And sometimes matches are not shown - e.g. in the following query in the current version of wikipedia one of three matches is skipped:
http://10.0.10.14:6666/?q=contains%28%3Cs%3E%2C+ich%29&ql=poliqarp&cutoff=1
(Internal instance)
This bug was transfered from Trac Issue #127
A query containing empty tokens with optional neighbors, for example
der? [] Mann?
should be deserialized as a disjunction with all possible subqueries as its operands, i.e.
der []
[] Man
der [] Man
[]
and a warning should be added for the last subquery since empty query is not allowed / cannot be searched.
Using https://github.com/KorAP/Kustvakt/tree/master/sample-index, Krill does not return any matches for:
{ "query": { "@type": "koral:token", "wrap": { "@type": "koral:term", "match": "match:eq", "key": "der", "layer": "orth", "foundry": "opennlp" } }, "collection": { "@type": "koral:docGroup", "operation": "operation:and", "operands": [ { "@type": "koral:doc", "match": "match:eq", "type": "type:regex", "value": "CC-BY.*", "key": "availability" }, { "@type": "koral:doc", "match": "match:ne", "value": "GOE/AGI/00000", "key": "textSigle" } ] } }
This happens because GOE/AGI/00000 is not part of the index. Replacing it with GOE/AGI/04846 will return some matches.
When searching a meta data field like "author" as part of a virtual corpus, currently it's not possible to query this as a string with a space delimiter, e.g. "author eq 'Theodor Fontane'" does not work. Maybe text fields with "eq" should be treated like sequences of tokens delimited by spaces.
Currently, Krill refers and relies to some annotations from the base
, namely s
to set boundaries for annotation retrieval and p
for snippet retrieval (in case, this is wanted). But - this fails in case a match is not in a sentence or a boundary, which can happen with the new data from Wikipedia.
For match retrieval there should exist a fallback mechanism to use token contexts whenever sentence- or paragraph-contexts fail.
When the index is started by a second process, the server is not aware of the new data and won't reopen the reader in KrillIndex. At the moment that means Kustvakt needs to be restarted to reopen the reader. A better approach would be to have a command that enforces Krill to reopen the reader. This could be issued on commit by the second process or manually.
SubSpans need to be sorted, in case the offset is negative, as the ending of embedded spans is not in a defined order.
There is a failing test at TestSubSpanIndex#testCaseNegativeSubSpan
in the sort-subspans-bug
branch.
The JavaDoc can currently not be generated. This needs to be fixed. All public methods should somehow be documented.
In Annis, it should be possible to search for relations using regex, see KorAP/Koral#45
Krill supports indexing empty elements. This can be useful for certain queries, but it also comes with some problems, for example, in case the user searches for empty elements, the match for such an element will be empty. There needs to be a way to mark the empty element in result sets and preverably visualize the position in clients.
Currently trying to generate a snippet with an empty match throws an exception.
The new Frame-Proposal (internally discussed in the GDoc "Position Frames") introduces various modifications on how SpanWithinQueries are expected to work.
It not only introduces frame vectors for different combinations of overlap-configurations, it also has a proposal for support of partial matches based on classes.
This task is pretty complex and not expected to be realized soon. This would introduce full support for Cosmas' #OV.
As KoralQuery already supports the new frame model, not supporting position frames is considered a bug.
This require support of #8.
Field information that is part, e.g., of a match, require attached type information like "date" or "readonly" to be reused in Kalamar correctly. For example readonly fields can't be part of a virtual corpus constraint and therefore shouldn't be dynamically clickable for that purpose.
In some cases wrong matches occur in queries with negative expansions, like [orth=a][orth!=b][orth=c]
. This is likely a skipTo bug in the extension query.
There is a failing test available in Gerrit.
The bug was reported by Verginica Mititelu, member of the DRuKoLA project.
Constituency queries are currently realized using SpanWithin queries. But there are further information indexed in the payloads of spans, declaring the position of a span term in a tree (depth).
<>:xip/c:NPA$<i>15<i>28<i>6<b>1
The byte information represents the position of NPA in the tree. 1
means, it’s below the root (which has no depth information at that position or is 0
).
To make queries possible, that take distances in account, these payloads have to be read and compared.
This has to be done within a SpanWithinQuery (and not outside) as otherwise the payloads would be mangled.
This is necessary for AnnisQL. This issue was copied from Trac issue #146 .
Currently spans returned by focus()
are not guaranteed to be sorted, but a lot of queries rely on that property. There are two cases that are problematic and force the use of forward-looking span caches:
a) The span <a>...{1:...}...</a>
is modified using focus(1:...)
. A following span <a>{1:...}...</a>
has the class 1 in a preceding position.
b) The span <a>...{1:...}...{2:...}...</a>
is modified using focus(2:...)
, but still contains a class 1. Now, if the span is again modified using focus(1:...)
a preceding match span may be <a>...{2:...}...{1:...}...</a>
, so the second class 1 may precede the first class 1.
There is now a failing test TestFocusIndex#testFocusSorting in the focus-sort branch.
P.S. This issue refers to #169 in our old Trac ticket system.
Multiple distance query needs a synchronization among the distances so that they refer to the same elements/terms.
As reported by Verginica, sometimes queries return no results (without any reported error), when Glimpse is deaktivated.
Provided example queries for the DRuKoLa instance:
[drukola/m!="msd:ts.*"][drukola/m="msd:nc..o.*"][drukola/m="msd:nc..rn.*"]
[drukola/m="msd:nc..o.*"][drukola/m!="msd:s.*"][drukola/m="msd:nc..rn.*"]
I think this involves two errors:
Although this is already prepared in Krill, currently the metadata items are still focussed on I5 metadata. To make it more flexible, input documents should introduce metadata not as key value pairs, but as fields, like
{
"@type":"koral:field",
"key":"license",
"value":"closed",
"type":"type:string"
},
{
"@type":"koral:field",
"key":"textLength",
"value": 8,
"type":"type:integer"
}
(This format is based on Krawfish and the new return value of the textInfo endpoint.) That way, arbitrary fields can be ingested. Supported types should be type:string
, type:text
, type:date
and type:integer
. Keywords should be represented as list of strings in the value field. In the future, text should also be pretokenized. dates may also contain (multiple) (open) date ranges in the future. Integers may also contain multiple items in the future.
Currently, relations without annotation value such as
node ->malt/d node
is not allowed in Krill.
Thex can nevertheless be interpreted as relation with "any annotation value" thus identical to using regex:
node ->malt/d[func=/.*/] node
There is a bug in spanWithin that seems to be close to TestWithinIndex.queryJSONpoly2.
Breaking test in TestWithinIndex.queryJSONcomplexSpanOrTerm (span-or-bug branch).
Stacktrace:
at org.apache.lucene.search.spans.SpanOrQuery$1.doc(SpanOrQuery.java:234)
at de.ids_mannheim.korap.query.spans.WithinSpans.toSameDoc(WithinSpans.java:423)
at de.ids_mannheim.korap.query.spans.WithinSpans.next(WithinSpans.java:375)
at de.ids_mannheim.korap.KrillIndex.search(KrillIndex.java:1293)
at de.ids_mannheim.korap.Krill.apply(Krill.java:304)
Need for possibility to specify and process date ranges or date additions in the I5:<creatDate>, field. E.g. according to specification of the BOT-ent field (predecessor of <creatDate> in the BOT Manual by Doris al-Wadi (p.22f.), use of a date range in at least on corpus in DeReKo, and use in the Sprache 1933-1945 project. Concerns <creatDate> of collections, but also of single texts.
There is an Index out of bounds bug in case of multiple distances (The example query by @bansp was "({1:Sonne []* Erde} | {2: Erde []* Sonne})" in Poliqarp+). The stack trace is:
java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:400)
at java.util.ArrayList.get(ArrayList.java:413)
at de.ids_mannheim.korap.query.spans.TokenDistanceSpans.cost(TokenDistanceSpans.java:128)
at de.ids_mannheim.korap.query.spans.ClassSpans.cost(ClassSpans.java:186)
at org.apache.lucene.search.spans.SpanOrQuery$1.initSpanQueue(SpanOrQuery.java:181)
at org.apache.lucene.search.spans.SpanOrQuery$1.next(SpanOrQuery.java:193)
at de.ids_mannheim.korap.KrillIndex.search(KrillIndex.java:1325)
at de.ids_mannheim.korap.Krill.apply(Krill.java:310)
at de.ids_mannheim.korap.Krill.apply(Krill.java:279)
at de.ids_mannheim.korap.search.TestKrill.searchJSONtokenDistanceSpanBug(TestKrill.java:824)
There is a failing test case in the tokendistancespan-bug branch
committed as acf46c9 .
For Cosmas queries like "Katze und Hund" internally we serialize to a distance query with a text distance of 0 (i.e. both words occur in the same text). There is now a failing test for this scenario in the distance-with-t-bug remote branch at TestElementDistanceIndex#testCase6 .
Franck Bodmer reported that the following queries do not yield any results:
StartsWith() seems to fail in certain configurations. For Example the query startsWith(<base/s=s>, der/i [corenlp/p=ADJA] Mann)
in the Goethe-Korpus with the VC title contains Wanderjahre
yields to no results (/?q=startsWith%28%3Cbase%2Fs%3Ds%3E%2C+der%2Fi+[corenlp%2Fp%3DADJA]+Mann%29&collection-name=&collection=title+~+%22Wanderjahre%22&ql=poliqarp
).
There is now a failing test TestWithinIndex#indexExample8
in the startswith-bug
branch.
The query
contains([sgbr/p=ADJA ],<base/s=s>)
returns incorrect results. It should probably throw an error.
The results of of an unordered element distance spans may become unsorted. The current strategy is to create a list of matches for the smallest subspan. When both sub-spans are of the same occurrence, the second subspan is chosen and the first subspan is proceed later. However, there is a possibility that a match for the first subspan has a smaller position than those of the second subspan.
Currently our approach to case insensitivity is rather naive and should be improved by storing and retrieving casefolded variants of terms (using ICU4j). The storing part needs to be implemented in Korap::XML::Krill.
Poliqarp supports free context alignment of matches by defining anchors.
So all matches will be aligned in the KWIC view at that point.
[pos=adj & case=nom]+^[pos=subst & case=nom]+
This involves modification of the frontend, the API, the match serialization and the deserializer.
The serializer has this information now in the meta object of KoralQuery.
Once Krill supports this feature, the issue will be reopened in Kalamar.
This is a copy of Trac issue #148 .
Indexing date must be added as a field in Krill index during indexing. It is needed to define a persistent VC that must not change over time. Persistent VC is constrained with an indexing date, e.g.
indexingDate leq 2018-08-22
so that only documents indexed up to a the specified indexing date are considered.
The fields parameter is currently not supported by Krill or Kustvakt for text information (regarding metadata).
It would be helpful to support fields, to specify
a) Which fields should be retrieved,
b) Which fields should be listed (even with empty values, as requested by DRuKoLA),
c) In which order.
The parameter should be a comma separated list. The default parameter is @all
, requesting all fields that are stored for the specific text.
Currently a query like der {[pos!=ADJA]*} Mann
throws an error.
Reported by @margaretha
In case, negation and empty tokens are prefixed to an anchor like with this:
[][orth!=der][]Baum
one extension seems to be lost in deserialization.
Due to the licenses of some resources, only a small amount of data/text can be shown as matches/ query results. It might happen that some sentences contain numerous words and thus a restriction on the sentence length is needed. The restriction should be sent by Kustvakt and handled by Krill while doing search.
Edit: the restriction should only be sent for nested spanqueries, since it won't reduce Krill workloads otherwise.
To link texts to separate resources (external or internal), it is required to store arbitrary data in meta data fields including arbitrary descriptions in a conventional way, to give clients retrieving the data some hints about how to handle the data.
To make it easy and to introduce some advanced formalisms of KoralQuery, I would argue the field should be indexed and returned as a koral:field
with type:attachement
, as exemplified in Krawfish.
The key can be an arbitrary name field, like "Wikipedia-Link", the value needs to follow the data uri scheme.
{
"@type":"koral:field",
"key":"Wikipedia",
"value":"data:application/x.korap-link,https://de.wikipedia.org/wiki/Beispiel",
"type":"type:attachement"
},
{
"@type":"koral:field",
"key":"Reference",
"value":"data:text/plain,This is a reference string",
"type":"type:attachement"
}
This makes it possible to store arbitrary data like text or images (base64 encoded) as well as references in the value. It may seem unintuitive to use the data uri scheme for hyperlinks, as they already are scheme prefixed. The reason we use data for all attachements and the new application/x.korap-link
media type is that data uris support parameters, that can be used to describe the resource and give the KoralQuery consumer a hint how to handle the resource. E.g. for links that will be embedded in Kalamar, the parameter can give a link title to view instead of the URI. Or if Kalamar has a plugin to embed images, a parameter can give hints about title tags of the image.
data:image/png;title=Palimpsest;base64,...
To adapt virtual corpora from COSMAS II it's necessary to have a "close to the index" mechanism to store virtual corpora, that can be retrieved by a single ID. In COSMAS II these virtual corpora are stored as vectors of text siglen. To support this in KorAP, we may need:
Krill has not support indirect relation queries, such as
node ->malt/d * node
This requires a SpanRepetitionQuery wrapping each possible SpanRelationQuery.
In collections, sometimes warnings are raised by the assumption that a value is a date. This is sometimes completely confusing (s. below) and sometimes wrong, as document identifiers may look like dates.
Failing example test:
@Test
public void testNotDate() throws JsonProcessingException, IOException {
collection = "author=\"firefighter1974\"";
qs.setQuery(query, ql);
qs.setCollection(collection);
res = mapper.readTree(qs.toJSON());
assertEquals("koral:doc", res.at("/collection/@type").asText());
assertEquals("author", res.at("/collection/key").asText());
assertEquals("firefighter1974", res.at("/collection/value").asText());
assertEquals("match:eq", res.at("/collection/match").asText());
assertEquals(res.at("/errors/0/0").asText(), "");
assertEquals(res.at("/warnings/0/0").asText(), "");
}
(I corrupted my local repo, that's why I'm reporting it this way.)
Follow up to #7 :
Currently focus queries require that classes are part of sub spans. Unfortunately there are cases where this is not true.
Example: The span <a>...{1:...}...{2:...}...</a>
is modified using focus(2:...)
, but still contains a class 1. Now, if the span is again modified using focus(1:...)
a preceding match span may be <a>...{2:...}...{1:...}...</a>
, so the second class 1 may precede the first class 1.
There is now a failing test for span expansions exceeding text boundaries documented as TestSpanExpansionIndex.testExpansionQueryBug3 in branch span-expansion-match-creation-bug .
There is also a related serialization bug in TestKrillQuery.KorapSequenceWithEmptyRepetitionQuery.
Krill should handles an empty query as an operand of SpanWithinQuery and throws an appropriate error. Currently both
throws an error : You can't queryize an empty query
SpanClassRefOp is a SpanQuery that modifies classes gathered by the query. A classRefOp does not skip matches and it does not alter spans of the match. There are several problems this approach should solve:
(Trac-Ticket #151)
Class inversion is necessary for matches between defined classes. The serializer will match the surrounding classes and calculates the inversion. An example for Poliqarp (although realizable differently) is:
focus(inv(1): {1:[orth=das]}[]{1:[orth=Haus]})
This matches the token between "das" and "Haus" indirectly.
The above description is just a show case. In fact this is necessary for the #NHIT operator in Cosmas-II.
(Trac-Ticket #149)
Some queries use internal classes to deal with certain parts of matches. These classes shouldn't be inherited by embedding queries, so these class information have to be removed from the payloads.
It's part of the partial matching stuff we have to support for #OV in Cosmas-II.
(Trac-Ticket #147)
Although it's not necessary for any language in particular, it's a good addition to submatching to support the class reference operators already serialized from Poliqarp+.
This allows finegrained focusing on class intersections.
focus(1|2: ...{1:...}...{2:...}...)
In addition, the split operator is nearly from the beginning in the Poliqarp+ Proposal but never made it to real life. It supports the splitting of matches based on its classes.
As this is needed for support of position frames ( #9 ), not supporting SpanClasRefOpQuery is considered a bug.
Some spans would benefit from a nextStart()
method, advising the span to forward to the next start position. This is especially useful for positional queries, like
contains(<s>, der []*)
where the expansion spans may go up to an end position of start + 100, although the positional span knows at a certain point advancing the end position can't be satisfying anymore. Another example are element spans in the same configuration.
Currently referenced named VCs that are not yet cached will be autocached during VC retrieval. This, however, can fail, when the reference is nested and an external constraint forces documents or index segments to be skipped during retrieval.
This means: autocached VCs may return fewer resulting documents than possible.
A failing test is TestKrillCollectionIndex#testNamedVCsAfterCorpusWithMissingDocs .
Sometimes, when a match is sentence expended and the second sentence starts at the beginning of a paragraph, the paragraph is ranked below the sentence, although the tree-depth is closer to the root.
Example text:
<p>
<s>a</s>
<s>b</s>
</p>
<p>
<s>x</s>
<s>y</s>
</p>
In case the match is bx
and the match is sentence expanded, the tree in Kalamar will look like that:
/^\
| s
| |
s p
| |
b x
A better solution would be to ignore the paragraph, in case it ends after the match, resulting in
/^\
s s
| |
b x
Lucene 5 introduces Roaring Bitmaps for bit sets. That sounds beneficial for a lot of things happening in Krill, like the creation of complex virtual collections, calculations of cardinality, regex rewrites (probably) etc. Changes will require a switch to RoaringDocIdSet and SparseFixedBitSet (instead of FixedBitSet), I guess.
There are now differences between koral:distance
and cosmas:distance
that need to be deserialized correctly. This is already fixed (somehow) but two failing tests still need investigation.
This is reborted in the korap-distance-deserialization-bug branch.
Given the following text
Eine besonders künstlerische Aufnahme entstand mit dem Blick durch den Berliner Rathausturm, auf den man den Zeppelin zufliegen sieht. Bei der vom Boden aufgenommenen Landung lässt sich die Mannschaft an Bord, die das Landungsseil in Richtung Boden abwirft, gut erkennen.
The following query
Rathausturm auf {1:[base=die]} []{0,10} {2:[base=die]}
should return the matches
1. Rathausturm, auf den man den
and
2. Rathausturm, auf den man den Zeppelin zufliegen sieht. Bei der
But currently only the first match is returned.
This is probably an issue with the SpanExpansion function.
(Reported by Bryan Jurish)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.