inl / blacklab Goto Github PK

View Code? Open in Web Editor NEW

103.0 19.0 53.0 26.22 MB

Linguistic search for large annotated text corpora, based on Apache Lucene

Home Page: http://inl.github.io/BlackLab/

License: Apache License 2.0

HTML 0.13% Java 98.56% CSS 0.02% Shell 0.25% Dockerfile 0.09% JavaScript 0.78% Stylus 0.03% XSLT 0.01% Makefile 0.14%

corpus

blacklab's Introduction

What is BlackLab?

BlackLab is a corpus retrieval engine built on top of Apache Lucene. It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text. It was developed at the Dutch Language Institute (INT) to provide a fast and feature-rich search interface on our contemporary and historical text corpora.

In addition to the Java library (BlackLab Core), there is also a web service (BlackLab Server), so you can access it from any programming language.

BlackLab is licensed under the Apache License 2.0.

To learn how to index and search your data, see the official project site.

To learn about BlackLab development, see the dev docs.

Branches

The default branch, dev, corresponds to the "bleeding edge" in-development version. You can certainly run it (we do), but if you need maximum stability, it might be better to stay on a stable release instead.

The branch that corresponds to BlackLab's latest release is called main.

There are additional branches related to in-development features. These are intended to be short-lived and will be merged into dev.

Compatibility: Java, Lucene

This version of BlackLab required Java 11 or higher. It has been tested up to Java 17.

This version uses Lucene 8. This unfortunately means that corpora created with older BlackLab versions (up to 2.3) cannot be read and will need to be re-indexed. If this is a problem for you, you can stick with the 2.3 version for now. We would like to eventually provide a conversion tool, but there is no date planned for this.

Roadmap

There is a high-level roadmap page on the documentation site. There are also BlackLab Archives of Relevant Knowledge (BARKs) that go into more detail.

For the next major version (4.0), we are focused on integrating BlackLab with Solr, with the goal of enabling distributed search. We will use this to scale our largest corpus to many billions of tokens. Status and plans for this can be found in the above-mentioned BARKs and in more technical detail here.

Development workflow

We strive towards practicing Continuous Delivery.

Our intention is to:

continuously improve both unit and integration tests (during development and whenever bugs are discovered)
avoid long-lived feature branches but frequently merge to the dev branch
create meaningful commits that fix a bug or add (part of) a feature
use temporary feature flags to prevent issues with unfinished code
deploy to our servers frequently

Code style

Configurations for IDE code formatters can be found in the build-tools/ directory:

IntelliJ IDEA: formatter-intellij.xml
Eclipse: formatter-eclipse.xml

Building the site

The BlackLab end-user documentation site can be built locally if you want:

# Contains the configurations for various checking plugins shared by multiple modules
cd build-tools
mvn install

# Build the actual site (result will be in core/target/site)
cd ..
mvn site

Using BlackLab with Docker

An alpha version of the Docker setup is provided on Docker Hub. For each upcoming release, we will publish a corresponding Docker image.

A Docker version supporting BuildKit is required (18.09 or higher), as well as Docker Compose version 1.27.1 or higher.

See the Docker README for more details.

Indexing with Docker

We assume here that you are familiar with the BlackLab indexing process; see indexing with BlackLab to learn more.

The easiest is to use the index-corpus.sh Bash script in the root of the repository. It will download Docker image and run IndexTool in a container, using bind mounts for the input data and writing the indexed corpus. Run the script without arguments for documentation.

Alternatively, you can use Docker Compose to run the indexer. This will create your index on a named volume defined by the Compose file.

Create a file named test.env with your indexing configuration:

BLACKLAB_FORMATS_DIR=/path/to/my/formats
INDEX_NAME=my-index
INDEX_FORMAT=my-file-format
INDEX_INPUT_DIR=/path/to/my/input-files
JAVA_OPTS=-Xmx10G

To index your data:

docker compose --env-file test.env run --rm indexer

Now start the server:

docker compose up -d

Your index should now be accessible at http://localhost:8080/blacklab-server/my-index.

If you want to be able to use the corpus frontend as well, create a file named .env in the root of the repository with the following contents:

DOCKER_IMAGE=blacklab-frontend

Then run:

docker compose up -d --no-build

Special thanks

ej-technologies for the JProfiler Java profiler
Matthijs Brouwer, developer of Mtas, which we used for reference while developing the custom Lucene Codec and integrating with Solr.
Everyone who contributed to the project. BlackLab wouldn't be where it is today without all of you.

blacklab's People

Contributors

Stargazers

Watchers

blacklab's Issues

docId file metadata not complete

I have an index which has a docId metadata value for each document. Some docs have docId's such as 10152315, others have a letter prepended to it (A10152315) and others have a suffix such as 10152315.0, etc..

The issue is that filters for docId fail to retrieve text at all if the docId is in the format with a prefix.
The type of the metadata field is "text" and the analyzer is "nontokenizing".

I also noticed that in the generated indexmetadata.json file only up to 50 values are listed (none of which in the prefixed format), which I guess doesn't really bear on what is actually indexed.

Having hierarchical views inside blacklab?

Is it possible to have hierarchical view for some of the views?
Why this is important? Suppose you have the following two views:

A row text view
A POS tag view
A taxonomy view: a tree structured view of words and how words are hierarchically connected to each other. For example, one example hierarchy is:

=> respiratory disease => disease => illness => ill health => pathological state => physical condition => condition => state => attribute => abstraction => entity

The question can I query for any NN (nouns) that is in the subcategory of disease?

I suspect that faceted search of Lucene might helpful for this, if it is available inside BlackLab.
https://chimpler.wordpress.com/2013/01/30/faceted-search-with-lucene/, http://stackoverflow.com/questions/14852995/tree-search-with-lucene

@dirkgr : any idea? (might as well be of your interest)
FYI @cttsai

Linked metadata documents should be cacheable

When metadata is retrieved from a document that contains the metadata of many documents, e.g. a CSV file where every row contains metadata for a document, right now there is no caching of such documents: each time they are referred to, they are re-parsed.

It would be much better in this case if they could be cached. The way to approach this is to make DocIndexer hold on to its parsed document and not re-parse it, then make sure DocIndexers created in DocIndexerBase.indexLinkedDocument() are cached.

Only a limited number of such DocIndexers should be cached (say, no more than 20 by default?), and there should probably be a way to disabled caching altogether, to prevent out-of-memory issues.

Properties should have displayName, uiType, etc. settings in indexmetadata

Just like metadata fields, it should also be possible to store this information about properties in the indexmetadata file (copied from the input format config file).

This would also help with INL/corpus-frontend#61

Cannot get snippets

Sorry, this should have been in corpus frontend

Metadata field names (and annotation names) may not contain spaces.

(reported by @eduarddrenth) Putting spaces in these names causes problems when generating XML: field names are used as element or attribute names.

Two Performance Issues During the Retrieval Time

Hello,

I use the following code for fetching concordances from a relatively large index (120G):

Hits hits = searcher.find(some_query);

if (hits.doneFetchingHits()) {
            System.out.println("#hits: " + hits.countSoFarHitsCounted());
            System.out.println("#doc: " + hits.countSoFarDocsCounted());
}

For example, for a query I see this output:

#hits: 49050
#doc: 39587

I can this result fast (in a few seconds).

Then, I use the following code to fetch the results and do some process (e.g., a simple file print):


for (Hit hit : hits) {
Kwic kwic = hits.getKwic(hit, CONTEXT_WINDOW_SIZE);
doSomeProcessOnKwic(kwic.getFullXml())
}

For a small CONTEXT_WINDOW_SIZE, i.e.,CONTEXT_WINDOW_SIZE <=3, the IO performs is ok, i.e., the time spent for retrieving the first kwic is not noticeable. In fact, the first few thousands hit can be retrieved fast. However, (i.e., ISSUE 1) this time grows when retrieving more hit (e.g., ranked at 6k and above).

In addition, (ISSUE 2), for CONTEXT_WINDOW_SIZE>3, the performance drops drastically and the retrieval time grows rather exponential that makes the indexing almost usefulness: that is to say, in the same amount of time (or less), the original vertical text can be parsed to find the required data.

My first intuition was that this is due to some memory problem, but these issues persist even if run the code with a heap size such as 50GB.

Could you please tell me what do I do wrong?

Thanks!

Unknown key in indexmetadata file should issue a warning

The other BlackLab config files are fairly strict about what keys can be used and will complain if there's one they don't recognize, but anything in the indexmetadata.yaml (or .json) is accepted; unknown keys are simply ignored. It would be better if at least a warning was logged for unrecognized keys. This makes it easier to deal with typos.

Functionality for killing running queries?

I've been looking through the codebase for a solution, but so far I haven't been able to spot it.
Say you want to timeout a query that is taking too long (and probably consuming way too many memory resources). Is there a way I can kill a query after some time before, for instance, running out of memory?

bug in autocomplete

LuceneUtils.find...does:

!optDesensitized.substring(0, prefix.length()).equalsIgnoreCase(prefix)

changing to

!optDesensitized.startsWith(prefix)

makes autocomplete fully work

PR follows

update lucene?

We're on 5.5.2, 7.2.1 is available. Look into what this brings us?

Don't attempt to read config file when running unit tests.

(reported by @eduarddrenth) Unit tests should be independent of local environment.

performance blacklabservlet

in the processing of every request in BlackLabServer#handleRequest an expensive synchronized block is coded. Expensive in two ways:

synchronized is expensive
a servlet container may instantiate many BlackLabServer objects, each initializing config in its instance fields.

I think the configuration can be made static and the initialization of it in a static code block.

Greedy matching in TextPatternRepetition

Is there a way to make TextPatternRepetition perform greedy matching?

For example, if I have the document a a a b and I search for the pattern a+ b, BlackLab will return multiple hits for a b, a a b, and a a a b. I'm wondering if there is a way to change the semantics of TextPatternRepetition to perform a greedy, left-to-right matching and return only a a a b.

(Right now, my solution is to post-process the hits outside of BlackLab.)

Error due to two xml declarations when viewing document contents

I created an index using an xml-based custom blacklab-format. contentViewable was set to true. Searching this index (using the corpus-frontend) works without problem, but when I click a document name, to view the document metadata and content, it doesn't work. (To make this work I added an article_empty.xsl to src/main/webapp/WEB-INF/interface-default/ of the corpus-frontend that is loaded if "article_" + corpusDataFormat + ".xsl" does not exist.)

So if I try to view the metadata and contents of a document, I get the error message: The processing instruction target matching "[xX][mM][lL]" is not allowed. This is caused by the fact that the BlackLab response has two xml declarations:

<?xml version="1.0" encoding="utf-8" ?>
<blacklabResponse>
  <?xml version="1.0" encoding="utf-8"?>
  <document>
...

My guess is that this problem can be fixed by not storing the document's xml declaration in the index.

update to xpath 2 / saxonica

support for xpath 2/3 extends possibilities in config (that may be of bigger influence on performance than using the poor maintained ximpleware).
saxonica is in active development, renowned and I wonder if it is significantly slower then ximpleware these days.

Is it possible to index multiple values for a single token?

Estonian morphology is complex (who's isn't?) and very often ambiguous. Is it possible to specify many different morph-analyses for a word and search it with BlackLab?

Option to return whole sentence as context

Right now, each match is returned with a fixed number of words as context (wordsaroundhit parameter). We would like to have an option to return matches with the whole sentence as context, or possibly even the previous and next sentence too. This would require significant changes in BlackLab, though. We should think about how to go about this.

ContentStoreDirFixedBlock prevents threaded indexing

This class is responsible for storing documents within an index on disk, for later retrieval of the original contents. An index can potentially contain many separate documents.
Every document is stored as a series of compressed blocks on disk, where every block is 4k in size.
A document is stored as a list of the blocks that make it up, and which part of the document each of those blocks contain.

This issue with the current implementation is twofold:

Firstly, because compression ratio is variable, a different amount of uncompressed data is required to fill every block with 4k of compressed data. Because of this, the next block cannot begin to be compressed before the previous block is finished, as it is unknown how much of the uncompressed data the previous block will use up.
So blocks are essentially compressed serially.

Secondly, a document can be stored in parts (see ContentStoreDirFixedBlock#storePart) and written to disk as it's being processed.
This requires some state about the current document to be kept within the ContentStore class. Namely how much of the document has already been stored, and which block indices/ids were used to do so.
The consequence of this is that it is currently impossible to process multiple documents using the same instance of the ContentStore class, because there is state linked to the document in between calls to store()/storePart().

The current system has a couple of features that have to be considered in any changes to the system

Every block has some metadata (stored within the table of contents file) that stores the location/offset of its uncompressed data within the source document (see TocEntry#deserialize).
This effectively allows random access to data within the file, because the block containing that bit of data can be found without having to actually read/process the block.
Also, blocks can easily be reused, as a document is essentially just a list of pointers to some blocks, and every block is just a 4k piece of data within the disk file.

A couple ways to solve this:

Allow blocks to have a variable size.

To do this, a block would need to contain the following information:

offset on disk
length on disk
offset within the uncompressed document (new)
length within the uncompressed document (new)

Reading the old content store would still be possible, the offset and length within the document are already stored in the current system.
The on-disk offset is the index of the block (index * block_size [4096]), and the length is constant at 4096 bytes.

Keep the current block system, but allow separate documents to be processed in parallel.

This can be done by moving the document-specific state (charsFromEntryWritten, bytesWritten, blockIndicesWhileStoring, blockCharOffsetsWhileStoring, unwrittenContents)
out of the ContentStoreDirFixedBlock and into a sort of context class, together with that document's id. The context would be created when the document is first created, and passed into the store*() functions within the contentStore.
The global data about the store file itself, such as freeBlocks, next(Block)Id, etc will need to be synchronized and kept within the store.

Get rid of the contentStore entirely, and research how to store the document data in Solr/Lucene.

This is probably the best long-term solution.

Duplicate key in YAML file should produce warning or error

(reported by @eduarddrenth) An object in YAML must have unique keys, but it is possible to specify duplicate keys and not get a warning. Maybe there's an option in the YAML reading library to complain when it encounters this?

Documentation request: required Java version(s) are unclear

Trying to run Blacklab server on Debian Linux 8 ('jessie'/oldstable, AMD64) fails ("Unsupported major.minor version 52.0"). Compilation of 'dev' branch with maven fails as well, editing source/target versions in core/pom.xml and server/pom.xml doesn't help. The compilation log shows errors on the '->' operator, which is apparently not present in openjdk-7-jdk.

Installing openjdk-8-jdk from the Debian Backports repository (also present in Debian 9 ('stretch'/stable) seems to fix all problems.

Having Java requirements specified in the installation instructions would have saved time. If I overlooked something, please let me know.

On behalf of our researchers, thanks for your work on Blacklab server!

IndexCollections directory not accessible from server

I'm trying to install blacklab-server. I have the web application running, but there is a problem accessing the indexCollections directory.

The error message is:

INTERNAL_ERROR Configuration error: no index locations found. Create /etc/blacklab/blacklab-server.json containing at least the following:
{
  "indexCollections": [
    "/dir/containing/indices"
  ]
}

If I look at the logs, I see the /etc/blacklab/blacklab-server.json is being read:

12:05:37.054 [http-nio-8080-exec-1] server.BlackLabServer               DEBUG Running from dir: /opt/tomcat/webapps/blacklab-server-1.6.0
12:05:37.054 [http-nio-8080-exec-1] server.BlackLabServer               DEBUG Reading configuration file /etc/blacklab/blacklab-server.json
12:05:37.061 [http-nio-8080-exec-1] search.SearchManager                DEBUG SearchManager created
12:05:37.069 [http-nio-8080-exec-1] search.IndexManager                 WARN  Configured collection not found or not readable: /home/jvdzwaan/data/blacklab
12:05:37.071 [http-nio-8080-exec-1] requesthandlers.Response            DEBUG INTERNAL ERROR 29

My /etc/blacklab/blacklab-server.json looks like this:

{ 
"indexCollections": [ "/home/jvdzwaan/data/blacklab"],
}

I have no problem reading the index that is in there with blacklab-core.

If I put a different directory in the array, e.g., "/opt/tomcat/webapps/blacklab-server-1.6.0" (this is a directory tomcat has direct access to), I don't get the error message (of course it doesn't work, because there are no directories with indices in this dir).

I'm sure it has something to do with giving tomcat access to the local file system, but I haven't been able to figure out how. Can you help me?

Suggestion, monitoring of loading of indices

This really is a suggestion, but perhaps it's implemented already and I've overseen it.

Basically, from a client app, I'd like to know when a corpus query has triggered loading the corpus index so that I can notify the user that nothing is wrong with the query taking so long. This is particularly important when using jsonp since jsonp error handling already takes advantage of timeouts: so if loading an index takes longer than the jsonp timeout, the user will get a network error message while in fact there was none. (This can be tweaked by increasing the jsonp error timeout, but then the user waiting time in case of a real network error goes high).

I imagine that this could probably be implemented via specific server responses in case of corpus index loading. What do you think?

wildcard regexes

I am noticing some strange behaviour of wildcards within regexes.
For instance, querying for "had" "laughed" "at" on the default Brown corpus (indexed with "tei"), gives you 1 hit.
However, querying for "had" "l.*" "at" gives you 0 hits. I've tried this with similar examples finding similar issues (missing hits when using the wildcard).
I am using blacklab-1.3.4.
Interestingly, I haven't been able to reproduce this on OpenSonar.

best,

Cannot build on Debian Jessie 64bit

Building fails using ant.
I have Eclipse installed, but haven't really gotten used working with it. All needed libraries should be installed (JDK etc).

~/eclipse/git/BlackLab$ ant
Buildfile: ~/eclipse/git/BlackLab/build.xml

determine-app-or-lib:

build-dependencies:

manifest-classpath:

init:

init-internal:
     [echo] ----- Building BlackLab -----

compile:

test.compile:

test:
    [junit] Running nl.inl.blacklab.analysis.TestBLDutchAnalyzer
    [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec

BUILD FAILED
~/eclipse/git/BlackLab/ant-jar.xml:169: Test nl.inl.blacklab.analysis.TestBLDutchAnalyzer failed

Total time: 1 second

uri encoding

including this:

            try {
                request.setCharacterEncoding("utf-8");
            } catch (UnsupportedEncodingException ex) {
                java.util.logging.Logger.getLogger(BlackLabServer.class.getName()).log(Level.SEVERE, null, ex);
            }

in BlackLabServer#handleRequest makes utf-8 encoded uris work on every servlet container

add absolute or relative xpath condition to addHandler

given a file like

<doc>
  <header1>
    <date>1940</date>
  </header1>
  <header2>
    <date>1945</date>
  </header2>
</doc>

I am trying to index only /doc/header2/date, however neither //header2/date nor /doc/header2/date seem to be working (currently I am just printing out some random string in the startElement/endElement overriden methods of the corresponding Handler to test if the condition is being matched), despite

https://github.com/INL/BlackLab/blob/master/src/main/java/nl/inl/blacklab/index/HookableSaxHandler.java#L141

suggesting so...
I know I could use some flags to check whether I am inside "header1" or "header2" and condition on bare "date", but it be nice to be able to just place xpath expressions.

Any thoughts?

Diacritics are incorrectly decoded by Blacklab Server

For instance, if I search for words including "á" with 'patt=[word="..%C3%A1."]' or 'patt=[word="..á."]' I get no results. The search pattern given back by BLS looks like this: [word="..Ã¡."].

xpath not supported anymore

Using a construct that previously worked:

valuePath: "@lemma | .[@next or @previous]/following-sibling::tei:join[position()=1 and '$1'!='' and contains(concat(@target,' '),'#$1 ')]/@lemma"

I now get a:

Syntax error after or around the end of ==>@lemma |

Also tried (without the dot):

valuePath: "@lemma | [@next or @previous]/following-sibling::tei:join[position()=1 and '$1'!='' and contains(concat(@target,' '),'#$1 ')]/@lemma"

How to cite BlackLab

What is the appropriate way to cite BlackLab?

regex character negation

I just noticed that the caret as character class negator doesn't seem to be supported by Blacklab's CQP query language implementation:

For instance, "[^,.!]" is still returning hits for ,, . etc.

I was just wondering about it, because I seemed to remember that CQP's CQL does support it.

Fix DocIndexerXpath for non-utf8 charsets

DocIndexerXpath::inputDocument is byte[], but the charset for this array is not stored anywhere, and the assumption of utf-8 is made everywhere the buffer is turned into a string again (storeDocument(), getCharacterPosition(), probably some other places in derived DocIndexers).

Since we're always dealing with an xml document, we can rely on VTD-XML to parse the correct charset from the encoding declaration in the file. We can get it by using VTDNav::getEncoding.
The result will have to be mapped back to a java Charset, and be used everywhere we use the byte[] as a string.

There's also a little bug in the setDocument where the defaultCharset is always used to resolve the references, but we should use the actual charset of the buffer.
This presents a bit of a catch-22, we need vtd to parse the document to get the charset, but we need the charset to prepare the document for parsing by vtd. Can we make XmlUtil::readXmlAndResolveReferences to also auto-discover the correct charset from the encoding declaration in the file itself?

Bad formatting of json/xml response.

I am getting a format error while trying to parse the output of a blacklab-server query.
I've attached an example for both JSON and XML but it's basically an issue with the pidField field in indexmetadata.json

Currently, my documents look like:

<doc><header><docId>a123.a2.xml</docId>...</header><body></body></doc>

And I've tried with setting pidField to both docId and /doc/header/docId but in both cases It seems as if blacklab weren't being able to pick the corresponding data from the xml. (See below the xml output <docPid>(null)</docPid>). But perhaps I am misunderstanding something :-) ...

I forgot to add that I haven't been able to find any relevant info in catalina's log

Attachment (sorry for the wierd indexation, but it's hard to do better given that it isn't really real JSON/xml):

JSON:

{"summary":
{"searchParam":{"first":"0","indexname":"mbg-index","maxcount":"100000","number":"5","patt":"\"a\"","waitfortotal":"no","wordsaroundhit":"5"},"searchTime":104,"countTime":45,"stillCounting":false,"numberOfHits":100000,"numberOfHitsRetrieved":100000,"stoppedCountingHits":true,"stoppedRetrievingHits":false,"numberOfDocs":225,"numberOfDocsRetrieved":225,"windowFirstResult":0,"requestedWindowSize":5,"actualWindowSize":5,"windowHasPrevious":false,"windowHasNext":true,
"docFields":{"pidField":"docId","titleField":"title","authorField":"author","dateField":"date"}},"hits":
[{"docPid":null,"start":103,"end":104,"left":{"punct":[" "," "," "," "," "],"word":["for","we","have","before","us"]},"match":{"punct":[" "],"word":["a"]},"right":{"punct":[" "," "," "," "," "],"word":["Work",",","that","seems","to"]}},{"docPid":null,"start":111,"end":112,"left":{"punct":[" "," "," "," "," "],"word":["that","seems","to","our","selves"]},"match":{"punct":[" "],"word":["a"]},"right":{"punct":[" "," "," "," "," "],"word":["Dream",",","and","that","will"]}},{"docPid":null,"start":120,"end":121,"left":{"punct":[" "," "," "," "," "],"word":["that","will","appear","to","Posterity"]},"match":{"punct":[" "],"word":["a"]},"right":{"punct":[" "," "," "," "," "],"word":["Fiction",":","a","Work","about"]}},{"docPid":null,"start":123,"end":124,"left":{"punct":[" "," "," "," "," "],"word":["to","Posterity","a","Fiction",":"]},"match":{"punct":[" "],"word":["a"]},"right":{"punct":[" "," "," "," "," "],"word":["Work","about","which","Providence","has"]}},{"docPid":null,"start":133,"end":134,"left":{"punct":[" "," "," "," "," "],"word":["has","watched","in","so","peculiar"]},"match":{"punct":[" "],"word":["a"]},"right":{"punct":[" "," "," "," "," "],"word":["manner",",","that","a","Mind"]}}],
"docInfos":{"{"error":{"code":"INTERNAL_ERROR","message":"An internal error occurred. Please contact the administrator. Error code: 32."}}

XML

<?xml version="1.0" encoding="utf-8" ?><blacklabResponse><summary><searchParam><first>0</first>
<indexname>mbg-index</indexname><maxcount>100000</maxcount><number>5</number>
<patt>&quot;a&quot;</patt><waitfortotal>no</waitfortotal><wordsaroundhit>5</wordsaroundhit>
</searchParam><searchTime>104</searchTime><countTime>45</countTime>
<stillCounting>false</stillCounting><numberOfHits>100000</numberOfHits>
<numberOfHitsRetrieved>100000</numberOfHitsRetrieved>
<stoppedCountingHits>true</stoppedCountingHits>
<stoppedRetrievingHits>false</stoppedRetrievingHits><numberOfDocs>225</numberOfDocs>
<numberOfDocsRetrieved>225</numberOfDocsRetrieved><windowFirstResult>0</windowFirstResult>
<requestedWindowSize>5</requestedWindowSize><actualWindowSize>5</actualWindowSize>
<windowHasPrevious>false</windowHasPrevious><windowHasNext>true</windowHasNext>
<docFields><pidField>docId</pidField><titleField>title</titleField><authorField>author</authorField>
<dateField>date</dateField></docFields></summary><hits><hit><docPid>(null)</docPid>
<start>103</start><end>104</end><left> <w>for</w> <w>we</w> <w>have</w> <w>before</w> 
<w>us</w></left><match> <w>a</w></match><right> <w>Work</w> <w>,</w> <w>that</w> 
<w>seems</w> <w>to</w></right></hit><hit><docPid>(null)</docPid><start>111</start>
<end>112</end><left> <w>that</w> <w>seems</w> <w>to</w> <w>our</w> <w>selves</w></left><match> <w>a</w></match><right> <w>Dream</w> <w>,</w> <w>and</w> <w>that</w> <w>will</w></right></hit><hit><docPid>(null)</docPid><start>120</start><end>121</end><left> <w>that</w> <w>will</w> <w>appear</w> <w>to</w> <w>Posterity</w></left><match> <w>a</w></match><right> <w>Fiction</w> <w>:</w> <w>a</w> <w>Work</w> <w>about</w></right></hit><hit><docPid>(null)</docPid><start>123</start><end>124</end><left> <w>to</w> <w>Posterity</w> <w>a</w> <w>Fiction</w> <w>:</w></left><match> <w>a</w></match><right> <w>Work</w> <w>about</w> <w>which</w> <w>Providence</w> <w>has</w></right></hit><hit><docPid>(null)</docPid><start>133</start><end>134</end><left> <w>has</w> <w>watched</w> <w>in</w> <w>so</w> 
<w>peculiar</w></left><match> <w>a</w></match><right> <w>manner</w> <w>,</w> <w>that</w> 
<w>a</w> <w>Mind</w></right></hit></hits><docInfos><docInfo pid="<error>
<code>INTERNAL_ERROR</code><message>An internal error occurred. Please contact the administrator. Error code: 32.</message></error></docInfo>

Capture groups in sequences can fail to return everything expected

Queries where:
1: The capture group(s) can match a variable number of tokens
2: The non-capture-group part of the query also contains terms that can also match a variable number of tokens

can sometimes fail to return all the capture groups expected. This is caused by the fact sometimes capture groups can capture different Spans on an identical Hit, so the uniqueness filter in SpanQuerySequence.java will remove those results. Therefore if a query matches the same span of tokens as a previous Hit, but could capture different tokens within that span, that capture is dropped.

For example a query like "1:[pos="NOUN"]+ [pos="NOUN"]* d" on the sentence "a b c d" where "a-d" are nouns will capture "a b" and "b", but not "a".

Indexing of very big corpus runs out of memory

We reach memory limits when indexing a huge TEI file (~21Gb).

Is there any other solution for importing huge files, than adding more RAM memory?

0 docs done (2789 MB, 42625k tokens). Average speed 111.8k tokens/s (7.3 MB/s), currently 36.4k tokens/s (2.4 MB/s)
Done. Elapsed time: 6 minutes, 21 seconds
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
    at java.lang.StringBuilder.append(StringBuilder.java:132)
    at nl.inl.blacklab.externalstorage.ContentStoreDirUtf8.addToBlock(ContentStoreDirUtf8.java:474)
    at nl.inl.blacklab.externalstorage.ContentStoreDirUtf8.storePart(ContentStoreDirUtf8.java:560)
    at nl.inl.blacklab.index.DocIndexerAbstract.storePartCapturedContent(DocIndexerAbstract.java:138)
    at nl.inl.blacklab.index.DocIndexerAbstract.appendContent(DocIndexerAbstract.java:156)
    at nl.inl.blacklab.index.DocIndexerAbstract.processContent(DocIndexerAbstract.java:180)
    at nl.inl.blacklab.index.DocIndexerXmlHandlers.endElement(DocIndexerXmlHandlers.java:691)
    at nl.inl.blacklab.index.DocIndexerXmlHandlers$SaxParseHandler.endElement(DocIndexerXmlHandlers.java:760)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1789)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2965)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
    at nl.inl.blacklab.index.DocIndexerXmlHandlers.index(DocIndexerXmlHandlers.java:725)
    at nl.inl.blacklab.index.Indexer.index(Indexer.java:348)
    at nl.inl.blacklab.index.Indexer.indexInputStream(Indexer.java:510)
    at nl.inl.blacklab.index.Indexer.indexInternal(Indexer.java:475)
    at nl.inl.blacklab.index.Indexer.index(Indexer.java:388)
    at nl.inl.blacklab.tools.IndexTool.main(IndexTool.java:266)

captureValuePaths not processed

Caused by: nl.inl.blacklab.index.config.InputFormatConfigException: Unknown key captureValuePaths in annotation lemmasplit
at nl.inl.blacklab.index.config.InputFormatReader.readAnnotation(InputFormatReader.java:257)

Support for CoNLL-U format

(Requested by @JessedeDoes)
Expand TSV input type to be able to deal with the CoNLL-U format.

The format is basically a TSV with some special features (point 2 and 3):

Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
Blank lines marking sentence boundaries.
Comment lines starting with hash (#).

So we should probably add two options, e.g. blankLinesMarkSentenceBoundaries (default false) and commentLineCharacter (if this is the first character on the line, skip that line; default: none)

Memory leak? "Not enough free mem, will remove some searches" constantly repeats

In our Blacklab installation, called from Whitelab (@matjemeisje), we run into a memory problem, where the last DEBUG message keeps repeating constantly and the system is reported to stall (@martinreynaert).

Log excerpt:

32273788 [http-bio-8080-exec-17] requesthandlers.RequestHandler       INFO  ::1 S:765D GET /cgnsonar/hits?outputformat=json&patt=%5B%5D&group=hit%3Alemma&first=0&number=50
32513872 [Thread-122] search.SearchCache                   DEBUG Search is taking too long, cancelling: 47: JobHitsGrouped(input=JobHits(index=cgnsonar, patt=ANYTOKEN(1, 1), filter=null, ma
xRetrieve=-1, maxCount=-1, ctxsize=5, conctype=FORWARD_INDEX), hitgroup=hit:lemma, hitgroupsort=identity, sortreverse=false)
34214714 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
34215215 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
35442546 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
36423357 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
36639424 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
37588184 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
37766089 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
38734262 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
38908324 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
39819659 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
39990168 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
40933962 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
41108701 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
42044920 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
42225158 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.
java.lang.OutOfMemoryError: Java heap space45262379 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.

        at nl.inl.blacklab.search.HitsImpl.getContextWords(HitsImpl.java:1547)
        at nl.inl.blacklab.search.HitsImpl.findPartOfContext(HitsImpl.java:1229)
        at nl.inl.blacklab.search.HitsImpl.findContext(HitsImpl.java:1200)
        at nl.inl.blacklab.search.grouping.ResultsGrouper.init(ResultsGrouper.java:114)
        at nl.inl.blacklab.search.grouping.ResultsGrouper.<init>(ResultsGrouper.java:107)
        at nl.inl.blacklab.search.Hits.groupedBy(Hits.java:467)
        at nl.inl.blacklab.server.jobs.JobHitsGrouped.performSearch(JobHitsGrouped.java:70)
        at nl.inl.blacklab.server.jobs.Job.performSearchInternal(Job.java:301)
        at nl.inl.blacklab.server.jobs.SearchThread.run(SearchThread.java:31)
45262881 [Thread-122] search.SearchCache                   DEBUG Not enough free mem, will remove some searches.

Memory usage is indeed very high (440GB virtual, 154GB resident).

blacklab-server.json: http://lst.science.ru.nl/~proycon/blacklab-server.json

Blacklab version is 1.5.0

Out of memory error indexing largish corpus (10k documents)

I am trying to index a collection of some 10k documents (ranging from few kbs to most 8mb) and I am getting an out of memory error after indexing some 5k documents:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at nl.inl.blacklab.forwardindex.TermsImplV3.write(TermsImplV3.java:292)
        at nl.inl.blacklab.forwardindex.ForwardIndexImplV3.close(ForwardIndexImplV3.java:444)
        at nl.inl.blacklab.search.Searcher.close(Searcher.java:539)
        at nl.inl.blacklab.search.SearcherImpl.close(SearcherImpl.java:297)
        at nl.inl.blacklab.index.Indexer.close(Indexer.java:367)
        at nl.inl.blacklab.tools.IndexTool.main(IndexTool.java:269)

I've already set -Xmx to 5gb, but it didn't help solving the issue.
The last log line before crashing says:

4958 docs done (1723 MB, 41160k tokens). Average speed 9.2k tokens/s (0.4 MB/s), currently 1.3k tokens/s (0.1 MB/s)

Any ideas?

best!

Error for multiple values field when the xpath query returns nothing

I'm creating an indexer format for a new/custom xml format. Due to imperfections of the tool the results were generated with, sometimes a query for a field that has multipleValues set to true is empty. In this case the field value of the next word is added to this word. This is incorrect.

Indexer format: https://github.com/arabic-digital-humanities/index-safar/blob/master/safar-analyzer.blf.yaml
Example input xml (simplified):

<?xml version="1.0" encoding="utf-8"?>
<morphology_analysis total_words="242">
 <word total_analysis="0" value="b_ay^g_abh" w_id="1"/>
 <word total_analysis="36" value="wqd" w_id="2">
  <analysis a_id="1" root="qdd" stem="qd"/>
  <analysis a_id="36" stem="qd"/>
 </word>
</morphology_analysis>

For this example the stem qd is added to the word b_ay^g_abh, which is incorrect.

The solution is to add an (empty) annotation if the xPath query for a multipleValues field is empty.

Problem with '&' operator in the CQL queries

Hi,

I get erroneous results when using the '&' operator in CQL queries.

For a query such as [lemma='zebra'], the system returns 261 hits:

Hits hits = searcher.find(parseCorpusQL(" [lemma='zebra'] "));

However, when I alter this query to something like [lemma='zebra' & tag='n.*']:

Hits hits = searcher.find(parseCorpusQL(" [lemma='zebra' & tag='n.*'] "));

the system returns 1290126 hits.

Do you have any clue why I have this problem? Here is the handler I used for indexing:

` public WLTGDDocIndexer(Indexer indexer, String fileName, Reader reader) {
        super(indexer, fileName, reader);
        // Get handles to the default properties (the main one word & punct)
        final ComplexFieldProperty propMain = getMainProperty();
        final ComplexFieldProperty propPunct = getPropPunct();
        final ComplexFieldProperty propLemma = addProperty("lemma", SensitivitySetting.SENSITIVE_AND_INSENSITIVE);
        final ComplexFieldProperty propPartOfSpeech = addProperty("tag", SensitivitySetting.ONLY_INSENSITIVE);
        final ComplexFieldProperty propDepr = addProperty("depr", SensitivitySetting.ONLY_INSENSITIVE);
        final ComplexFieldProperty propDist = addProperty("dist", SensitivitySetting.ONLY_INSENSITIVE);
        final ComplexFieldProperty propGWord = addProperty("gword", SensitivitySetting.SENSITIVE_AND_INSENSITIVE);
        final ComplexFieldProperty propGLemma = addProperty("glemma", SensitivitySetting.SENSITIVE_AND_INSENSITIVE);
        final ComplexFieldProperty propGTag = addProperty("gtag", SensitivitySetting.ONLY_INSENSITIVE);
        // Doc element: the individual documents to index
        addHandler("/text", new DocumentElementHandler() {
            @Override
            public void startElement(String uri, String localName, String qName,
                    Attributes attributes) {
                super.startElement(uri, localName, qName, attributes);
            }
        }
        );

        // Sentence and para as inline tags
        addHandler("p", new InlineTagHandler());
        addHandler("s", new InlineTagHandler());

        addHandler("w", new WordHandlerBase() {

            @Override
            public void startElement(String uri, String localName, String qName,
                    Attributes attributes) {
                super.startElement(uri, localName, qName, attributes);
                propLemma.addValue(attributes.getValue("l"));
                propPartOfSpeech.addValue(attributes.getValue("p"));
                propDepr.addValue(attributes.getValue("dp"));
                propDist.addValue(attributes.getValue("di"));
                propGWord.addValue(attributes.getValue("gw"));
                propGLemma.addValue(attributes.getValue("gl"));
                propGTag.addValue(attributes.getValue("gp"));
                propPunct.addValue(consumeCharacterContent());
            }

            @Override
            public void endElement(String uri, String localName, String qName) {
                super.endElement(uri, localName, qName);
                propMain.addValue(consumeCharacterContent());
            }
        });
    }`

If you're creating an index and the format config file contains an error, show that error

We only get a "format X not found" message now because the format config failed to be instantiated correctly, but it would be much more helpful to be specific about what's wrong.

option to stream complete document

No buffering, just stream a whole document when allowed

Performance / memory wise it may be a good idea in general to reconsider the buffering mechanism in BlacklabServer.

Filtering error response codes 32/15

I have a corpus with some 50 authors for which I have an id each. I am getting this error whenever I try to filter by that field:

{
  "error": {
    "message": "An internal error occurred. Please contact the administrator. Error code: 32.",
    "code": "INTERNAL_ERROR"
  }
}

I remember that it used to work without any problems before I updated to 1.6.0-dev.
I've noticed it happens with other fields as well.
I've tried reindexing without custom indextemplate.json and it doesn't seem to change anything.

Do you know how can I debug this further - like, e.g. what does 32 stand for -, or perhaps any other further hints?

Request to publish to nl.inl.blacklab namespace

Hello,

We usually publish out libraries to the org.allenai namespace but we have our fork of your BlackLab project (https://github.com/allenai/BlackLab) that we would now like to publish. It's in Bintray but we cannot push it to Maven Central under the nl.inl.blacklab namespace as we don't own it. We wanted to keep that namespace to ensure sbt versioning happens correctly. Would you please grant us permission to do so? Sonatype user token is allenai-role.

Thanks,
Sumithra

Handle standoff annotations referring to multiple tokens / stretches of tokens

Some formats include standoff annotations that don't just refer to a single token but to several tokens (i.e. with an attribute with space-separated ids), or refer to a start/end token (i.e. sentence start/end). There needs to be a way to deal with these formats.

Examples: EAF, TCF, Fryske Akademy TEI format.

query ranges

I was wondering to what extent is it possible to filter documents by range.
For instance, given a metadata field year for each docs with some integer 1945, how could I index it so that I can filter for documents in the range 1940-1960?
This seems to be possible with Lucene, and I've been (very briefly) looking into the BlackLab source for a hint to hook that in, but without success.
I could also spend some time implementing this in case it isn't there yet but desirable but I would need some guidance into the BlackLab source code first.

best

Retrieve stored metadata values

I am looking for a way to retrieve possible values indexed for each metadata field. Right now I have users inputting the values they want to filter for, but this is quite error-prone. I'd prefer to have them selecting from a list for instance.

http://.../blacklab-server.../corpus gives you info about which fields are indexed but not about the values in there.

Any hints?

best

omit empty properties from results?

Searching hits show for example:

<?xml version="1.0" encoding="utf-8" ?><blacklabResponse><summary><searchParam><indexname>frysk</indexname><number>1</number><patt>"fabryk"</patt></searchParam><searchTime>114</searchTime><countTime>114</countTime><stillCounting>false</stillCounting><numberOfHits>1052</numberOfHits><numberOfHitsRetrieved>1052</numberOfHitsRetrieved><stoppedCountingHits>false</stoppedCountingHits><stoppedRetrievingHits>false</stoppedRetrievingHits><numberOfDocs>205</numberOfDocs><numberOfDocsRetrieved>205</numberOfDocsRetrieved><windowFirstResult>0</windowFirstResult><requestedWindowSize>1</requestedWindowSize><actualWindowSize>1</actualWindowSize><windowHasPrevious>false</windowHasPrevious><windowHasNext>true</windowHasNext><docFields><titleField>title</titleField><dateField>year</dateField></docFields></summary><hits><hit><docPid>1</docPid><start>20075</start><end>20076</end><left> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">it</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">wie</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">hjir</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">in</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">folslein</w></left><match> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">fabryk</w></match><right> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">mei</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">in</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">bidriuwslieder</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">in</w> <w diminutive="" voice="" prontype="" gender="" mood="" aux="" valency="" convertedfrom="" degree="" lemma="" poss="" numtype="" number="" predicate="" pos="" pronoun="" person="" construction="" abbr="" tense="" verbform="" case="" inflection="">algemien</w></right></hit></hits><docInfos><docInfo pid="1"><author>R. van Tuinen, 1916, ()</author><contentViewable>false</contentViewable><fromInputFile>0203933a.xml</fromInputFile><language_variant>fry</language_variant><title>Efter it tried</title><year>1946</year><lengthInTokens>30769</lengthInTokens><mayView>false</mayView></docInfo></docInfos></blacklabResponse>

Value sorting should ignore parentheses, etc.

When e.g. metadata values are sorted, right now values such as "(ex-)wife" will always end up at the top instead of under 'e'. A natural sort should be done, using a collator.

inl / blacklab Goto Github PK

blacklab's Introduction

What is BlackLab?

Branches

Compatibility: Java, Lucene

Roadmap

Development workflow

Code style

Building the site

Using BlackLab with Docker

Indexing with Docker

Special thanks

blacklab's People

Contributors

Stargazers

Watchers

Forkers

blacklab's Issues

This issue with the current implementation is twofold:

The current system has a couple of features that have to be considered in any changes to the system

A couple ways to solve this:

Allow blocks to have a variable size.

Keep the current block system, but allow separate documents to be processed in parallel.

Get rid of the contentStore entirely, and research how to store the document data in Solr/Lucene.

Recommend Projects

Recommend Topics

Recommend Org