Git Product home page Git Product logo

ambiverse-nlu's Introduction

Try the demo at http://ambiversenlu.mpi-inf.mpg.de

Ambiverse Natural Language Understanding - AmbiverseNLU

Build Status DockerHub Hex.pm

The multilingual Ambiverse Natural Language Understanding suite (AmbiverseNLU) combines a number of state-of-the-art components for language understanding tasks: named entity recognition and disambiguation (or entity linking), open information extraction, entity salience estimation, and concept linking, providing a basis for text-to-knowledge applications.

Take the example sentence below:

Jack founded Alibaba with investments from SoftBank and Goldman.

The AmbiverseNLU will produce the following outputs:

AmbiverseNLU Demo

AmbiverseNLU Demo

Quickly play with AmbiverseNLU without installing anything: demo at http://ambiversenlu.mpi-inf.mpg.de

Quick Start

Call the Web Service using Docker

Starting the AmbiverseNLU as web service (with PostgreSQL backend) using Docker is simple, using docker-compose (note that this can take a couple of hours to come up while the database is being populated):

docker-compose -f docker-compose/service-postgres.yml up

If your machine has less than 32 GB of main memory, run this configuration instead, which knows way fewer entities (some big companies and related entities )but is good enough to play around:

docker-compose -f docker-compose/service-postgres-small.yml up

Wait for some time (depending on your internet connection and CPU speed it can easily take more than an hour), then call the service:

curl --request POST \
  --url http://localhost:8080/factextraction/analyze \
  --header 'accept: application/json' \
  --header 'content-type: application/json' \
  --data '{"docId": "doc1", "text": "Jack founded Alibaba with investments from SoftBank and Goldman.", "extractConcepts": "true" }'

You can run AmbiverseNLU with different databases as backend, or also start the database backend alone. Check out the different configurations of the Docker files on https://github.com/ambiverse-nlu/dockerfiles for details.

Alternative Ways to Run

Start the Database Backend

Start the PostgreSQL backend with the fully multilingual knowledge graph (note that this can take a couple of hours to come up while the database is being populated):

docker run -d --name nlu-db-postgres \
  -p 5432:5432 \
  -e POSTGRES_DB=aida_20180120_cs_de_en_es_ru_zh_v18_db \
  -e POSTGRES_USER=ambiversenlu \
  -e POSTGRES_PASSWORD=ambiversenlu \
  ambiverse/nlu-db-postgres

If you have less than 32 GB of main memory, you can also start a PostgreSQL backend with a smaller knowledge graph, containing only a few companies and related entities, supporting only English and German:

docker run -d --name nlu-db-postgres \
  -p 5432:5432 \
  -e POSTGRES_DB=aida_20180120_b3_de_en_v18_db \
  -e POSTGRES_USER=ambiversenlu \
  -e POSTGRES_PASSWORD=ambiversenlu \
  ambiverse/nlu-db-postgres

Make sure to use aida_20180120_b3_de_en_v18_db as value for the AIDA_CONF exports below.

Start the Web Service using Maven and Jetty from Source Code

  1. Adapt the database configuration. You need to adapt the database_aida.properties of the AIDA_CONF you are using. For example, if you are using aida_20180120_cs_de_en_es_ru_zh_v18_db as configuration, adapt src/main/config/aida_20180120_cs_de_en_es_ru_zh_v18_db/database_aida.properties and make sure that the property dataSource.serverName points to the host of the machine (or linked docker image) that runs the database.
  2. Start the web service by executing the following script:
export AIDA_CONF=aida_20180120_cs_de_en_es_ru_zh_v18_db
./scripts/start_webservice.sh

You can the MAVEN_OPTS in the script if you want to change the port and the available memory. If you adapt AIDA_CONF, make sure that PostgreSQL backend started above uses the same configuration value. The database_aida.properties configuration must point to an existing database.

Run a Pipeline from the Command Line

Adapt the database configuration as explained in the section above (Starting the Web Service).

The main command line interface is de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.run.UimaCommandLineProcessor. Example call using a script:

export AIDA_CONF=aida_20180120_cs_de_en_es_ru_zh_v18_db
mkdir nlu-input
echo "Jack founded Alibaba with investments from SoftBank and Goldman." > nlu-input/doc.txt
./scripts/driver/run_pipeline.sh -d nlu-input -i TEXT -l en -pip ENTITY_SALIENCE

A list of existing pipelines can be found in de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.uima.pipelines.PipelineType, where you can also define new pipelines.

Database dumps

The database dumps can be downloaded from http://ambiversenlu-download.mpi-inf.mpg.de/. The database docker images will download them automatically.

Natural Language Understanding Components

KnowNER: Named Entity Recognition

Named Entity Recognition (NER) identifies mentions of named entities (persons, organizations, locations, songs, products, ...) in text.

KnowNER works on English, Czech, German, Spanish, and Russian texts.

AmbiverseNLU provides KnowNER [1] for NER.

AIDA: Named Entity Disambiguation

Named Entity Disambiguation (NED) links mentions recognized by NER (see above) to a unique identifier. Most names are ambiguous, especially family names, and entity disambiguation resolves this ambiguity. Together with NER, NED is often referred to as entity linking.

AIDA works on English, Chinese, Czech, German, Spanish, and Russian texts.

AmbiverseNLU provides an enhanced version of AIDA [2] for NED, mapping mentions to entities registered in the Wikipedia-derived YAGO [4,5] knowledge base.

ClausIE: Open Information Extraction

Open Information Extraction (OpenIE) is the task of generating a structured output from natural language text in the form of n-ary propositions, consisting of a subject, a relation, and one or more arguments. For example, in the sentence "Albert Einstein was born in Ulm", an open information extraction system will generate the extraction ("Albert Einstein", "was born in", "Ulm"), where the first argument is usually referred as the subject, the second as the relation, and the last one as the object or argument.

ClausIE works on English texts.

AmbiverseNLU provides an enhanced version of ClausIE [3] for OpenIE.

Concept Linking

Concept linking is similar to entity linking but with a focus on non-named entities (e.g., car, chair, etc.). It identifies relevant concepts in text and links them to a to concepts registered in the Wikipedia-derived YAGO [4,5] knowledge base.

Concept Linking works on English, Chinese, Czech, German, Spanish, and Russian texts.

AmbiverseNLU provides a new concept linking component based on the original AIDA entity disambiguation with knowledge-informed spotting.

Entity Salience

Entity Salience gives each entity in a document a score in [0,1], denoting its importance with respect to the document.

Our Entity Salience is fully multilingual.

Resource Considerations

Main Memory

The Entity/Concept Linking component has the largest main memory requirements. This is due to the large contextual and coherence models it needs to load in order to disambiguate with high accuracy.

Initially, Entity Linking loads static data in main memory, which requires (depending on the languages you are configuring it for), a couple of GB. We estimate 8 GB for all languages to be the upper bound.

The actual requirements per document vary depending on the density of mentions and the number of entites per mention, so it cannot be estimated just by the length of the document. To be on the safe side, plan 8 GB of main memory per document.

This means that if you want to disambiguate one document at a time, you need at least 16 GB of main memory. If you want to disambiguate 4 documents in parallel, you should be using 40 GB.

Disk Space

The full AmbiverseNLU database, aida_20180120_cs_de_en_es_ru_zh_v18_db, requires 387 GB disk space.

Throughput Analysis

Benchmarking setup: (multi-threaded) Entity Linking Service in a single Docker-container using 4 cores and 32 GB of main memory. Cassandra node running on the same physical machine.

For 1,000 news articles (2,531 chars on average, 26 named entities on average), with highest-quality setting (coherence):

  • Average time per article: 2.36 seconds
  • Throughput: 1.7 documents per second

Evaluation

The Entity Disambiguation accuracy on the widely used CoNLL-YAGO dataset [2] is as follows:

  • Micro-Accuracy: 84.61%
  • Macro-Accuracy: 82.67%

Advanced configuration

Configuring the environment

Most settings are bundled by folder in 'src/main/config'. Set the configuration you need using the AIDA_CONF environment variable, e.g.:

export AIDA_CONF=aida_20180120_cs_de_en_es_ru_zh_v18_db

AmbiverseNLU Pipeline

AmbiverseNLU has a flexible, based on UIMA and DKPro, which allows you to specify the components you want to run. A number of useful pipelines are preconfigured, new ones can be added easily.

Using pipelines programaticaly

In the web service:

Have a look at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.service.web.resource.impl.AnalyzeResourceImpl.java which configures the web service.

As a stand alone application

Have a look at de.mpg.mpi_inf.ambiversenlu.nlu.drivers.test.Disambiguation and de.mpg.mpi_inf.ambiversenlu.nlu.drivers.test.OpenIE for examples.

Creating new pipelines

Pipelines are enums in de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.uima.pipelines.PipelineType. Each pipeline contains the order in which the components should be executed. The components are located in de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.uima.components.Component.

Building your own YAGO Knowledge Graph

AmbiverseNLU uses the YAGO knowledge base by default.

Building steps:

  1. Create the YAGO KG and AIDA repositories using scripts/repository_creation/createAidaRepository.py
  2. Build updated KnowNER models (optional), again using scripts/repository_creation/createAidaRepository.py passing --reuse-yago --stages KNOWNER_PREPARE_RESOURCES,KNOWNER_TRAIN_MODEL as additional parameters

Building a custom Knowledge Graph

The AmbiverseNLU architecture is knowledge base agnostic, allowing your to import your own concepts and entities, or combine them with YAGO. Have a look at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.datapreparation.PrepareData and de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.datapreparation.conf.GenericPrepConf to get started.

Extending KnowNER

KnowNER provides means to add new languages. Have a look at docs/know-ner/new_corpus.md and docs/know-ner/new_language.md.

Further Information

Stay in Touch

Sign up for the AmbiverseNLU mailing list: Visit https://lists.mpi-inf.mpg.de/listinfo/ambiversenlu or send a mail to [email protected]

AmbiverseNLU License

Apache License, Version 2.0

Maintainers and Contributors

Current Maintainers (in alphabetical order):

  • Ghazale Haratinezhad Torbati
  • Johannes Hoffart
  • Luciano Del Corro

Contributors (in alphabetical order):

  • Artem Boldyrev
  • Daniel Bär
  • Dat Ba Nguyen
  • Diego Ceccarelli
  • Dominic Seyler
  • Dragan Milchevski (former maintainer)
  • Felix Keller
  • Ghazale Haratinezhad Torbati
  • Ilaria Bordino
  • Johannes Hoffart
  • Luciano Del Corro
  • Mohamed Amir Yosef
  • Tatiana Dembelova
  • Vasanth Venkatraman

References

  • [1] D. Seyler, T. Dembelova, L. Del Corro, J. Hoffart, and G. Weikum, “A Study of the Importance of External Knowledge in the Named Entity Recognition Task,” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 2018
  • [2] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum, “Robust Disambiguation of Named Entities in Text,” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, 2011
  • [3] L. Del Corro and R. Gemulla, “ClausIE - clause-based open information extraction.,” Proceedings of the 22nd International Conference on World Wide Web, WWW 2013, Rio de Janerio, Brazil, 2013
  • [4] T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum, “YAGO - A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames.,” Proceedings of the 15th International Semantic Web Conference, ISWC 2016, Kobe, Japan, 2016
  • [5] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia,” Artificial Intelligence, vol. 194, pp. 28–61, 2013

ambiverse-nlu's People

Contributors

ambiversenlu avatar dmilcevski avatar ghazalehnt avatar hoffart avatar lucianodelcorro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ambiverse-nlu's Issues

Postgres connection error

I'm trying to set up an Ambiverse REST endpoint on one of our servers so we can query it locally. So far the setup is rather easy, the Postgres database is running with the imported AIDA dump and the maven build succeeded.

However, when starting the web_service.sh, ambiverse fails to establish a connection ( it logs timeout warnings ).

2019-03-18 16:17:01,577 [main] INFO nlu.entitylinking.EntityLinkingManager:206  - Postgres DB seems unavailable. This is expected during first startup using Docker. Waiting for 60s in addition (already waited 0s, will wait up to 10800s in total).

To figure out what's going on, I checked whether all config parameters are used by the code correctly and also added prop.put("dataSource.logWriter", new PrintWriter(System.out)); to get a more verbose log of the database connection library. Everything seems fine, however loginTimeout=0 and socketTimeout=0 seems odd. Changing it to a non-zero value doesn't change the behaviour, where the library quickly logs a lot of connection attempts.
To check if anything gets to the postgres server, I configured it to log everything:

2019-03-18 16:16:52.753 CET [5573] [unknown]@[unknown] LOG:  00000: connection received: host=127.0.0.1 port=40314
2019-03-18 16:16:52.753 CET [5573] [unknown]@[unknown] LOCATION:  BackendInitialize, postmaster.c:4205
2019-03-18 16:16:52.754 CET [5573] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOG:  00000: connection authorized: user=ambiverse database=aida_20180120_cd_de_en_es_ru_zh_v18
2019-03-18 16:16:52.754 CET [5573] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOCATION:  PerformAuthentication, postinit.c:279
2019-03-18 16:16:52.757 CET [5573] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOG:  00000: disconnection: session time: 0:00:00.004 user=ambiverse database=aida_20180120_cd_de_en_es_ru_zh_v18 host=127.0.0.1 port=40314
2019-03-18 16:16:52.757 CET [5573] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOCATION:  log_disconnections, postgres.c:4614
2019-03-18 16:16:53.760 CET [5593] [unknown]@[unknown] LOG:  00000: connection received: host=127.0.0.1 port=40316
2019-03-18 16:16:53.760 CET [5593] [unknown]@[unknown] LOCATION:  BackendInitialize, postmaster.c:4205
2019-03-18 16:16:53.762 CET [5593] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOG:  00000: connection authorized: user=ambiverse database=aida_20180120_cd_de_en_es_ru_zh_v18
2019-03-18 16:16:53.762 CET [5593] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOCATION:  PerformAuthentication, postinit.c:279
2019-03-18 16:16:53.767 CET [5593] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOG:  00000: disconnection: session time: 0:00:00.006 user=ambiverse database=aida_20180120_cd_de_en_es_ru_zh_v18 host=127.0.0.1 port=40316
2019-03-18 16:16:53.767 CET [5593] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOCATION:  log_disconnections, postgres.c:4614
2019-03-18 16:16:54.770 CET [5635] [unknown]@[unknown] LOG:  00000: connection received: host=127.0.0.1 port=40318
2019-03-18 16:16:54.770 CET [5635] [unknown]@[unknown] LOCATION:  BackendInitialize, postmaster.c:4205
2019-03-18 16:16:54.772 CET [5635] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOG:  00000: connection authorized: user=ambiverse database=aida_20180120_cd_de_en_es_ru_zh_v18
2019-03-18 16:16:54.772 CET [5635] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOCATION:  PerformAuthentication, postinit.c:279
2019-03-18 16:16:54.777 CET [5635] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOG:  00000: disconnection: session time: 0:00:00.006 user=ambiverse database=aida_20180120_cd_de_en_es_ru_zh_v18 host=127.0.0.1 port=40318
2019-03-18 16:16:54.777 CET [5635] ambiverse@aida_20180120_cd_de_en_es_ru_zh_v18 LOCATION:  log_disconnections, postgres.c:4614

(yes, my user isn't ambiversenlu, it's just ambiverse).
Connecting to the server from pgadmin looks exactly the same in the logs, except that it doesn't immediately disconnect again.

I have tried a lot, but I guess it boils down to the scenario described above, where the database connection isn't staying alive. I have tried the git version tagged 1.0 and 1.1 as well as current master.

% cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic

% psql --version                                                                                                                                                                          
psql (PostgreSQL) 11.2 (Ubuntu 11.2-1.pgdg18.04+1)

% mvn -version
Apache Maven 3.5.2
Maven home: /usr/share/maven
Java version: 1.8.0_201, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-32-generic", arch: "amd64", family: "unix"

I'm not using Docker but used the Dockerfiles as setup instructions. Not sure what the error is, the project compiles without errors, it runs without errors until it get's caught in the loop waiting for a DB connection. The connection is established (as logs suggest) but immediately closed. Using Docker wouldn't fix that I guess...

Yago 4

Would there be an easy way to update this to use Yago 4?

ERROR: relation "word_ids" does not exist

Thanks for building and open sourcing this really impressive looking tool.
I am trying to set it up locally using the large database and have followed the instructions in the READMe, however I am having the above error on startup. Any assistance in solving it would be great.

More comprehensive error message:

ERROR: relation "word_ids" does not exist at character 22 STATEMENT: SELECT word, id FROM word_ids WHERE word IN (E'Messi',E'juega',E'al',E'futbol',E'en',E'Barcelona',E'.',E'Maradona',E'tambien',E'jugo',E'en',E'Barcelona',E'y',E'luego',E'fue',E'al',E'Napoli',E'.') de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.access.EntityLinkingDataAccessException: org.postgresql.util.PSQLException: ERROR: relation "word_ids" does not exist

Spacy vs CoreNLP for preprocessing, Scalability vs Flexibility of AIDA

@hoffart / @dmilcevski

Good Morning/Afternoon/Evening,

As you guys are experts in this area, would like to know the following about ambiverse:

  1. Would you guys suggest moving to Spacy from CoreNLP for POS tagging, dependency tree, etc.,? Our server doesn't have GPUs and observed an order of magnitude slowness on CPU vs GPUs for CoreNLP processes.
  2. Would it make sense to adopt ambiverse for smaller domain knowledge, to process huge document corpus? What would be the suggested scaling methods?
  3. Would it be possible to achieve a sub-second disambiguation turnaround time on the full Yago? As per our observation, ambiverse is taking in the order of 4-6 seconds for disambiguation.
  4. Is ELQ as flexible as AIDA to adopt it to a domain knowledge from wikidata?

Looking forward for your views.

Regards,
Naren M

Adding new entities

I successfully deployed and run the NLU engine, however I want to add some new entities to the database and when I query for the new entities through API I want to be able to get them. Could I achieve that?

Start ambiverse using Maven and Jetty from Source Code

Hi,

I am newly in using the web service and I am using the Maven and Jetty from Source Code.

1- I ran the web service and it's running
2- I try to run the pipeline command:
run_pipeline.sh -d nlu-input -i TEXT -l en -pip ENTITY_SALIENCE

and this error is shown:
java.lang.ClassNotFoundException:de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.run.UimaCommandLineProcessor

could anyone help, please?

Thank you

java.lang.RuntimeException: AIDA configuration should be specified as enviromental variable (AIDA_CONF) or as a system property aida.conf

i am trying to run it with maven and jetty but i fail, this is the steps that i follow

docker run -d --name nlu-db-postgres \
  -p 5432:5432 \
  -e POSTGRES_DB=aida_20180120_b3_de_en_v18_db \
  -e POSTGRES_USER=ambiversenlu \
  -e POSTGRES_PASSWORD=ambiversenlu \
  ambiverse/nlu-db-postgres

i run the export

export AIDA_CONF=aida_20180120_b3_de_en_v18_db

check that the export is made with success

user@ubuntu:~$ printenv | grep  AIDA_CONF
AIDA_CONF=aida_20180120_b3_de_en_v18_db

and this is the database aida properties i change only the server name and the port number

dataSourceClassName = org.postgresql.ds.PGSimpleDataSource
dataSource.serverName = localhost
dataSource.databaseName = aida_20180120_b3_de_en_v18
dataSource.portNumber = 5432
dataSource.user = ambiversenlu
dataSource.password = ambiversenlu
maximumPoolSize = 5

Finally, I run this command
sudo ./scripts/start_webservice.sh

and I got this error message

java.lang.RuntimeException: AIDA configuration should be specified as enviromental variable (AIDA_CONF) or as a system property aida.conf
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.config.ConfigUtils.loadProperties (ConfigUtils.java:83)
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.config.EntityLinkingConfig.<init> (EntityLinkingConfig.java:72)
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.config.EntityLinkingConfig.getInstance (EntityLinkingConfig.java:80)
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.config.EntityLinkingConfig.get (EntityLinkingConfig.java:94)
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.config.EntityLinkingConfig.getBoolean (EntityLinkingConfig.java:125)
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.EntityLinkingManager.init (EntityLinkingManager.java:82)
    at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.service.web.ServiceContext.contextInitialized (ServiceContext.java:34)
    at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized (ContextHandler.java:890)
    at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized (ServletContextHandler.java:532)
    at org.eclipse.jetty.server.handler.ContextHandler.startContext (ContextHandler.java:853)
    at org.eclipse.jetty.servlet.ServletContextHandler.startContext (ServletContextHandler.java:344)
    at org.eclipse.jetty.webapp.WebAppContext.startWebapp (WebAppContext.java:1501)
    at org.eclipse.jetty.maven.plugin.JettyWebAppContext.startWebapp (JettyWebAppContext.java:357)
    at org.eclipse.jetty.webapp.WebAppContext.startContext (WebAppContext.java:1463)
    at org.eclipse.jetty.server.handler.ContextHandler.doStart (ContextHandler.java:785)
    at org.eclipse.jetty.servlet.ServletContextHandler.doStart (ServletContextHandler.java:261)
    at org.eclipse.jetty.webapp.WebAppContext.doStart (WebAppContext.java:545)
    at org.eclipse.jetty.maven.plugin.JettyWebAppContext.doStart (JettyWebAppContext.java:432)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start (AbstractLifeCycle.java:68)
    at org.eclipse.jetty.util.component.ContainerLifeCycle.start (ContainerLifeCycle.java:131)
    at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart (ContainerLifeCycle.java:113)
    at org.eclipse.jetty.server.handler.AbstractHandler.doStart (AbstractHandler.java:113)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.doStart (ContextHandlerCollection.java:167)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start (AbstractLifeCycle.java:68)
    at org.eclipse.jetty.util.component.ContainerLifeCycle.start (ContainerLifeCycle.java:131)
    at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart (ContainerLifeCycle.java:113)
    at org.eclipse.jetty.server.handler.AbstractHandler.doStart (AbstractHandler.java:113)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start (AbstractLifeCycle.java:68)
    at org.eclipse.jetty.util.component.ContainerLifeCycle.start (ContainerLifeCycle.java:131)
    at org.eclipse.jetty.server.Server.start (Server.java:452)
    at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart (ContainerLifeCycle.java:105)
    at org.eclipse.jetty.server.handler.AbstractHandler.doStart (AbstractHandler.java:113)
    at org.eclipse.jetty.server.Server.doStart (Server.java:419)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start (AbstractLifeCycle.java:68)
    at org.eclipse.jetty.maven.plugin.AbstractJettyMojo.startJetty (AbstractJettyMojo.java:460)
    at org.eclipse.jetty.maven.plugin.AbstractJettyMojo.execute (AbstractJettyMojo.java:328)
    at org.eclipse.jetty.maven.plugin.JettyRunMojo.execute (JettyRunMojo.java:170)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)

What am I doing wrong here??

Example of a response?

Hi,

I'm looking at README and seeing examples of HTTP request. What about responses? I'd like to see what kind of data is returned.

Coverage/quality of aida_20180120_cs_de_en_es_ru_zh_v18.sql.gz

We installed and (superficially) tested ambiverse using the "small" dataset (aida_20180120_b3_de_en_v18.sql.gz) but found it lagging far behind the (impressive) quality of the online-demo.
We hope, that we soon will have the necessary resources (RAM) in place to use the "big" dataset (aida_20180120_cs_de_en_es_ru_zh_v18.sql.gz).

But before we get too enthusiastic beforehand (and thus too disappointed afterwards): can anyone comment on what to expect from the big dataset in comparison to the online demo? Is the "coverage/quality" the same?
I am talking about performing named entitiy on German and English texts, btw.

Strategies for improving performance

We are preparing to process a large corpus (100M documents, 200 GB of data) with AmbiverseNLU, focusing initially on NER/NED using the entitylinking/analyze endpoint. Before starting the project, we need to optimize performance in order to (1) process the corpus in a reasonable amount of time and (2) minimize cost. Based on our testing to date, we have some questions about performance:

  1. Benchmarking data. Is there a standard set of test data you use for benchmarking? If not, we can build something from our data, but are there any specific properties you would recommend such a data set have?

  2. Configuration. Are there any configuration parameters that we should set to optimize performance? One specific area we thought about is languages--is there any benefit to restricting the application to English if that is the only language we need? If we were to do so, would it be difficult to add additional languages in the future?

  3. Monitoring. We are monitoring our instances using htop and console logs. Is there any additional performance monitoring capability built into AmbiverseNLU?

  4. Internal parallelization. The README.md file discusses parallel disambiguation in the context of memory requirements, implying that memory is the limiting factor in running parallel queries. Do CPU and memory usage scale roughly linearly with parallel queries? Is there any advantage to running parallel queries on one large instance, vs. running individual queries on smaller instances?

  5. External parallelization. We have been testing AmbiverseNLU on AWS EC2 m5n.4xlarge instances, which have 16 cores and 64 GB of RAM, and support 4,750 IOPS. On such a machine, where would you expect AmbiverseNLU to be resource bound? We also did some testing on an m5n.2xlarge instance, which has 16 cores and 32 GB of RAM. It ran about 75% of the transaction rate that the larger instances did, suggesting that we were not totally CPU and/or RAM bound. However, with the lower RAM the instance crashed 1-2 times per hour as described in #19. If we could solve these crashes, this would be a great solution for us, as we could get 75% of the performance for 50% of the price.

  6. Deployment time. The biggest issue we have with new instances is how long it takes them to spin up. Based on what we have seen so far, we have the following questions:

    • Can we download the database image and specify the local location to PostgreSQL?
    • Can we pre-build the database and store the backup image locally, then restore it on instance start?
    • Are there other things we can do to cache resources locally or pre-build them for either the PostgreSQL or AmbiverseNLU images?

Some of these questions will likely require us to do our own benchmarking--that's fine. I just wanted to make sure we were looking at the right things before doing so. Any guidance is appreciated. In addition, we'd be happy to share our results with you, in order to come up with some typical measurements on AWS instance types.

I know this is a long list of questions. Please let me know if it is helpful to split these up into their own individual issues.

Thanks in advance for the help!

How to have a aidaFacts.tsv file created?

I have been trying to run the createAidaRepository.py Script which runs Yago3. It has been taking a lot more time than I would like on my local machine. I believe when Yago3 runs, it is supposed to create a aidaFacts.tsv file somewhere in between which is then used when running PrepareData.java. I was wondering how a sample aidaFacts.tsv file may look like and is there any way I can decrease the runtime. I dont need a full knowledge base atm as I am trying to just see whether we can build more on top of yago3. So a small file showing the pattern of facts would be good enough.

Thanks!

Cannot run docker container

Hi. I'm trying to run a full-sized docker container, as in quick start guide, but i got a bunch of errors in console. Looks like containers started, and even serving 8080 port, but there are errors in console, and curl request returns nothing. Also, it seems data for database haven't downloaded, and 387 GB disk space is still free.

Here is console output:
https://pastebin.com/XgT56FvB

What can I do?

No space left on device

Thanks for making the resource available.
I tried to docker-compose but still getting this: Any idea?
I have +1T disk and +100GB memory

docker-compose_db_1 exited with code 1
nlu_1  | 2020-05-08 05:35:36,388 [main] INFO  nlu.entitylinking.EntityLinkingManager:206  - Postgres DB seems unavailable. This is expected during first startup using Docker. Waiting for 60s in addition (already waited 1560s, will wait up to 10800s in total).
db_1   | FATAL:  could not write lock file "postmaster.pid": No space left on device

docker-compose_nlu_1 crashes (sometimes with 137 code)

Hi.

I can't even try the quick start solution, after jetty starts it crashes in less than 5 minutes. I never got to a point where I can send a request.

Initially I thought that 137 is exit code for Out of memory killer, but service-postgres-small is behaving in the exact same way. I even tried to spin up a 64 GB RAM VM in cloud but the behavior is the same.

Here is the last thing I see in logs before it restarts:

nlu_1  | [INFO] jetty-9.4.4.v20170414
nlu_1  | [INFO] Scanning elapsed time=38595ms
nlu_1  | [INFO] DefaultSessionIdManager workerName=node0
nlu_1  | [INFO] No SessionScavenger set, using defaults
nlu_1  | [INFO] Scavenging every 600000ms
nlu_1  | 2019-04-03 11:28:22,221 [main] INFO  service.web.ServiceContext:33  - Initializing the Entity Linking Manager
nlu_1  | 2019-04-03 11:28:22,246 [main] INFO  entitylinking.config.ConfigUtils:65  - Configuration 'aida_20180120_b3_de_en_v18_db' [set by environment AIDA_CONF].
docker-compose_nlu_1 exited with code 137
nlu_1  | [INFO] Scanning for projects...

The last line is new contaner

As I said, docker-compose_nlu_1 exited with code 137 message is intermittent. I sometimes get it, sometimes I don't.

I'm running Windows 10 if that's somehow important, which shouldn't be.

4-5 hours to process a single request

We have set up the full Postgres database and jetty on the server which has 32 GB of RAM and 512 Hard disk space.
When we request to URL localhost:8080/factextraction/analyze, It usually takes around 4-5 hours to process a single request. I am not sure why It is taking so long?
I also tried the pipeline which is better than a URL. but still takes 2 hours for a document
In a smaller database, It takes seconds to generate a response.
In the documentation, It says Ambiverse-nlu only needs 16GB of Main memory to process a single document.
Can you tell me what is the https://ambiversenlu.mpi-inf.mpg.de/ memory configuration in the demo site or what is the ideal configuration we should keep on the server?
Is there any way to Improve Performance and how many cores/threads I can use at a time?

Results for "Kellogg's" point to wrong Wikidata entity

When we run the following request through AmbiverseNLU:

curl --request POST --url http://localhost:8080/entitylinking/analyze --header 'accept: application/json' --header 'content-type: application/json' --data $'{"docId": "doc1", "text": "Kellogg\'s is an American multinational food manufacturing company headquartered in Battle Creek, Michigan.", "extractConcepts": "true", "language": "en"}'

we obtain the following results:

{
  "docId": "doc1",
  "language": "en",
  "matches": [
    {
      "charLength": 7,
      "charOffset": 0,
      "text": "Kellogg",
      "entity": {
        "id": "http://www.wikidata.org/entity/Q856886",
        "confidence": 0.21240166844575648
      }
    },
    {
      "charLength": 12,
      "charOffset": 83,
      "text": "Battle Creek",
      "entity": {
        "id": "http://www.wikidata.org/entity/Q810998",
        "confidence": 0.41914338870940004
      }
    },
    {
      "charLength": 8,
      "charOffset": 97,
      "text": "Michigan",
      "entity": {
        "id": "http://www.wikidata.org/entity/Q1166",
        "confidence": 0.31163620702219674
      }
    },
    {
      "charLength": 4,
      "charOffset": 39,
      "text": "food",
      "entity": {
        "id": "http://www.wikidata.org/entity/Q2095",
        "confidence": 0.5883575301498092
      }
    },
    {
      "charLength": 21,
      "charOffset": 44,
      "text": "manufacturing company",
      "entity": {
        "id": "http://www.wikidata.org/entity/Q187939",
        "confidence": 1.0
      }
    }
  ],
  "entities": [
    {
      "id": "http://www.wikidata.org/entity/Q810998",
      "name": "Battle Creek, Michigan",
      "url": "http://en.wikipedia.org/wiki/Battle%20Creek%2C%20Michigan",
      "type": "LOCATION",
      "salience": 0.2317468977580788
    },
    {
      "id": "http://www.wikidata.org/entity/Q856886",
      "name": "Kellogg's",
      "url": "http://en.wikipedia.org/wiki/Kellogg%27s",
      "type": "ORGANIZATION",
      "salience": 0.7863166812635995
    },
    {
      "id": "http://www.wikidata.org/entity/Q2095",
      "name": "Food",
      "url": "http://en.wikipedia.org/wiki/Food",
      "type": "CONCEPT",
      "salience": 0.0
    },
    {
      "id": "http://www.wikidata.org/entity/Q187939",
      "name": "Manufacturing",
      "url": "http://en.wikipedia.org/wiki/Manufacturing",
      "type": "CONCEPT",
      "salience": 0.0
    },
    {
      "id": "http://www.wikidata.org/entity/Q1166",
      "name": "Michigan",
      "url": "http://en.wikipedia.org/wiki/Michigan",
      "type": "LOCATION",
      "salience": 0.22656730944062733
    }
  ]
}

In the above result, the entity ID for "Kellogg's" is incorrect. It refers to the Wikidata entity Q856886, which is the entity representing a Wikipedia disambiguation page. It should actually refer to Q856897. We see the same issue on our local installation and on the web demo at https://ambiversenlu.mpi-inf.mpg.de.

One additional data point--we did try running this using both "Kellogg" and "Kelloggs" in addition to "Kellogg's". The results are the same in all three cases.

Updates to Knowledge graph and Entities

Thanks for providing this rich and seamless framework.
I have the following questions or need help sort of things. Any information is greatly appreciated.

  1. I wanted to know how frequently [Cadence of] the database/datasets from wikitdata being pulled into Ambiverse.
  2. Is there a methodological automated way of upgrading from v18 to v19 if i am currently on v18 dataset of Knowledgegraph and Entities.
  3. If i am pushing changesets of ttl/quads from wikidata , would the entity recognition be impacted or not.

entitylinking for chinese

Hi,
I use API http://127.0.0.1:8080/entitylinking/analyze with
{
"docId": "doc1",
"text": "谷歌在硅谷(山景市)开发自动驾驶汽车。",
"language": "zh"
}
got
{
"docId": "doc1",
"language": "zh",
"matches": []
}
How about Chinese support? It is a simple sentence, but get nothing. When test this sentence at your offical website demo, only can recognize "汽车"(car).

Issue while building custom knowledge graph

Hi,

We are trying to add our Business specific knowledge to Yago3 KG available in AmbiverseNLU by following "Building a custom Knowledge Graph" from documentation. As the process involves creating a new schema from scratch, we downloaded Yago3 knowledge base dumps to equip community knowledge.

But, AmbiverseNLU expects "hasAnchorText", "hasInternalWikipediaLinkTo" and others. Data obtained from YAGO3 does not provide these and the only option left for us is to get from Wikidump to generate mentions data dictionary.

We appreciate your comments/ suggestions to get the required data and point us in the right direction.

Thank you.

Sentence Annotation (EN_POS) step is slow in pipeline

Hi,

We had built custom knowledge graph and operating Ambiverse application on top of it. But fact extraction(/analyze) webservice is lagging.

Up on research we found that Sentence Tagging( EN_POS ) step in pipeline is taking lot of time. An average of 12 sec to tag one paragraph of data with 3-4 sentences and each sentence comprise of 8-10 words.

Could you please let us understand the reasons, if any, for the slowness and suggest a way to improve it.

More Info:
Used Pipeline : ENTITY_SALIENCE_STANFORD

Thank you.

Run with less memory

Hi,

thank you for sharing your research results openly. For testing purpose I'd like to run the webservice as Docker container on my locale machine. But the memory consumption seems to high to run it on a consumer laptop. Even the small version produces a stack overflow error:

...
nlu_1  | [INFO] Started ServerConnector@2489ee11{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
nlu_1  | [INFO] Started @279688ms
nlu_1  | [INFO] Started Jetty Server
nlu_1  | 2019-04-25 08:26:45,349 [qtp165255249-17] INFO  entitylinking.processor.DocumentProcessor:112  - Initializing DocumentProcessor for type 'FACTS_WITH_SALIENCE_EN_STANFORD'.
nlu_1  | 2019-04-25 08:26:48,578 [sparkDriverActorSystem-akka.actor.default-dispatcher-2] INFO  event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3:74  - Starting remoting
nlu_1  | 2019-04-25 08:26:48,599 [sparkDriverActorSystem-akka.actor.default-dispatcher-2] INFO  event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3:74  - Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@localhost:45885]
nlu_1  | 2019-04-25 08:26:52,922 [sparkDriverActorSystem-akka.actor.default-dispatcher-5] INFO  event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3:74  - Shutting down remote daemon.
nlu_1  | 2019-04-25 08:26:52,925 [sparkDriverActorSystem-akka.actor.default-dispatcher-5] INFO  event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3:74  - Remote daemon shut down; proceeding with flushing remote transports.
nlu_1  | 2019-04-25 08:26:52,945 [sparkDriverActorSystem-akka.actor.default-dispatcher-5] INFO  event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$3:74  - Remoting shut down.
nlu_1  | 2019-04-25 08:27:22,090 [qtp165255249-17] INFO  entitylinking.preparation.Preparator:53  - Document 'doc1' prepared in 51.0ms.
nlu_1  | 2019-04-25 08:27:22,867 [qtp165255249-17] INFO  nlu.entitylinking.Disambiguator:101  - Document 'doc1' disambiguated in 771.0ms (1 chunks, 4 mentions).
nlu_1  | 2019-04-25 08:27:25,876 [qtp165255249-17] INFO  util.impl.JSR47Logger_impl:255  - Loading parser from serialized file jar:file:/root/.ivy2/cache/de.tudarmstadt.ukp.dkpro.core/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-upstream-parser-en-rnn/jars/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-upstream-parser-en-rnn-20140104.jar!/de/tudarmstadt/ukp/dkpro/core/stanfordnlp/lib/parser-en-rnn.ser.gz ...
nlu_1  | [WARNING] /factextraction/analyze
nlu_1  | javax.servlet.ServletException: javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.StackOverflowError
nlu_1  | 	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
nlu_1  | 	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
nlu_1  | 	at org.eclipse.jetty.server.Server.handle(Server.java:564)
nlu_1  | 	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
nlu_1  | 	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
...

Request:

curl --request POST   --url http://localhost:8080/factextraction/analyze   --header 'accept: application/json'   --header 'content-type: application/json'   --data '{"docId": "doc1", "text": "Jack founded Alibaba with investments from SoftBank and Goldman.", "language":"en"}'

Is there a way to decrease the memory consumption, e.g. disabling components or using a smaller knowledge base?

Best,
Malte

Web service requests sometimes throw errors or crash

First, thanks for open sourcing such an impressive suite of tools!

We're running into some errors and crashes as we put AmbiverseNLU through its paces, and I wanted to see if you could provide any insight on where we're going wrong. We're running on an EC2 m5n.4xlarge instance (16 core, 64G mem, 1TB HDD, 18k IOPS).

The installation went without incident. We installed the docker containers individually per the instructions on Docker Hub, The docker-compose script timed out on AmbiverseNLU waiting for the PostgreSQL install to complete, but we opted to reinstall manually instead of changing the timeout.

The web service starts up fine and processes most request without incident. However, we are seeing some issues:

  1. One or more of the following errors occur when processing some requests:
    2020-01-10 07:17:37,550 [qtp1359212194-54088] INFO custom.aes.ClausIEAnalysisEngine:119 - Exception at ClausIEAnalysisEngine: null
    These seem innocuous--AmbiverseNLU still returns a valid set of entities. Does this indicate a bigger issue? We did wonder if it indicates some kind of memory leak or other problem that could contribute to the below issues.

  2. ~3% of the documents we submit fail. This occurs after 16 consecutive errors as shown in #1 above, at which point the stack dump shown below prints and AmbiverseNLU responds with an empty JSON object ("{}"). This problem is 100% reproducible with specific text, and relatively minor changes to the text can make the problem disappear.

2020-01-10 07:25:45,412 [qtp1359212194-49842] ERROR util.impl.JSR47Logger_impl:324  - Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
        at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.uima.custom.aes.ClausIEAnalysisEngine.process(ClausIEAnalysisEngine.java:121)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
...deleted...
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
2020-01-10 07:25:45,413 [qtp1359212194-49842] ERROR util.impl.JSR47Logger_impl:324  - Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
        at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.uima.custom.aes.ClausIEAnalysisEngine.process(ClausIEAnalysisEngine.java:121)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
...deleted...
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.processor.UnprocessableDocumentException
        at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.processor.DocumentProcessor.process(DocumentProcessor.java:72)
        at de.mpg.mpi_inf.ambiversenlu.nlu.entitylinking.service.web.resource.impl.AnalyzeResourceWithFactsImpl.postAnalyze(AnalyzeResourceWithFactsImpl.java:95)
...deleted...
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590)
        at java.lang.Thread.run(Thread.java:748)
2020-01-10 07:25:45,415 [qtp1359212194-49842] ERROR resource.impl.AnalyzeResourceWithFactsImpl:112  - (221) AnalyzeInput{docId='doc1', language='en', text='...document text...', confidenceThreshold=null, coherentDocument=null, annotatedMentions=null}
2020-01-10 07:25:45,416 [qtp1359212194-49842] ERROR resource.impl.AnalyzeResourceWithFactsImpl:113  - ERROR MESSAGE: null
  1. We are running documents through AmbiverseNLU as fast as it can process them--roughly every 3-4 seconds. As it is running, we use docker logs and htop to monitor the system in real-time. After many hours of processing, the process runs out of memory and core dumps. The beginning of the log is shown below. We have not noticed a memory leak, but the restarts have occurred overnight so we can't be sure. We are now logging free memory, so we should be able to provide more definitive data if a memory leak is occurring.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 3775922176 bytes for committing reserved memory.
# Can not save log file, dump to screen..
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 3775922176 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2643), pid=1, tid=0x00007f625577e700
#
# JRE version: OpenJDK Runtime Environment (8.0_141-b15) (build 1.8.0_141-8u141-b15-1~deb9u1-b15)
# Java VM: OpenJDK 64-Bit Server VM (25.141-b15 mixed mode linux-amd64 )
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f6b56600000, 3775922176, 0) failed; error='Cannot allocate memory' (errno=12)
# Core dump written. Default location: /ambiverse-nlu/core or core.1
#
  1. After the above restart occurs, the ConceptSpotters are initialized and dummy docs are run for each language, and the Jetty service starts up. After this, six errors occur similar to:
    [WARNING] Illegal character 0x16 in state=START for buffer HeapByteBuffer@7a822f53[p=1,l=215,c=8192,r=214]={\x16<<<\x03\x01\x00\xD2\x01\x00\x00\xCe\x03\x03\xF1\xA1kE\xB4\xB8\xE8...\x03\x02\x03\x03\x02\x01\x02\x02\x02\x03\x00\x0f\x00\x01\x01>>>v20170414)\r\n\r\n:80...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
    At this point, when we send a request to the web service, the ConceptSpotters are reinitialized. This process takes about 10 minutes, after which the response is returned. After that, requests are processed normally. This does not seem to occur when the server is initially installed and brought up. We were wondering if there is some cleanup we need to do on startup to prevent this, and if it's a problem.

Apologies for the involved issue. Please let us know if there is any additional information or logs you need to look into these. Also, if it is easier for us to split this into four separate issues, we'll be happy to do so. Thanks in advance!

Error in run docker-compose

After run docker-compose docker-compose -f docker-compose/service-postgres.yml up
I got a error as following:
[ERROR] Failed to execute goal on project ambiversenlu: Could not resolve dependencies for project de.mpg.mpi-inf.ambiversenlu:ambiversenlu:jar:1.0.1-SNAPSHOT: Failed to collect dependencies at org.dkpro.tc:dkpro-tc-api:jar:0.9.0: Failed to read artifact descriptor for org.dkpro.tc:dkpro-tc-api:jar:0.9.0: Could not transfer artifact org.dkpro:dkpro-parent-pom:pom:14 from/to ukp-oss-snapshots (http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-ukp-snapshots-local/): Failed to transfer file: http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-ukp-snapshots-local/org/dkpro/dkpro-parent-pom/14/dkpro-parent-pom-14.pom. Return code is: 409 , ReasonPhrase:Conflict. -> [Help 1]

Anyone can help me?

curl: (56) Recv failure: Connection reset by peer

My machine has less than 32 GB of main memory and i run this configuration with the fewer entities

docker-compose -f docker-compose/service-postgres-small.yml up

my problem is that the command produces an infinity loop and its running for about 2 days and never finishes

in the meanwhile, i am trying to execute the service with no success

curl --request POST \
  --url http://localhost:8080/factextraction/analyze \
  --header 'accept: application/json' \
  --header 'content-type: application/json' \
  --data '{"docId": "doc1", "text": "Jack founded Alibaba with investments from SoftBank and Goldman.", "extractConcepts": "true" }

this is the message i got after curl operation:
curl: (56) Recv failure: Connection reset by peer

This is the command that i execute

user@ubuntu:~$ sudo docker ps
CONTAINER ID   IMAGE                       COMMAND                  CREATED      STATUS              PORTS                    NAMES
b4c43de28601   ambiverse/ambiverse-nlu     "mvn jetty:run"          3 days ago   Up About a minute   0.0.0.0:8080->8080/tcp   docker-compose_nlu_1
568a2921f50f   ambiverse/nlu-db-postgres   "docker-entrypoint.s…"   3 days ago   Up 43 minutes       5432/tcp                 docker-compose_db_1

this command sudo docker logs -f 568a2921f50f give me this result to better help y to figure out my problem

STATEMENT:  SELECT DISTINCT tmp.mention FROM entity_languages, (SELECT DISTINCT entity, mention FROM dictionary WHERE entitytype=1) as tmp WHERE entity_languages.entity=tmp.entity AND language=0
FATAL:  connection to client lost
STATEMENT:  SELECT DISTINCT tmp.mention FROM entity_languages, (SELECT DISTINCT entity, mention FROM dictionary WHERE entitytype=1) as tmp WHERE entity_languages.entity=tmp.entity AND language=0
LOG:  could not send data to client: Broken pipe
STATEMENT:  SELECT DISTINCT tmp.mention FROM entity_languages, (SELECT DISTINCT entity, mention FROM dictionary WHERE entitytype=1) as tmp WHERE entity_languages.entity=tmp.entity AND language=0
FATAL:  connection to client lost
STATEMENT:  SELECT DISTINCT tmp.mention FROM entity_languages, (SELECT DISTINCT entity, mention FROM dictionary WHERE entitytype=1) as tmp WHERE entity_languages.entity=tmp.entity AND language=0
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction
LOG:  could not send data to client: Connection reset by peer
STATEMENT:  SELECT DISTINCT tmp.mention FROM entity_languages, (SELECT DISTINCT entity, mention FROM dictionary WHERE entitytype=1) as tmp WHERE entity_languages.entity=tmp.entity AND language=0
FATAL:  connection to client lost
STATEMENT:  SELECT DISTINCT tmp.mention FROM entity_languages, (SELECT DISTINCT entity, mention FROM dictionary WHERE entitytype=1) as tmp WHERE entity_languages.entity=tmp.entity AND language=0
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction
LOG:  unexpected EOF on client connection with an open transaction

Why i can't call the service in the http://localhost:8080/factextraction/analyze and why the link is not settled

And another simply questions is there is any size restriction for the
aida_20180120_b3_de_en_v18_db

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.