Git Product home page Git Product logo

babelente's Introduction

BabelEnte: Entity extractioN, Translation and implicit Evaluation using BabelFy

This is an entity extractor, translator and evaluator that uses BabelFy . Initially developed for the TraMOOC project. It is written in Python 3.

image

Installation

pip3 install babelente

or clone this github repository and run python3 setup.py install, optionally prepend the commands with sudo for global installation.

Usage

You will need a BabelFy API key, get it from BabelNet.org .

See babelente -h for extensive usage instructions, explaining all the options.

For simple entity recognition/linking on plain text documents, invoke BabelEnte as follows. This will produce JSON output with all entities found:

$ babelente -k "YOUR-API-KEY" -s en -S sentences.en.txt > output.json

BabelEnte comes with FoLiA support. Allowing you to read FoLiA documents and producing enriched FoLiA documents that include the detected/linked entities. To this end, simply specify the language of your FoLiA document(s) and pass them to babelente as follows, multiple documents are allowed:

$ babelente -k "YOUR-API-KEY" -s en yourdocument.folia.xml

Each FoLiA document will be outputted to a new file, which includes all the entities. Entities will be explicitly linked to BabelNet and DBpedia where possible. At the same time, the stdout output again consists of a JSON object containing all found entities.

Note that this method does currently not do any translation of entities yet (I'm open to feature request if you want this).

If you start from plain text but want to produce FoLiA output, then first use for instance ucto to tokenise your document and convert it to FoLiA, prior to passing it to BabelEnte.

Usage for TraMOOC

This sofware can be used for implicit evaluation of translations, as it was designed in the scope of the TraMOOC project.

To evaluate a translation (english to portuguese in this example), output wil be JSON to stdout:

$ babelente -k "YOUR-API-KEY" -s en -t pt -S sentences.en.txt -T sentences.pt.txt > output.json

To re-evaluate:

$ babelente --evalfile output.json -S sentences.en.txt -T sentences.pt.txt > newoutput.json

Evaluation

The evaluation produces several metrics.

  • source coverage number of characters covered by found source entities divided by the total number of characters in the source text
  • target coverage number of characters covered by found target entities divided by the total number of characters in the target text

Precision and Recall

In the standard scoring method we count each entity and compute scores We also implemented the option to compute the scores

  • micro precision sum of found equivalent entities in target and source texts divided by the total sum of found entities in target language
  • macro precision sum of found equivalent entities in target and source texts divided by the number of target sentences

* micro recall sum of found equivalent entities in target and source divided by the total sum of found entities in source language for which a equivalent link existed in the target language. In other words, how many of the hypothetical possible matches that were found? Note that this is intensive computation and needs to be specified as command line parameter —recall. * macro recall sum of found equivalent entities in target and source texts divided by the number of source sentences.

Computing recall and precision over entity sets

Instead of counting every occurring entity (“tokens”), we can also count each entity once (“types” or “sets”). This can be a more useful indicator of the performance measure when the input texts contains many repetitions or slight variations of the same sentences. This option is activated with the parameter —nodup (no duplicates) .

License

GNU - GPL 3.0

babelente's People

Contributors

irishx avatar proycon avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

babelente's Issues

babelpy parameters

add the parameters as variables to the scripts so that we can tune our program to optimal result.

precision micro/macro

The current computation of precision uses macro-prec. L.284:
if overallprecision:
evaluation['precision'] = sum(overallprecision) / len(overallprecision)

It would be more logical to compute micro-prec too.

Windows install fails

Hi Maarten,

Babelente won't install on Windows as is:
Collecting babelente
Downloading BabelEnte-0.4.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 22, in
long_description=read('README.rst'),
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 10, in read
return open(os.path.join(os.path.dirname(file), fname)).read()
File "c:\programs\python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3922: character maps to

This is a common issue on Windows, caused by utf-8 characters in README.rst which setup.py tries to interpret as CP1252. Replacing the curly quotes in README.rst with straight quotes solved the issue for me.

Regards,

Marc

printing wishlist

Besides the scores and the full JSON output, it would be very helpful to get a focused list of the matchting pairs like this:

printing a list of all matching items (tab separated):
sentence-nr babelsynsetid source-text-id target-text-id
684 bn:00019586n classroom Klassenraum

I added some code example to babelente.py but I do not know how exactly the entities are structured.

subtle scoring implementation issue /comparison Wikfier

The Entity translation recall of a text expresses the number of correctly translated Wikipedia entities in comparison to the overall number of present Wikipedia entities.

example PT
Again Running Back Wilson did not make it happen.
Novamente corrida para trás Wilson não fez acontecer. (constructed erroneous translation)

en-found link:
https://en.wikipedia.org/wiki/Running_Back

pt missing link:
https://pt.wikipedia.org/wiki/Running_Back

Lets assume the erroneous translation could have lead to wikipage that is not equivalent So https://pt.wikipedia.org/wiki/Corrida
This is not possible with the Wikifier implementation, as our focus on the English entities only as our groundtruth: how many of those can we retrieve?

#compute entity translation recall

#computing wikipedia target language coverage
How many of the found Engish topics have an equivalent page in target language?

my $wiki_target_coverage = $nbrSynsets / ($nbrSynsets+$missed);

#scoring of entity translation recall

$entity_translation_recall= $foundpairs / ($nbrSynsets);

-how many of the LANG entities for which we actually know that there exists a corresponding Wikipedia page do we retrieve in the text?

#practical implementation is this:
How many of the entities that have an corresponding LANGLINK wikidia page, can we actually find in the LANG text/lemmas (=matching found pairs)?

-a subtle difference question is: how many of the LANG Entities with a wikipedia page that we found in LANG text, actually match with a correspondig wiki-page in English?

a) total of found pairs (en-lang) in text / total nbr of LANG retrieved wiki-links in text (incl erroneous one)

vs

b) total of found pairs (en-lang) in text / total of possible pairs (en-lang) of wikipedia

Door de manier van implementeren doe ik nu optie b), we nemen Engels als groundtruth.

bij Babely hebben we een andere manier van entities vinden en kunnen we zowel a) als b) berekenen.

dus laten we ze alebei implementeren zodat we ze ook kunnen vergelijken.

error URI request too big

For Bulgarian input text, the chars are too big it seems.

urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

hele trace:

Extracting target entities...
chunk #0 -- querying BabelFy
Traceback (most recent call last):
  File "/vol/customopt/lamachine2/bin/babelente", line 11, in <module>
    sys.exit(main())
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in main
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in <listcomp
>
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 113, in findentit
ies
    babelclient.babelfy(text)
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelpy/babelfy.py", line 99, in babelfy
    response = urlopen(request)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

Multiple newlines cause fails

Hi,

I also get an error if a file contains multiple consecutive newlines:
Traceback (most recent call last):
File "C:\Programs\Python36\Scripts\babelente-script.py", line 11, in
load_entry_point('BabelEnte==0.4.0', 'console_scripts', 'babelente')()
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in main
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 126, in findentities
raise e
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 119, in findentities
entity['linenr'], entity['offset'] = resolveoffset(offsetmap, entity['start'], lines, entity)
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 70, in resolveoffset
raise ValueError("Resolved offset does not match text " + str(offset) + "; minoffset=" + str(minoffset) + ", maxoffset=" + str(maxoffset) + ", lines=" + str(len(offsetmap)) )
ValueError: Resolved offset does not match text 15; minoffset=0, maxoffset=51, lines=5

Attached is a file which causes this error. Replacing double newlines with single ones resolves the issue.

testtxt.txt

Regards,

Marc

offset problem

why does the resolveoffset function not work?

Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 52, in resolveoffset
raise ValueError("Unable to resolve offset " + str(offset))
ValueError: Unable to resolve offset 1783

im testing with PT file now:

/vol/bigdata2/datasets2/TraMOOC/Data/Wikifier2017/Tune/PT/*sentences

Failure on coverage computation

Traceback (most recent call last):
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/bin/babelente", line 11, in
load_entry_point('BabelEnte==0.1.2', 'console_scripts', 'babelente')()
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 255, in main
evaluation = evaluate(sourceentities, targetentities, sourcelines, targetlines)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 167, in evaluate
coverage = compute_coverage_line(sourcelines[linenr], linenr, sourceentities)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 94, in compute_coverage_line
charmask[i] = 1
IndexError: index 103 is out of bounds for axis 0 with size 103

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.