proycon / babelente Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 3.89 MB

BabelEnte: Entity Extractor and Translator using BabelFy and Babelnet.org

Python 100.00%

babelfy babelnet computational-linguistics nlp

babelente's People

Contributors

Stargazers

Watchers

Forkers

tramooc-impleval shzamani

babelente's Issues

overlapping entities

Can we adjust Babelpy to only return the longest or best match entity? Is this already solvable with the parameter 'cands=TOP' ? See http://babelfy.org/guide

I also get an error if a file contains multiple consecutive newlines:
Traceback (most recent call last):
File "C:\Programs\Python36\Scripts\babelente-script.py", line 11, in
load_entry_point('BabelEnte==0.4.0', 'console_scripts', 'babelente')()
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in main
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 126, in findentities
raise e
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 119, in findentities
entity['linenr'], entity['offset'] = resolveoffset(offsetmap, entity['start'], lines, entity)
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 70, in resolveoffset
raise ValueError("Resolved offset does not match text " + str(offset) + "; minoffset=" + str(minoffset) + ", maxoffset=" + str(maxoffset) + ", lines=" + str(len(offsetmap)) )
ValueError: Resolved offset does not match text 15; minoffset=0, maxoffset=51, lines=5

Attached is a file which causes this error. Replacing double newlines with single ones resolves the issue.

testtxt.txt

Regards,

Marc

Windows install fails

Hi Maarten,

Babelente won't install on Windows as is:
Collecting babelente
Downloading BabelEnte-0.4.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 22, in
long_description=read('README.rst'),
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 10, in read
return open(os.path.join(os.path.dirname(file), fname)).read()
File "c:\programs\python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3922: character maps to

This is a common issue on Windows, caused by utf-8 characters in README.rst which setup.py tries to interpret as CP1252. Replacing the curly quotes in README.rst with straight quotes solved the issue for me.

Regards,

Marc

subtle scoring implementation issue /comparison Wikfier

The Entity translation recall of a text expresses the number of correctly translated Wikipedia entities in comparison to the overall number of present Wikipedia entities.

example PT
Again Running Back Wilson did not make it happen.
Novamente corrida para trás Wilson não fez acontecer. (constructed erroneous translation)

en-found link:
https://en.wikipedia.org/wiki/Running_Back

pt missing link:
https://pt.wikipedia.org/wiki/Running_Back

Lets assume the erroneous translation could have lead to wikipage that is not equivalent So https://pt.wikipedia.org/wiki/Corrida
This is not possible with the Wikifier implementation, as our focus on the English entities only as our groundtruth: how many of those can we retrieve?

#compute entity translation recall

#computing wikipedia target language coverage
How many of the found Engish topics have an equivalent page in target language?

my $wiki_target_coverage = $nbrSynsets / ($nbrSynsets+$missed);

#scoring of entity translation recall

$entity_translation_recall= $foundpairs / ($nbrSynsets);

-how many of the LANG entities for which we actually know that there exists a corresponding Wikipedia page do we retrieve in the text?

#practical implementation is this:
How many of the entities that have an corresponding LANGLINK wikidia page, can we actually find in the LANG text/lemmas (=matching found pairs)?

-a subtle difference question is: how many of the LANG Entities with a wikipedia page that we found in LANG text, actually match with a correspondig wiki-page in English?

a) total of found pairs (en-lang) in text / total nbr of LANG retrieved wiki-links in text (incl erroneous one)

b) total of found pairs (en-lang) in text / total of possible pairs (en-lang) of wikipedia

Door de manier van implementeren doe ik nu optie b), we nemen Engels als groundtruth.

bij Babely hebben we een andere manier van entities vinden en kunnen we zowel a) als b) berekenen.

dus laten we ze alebei implementeren zodat we ze ook kunnen vergelijken.

Build CLAM webservice

A CLAM webservice of BabelEnte has been promised for the TraMOOC project

Failure on coverage computation

Traceback (most recent call last):
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/bin/babelente", line 11, in
load_entry_point('BabelEnte==0.1.2', 'console_scripts', 'babelente')()
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 255, in main
evaluation = evaluate(sourceentities, targetentities, sourcelines, targetlines)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 167, in evaluate
coverage = compute_coverage_line(sourcelines[linenr], linenr, sourceentities)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 94, in compute_coverage_line
charmask[i] = 1
IndexError: index 103 is out of bounds for axis 0 with size 103

babelpy parameters

add the parameters as variables to the scripts so that we can tune our program to optimal result.

precision micro/macro

The current computation of precision uses macro-prec. L.284:
if overallprecision:
evaluation['precision'] = sum(overallprecision) / len(overallprecision)

It would be more logical to compute micro-prec too.

error URI request too big

For Bulgarian input text, the chars are too big it seems.

urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

hele trace:

Extracting target entities...
chunk #0 -- querying BabelFy
Traceback (most recent call last):
  File "/vol/customopt/lamachine2/bin/babelente", line 11, in <module>
    sys.exit(main())
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in main
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in <listcomp
>
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 113, in findentit
ies
    babelclient.babelfy(text)
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelpy/babelfy.py", line 99, in babelfy
    response = urlopen(request)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

Add support for FoLiA input and output

offset problem

why does the resolveoffset function not work?

Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 52, in resolveoffset
raise ValueError("Unable to resolve offset " + str(offset))
ValueError: Unable to resolve offset 1783

im testing with PT file now:

/vol/bigdata2/datasets2/TraMOOC/Data/Wikifier2017/Tune/PT/*sentences

printing wishlist

Besides the scores and the full JSON output, it would be very helpful to get a focused list of the matchting pairs like this:

printing a list of all matching items (tab separated):
sentence-nr babelsynsetid source-text-id target-text-id
684 bn:00019586n classroom Klassenraum

I added some code example to babelente.py but I do not know how exactly the entities are structured.