proycon / babelente Goto Github PK

BabelEnte: Entity Extractor and Translator using BabelFy and Babelnet.org

Python 100.00%

nlp babelnet babelfy computational-linguistics

babelente's Issues

Windows install fails

Hi Maarten,

Babelente won't install on Windows as is:
Collecting babelente
Downloading BabelEnte-0.4.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 22, in
long_description=read('README.rst'),
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 10, in read
return open(os.path.join(os.path.dirname(file), fname)).read()
File "c:\programs\python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3922: character maps to

This is a common issue on Windows, caused by utf-8 characters in README.rst which setup.py tries to interpret as CP1252. Replacing the curly quotes in README.rst with straight quotes solved the issue for me.

Regards,

Marc

precision micro/macro

The current computation of precision uses macro-prec. L.284:
if overallprecision:
evaluation['precision'] = sum(overallprecision) / len(overallprecision)

It would be more logical to compute micro-prec too.

printing wishlist

Besides the scores and the full JSON output, it would be very helpful to get a focused list of the matchting pairs like this:

printing a list of all matching items (tab separated):
sentence-nr babelsynsetid source-text-id target-text-id
684 bn:00019586n classroom Klassenraum

I added some code example to babelente.py but I do not know how exactly the entities are structured.

Failure on coverage computation

Traceback (most recent call last):
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/bin/babelente", line 11, in
load_entry_point('BabelEnte==0.1.2', 'console_scripts', 'babelente')()
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 255, in main
evaluation = evaluate(sourceentities, targetentities, sourcelines, targetlines)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 167, in evaluate
coverage = compute_coverage_line(sourcelines[linenr], linenr, sourceentities)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 94, in compute_coverage_line
charmask[i] = 1
IndexError: index 103 is out of bounds for axis 0 with size 103

subtle scoring implementation issue /comparison Wikfier

The Entity translation recall of a text expresses the number of correctly translated Wikipedia entities in comparison to the overall number of present Wikipedia entities.

example PT
Again Running Back Wilson did not make it happen.
Novamente corrida para trás Wilson não fez acontecer. (constructed erroneous translation)

en-found link:
https://en.wikipedia.org/wiki/Running_Back

pt missing link:
https://pt.wikipedia.org/wiki/Running_Back

Lets assume the erroneous translation could have lead to wikipage that is not equivalent So https://pt.wikipedia.org/wiki/Corrida
This is not possible with the Wikifier implementation, as our focus on the English entities only as our groundtruth: how many of those can we retrieve?

#compute entity translation recall

#computing wikipedia target language coverage
How many of the found Engish topics have an equivalent page in target language?

my $wiki_target_coverage = $nbrSynsets / ($nbrSynsets+$missed);

#scoring of entity translation recall

$entity_translation_recall= $foundpairs / ($nbrSynsets);

-how many of the LANG entities for which we actually know that there exists a corresponding Wikipedia page do we retrieve in the text?

#practical implementation is this:
How many of the entities that have an corresponding LANGLINK wikidia page, can we actually find in the LANG text/lemmas (=matching found pairs)?

-a subtle difference question is: how many of the LANG Entities with a wikipedia page that we found in LANG text, actually match with a correspondig wiki-page in English?

a) total of found pairs (en-lang) in text / total nbr of LANG retrieved wiki-links in text (incl erroneous one)

b) total of found pairs (en-lang) in text / total of possible pairs (en-lang) of wikipedia

Door de manier van implementeren doe ik nu optie b), we nemen Engels als groundtruth.

bij Babely hebben we een andere manier van entities vinden en kunnen we zowel a) als b) berekenen.

dus laten we ze alebei implementeren zodat we ze ook kunnen vergelijken.

babelfy coverage score

babelpy parameters

add the parameters as variables to the scripts so that we can tune our program to optimal result.

error URI request too big

For Bulgarian input text, the chars are too big it seems.

urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

hele trace:

Extracting target entities...
chunk #0 -- querying BabelFy
Traceback (most recent call last):
  File "/vol/customopt/lamachine2/bin/babelente", line 11, in <module>
    sys.exit(main())
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in main
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in <listcomp
>
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 113, in findentit
ies
    babelclient.babelfy(text)
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelpy/babelfy.py", line 99, in babelfy
    response = urlopen(request)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

offset problem

why does the resolveoffset function not work?

Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 52, in resolveoffset
raise ValueError("Unable to resolve offset " + str(offset))
ValueError: Unable to resolve offset 1783

im testing with PT file now:

/vol/bigdata2/datasets2/TraMOOC/Data/Wikifier2017/Tune/PT/*sentences

Add support for FoLiA input and output

overlapping entities

Can we adjust Babelpy to only return the longest or best match entity? Is this already solvable with the parameter 'cands=TOP' ? See http://babelfy.org/guide

Build CLAM webservice

A CLAM webservice of BabelEnte has been promised for the TraMOOC project

Multiple newlines cause fails

Hi,

I also get an error if a file contains multiple consecutive newlines:
Traceback (most recent call last):
File "C:\Programs\Python36\Scripts\babelente-script.py", line 11, in
load_entry_point('BabelEnte==0.4.0', 'console_scripts', 'babelente')()
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in main
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 126, in findentities
raise e
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 119, in findentities
entity['linenr'], entity['offset'] = resolveoffset(offsetmap, entity['start'], lines, entity)
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 70, in resolveoffset
raise ValueError("Resolved offset does not match text " + str(offset) + "; minoffset=" + str(minoffset) + ", maxoffset=" + str(maxoffset) + ", lines=" + str(len(offsetmap)) )
ValueError: Resolved offset does not match text 15; minoffset=0, maxoffset=51, lines=5

Attached is a file which causes this error. Replacing double newlines with single ones resolves the issue.

testtxt.txt

Regards,

Marc

proycon / babelente Goto Github PK

babelente's Issues

Windows install fails

precision micro/macro

printing wishlist

Failure on coverage computation

subtle scoring implementation issue /comparison Wikfier

babelfy coverage score

babelpy parameters

error URI request too big

offset problem

Add support for FoLiA input and output

overlapping entities

Build CLAM webservice

Multiple newlines cause fails

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent