proycon / babelente Goto Github PK
View Code? Open in Web Editor NEWBabelEnte: Entity Extractor and Translator using BabelFy and Babelnet.org
BabelEnte: Entity Extractor and Translator using BabelFy and Babelnet.org
Hi Maarten,
Babelente won't install on Windows as is:
Collecting babelente
Downloading BabelEnte-0.4.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 22, in
long_description=read('README.rst'),
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 10, in read
return open(os.path.join(os.path.dirname(file), fname)).read()
File "c:\programs\python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3922: character maps to
This is a common issue on Windows, caused by utf-8 characters in README.rst which setup.py tries to interpret as CP1252. Replacing the curly quotes in README.rst with straight quotes solved the issue for me.
Regards,
Marc
The current computation of precision uses macro-prec. L.284:
if overallprecision:
evaluation['precision'] = sum(overallprecision) / len(overallprecision)
It would be more logical to compute micro-prec too.
Besides the scores and the full JSON output, it would be very helpful to get a focused list of the matchting pairs like this:
printing a list of all matching items (tab separated):
sentence-nr babelsynsetid source-text-id target-text-id
684 bn:00019586n classroom Klassenraum
I added some code example to babelente.py but I do not know how exactly the entities are structured.
Traceback (most recent call last):
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/bin/babelente", line 11, in
load_entry_point('BabelEnte==0.1.2', 'console_scripts', 'babelente')()
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 255, in main
evaluation = evaluate(sourceentities, targetentities, sourcelines, targetlines)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 167, in evaluate
coverage = compute_coverage_line(sourcelines[linenr], linenr, sourceentities)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 94, in compute_coverage_line
charmask[i] = 1
IndexError: index 103 is out of bounds for axis 0 with size 103
The Entity translation recall of a text expresses the number of correctly translated Wikipedia entities in comparison to the overall number of present Wikipedia entities.
example PT
Again Running Back Wilson did not make it happen.
Novamente corrida para trás Wilson não fez acontecer. (constructed erroneous translation)
en-found link:
https://en.wikipedia.org/wiki/Running_Back
pt missing link:
https://pt.wikipedia.org/wiki/Running_Back
Lets assume the erroneous translation could have lead to wikipage that is not equivalent So https://pt.wikipedia.org/wiki/Corrida
This is not possible with the Wikifier implementation, as our focus on the English entities only as our groundtruth: how many of those can we retrieve?
#compute entity translation recall
#computing wikipedia target language coverage
How many of the found Engish topics have an equivalent page in target language?
my $wiki_target_coverage = $nbrSynsets / ($nbrSynsets+$missed);
#scoring of entity translation recall
$entity_translation_recall= $foundpairs / ($nbrSynsets);
-how many of the LANG entities for which we actually know that there exists a corresponding Wikipedia page do we retrieve in the text?
#practical implementation is this:
How many of the entities that have an corresponding LANGLINK wikidia page, can we actually find in the LANG text/lemmas (=matching found pairs)?
-a subtle difference question is: how many of the LANG Entities with a wikipedia page that we found in LANG text, actually match with a correspondig wiki-page in English?
a) total of found pairs (en-lang) in text / total nbr of LANG retrieved wiki-links in text (incl erroneous one)
vs
b) total of found pairs (en-lang) in text / total of possible pairs (en-lang) of wikipedia
Door de manier van implementeren doe ik nu optie b), we nemen Engels als groundtruth.
bij Babely hebben we een andere manier van entities vinden en kunnen we zowel a) als b) berekenen.
dus laten we ze alebei implementeren zodat we ze ook kunnen vergelijken.
add the parameters as variables to the scripts so that we can tune our program to optimal result.
For Bulgarian input text, the chars are too big it seems.
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large
hele trace:
Extracting target entities...
chunk #0 -- querying BabelFy
Traceback (most recent call last):
File "/vol/customopt/lamachine2/bin/babelente", line 11, in <module>
sys.exit(main())
File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in main
targetentities = [ entity for entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in <listcomp
>
targetentities = [ entity for entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 113, in findentit
ies
babelclient.babelfy(text)
File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelpy/babelfy.py", line 99, in babelfy
response = urlopen(request)
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large
why does the resolveoffset function not work?
Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 52, in resolveoffset
raise ValueError("Unable to resolve offset " + str(offset))
ValueError: Unable to resolve offset 1783
im testing with PT file now:
/vol/bigdata2/datasets2/TraMOOC/Data/Wikifier2017/Tune/PT/*sentences
Can we adjust Babelpy to only return the longest or best match entity? Is this already solvable with the parameter 'cands=TOP' ? See http://babelfy.org/guide
A CLAM webservice of BabelEnte has been promised for the TraMOOC project
Hi,
I also get an error if a file contains multiple consecutive newlines:
Traceback (most recent call last):
File "C:\Programs\Python36\Scripts\babelente-script.py", line 11, in
load_entry_point('BabelEnte==0.4.0', 'console_scripts', 'babelente')()
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in main
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 126, in findentities
raise e
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 119, in findentities
entity['linenr'], entity['offset'] = resolveoffset(offsetmap, entity['start'], lines, entity)
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 70, in resolveoffset
raise ValueError("Resolved offset does not match text " + str(offset) + "; minoffset=" + str(minoffset) + ", maxoffset=" + str(maxoffset) + ", lines=" + str(len(offsetmap)) )
ValueError: Resolved offset does not match text 15; minoffset=0, maxoffset=51, lines=5
Attached is a file which causes this error. Replacing double newlines with single ones resolves the issue.
Regards,
Marc
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.