Git Product home page Git Product logo

babelente's People

Contributors

irishx avatar proycon avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

babelente's Issues

Multiple newlines cause fails

Hi,

I also get an error if a file contains multiple consecutive newlines:
Traceback (most recent call last):
File "C:\Programs\Python36\Scripts\babelente-script.py", line 11, in
load_entry_point('BabelEnte==0.4.0', 'console_scripts', 'babelente')()
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in main
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 514, in
sourceentities = [ entity for entity in findentities(sourcelines, args.sourcelang, args, None if cache is None else cache['source']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 126, in findentities
raise e
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 119, in findentities
entity['linenr'], entity['offset'] = resolveoffset(offsetmap, entity['start'], lines, entity)
File "C:\Programs\Python36\lib\site-packages\babelente-0.4.0-py3.6.egg\babelente\babelente.py", line 70, in resolveoffset
raise ValueError("Resolved offset does not match text " + str(offset) + "; minoffset=" + str(minoffset) + ", maxoffset=" + str(maxoffset) + ", lines=" + str(len(offsetmap)) )
ValueError: Resolved offset does not match text 15; minoffset=0, maxoffset=51, lines=5

Attached is a file which causes this error. Replacing double newlines with single ones resolves the issue.

testtxt.txt

Regards,

Marc

Windows install fails

Hi Maarten,

Babelente won't install on Windows as is:
Collecting babelente
Downloading BabelEnte-0.4.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 22, in
long_description=read('README.rst'),
File "C:\Users\Marc\AppData\Local\Temp\pip-build-cagenn93\babelente\setup.py", line 10, in read
return open(os.path.join(os.path.dirname(file), fname)).read()
File "c:\programs\python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3922: character maps to

This is a common issue on Windows, caused by utf-8 characters in README.rst which setup.py tries to interpret as CP1252. Replacing the curly quotes in README.rst with straight quotes solved the issue for me.

Regards,

Marc

subtle scoring implementation issue /comparison Wikfier

The Entity translation recall of a text expresses the number of correctly translated Wikipedia entities in comparison to the overall number of present Wikipedia entities.

example PT
Again Running Back Wilson did not make it happen.
Novamente corrida para trás Wilson não fez acontecer. (constructed erroneous translation)

en-found link:
https://en.wikipedia.org/wiki/Running_Back

pt missing link:
https://pt.wikipedia.org/wiki/Running_Back

Lets assume the erroneous translation could have lead to wikipage that is not equivalent So https://pt.wikipedia.org/wiki/Corrida
This is not possible with the Wikifier implementation, as our focus on the English entities only as our groundtruth: how many of those can we retrieve?

#compute entity translation recall

#computing wikipedia target language coverage
How many of the found Engish topics have an equivalent page in target language?

my $wiki_target_coverage = $nbrSynsets / ($nbrSynsets+$missed);

#scoring of entity translation recall

$entity_translation_recall= $foundpairs / ($nbrSynsets);

-how many of the LANG entities for which we actually know that there exists a corresponding Wikipedia page do we retrieve in the text?

#practical implementation is this:
How many of the entities that have an corresponding LANGLINK wikidia page, can we actually find in the LANG text/lemmas (=matching found pairs)?

-a subtle difference question is: how many of the LANG Entities with a wikipedia page that we found in LANG text, actually match with a correspondig wiki-page in English?

a) total of found pairs (en-lang) in text / total nbr of LANG retrieved wiki-links in text (incl erroneous one)

vs

b) total of found pairs (en-lang) in text / total of possible pairs (en-lang) of wikipedia

Door de manier van implementeren doe ik nu optie b), we nemen Engels als groundtruth.

bij Babely hebben we een andere manier van entities vinden en kunnen we zowel a) als b) berekenen.

dus laten we ze alebei implementeren zodat we ze ook kunnen vergelijken.

Failure on coverage computation

Traceback (most recent call last):
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/bin/babelente", line 11, in
load_entry_point('BabelEnte==0.1.2', 'console_scripts', 'babelente')()
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 255, in main
evaluation = evaluate(sourceentities, targetentities, sourcelines, targetlines)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 167, in evaluate
coverage = compute_coverage_line(sourcelines[linenr], linenr, sourceentities)
File "/Users/irishendrickx/Work/TraMOOC/Virtualenvs/Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 94, in compute_coverage_line
charmask[i] = 1
IndexError: index 103 is out of bounds for axis 0 with size 103

babelpy parameters

add the parameters as variables to the scripts so that we can tune our program to optimal result.

precision micro/macro

The current computation of precision uses macro-prec. L.284:
if overallprecision:
evaluation['precision'] = sum(overallprecision) / len(overallprecision)

It would be more logical to compute micro-prec too.

error URI request too big

For Bulgarian input text, the chars are too big it seems.

urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

hele trace:

Extracting target entities...
chunk #0 -- querying BabelFy
Traceback (most recent call last):
  File "/vol/customopt/lamachine2/bin/babelente", line 11, in <module>
    sys.exit(main())
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in main
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 534, in <listcomp
>
    targetentities = [ entity for  entity in findentities(targetlines, args.targetlang, args, None if cache i
s None else cache['target']) if entity['isEntity'] and 'babelSynsetID' in entity ] #with sanity check
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelente/babelente.py", line 113, in findentit
ies
    babelclient.babelfy(text)
  File "/vol/customopt/lamachine2/lib/python3.5/site-packages/babelpy/babelfy.py", line 99, in babelfy
    response = urlopen(request)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Large

offset problem

why does the resolveoffset function not work?

Babelente/lib/python3.6/site-packages/babelente/babelente.py", line 52, in resolveoffset
raise ValueError("Unable to resolve offset " + str(offset))
ValueError: Unable to resolve offset 1783

im testing with PT file now:

/vol/bigdata2/datasets2/TraMOOC/Data/Wikifier2017/Tune/PT/*sentences

printing wishlist

Besides the scores and the full JSON output, it would be very helpful to get a focused list of the matchting pairs like this:

printing a list of all matching items (tab separated):
sentence-nr babelsynsetid source-text-id target-text-id
684 bn:00019586n classroom Klassenraum

I added some code example to babelente.py but I do not know how exactly the entities are structured.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.