abosamoor / polyglot Goto Github PK

View Code? Open in Web Editor NEW

2.3K 2.3K 334.0 428 KB

Multilingual text (NLP) processing toolkit

Home Page: http://polyglot-nlp.com

License: Other

Makefile 0.39% Shell 0.07% Python 50.05% Jupyter Notebook 49.46% Vim Script 0.02%

polyglot's People

Contributors

Stargazers

Watchers

Forkers

arne-cl svisser vtsuei2 pombreda phanein wenhuizhang provemyself hengqujushi snowind fw1121 mingleili cjyclaire orangelpai caohy1988 yuwentao xsongx alexsnet edisonqkj tanger830 xtomax xuanhan863 lampda simudream jamesmeneghello iamtrask rhythmsosad indatalabs dchaplinsky riyazbhat syscmp ecnumjc pombredanne steveyin tgalery imclab superxiaoqiang lxj0276 dzianis-pirshtuk liuzl natsheh appscluster tediscript hitluobin merlijnwajer todo udn amirothman dongqing7 cbentes rlugojr warungdata ashbeats digideskio liormagen hugohn reyrodrigues cytora saltukalakus qwaider befeng adamhadani tanthml alexgarel albannatechno h4ck3rk3y ppiccolo drat shannonyu pcbje peblair neomatrixcode htaghizadeh linwoodc3 vyraun cherish24 olivierh59500 cyqclark khan007 bobquest33 cagataykuru omkar0001 nawaffelemban enod sonthonaxrk rahulmirdha sabriwi cdyangbo oscarsalu im-rec mahaocheng jenalgit bekerov akshaydeshraj triviality miewinstrup ambivalentno ferplascencia hgluka areafather thiagoneves

polyglot's Issues

Unable to download models, 401 unauthorized error

When I try to download any model I get a 401 unauthorized error.

(default)vagrant@vagrant-ubuntu-trusty-64:~$ !polyglot download morph2.en
polyglot download embeddings2.en download morph2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data] /home/vagrant/polyglot_data...
[polyglot_data] Error downloading u'embeddings2.en' from
[polyglot_data] <https://www.googleapis.com/download/storage/v1/b
[polyglot_data] /polyglot-models/o/embeddings2%2Fen%2Fembeddings_p
[polyglot_data] kl.tar.bz2?generation=1406139163483000&alt=media>:
[polyglot_data] HTTP Error 401: Unauthorized

Data and NLP Models

Thanks for the awesome lib @aboSamoor . I'm taking a look at your load module and it seems that it provides the ability to unpickle data into different objects depending of the nlp task in question. I was wondering if there is any documentation on:

how these different models were created (data sources, etc... )
whether the original data sources are available (is some format)
benchmarks of the performance of different models
whether we can re-train some models in case we need.

Thanks again and sorry if github is not the right place to put these questions

Available packages to download

I'm trying to list all the available packages using downloader.list(show_packages=False)
however only two are retrieved (LANG:zhc, TASK:tsne2/zhc).
I've also tried to manually set the URL that points to the package index file, but the default link provided in the documentation (http://nltk.googlecode.com/svn/trunk/polyglot_data/index.xml) does not exist anymore.

Where to get all models as one archive?

I\m trying to download models from http://whoisbigger.com/polyglot. But unfortunately it shows 0 bps after some time. Could you give me a link to an alternative donwload?

Force Language

Has anyway to force to Polyglot use a language? Sometimes when you run the NER, and the pharse have a foreign language entities (eg: "Albert Einstein"), it's classified wrong, for that example says that is: "de" (language code).

Have anyway to force a specific language?

The only way I've found it's to create custom method and overwrite the def ne_chunker(self): to accept params (language_code) and then change the def entities(self) to accept the language code too.

def extract(self, pharse, language='pt'):
    # Transform pharse into `polyglot.text.Text`
    text = Text(pharse)

    # Create a named entitiy chunker
    ne_chunker = get_ner_tagger(lang=language)

    # Extract Entities
    start = 0
    end = 0
    prev_tag = u'O'
    chunks = []
    for i, (w, tag) in enumerate(ne_chunker.annotate(text.words)):
      if tag != prev_tag:
        if prev_tag == u'O':
          start = i
        else:
          chunks.append(Chunk(text.words[start: i], start, i, tag=prev_tag,
                              parent=text))
        prev_tag = tag
    if tag != u'O':
      chunks.append(Chunk(text.words[start: i+1], start, i+1, tag=tag,
                          parent=text))
    return chunks

transliteration2 package download error

polyglot download TASK:transliteration2
on executing above command on terminal I am getting error stating "Error downloading u'transliteration2.pl' from http://polyglot.cs.stonybrook.edu/~polyglot/transliteration2/pl/transliteration.pl.tar.bz2: HTTP Error403: Forbidden
".
On browsing the url "http://polyglot.cs.stonybrook.edu/~polyglot/transliteration2/pl/transliteration.pl.tar.bz2", It is forbidding access from my ip, I have tried accessing via vpn too

installation faild

I want to use your tool polyglotner,But it appears to me this error and tried to repair it frequently but I could not, I need this tool a lot.
can you help me ?please?

Thank you very much

Model download fails (HTTP 403)

This may be temporary or not, but the model downloads are failing currently:

[polyglot_data] Downloading collection 'TASK:embeddings2'
[polyglot_data]    |
[polyglot_data]    | Downloading package embeddings2.fy to
[polyglot_data]    |     /home/ubuntu/polyglot_data...
[polyglot_data]    | Error downloading 'embeddings2.fy' from
[polyglot_data]    |     <https://www.googleapis.com/download/storage/
[polyglot_data]    |     v1/b/polyglot-models/o/embeddings2%2Ffy%2Femb
[polyglot_data]    |     eddings_pkl.tar.bz2?generation=14061390326200
[polyglot_data]    |     00&alt=media>:   HTTP Error 401: Unauthorized

No requirements on python3

When running pip3 install polyglot==16.7.4 no requirements get installed.

See (this uses a fresh python 3.5 docker image with nothing else installed):

$ docker run --rm -it python:3.5 pip3 install polyglot==16.7.4
Collecting polyglot==16.7.4
  Downloading polyglot-16.7.4.tar.gz (126kB)
    100% |████████████████████████████████| 133kB 1.9MB/s
Installing collected packages: polyglot
  Running setup.py install for polyglot ... done
Successfully installed polyglot-16.7.4

For comparison with polyglot==15.10.3:

$ docker run --rm -it python:3.5 pip3 install polyglot==15.10.3
Collecting polyglot==15.10.3
  Downloading polyglot-15.10.03-py2.py3-none-any.whl (54kB)
    100% |████████████████████████████████| 61kB 1.3MB/s
Collecting PyICU>=1.8 (from polyglot==15.10.3)
  Downloading PyICU-1.9.3.tar.gz (179kB)
    100% |████████████████████████████████| 184kB 3.0MB/s
Collecting morfessor>=2.0.2a1 (from polyglot==15.10.3)
  Downloading Morfessor-2.0.2alpha3.tar.gz
Collecting futures>=2.1.6 (from polyglot==15.10.3)
  Downloading futures-3.0.5.tar.gz
Collecting pycld2>=0.3 (from polyglot==15.10.3)
  Downloading pycld2-0.31.tar.gz (14.3MB)
    100% |████████████████████████████████| 14.3MB 101kB/s
Collecting six>=1.7.3 (from polyglot==15.10.3)
  Downloading six-1.10.0-py2.py3-none-any.whl
Collecting wheel>=0.23.0 (from polyglot==15.10.3)
  Downloading wheel-0.29.0-py2.py3-none-any.whl (66kB)
    100% |████████████████████████████████| 71kB 11.5MB/s
Installing collected packages: PyICU, morfessor, futures, pycld2, six, wheel, polyglot
  Running setup.py install for PyICU ... error
    Complete output from command /usr/local/bin/python3.5 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-68sk36vg/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-ptkc3_tq-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    copying icu.py -> build/lib.linux-x86_64-3.5
    copying PyICU.py -> build/lib.linux-x86_64-3.5
    copying docs.py -> build/lib.linux-x86_64-3.5
    running build_ext
    building '_icu' extension
    creating build/temp.linux-x86_64-3.5
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include/python3.5m -c _icu.cpp -o build/temp.linux-x86_64-3.5/_icu.o -DPYICU_VER="1.9.3"
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    In file included from _icu.cpp:27:0:
    common.h:90:28: fatal error: unicode/utypes.h: No such file or directory
     #include <unicode/utypes.h>
                                ^
    compilation terminated.
    error: command 'gcc' failed with exit status 1

    ----------------------------------------
Command "/usr/local/bin/python3.5 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-68sk36vg/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-ptkc3_tq-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-68sk36vg/PyICU/

Bug with supported_languages_table in python3.4

In [1]: from polyglot.downloader import downloader

In [2]: downloader.supported_languages_table(u"ner2")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-e3f6f145dbd1> in <module>()
----> 1 downloader.supported_languages_table(u"ner2")

/Users/dchaplinsky/Projects/et-lemma/venv/lib/python3.4/site-packages/polyglot/downloader.py in supported_languages_table(self, task, cols)
    977   def supported_languages_table(self, task, cols=3):
    978     languages = self.supported_languages(task)
--> 979     return pretty_list(languages)
    980 
    981 

/Users/dchaplinsky/Projects/et-lemma/venv/lib/python3.4/site-packages/polyglot/utils.py in pretty_list(items, cols)
     70   col_width = u"{" + u":<" + str(width) + u"} "
     71   for i, lang in enumerate(items):
---> 72     lang = lang.decode(u"utf-8")
     73     if len(lang) > width:
     74       lang = lang[:width-3] + "..."

AttributeError: 'str' object has no attribute 'decode'

Download seems to depend on numpy

When installing polyglot from scratch all works fine (icu cdl and morphesor are correctly installed), but when using to download models I get the following error:

Traceback (most recent call last):
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/bin/polyglot", line 9, in <module>
    load_entry_point('polyglot==15.10.3', 'console_scripts', 'polyglot')()
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 558, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2682, in load_entry_point
    return ep.load()
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2355, in load
    return self.resolve()
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2361, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/polyglot/__main__.py", line 21, in <module>
    from polyglot.load import load_morfessor_model
  File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/polyglot/load.py", line 8, in <module>
    import numpy as np
ImportError: No module named numpy

Maybe numpy should be included as a dep too.

Word.morphemes / morfessor UnicodeError

From the tutorial:

#!/usr/bin/env python3
import polyglot
from polyglot.text import Text, Word
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)

When I try to run it (after calling polyglot.downloader.downloader.download('morph2.en')):

Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    print(word.morphemes)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/text.py", line 286, in morphemes
    words, score = self.morpheme_analyzer.viterbi_segment(self.string)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/text.py", line 282, in morpheme_analyzer
    return load_morfessor_model(lang=self.language)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/load.py", line 142, in load_morfessor_model
    model = io.read_any_model(tmp_file_.name)
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 203, in read_any_model
    model.load_segmentations(self.read_segmentation_file(file_name))
  File "/usr/local/lib/python3.4/dist-packages/morfessor/baseline.py", line 487, in load_segmentations
    for count, segmentation in segmentations:
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 53, in read_segmentation_file
    for line in self._read_text_file(file_name):
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 240, in _read_text_file
    encoding = self._find_encoding(file_name)
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 320, in _find_encoding
    raise UnicodeError("Can not determine encoding of input files")
UnicodeError: Can not determine encoding of input files

Versions:

$ python3 --version
Python 3.4.2

$ pip3 show polyglot | grep Version
Version: 16.07.04

$ pip3 show morfessor | grep Version
Version: 2.0.1

loading glove produced files into polyglot?

glove produces data of the form shown below. how do I load these glove produced files into polyglot so I can take advantage of the polyglot infrastructure?

in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.036121 0.13085 0.0012462 0.14769 0.26926 0.37144 1.3501 -0.11326 -0.23036 -0.26575 -0.18077 0.092455 -0.16215 0.15003 -0.34547 0.072295 0.40659 0.010021 -0.0079257 -0.11435 0.017008 -0.29789 0.19079 0.37112 -0.26588 0.16212 0.065469 -0.31781 -0.03226 0.081969 0.3445 -0.17362 -0.35745 0.054487 0.39941 0.13699 -0.022066 0.11025 -0.41898 0.1276 -0.095869 -0.17944 -0.17443 0.27302 -0.19464 0.26747 -0.28241 0.1638 -0.11518 0.013196 -0.10616 -0.36093 0.023634 0.13464 0.021652 -0.27094 -0.018737 0.10017 0.36071 -0.093951 0.47634 0.12874 0.0011868 0.1377 -0.14034 -0.1887 -0.16405 -0.15349 0.32347 -0.17616 0.3523 -0.023531 -0.19121 -0.054809 -0.099521 -0.30056 0.36632 -0.21509 0.074123 -0.20267 0.1286 -0.38111 -0.025482 0.45103 0.088633 0.36288 -0.23406 -0.086024 -0.50604 0.034242 0.43998 -0.083023 -0.11969 0.68686 -0.34115 0.21228 0.40039 0.26367 -0.37144 0.16206 -0.42854 0.078658 -0.2905 0.21727 -0.27484 0.35887 0.27055 -0.11326 -0.14848 -0.0050659 -0.076862 0.078621 -0.24922 0.42026 -0.069698 0.071595 0.0071665 0.27473 -0.15664 0.25713 -0.058461 -0.29733 -0.090996 0.5246 0.14889 -0.20883 -0.13004 -0.20022 0.4503 -0.34654 -0.26007 0.35247 -0.34757 0.033738 0.19907 -0.32912 -0.084689 0.65319 0.20954 0.079274 0.1086 0.0026466 -0.12843 -0.22811 0.051501 -0.27429 0.14505 -0.1843 -0.34825 -0.11701 0.34034 0.075848 0.08239 -0.39188 -0.022312 -0.080373 0.14477 0.29701 -0.10523 0.092893 0.029813 -0.11761 0.16308 0.098382 0.46152 -0.162 -0.2456 0.20293 -0.11344 0.057902 -0.19528 -0.20141 -0.22874 -0.014101 0.2637 -0.10028 -0.051896 0.18859 -0.17767 -0.11556 0.121 0.17303 0.11773 0.034837 0.28485 -0.30447 0.061024 -0.26442 -0.081135 -0.044524 -0.036931 -0.15217 0.29175 0.44926 -0.28875 0.33193 -0.01242 -0.18805 -0.19832 -0.19736 0.26893 0.11106 -0.67383 -0.1518 -0.16615 -0.16563 0.0093671 -0.15945 -0.33468 0.22038 -0.16724 -0.1535 -0.61782 -0.17258 0.088928 0.019411 0.18296 0.32967 -0.0024906 -0.09208 0.514 0.0042484 -0.084377 -0.71448 -0.22148 -0.04835 0.043761 -0.29376 -0.22287 0.18001 0.072197 0.46499 0.056466 0.40844 -0.23641 -0.038946 0.087363 -0.21901 -0.3231 -0.19989 -0.3128 -0.067656 -0.22596 0.090926 0.28365 0.31462 0.46082 -0.024871 -0.14605 0.30454 0.17704 -0.011311 0.26807 -0.032461 -0.16644 -0.15313 -0.20426 -0.3082 -0.2459 0.085848 -0.11767 -0.063056 -0.18133 -0.18629 -0.17694 0.29618 0.35987 0.0020102 0.38616 0.36712 -0.055112 -0.34733 -0.072678 -0.051119 -0.29069 0.053598 0.019587 0.16808 -0.27456 -0.097179 -0.054541 0.19229 -0.48128 -0.20304 0.19368 -0.32546 0.14421 -0.169 0.26501

Tokenization and NER

Hi there, I'm using polyglot to do some tokenization and NER extraction and am using the output both as features in a Machine Learning model. Since I know in advance which language I am processing, I instantiate a Text object using language hinting.

text = Text(text_string, hint_language_code="pt")

Now, for some reason, it seems that NER doesn't rely on the language hint passed in the constructor. It tries to infer a language again and it sometimes gets it wrong (e.g. detects portuguese as gl). Since there are no ner2 models for galician to be downloaded, I get a ValueError: Package u'ner2.gl' not found in index.

Here is the full stack trace:

--> 238             ne_tuples = [((ent.start, ent.end), ent.tag, 1.) for ent in sent.entities]
    239             ne_entity_range = get_target_entity_range(ne_tuples, entity_name, sent.tokens)
    240             if ne_entity_range:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
     18     if obj is None:
     19         return self
---> 20     value = obj.__dict__[self.func.__name__] = self.func(obj)
     21     return value
     22 

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in entities(self)
    130     prev_tag = u'O'
    131     chunks = []
--> 132     for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
    133       if tag != prev_tag:
    134         if prev_tag == u'O':

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
     18     if obj is None:
     19         return self
---> 20     value = obj.__dict__[self.func.__name__] = self.func(obj)
     21     return value
     22 

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in ne_chunker(self)
     98   @cached_property
     99   def ne_chunker(self):
--> 100     return get_ner_tagger(lang=self.language.code)
    101 
    102   @cached_property

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
     28     key = tuple(list(args) + sorted(kwargs.items()))
     29     if key not in cache:
---> 30       cache[key] = obj(*args, **kwargs)
     31     return cache[key]
     32   return memoizer

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in get_ner_tagger(lang)
    190 def get_ner_tagger(lang='en'):
    191   """Return a NER tagger from the models cache."""
--> 192   return NEChunker(lang=lang)

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
    102       lang: language code to decide which chunker to use.
    103     """
--> 104     super(NEChunker, self).__init__(lang=lang)
    105     self.ID_TAG = NER_ID_TAG
    106 

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
     38     """
     39     self.lang = lang
---> 40     self.predictor = self._load_network()
     41     self.ID_TAG = {}
     42     self.add_bias = True

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in _load_network(self)
    109     self.embeddings = load_embeddings(self.lang, type='cw')
    110     self.embeddings.normalize_words(inplace=True)
--> 111     self.model = load_ner_model(lang=self.lang, version=2)
    112     first_layer, second_layer = self.model
    113     def predict_proba(input_):

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
     28     key = tuple(list(args) + sorted(kwargs.items()))
     29     if key not in cache:
---> 30       cache[key] = obj(*args, **kwargs)
     31     return cache[key]
     32   return memoizer

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in load_ner_model(lang, version)
     92   """
     93   src_dir = "ner{}".format(version)
---> 94   p = locate_resource(src_dir, lang)
     95   fh = _open(p)
     96   try:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in locate_resource(name, lang, filter)
     41   p = path.join(polyglot_path, task_dir, lang)
     42   if not path.isdir(p):
---> 43     if downloader.status(package_id) != downloader.INSTALLED:
     44       raise ValueError("This resource is available in the index "
     45                        "but not downloaded, yet. Try to run\n\n"

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in status(self, info_or_id, download_dir)
    735     """
    736     if download_dir is None: download_dir = self._download_dir
--> 737     info = self._info_or_id(info_or_id)
    738 
    739     # Handle collections:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in _info_or_id(self, info_or_id)
    505   def _info_or_id(self, info_or_id):
    506     if isinstance(info_or_id, unicode):
--> 507       return self.info(info_or_id)
    508     else:
    509       return info_or_id

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in info(self, id)
    931     if id in self._packages: return self._packages[id]
    932     if id in self._collections: return self._collections[id]
--> 933     raise ValueError('Package %r not found in index' % id)
    934 
    935   def get_collection(self, lang=None, task=None):

ValueError: Package u'ner2.gl' not found in index

List index out of range

Windows 8.1 Python 3.5
Trying to run the NER example given in the documentation and it results in a list index out of range error.

Named Entity Extraction does not seem to work

I would like to use the Named Entity Extraction of Polyglot, so I'm following the documentation at http://polyglot.readthedocs.org/en/latest/NamedEntityRecognition.html, however when I execute

print(downloader.supported_languages_table("ner2", 3))

I get the following error:

Traceback (most recent call last):
  File "C:/Users/text_analyzer_polyglot.py", line 22, in <module>
    main()
  File "C:/Users/text_analyzer_polyglot.py", line 18, in main
    print(downloader.supported_languages_table("ner2", 3))
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 963, in supported_languages_table
    languages = self.supported_languages(task)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 955, in supported_languages
    collection = self.get_collection(task=task)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 934, in get_collection
    if task: raise TaskNotSupported("Task {} is not supported".format(id))
polyglot.downloader.TaskNotSupported: Task TASK:ner2 is not supported

In addition, if I try to execute:

blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
    text = Text(blob)
    print (text.entities)

I get the following error:

Traceback (most recent call last):
  File "C:/Users/text_analyzer_polyglot.py", line 23, in <module>
    main()
  File "C:/Users/text_analyzer_polyglot.py", line 20, in main
    print (text.entities)
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "C:\Python27\lib\site-packages\polyglot\text.py", line 124, in entities
    for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "C:\Python27\lib\site-packages\polyglot\text.py", line 96, in ne_chunker
    return get_ner_tagger(lang=self.language.code)
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 152, in get_ner_tagger
    return NEChunker(lang=lang)
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 99, in __init__
    super(NEChunker, self).__init__(lang=lang)
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 40, in __init__
    self.predictor = self._load_network()
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 104, in _load_network
    self.embeddings = load_embeddings(self.lang, type='cw')
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "C:\Python27\lib\site-packages\polyglot\load.py", line 64, in load_embeddings
    p = locate_resource(src_dir, lang)
  File "C:\Python27\lib\site-packages\polyglot\load.py", line 47, in locate_resource
    if downloader.status(package_id) != downloader.INSTALLED:
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 730, in status
    info = self._info_or_id(info_or_id)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 500, in _info_or_id
    return self.info(info_or_id)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 918, in info
    raise ValueError('Package %r not found in index' % id)
ValueError: Package u'embeddings2.en' not found in index

Am I missing something in the documentation?
Could you tell me how to successfully run the Named Entity Extraction?

Punctuation and bad lowercase words in NER result

I just found polyglot, which seems to be FANTASTIC for dealing with all sorts of NLP problems in multiple languages. I want to use it for Swedish texts so I got to work and tested it on some real world texts.

Here's a random swedish article: http://www.dn.se/nyheter/sverige/oligarken-som-ager-en-o-i-stockholm/ I manually copy-pasted the text into a txt file, downloaded the required swedish models and tried to get the NER tags from it.

from polyglot.text import Text
text = Text(open("test.txt").read())
for entity in text.entities:
    print(entity.tag, entity)

Here's a part of the ouput:
...
I-LOC ['Stockholm']
I-LOC ['slovakisk']
I-PER ['Frantisek']
I-PER ['Jules', 'Verne']
I-PER ['Zvrskovec']
I-LOC ['Lidingö']
I-PER ['Bilspedition']
I-LOC ['Tjeckien']
I-PER ['oligarken']
I-LOC ['Indiana']
I-PER ['Indiana', 'Jones']
I-ORG ['Arlanda']
I-LOC ['Tjeckien']
I-LOC ['Stockholm']
I-PER ['Frantisek', 'Zvrskovec']
I-PER ['.']
I-PER ['Frantisek', 'Zvrskovec']
I-PER ['bottenskrevan']
I-LOC ['Stockholms']
I-PER ['landstigningsförbud']
I-PER ['helstängt']
I-PER ['Magnus', 'Hallgren']
I-PER ['Dividend']
I-ORG ['Central', 'Europe']
I-PER ['Zvrskovec']
I-PER ['.']
I-LOC ['Tjeckoslovakien']
I-LOC ['Dolny']
I-PER ['Dolny', 'Kubin']
I-PER ['.']
...

...which is ok, but two things stand out:

Lots of People tags are just punctuation. Is this a known bug in polyglot?
All the lowercase words in the example above are actually just nouns, and not people (let me know if you need a translation of the words to make sense of them). Is that also a bug in polyglot?

I guess I could write my own filter to remove punctuation and lowercase words, but this seems should be easier to solve in an earlier step, when training the models, don't you think?

Docs: unescaped backslashes not showing in "parsed-literal"

For example, in READM.rst, under "Named Entity Recognition", the result should include a visible backslash at "\xfd", but this is parsed by the parsed-literal body element.

While it could be possible to escape the backslash and have it shown correctly when rendered, this would be confusing to someone reading the document in raw format.

Consider using the code body element instead, or showing results as comments in the code preceding them.

Update the Indonesian POS model

outdated model for 15.04.19?

Hi,

I just upgraded to polyglot 15.04.19 and it seems the model needs to be updated too.

In [1]: from polyglot.downloader import downloader

In [2]: downloader.download("embeddings2.en")
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/ubuntu/polyglot_data...
Out[2]: True

In [3]: downloader.download("pos2.en")
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /home/ubuntu/polyglot_data...
Out[3]: True

In [4]: blob = """We will meet at eight o'clock on Thursday morning."""

In [5]: from polyglot.text import Text

In [6]: text = Text(blob)

In [7]: text.pos_tags
Out[7]:
[(u'We', u'INTJ'),
 (u'will', u'NOUN'),
 (u'meet', u'NOUN'),
 (u'at', u'ADP'),
 (u'eight', u'DET'),
 (u"o'clock", u'PART'),
 (u'on', u'ADP'),
 (u'Thursday', u'PART'),
 (u'morning', u'PART'),
 (u'.', u'ADV')]

Also you might want to update this too.

transliteration

hi,Actually I tried again with the github version and transliteration did operate correctly. Thanks for making this available!

Could you please give me information on the method rely on (character-based? dictionnary based? Buckwalter? machine-learning?) and on the output format (Arabica?), and if it is conceivable to adapt the model?

Can we install Polyglot2 on windows machine?

Hebrew Language is being detected with it's old code "iw" instead of "he"

In [200]: polyglot.__version__ Out[200]: '16.07.04'

hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew
Text(hebrew_text).language.code
Out[184]: 'iw'

This problem affects work greatly, since "he" language is significantly better supported in polyglot (and elsewhere) than "iw".

Thanks for all the hard work!

build new ner-model

Hi.
Is there anyway to build a new ner-model based on a my dataset?

Wrong sentence split for abbreviations

In [1]: from polyglot.text import Text
In [2]: Text("Мне уже не раз приходилось писать о том, что мировой капитализм вошел в новую и последнюю фазу своего развития. Почти 100 лет назад (в 1916 году) В. Ленин (Ульянов) написал книгу «Империализм, как высшая стадия капитализма». В ней он констатировал, что в конце XIX — начале XX века капитализм стал монополистическим, и что такой капитализм является последней стадией развития этой общественно-экономической формации. Классик несколько поспешил с вынесением смертного приговора капитализму.").sentences
Out[2]: 
[Sentence("Мне уже не раз приходилось писать о том, что мировой капитализм вошел в новую и последнюю фазу своего развития."),
 Sentence("Почти 100 лет назад (в 1916 году) В."),
 Sentence("Ленин (Ульянов) написал книгу «Империализм, как высшая стадия капитализма»."),
 Sentence("В ней он констатировал, что в конце XIX — начале XX века капитализм стал монополистическим, и что такой капитализм является последней стадией развития этой общественно-экономической формации."),
 Sentence("Классик несколько поспешил с вынесением смертного приговора капитализму.")]
#next one is fine
In [3]: Text("In recent years, enormous parsing success has been achieved by the use of feature-based discriminative dependency parsers (Kubler et al., 2009).")
Out[3]: Text("In recent years, enormous parsing success has been achieved by the use of feature-based discriminative dependency parsers (Kubler et al., 2009).")
#but this one is not
In [4]: Text("Marshall R. Mayberry III and Risto Miikkulainen. 2005. Broad-coverage parsing with neural networks. Neural Processing Letters").sentences
Out[4]: 
[Sentence("Marshall R."),
 Sentence("Mayberry III and Risto Miikkulainen."),
 Sentence("2005."),
 Sentence("Broad-coverage parsing with neural networks."),
 Sentence("Neural Processing Letters")]

(How are such issues solved or avoided?)

polyglot_data on windows.

Hi,
Have installed polyglot on windows with python 3.4, after some lib problems solved, I start getting this error:

downloader.download()
Polyglot Downloader

---------------------------------------------------------------------------

d) Download l) List u) Update c) Config h) Help q) Quit

---------------------------------------------------------------------------

Downloader> l

Collections:
Traceback (most recent call last):
File "", line 1, in
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 649, in download
self._interactive_download()
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 1068, in _interactive_download
DownloaderShell(self).run()
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 1096, in run
more_prompt=True)
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 459, in list
for info in sorted(getattr(self, category)(), key=str):
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 495, in collections
self._update_index()
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 832, in _update_index
P = Package.fromcsobj(p)
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 232, in fromcsobj
language = subdir.split(path.sep)[1]
IndexError: list index out of range

After some analysis (and some neurons less...) e got the problem, on windows "path.sep" is,as expected "" instead of "/", since the packages are "named" or "ID(ed)" with "/" it makes no sense the path.sep on windows users? Or am I missing something I should had installed?

A replace on path.sep for "/" solve the problem and allow me to list and download any data I want to my polyglot installation.

NER Index downloader Issue

Hi there, I'm getting an error when doing ner from a sentence.
The code used to get the named entities is:

ne_tuples = [((ent.start, ent.end), ent.tag, 1.) for ent in sent.entities

It seems that the Downloader tries to read some information from a web index which results in a socket error. Any ideas on what it might be going on ? Here is the stacktrace:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
     18     if obj is None:
     19         return self
---> 20     value = obj.__dict__[self.func.__name__] = self.func(obj)
     21     return value
     22 

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in entities(self)
    130     prev_tag = u'O'
    131     chunks = []
--> 132     for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
    133       if tag != prev_tag:
    134         if prev_tag == u'O':

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
     18     if obj is None:
     19         return self
---> 20     value = obj.__dict__[self.func.__name__] = self.func(obj)
     21     return value
     22 

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in ne_chunker(self)
     98   @cached_property
     99   def ne_chunker(self):
--> 100     return get_ner_tagger(lang=self.language.code)
    101 
    102   @cached_property

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
     28     key = tuple(list(args) + sorted(kwargs.items()))
     29     if key not in cache:
---> 30       cache[key] = obj(*args, **kwargs)
     31     return cache[key]
     32   return memoizer

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in get_ner_tagger(lang)
    190 def get_ner_tagger(lang='en'):
    191   """Return a NER tagger from the models cache."""
--> 192   return NEChunker(lang=lang)

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
    102       lang: language code to decide which chunker to use.
    103     """
--> 104     super(NEChunker, self).__init__(lang=lang)
    105     self.ID_TAG = NER_ID_TAG
    106 

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
     38     """
     39     self.lang = lang
---> 40     self.predictor = self._load_network()
     41     self.ID_TAG = {}
     42     self.add_bias = True

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in _load_network(self)
    109     self.embeddings = load_embeddings(self.lang, type='cw')
    110     self.embeddings.normalize_words(inplace=True)
--> 111     self.model = load_ner_model(lang=self.lang, version=2)
    112     first_layer, second_layer = self.model
    113     def predict_proba(input_):

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
     28     key = tuple(list(args) + sorted(kwargs.items()))
     29     if key not in cache:
---> 30       cache[key] = obj(*args, **kwargs)
     31     return cache[key]
     32   return memoizer

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in load_ner_model(lang, version)
     92   """
     93   src_dir = "ner{}".format(version)
---> 94   p = locate_resource(src_dir, lang)
     95   fh = _open(p)
     96   try:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in locate_resource(name, lang, filter)
     41   p = path.join(polyglot_path, task_dir, lang)
     42   if not path.isdir(p):
---> 43     if downloader.status(package_id) != downloader.INSTALLED:
     44       raise ValueError("This resource is available in the index "
     45                        "but not downloaded, yet. Try to run\n\n"

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in status(self, info_or_id, download_dir)
    735     """
    736     if download_dir is None: download_dir = self._download_dir
--> 737     info = self._info_or_id(info_or_id)
    738 
    739     # Handle collections:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in _info_or_id(self, info_or_id)
    505   def _info_or_id(self, info_or_id):
    506     if isinstance(info_or_id, unicode):
--> 507       return self.info(info_or_id)
    508     else:
    509       return info_or_id

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in info(self, id)
    927     if id in self._packages: return self._packages[id]
    928     if id in self._collections: return self._collections[id]
--> 929     self._update_index() # If package is not found, most probably we did not
    930                          # warm up the cache
    931     if id in self._packages: return self._packages[id]

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in _update_index(self, url)
    829     elif source == 'mirror':
    830         index_url = path.join(self._url, 'index.json')
--> 831         data = urlopen(index_url).read()
    832 
    833     if six.PY3:

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    429             req = meth(req)
    430 
--> 431         response = self._open(req, data)
    432 
    433         # post-process response

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in _open(self, req, data)
    447         protocol = req.get_type()
    448         result = self._call_chain(self.handle_open, protocol, protocol +
--> 449                                   '_open', req)
    450         if result:
    451             return result

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    407             func = getattr(handler, meth_name)
    408 
--> 409             result = func(*args)
    410             if result is not None:
    411                 return result

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in http_open(self, req)
   1225 
   1226     def http_open(self, req):
-> 1227         return self.do_open(httplib.HTTPConnection, req)
   1228 
   1229     http_request = AbstractHTTPHandler.do_request_

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1198         else:
   1199             try:
-> 1200                 r = h.getresponse(buffering=True)
   1201             except TypeError: # buffering kw not supported
   1202                 r = h.getresponse()

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/httplib.pyc in getresponse(self, buffering)
   1134 
   1135         try:
-> 1136             response.begin()
   1137             assert response.will_close != _UNKNOWN
   1138             self.__state = _CS_IDLE

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/httplib.pyc in begin(self)
    451         # read until we get a non-100 response
    452         while True:
--> 453             version, status, reason = self._read_status()
    454             if status != CONTINUE:
    455                 break

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/httplib.pyc in _read_status(self)
    407     def _read_status(self):
    408         # Initialize with Simple-Response defaults
--> 409         line = self.fp.readline(_MAXLINE + 1)
    410         if len(line) > _MAXLINE:
    411             raise LineTooLong("header line")

/home/preceptor/miniconda2/envs/wiki/lib/python2.7/socket.pyc in readline(self, size)
    478             while True:
    479                 try:
--> 480                     data = self._sock.recv(self._rbufsize)
    481                 except error, e:
    482                     if e.args[0] == EINTR:

polyglot not running in windows.

pip install -U git+https://github.com/aboSamoor/polyglot.git@master

In cmd:

c:\tmp>polyglot download
raceback (most recent call last):
 File "C:\Python27\Scripts\polyglot-script.py", line 9, in <module>
   load_entry_point('polyglot==15.10.3', 'console_scripts', 'polyglot')()
 File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 552, in load_entry_point
   return get_distribution(dist).load_entry_point(group, name)
 File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 2672, in load_entry_point
   return ep.load()
 File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 2345, in load
   return self.resolve()
 File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 2351, in resolve
   module = __import__(self.module_name, fromlist=['__name__'], level=0)
 File "C:\Python27\lib\site-packages\polyglot\__main__.py", line 9, in <module>
   from signal import signal, SIGPIPE, SIG_DFL
mportError: cannot import name SIGPIPE

In idle:

from polyglot.downloader import downloader
downloader.download("embeddings2.en")


Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    downloader.download("embeddings2.en")
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 663, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 533, in incr_download
    try: info = self._info_or_id(info_or_id)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 507, in _info_or_id
    return self.info(info_or_id)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 929, in info
    self._update_index() # If package is not found, most probably we did not
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 843, in _update_index
    P = Package.fromcsobj(p)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 216, in fromcsobj
    language = subdir.split(path.sep)[1]
IndexError: list index out of range

ValueError: Package 'embeddings2.zh_Hant' not found in index

Hello, I download the embeddings2.zh model and trying to get the entities of Chinese but failed, got this error message :

ValueError: Package 'embeddings2.zh_Hant' not found in index

How could I solve it? thanks.

output format text

Dear Mr. Al-Rfou,

Actually I tried again with the github version and transliteration did operate correctly. Thanks for making this available!

Thank you in advance.
Best,

tsne2, sgns2

Hello, thanks for your work.

What kind of data tsne2, sgns2?

OSError: [Errno 2] No such file or directory: '/root/polyglot_data/morph2/cs'

I am not able to install new models, for example czech language to test the morphology analysis:

root@mario-VirtualBox:/home/mario/python-scripts# python example.py 
Traceback (most recent call last):
  File "example.py", line 28, in <module>
    print(word2.morphemes)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/text.py", line 269, in morphemes
    words, score = self.morpheme_analyzer.viterbi_segment(self.string)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/text.py", line 265, in morpheme_analyzer
    return load_morfessor_model(lang=self.language)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/load.py", line 128, in load_morfessor_model
    p = locate_resource(src_dir, lang)
  File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/load.py", line 51, in locate_resource
    return path.join(p, os.listdir(p)[0])
OSError: [Errno 2] No such file or directory: '/root/polyglot_data/morph2/cs'
root@mario-VirtualBox:/home/mario/python-scripts#

my code is:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import polyglot
from polyglot.text import Text, Word
from polyglot.detect import Detector
#from polyglot.downloader import downloader
#downloader.download("morph2.cs")


#mixed_text = u"""
#fficially the People's Republic of China (PRC), is a sovereign state located in East Asia.
#"""

#zen = Text("Beautiful is better than ugly. "
#           "Explicit is better than implicit. "
#           "Simple is better than complex.")
#print(zen.words)
#print(zen.sentences)

#detector = Detector(mixed_text)
#print(detector.language)

#word = Text("Preprocessing is an essential step.").words[0]
#print(word.morphemes)

word2 = Text("Na cestu do Německa se vydal už před dvěma roky.").words[0]
print(word2.morphemes)


from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="cs")
print(transliterator.transliterate(u"preprocessing"))

commented lines are working well.

Installation error: LONG_BIT definition

I'm using Ubuntu 14.04 and PIP:

gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Icld2/internal -Icld2/public -I/home/raul/miniconda/include/python2.7 -c bindings/pycldmodule.cc -o build/temp.linux-i686-2.7/bindings/pycldmodule.o -w -O2 -m64 -fPIC
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
    In file included from /home/raul/miniconda/include/python2.7/Python.h:58:0,
                     from bindings/pycldmodule.cc:15:
    /home/raul/miniconda/include/python2.7/pyport.h:886:2: error: #error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
     #error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
      ^
    error: command 'gcc' failed with exit status 1

    ----------------------------------------
    Command "/home/raul/miniconda/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-GclX4J/pycld2/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-kaY6oa-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-GclX4J/pycld2

Trouble Installing

This is using
"pip install polyglot".

I've located some useful arguments that can help here, but I'm not sure how to add them to the cc command.

Complete output from command /usr/bin/python -c "import setuptools, tokenize;file='/private/var/folders/k1/6_4k217j1ng5qnm8_vrpx1b80000gp/T/pip-build-uOkJfF/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /var/folders/k1/6_4k217j1ng5qnm8_vrpx1b80000gp/T/pip-EdbjO8-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.macosx-10.10-intel-2.7
copying icu.py -> build/lib.macosx-10.10-intel-2.7
copying PyICU.py -> build/lib.macosx-10.10-intel-2.7
copying docs.py -> build/lib.macosx-10.10-intel-2.7
running build_ext
building '_icu' extension
creating build/temp.macosx-10.10-intel-2.7
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/local/include -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c _icu.cpp -o build/temp.macosx-10.10-intel-2.7/_icu.o -DPYICU_VER="1.9.2"
In file included from _icu.cpp:27:
./common.h:86:10: fatal error: 'unicode/utypes.h' file not found
#include <unicode/utypes.h>
^
1 error generated.
error: command 'cc' failed with exit status 1

The downloads server seems currently down

Hello!
When I issued this command:

polyglot download embeddings2.en ner2.en

I received the following answer:

[polyglot_data] Error loading embeddings2.en: HTTP Error 503: Service
[polyglot_data]     Unavailable
Error installing package. Retry? [n/y/e]

This has been happening for about 3 days (as far as I know) and in all sorts of circumstances.
I think your downloads server is down. Any thoughts?

How to train new SRL model ?

Hi aboSamoor, i want to create Semantic Role Labeling model (non english).
Maybe this is offtopic for Polyglot, but you advice to use deepnl in other issue answer, so i would like to ask: how to train new SRL model from raw text data?

Currently i try to use https://github.com/attardi/deepnl and https://github.com/erickrf/nlpnet (both hard to implement due to unclear steps and data formats, asked in their Github pages but no clear solution provided). I want to use your pretrained wordembedings from https://sites.google.com/site/rmyeid/projects/polyglot to create SRL model.

What will you advice to solve this task?

Unexpected POS result

Hi Rami @aboSamoor,

In the following example, I expected to have 'VERB' pos_tag for the word 'restate', however, I got 'NUM':

blob ="""restate (words) from one language into another language."""
text = Text(blob)
text.pos_tags
text.words[0].pos_tag

u'NUM'

Seems that docs for transliteration has partially wrong content

http://polyglot.readthedocs.org/en/latest/Transliteration.html

Everything below
Downloading Necessary Models
belongs to POS example

Entity recognition should process sentence words

Hi,

Just found out that entities function process BaseBlob words list where in some cases it produce false positives by an merging entity at the end of a sentence and another entity at the start of next sentence.

Here's the example,

In [1]: from polyglot.text import Text, Word

In [2]: blob = u"""Momentum perbaikan "Los Blaugranas" itu awalnya justru terjadi setelah Martin Caceres diusir wasit karena melakukan pelanggaran keras di kotak penalti. Meski bermain dengan sepuluh pemain, rasa percaya diri mereka bangkit setelah kiper Jose Manuel Pinto berhasil menggagalkan tendangan penalti Jose Luis Marti.

Barca semakin kuat setelah Messi masuk pada menit ke-58. Messi pula yang mencetak gol penyama skor pada menit ke-80."""

In [3]: text = Text(blob)

In [4]: text.entities
Out[4]:
[I-PER([u'Martin', u'Caceres']),
 I-PER([u'Jose', u'Manuel', u'Pinto']),
 I-PER([u'Jose', u'Luis', u'Marti', u'.', u'Barca']),
 I-PER([u'Messi']),
 I-PER([u'Messi'])]

Since there's SentenceTokenizer function, I think it would be best to process the Sentence words rather than the BaseBlob words. I know it'll require extra step to create sentence objects, but the result will be better.

--- text.py.orig        2015-05-11 16:25:54.249957999 +0000
+++ .../lib/python2.7/site-packages/polyglot/text.py    2015-05-11 16:26:53.001957999 +0000
@@ -117,21 +117,22 @@
   @cached_property
   def entities(self):
     """Returns a list of entities for this blob."""
-    start = 0
-    end = 0
-    prev_tag = u'O'
     chunks = []
-    for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
-      if tag != prev_tag:
-        if prev_tag == u'O':
-          start = i
-        else:
-          chunks.append(Chunk(self.words[start: i], start, i, tag=prev_tag,
-                              parent=self))
-        prev_tag = tag
-    if tag != u'O':
-      chunks.append(Chunk(self.words[start: i+1], start, i+1, tag=tag,
-                          parent=self))
+    for sentence in self.sentences:
+      start = 0
+      end = 0
+      prev_tag = u'O'
+      for i, (w, tag) in enumerate(self.ne_chunker.annotate(sentence.words)):
+        if tag != prev_tag:
+          if prev_tag == u'O':
+            start = i
+          else:
+            chunks.append(Chunk(sentence.words[start: i], start, i, tag=prev_tag,
+                                parent=self))
+          prev_tag = tag
+      if tag != u'O':
+        chunks.append(Chunk(sentence.words[start: i+1], start, i+1, tag=tag,
+                            parent=self))
     return chunks

   @cached_property

...and here's the result,

In [8]: text.entities
Out[8]:
[I-PER([u'Martin', u'Caceres']),
 I-PER([u'Jose', u'Manuel', u'Pinto']),
 I-PER([u'Jose', u'Luis', u'Marti']),
 I-PER([u'Barca']),
 I-PER([u'Messi']),
 I-PER([u'Messi'])]

Analogy task with polyglot

I can't seem to find a way, using polyglot, to test word analogies like having v["queen"] closest to v["king"] - v["man"] + v["woman"]. Basically I'd like to find neighbors of a linear combination of vectors of words instead of a single word vector.

Package data downloaded to / when superuser

When deploying polyglot to a docker container ubuntu environment (thus running as root inside the container), the downloader simply drops the resources into / -- looks like there might be a bug in how the package_data path is determined for a superuser.

I was also wondering if it might make sense to consider a $POLYGLOT_PACKAGE_DATA environment variable when determining the path?

p.s. thanks for this amazing library! It's giving me great results and was much needed.

Distinguish docs/REAME.rst and docs/readme.rst

These two files can cause problems on OS X or Windows. Will rename one of them.

Broken POS tagging after NER

Testing the following sequence of commands:

from polyglot.text import Text
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').pos_tags
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').pos_tags
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').entities
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').pos_tags

Gives the output:

[(u'At', u'ADV'), (u'least', u'ADV'), (u'two', u'NUM'), (u'dead', u'ADJ'), (u'in', u'ADP'), (u'operation', u'NOUN'), (u'targeting', u'VERB'), (u'suspected', u'VERB'), (u'Paris', u'PROPN'), (u'attacks', u'NOUN'), (u'mastermind', u'PROPN')]
[(u'At', u'ADV'), (u'least', u'ADV'), (u'two', u'NUM'), (u'dead', u'ADJ'), (u'in', u'ADP'), (u'operation', u'NOUN'), (u'targeting', u'VERB'), (u'suspected', u'VERB'), (u'Paris', u'PROPN'), (u'attacks', u'NOUN'), (u'mastermind', u'PROPN')]
[]
[(u'At', u'PROPN'), (u'least', u'PROPN'), (u'two', u'PROPN'), (u'dead', u'NOUN'), (u'in', u'PROPN'), (u'operation', u'NOUN'), (u'targeting', u'VERB'), (u'suspected', u'PROPN'), (u'Paris', u'PROPN'), (u'attacks', u'NOUN'), (u'mastermind', u'PROPN')]

Notice how the third set of tags is completely different from the first two, after computing named entities. 'PROPN' tags especially become common afterwards. It appears that every subsequent sentence POS tags after the entities command is broken as well.

What is going wrong here?

Package u'transliteration2.ks' not found in index

I am trying to translate this string "عمتي سوسن" to english. Altough google language detector is detecting this as arabic text, polyglot is throwing error stating "Package u'transliteration2.ks' not found in index". Moreover the language package ks is not present in polyglot list. I am getting same error in 'sd', 'ku', 'ps', 'uz' too

Tokenization malfunction after strange character

Text("Edge cases " + chr(917631) + " can be annoying.").words
WordList(['Edge', 'cases', '\U000e007f', 'c', 'an', 'b', 'e', 'a', 'nnoying.'])

Unicode Decode Error when getting key from embeddings

I'm doing some word lookups for portuguese and I got the following:

File "/home/intruder/source/tgalery/analytyca/analytyca/utils/context.py", line 9, in get_vector
    vector = embeddings[word_key]
  File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/embeddings.py", line 40, in __getitem__
    return self.vectors[self.vocabulary[k]]
  File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/expansion.py", line 29, in __getitem__
    return self.approximate_ids(key)
  File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/expansion.py", line 52, in approximate_ids
    raise KeyError("{} not found".format(key))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128)

Because the string to be formated is not compatible with the incoming unicode key the Key Error throws another exception.

I'm happy to fix this, but I wonder whether the keys are meant to be in binary format for lookups.
Let me know how best to proceed.

ImportError: No module named concurrent.futures

When running Python 2.7, I can't import polyglot. This can be fixed by installing the backported
futures package.

~/bin/polyglot $ ipython
Python 2.7.5+ (default, Feb 27 2014, 19:37:08) 
In [1]: import polyglot
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-00a07ab614e8> in <module>()
----> 1 import polyglot

/home/arne/bin/polyglot/polyglot/__init__.py in <module>()
      8 
      9 from six.moves import copyreg
---> 10 from .base import Sequence, TokenSequence
     11 from .utils import _pickle_method, _unpickle_method
     12 

/home/arne/bin/polyglot/polyglot/base.py in <module>()
      7 from collections import Counter
      8 import os
----> 9 from concurrent.futures import ProcessPoolExecutor
     10 from itertools import islice
     11 

ImportError: No module named concurrent.futures

ImportError: No module named 'icu'

python3.4:
pip install polyglot
from polyglot.text import Text, Word
---> 11 from icu import Locale
12 import pycld2 as cld2
13

ImportError: No module named 'icu'

Its not a module dependency nor is it mentioned in readme.

Transliterate into English

Hi,

firstly, thanks for this great software. I realized that in the code, that when the target language is English, it will not transliterate it and returns the original text. Does polyglot does not support transliterating into English or am I missing something?

regards,
Amir

New release on pypi

Please consider doing a new release on pypi. Transliteration is broken in the current version, and it was fixed by my commit: 600514a

I'm using polyglot in a project and it's not nice to have to get or build development versions manually.

It could also be a simple minor version 'bugfix' release.

abosamoor / polyglot Goto Github PK

polyglot's People

Contributors

Stargazers

Watchers

Forkers

polyglot's Issues

---------------------------------------------------------------------------

---------------------------------------------------------------------------

Recommend Projects

Recommend Topics

Recommend Org