abosamoor / polyglot Goto Github PK
View Code? Open in Web Editor NEWMultilingual text (NLP) processing toolkit
Home Page: http://polyglot-nlp.com
License: Other
Multilingual text (NLP) processing toolkit
Home Page: http://polyglot-nlp.com
License: Other
When I try to download any model I get a 401 unauthorized error.
(default)vagrant@vagrant-ubuntu-trusty-64:~$ !polyglot download morph2.en
polyglot download embeddings2.en download morph2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data] /home/vagrant/polyglot_data...
[polyglot_data] Error downloading u'embeddings2.en' from
[polyglot_data] <https://www.googleapis.com/download/storage/v1/b
[polyglot_data] /polyglot-models/o/embeddings2%2Fen%2Fembeddings_p
[polyglot_data] kl.tar.bz2?generation=1406139163483000&alt=media>:
[polyglot_data] HTTP Error 401: Unauthorized
Thanks for the awesome lib @aboSamoor . I'm taking a look at your load
module and it seems that it provides the ability to unpickle data into different objects depending of the nlp task in question. I was wondering if there is any documentation on:
Thanks again and sorry if github is not the right place to put these questions
I'm trying to list all the available packages using downloader.list(show_packages=False)
however only two are retrieved (LANG:zhc, TASK:tsne2/zhc
).
I've also tried to manually set the URL that points to the package index file, but the default link provided in the documentation (http://nltk.googlecode.com/svn/trunk/polyglot_data/index.xml
) does not exist anymore.
I\m trying to download models from http://whoisbigger.com/polyglot. But unfortunately it shows 0 bps after some time. Could you give me a link to an alternative donwload?
Has anyway to force to Polyglot use a language? Sometimes when you run the NER, and the pharse have a foreign language entities (eg: "Albert Einstein"), it's classified wrong, for that example says that is: "de" (language code).
Have anyway to force a specific language?
The only way I've found it's to create custom method and overwrite the def ne_chunker(self):
to accept params (language_code) and then change the def entities(self)
to accept the language code too.
def extract(self, pharse, language='pt'):
# Transform pharse into `polyglot.text.Text`
text = Text(pharse)
# Create a named entitiy chunker
ne_chunker = get_ner_tagger(lang=language)
# Extract Entities
start = 0
end = 0
prev_tag = u'O'
chunks = []
for i, (w, tag) in enumerate(ne_chunker.annotate(text.words)):
if tag != prev_tag:
if prev_tag == u'O':
start = i
else:
chunks.append(Chunk(text.words[start: i], start, i, tag=prev_tag,
parent=text))
prev_tag = tag
if tag != u'O':
chunks.append(Chunk(text.words[start: i+1], start, i+1, tag=tag,
parent=text))
return chunks
polyglot download TASK:transliteration2
on executing above command on terminal I am getting error stating "Error downloading u'transliteration2.pl' from http://polyglot.cs.stonybrook.edu/~polyglot/transliteration2/pl/transliteration.pl.tar.bz2: HTTP Error403: Forbidden
".
On browsing the url "http://polyglot.cs.stonybrook.edu/~polyglot/transliteration2/pl/transliteration.pl.tar.bz2", It is forbidding access from my ip, I have tried accessing via vpn too
This may be temporary or not, but the model downloads are failing currently:
[polyglot_data] Downloading collection 'TASK:embeddings2'
[polyglot_data] |
[polyglot_data] | Downloading package embeddings2.fy to
[polyglot_data] | /home/ubuntu/polyglot_data...
[polyglot_data] | Error downloading 'embeddings2.fy' from
[polyglot_data] | <https://www.googleapis.com/download/storage/
[polyglot_data] | v1/b/polyglot-models/o/embeddings2%2Ffy%2Femb
[polyglot_data] | eddings_pkl.tar.bz2?generation=14061390326200
[polyglot_data] | 00&alt=media>: HTTP Error 401: Unauthorized
When running pip3 install polyglot==16.7.4
no requirements get installed.
See (this uses a fresh python 3.5 docker image with nothing else installed):
$ docker run --rm -it python:3.5 pip3 install polyglot==16.7.4
Collecting polyglot==16.7.4
Downloading polyglot-16.7.4.tar.gz (126kB)
100% |████████████████████████████████| 133kB 1.9MB/s
Installing collected packages: polyglot
Running setup.py install for polyglot ... done
Successfully installed polyglot-16.7.4
For comparison with polyglot==15.10.3
:
$ docker run --rm -it python:3.5 pip3 install polyglot==15.10.3
Collecting polyglot==15.10.3
Downloading polyglot-15.10.03-py2.py3-none-any.whl (54kB)
100% |████████████████████████████████| 61kB 1.3MB/s
Collecting PyICU>=1.8 (from polyglot==15.10.3)
Downloading PyICU-1.9.3.tar.gz (179kB)
100% |████████████████████████████████| 184kB 3.0MB/s
Collecting morfessor>=2.0.2a1 (from polyglot==15.10.3)
Downloading Morfessor-2.0.2alpha3.tar.gz
Collecting futures>=2.1.6 (from polyglot==15.10.3)
Downloading futures-3.0.5.tar.gz
Collecting pycld2>=0.3 (from polyglot==15.10.3)
Downloading pycld2-0.31.tar.gz (14.3MB)
100% |████████████████████████████████| 14.3MB 101kB/s
Collecting six>=1.7.3 (from polyglot==15.10.3)
Downloading six-1.10.0-py2.py3-none-any.whl
Collecting wheel>=0.23.0 (from polyglot==15.10.3)
Downloading wheel-0.29.0-py2.py3-none-any.whl (66kB)
100% |████████████████████████████████| 71kB 11.5MB/s
Installing collected packages: PyICU, morfessor, futures, pycld2, six, wheel, polyglot
Running setup.py install for PyICU ... error
Complete output from command /usr/local/bin/python3.5 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-68sk36vg/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-ptkc3_tq-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.5
copying icu.py -> build/lib.linux-x86_64-3.5
copying PyICU.py -> build/lib.linux-x86_64-3.5
copying docs.py -> build/lib.linux-x86_64-3.5
running build_ext
building '_icu' extension
creating build/temp.linux-x86_64-3.5
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include/python3.5m -c _icu.cpp -o build/temp.linux-x86_64-3.5/_icu.o -DPYICU_VER="1.9.3"
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from _icu.cpp:27:0:
common.h:90:28: fatal error: unicode/utypes.h: No such file or directory
#include <unicode/utypes.h>
^
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/usr/local/bin/python3.5 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-68sk36vg/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-ptkc3_tq-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-68sk36vg/PyICU/
In [1]: from polyglot.downloader import downloader
In [2]: downloader.supported_languages_table(u"ner2")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-e3f6f145dbd1> in <module>()
----> 1 downloader.supported_languages_table(u"ner2")
/Users/dchaplinsky/Projects/et-lemma/venv/lib/python3.4/site-packages/polyglot/downloader.py in supported_languages_table(self, task, cols)
977 def supported_languages_table(self, task, cols=3):
978 languages = self.supported_languages(task)
--> 979 return pretty_list(languages)
980
981
/Users/dchaplinsky/Projects/et-lemma/venv/lib/python3.4/site-packages/polyglot/utils.py in pretty_list(items, cols)
70 col_width = u"{" + u":<" + str(width) + u"} "
71 for i, lang in enumerate(items):
---> 72 lang = lang.decode(u"utf-8")
73 if len(lang) > width:
74 lang = lang[:width-3] + "..."
AttributeError: 'str' object has no attribute 'decode'
When installing polyglot from scratch all works fine (icu cdl and morphesor are correctly installed), but when using to download models I get the following error:
Traceback (most recent call last):
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/bin/polyglot", line 9, in <module>
load_entry_point('polyglot==15.10.3', 'console_scripts', 'polyglot')()
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 558, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2682, in load_entry_point
return ep.load()
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2355, in load
return self.resolve()
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2361, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/polyglot/__main__.py", line 21, in <module>
from polyglot.load import load_morfessor_model
File "/home/preceptor/source/ca/wiki_extractor/.wikienv/lib/python2.7/site-packages/polyglot/load.py", line 8, in <module>
import numpy as np
ImportError: No module named numpy
Maybe numpy should be included as a dep too.
From the tutorial:
#!/usr/bin/env python3
import polyglot
from polyglot.text import Text, Word
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)
When I try to run it (after calling polyglot.downloader.downloader.download('morph2.en')
):
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print(word.morphemes)
File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/usr/local/lib/python3.4/dist-packages/polyglot/text.py", line 286, in morphemes
words, score = self.morpheme_analyzer.viterbi_segment(self.string)
File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/usr/local/lib/python3.4/dist-packages/polyglot/text.py", line 282, in morpheme_analyzer
return load_morfessor_model(lang=self.language)
File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/polyglot/load.py", line 142, in load_morfessor_model
model = io.read_any_model(tmp_file_.name)
File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 203, in read_any_model
model.load_segmentations(self.read_segmentation_file(file_name))
File "/usr/local/lib/python3.4/dist-packages/morfessor/baseline.py", line 487, in load_segmentations
for count, segmentation in segmentations:
File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 53, in read_segmentation_file
for line in self._read_text_file(file_name):
File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 240, in _read_text_file
encoding = self._find_encoding(file_name)
File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 320, in _find_encoding
raise UnicodeError("Can not determine encoding of input files")
UnicodeError: Can not determine encoding of input files
Versions:
$ python3 --version
Python 3.4.2
$ pip3 show polyglot | grep Version
Version: 16.07.04
$ pip3 show morfessor | grep Version
Version: 2.0.1
glove produces data of the form shown below. how do I load these glove produced files into polyglot so I can take advantage of the polyglot infrastructure?
in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.036121 0.13085 0.0012462 0.14769 0.26926 0.37144 1.3501 -0.11326 -0.23036 -0.26575 -0.18077 0.092455 -0.16215 0.15003 -0.34547 0.072295 0.40659 0.010021 -0.0079257 -0.11435 0.017008 -0.29789 0.19079 0.37112 -0.26588 0.16212 0.065469 -0.31781 -0.03226 0.081969 0.3445 -0.17362 -0.35745 0.054487 0.39941 0.13699 -0.022066 0.11025 -0.41898 0.1276 -0.095869 -0.17944 -0.17443 0.27302 -0.19464 0.26747 -0.28241 0.1638 -0.11518 0.013196 -0.10616 -0.36093 0.023634 0.13464 0.021652 -0.27094 -0.018737 0.10017 0.36071 -0.093951 0.47634 0.12874 0.0011868 0.1377 -0.14034 -0.1887 -0.16405 -0.15349 0.32347 -0.17616 0.3523 -0.023531 -0.19121 -0.054809 -0.099521 -0.30056 0.36632 -0.21509 0.074123 -0.20267 0.1286 -0.38111 -0.025482 0.45103 0.088633 0.36288 -0.23406 -0.086024 -0.50604 0.034242 0.43998 -0.083023 -0.11969 0.68686 -0.34115 0.21228 0.40039 0.26367 -0.37144 0.16206 -0.42854 0.078658 -0.2905 0.21727 -0.27484 0.35887 0.27055 -0.11326 -0.14848 -0.0050659 -0.076862 0.078621 -0.24922 0.42026 -0.069698 0.071595 0.0071665 0.27473 -0.15664 0.25713 -0.058461 -0.29733 -0.090996 0.5246 0.14889 -0.20883 -0.13004 -0.20022 0.4503 -0.34654 -0.26007 0.35247 -0.34757 0.033738 0.19907 -0.32912 -0.084689 0.65319 0.20954 0.079274 0.1086 0.0026466 -0.12843 -0.22811 0.051501 -0.27429 0.14505 -0.1843 -0.34825 -0.11701 0.34034 0.075848 0.08239 -0.39188 -0.022312 -0.080373 0.14477 0.29701 -0.10523 0.092893 0.029813 -0.11761 0.16308 0.098382 0.46152 -0.162 -0.2456 0.20293 -0.11344 0.057902 -0.19528 -0.20141 -0.22874 -0.014101 0.2637 -0.10028 -0.051896 0.18859 -0.17767 -0.11556 0.121 0.17303 0.11773 0.034837 0.28485 -0.30447 0.061024 -0.26442 -0.081135 -0.044524 -0.036931 -0.15217 0.29175 0.44926 -0.28875 0.33193 -0.01242 -0.18805 -0.19832 -0.19736 0.26893 0.11106 -0.67383 -0.1518 -0.16615 -0.16563 0.0093671 -0.15945 -0.33468 0.22038 -0.16724 -0.1535 -0.61782 -0.17258 0.088928 0.019411 0.18296 0.32967 -0.0024906 -0.09208 0.514 0.0042484 -0.084377 -0.71448 -0.22148 -0.04835 0.043761 -0.29376 -0.22287 0.18001 0.072197 0.46499 0.056466 0.40844 -0.23641 -0.038946 0.087363 -0.21901 -0.3231 -0.19989 -0.3128 -0.067656 -0.22596 0.090926 0.28365 0.31462 0.46082 -0.024871 -0.14605 0.30454 0.17704 -0.011311 0.26807 -0.032461 -0.16644 -0.15313 -0.20426 -0.3082 -0.2459 0.085848 -0.11767 -0.063056 -0.18133 -0.18629 -0.17694 0.29618 0.35987 0.0020102 0.38616 0.36712 -0.055112 -0.34733 -0.072678 -0.051119 -0.29069 0.053598 0.019587 0.16808 -0.27456 -0.097179 -0.054541 0.19229 -0.48128 -0.20304 0.19368 -0.32546 0.14421 -0.169 0.26501
Hi there, I'm using polyglot to do some tokenization and NER extraction and am using the output both as features in a Machine Learning model. Since I know in advance which language I am processing, I instantiate a Text
object using language hinting.
text = Text(text_string, hint_language_code="pt")
Now, for some reason, it seems that NER doesn't rely on the language hint passed in the constructor. It tries to infer a language again and it sometimes gets it wrong (e.g. detects portuguese as gl). Since there are no ner2 models for galician to be downloaded, I get a ValueError: Package u'ner2.gl' not found in index
.
Here is the full stack trace:
--> 238 ne_tuples = [((ent.start, ent.end), ent.tag, 1.) for ent in sent.entities]
239 ne_entity_range = get_target_entity_range(ne_tuples, entity_name, sent.tokens)
240 if ne_entity_range:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
18 if obj is None:
19 return self
---> 20 value = obj.__dict__[self.func.__name__] = self.func(obj)
21 return value
22
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in entities(self)
130 prev_tag = u'O'
131 chunks = []
--> 132 for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
133 if tag != prev_tag:
134 if prev_tag == u'O':
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
18 if obj is None:
19 return self
---> 20 value = obj.__dict__[self.func.__name__] = self.func(obj)
21 return value
22
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in ne_chunker(self)
98 @cached_property
99 def ne_chunker(self):
--> 100 return get_ner_tagger(lang=self.language.code)
101
102 @cached_property
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
28 key = tuple(list(args) + sorted(kwargs.items()))
29 if key not in cache:
---> 30 cache[key] = obj(*args, **kwargs)
31 return cache[key]
32 return memoizer
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in get_ner_tagger(lang)
190 def get_ner_tagger(lang='en'):
191 """Return a NER tagger from the models cache."""
--> 192 return NEChunker(lang=lang)
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
102 lang: language code to decide which chunker to use.
103 """
--> 104 super(NEChunker, self).__init__(lang=lang)
105 self.ID_TAG = NER_ID_TAG
106
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
38 """
39 self.lang = lang
---> 40 self.predictor = self._load_network()
41 self.ID_TAG = {}
42 self.add_bias = True
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in _load_network(self)
109 self.embeddings = load_embeddings(self.lang, type='cw')
110 self.embeddings.normalize_words(inplace=True)
--> 111 self.model = load_ner_model(lang=self.lang, version=2)
112 first_layer, second_layer = self.model
113 def predict_proba(input_):
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
28 key = tuple(list(args) + sorted(kwargs.items()))
29 if key not in cache:
---> 30 cache[key] = obj(*args, **kwargs)
31 return cache[key]
32 return memoizer
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in load_ner_model(lang, version)
92 """
93 src_dir = "ner{}".format(version)
---> 94 p = locate_resource(src_dir, lang)
95 fh = _open(p)
96 try:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in locate_resource(name, lang, filter)
41 p = path.join(polyglot_path, task_dir, lang)
42 if not path.isdir(p):
---> 43 if downloader.status(package_id) != downloader.INSTALLED:
44 raise ValueError("This resource is available in the index "
45 "but not downloaded, yet. Try to run\n\n"
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in status(self, info_or_id, download_dir)
735 """
736 if download_dir is None: download_dir = self._download_dir
--> 737 info = self._info_or_id(info_or_id)
738
739 # Handle collections:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in _info_or_id(self, info_or_id)
505 def _info_or_id(self, info_or_id):
506 if isinstance(info_or_id, unicode):
--> 507 return self.info(info_or_id)
508 else:
509 return info_or_id
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in info(self, id)
931 if id in self._packages: return self._packages[id]
932 if id in self._collections: return self._collections[id]
--> 933 raise ValueError('Package %r not found in index' % id)
934
935 def get_collection(self, lang=None, task=None):
ValueError: Package u'ner2.gl' not found in index
Windows 8.1 Python 3.5
Trying to run the NER example given in the documentation and it results in a list index out of range error.
I would like to use the Named Entity Extraction of Polyglot, so I'm following the documentation at http://polyglot.readthedocs.org/en/latest/NamedEntityRecognition.html
, however when I execute
print(downloader.supported_languages_table("ner2", 3))
I get the following error:
Traceback (most recent call last):
File "C:/Users/text_analyzer_polyglot.py", line 22, in <module>
main()
File "C:/Users/text_analyzer_polyglot.py", line 18, in main
print(downloader.supported_languages_table("ner2", 3))
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 963, in supported_languages_table
languages = self.supported_languages(task)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 955, in supported_languages
collection = self.get_collection(task=task)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 934, in get_collection
if task: raise TaskNotSupported("Task {} is not supported".format(id))
polyglot.downloader.TaskNotSupported: Task TASK:ner2 is not supported
In addition, if I try to execute:
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
text = Text(blob)
print (text.entities)
I get the following error:
Traceback (most recent call last):
File "C:/Users/text_analyzer_polyglot.py", line 23, in <module>
main()
File "C:/Users/text_analyzer_polyglot.py", line 20, in main
print (text.entities)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "C:\Python27\lib\site-packages\polyglot\text.py", line 124, in entities
for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "C:\Python27\lib\site-packages\polyglot\text.py", line 96, in ne_chunker
return get_ner_tagger(lang=self.language.code)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 152, in get_ner_tagger
return NEChunker(lang=lang)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 99, in __init__
super(NEChunker, self).__init__(lang=lang)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 40, in __init__
self.predictor = self._load_network()
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 104, in _load_network
self.embeddings = load_embeddings(self.lang, type='cw')
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "C:\Python27\lib\site-packages\polyglot\load.py", line 64, in load_embeddings
p = locate_resource(src_dir, lang)
File "C:\Python27\lib\site-packages\polyglot\load.py", line 47, in locate_resource
if downloader.status(package_id) != downloader.INSTALLED:
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 730, in status
info = self._info_or_id(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 500, in _info_or_id
return self.info(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 918, in info
raise ValueError('Package %r not found in index' % id)
ValueError: Package u'embeddings2.en' not found in index
Am I missing something in the documentation?
Could you tell me how to successfully run the Named Entity Extraction?
I just found polyglot, which seems to be FANTASTIC for dealing with all sorts of NLP problems in multiple languages. I want to use it for Swedish texts so I got to work and tested it on some real world texts.
Here's a random swedish article: http://www.dn.se/nyheter/sverige/oligarken-som-ager-en-o-i-stockholm/ I manually copy-pasted the text into a txt file, downloaded the required swedish models and tried to get the NER tags from it.
from polyglot.text import Text
text = Text(open("test.txt").read())
for entity in text.entities:
print(entity.tag, entity)
Here's a part of the ouput:
...
I-LOC ['Stockholm']
I-LOC ['slovakisk']
I-PER ['Frantisek']
I-PER ['Jules', 'Verne']
I-PER ['Zvrskovec']
I-LOC ['Lidingö']
I-PER ['Bilspedition']
I-LOC ['Tjeckien']
I-PER ['oligarken']
I-LOC ['Indiana']
I-PER ['Indiana', 'Jones']
I-ORG ['Arlanda']
I-LOC ['Tjeckien']
I-LOC ['Stockholm']
I-PER ['Frantisek', 'Zvrskovec']
I-PER ['.']
I-PER ['Frantisek', 'Zvrskovec']
I-PER ['bottenskrevan']
I-LOC ['Stockholms']
I-PER ['landstigningsförbud']
I-PER ['helstängt']
I-PER ['Magnus', 'Hallgren']
I-PER ['Dividend']
I-ORG ['Central', 'Europe']
I-PER ['Zvrskovec']
I-PER ['.']
I-LOC ['Tjeckoslovakien']
I-LOC ['Dolny']
I-PER ['Dolny', 'Kubin']
I-PER ['.']
...
...which is ok, but two things stand out:
I guess I could write my own filter to remove punctuation and lowercase words, but this seems should be easier to solve in an earlier step, when training the models, don't you think?
For example, in READM.rst, under "Named Entity Recognition", the result should include a visible backslash at "\xfd", but this is parsed by the parsed-literal
body element.
While it could be possible to escape the backslash and have it shown correctly when rendered, this would be confusing to someone reading the document in raw format.
Consider using the code
body element instead, or showing results as comments in the code preceding them.
Hi,
I just upgraded to polyglot 15.04.19 and it seems the model needs to be updated too.
In [1]: from polyglot.downloader import downloader
In [2]: downloader.download("embeddings2.en")
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data] /home/ubuntu/polyglot_data...
Out[2]: True
In [3]: downloader.download("pos2.en")
[polyglot_data] Downloading package pos2.en to
[polyglot_data] /home/ubuntu/polyglot_data...
Out[3]: True
In [4]: blob = """We will meet at eight o'clock on Thursday morning."""
In [5]: from polyglot.text import Text
In [6]: text = Text(blob)
In [7]: text.pos_tags
Out[7]:
[(u'We', u'INTJ'),
(u'will', u'NOUN'),
(u'meet', u'NOUN'),
(u'at', u'ADP'),
(u'eight', u'DET'),
(u"o'clock", u'PART'),
(u'on', u'ADP'),
(u'Thursday', u'PART'),
(u'morning', u'PART'),
(u'.', u'ADV')]
Also you might want to update this too.
hi,Actually I tried again with the github version and transliteration did operate correctly. Thanks for making this available!
Could you please give me information on the method rely on (character-based? dictionnary based? Buckwalter? machine-learning?) and on the output format (Arabica?), and if it is conceivable to adapt the model?
In [200]: polyglot.__version__ Out[200]: '16.07.04'
hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew
Text(hebrew_text).language.code
Out[184]: 'iw'
This problem affects work greatly, since "he" language is significantly better supported in polyglot (and elsewhere) than "iw".
Thanks for all the hard work!
Hi.
Is there anyway to build a new ner-model based on a my dataset?
In [1]: from polyglot.text import Text
In [2]: Text("Мне уже не раз приходилось писать о том, что мировой капитализм вошел в новую и последнюю фазу своего развития. Почти 100 лет назад (в 1916 году) В. Ленин (Ульянов) написал книгу «Империализм, как высшая стадия капитализма». В ней он констатировал, что в конце XIX — начале XX века капитализм стал монополистическим, и что такой капитализм является последней стадией развития этой общественно-экономической формации. Классик несколько поспешил с вынесением смертного приговора капитализму.").sentences
Out[2]:
[Sentence("Мне уже не раз приходилось писать о том, что мировой капитализм вошел в новую и последнюю фазу своего развития."),
Sentence("Почти 100 лет назад (в 1916 году) В."),
Sentence("Ленин (Ульянов) написал книгу «Империализм, как высшая стадия капитализма»."),
Sentence("В ней он констатировал, что в конце XIX — начале XX века капитализм стал монополистическим, и что такой капитализм является последней стадией развития этой общественно-экономической формации."),
Sentence("Классик несколько поспешил с вынесением смертного приговора капитализму.")]
#next one is fine
In [3]: Text("In recent years, enormous parsing success has been achieved by the use of feature-based discriminative dependency parsers (Kubler et al., 2009).")
Out[3]: Text("In recent years, enormous parsing success has been achieved by the use of feature-based discriminative dependency parsers (Kubler et al., 2009).")
#but this one is not
In [4]: Text("Marshall R. Mayberry III and Risto Miikkulainen. 2005. Broad-coverage parsing with neural networks. Neural Processing Letters").sentences
Out[4]:
[Sentence("Marshall R."),
Sentence("Mayberry III and Risto Miikkulainen."),
Sentence("2005."),
Sentence("Broad-coverage parsing with neural networks."),
Sentence("Neural Processing Letters")]
(How are such issues solved or avoided?)
Hi,
Have installed polyglot on windows with python 3.4, after some lib problems solved, I start getting this error:
downloader.download()
Polyglot Downloader
d) Download l) List u) Update c) Config h) Help q) Quit
Downloader> l
Collections:
Traceback (most recent call last):
File "", line 1, in
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 649, in download
self._interactive_download()
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 1068, in _interactive_download
DownloaderShell(self).run()
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 1096, in run
more_prompt=True)
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 459, in list
for info in sorted(getattr(self, category)(), key=str):
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 495, in collections
self._update_index()
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 832, in _update_index
P = Package.fromcsobj(p)
File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 232, in fromcsobj
language = subdir.split(path.sep)[1]
IndexError: list index out of range
After some analysis (and some neurons less...) e got the problem, on windows "path.sep" is,as expected "" instead of "/", since the packages are "named" or "ID(ed)" with "/" it makes no sense the path.sep on windows users? Or am I missing something I should had installed?
A replace on path.sep for "/" solve the problem and allow me to list and download any data I want to my polyglot installation.
Hi there, I'm getting an error when doing ner from a sentence.
The code used to get the named entities is:
ne_tuples = [((ent.start, ent.end), ent.tag, 1.) for ent in sent.entities
It seems that the Downloader tries to read some information from a web index which results in a socket error. Any ideas on what it might be going on ? Here is the stacktrace:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
18 if obj is None:
19 return self
---> 20 value = obj.__dict__[self.func.__name__] = self.func(obj)
21 return value
22
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in entities(self)
130 prev_tag = u'O'
131 chunks = []
--> 132 for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
133 if tag != prev_tag:
134 if prev_tag == u'O':
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in __get__(self, obj, cls)
18 if obj is None:
19 return self
---> 20 value = obj.__dict__[self.func.__name__] = self.func(obj)
21 return value
22
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/text.pyc in ne_chunker(self)
98 @cached_property
99 def ne_chunker(self):
--> 100 return get_ner_tagger(lang=self.language.code)
101
102 @cached_property
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
28 key = tuple(list(args) + sorted(kwargs.items()))
29 if key not in cache:
---> 30 cache[key] = obj(*args, **kwargs)
31 return cache[key]
32 return memoizer
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in get_ner_tagger(lang)
190 def get_ner_tagger(lang='en'):
191 """Return a NER tagger from the models cache."""
--> 192 return NEChunker(lang=lang)
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
102 lang: language code to decide which chunker to use.
103 """
--> 104 super(NEChunker, self).__init__(lang=lang)
105 self.ID_TAG = NER_ID_TAG
106
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in __init__(self, lang)
38 """
39 self.lang = lang
---> 40 self.predictor = self._load_network()
41 self.ID_TAG = {}
42 self.add_bias = True
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/tag/base.pyc in _load_network(self)
109 self.embeddings = load_embeddings(self.lang, type='cw')
110 self.embeddings.normalize_words(inplace=True)
--> 111 self.model = load_ner_model(lang=self.lang, version=2)
112 first_layer, second_layer = self.model
113 def predict_proba(input_):
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/decorators.pyc in memoizer(*args, **kwargs)
28 key = tuple(list(args) + sorted(kwargs.items()))
29 if key not in cache:
---> 30 cache[key] = obj(*args, **kwargs)
31 return cache[key]
32 return memoizer
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in load_ner_model(lang, version)
92 """
93 src_dir = "ner{}".format(version)
---> 94 p = locate_resource(src_dir, lang)
95 fh = _open(p)
96 try:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/load.pyc in locate_resource(name, lang, filter)
41 p = path.join(polyglot_path, task_dir, lang)
42 if not path.isdir(p):
---> 43 if downloader.status(package_id) != downloader.INSTALLED:
44 raise ValueError("This resource is available in the index "
45 "but not downloaded, yet. Try to run\n\n"
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in status(self, info_or_id, download_dir)
735 """
736 if download_dir is None: download_dir = self._download_dir
--> 737 info = self._info_or_id(info_or_id)
738
739 # Handle collections:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in _info_or_id(self, info_or_id)
505 def _info_or_id(self, info_or_id):
506 if isinstance(info_or_id, unicode):
--> 507 return self.info(info_or_id)
508 else:
509 return info_or_id
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in info(self, id)
927 if id in self._packages: return self._packages[id]
928 if id in self._collections: return self._collections[id]
--> 929 self._update_index() # If package is not found, most probably we did not
930 # warm up the cache
931 if id in self._packages: return self._packages[id]
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/site-packages/polyglot/downloader.pyc in _update_index(self, url)
829 elif source == 'mirror':
830 index_url = path.join(self._url, 'index.json')
--> 831 data = urlopen(index_url).read()
832
833 if six.PY3:
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
429 req = meth(req)
430
--> 431 response = self._open(req, data)
432
433 # post-process response
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in _open(self, req, data)
447 protocol = req.get_type()
448 result = self._call_chain(self.handle_open, protocol, protocol +
--> 449 '_open', req)
450 if result:
451 return result
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
407 func = getattr(handler, meth_name)
408
--> 409 result = func(*args)
410 if result is not None:
411 return result
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in http_open(self, req)
1225
1226 def http_open(self, req):
-> 1227 return self.do_open(httplib.HTTPConnection, req)
1228
1229 http_request = AbstractHTTPHandler.do_request_
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1198 else:
1199 try:
-> 1200 r = h.getresponse(buffering=True)
1201 except TypeError: # buffering kw not supported
1202 r = h.getresponse()
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/httplib.pyc in getresponse(self, buffering)
1134
1135 try:
-> 1136 response.begin()
1137 assert response.will_close != _UNKNOWN
1138 self.__state = _CS_IDLE
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/httplib.pyc in begin(self)
451 # read until we get a non-100 response
452 while True:
--> 453 version, status, reason = self._read_status()
454 if status != CONTINUE:
455 break
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/httplib.pyc in _read_status(self)
407 def _read_status(self):
408 # Initialize with Simple-Response defaults
--> 409 line = self.fp.readline(_MAXLINE + 1)
410 if len(line) > _MAXLINE:
411 raise LineTooLong("header line")
/home/preceptor/miniconda2/envs/wiki/lib/python2.7/socket.pyc in readline(self, size)
478 while True:
479 try:
--> 480 data = self._sock.recv(self._rbufsize)
481 except error, e:
482 if e.args[0] == EINTR:
pip install -U git+https://github.com/aboSamoor/polyglot.git@master
In cmd:
c:\tmp>polyglot download
raceback (most recent call last):
File "C:\Python27\Scripts\polyglot-script.py", line 9, in <module>
load_entry_point('polyglot==15.10.3', 'console_scripts', 'polyglot')()
File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 552, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 2672, in load_entry_point
return ep.load()
File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 2345, in load
return self.resolve()
File "C:\Python27\lib\site-packages\pkg_resources\__init__.py", line 2351, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "C:\Python27\lib\site-packages\polyglot\__main__.py", line 9, in <module>
from signal import signal, SIGPIPE, SIG_DFL
mportError: cannot import name SIGPIPE
In idle:
from polyglot.downloader import downloader
downloader.download("embeddings2.en")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
downloader.download("embeddings2.en")
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 663, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 533, in incr_download
try: info = self._info_or_id(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 507, in _info_or_id
return self.info(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 929, in info
self._update_index() # If package is not found, most probably we did not
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 843, in _update_index
P = Package.fromcsobj(p)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 216, in fromcsobj
language = subdir.split(path.sep)[1]
IndexError: list index out of range
Hello, I download the embeddings2.zh model and trying to get the entities of Chinese but failed, got this error message :
ValueError: Package 'embeddings2.zh_Hant' not found in index
How could I solve it? thanks.
Dear Mr. Al-Rfou,
Actually I tried again with the github version and transliteration did operate correctly. Thanks for making this available!
Could you please give me information on the method rely on (character-based? dictionnary based? Buckwalter? machine-learning?) and on the output format (Arabica?), and if it is conceivable to adapt the model?
Thank you in advance.
Best,
Hello, thanks for your work.
What kind of data tsne2, sgns2?
I am not able to install new models, for example czech language to test the morphology analysis:
root@mario-VirtualBox:/home/mario/python-scripts# python example.py
Traceback (most recent call last):
File "example.py", line 28, in <module>
print(word2.morphemes)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/text.py", line 269, in morphemes
words, score = self.morpheme_analyzer.viterbi_segment(self.string)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/text.py", line 265, in morpheme_analyzer
return load_morfessor_model(lang=self.language)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/load.py", line 128, in load_morfessor_model
p = locate_resource(src_dir, lang)
File "/usr/local/lib/python2.7/dist-packages/polyglot-15.10.03-py2.7.egg/polyglot/load.py", line 51, in locate_resource
return path.join(p, os.listdir(p)[0])
OSError: [Errno 2] No such file or directory: '/root/polyglot_data/morph2/cs'
root@mario-VirtualBox:/home/mario/python-scripts#
my code is:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import polyglot
from polyglot.text import Text, Word
from polyglot.detect import Detector
#from polyglot.downloader import downloader
#downloader.download("morph2.cs")
#mixed_text = u"""
#fficially the People's Republic of China (PRC), is a sovereign state located in East Asia.
#"""
#zen = Text("Beautiful is better than ugly. "
# "Explicit is better than implicit. "
# "Simple is better than complex.")
#print(zen.words)
#print(zen.sentences)
#detector = Detector(mixed_text)
#print(detector.language)
#word = Text("Preprocessing is an essential step.").words[0]
#print(word.morphemes)
word2 = Text("Na cestu do Německa se vydal už před dvěma roky.").words[0]
print(word2.morphemes)
from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="cs")
print(transliterator.transliterate(u"preprocessing"))
commented lines are working well.
I'm using Ubuntu 14.04 and PIP:
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Icld2/internal -Icld2/public -I/home/raul/miniconda/include/python2.7 -c bindings/pycldmodule.cc -o build/temp.linux-i686-2.7/bindings/pycldmodule.o -w -O2 -m64 -fPIC
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from /home/raul/miniconda/include/python2.7/Python.h:58:0,
from bindings/pycldmodule.cc:15:
/home/raul/miniconda/include/python2.7/pyport.h:886:2: error: #error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
#error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
^
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/home/raul/miniconda/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-GclX4J/pycld2/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-kaY6oa-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-GclX4J/pycld2
This is using
"pip install polyglot".
I've located some useful arguments that can help here, but I'm not sure how to add them to the cc command.
Complete output from command /usr/bin/python -c "import setuptools, tokenize;file='/private/var/folders/k1/6_4k217j1ng5qnm8_vrpx1b80000gp/T/pip-build-uOkJfF/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /var/folders/k1/6_4k217j1ng5qnm8_vrpx1b80000gp/T/pip-EdbjO8-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.macosx-10.10-intel-2.7
copying icu.py -> build/lib.macosx-10.10-intel-2.7
copying PyICU.py -> build/lib.macosx-10.10-intel-2.7
copying docs.py -> build/lib.macosx-10.10-intel-2.7
running build_ext
building '_icu' extension
creating build/temp.macosx-10.10-intel-2.7
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/local/include -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c _icu.cpp -o build/temp.macosx-10.10-intel-2.7/_icu.o -DPYICU_VER="1.9.2"
In file included from _icu.cpp:27:
./common.h:86:10: fatal error: 'unicode/utypes.h' file not found
#include <unicode/utypes.h>
^
1 error generated.
error: command 'cc' failed with exit status 1
Hello!
When I issued this command:
polyglot download embeddings2.en ner2.en
I received the following answer:
[polyglot_data] Error loading embeddings2.en: HTTP Error 503: Service
[polyglot_data] Unavailable
Error installing package. Retry? [n/y/e]
This has been happening for about 3 days (as far as I know) and in all sorts of circumstances.
I think your downloads server is down. Any thoughts?
Hi aboSamoor, i want to create Semantic Role Labeling model (non english).
Maybe this is offtopic for Polyglot, but you advice to use deepnl in other issue answer, so i would like to ask: how to train new SRL model from raw text data?
Currently i try to use https://github.com/attardi/deepnl and https://github.com/erickrf/nlpnet (both hard to implement due to unclear steps and data formats, asked in their Github pages but no clear solution provided). I want to use your pretrained wordembedings from https://sites.google.com/site/rmyeid/projects/polyglot to create SRL model.
What will you advice to solve this task?
Hi Rami @aboSamoor,
In the following example, I expected to have 'VERB' pos_tag for the word 'restate', however, I got 'NUM':
blob ="""restate (words) from one language into another language."""
text = Text(blob)
text.pos_tags
text.words[0].pos_tag
u'NUM'
http://polyglot.readthedocs.org/en/latest/Transliteration.html
Everything below
Downloading Necessary Models
belongs to POS example
Hi,
Just found out that entities
function process BaseBlob
words list where in some cases it produce false positives by an merging entity at the end of a sentence and another entity at the start of next sentence.
Here's the example,
In [1]: from polyglot.text import Text, Word
In [2]: blob = u"""Momentum perbaikan "Los Blaugranas" itu awalnya justru terjadi setelah Martin Caceres diusir wasit karena melakukan pelanggaran keras di kotak penalti. Meski bermain dengan sepuluh pemain, rasa percaya diri mereka bangkit setelah kiper Jose Manuel Pinto berhasil menggagalkan tendangan penalti Jose Luis Marti.
Barca semakin kuat setelah Messi masuk pada menit ke-58. Messi pula yang mencetak gol penyama skor pada menit ke-80."""
In [3]: text = Text(blob)
In [4]: text.entities
Out[4]:
[I-PER([u'Martin', u'Caceres']),
I-PER([u'Jose', u'Manuel', u'Pinto']),
I-PER([u'Jose', u'Luis', u'Marti', u'.', u'Barca']),
I-PER([u'Messi']),
I-PER([u'Messi'])]
Since there's SentenceTokenizer
function, I think it would be best to process the Sentence
words rather than the BaseBlob
words. I know it'll require extra step to create sentence objects, but the result will be better.
--- text.py.orig 2015-05-11 16:25:54.249957999 +0000
+++ .../lib/python2.7/site-packages/polyglot/text.py 2015-05-11 16:26:53.001957999 +0000
@@ -117,21 +117,22 @@
@cached_property
def entities(self):
"""Returns a list of entities for this blob."""
- start = 0
- end = 0
- prev_tag = u'O'
chunks = []
- for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
- if tag != prev_tag:
- if prev_tag == u'O':
- start = i
- else:
- chunks.append(Chunk(self.words[start: i], start, i, tag=prev_tag,
- parent=self))
- prev_tag = tag
- if tag != u'O':
- chunks.append(Chunk(self.words[start: i+1], start, i+1, tag=tag,
- parent=self))
+ for sentence in self.sentences:
+ start = 0
+ end = 0
+ prev_tag = u'O'
+ for i, (w, tag) in enumerate(self.ne_chunker.annotate(sentence.words)):
+ if tag != prev_tag:
+ if prev_tag == u'O':
+ start = i
+ else:
+ chunks.append(Chunk(sentence.words[start: i], start, i, tag=prev_tag,
+ parent=self))
+ prev_tag = tag
+ if tag != u'O':
+ chunks.append(Chunk(sentence.words[start: i+1], start, i+1, tag=tag,
+ parent=self))
return chunks
@cached_property
...and here's the result,
In [8]: text.entities
Out[8]:
[I-PER([u'Martin', u'Caceres']),
I-PER([u'Jose', u'Manuel', u'Pinto']),
I-PER([u'Jose', u'Luis', u'Marti']),
I-PER([u'Barca']),
I-PER([u'Messi']),
I-PER([u'Messi'])]
I can't seem to find a way, using polyglot, to test word analogies like having v["queen"] closest to v["king"] - v["man"] + v["woman"]. Basically I'd like to find neighbors of a linear combination of vectors of words instead of a single word vector.
When deploying polyglot to a docker container ubuntu environment (thus running as root inside the container), the downloader simply drops the resources into /
-- looks like there might be a bug in how the package_data
path is determined for a superuser.
I was also wondering if it might make sense to consider a $POLYGLOT_PACKAGE_DATA
environment variable when determining the path?
p.s. thanks for this amazing library! It's giving me great results and was much needed.
These two files can cause problems on OS X or Windows. Will rename one of them.
Testing the following sequence of commands:
from polyglot.text import Text
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').pos_tags
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').pos_tags
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').entities
print Text('At least two dead in operation targeting suspected Paris attacks mastermind').pos_tags
Gives the output:
[(u'At', u'ADV'), (u'least', u'ADV'), (u'two', u'NUM'), (u'dead', u'ADJ'), (u'in', u'ADP'), (u'operation', u'NOUN'), (u'targeting', u'VERB'), (u'suspected', u'VERB'), (u'Paris', u'PROPN'), (u'attacks', u'NOUN'), (u'mastermind', u'PROPN')]
[(u'At', u'ADV'), (u'least', u'ADV'), (u'two', u'NUM'), (u'dead', u'ADJ'), (u'in', u'ADP'), (u'operation', u'NOUN'), (u'targeting', u'VERB'), (u'suspected', u'VERB'), (u'Paris', u'PROPN'), (u'attacks', u'NOUN'), (u'mastermind', u'PROPN')]
[]
[(u'At', u'PROPN'), (u'least', u'PROPN'), (u'two', u'PROPN'), (u'dead', u'NOUN'), (u'in', u'PROPN'), (u'operation', u'NOUN'), (u'targeting', u'VERB'), (u'suspected', u'PROPN'), (u'Paris', u'PROPN'), (u'attacks', u'NOUN'), (u'mastermind', u'PROPN')]
Notice how the third set of tags is completely different from the first two, after computing named entities. 'PROPN' tags especially become common afterwards. It appears that every subsequent sentence POS tags after the entities command is broken as well.
What is going wrong here?
I am trying to translate this string "عمتي سوسن" to english. Altough google language detector is detecting this as arabic text, polyglot is throwing error stating "Package u'transliteration2.ks' not found in index". Moreover the language package ks is not present in polyglot list. I am getting same error in 'sd', 'ku', 'ps', 'uz' too
Text("Edge cases " + chr(917631) + " can be annoying.").words
WordList(['Edge', 'cases', '\U000e007f', 'c', 'an', 'b', 'e', 'a', 'nnoying.'])
I'm doing some word lookups for portuguese and I got the following:
File "/home/intruder/source/tgalery/analytyca/analytyca/utils/context.py", line 9, in get_vector
vector = embeddings[word_key]
File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/embeddings.py", line 40, in __getitem__
return self.vectors[self.vocabulary[k]]
File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/expansion.py", line 29, in __getitem__
return self.approximate_ids(key)
File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/expansion.py", line 52, in approximate_ids
raise KeyError("{} not found".format(key))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128)
Because the string to be formated is not compatible with the incoming unicode key
the Key Error throws another exception.
I'm happy to fix this, but I wonder whether the keys are meant to be in binary format for lookups.
Let me know how best to proceed.
When running Python 2.7, I can't import polyglot
. This can be fixed by installing the backported
futures package.
~/bin/polyglot $ ipython
Python 2.7.5+ (default, Feb 27 2014, 19:37:08)
In [1]: import polyglot
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-00a07ab614e8> in <module>()
----> 1 import polyglot
/home/arne/bin/polyglot/polyglot/__init__.py in <module>()
8
9 from six.moves import copyreg
---> 10 from .base import Sequence, TokenSequence
11 from .utils import _pickle_method, _unpickle_method
12
/home/arne/bin/polyglot/polyglot/base.py in <module>()
7 from collections import Counter
8 import os
----> 9 from concurrent.futures import ProcessPoolExecutor
10 from itertools import islice
11
ImportError: No module named concurrent.futures
python3.4:
pip install polyglot
from polyglot.text import Text, Word
---> 11 from icu import Locale
12 import pycld2 as cld2
13
ImportError: No module named 'icu'
Its not a module dependency nor is it mentioned in readme.
Hi,
firstly, thanks for this great software. I realized that in the code, that when the target language is English, it will not transliterate it and returns the original text. Does polyglot does not support transliterating into English or am I missing something?
regards,
Amir
Please consider doing a new release on pypi. Transliteration is broken in the current version, and it was fixed by my commit: 600514a
I'm using polyglot in a project and it's not nice to have to get or build development versions manually.
It could also be a simple minor version 'bugfix' release.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.