Comments (5)
Hi @pavancs , could you provide the full traceback, so I can see which line in textacy
is causing the UnicodeDecodeError? Also: Do you happen to know which article is causing the error? I did a re-work of the WikiReader.texts()
code a few weeks ago, and I thought that I caught these sorts of errors. Apparently I missed something. :/
from textacy.
hi @ bdewilde, here is the full track back. Not sure which article is causing this error.
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-28-bae0283836f6> in <module>()
----> 1 for text in wr.texts(limit=2):
2 print(text)
C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in texts(self, min_len, limit)
255 """
256 n_pages = 0
--> 257 for _, title, content in self:
258 text = strip_markup(content)
259 if len(text) < min_len:
C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in __iter__(self)
146 text_path = './{%s}revision/{%s}text' % (namespace, namespace)
147
--> 148 for elem in elems:
149 if elem.tag == page_tag:
150 page_id = elem.find(page_id_path).text
C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in <genexpr>(.0)
131 with f:
132
--> 133 elems = (elem for _, elem in iterparse(f, events=events))
134
135 elem = next(elems)
C:\Anaconda3\lib\xml\etree\ElementTree.py in __next__(self)
1295 raise StopIteration
1296 # load event buffer
-> 1297 data = self._file.read(16 * 1024)
1298 if data:
1299 self._parser.feed(data)
C:\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3216: character maps to <undefined>
from textacy.
Hi @bdewilde , Did some digging.
This can be naive question. Why in fileio, open_sesame(), "encoding: optional name of the encoding used to decode or encode `filepath``; only applicable in text mode"?
Setting encoding to 'utf-8', able to read file successfully.
from textacy.
Hey @pavancs , thanks for digging. As far as I understand, that bit of code is correct: According to Python's docs, encoding should only be used in text mode. (See here and here.) Have I misunderstood how this stuff works?
Honestly, I found the fileio
code very tricky to write on account of differences between Python 2 and 3, OS X / macOS and Windows Vista / 7 / 8 / 10, and compression formats. I tried to write a function that would automagically handle all these differences for the user, but apparently did not entirely succeed. I myself don't have any unicode errors when iterating over enwiki-latest-pages-articles.xml.bz2
(which is probably not the same version as yours) so I can't replicate the error. :/
How can I help? I'm not sure how to proceed.
from textacy.
@bdewilde , currently did some dirty fix by setting encoding.
Yeah, doc says encoding should only be used in text mode. Me too confused now. Will see what i can find and get back.
from textacy.
Related Issues (20)
- AttributeError: module 'textacy' has no attribute 'tm' HOT 1
- UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in Concept HOT 2
- GHSL-2021-109 HOT 1
- ImportError: cannot import name 'keyterms' HOT 1
- Installing textacy on Python 3.6 downgrades the packages HOT 1
- Method to retrieve all words in ConceptNet HOT 1
- normalize_hyphenated_words() not working in version 0.11.0 HOT 1
- Can't Load Russian Language Model due to Caching Bug HOT 4
- Missing SVOTriple returned from subject_verb_object_triples(doc) ? HOT 1
- Missing ARM64 build on conda HOT 2
- most_discriminating_terms no longer available HOT 2
- Why `normalize.hyphenated_words` is not able to reassemble all words separated by a line break? HOT 2
- ERROR: Failed building wheel for pyemd during installation HOT 1
- Incompatible with NetworkX 3.0 HOT 6
- Updating triples.py to rely on spacy-wordnet and sense tagging
- vectorizer = TfidfVectorizer(tokenizer=createTokens) HOT 5
- readthedocs site still has 0.11.0 documentation HOT 2
- Improve quotation detection by parsing quotation mark types
- Quotes - Is there a way to include these additional patterns?
- to_bag_of_terms kwargs like to_bag_of_words
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from textacy.