I am trying to read wiki data from my local directory. It is giving encoding error.</p

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

can not load wikipedia files one the disk about textacy HOT 5 CLOSED

chartbeat-labs commented on July 29, 2024

can not load wikipedia files one the disk

from textacy.

Comments (5)

bdewilde commented on July 29, 2024

Hi @pavancs , could you provide the full traceback, so I can see which line in textacy is causing the UnicodeDecodeError? Also: Do you happen to know which article is causing the error? I did a re-work of the WikiReader.texts() code a few weeks ago, and I thought that I caught these sorts of errors. Apparently I missed something. :/

from textacy.

pavancs commented on July 29, 2024

hi @ bdewilde, here is the full track back. Not sure which article is causing this error.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-28-bae0283836f6> in <module>()
----> 1 for text in wr.texts(limit=2):
      2     print(text)

C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in texts(self, min_len, limit)
    255         """
    256         n_pages = 0
--> 257         for _, title, content in self:
    258             text = strip_markup(content)
    259             if len(text) < min_len:

C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in __iter__(self)
    146             text_path = './{%s}revision/{%s}text' % (namespace, namespace)
    147 
--> 148             for elem in elems:
    149                 if elem.tag == page_tag:
    150                     page_id = elem.find(page_id_path).text

C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in <genexpr>(.0)
    131         with f:
    132 
--> 133             elems = (elem for _, elem in iterparse(f, events=events))
    134 
    135             elem = next(elems)

C:\Anaconda3\lib\xml\etree\ElementTree.py in __next__(self)
   1295                 raise StopIteration
   1296             # load event buffer
-> 1297             data = self._file.read(16 * 1024)
   1298             if data:
   1299                 self._parser.feed(data)

C:\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3216: character maps to <undefined>

from textacy.

pavancs commented on July 29, 2024

Hi @bdewilde , Did some digging.

This can be naive question. Why in fileio, open_sesame(), "encoding: optional name of the encoding used to decode or encode `filepath``; only applicable in text mode"?

Setting encoding to 'utf-8', able to read file successfully.

from textacy.

bdewilde commented on July 29, 2024

Hey @pavancs , thanks for digging. As far as I understand, that bit of code is correct: According to Python's docs, encoding should only be used in text mode. (See here and here.) Have I misunderstood how this stuff works?

Honestly, I found the fileio code very tricky to write on account of differences between Python 2 and 3, OS X / macOS and Windows Vista / 7 / 8 / 10, and compression formats. I tried to write a function that would automagically handle all these differences for the user, but apparently did not entirely succeed. I myself don't have any unicode errors when iterating over enwiki-latest-pages-articles.xml.bz2 (which is probably not the same version as yours) so I can't replicate the error. :/

How can I help? I'm not sure how to proceed.

from textacy.

pavancs commented on July 29, 2024

@bdewilde , currently did some dirty fix by setting encoding.

Yeah, doc says encoding should only be used in text mode. Me too confused now. Will see what i can find and get back.

from textacy.

can not load wikipedia files one the disk about textacy HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent