Git Product home page Git Product logo

Comments (5)

bdewilde avatar bdewilde commented on July 29, 2024

Hi @pavancs , could you provide the full traceback, so I can see which line in textacy is causing the UnicodeDecodeError? Also: Do you happen to know which article is causing the error? I did a re-work of the WikiReader.texts() code a few weeks ago, and I thought that I caught these sorts of errors. Apparently I missed something. :/

from textacy.

pavancs avatar pavancs commented on July 29, 2024

hi @ bdewilde, here is the full track back. Not sure which article is causing this error.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-28-bae0283836f6> in <module>()
----> 1 for text in wr.texts(limit=2):
      2     print(text)

C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in texts(self, min_len, limit)
    255         """
    256         n_pages = 0
--> 257         for _, title, content in self:
    258             text = strip_markup(content)
    259             if len(text) < min_len:

C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in __iter__(self)
    146             text_path = './{%s}revision/{%s}text' % (namespace, namespace)
    147 
--> 148             for elem in elems:
    149                 if elem.tag == page_tag:
    150                     page_id = elem.find(page_id_path).text

C:\Anaconda3\lib\site-packages\textacy-0.3.2-py3.5.egg\textacy\corpora\wiki_reader.py in <genexpr>(.0)
    131         with f:
    132 
--> 133             elems = (elem for _, elem in iterparse(f, events=events))
    134 
    135             elem = next(elems)

C:\Anaconda3\lib\xml\etree\ElementTree.py in __next__(self)
   1295                 raise StopIteration
   1296             # load event buffer
-> 1297             data = self._file.read(16 * 1024)
   1298             if data:
   1299                 self._parser.feed(data)

C:\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3216: character maps to <undefined>

from textacy.

pavancs avatar pavancs commented on July 29, 2024

Hi @bdewilde , Did some digging.

This can be naive question. Why in fileio, open_sesame(), "encoding: optional name of the encoding used to decode or encode `filepath``; only applicable in text mode"?

Setting encoding to 'utf-8', able to read file successfully.

from textacy.

bdewilde avatar bdewilde commented on July 29, 2024

Hey @pavancs , thanks for digging. As far as I understand, that bit of code is correct: According to Python's docs, encoding should only be used in text mode. (See here and here.) Have I misunderstood how this stuff works?

Honestly, I found the fileio code very tricky to write on account of differences between Python 2 and 3, OS X / macOS and Windows Vista / 7 / 8 / 10, and compression formats. I tried to write a function that would automagically handle all these differences for the user, but apparently did not entirely succeed. I myself don't have any unicode errors when iterating over enwiki-latest-pages-articles.xml.bz2 (which is probably not the same version as yours) so I can't replicate the error. :/

How can I help? I'm not sure how to proceed.

from textacy.

pavancs avatar pavancs commented on July 29, 2024

@bdewilde , currently did some dirty fix by setting encoding.

Yeah, doc says encoding should only be used in text mode. Me too confused now. Will see what i can find and get back.

from textacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.