Bassically I am trying to recognize the text of the attached image. When I use tessera

It might be related to the way the file is open in <a href="https://github.com/documen

Sorry, I didn't include the link to <a href="https://dl.dropbox.com/u/462142/p1.pdf" r

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Not saving Unicode (UTF8) characters (accents in other languages) about docsplit HOT 4 CLOSED

documentcloud commented on June 30, 2024

Not saving Unicode (UTF8) characters (accents in other languages)

from docsplit.

Comments (4)

robertour commented on June 30, 2024

It might be related to the way the file is open in this line where the file is opened again in the clean_function:

File.open(file, 'r+') do |f|

According to the stackoverflow question the file should be open with

File.open(name, "r+:UTF-8")

I am not a ruby programmer, so I am not sure. Also I am using gem installation and I have no idea of how to install a github fork into the system. It will take me a while to know the basics of Ruby and try it out.

Thanks!

from docsplit.

robertour commented on June 30, 2024

Sorry, I didn't include the link to the pdf of the image, so you can try it.

from docsplit.

knowtheory commented on June 30, 2024

Hey @robertour, if you set the --no-clean flag, then everything should work alright. The OCR cleaning functions are really only intended to be used with English (until we can get a pluggable system for other languages too).

The --language flag should probably automatically set --no-clean as well.

Thanks for the note, and sorry for the inconvenience!

from docsplit.

robertour commented on June 30, 2024

Thanks! That was quite obvious :S, but it didn't occur to me.

from docsplit.

Not saving Unicode (UTF8) characters (accents in other languages) about docsplit HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent