Comments (4)
It might be related to the way the file is open in this line where the file is opened again in the clean_function
:
File.open(file, 'r+') do |f|
According to the stackoverflow question the file should be open with
File.open(name, "r+:UTF-8")
I am not a ruby programmer, so I am not sure. Also I am using gem installation and I have no idea of how to install a github fork into the system. It will take me a while to know the basics of Ruby and try it out.
Thanks!
from docsplit.
Sorry, I didn't include the link to the pdf of the image, so you can try it.
from docsplit.
Hey @robertour, if you set the --no-clean flag, then everything should work alright. The OCR cleaning functions are really only intended to be used with English (until we can get a pluggable system for other languages too).
The --language flag should probably automatically set --no-clean as well.
Thanks for the note, and sorry for the inconvenience!
from docsplit.
Thanks! That was quite obvious :S, but it didn't occur to me.
from docsplit.
Related Issues (20)
- Percent sign in filenames isn't escaped properly
- "undefined method `strip' for nil:NilClass" occurs when attempting "Docsplit.extract_pdf" HOT 8
- encoding issue HOT 1
- rails invalid byte sequence in UTF-8 HOT 1
- Horizontal / table formatted text
- Executable filename issue with latest version (5.0.4) of LibreOffice on RHEL HOT 1
- Can any one please tell me how to pass file path as url to Docsplit ? HOT 2
- Docsplit::TextExtractor#extract_text should return the path of the output text file? HOT 2
- Downsampling has gotten worse in the last year
- Error "MAGICK_TEMPDIR" no se reconoce como comando interno o externo.
- Docsplit.extract_text auto orientation detection 'detect_orientation: true' param does not work.
- Email address contains more than three special chars(punctuation) is removed by Docsplit.clean_text method
- Different behavior on mac and linux
- diskspace leak when extracting text from pdf HOT 1
- Docsplit.extract_text generates a String with a null byte
- Docsplit.extract_images(path) => bin/rails: No such file or directory - file HOT 2
- Docsplit working on Dev, Staging server but not on Production.
- Docsplit::ExtractionFailed: gm convert: Unable to open file (/tmp/docsplit/58371.pdf) [No such file or directory]
- ruby 3.2 compatibility
- "Error: source file could not be loaded"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docsplit.