documentcloud / docsplit Goto Github PK
View Code? Open in Web Editor NEWBreak Apart Documents into Images, Text, Pages and PDFs
Home Page: http://documentcloud.github.com/docsplit/
License: Other
Break Apart Documents into Images, Text, Pages and PDFs
Home Page: http://documentcloud.github.com/docsplit/
License: Other
== __ ___ __ ____/ /___ ______________ / (_) /_ / __ / __ \/ ___/ ___/ __ \/ / / __/ / /_/ / /_/ / /__(__ ) /_/ / / / /_ \____/\____/\___/____/ .___/_/_/\__/ /_/ Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...) Installation: gem install docsplit For documentation, usage, and examples, see: https://documentcloud.github.io/docsplit/ To suggest a feature or report a bug: http://github.com/documentcloud/docsplit/issues/
I'd like to be able to set a higher density when extracting images from a PDF.
Cheers,
Zach
I've followed the directions to install Docsplit from here (accounting for differences for CentOS of course): http://documentcloud.github.com/docsplit
I think I've got all necessary packages installed, but when I run "docsplit pdf <some.doc.file>" to convert a Word Doc to PDF, I get this error:
Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: failed to start and connect at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:61) at org.artofsolving.jodconverter.office.PooledOfficeManager.start(PooledOfficeManager.java:102) at org.artofsolving.jodconverter.office.ProcessPoolOfficeManager.start(ProcessPoolOfficeManager.java:59) at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:98) Caused by: java.util.concurrent.ExecutionException: org.artofsolving.jodconverter.office.OfficeException: could not establish connection at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:59) ... 3 more Caused by: org.artofsolving.jodconverter.office.OfficeException: could not establish connection at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:123) at org.artofsolving.jodconverter.office.ManagedOfficeProcess.access$000(ManagedOfficeProcess.java:31) at org.artofsolving.jodconverter.office.ManagedOfficeProcess$1.run(ManagedOfficeProcess.java:55) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: org.artofsolving.jodconverter.office.RetryTimeoutException: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:48) at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:31) at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:113) ... 8 more Caused by: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused at org.artofsolving.jodconverter.office.OfficeConnection.connect(OfficeConnection.java:101) at org.artofsolving.jodconverter.office.ManagedOfficeProcess$6.attempt(ManagedOfficeProcess.java:116) at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:41) ... 10 more
For what it's worth, I had all of this working perfectly both in OSX and in an Ubuntu VM I set up. But CentOS is giving me a lot of grief with this stuff.
Any ideas?
When image splitting a PDF with one or several rotated pages, the results are cropped and bottom-aligned to fit in the standard portrait orientation.
The command:
docsplit images example.pdf
using this pdf produces
I can't convert anything to pdf. I tried html, doc and docx
Example: docsplit pdf index.html
It doesn't give me any error or output
Docsplit version 0.7.2
LibreOffice 3.4 340m1(Build:502)
Is there any way to track errors while converting file
I am having an issue with docsplit adding staring text when converting some PPTX files to PNGs.
It appears that some text is being superimposed on top of some images inside the slides (always at the top left corner).
Using LibreOffice 3.5.
Really appreciate any help on this!
Anything I can do to blow through this?
$ docsplit pages fileName.pdf
Error: Failed to open PDF file:
fileName.pdf
OWNER PASSWORD REQUIRED, but not given (or incorrect)
Errors encountered. No output created.
Done. Input errors, so no output created.
Example:
Docsplit.extract_text('/tmp/file with spaces (and probably other chars that need to be escaped).pdf', :ocr => false, :output => dir)
=> fails
Development has been done on Mac, and upgrading to docsplit 0.7.2 and installing libreoffice has been a breeze and everything worked perfectly the first time around.
However moving it to our staging environment (AWS EC2 running Ubuntu) has proven to be a challenge.
The gemfile was updated, and libre office was install through (sudo apt-get install libreoffice), but it seems that docsplit is having issue running libreoffice:
$ bundle exec docsplit pdf test.txt
Error: Could not find or load main class .usr.lib.libreoffice
However libreoffice seemed to be installed properly anyway:
$ soffice --headless --convert-to txt:text test_fake.doc
convert /home/app/helpers-staging/releases/20130328152857/test.doc -> /home/app/helpers-staging/releases/20130328152857/test.txt using text
[mike 1:12:20 ~/docsplit-test]% docsplit text HB00300S.PDF Exception in thread "main" java.lang.IllegalArgumentException: unsupported input format: Portable Document Format at com.artofsolving.jodconverter.openoffice.converter.AbstractOpenOfficeDocumentConverter.convert(AbstractOpenOfficeDocumentConverter.java:99) at com.artofsolving.jodconverter.openoffice.converter.AbstractOpenOfficeDocumentConverter.convert(AbstractOpenOfficeDocumentConverter.java:74) at com.artofsolving.jodconverter.openoffice.converter.AbstractOpenOfficeDocumentConverter.convert(AbstractOpenOfficeDocumentConverter.java:70) at com.artofsolving.jodconverter.cli.ConvertDocument.convertOne(ConvertDocument.java:154) at com.artofsolving.jodconverter.cli.ConvertDocument.main(ConvertDocument.java:133) [mike 1:12:30 ~/docsplit-test]% mv HB00300S.PDF HB00300S.pdf [mike 1:12:34 ~/docsplit-test]% docsplit text HB00300S.pdf [mike 1:12:43 ~/docsplit-test]%
Is there a way to convert PDFs and Office documents into SVG?
Hello,
On this line, the directory name should be computed based on Time.now or something, because hardcoded as is, it will produce write permissions error if the gem is used by more than one unix user (the directory will belong to the first one using docsplit, all others will have write errors).
clean_ocr method replaces accents by ?
in french it's an issue
if i use --no-clean parameter, accents are kept
Hello,
I have a few questions about the project Docsplit. We create a product
that uses doscplit. In this product we need support for Slovak within
OCR. As you may know came new tesseract that makes this possible. I
want to ask when you implement in your project docplit support for
tesseract 3.0. Have a nice day. Matej Marus
Hi,
First, Docsplit is awesome. Thank you for sharing that.
I would like that my users will upload a file (pdf) and that they will have it available as images.
I am trying to combine paperclip with docsplit for that with no real success (Rails newbie).
Any tips / directions on how can I accomplish that? How did you do it on documentcloud?
Thank you so much!
S.
When trying to use docsplit to extract text from some PDFs I found out that some text is mixed; I understand that docsplit is a thin layer over other tools (in fact, pdftotext is who to blame for mixing the text); but I was wondering if you had some examples of how to use docsplit to minimize this effects (maybe using OCR instead of pdftotext?)
Also, I couldn't find if you had any suggestions to strip headers and page numbers in the output text; I wrote some code, but I guess you had the same problem and maybe came up with something better? :)
Thanks!
The "Installation & Dependencies" section of the docs show the package name to be installed for Tesseract as "tesseract". This is incorrect for Debian/Ubuntu -- in those distros the correct package name is "tesseract-ocr" (Debian package, Ubuntu package).
Not a huge deal for anyone who knows how to do an aptitude search, but might confuse some newbies.
I installed docsplit on fedora with the instructions in the documentation page http://documentcloud.github.com/docsplit .
I created a symbolic link:
ln -s /usr/lib64/openoffice.org3 /usr/lib/openoffice
but after that, docsplit works well except when i want to use the funtionalities of openoffice: for example docsplit text my_file.odt . Here is the error message:
Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: failed to start and connect
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:61)
at org.artofsolving.jodconverter.office.PooledOfficeManager.start(PooledOfficeManager.java:102)
at org.artofsolving.jodconverter.office.ProcessPoolOfficeManager.start(ProcessPoolOfficeManager.java:59)
at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:98)
Caused by: java.util.concurrent.ExecutionException: org.artofsolving.jodconverter.office.OfficeException: could not establish connection
at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
at java.util.concurrent.FutureTask.get(Unknown Source)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:59)
... 3 more
Caused by: org.artofsolving.jodconverter.office.OfficeException: could not establish connection
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:123)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.access$000(ManagedOfficeProcess.java:31)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess$1.run(ManagedOfficeProcess.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.artofsolving.jodconverter.office.RetryTimeoutException: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused
at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:48)
at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:31)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:113)
... 8 more
Caused by: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused
at org.artofsolving.jodconverter.office.OfficeConnection.connect(OfficeConnection.java:101)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess$6.attempt(ManagedOfficeProcess.java:116)
at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:41)
... 10 more
I have tested the installation on ubuntu 10.10 and everything works.
On Fedora there is no package/rpm for openoffice.org-java-common so i have unpacked a openoffice.org-java-comon.deb and added the missing files on the fedora system but i still have the same issue.
Furthermore the version of Open Office is 3.2 on Ubuntu and 3.3 on Fedora. Is it related with the openoffice version ?
Please don't tell me to only use ubuntu.
Thanks (Sorry for my english)
If I run tesseract over a scanned PDF, the text is correctly recognized with all its german umlauts and special chars. When enabling text cleaning after recognition, german umlauts are getting garbled.
Gemüse => Gem"use
This is due to the use of Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first
. As Iconv is also deprecated, it would make sense to remove the Iconv part. The output, producted by TextCleaner (with disabled Iconv), is valid UTF-8 and my umlauts are preserved.
I suggest removing these two lines:
require 'iconv' unless defined?(Iconv)
text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first
Thanks and cheers,
Marc
When specifying :pages => Range.new(1,5) I get a undefined method normalize_range for Docsplit:Module error. Looking at the source the method don't seem to be defined.
https://github.com/documentcloud/docsplit/blob/master/lib/docsplit.rb#L112-113
options = { :format => :jpg, :output => path_to_output_directory, :pages => Range.new(1, 5) }
Docsplit.extract_images(path_to_file, options)
Hi,
I'm wondering if there is an example where you convert a scanned document (tif) to a pdf while running OCR so that the PDF becomes searchable. Some of the documents formatting needs to be kept. Is this doable using docsplit? docsplit text removes all formatting from the original document. Any hints on how to achieve this is appreciated.
I'd like to be able to extract pdf concurently, but it is not possible with docsplit gem
I tried to extract 2 ppt files to pdf, the gem fails to process.
The code is as below, please replace path_to_docsplit.rb, path_to_test_file1.ppt, path_to_test_file2.ppt
Im looking forward to your answer.
Thank you,
Quyen
require 'path_to_docsplit.rb'
def extraction(path_to_file)
Docsplit.extract_pdf(path_to_file)
end
puts('start extraction')
t1=Thread.new{extraction('path_to_test_file1.ppt')}
t2=Thread.new{extraction('path_to_test_file2.ppt')}
t1.join
t2.join
puts('end extraction')
When converting PDFs to images with docsplit I found it added a lot of whitespace to some pages. Cutting a long story short, it's fixed by adding '-define pdf:use-cropbox=true' to the graphicsmagick call. I've forked docsplit to patch this in but would like to see it as an option I could pass in to the standard gem. I'm happy to do the work to implement any custom option passing, but before I started I wanted to know:
a) is this something you'd be willing to integrate?
b) if so, how would you want the options passing?
I see it either being a specific param to trigger the pdf cropping behaviour or the ability to pass in an arbitrary string to interpolate into the executed command.
I can't use language option. I get this message :
invalid option: --language
I need some API to convert from PDF to HTML, i.e extract_html method...
Is there any way to track errors as notifications in extract_pdf
It would be nice if there was an option one could pass which would determine whether the images created were returned, or the path to the PDF that was created as an intermediary. I'm thinking something like :and_return => :images
or :and_return => :intermediate
With the default case (:and_return == nil) being the current behavior. I believe this wouldn't break backwards compatibility, and would increase the utility of this application from within Ruby.
Can the library convert a full fledged html/css webpage to pdf document ??
When I try to convert certain PDF's to images with the Docsplit.extract_images method I get the following error:
Error: /undefined in x
Operand stack:
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1862 1 3 %oparray_pop 1861 1 3 %oparray_pop 1845 1 3 %oparray_pop 1739 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1158/1684(ro)(G)-- --dict:0/20(G)-- --dict:71/200(L)--
Current allocation mode is local
Current file position is 1
GPL Ghostscript 8.70: Unrecoverable error, exit code 1
$ ruby -v ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.2.0] $ git clone [email protected]:documentcloud/docsplit.git Cloning into 'docsplit'... remote: Counting objects: 667, done. remote: Compressing objects: 100% (314/314), done. remote: Total 667 (delta 384), reused 617 (delta 341) Receiving objects: 100% (667/667), 8.67 MiB | 535 KiB/s, done. Resolving deltas: 100% (384/384), done. $ cd docsplit $ rake -T rake gem:install # Build and install the docsplit gem rake gem:uninstall # Uninstall the docsplit gem rake test # Run all tests $ rake test NOTE: Gem.available? is deprecated, use Specification::find_by_name. It will be removed on or after 2011-11-01. Gem.available? called from /Users/dentarg/src/docsplit/Rakefile:7. rake aborted! cannot load such file -- test/unit/test_convert_to_pdf.rb Tasks: TOP => test (See full trace by running task with --trace)
When running two commands "docsplit images foo.doc" at the same time one will crash when trying to connect to office.
Docsplit java calls look like:
java -Djava.awt.headless=true -Djava.util.logging.config.file=/home/[..bla...]/docsplit-0.5.2/vendor/logging.properties -Doffice.home=/usr/lib/openoffice -cp /home/..bla...]/docsplit-0.5.2/vendor/'*' -jar /home/..bla...]/jodconverter-core-3.0.jar -r /home/[..bla...]/vendor/conf/document-formats.js "/home/greg/Downloads/foobar.docx" "/tmp/docsplit/foobar.pdf"
Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: failed to start and connect
...
Caused by: java.lang.IllegalStateException: a process with acceptString 'socket,host=127.0.0.1,port=2002' is already running; pid 19490
If we try to launch soffice in headless mode like stated in the docs:
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
how can we then tell docsplit to tell jodconverter to use port 8100 instead of 2002 like he is actually doing ?
Having one headless soffice daemon running will probably solve our OfficeException when trying to run concurrent doc split workers.
Hi! ,
im using Docsplit to extract images from pdf,
my problem is when i do this
Docsplit.extract_images("./public/" + doc, :size => '100x', :format => [:png], :pages=>1)
=> ["./public//uploads/test/document/document/4fb4530d1a88789d180c6eaa/doc.pdf"]
the file saves ok, but in the root of my project, not in ./public//uploads/test/document/docu ... doc.pdf
this is a normal behavior? , can i set the output path somewhere ?
btw im using JRUBY 1.6.7 --1.9
thanks for this proyect
When docsplit is installed in a directory which contains spaces, java jokes and dies.
I tried to run docsplit on a workstation running Ubuntu 10.10 (Maverick Meerkat), and when I invoked it via the command-line interface it threw the following error:
Exception in thread "main" java.lang.IllegalStateException: invalid officeHome: it doesn't contain soffice.bin: /usr/lib/openoffice
at org.artofsolving.jodconverter.office.DefaultOfficeManagerConfiguration.buildOfficeManager(DefaultOfficeManagerConfiguration.java:119)
at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:97)
I eventually tracked the issue down to the fact that this particular workstation had replaced Ubutu's standard OpenOffice.org with the new LibreOffice fork from the Document Foundation, as distributed in this PPA. Removing LibreOffice and re-installing the stock OpenOffice.org from the Ubuntu repositories fixed the problem.
I'm having an issue using extract_text on a .docx or .pdf file, It looks like when reading in the document the parser is removing the new lines. Is there any setting to ensure these are put into the new txt file? I've tried :clean => false with no luck.
Example:
ESTRAGON:
(giving up again). Nothing to be done.
VLADIMIR:
(advancing with short, stiff strides, legs wide apart).
Converts to:
ESTRAGON: (giving up again). Nothing to be done. VLADIMIR: (advancing with short, stiff strides, legs wide apart).
Expected Result:
There should be \n where the line breaks are.
Hello everyone, I was wondering whether anyone is thinking about adding support for passing a File object (or maybe a relative thereof) to the docsplit ruby API instead of just pathnames?
The particular use case I have in mind is extracting text directly from a pdf in Mongodb's GridFS (yields a file-like IO object), but I think it should apply to anyone wanting to read a pdf from a binary stream. Writing the stream contents into a temp file in a real FS so a pathname can be supplied to docsplit in a runtime context feels like an artificial step, and on Heroku it becomes a bit of a headache :)
I don't know whether this is obviously impractical due to the underlying library APIs, but I thought it was worth the suggestion. Thanks for a great library!
Hey,
I have open this issue: carrierwaveuploader/carrierwave#502
on CarrierWave - but maybe you could help me ?
Thanks
Docsplit invokes pdftotext to extract text, escaping spaces in the filename with \ to construct a command line. On Windows, \ does not escape spaces.
In my instrumented test, docsplit attempts to execute the following command:
pdftotext -enc UTF-8 test-docs/Ideology\ and\ Climate\ Change.pdf extracted-text/Ideology\ and\ Climate\ Change.txt 2>&1
This fails with the error message below. The following command works:
pdftotext -enc UTF-8 "test-docs/Ideology and Climate Change.pdf" "extracted-text/Ideology and Climate Change.txt" 2>&1
You will need poppler on Windows to reproduce, which is available here: http://www.compgeom.com/~piyush/scripts/scripts.html
Full error message follows:
C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/text_extractor.r
b:99:in run': pdftotext version 0.16.6 (Docsplit::ExtractionFailed) Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC Usage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta informatio n -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -bbox : output bounding box for each word and page size to html. Sets -htmlmeta -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:106:in
extract_full'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:54:in extract_from_pdf' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:38:in
block in extract'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:32:in each' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:32:in
extract'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb:
51:in extract_text' from overview-prototype/docloader/docloader.rb:75:in
processFile'
from overview-prototype/docloader/docloader.rb:150:in block in <main>' from overview-prototype/docloader/docloader.rb:50:in
call'
from overview-prototype/docloader/docloader.rb:50:in block in scanDir' from overview-prototype/docloader/docloader.rb:42:in
foreach'
from overview-prototype/docloader/docloader.rb:42:in scanDir' from overview-prototype/docloader/docloader.rb:150:in
I updated to new docsplit (0.7.2) and I now got this error of no such file or directory
.
Inside the /tmp folder, I checked for filename.pdf
and it is there.
Someone can help me what is going on?
When I'm running the docsplit on a EC2 instance it writes on the root "/tmp" folder and for some reason it does not have permission to read(?) I guess, because in localhost it works.
Thanks!
extract_options
doesn't extract the parameter, the script never uses it, and in fact pdftk, the underlying utility for page extraction does not support partial page ranges
There can be issues on some systems with the permissions of /tmp at file cleanup time -- it would be nice to have an option to create tmp files in a sub-directory on /tmp.
I'm using DocSplit in an app that doesn't require the dependencies - so the dependency warnings are always showing up in my dev environment, cluttering up my console (crying wolf, if you will). It would be nice if there were some config setting to ignore the dependencies. Is there any way to do that currently?
Hi I installed the Docsplit gem yesterday (together with all the dependencies) and I wanted to test this quickly so I tried one of the examples of your documentation (in commandline)
docsplit images example.pdf
and this was the outputted error:
execvp failed, errno = 2 (No such file
or directory) gm convert: "gs" "-q"
"-dBATCH" "-dMaxBitmap=50000000"
"-dNOPAUSE" "-sDEVICE=ppmraw"
"-dTextAlphaBits=4"
"-dGraphicsAlphaBits=4" "-r150x150"
"-dFirstPage=1" "-dLastPage=1"
"-sOutputFile=/var/folders/um/umOJP4yeEoG4UihNlcD7ME+++TM/-Tmp-/d20110325-6084-j35i1w/gmrpht13"
"--"
"/var/folders/um/umOJP4yeEoG4UihNlcD7ME+++TM/-Tmp-/d20110325-6084-j35i1w/gm04N0rO"
"-c" "quit". gm convert: Postscript
delegate failed (example.pdf).
I'm not sure why it says No such file or directory because I'm absolutely sure the file exists.
Also I'm trying out the method in a ruby script (usually I only use gems in a Ruby on Rails project, so this might be a stupid error)
require 'rubygems'
require 'docsplit'
CUR_DIR = Dir.getwd
DOCS_DIR = "#{CUR_DIR}/docs"
THUMB_DIR = "#{CUR_DIR}/thumbnails"
Dir.mkdir DOCS_DIR unless File.directory? DOCS_DIR
Dir.mkdir THUMB_DIR unless File.directory? THUMB_DIR
Dir.chdir(DOCS_DIR)
Dir["*"].each do |filename|
# skip directories
next if File.directory? filename
puts "processing #{filename}"
Docsplit.extract_images(filename, :size => '920x', :format => [:png, :jpg])
end
NameError: uninitialized constant Docsplit
Note I'm using docsplit (0.5.0) and ruby 1.8.7 (2011-02-18 patchlevel 334) [i686-darwin10]
Would you happend to know what's causing this problem and what would possibly fix this issue?
Is there a way in doscplit to do that or I have to attach this funcionality in my code?
Bassically I am trying to recognize the text of the attached image. When I use tesseract directly on the image then it works:
tesseract p1.jpg p1 -l spa
For example, "Pero acá tenés una y está en tus manos."
However when I try to use docsplit directly, the accents are not saved correctly.
docsplit text p1.pdf --pages all -l spa
And the same line becomes, "Pero ac? tenes una est? en tus manos."
I am having trouble rendering large pdf's into png's. The png output is all distorted and glitched out.
If I run:
$ docsplit images large_pdf_test.pdf -d 300 -f png
I get this image:
Any help would be appreciated!
I have a feeling it might be running out of memory. Since it works on smaller pdf's that are being produced the same way.
thanks,
Nick
Can I just say first.. Docsplit is amazing!
I have this in my code...
Docsplit.extract_text(source_path, :output => destination_path)
Is there a way, however, to "get" the text in Ruby directly?
With the above, I end up with a lovely file that contains text, but it I am to use it, I need to reopen it to get its contents.
As far as I know, setting:
something = Docsplit.extract_text(...)
..would just give me the source filename in "something".
So this is a pretty minor issue, but when I run something like this on the command line:
docsplit text ZUJI\ Hong\ Kong\:\ Your\ Online\ Travel\ Guru.pdf
It's going to spit out:
pdftotext version 0.16.7
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-r <fp> : resolution, in DPI (default is 72)
-x <int> : x-coordinate of the crop area top left corner
-y <int> : y-coordinate of the crop area top left corner
-W <int> : width of crop area in pixels (default is 0)
-H <int> : height of crop area in pixels (default is 0)
-layout : maintain original physical layout
-raw : keep strings in content stream order
-htmlmeta : generate a simple HTML file, including the meta information
-enc <string> : output text encoding name
-listenc : list available encodings
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-bbox : output bounding box for each word and page size to html. Sets -htmlmeta
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-q : don't print any messages or errors
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
As I said a minor issue, but it appears that any PDF filenames that have spaces (ebooks) will need to be changed before running this command in the terminal.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.