Comments (5)
This makes sense. How about set "omitting the images or objects that cause errors" as a default behavior, and show log information when this happened? Thanks for your suggestion.
from pdf2docx.
How about set "omitting the images or objects that cause errors" as a default behavior, and show log information when this happened?
Also, I think it's great, thanks for taking it into consideration. I'll be using this library a lot so you'll see me around a lot, it's the best and easiest to use and I feel it has a lot of potential for more features.
I think that the following information of the omitted items can be shown in the log information:
Page, type (table, image...), and that somehow the respective blank space is left where the element was, this way even if elements have been omitted there will be no change in the order or number of pages.
from pdf2docx.
Didn't get time to this project for so long a time. A new version was released finally at this moment, the first day of New Year. :) It gets improved on image extraction, e.g. floating image, and paragraph format. Hope to make progress on this issue.
pip install --upgrade pdf2docx
from pdf2docx.
I'll be using this library a lot so you'll see me around a lot, it's the best and easiest to use and I feel it has a lot of potential for more features.
This library is rule-based to map pdf objects to docx, e.g. some texts surrounded by horizontal/vertical lines -> a table in docx. The limited rules never accommodate all cases, so definitely a lot of potential features/enhancements. Welcome and thanks for make it grow up, so that it can benefit for more people.
Page, type (table, image...), and that somehow the respective blank space is left where the element was
Good point. Just one comment: as a layout format for printing, what we extract from pdf is either text or image or shape (like a line, a rectangle) and their coordinates in the page. So, of course, the blank space is preserved, but regarding the type, I'm afraid it can provide image only since no 'table' exists for pdf.
from pdf2docx.
Welcome and thanks for make it grow up, so that it can benefit for more people.
Thanks, I will be testing with different files with different contents to see how the library reacts to each one and if there is any failure I will be leaving it here (in issues) with the detailed information..
I'm afraid it can provide image only since no 'table' exists for pdf.
When I said "table" I meant things like this:
Although I just sensed that that counts as simple lines, sorry, bad way to refer to that. In the same way the idea is that, to say the type of element that has been omitted, I do not know the truth what type of element to mention apart from an image, but the idea is already clear hehe.
from pdf2docx.
Related Issues (20)
- Error encountered when using parse() function on PDF file HOT 6
- Missing separators when converting pdf to docx HOT 2
- 关于多栏布局/版面分析的一些探讨 HOT 3
- 转化之后word表格的列宽和pdf不一致 HOT 5
- pdf中的流程图转word的问题 HOT 6
- 作者你好,pdf转word后图片还是翻转状态 HOT 1
- 单元格内表格转换的文本错误
- 有计划支持 公式的转换吗 HOT 2
- Add more support for equations
- Table formatting
- 设置multi_processing=True,在Linux上会程序卡死 HOT 3
- Ignore charts and images during conversion
- 可不可以将pdf的字体自动添加到系统中以防止转换后乱码
- 这个项目最大的问题在于数据结构设计 HOT 4
- PDF转docx时文档中带链接的文字全部丢失 HOT 1
- pdf2docx-0.5.8版,将附件"深入浅出强化学习01.pdf"转docx后,每段首句被移到末尾了 HOT 1
- 转word速度太慢了,怎么设置只转换部分内容?比如只转换pdf中表格到word,不要页眉页脚段落,也许这样指定内容更快
- 2 tests fail
- transfer error:unsupported colorspace for '{output}' HOT 1
- [WARNING] Ignore Line "<image>" due to overlap
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2docx.