Comments (9)
There's no exact counterpart so far. Can you say a more about your use case? It could be a simple snippet would do the trick.
from python-docx.
The main idea is just extract all the text from docx or "transform" the docx documento to TXT (or xml) document. I know i can do this using antiword or other command line tools, but i'm looking for something native.
Regards
from python-docx.
Why do you want to do that? For example, does it matter if the sequence of text is the same as is in the document or do you just care you get all the words, such as if you were indexing it for search?
from python-docx.
I want to use the text for search.. i need to index de text using
lucen/solr so.. i need to get the tex from docx (and only the text, the
images doesn't matters) to "export" to a plain text file and the use for
"text seach and index"
scanny wrote:
Why do you want to do that? For example, does it matter if the
sequence of text is the same as is in the document or do you just care
you get all the words, such as if you were indexing it for search?—
Reply to this email directly or view it on GitHub
#32 (comment).
from python-docx.
Ok, so something like this would be a start:
document = Document(filename)
text_chunks = []
for paragraph in document.paragraphs:
text_chunks.append(paragraph.text)
for table in document.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
text_chunks.append(paragraph.text)
It is possible that a table contains a cell that itself contains a table and so on. Not terrifically common as far as I know, but not unheard of. The library doesn't have API calls to locate tables contained in a cell yet, so you'll have to judge whether that's a problem.
Let us know how you go :)
from python-docx.
Hello, would it be possible to put the previous version back on pypi. I was pointing people to this project on my blog. Then I could just suggest installing the old version.
from python-docx.
Hi David, legacy versions of python-docx are actually named 'docx' on PyPI. All those versions are still available and will remain so indefinitely, to support users who have built applications using it. So you can install it with:
$ pip install docx
Folks will need to uninstall python-docx beforehand if they have it installed. Program behavior is unpredictable when both are installed as they both have the root package name 'docx'.
from python-docx.
Thanks for getting back so quick - you are right.
from python-docx.
See issue #72 for feature request for replacement property Document.text
.
from python-docx.
Related Issues (20)
- How to get page number? HOT 1
- Exceptions when working with own document HOT 3
- Incorrect column count estimation on some tables HOT 1
- section.bottom_margin fails HOT 3
- Unexpected Behavior HOT 2
- ImportError: cannot import name 'Self' from 'typing_extensions' HOT 4
- Inaccurately extracts underlined words from docx file
- allows to copy data_frame styling
- How to order hyperlink and run in a block HOT 1
- How to change default Cambria Math font for math equations while using python-docx?
- Can I add a directory? HOT 1
- Split docx file at all Headings and keep styles?
- Multiple tables data inserting issue HOT 1
- Why is my own defined style persistent? HOT 2
- Track change.
- Delete break section with next page?
- There is no item named %r in the archive' % name
- Error while loading the docx(KeyError: "There is no item named 'word/#_top' in the archive") HOT 4
- Docu: Reference to root package missing in object.inv file HOT 1
- Insert picture issue python-docx
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-docx.