Git Product home page Git Product logo

Comments (9)

scanny avatar scanny commented on June 26, 2024

There's no exact counterpart so far. Can you say a more about your use case? It could be a simple snippet would do the trick.

from python-docx.

matiasfha avatar matiasfha commented on June 26, 2024

The main idea is just extract all the text from docx or "transform" the docx documento to TXT (or xml) document. I know i can do this using antiword or other command line tools, but i'm looking for something native.

Regards

from python-docx.

scanny avatar scanny commented on June 26, 2024

Why do you want to do that? For example, does it matter if the sequence of text is the same as is in the document or do you just care you get all the words, such as if you were indexing it for search?

from python-docx.

matiasfha avatar matiasfha commented on June 26, 2024

I want to use the text for search.. i need to index de text using
lucen/solr so.. i need to get the tex from docx (and only the text, the
images doesn't matters) to "export" to a plain text file and the use for
"text seach and index"

scanny wrote:

Why do you want to do that? For example, does it matter if the
sequence of text is the same as is in the document or do you just care
you get all the words, such as if you were indexing it for search?


Reply to this email directly or view it on GitHub
#32 (comment).

from python-docx.

scanny avatar scanny commented on June 26, 2024

Ok, so something like this would be a start:

document = Document(filename)
text_chunks = []
for paragraph in document.paragraphs:
    text_chunks.append(paragraph.text)
for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            text_chunks.append(paragraph.text)

It is possible that a table contains a cell that itself contains a table and so on. Not terrifically common as far as I know, but not unheard of. The library doesn't have API calls to locate tables contained in a cell yet, so you'll have to judge whether that's a problem.

Let us know how you go :)

from python-docx.

bufke avatar bufke commented on June 26, 2024

Hello, would it be possible to put the previous version back on pypi. I was pointing people to this project on my blog. Then I could just suggest installing the old version.

http://davidmburke.com/2014/02/04/python-convert-documents-doc-docx-odt-pdf-to-plain-text-without-libreoffice/

from python-docx.

scanny avatar scanny commented on June 26, 2024

Hi David, legacy versions of python-docx are actually named 'docx' on PyPI. All those versions are still available and will remain so indefinitely, to support users who have built applications using it. So you can install it with:

$ pip install docx

Folks will need to uninstall python-docx beforehand if they have it installed. Program behavior is unpredictable when both are installed as they both have the root package name 'docx'.

from python-docx.

bufke avatar bufke commented on June 26, 2024

Thanks for getting back so quick - you are right.

from python-docx.

scanny avatar scanny commented on June 26, 2024

See issue #72 for feature request for replacement property Document.text.

from python-docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.