Git Product home page Git Product logo

nifi-extracttext-processor's Introduction

nifi-extracttext-processor's People

Contributors

dependabot[bot] avatar timspannairisdata avatar tspannhw avatar wdemis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nifi-extracttext-processor's Issues

ZeroByteFileException from Tika?

@tspannhw When using this code, I am able to get the unit tests to work just fine and return data after the enqueue/run methods are called. Seems to be working just fine. But once I deploy to Nifi, I keep getting this Tika ZeroByteFileException message "InputStream must have > 0 bytes." This is after sending in the same pdf file used for the unit tests. I can't seem to find any information about this...

I have confirmed from a post by Brian Bende that the nar packages up all required libraries, and I have even unzipped the nar to verify that the Tika libraries were included. Nifi starts up fine, so I really don't think it's a missing library issue. The processor is accessible in the Nifi UI and can be configured. It just doesn't seem to get the input properly.

Was there any additional installation tasks for your processor other than dropping the nar in the /nifi/lib/ dir? I think Tika does allow custom configurations through xml files- did you have to specify a custom config at all? I can't seem to make any sense of this exception and figure it must be an install issue. Any thoughts?

I'm using Nifi 1.5.0, Tika 1.17, JDK 8. I also have pdfbox 2.0.8 there.

*Note- I also have a simple pdfbox based custom processor hooked up in parallel in the Nifi flow. This processor gets the pdf input file, reads it just fine, and parses the output. So I suppose that eliminates any potential issue with Nifi not "delivering" the input file as a Java IO InputStream properly.

Payload lost on failure

I have a PDF document that has the prevent extraction flag checked.

The processor (very reasonably) fails, but a zero byte flow file is returned to failure.

I would expect the original flow file to be routed to failure.

Option to extract HTML

Hi,
Great job on this processor!
As tika allows HTML extraction of multiple document formats, do you think it is feasible to have an option to output HTML in this processor ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.