Git Product home page Git Product logo

go-tika's People

Contributors

andymanning avatar bitcoin-coder-bob avatar dmnyu avatar dvrkps avatar kujenga avatar nathj07 avatar ruesier avatar tbpg avatar tmaxmax avatar tomyl avatar u5surf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-tika's Issues

If the server already running

Hi!
If the server already running, can I use it without spawning new one?
Sorry I just figured I only use client then

Missing jar file fails silently

When calling NewServer if the jar file is not found (bad path or just does not exist), NewServer does not return an error. When the Start function is then called, it will also not return an error.

The underlying line of code being called in Start (line 77) is:
cmd := command("java", "-jar", s.jar, "-p", s.port) which is then started with:
if err := cmd.Start(); err != nil { return err } which has no problem except for the fact that it is erring silently. If we read from standard error we can see the problem being logged, but the current code is not capturing this, since cmd.Start() did not err. Thus, this is a silent "failure". The code does not fail but the call to start the Server never actually succeeds.

Consider the following example code:

server, err = tika.NewServer("tika-server-1.19.jar", "")
	if err != nil {
		l.Error().Msgf("Error creating Tika server: %s", err.Error())
		return
	} else {
		l.Info().Msgf("SERVER: %v", server)
	}
	err = server.Start(context.Background())
	if err != nil {
		l.Error().Msgf("Error starting Tika server: %s", err.Error())
		return
	} else {
		l.Info().Msg("No issue starting server")
	}
	l.Info().Msg("HERE")

This prints: SERVER: &{tika-server-1.19.jar http://localhost:9998 9998 <nil>} and HERE is never printed since the line is never reached. The Start function never completes.

I propose the following fix to be placed in the first line of the Start function:

if _, err := os.Stat(s.jar); os.IsNotExist(err) {
		return err
	}

This will allow for immediate failure if the tika jar file was not found or does not exist.

I would like to make a PR for this.

Add function to allow setting java props

Create a function that will allow the setting of Java system properties to the the tika command. For example AddJavaProps("java.io.tmdir", "/content/dev/tika/") should add "-Djava.io.tmpdir=/content/dev/tika" to the java command when starting a tika server, i.e. java -Djava.io.tmpdir=/content/dev/tika -jar /location/of/tikaserver.jar

how to convert to html use go-tika?

in java api, we can convert file to html like this:

public static String extractHtml(File file) throws IOException {
    byte[] bytes = Files.toByteArray(file);
    AutoDetectParser tikaParser = new AutoDetectParser();
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    TransformerHandler handler;
    try {
        handler = factory.newTransformerHandler();
    } catch (TransformerConfigurationException ex) {
        throw new IOException(ex);
    }
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
    handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    handler.setResult(new StreamResult(out));
    ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
    try {
        tikaParser.parse(new ByteArrayInputStream(bytes), handler1, new Metadata());
    } catch (SAXException | TikaException ex) {
        throw new IOException(ex);
    }
    return new String(out.toByteArray(), "UTF-8");
}

can go-tika do this? when we use

client := tika.NewClient(nil, s.URL())
body, err := client.Parse(context.Background(), f)

what is the body's content? or how to understand this return?

Client reads every response in memory

Is there a reason why the Tika client always reads the whole response body in memory using ioutil.ReadAll and then copies it again in callString? It seems unnecessary and it's very inefficient, especially when sending large documents to Tika for parsing.

I've forked the repository and made some changes, the tests all pass. I'm not opening a PR yet to see why this wasn't done before, as it's not obvious to me why things work this way right now and I want to avoid breaking anything.

Return an error if the JAR file doesn't exist

As a developer creating a tika service using a pre-downloaded JAR file
I want to know if the jar file cannot be found
So that I can report an error

eg.
server, err := tika.NewServer("./lib/tika-server.jar", "1080")

if tika-server.jar is not found in ./lib then an error should be returned.

The error handling in the Start function doesn't return the error when running in background context

func (s *Server) Start(ctx context.Context) error { if _, err := os.Stat(s.jar); os.IsNotExist(err) { return err }

Please assign this issue to me, I'll fix.

Pass a request or request header to Parse

I'm experimenting with this library and it is very good, so thank you. However, the default behaviour from the Tika server seems to be to return extracted text as HTML. According to the Tika docs making the request with the header "Accept: text/plain" will return plain text. This would be much better in my use case.

I see in the code that client.call method allows the caller to pass in a http.Header. The issue is that the public methods do not. It would be great if Parse, and ParseReader could be updated to accept http.Header and then this was passed on to callString and call.

I'm happy to open a PR for this if that would help.

Not able to call different methods on the client for the same *os.File

Hi,
Thanks for the awesome package.
I am trying to run the following code but it only works for the first function call whether it be .Parse,.Detect or .Meta . I am wondering whether it is the limitation of the package or of golang GC itself and how could I solve it.

func ReadData(s *indexService) {
	//Open the file
	f, err := os.Open(fileDir)
	if err != nil {
		fmt.Println("[ERROR] Opening file")
	}
	defer f.Close()
	c := context.Background()
	
	//First function call works fine
	docBody, err := s.TClient.Parse(c, f)
	if err != nil {
		fmt.Println("[ERROR] Reading body")
	}
	//Returns file already closed error
	docContent, err := s.TClient.Detect(c, f)
	if err != nil {
		fmt.Println(err)
		fmt.Println("[ERROR] Reading MIMETYPE")
	}
	//Returns file already closed error
	docMeta, err := s.TClient.Meta(c, f)
	if err != nil {
		fmt.Println("[ERROR] Reading Meta-data")
	}
	
	defer func() {
		fmt.Printf("Tika Processed: %s \n", fileName)
	}()
	
}

indexService looks like this -

type indexService struct {
	TClient *tika.Client
}

Add support for multipart uploads

Tika appears to support multipart file uploads via the endpoint POST /tika/form, as documented here.

It would be great to add this support so that we could use the client for uploading large files without having to hold the buffer with all bytes in memory.

Updating go docs

The docs should be updated to reflect the most recent compatible version as it appears 1.16 (as stated in the go docs) is no longer supported. The very bottom of the docs references versions but it is inconsistent with the version (1.16) listed in the example at the top of the page. I could make the edits here where it seems to mimic the godocs: https://github.com/google/go-tika/blob/fe9f7a490ac6260631874005f88d091330a109f1/tika/doc.go

I can make a pull request for this change. Worth cross referencing this in this issue: #7

Making location of Tika tmp configurable.

Making a minor change to the codebase to allow the passing of a path as part of the java.io.tmpdir system property when starting a Tika instance from the go-tika library.

Expose Tika http status code in errors returned by client methods

For users of tika.Client it can be useful to be able to differentiate between intermittent errors (http status code 500) and content related errors (e.g. 415 and 422) however currently the client methods just return an opaque error string.

I'm experimenting in my fork https://github.com/tomyl/go-tika with exposing the http status code in the error. Basically:

diff --git a/tika/tika.go b/tika/tika.go
index a6ffdab..8a0cd39 100644
--- a/tika/tika.go
+++ b/tika/tika.go
@@ -29,6 +29,16 @@ import (
        "golang.org/x/net/context/ctxhttp"
 )
 
+// ClientError represents an error response from the Tika server.
+type ClientError struct {
+       // StatusCode is the http status code returned by the Tika server.
+       StatusCode int
+}
+
+func (e ClientError) Error() string {
+       return fmt.Sprintf("response code %d", e.StatusCode)
+}
+
 // Client represents a connection to a Tika Server.
 type Client struct {
        // url is the URL of the Tika Server, including the port (if necessary), but
@@ -107,7 +117,7 @@ func (c *Client) call(ctx context.Context, input io.Reader, method, path string,
        }
        defer resp.Body.Close()
        if resp.StatusCode != http.StatusOK {
-               return nil, fmt.Errorf("response code %v", resp.StatusCode)
+               return nil, ClientError{resp.StatusCode}
        }
        return ioutil.ReadAll(resp.Body)
 }

The calling code can do something like

func doStuff(input io.Reader, tikaURL string) error {
    client := tika.NewClient(nil, tikaURL)
    s, err := client.Parse(context.Background(), input)
    if isUnsupportedFileFormat(err) {
        return nil
    }
    if err != nil {
        return err
    }
   ...
}

func isUnsupportedFileFormat(err error) bool {
    var tikaErr tika.ClientError

    if errors.As(err, &tikaErr) {
        switch tikaErr.StatusCode {
        // Password protected documents yield StatusUnprocessableEntity
        case http.StatusUnsupportedMediaType, http.StatusUnprocessableEntity:
            return true
        default:
            return false
        }
    }

    return false
}

Thoughts? I'm happy to submit a PR if a change like this would be accepted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.