google / go-tika Goto Github PK
View Code? Open in Web Editor NEWGo package for using Apache Tika
License: Apache License 2.0
Go package for using Apache Tika
License: Apache License 2.0
Hi!
If the server already running, can I use it without spawning new one?
Sorry I just figured I only use client then
When calling NewServer if the jar file is not found (bad path or just does not exist), NewServer
does not return an error. When the Start
function is then called, it will also not return an error.
The underlying line of code being called in Start
(line 77) is:
cmd := command("java", "-jar", s.jar, "-p", s.port)
which is then started with:
if err := cmd.Start(); err != nil { return err }
which has no problem except for the fact that it is erring silently. If we read from standard error we can see the problem being logged, but the current code is not capturing this, since cmd.Start()
did not err. Thus, this is a silent "failure". The code does not fail but the call to start the Server never actually succeeds.
Consider the following example code:
server, err = tika.NewServer("tika-server-1.19.jar", "")
if err != nil {
l.Error().Msgf("Error creating Tika server: %s", err.Error())
return
} else {
l.Info().Msgf("SERVER: %v", server)
}
err = server.Start(context.Background())
if err != nil {
l.Error().Msgf("Error starting Tika server: %s", err.Error())
return
} else {
l.Info().Msg("No issue starting server")
}
l.Info().Msg("HERE")
This prints: SERVER: &{tika-server-1.19.jar http://localhost:9998 9998 <nil>}
and HERE
is never printed since the line is never reached. The Start
function never completes.
I propose the following fix to be placed in the first line of the Start
function:
if _, err := os.Stat(s.jar); os.IsNotExist(err) {
return err
}
This will allow for immediate failure if the tika jar file was not found or does not exist.
I would like to make a PR for this.
Create a function that will allow the setting of Java system properties to the the tika command. For example AddJavaProps("java.io.tmdir", "/content/dev/tika/") should add "-Djava.io.tmpdir=/content/dev/tika" to the java command when starting a tika server, i.e. java -Djava.io.tmpdir=/content/dev/tika -jar /location/of/tikaserver.jar
in java api, we can convert file to html like this:
public static String extractHtml(File file) throws IOException {
byte[] bytes = Files.toByteArray(file);
AutoDetectParser tikaParser = new AutoDetectParser();
ByteArrayOutputStream out = new ByteArrayOutputStream();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler;
try {
handler = factory.newTransformerHandler();
} catch (TransformerConfigurationException ex) {
throw new IOException(ex);
}
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(out));
ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
try {
tikaParser.parse(new ByteArrayInputStream(bytes), handler1, new Metadata());
} catch (SAXException | TikaException ex) {
throw new IOException(ex);
}
return new String(out.toByteArray(), "UTF-8");
}
can go-tika do this? when we use
client := tika.NewClient(nil, s.URL())
body, err := client.Parse(context.Background(), f)
what is the body's content? or how to understand this return?
Is there a reason why the Tika client always reads the whole response body in memory using ioutil.ReadAll
and then copies it again in callString
? It seems unnecessary and it's very inefficient, especially when sending large documents to Tika for parsing.
I've forked the repository and made some changes, the tests all pass. I'm not opening a PR yet to see why this wasn't done before, as it's not obvious to me why things work this way right now and I want to avoid breaking anything.
As a developer creating a tika service using a pre-downloaded JAR file
I want to know if the jar file cannot be found
So that I can report an error
eg.
server, err := tika.NewServer("./lib/tika-server.jar", "1080")
if tika-server.jar is not found in ./lib then an error should be returned.
The error handling in the Start function doesn't return the error when running in background context
func (s *Server) Start(ctx context.Context) error { if _, err := os.Stat(s.jar); os.IsNotExist(err) { return err }
Please assign this issue to me, I'll fix.
I'm experimenting with this library and it is very good, so thank you. However, the default behaviour from the Tika server seems to be to return extracted text as HTML. According to the Tika docs making the request with the header "Accept: text/plain"
will return plain text. This would be much better in my use case.
I see in the code that client.call method allows the caller to pass in a http.Header
. The issue is that the public methods do not. It would be great if Parse, and ParseReader could be updated to accept http.Header
and then this was passed on to callString
and call
.
I'm happy to open a PR for this if that would help.
Hi,
Thanks for the awesome package.
I am trying to run the following code but it only works for the first function call whether it be .Parse,.Detect or .Meta . I am wondering whether it is the limitation of the package or of golang GC itself and how could I solve it.
func ReadData(s *indexService) {
//Open the file
f, err := os.Open(fileDir)
if err != nil {
fmt.Println("[ERROR] Opening file")
}
defer f.Close()
c := context.Background()
//First function call works fine
docBody, err := s.TClient.Parse(c, f)
if err != nil {
fmt.Println("[ERROR] Reading body")
}
//Returns file already closed error
docContent, err := s.TClient.Detect(c, f)
if err != nil {
fmt.Println(err)
fmt.Println("[ERROR] Reading MIMETYPE")
}
//Returns file already closed error
docMeta, err := s.TClient.Meta(c, f)
if err != nil {
fmt.Println("[ERROR] Reading Meta-data")
}
defer func() {
fmt.Printf("Tika Processed: %s \n", fileName)
}()
}
indexService looks like this -
type indexService struct {
TClient *tika.Client
}
Tika appears to support multipart file uploads via the endpoint POST /tika/form
, as documented here.
It would be great to add this support so that we could use the client for uploading large files without having to hold the buffer with all bytes in memory.
We should have a set of integration tests to make sure tika.Client
works with new server versions. Came up in #8.
Could use an environment variable to indicate if the tests should run and t.Skip
the test if not.
We know the list of supported versions (https://github.com/google/go-tika/blob/master/tika/server.go#L150), so the test can download the server, start it, then run whatever tests it needs.
The docs should be updated to reflect the most recent compatible version as it appears 1.16 (as stated in the go docs) is no longer supported. The very bottom of the docs references versions but it is inconsistent with the version (1.16) listed in the example at the top of the page. I could make the edits here where it seems to mimic the godocs: https://github.com/google/go-tika/blob/fe9f7a490ac6260631874005f88d091330a109f1/tika/doc.go
I can make a pull request for this change. Worth cross referencing this in this issue: #7
Making a minor change to the codebase to allow the passing of a path as part of the java.io.tmpdir system property when starting a Tika instance from the go-tika library.
For users of tika.Client
it can be useful to be able to differentiate between intermittent errors (http status code 500) and content related errors (e.g. 415 and 422) however currently the client methods just return an opaque error string.
I'm experimenting in my fork https://github.com/tomyl/go-tika with exposing the http status code in the error. Basically:
diff --git a/tika/tika.go b/tika/tika.go
index a6ffdab..8a0cd39 100644
--- a/tika/tika.go
+++ b/tika/tika.go
@@ -29,6 +29,16 @@ import (
"golang.org/x/net/context/ctxhttp"
)
+// ClientError represents an error response from the Tika server.
+type ClientError struct {
+ // StatusCode is the http status code returned by the Tika server.
+ StatusCode int
+}
+
+func (e ClientError) Error() string {
+ return fmt.Sprintf("response code %d", e.StatusCode)
+}
+
// Client represents a connection to a Tika Server.
type Client struct {
// url is the URL of the Tika Server, including the port (if necessary), but
@@ -107,7 +117,7 @@ func (c *Client) call(ctx context.Context, input io.Reader, method, path string,
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
- return nil, fmt.Errorf("response code %v", resp.StatusCode)
+ return nil, ClientError{resp.StatusCode}
}
return ioutil.ReadAll(resp.Body)
}
The calling code can do something like
func doStuff(input io.Reader, tikaURL string) error {
client := tika.NewClient(nil, tikaURL)
s, err := client.Parse(context.Background(), input)
if isUnsupportedFileFormat(err) {
return nil
}
if err != nil {
return err
}
...
}
func isUnsupportedFileFormat(err error) bool {
var tikaErr tika.ClientError
if errors.As(err, &tikaErr) {
switch tikaErr.StatusCode {
// Password protected documents yield StatusUnprocessableEntity
case http.StatusUnsupportedMediaType, http.StatusUnprocessableEntity:
return true
default:
return false
}
}
return false
}
Thoughts? I'm happy to submit a PR if a change like this would be accepted.
https://github.com/google/go-tika/blob/master/tika/server.go#L143
Need to be careful with server API changes. It would be OK to pass incompatible changes through this API.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.