Git Product home page Git Product logo

tika's Introduction

Welcome to Apache Tika https://tika.apache.org/

license Jenkins Jenkins tests Maven Central

Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tika is a project of the Apache Software Foundation.

Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software Foundation.

Getting Started

Pre-built binaries of Apache Tika standalone applications are available from https://tika.apache.org/download.html . Pre-built binaries of all the Tika jars can be fetched from Maven Central or your favourite Maven mirror.

Tika 1.X reached End of Life (EOL) on September 30, 2022.

Tika is based on Java 8 and uses the Maven 3 build system. N.B. Docker is used for tests in tika-integration-tests. As of Tika 2.5.1, if Docker is not installed, those tests are skipped. Docker is required for a successful build on earlier 2.x versions.

To build Tika from source, use the following command in the main directory:

mvn clean install

The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this:

java -jar tika-app/target/tika-app-*.jar --help

To build a specific project (for example, tika-server-standard):

mvn clean install -am -pl :tika-server-standard

If the ossindex-maven-plugin is causing the build to fail because a dependency has now been discovered to have a vulnerability:

mvn clean install -Dossindex.skip

Maven Dependencies

Apache Tika provides Bill of Material (BOM) artifact to align Tika module versions and simplify version management. To avoid convergence errors in your own project, import this bom or Tika's parent pom.xml in your dependencey management section.

If you use Apache Maven:

<project>
  <dependencyManagement>
    <dependencies>
      <dependency>
       <groupId>org.apache.tika</groupId>
       <artifactId>tika-bom</artifactId>
       <version>2.x.y</version>
       <type>pom</type>
       <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>

  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers-standard-package</artifactId>
      <!-- version not required since BOM included -->
    </dependency>
  </dependencies>
</project>

For Gradle:

dependencies {
  implementation(platform("org.apache.tika:tika-bom:2.x.y"))

  // version not required since bom (platform in Gradle terms)
  implementation("org.apache.tika:tika-parsers-standard-package")
}

Migrating to 2.x

The initial 2.x release notes are available in the archives.

See our wiki for the latest.

Contributing via Github

See the pull request template.

NOTE: Please open pull requests against the main branch. We locked master in September 2020 and no longer use it.

Thanks to all the people who have contributed

contributors

Building from a Specific Tag

Let's assume that you want to build the 2.5.0 tag:

0. Download and install hub.github.com
1. git clone https://github.com/apache/tika.git 
2. cd tika
3. git checkout 2.5.0
4. mvn clean install

If a new vulnerability has been discovered between the date of the tag and the date you are building the tag, you may need to build with:

4. mvn clean install -Dossindex.skip

If a local test is not working in your environment, please notify the project at [email protected]. As an immediate workaround, you can turn off individual tests with e.g.:

4. mvn clean install -Dossindex.skip -Dtest=\!UnpackerResourceTest#testPDFImages

License (see also LICENSE.txt)

Collective work: Copyright 2011 The Apache Software Foundation.

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache Tika includes a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the licenses listed in the LICENSE.txt file.

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Tika uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

Mailing Lists

Discussion about Tika takes place on the following mailing lists:

Notification on all code changes are sent to the following mailing list:

The mailing lists are open to anyone and publicly archived.

You can subscribe the mailing lists by sending a message to [LIST][email protected] (for example user-subscribe@...).
To unsubscribe, send a message to [LIST][email protected].
For more instructions, send a message to [LIST][email protected].

Issue Tracker

If you encounter errors in Tika or want to suggest an improvement or a new feature, please visit the Tika issue tracker. There you can also find the latest information on known issues and recent bug fixes and enhancements.

Build Issues

TODO

  • Need to install jce

  • If you find any other issues while building, please email the [email protected] list.

tika's People

Contributors

abehara2 avatar asmehra95 avatar bdelacretaz avatar bitsgalore avatar bobpaulin avatar chrismattmann avatar dameikle avatar dependabot[bot] avatar gagravarr avatar grossws avatar jukka avatar kkrugler avatar kranthigv avatar lewismc avatar lfcnassif avatar manalishah avatar maxcom avatar mikemccand avatar nddipiazza avatar nprate2 avatar peteralfredlee avatar rohan2810 avatar siren avatar smadha avatar tballison avatar tbpalsulich avatar thammegowda avatar thausherr avatar thejanw avatar zarana-parekh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.