Comments (7)
I was thinking that we should make the language identification pluggable in Tika. Its current resources are both slow and inaccurate and there are better alternatives. We could also add the language detection directly as a Behemoth module indeed
from behemoth.
On Mar 20, 2012, at 5:26 AM, Julien Nioche wrote:
I was thinking that we should make the language identification pluggable in Tika. Its current resources are both slow and inaccurate and there are better alternatives. We could also add the language detection directly as a Behemoth module indeed
We made Solr's pluggable as well, so I think this makes sense.
Reply to this email directly or view it on GitHub:
https://github.com/jnioche/behemoth/issues/34#issuecomment-4591403
Grant Ingersoll
http://www.lucidimagination.com
from behemoth.
http://code.google.com/p/language-detection/ is a pretty good LangId tool that we also use in Solr.
from behemoth.
I have started a discussion in Tika land about making the language detection pluggable but it might take a bit of time. Having a wrapper for this library should not be too difficult and would provide the functionality until we get it from Tika for free.
from behemoth.
Cool. It's been a while since I've been in Tika land, but would be great to have it. Naturally, we could very easily add it here, too.
from behemoth.
Implemented in module language-id
from behemoth.
Implemented in module language-id
from behemoth.
Related Issues (20)
- Ingest times with CorpusGenerator HOT 5
- Exception when calling DistributedCache.purgeCache(job) in GATEDriver.java HOT 3
- Unnecessary jars being included in .job files HOT 4
- Classloader problems with job files that include behemoth.core.jar HOT 3
- ClassNotFoundException org.apache.mahout.math.Vector HOT 5
- Output to LucidWorks 2.1 HOT 3
- Warn when input is not available for CorpusGenerator HOT 1
- UIMAMapper to use UIMAProcessor HOT 1
- CorpusReader generic parameter for annotations
- Add negative filter for mimetype
- Unable to Index Tika file to Solr using behemoth HOT 9
- Tests cant be run by more than one person HOT 1
- CTakes modules for Behemoth HOT 2
- Elasticsearch module HOT 5
- Use warc-hadoop library
- Upgrade to Mahout 0.9 HOT 4
- Upgrade to Mahout 0.10.0 HOT 3
- CorpusGenerator never invokes document.setText HOT 2
- WARC converter to allow custom metadata
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from behemoth.