Git Product home page Git Product logo

solrtexttagger's Introduction

Solr Text Tagger

This project implements a "naive" text tagger based on Apache Lucene / Solr, using Lucene FST (Finite State Transducer) technology under the hood for remarkable low-memory properties. It is "naive" because it does simple text word based substring tagging without consideration of any natural language context. It operates on the results of how you configure text analysis in Lucene and so it's quite flexible to match things like phonetics for sounds-like tagging if you wanted to. For more information, see the presentation video/slides referenced below.

The tagger can be used for finding entities/concepts in large text, or for doing likewise in queries to enhance query-understanding.

For a list of changes with version of this tagger, to include Solr & Java version compatibility, see CHANGES.md

Note: the STT is included in Apache Solr 7.4.0 !!!

Solr 7.4.0 now includes the Solr Text Tagger. It's documented in the Solr Reference Guide. As-such, you likely should just use the one in Solr and not the one here. That said, htmlOffsetAdjust is not implemented there. Issues #82 and #81 document some information about the differences and contain further links.

Resources / References

Pertaining to Lucene's Finite State Transducers:

Contributors:

  • David Smiley
  • Rupert Westenthaler (notably the PhraseBuilder in the 1.1 branch)

Quick Start

See the QUICK_START.md file for a set of instructions to get you going ASAP.

Build Instructions

The build requires Java (v8 or v9) and Maven.

To compile and run tests, use:

%> mvn test

To compile, test, and build the jar (placed in target/), use

%> mvn package

Configuration

A Solr schema.xml needs 2 things

  • A unique key field (see <uniqueKey>). Setting docValues=true on this field is recommended.
  • A name/lookup field indexed with Shingling or more likely ConcatenateFilter.

If you want to support typical keyword search on the names, not just tagging, then index the names in an additional field with a typical analysis configuration to your preference.

For tagging, the name field's index analyzer needs to end in either shingling for "partial" (i.e. sub name phrase) matching of a name, or more likely using ConcatenateFilter for complete name matching. ConcatenateFilter acts similar to shingling but it concatenates all tokens into one final token with a space separator. The query time analysis should not have Shingling or ConcatenateFilter.

Prior to shingling or the ConcatenateFilter, preceding text analysis should result in consecutive positions (i.e. the position increment of each term must always be 1). As-such, Synonyms and some configurations of WordDelimiterFilter are not supported. On the other hand, if the input text has a position increment greater than one (e.g. stop word) then it is handled properly as if an unknown word was there. Support for synonyms or any other filters producing posInc=0 is a feature that has largely been overcome in the 1.1 version but it has yet to be ported to 2.x; see Issue #20, RE the PhraseBuilder

To make the tagger work as fast as possible, configure the name field with postingsFormat="FST50";. In doing so, all the terms/postings are placed into an efficient FST data structure.

Here is a sample field type config that should work quite well:

<fieldType name="tag" class="solr.TextField" positionIncrementGap="100" postingsFormat="FST50"
    omitTermFreqAndPositions="true" omitNorms="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />

    <filter class="org.opensextant.solrtexttagger.ConcatenateFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

A Solr solrconfig.xml needs a special request handler, configured like this.

<requestHandler name="/tag" class="org.opensextant.solrtexttagger.TaggerRequestHandler">
  <lst name="defaults">
    <str name="field">name_tag</str>
    <str name="fq">PUT SOME SOLR QUERY HERE; OPTIONAL</str><!-- filter out -->
  </lst>
</requestHandler>
  • field: The field that represents the corpus to match on, as described above.
  • fq: (optional) A query that matches a subset of documents for name matching.

Also, to enable custom so-called postings formats, ensure that your solrconfig.xml has a codecFactory defined like this:

<codecFactory name="CodecFactory" class="solr.SchemaCodecFactory" />

Usage

For tagging, you HTTP POST data to Solr similar to how the ExtractingRequestHandler (Tika) is invoked. A request invoked via the "curl" program could look like this:

curl -XPOST \
  'http://localhost:8983/solr/collection1/tag?overlaps=NO_SUB&tagsLimit=5000&fl=*' \
  -H 'Content-Type:text/plain' -d @/mypath/myfile.txt

The tagger request-time parameters are

  • overlaps: choose the algorithm to determine which overlapping tags should be retained, versus being pruned away. Options are:
  • ALL: Emit all tags.
  • NO_SUB: Don't emit a tag that is completely within another tag (i.e. no subtag).
  • LONGEST_DOMINANT_RIGHT: Given a cluster of overlapping tags, emit the longest one (by character length). If there is a tie, pick the right-most. Remove any tags overlapping with this tag then repeat the algorithm to potentially find other tags that can be emitted in the cluster.
  • matchText: A boolean indicating whether to return the matched text in the tag response. This will trigger the tagger to fully buffer the input before tagging.
  • tagsLimit: The maximum number of tags to return in the response. Tagging effectively stops after this point. By default this is 1000.
  • rows: Solr's standard param to say the maximum number of documents to return, but defaulting to 10000 for a tag request.
  • skipAltTokens: A boolean flag used to suppress errors that can occur if, for example, you enable synonym expansion at query time in the analyzer, which you normally shouldn't do. Let this default to false unless you know that such tokens can't be avoided.
  • ignoreStopwords: A boolean flag that causes stopwords (or any condition causing positions to skip like >255 char words) to be ignored as if it wasn't there. Otherwise, the behavior is to treat them as breaks in tagging on the presumption your indexed text-analysis configuration doesn't have a StopWordFilter. By default the indexed analysis chain is checked for the presence of a StopWordFilter and if found then ignoreStopWords is true if unspecified. You probably shouldn't have a StopWordFilter configured and probably won't need to set this param either.
  • xmlOffsetAdjust: A boolean indicating that the input is XML and furthermore that the offsets of returned tags should be adjusted as necessary to allow for the client to insert an open and closing element at the positions. If it isn't possible to do so then the tag will be omitted. You are expected to configure HTMLStripCharFilter in the schema when using this option. This will trigger the tagger to fully buffer the input before tagging.
  • htmlOffsetAdjust: Similar to xmlOffsetAdjust except for HTML content that may have various issues that would never work with an XML parser. There needn't be a top level element, and some tags are known to self-close (e.g. BR). The tagger uses the Jericho HTML Parser for this feature (ASL & LGPL & EPL licensed).
  • nonTaggableTags: (only with htmlOffsetAdjust) Omits tags that would enclose one of these HTML elements. Comma delimited, lower-case. For example 'a' (anchor) would be a likely choice so that links the application inserts don't overlap other links.
  • fl: Solr's standard param for listing the fields to return.
  • Most other standard parameters for working with Solr response formatting: echoParams, wt, indent, etc.

Output

The output is broken down into two parts, first an array of tags, and then Solr documents referenced by those tags. Each tag has the starting character offset, an ending character (+1) offset, and the Solr unique key field value. The Solr documents part of the response is Solr's standard search results format.

Advanced Tips

  • For reducing tagging latency even further, consider embedding Solr with EmbeddedSolrServer. See EmbeddedSolrNoSerializeTest.

solrtexttagger's People

Contributors

dsmiley avatar jdeolive avatar jigarparekh80 avatar jlleitschuh avatar joekiller avatar mikolajkania avatar mubaldino avatar treygrainger avatar westei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

solrtexttagger's Issues

Jericho 3.4 requires log4j-2.4.1 while Solr still uses 1.2.17

It appears that Solr is locked in still at 1.2.17 (log4j version: https://github.com/apache/lucene-solr/blob/master/lucene/ivy-versions.properties#L83 and slf4j-log4j12 version: https://github.com/apache/lucene-solr/blob/master/lucene/ivy-versions.properties#L296) while Jericho 3.4 uses the latest log4j library. When the SolrTextTagger hits the Jericho lib, it'll throw the error listed below.

Jericho's release notes state:

       - Upgraded to the following logger APIs:
         slf4j-api-1.7.12, log4j-2.4.1

Error:

o.a.s.s.SolrDispatchFilter null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
    at org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:618)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:477)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:499)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
    at net.htmlparser.jericho.LoggerProviderLog4J.getLogger(LoggerProviderLog4J.java:35)
    at net.htmlparser.jericho.LoggerProviderLog4J.getSourceLogger(LoggerProviderLog4J.java:41)
    at net.htmlparser.jericho.Source.newLogger(Source.java:1685)
    at net.htmlparser.jericho.Source.<init>(Source.java:151)
    at net.htmlparser.jericho.StreamedSource.<init>(StreamedSource.java:235)
    at org.opensextant.solrtexttagger.HtmlOffsetCorrector.<init>(HtmlOffsetCorrector.java:46)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.initOffsetCorrector(TaggerRequestHandler.java:251)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:154)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)
    ... 22 more
Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 34 more

Supplementing filter query with request filter query.

Apologies if this the wrong forum for such a question, but I didn't a forum or mailing list for the project. We're using the text tagger along with the gazetteer with a configuration that sets some defaults for filter query. The configuration per se was taken from [here|https://github.com/OpenSextant/Xponents/blob/master/solr/gazetteer/conf/solrconfig.xml#L810].

However, we need to be able to specify an additional constraint "on the fly". For example to constrain tagging to a specific country code. The behaviour as I understand it is that if there is a configured filter query in solrconfig.xml it trumps any filter query supplied by the request.

I couldn't find a way (that wasn't a total hack) to join the two filter queries. What I've done for now is patch TaggerRequestHandler and made setTopInitArgsAsInvariants() protected so I can subclass it and do the "ANDing" there.

I guess my questions are (a) is there is a way to do this already that I am missing, and (b) if not, is something like what I've done acceptable?

Adjust offsets for balancing Html/Xml elements in source text

If the input is XML/HTML, we can strip it via HtmlStripCharFilter provided by Lucene text analysis. This ensures that the tagging won't try and tag the XML markup itself. Lucene takes care of mapping offsets such that the tagger's offsets returned are into the original text (XML in this case). However, if you were to try and use this information to insert new markup to reflect the tagger match, then there is the distinct possibility that doing so would result in an imbalanced DOM structure that would result in an error. For example if the source text was:

Hello David <b>Wayne Smiley</b>.

And if "David Wayne" was in the gazetteer / corpus, then the tagger would give offsets to the obvious offsets above that, if you were to insert, say an anchor element, then it would result in incorrect XML:

Hello <a>David <b>Wayne</a> Smiley</b>.

I'd like to add a feature such that at least in this case, the tag would be omitted. In other cases, the offsets need to be adjusted around opening/closing elements as appropriate. For example if this was the input:

<p>David <b>Wayne</b></p>

Then the offsets should be adjusted such that inserting an anchor tag would yield:

<p><a>David <b>Wayne</b></a></p>

Tagger should support non-Integer unique key

The tagger currently requires an integer unique key. Ideally this could be any valid type (especially string!)

The readme refers to: OPENSEXTANT-73, I assume that is now moving here?

Port PhraseBuilder from v1.2 branch to master

I'd love to see the improvements made on the 1.2 branch ported to the MemPF branch (v2.0).

MemPF seems to work well in OpenSextant but it hasn't been as throughly evaluated. I suspect if Stanbol ports to MemPF, and if Rupert does his measurements as he's done before, it will be more clear through its tests, etc. how well the MemPF does.

Issue with solrTextTagger2.3 and solr 6.3

Hi David,
I have configured the solr 6.3 to work with sorlTextTagger 2.3. I hope I did everything described in configuration file. I have already indexed cities.csv file.
But when I tried to tag the city name with given example, i got the following error:

curl -X POST 'http://localhost:8983/solr/geonames/tag?fl=id,name,countrycode&wt=json&indent=on' -H 'Content-Type:text/plain' -d 'Hello New York City'

<title>Error 500 Server Error</title>

HTTP ERROR 500

Problem accessing /solr/geonames/tag. Reason:

    Server Error

Caused by:

java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.getLeafReader()Lorg/apache/lucene/index/LeafReader;
	at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:167)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:518)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
	at java.lang.Thread.run(Thread.java:745)

I have created the .jar with build and also tried by downloading the .jar file from given link. But no success.

Thanks a lot in advance,

Shrestha

Configurable stopword handling

Discussion: #11 (comment)

In summary, if posInc > 1 then there was an omitted stopword. What should we do?

What we do now is cause an error at index time, and at query time finish any tags in-progress (i.e. i.e. a tag can't span the gap).

We might want a gap to be effectively ignored -- pretending posInc is the typical 1.

We might want an interesting wildcard-like match in which the tagger can know to accept all possible upcoming terms. At index time, a special wildcard token might be emitted that the tagger knows how to handle.

Summary of options:

  • error
  • tag break (query time only)
  • ignore
  • wildcard

And you might want different behavior at index & query time.

p.s. I have no need for this right now but I want to record that this should ideally be configurable

Change History, etc

David,

greetings. Hope all is well with you.

Could you please summarize any functional changes in going to 2.1?
I see Solr 5.2+ is about where Solr rev sits along with Java 7+.

Really basic change history and integration requirements (versions) would help the README or a second file on change/integration.

I'd like to start looking at Solr 5.3 in Xponents taggers (and so SolrTT 2.1)

thanks,
Marc

Problem building SolrTextTagger with Lucene/Solr 4.7.0

I've just downloaded SolrTextTagger and added the required solr and lucene jars from Solr 4.7.0. I'm running into a compilation problem:

Buildfile: /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/build.xml
compile:
[javac] /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/build.xml:75: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 13 source files to /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/build
[javac] /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/src/main/java/org/opensextant/solrtexttagger/TaggerRequestHandler.java:299: error: cannot find symbol
[javac] docBits = searcher.getDocSet(filterQuery).getBits();
[javac] ^
[javac] symbol: method getBits()
[javac] location: interface DocSet
[javac] 1 error
BUILD FAILED

I see the getBits() method in org.apache.lucene.util.OpenBitSet OK so not sure what's going on.
As a possible clue (and I'm definitely a Lucene newbie ) I'm wondering whether issue https://issues.apache.org/jira/browse/LUCENE-5440 , (Add LongFixedBitSet and replace usage of OpenBitSet) which went into Lu/So 4.7 is relevant here ?

Hope to hear back soon

-Simon

Multilingual support

This aims to discuss things related the usage of the SolrTextTagger to process texts in different languages and tag them against a vocabulary with labels in multiple languages (e.g. freebase.org).

Multilingual Vocabularies

Expected properties of the vocabulary (numbered to allow referring them later in the text)

  • (1) defines labels in different languages
  • (2) labels without language tag should be used for all languages
  • (3) not all entities define labels in all languages
  • (4) for non common languages only a few entities do define labels

Within the Solr index labels of different language will be stored in different fields (as user will want to configure different Analyzers). For some languages a dynamic field with a generic text analyzer could be used - e.g.

<field name="label-en" type="text-en" ... />
<field name="label-de" type="text-de" ... />
<!-- other label fields for specific languages -->
<!-- finally the field for labels without language and
       a dynamic field for other languages -->
<field name="label" type="text-gen" ... />
<dynamicField name="label-*" type="text-gen ... />

Multilingual Tagging Process

Assuming that we do know the language of the processed text (parsed or detected) we would like to tag the content by using labels of the detected language as well as default labels (2).

For achieving this I see several solutions:

  1. Building language specific FST corpora and calling the SolrTextTagger twice: To allow this the TaggerFstCorpus needs to be adapted to NOT throwing an RuntimeExcpetion on documents where the storedField is not present as this will happen because of (3). Also building the FST is inefficient for (4) as it iterates over all documents in the index and most of them will be skipped because they do not define a label in that language. An other potential drawback is that the TagClusterReducer will only work within a single language. Results of the the two calls will still need to be merged / reduced.
  2. Building language specific FST corpora that do include default labels (2): While this would allow to use a single FST corpus for tagging a text based on a multi lingual vocabulary it would cause a lot of duplication. Especially for vocabularies that would contain a lot of default labels. TaggerFstCorpus would need to learn some new tricks as it would need to be built based on two fields with potential different analyzers. The problem of different Analyzers would also affect the Tagger - as it does use the same Analyzer to process the parsed text. If the Tagger would only use the Analyzer as defined by the Field of the language of the parsed text one would risk miss matches for default labels.
  3. Building a multi lingual FST corpora: This would require to merge labels in different labels (stored in different fields using different analyzers) to a single FST corpora. This corpora would need to be aware of the languages Phrases are present so that it can only suggest matches with labels of the language of the text as well as default labels. Same as with the 2nd option one would also need to solve the problem of supporting two Analyzers in the Tagger.

For now I am aiming for the first option, as it requires the least changes to SolrTextTagger, but I would be eager to have an opinion/feedback on the other two options.

best
Rupert

SolrTextTagger directly with Lucene?

Hi David!

On this blog post you mentionned that:

If Solr adds more weight then you want, then you can just depend on Lucene, since most of the functionality doesn't depend on Solr.

Would you have general guidelines as how it could be used directly within Lucene? (how should it be called, etc).

Write a "getting started" how-to.

Asume the user knows nothing about Solr but can nonetheless be directed to install Solr following Solr's installation instructions. The user might not know anything about text analysis either, but we can provide a sample.

Release version 1.2

I think the current master branch, version 1.2-SNAPSHOT is ready to be released as 1.2. Rupert, let me know when you concur and I'll push a release to Maven central.

Publish SolrTextTagger releases to Maven Central

Publishing SolrTextTagger releases on Maven Central would ease its usage by other components.

Background:

I am in the progress of implementing an Apache Stanbol Enhancement Engine that will use TaggerFstCorpus for in-memory EntityLinking - suggesting Entities for Mentions in a processed text STANBOL-1128.

To use a library in a Apache Project it is preferred (quasi required) that it is available on Maven Central. So having SolrTextTagger available on Maven Central would be really appreciated.

BTW: I would be also interested to know if SolrTextTagger is available on some other maven server ATM.

best
Rupert

Enhance README

  • Convert to Markdown
  • Add note on Solr version support
  • Add note on how to use Embedded, and need to create a special query class. Probably just point to the test.
  • When applicable, merge relevant feature docs in #20 (text analysis)

Use Lucene MemoryIndex postings format instead of explicit FSTs

In my presentation on the text tagger at Lucene Revolution, I indicated that an experimental test of a single FST surprisingly had better compression than the pair that the tagger uses now. Using Lucene's "Memory" postings format puts all the terms into an FST and it also uses a compact encoding for the docId postings to save memory there. This could be used in place of the TaggerCorpus. There are other advantages too, such as not having a single expensive build moment -- it's effectively amortized during indexing. I'm not sure how it would affect tagging performance; we'll see.

I'll post more on this when I get started; probably tomorrow. This is a large internal change, so it'll go to a new branch, and a 2.0-SNAPSHOT version.

TermPrefixCursor incompatible with 5.3.0

In Solr (Lucene) 5.3.0 the way that deleted docs are detected changed. The issue is https://issues.apache.org/jira/browse/LUCENE-6553 and their comment was "The postings, spans and scorer APIs no longer take an acceptDocs parameter. Live docs are now always checked on top of these APIs.".

So in TermPrefixCursor.java the following call no longer uses the liveDocs. This causes a few tests to fail.

postingsEnum = termsEnum.postings(liveDocs, postingsEnum, PostingsEnum.NONE);

postingsEnum = termsEnum.postings(liveDocs, postingsEnum, PostingsEnum.NONE);

Any advice on how to implement LUCENE-6553 within SolrTextTagger?

NullPointerException in TaggerRequestHandler.java:199

When I do:

curl -XPOST \
  'http://localhost:8983/solr/test/tag?overlaps=NO_SUB&tagsLimit=5000&fl=*' \
  -H 'Content-Type:text/plain' -d @example.txt

The core name is test. An unrelated question: the url in the README.md is <host>:<port>/solr/tag, which however, returns 404. In my case, <host>:<port>/solr/<core_name>/tag works.

The server returns(I extracted the trace from the XML result):

java.lang.NullPointerException
    at org.opensextant.solrtexttagger.TaggerRequestHandler$1.&lt;init&gt;(TaggerRequestHandler.java:199)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:168)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:640)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:436)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    at java.lang.Thread.run(Thread.java:745)

I am using:

  • Solr: 5.2.1
  • SolrTextTagger 2.2
  • JRE: 1.8

schema.xml:

<schema name="test" version="1.5">
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>

    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>

    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>

        <fieldType name="tag" class="solr.TextField" positionIncrementGap="100" postingsFormat="Memory"
                           omitTermFreqAndPositions="true" omitNorms="true">
          <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />

                <filter class="org.opensextant.solrtexttagger.ConcatenateFilterFactory" />
          </analyzer>
          <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />
          </analyzer>
        </fieldType>

        <field name="_version_" type="long" indexed="true" stored="true"/>
        <field name="surface_name" type="tag" indexed="true" stored="true"/>
        <field name="occurrences" type="int" indexed="false" stored="true"/>
        <field name="log_occurrences" type="double" indexed="false" stored="true"/>
</schema>

Part of solrconfig.xml:

  <requestHandler name="/tag" class="org.opensextant.solrtexttagger.TaggerRequestHandler">
        <lst name="defaults">
      <str name="field">surface_name</str>
      <str name="fq">*:*</str>
        </lst>
  </requestHandler>

nonTaggableTags option

Sometimes when submitting HTML markup to tag, you don't want tags to enclose certain elements (also called "tags" confusingly). The elements "script" and "style" are already stripped out by Lucene's HTMLStripCharFilter. But you might want to not tag text in "a" (anchor) link elements because your application is going to insert links and doesn't want such links to interfere with existing ones (no overlaps).

I'll add a nonTaggableTags option that is a comma-delimited list of HTML element (tag) names that, if found to overlap with a candidate tagger tag, will cause that tagger tag to be omitted. For now, this option will only work when htmlOffsetAdjust is true, but could be easily modified later for xmlOffsetAdjust likewise.

Sentence segmentation

It would be neat to add some sort of sentence segmentation to the query time text analysis to trigger a break in tagging. For example (a very silly one!) the input document text is:
" I want to buy something new. England is a nice place to visit. " Then assuming "New England" is in the dictionary (and possibly England but that doesn't matter) then the tagger will currently find "New England" which is undesirable. Of course this is a "naive tagger" as put to me when I joined the project; but nonetheless this sort of rule seems to me a good one to have at this layer in an overall system.

This could be implemented with a tokenizer that tokenizes sentences using Java's BreakIterator. It would set a new attribute that indicates the starting and ending offset of the sentence. Then the token would get split by other standard lucene components like WordDelimiterFilter which breaks on whitespace. Ultimately the Tagger could look for the custom attribute, and check if the last word's offsets don't fall within the current sentence as indicated by the attribute.

But maybe sentence segmentation isn't aggressive enough. After all, shouldn't there be a tag break at nearly any punctuation?

Release version 2.0

Version 2.0-SNAPSHOT has been used quite a bit at MITRE and has had some enhanced tests -- it's solid. The memory usage is better and it's more flexible to build/maintain indices since it's based directly off of Lucene instead of building a custom persisted memory structure, and it's simpler code too. The tagging performance wasn't specifically measured, but it was in aggregate with the rest of OpenSextant and there was no marked difference -- that's success as far as I'm concerned.

What 2.0 lacks as of this writing is v1.2's ability to use more rich text analysis, thanks to @westei. See #20. I'm a little conflicted on wether to just release it. I was about to announce that it'll happen anyway next week but I think this big feature mis-match and no truly pressing reason to release 2.0 quite yet so I should hold up, and document some remaining issues.

When 2.0 does get released, the master branch will be renamed to 1_x, and MemPF will become master.

Enhancement suggestion: tagging multiple text fields concurrently in a single request

Would it be feasible to extend the API so that one could submit several separate text fields to be tagged in a single HTTP request, and do the tagging concurrently (in multiple threads). I suggest this because
a) we have a use case for this and b) it would be nice to take advantage of multicore/multiprocessor environments where possible.

At the moment I'm achieving concurrency in my application (written in Python) , but that involves creating threads, each of which then has to issue its own HTTP request.

Thoughts ?

Possible bug in merging default initArgs for Tagger request handler with request params

I noticed that our consultant who was working on ou deployment of the tagger made the following change to TaggerRequesthandler.java


*** 354,360 ****
return;//short circuit; nothing to do
SolrParams topInvariants = new MapSolrParams(map);
// By putting putting the top level into the 1st arg, it overrides request params in 2nd arg.
! req.setParams(SolrParams.wrapDefaults(topInvariants, req.getParams()));
}

--- 354,362 ----
return;//short circuit; nothing to do
SolrParams topInvariants = new MapSolrParams(map);
// By putting putting the top level into the 1st arg, it overrides request params in 2nd arg.
! // Fixed, this was merging in the wrong direction, Francois Schiettecatte
! // req.setParams(SolrParams.wrapDefaults(topInvariants, req.getParams()));
! req.setParams(SolrParams.wrapDefaults(req.getParams(), topInvariants));

}
As far as I can see, it's a legitimate bug and fix, not corrected in current master (our source shapshot was from a year ago)

cheers

-Simon

posinc = 1 error

I saw some text files with really long strings, like URLs or MD5 hashes and base64 encoded metadata. The appear to be causing the FST tagger heartburn and tagger throws error:

REF: src/main/java/org/mitre/solr/tagger/TaggerFstCorpus.java
if (posIncAtt.getPositionIncrement() != 1) {
throw new IllegalArgumentException("term: " + text + " analyzed to a token with posinc != 1");
}

My data is my data. I cannot really scrub the data before tagging. The FST tagger issue here may be a valid one, but we should figure out how to handle it more gracefully. Right now the whole document fails in OpSx PlaceNameMatcher.

Example data:
Run XText "convert.sh" on a simple PDF or other doc. find the cached text file for that run. The bottom of the text file will have a XT:xxxxxxxxxxxxx ... long base64 encoded label.

I hope to reproduce the situation shortly.

Distributed Requests

Hi,

I have a question concerning SolrCloud.
Is the TaggerRequestHandler capable of performing distributed requests over multiple shards?
I know that the standard solr select handler does it and can be adjusted using the shards query parameter.

Thanks, Martin

Maven build fails due to UnsupportedTokenException

FYI -- just pulled MemPF branch to build for myself. I was looking for latest bug fixes in 2.0 snapshot. No rush. But I did see this build test failure:

Results :

Failed tests: testUnsupportedMultiTokenSynonyms(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): expects an UnsupportedTokenException!

Tests in error:
testWhitespaceTokenizerWithWordDelimiterFilter(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testRemovalOfAlternateTokens(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testWordDelimiter(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testAlternates(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testSynonymsAndDelimiterCombined(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testStopWords(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'

API change in Solr 6.3 [SOLR-9592]

The current version of SolrTextTagger does not work with Solr 6.3 because SolrIndexSearcher#getLeafReader was renamed to SolrIndexSearcher#getSlowAtomicReader (SOLR-9592).

Changing the code would mean that the most current version of SolrTextTagger would no longer work with Solr/Lucene versions < 6.3. So most likely this would require a new release to be used with Solr 6.3+

In addition the javadoc indicates that one should use IndexSearcher.leafContexts instead. However this field is protected so I am not sure how to use it.

Solr 6.0 support

A first cut at compiling against Lucene 6.x showed the lowest level of APIs have changed, specifically in the Tagging Attribute implementation.
org.apache.lucene.util.AttributeImpl has changed substantially, impacting org.opensextant.solrtexttagger.TaggingAttributeImpl

this was just a quick look at Solr 6. Not urgent at all.

Add support for PositionIncrementAttribute and PositionLengthAttribute

The SolrTextTagger should consider the PositionIncrementAttribute and PositionLengthAttribute when building the FST model.

The blog post Lucene's TokenStreams are actually graphs! does provide a good overview about how this is intended to work.

The main goal ist to add support for Analyzer chains that create tokens with PositionIncrementAttribute == 0, but with the PositionLengthAttribute one can also correctly create FST arcs for more complex situation, where alternate tokens (as shown by the "wi fi network" example in the linked blog post.

How would it work:

Lets assume the Term with the label "thomas wi fi network" that got similar analyzed as the text in the linked Blog post.

wi fi network"

The goal is to have the following three arcs in the FST

  1. thomas wi fi network
  2. thomas wifi network
  3. thomas hotspot

So what one needs to do is to create an own arc for all possible paths through the directed acyclic graph represented by the TokenStream.

This works also for the other example given in the Blog post: ショッピングセンター (shopping center) would result in the following two arcs:

  1. ショッピングセンター
  2. ショッピング センター

Implementation

By using both PositionIncrementAttribute AND PositionLengthAttribute it is possible to generate those arcs based on the tokens in the TokenStream.

In the 1.2 branch this needs to be done in the TaggerFstCorpus#analyze(..) method. This method would need to return an array IntsRef[] results for the paths as described above.

For the 2.0 branch the ConcatenateFilter needs to build the Strings as described above and emits them with a PositionIncrementAttribute == 0 to callers of its incrementToken() method. This should - AFAIK - cause Solr to index them correctly.

My plan is to try an implementation this based on the 1.2 branch in the westei/SolrTextTagger fork.

Tagging UpdateRequestProcessor

It would be cool to have a Solr URP that does text tagging and applies the results as fields. The referenced documents from tagging might include metadata that can be copied to the current document going through the URP. It might very well be demonstration in nature, as applications are likely to have specific needs here. Nonetheless it's great to start with something instead of from scratch.

Disclaimer: This is just a wish-list feature at this time. No plans yet.

Decide fate of ant build.xml

The official build is the maven pom. The ant build.xml is legacy and I forgot to remove it as part of an OpenSextant reshuffle but at least one user voiced to me a strong preference for it.

Possible actions:

  • remove it
  • update it (not likely; I don't want to maintain it). Update with disclaimer on out of date?
  • generate it automatically with maven (possible?).

Solr 5.3 Refactor Breaks testTagStreaming

In the test testTagStreaming the document response has changed due to SOLR-7662. Per the comments in SOLR-7662, "javabin returns the primitive types of the fields while the text based writers return a IndexableField/StorableFIeld depends on whether you are in branch 5x or trunk".

The change results in the field values being returned with the following call as IndexableField/StorableField instead of the expected primitive value.

 assertEquals("Boston", refDoc.get().getFieldValue("name"));

How do you suggest adjusting for this?

I think this is the last bug for getting 5.3.1 working.

htmlOffsetAdjust option

I should have an option similar to xmlOffsetAdjust but for parsing HTML. It'll use the Jericho HTML parser (EPL & LGPL dual licensed) for the tagging. When this option is enabled, there need not be a top level element to contain the text, and some tags (e.g. BR) are assumed to self-close even if not done so.

just starting out

Hello,

I have downloaded the SolrTextTagger, and built my jar. I also have a current solr instance with the settings you have suggested.

I wanted to try out a sample dictionary (gazetteer), but don't see one.
Is the format
"foo","bar"

This is amazingly cool code, I hope to get something running soon.

Thanks,
Evan

Field BoostFilter SearchComponent

I know that the SolrTextTagger is used by CareerBuilder to find interesting things in a user's query to then do other things (like boost or apply a filter). There is a cool Solr plugin by Ted Dunning at LucidWorks here: https://github.com/lucidworks/query-autofiltering-component that does this... although I have a bunch of concerns with it. Relevant blog: https://lucidworks.com/blog/2015/05/13/query-autofiltering-revisited-can-precise/

I think it would be cool to develop a SearchComponent similar to Ted's but based on the SolrTextTagger. It would build a "side-car index" (possibly held in memory -- configurable) and then use its results to either apply "fq" filter queries or dismax "bq" boost queries (or both). In the end, it should be much less code than Ted's and it should have it's analysis configurable via the Solr schema instead of being hard-coded.

Disclaimer: this is just an idea place-holder; I don't yet have plans to do this

Multi-word synonyms

I'm experimenting with different analysis that exercises your PhraseBuilder; specifically using multi-word synonyms — e.g. Input dictionary name "DNS" mapping to alternate "domain name service". Based on the latest code, simply replace the DNS entry with this:

# Note: when expand=true both are synonyms of each other, but when
#  expand=false then the first term (DNS) is the target replacement for
#  the remainder (Domain Name Service).
DNS, Domain Name Service

So it doesn't work which I suspect you already knew:

10:30:20.558 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] INFO o.o.solrtexttagger.TaggerFstCorpus - Building TaggerFstCorpus
10:30:20.558 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] DEBUG o.o.solrtexttagger.TaggerFstCorpus - Building word dict FST...
10:30:20.559 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] DEBUG o.o.solrtexttagger.TaggerFstCorpus - Building temporary phrase working set...
10:30:20.560 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] TRACE o.o.solrtexttagger.TaggerFstCorpus - Token: dns, posInc: 1, posLen: 1, offset: [0,3], termId 0
10:30:20.560 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] TRACE o.o.solrtexttagger.TaggerFstCorpus - Token: domain, posInc: 0, posLen: 1, offset: [0,3], termId 1
10:30:20.561 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] TRACE o.o.solrtexttagger.TaggerFstCorpus - Token: name, posInc: 1, posLen: 1, offset: [0,3], termId 2
10:30:20.561 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] ERROR o.o.solrtexttagger.PhraseBuilder - Unable to append term[offset: [0,3], posInc: 1, posLen 1] to any phrase.

I can see how there could be ambiguity on what to do with 'name' looking at the raw token metadata. Might it be possible to simply append the token "name" to the newly created partial phrase "domain" on the grounds that "domain" was the last token that was emitted? That seems like a practical solution. I don't know if it would break something else but it appears worth trying.

Next best postingsFormat for fieldType

This is more of a question than an issue, and not terribly urgent, but would help if you could answer...

I noticed that a default Solr 5 instance (-Xmx 512m) ran out of memory after about ingesting 10 million terms in the FST. I have since increased the -Xmx to 6GB so I have some breathing room (~ 120 million terms by extrapolation), but I was wondering if you could recommend a postingsFormat for the tag fieldType that can spill over into disk (or can work entirely from disk in the worst case). The boxes have SSDs so the disk penalty is not as great as with spinning disks.

I see that the possible values for postingsFormat (according to Dmitry Kan's comment on a page in the Solr ref guide) are - Lucene40, Lucene41, Pulsing41, SimpleText, Memory, BloomFilter, Direct, FSTPulsing41, FSTOrdPulsing41, FST41, and FSTOrd41. Going by the name I thought BloomFilter may be a good choice but Solr gives a runtime error. I tried removing the postingsFormat attribute and it works but I was wondering if there was some setting that is preferable after "Memory".

Also my understanding is that I would have to reindex all the content if I changed the postingsFormat, is that correct?

Thanks in advance for your answers.

Update to Solr 4.4

The update to Solr 4.4 needs some minor code changes because of changed APIs

In addition Solr 4.4 does force StopwordFilter to use posInc > 1 values (see LUCENE-4963.

This might cause existing configurations to no longer work with SolrTextTagger as

  • at FST generation time SolrTextTagger will throw Exceptions when encountering such posInc values
  • at tagging time any advancing tags are completed on posInc values > 1. Meaning that Entities with stopwords will no longer be tagged.

Can you recognize sentence or paragraph boundaries when tagging a large text field ?

Onelarge text field which we tag is yielding a lot of erroneous multiword tags due mostly to a large number of embedded newline characters. A simple (contrived) example of what we see.

I like my vitamin \n
A good time was had by all.

Since 'vitamin A' is in our tag dictionary, it will be tagged in this text if we use the standard tokenizer or the whitespace tokenizer. I've been playing around adding a MappingCharFilter to the query analyzer, which will substitute an arbitrary non-space character for a newline (I'm using Hebrew aleph) that can't occur in the English text or in our tag dictionary, followed by the Standard tokenizer. This inserts a junk character between 'vitamin' and 'a' so no tag will be found. However, this seems to be exquisitely sensitive to the presence or absence of spaces around the '\n' so I don't think it's robust enough

In an ideal world, I'd like the tagger to be able to recognize a new (tagger specific) Lucene token attribute ENDHERE, which would signal to the FST that this token is a boundary/terminal and not to look beyond it when a partial tag has been discovered. Obviously one would need some way of attaching this attribute to a token, (presumably by extending existing Tokenizers and filters). I'm not a Lucene expert so i have no idea if this is even feasible, which is why I'm reaching out here.

If all else fails I'll have to segment the text somehow upstream - there will porbably be a performance hit (our workflow is all in Python) but there will be fewer constraints compared to working within the Lucene analysis framework.

Comments welcome - maybe someone has solved this problem already

solr 5.2

Hi David

Thanks for the good work with SolrTextTagger.
Just wanted to say there seems to be a problem when updating to Solr 5.2.
Below is a snippet of the problem.

Cheers
A.

java.lang.NoSuchMethodError: org.apache.lucene.index.Terms.iterator(Lorg/apache/lucene/index/TermsEnum;)Lorg/apache/lucene/index/TermsEnum;
    at org.opensextant.solrtexttagger.Tagger.process(Tagger.java:160)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:223)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)

prefix word phrase matching

Hi David,

First of all I want to thank you for the contribution. Solr text tagger has really helped in building the solution we wanted.

However, as an extension, I am looking to use ShingleFilterFactory instead of ConcatenateFilter. And the reason is that I also want to enable partial matches as suggestions.

But I want to enable suggestions which only match from left edge and not in the middle.

For Ex - if the text is "Quick brown fox jumped"
Then the expected tokens should be -
"Quick"
"Quick brown"
"Quick brown fox"
"Quick brown fox jumped"

But using ShingleFilter produces extra token such as -
"brown fox"
"fox jumped"
etc

I would be really grateful if you can guide me on how to achieve it.

Best,
Amit

HDFS support for SolrTextTagger when using EmbeddedSolrServer

Hi David,
v2.0 still pumping away here at MITRE.

This is a request for an example of how to use STT in a read-only mode in an Hadoop Mapper or Spark situation. The use of EmbeddedSolrServer is crucial there as one would want to minimize network I/O that the Http server gets. However, EmbeddedSolrServer is impossible to get working in a simple Hadoop Mapper.

I wonder if you have encountered any requests for support for SolrTextTagger in BigData environments using this approach? I did see Suit Pal's post on his SODA work -- however that appears to use a bench of RESTFul instances of SolrTextTagger.

... The power we would have if we could deploy SolrTextTagger + EmbeddedSolrServer -- I fired off a 1000 mappers yesterday, each with about 10docs/sec (well, tweets). 10,000 tweets/sec would be good, But,... the Solr mechanics in this situation are impenetrable.

This forum vs. Apache Solr: From the gist of "EmbeddedSolrServer" in the Solr camp, I sense its not well-supported or cared for. So I don't feel posting this as an issue there is worthwhile. The driving force would be SolrTextTagger + EmbeddedSolrServer + BigData scaling. Hence I'm here.

Marc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.