Comments (7)
Could be language specific, genre specific, etc.
I don't think the tagger can deal with this, unless you are willing to take as an input the set of sentence spans. Otherwise, you being the tagger and one component of a larger pipeline would be dictating such segmentation. If you provide the option, you should expose it fully:
- allow caller to send in segment info
- allow caller to turn on SolrTextTagger segmenter, with option to report detected segments
otherwise it would seem half-baked to be making such decisions internal to tagger.
-m
from solrtexttagger.
I think that if some other component of a larger pipeline wanted to do segmentation, perhaps because it's more sophisticated, then the feature I propose here simply wouldn't be used. The BreakIterator based segmenter is better than nothing for those that don't have the time/expertise to develop the language specific, genre(?) specific segmenter.
What is your thoughts on breaking on all punctuation that's next to whitespace? It would have to be configurable of course, like perhaps not breaking on a hyphen.
from solrtexttagger.
Did you check that this actually happens? I just tried this sentence "All new. Norfolk gets a facelift" (because there is a New Norfolk (id:2155415) in the cities list) and it returns the following result (reformatted to return ("id", "name", "startOffset", "endOffset", "matchedSubstring")).
4776222 Norfolk 9 16 Norfolk
4945453 Norfolk 9 16 Norfolk
5073965 Norfolk 9 16 Norfolk
5128862 Norfolk 9 16 Norfolk
Assuming it does, we do something in user-land that might be of interest...
Sometimes we need to annotate at the sentence level, and doing a sentence per call works out to be too expensive (network overhead). So we create synthetic docs, say consider a max sentence size of maybe 500 chars and then join a bunch of sentences (say 50 or so) each space padded to 500 chars. Since the starting offset for each sentence is known, we can identify annotations based on the start and end offsets reported and also get the matching text from each sentence. We do this for performance, but one could use the technique to do sentence segmentation outside SolrTextTagger as well.
from solrtexttagger.
I'm sure "this can happen"; I could add a trivial test.
Nice trick on the sentence alignment. What could be useful is an additional TokenFilter that recognizes a large jump in startOffset, and if so increments the positionIncrement attribute, thus thwarting a match across the large span. If the Tagger sees a position increment > 1 (and if ignoreStopwords is false), it breaks any tags in progress (line ~128 of Tagger.java). ignoreStopwords has a dynamic default dependent on the presence of a StopWordFilter on the index analysis chain on the field.
from solrtexttagger.
In trying to reduce network overhead for tagging multiple separate entities (separate text fields in our case) in a single call, we did something similar ( see #35 ) , inserting a junk token which would never be matched between the concatenated entities, then using the returned offsets to demultiplex the returned tagger matches.
from solrtexttagger.
Great suggestion @simonatdrg ; LOL its much simpler than my idea of the TokenFilter :-)
from solrtexttagger.
Yes, my field type is little different from the one suggested in the QUICK_START.md. That is why I wasn't sure if you saw the same thing I was seeing. As you pointed out, I guess the positionIncrementGap setting is working for me here.
In addition to the positionIncrementGap I also have omitTermFreqAndPositions set. I copied these off my previous configuration (versions Solr 5.0.0 and SolrTextTagger 2.3-SNAPSHOT). Not sure if these used to be the recommended settings earlier, although I do remember doing some tuning based on my use case at the time.
Also my analysis chain is a little different since I explicitly handle mixed case and lower case (in my application). Also my name_tag analog is multiValued since I explicitly populate it from code (i.e. not a copy-field directive) using all my names (primary + all alternates).
"add-field-type":{
"name" : "tag",
"class" : "solr.TextField",
"positionIncrementGap" : "100",
"postingsFormat" : "Memory",
"omitTermFreqAndPositions" : true,
"omitNorms":true,
"indexAnalyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory" },
"filters":[
{"class":"solr.ASCIIFoldingFilterFactory"},
{"class":"solr.EnglishPossessiveFilterFactory"},
{"class":"org.opensextant.solrtexttagger.ConcatenateFilterFactory"}
]},
"queryAnalyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory" },
"filters":[
{"class":"solr.ASCIIFoldingFilterFactory"},
{"class":"solr.EnglishPossessiveFilterFactory"}
]}
},
from solrtexttagger.
Related Issues (20)
- SolrTextTagger not working with EmbeddedSolr 6.2.1 HOT 2
- implementing fuzzy matching HOT 2
- Copyrights, Org, etc. HOT 1
- OpenSextant projects add your topics. HOT 4
- SOLR 7 HOT 6
- Retrieve tagged term HOT 2
- synonyms with SolrTextTagger HOT 1
- Running the 100cities.txt example HOT 2
- Error while request tags: TaggerRequestHandler requires text to be POSTed to it HOT 1
- tagging within html attributes HOT 1
- FSTOrdPostingsFormat could enable faster Tagger HOT 1
- Can't create a schema with ConcatenateFilterFactory HOT 2
- Bring the ConcatenateFilter upstream to Lucene HOT 1
- Bring the TaggerRequestHandler to Solr (thus everything?) HOT 4
- Each tag in the output is an array of names and values instead of an object HOT 2
- htmlOffsetAdjust and matchText enabled gets StringIndexOutOfBoundsException HOT 7
- Small slowdown in tagging performance after moving to the Solr 7.4 built-in tagger handler HOT 1
- how to use the ConcatenateFilterFactory with solr 7.6 HOT 1
- Issue for creating collection in solrcloud. HOT 2
- why do you use FST HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from solrtexttagger.