Comments (5)
Hi
The error seems to happen because solr does not close the index properly. But it should not affect the results.
TF-IDF and TF can produce the same score if IDF of a term is 1.0. Typically if you have only 1 document in the corpus, that would be the case.
Can you provide your test files for us to investigate?
from jate.
hi
Thanks for your response , i m using 26 files ...
Here is the zip file corpus.zip
from jate.
Thank you.
I cannot reproduce your error, unfortunately. See the log file below.
It may be that we are using different schema files. I am using the ACLRDTEC config in the distribution for indexing your corpus, this gives some 2400 candidate terms. but your log indicates you had >4000. If you share your configurations, I can have another look.
The tfidf and ttf results are also different. Though both ranks 'plant' to be the first, it clearly gets different scores:
ttf: [{"string":"plant","score":360.0},
tfidf: [{"string":"plant","score":0.02591184465570414}
Log output:
indexing
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading done
2018-06-16 10:05:00 INFO Indexing:26 - DELETING PREVIOUS INDEX
2018-06-16 10:05:01 INFO Indexing:30 - INDEXING BEGINS
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.tika.parser.ParseContext (file:/home/zz/.m2/repository/org/apache/tika/tika-core/1.15/tika-core-1.15.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
WARNING: Please consider reporting this to the maintainers of org.apache.tika.parser.ParseContext
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-06-16 10:05:01 INFO IndexingHandler:28 - Beginning indexing dataset, total docs=26
2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (3).txt
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (10).txt
2018-06-16 10:05:02 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (5).txt
2018-06-16 10:05:03 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (1).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (1).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (12).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (4).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (3).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (2).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (2).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (9).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (8).txt
2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (7).txt
2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (6).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (3).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (5).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (6).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (7).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (4).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (5).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (2).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (1).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (4).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (8).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (6).txt
2018-06-16 10:05:09 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (11).txt
2018-06-16 10:05:09 INFO IndexingHandler:87 - Complete indexing dataset. Total processed items = 26
2018-06-16 10:05:09 INFO Indexing:37 - INDEXING COMPLETE
tfidf
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:52 BST 2018 loading done
Sat Jun 16 10:06:53 BST 2018 loading done
Sat Jun 16 10:06:53 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:53 BST 2018 loading done
Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:54 BST 2018 loading done
Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:54 BST 2018 loading done
Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:55 BST 2018 loading done
Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:55 BST 2018 loading done
2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311
2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488
2018-06-16 10:06:55 INFO TFIDF:27 - Beginning computing TFIDF values,, total terms=2488
2018-06-16 10:06:55 INFO TFIDF:38 - Complete
2018-06-16 10:06:55 INFO AppTFIDF:492 - Exporting terms to [/home/zz/Work/data/tfidf.json]
2018-06-16 10:06:56 INFO AppTFIDF:496 - complete.
ttf
Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:33 BST 2018 loading done
Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:34 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:34 BST 2018 loading done
Sat Jun 16 10:08:34 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:37 BST 2018 loading done
Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:37 BST 2018 loading done
Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:38 BST 2018 loading done
Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:38 BST 2018 loading done
2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311
2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488
2018-06-16 10:08:38 INFO TTF:25 - Beginning computing TTF values,, total terms=2488
2018-06-16 10:08:38 INFO TTF:32 - Complete
2018-06-16 10:08:38 INFO AppTTF:492 - Exporting terms to [/home/zz/Work/data/ttf.json]
2018-06-16 10:08:38 INFO AppTTF:496 - complete.
from jate.
Hi ,
I installed the Plugin Mode version -jate-2.0-beta.11 on Solr-7.2.1 and i get the same results as you so it's okay !
But still i don't understand why the same term "plant" is ranked first in the two alogorithme TF and TF-IDF. Normaly if the term "plant" is rancked filst in TF with maximal score ttf: [{"string":"plant","score":360.0}] (that means "plant" appeared a lot in the corpus ), Should not be ranckted first with the maximal score in alogorithme TD-IDF (give greater weight to the least frequent terms) !!
Here is the exmple .json
TF-->
JSON 0 string : "plant" score : 360 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null variants otherInfo 1 string : "view" score : 340 termInfo 2 string : "plante" score : 279 termInfo 3 string : "date" score : 235 termInfo
TF-IDF -->
JSON 0 string : "plant" score : 0.026804505797656034 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null variants otherInfo 1 string : "plante" score : 0.016739207424584502 termInfo 2 string : "view" score : 0.012504136237184318 termInfo 3 string : "dialog" score : 0.011585130414139116 termInfo
"plant" appeared in 11 files of 26 ...
Thank you for your help
from jate.
OK im glad that it works for you in the end and I am closing this issue now.
To answer the question about TF and TFIDF, it is possible that both rank the same term as #1. They are not entirely the opposite, taking into account that this TFIDF is not the originl TFIDF that works for individual documents, i.e., you get different TFIDF for the word 'plant' in doc1 and doc2, doc3 etc.
TFIDF = TTF x IDF
so obviously if a term has high TTF, it could also have high TFIDF. In the case of 'plant' in your corpus, it may be that this word also has high IDF, i.e., it is found only in a small subset of the corpus. But within this small subset, it may have very high frequency, thus giving a high TTF in the corpus too.
Hope that makes sense.
from jate.
Related Issues (20)
- NullPointException when loading dragon nlp resource with Apache Solr 5.5 or above HOT 1
- improve handling of unsuccessfully content extraction
- Cannot build with sbt since Dragon tool not linked HOT 1
- Purging index between JATE calls HOT 1
- payload in the development of lucene plugin pipelines
- API? HOT 5
- Jate stops working after couple of corpuses HOT 2
- solr did not shut down cleanly HOT 6
- null value causes chisquare to exit without warning, exit with 0 HOT 1
- export ranked term candidates with surface form
- NullPointerException when generating ngram from empty content HOT 1
- Upgrade JATE to use latest Solr? HOT 4
- Your valuable support: share your use case of JATE with us HOT 4
- Caused by: java.lang.NoSuchMethodError: org.apache.solr.common.SolrInputDocument.<init>([Ljava/lang/String;) HOT 2
- Google group does not exist HOT 2
- Example SOLR configuration for German text corpus? HOT 1
- Porting to ESearch? HOT 2
- Some problem with dependencies HOT 1
- TermComponentIndex sort problem. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jate.