Git Product home page Git Product logo

Comments (5)

ziqizhang avatar ziqizhang commented on July 25, 2024

Hi

The error seems to happen because solr does not close the index properly. But it should not affect the results.

TF-IDF and TF can produce the same score if IDF of a term is 1.0. Typically if you have only 1 document in the corpus, that would be the case.

Can you provide your test files for us to investigate?

from jate.

sicheva avatar sicheva commented on July 25, 2024

hi
Thanks for your response , i m using 26 files ...
Here is the zip file corpus.zip

from jate.

ziqizhang avatar ziqizhang commented on July 25, 2024

Thank you.

I cannot reproduce your error, unfortunately. See the log file below.

It may be that we are using different schema files. I am using the ACLRDTEC config in the distribution for indexing your corpus, this gives some 2400 candidate terms. but your log indicates you had >4000. If you share your configurations, I can have another look.

The tfidf and ttf results are also different. Though both ranks 'plant' to be the first, it clearly gets different scores:

ttf: [{"string":"plant","score":360.0},
tfidf: [{"string":"plant","score":0.02591184465570414}

Log output:
indexing

Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading done
2018-06-16 10:05:00 INFO Indexing:26 - DELETING PREVIOUS INDEX
2018-06-16 10:05:01 INFO Indexing:30 - INDEXING BEGINS
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.tika.parser.ParseContext (file:/home/zz/.m2/repository/org/apache/tika/tika-core/1.15/tika-core-1.15.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
WARNING: Please consider reporting this to the maintainers of org.apache.tika.parser.ParseContext
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-06-16 10:05:01 INFO IndexingHandler:28 - Beginning indexing dataset, total docs=26
2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (3).txt
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (10).txt
2018-06-16 10:05:02 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (5).txt
2018-06-16 10:05:03 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (1).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (1).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (12).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (4).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (3).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (2).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (2).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (9).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (8).txt
2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (7).txt
2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (6).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (3).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (5).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (6).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (7).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (4).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (5).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (2).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (1).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (4).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (8).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (6).txt
2018-06-16 10:05:09 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (11).txt
2018-06-16 10:05:09 INFO IndexingHandler:87 - Complete indexing dataset. Total processed items = 26
2018-06-16 10:05:09 INFO Indexing:37 - INDEXING COMPLETE

tfidf

Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:52 BST 2018 loading done
Sat Jun 16 10:06:53 BST 2018 loading done
Sat Jun 16 10:06:53 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:53 BST 2018 loading done
Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:54 BST 2018 loading done
Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:54 BST 2018 loading done
Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:55 BST 2018 loading done
Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:55 BST 2018 loading done
2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311
2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488
2018-06-16 10:06:55 INFO TFIDF:27 - Beginning computing TFIDF values,, total terms=2488
2018-06-16 10:06:55 INFO TFIDF:38 - Complete
2018-06-16 10:06:55 INFO AppTFIDF:492 - Exporting terms to [/home/zz/Work/data/tfidf.json]
2018-06-16 10:06:56 INFO AppTFIDF:496 - complete.

ttf

Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:33 BST 2018 loading done
Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:34 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:34 BST 2018 loading done
Sat Jun 16 10:08:34 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:37 BST 2018 loading done
Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:37 BST 2018 loading done
Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:38 BST 2018 loading done
Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:38 BST 2018 loading done
2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311
2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488
2018-06-16 10:08:38 INFO TTF:25 - Beginning computing TTF values,, total terms=2488
2018-06-16 10:08:38 INFO TTF:32 - Complete
2018-06-16 10:08:38 INFO AppTTF:492 - Exporting terms to [/home/zz/Work/data/ttf.json]
2018-06-16 10:08:38 INFO AppTTF:496 - complete.

from jate.

sicheva avatar sicheva commented on July 25, 2024

Hi ,
I installed the Plugin Mode version -jate-2.0-beta.11 on Solr-7.2.1 and i get the same results as you so it's okay !
But still i don't understand why the same term "plant" is ranked first in the two alogorithme TF and TF-IDF. Normaly if the term "plant" is rancked filst in TF with maximal score ttf: [{"string":"plant","score":360.0}] (that means "plant" appeared a lot in the corpus ), Should not be ranckted first with the maximal score in alogorithme TD-IDF (give greater weight to the least frequent terms) !!

Here is the exmple .json
TF-->

JSON 0 string : "plant" score : 360 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null variants otherInfo 1 string : "view" score : 340 termInfo 2 string : "plante" score : 279 termInfo 3 string : "date" score : 235 termInfo

TF-IDF -->
JSON 0 string : "plant" score : 0.026804505797656034 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null variants otherInfo 1 string : "plante" score : 0.016739207424584502 termInfo 2 string : "view" score : 0.012504136237184318 termInfo 3 string : "dialog" score : 0.011585130414139116 termInfo

"plant" appeared in 11 files of 26 ...

Thank you for your help

from jate.

ziqizhang avatar ziqizhang commented on July 25, 2024

OK im glad that it works for you in the end and I am closing this issue now.

To answer the question about TF and TFIDF, it is possible that both rank the same term as #1. They are not entirely the opposite, taking into account that this TFIDF is not the originl TFIDF that works for individual documents, i.e., you get different TFIDF for the word 'plant' in doc1 and doc2, doc3 etc.

TFIDF = TTF x IDF

so obviously if a term has high TTF, it could also have high TFIDF. In the case of 'plant' in your corpus, it may be that this word also has high IDF, i.e., it is found only in a small subset of the corpus. But within this small subset, it may have very high frequency, thus giving a high TTF in the corpus too.

Hope that makes sense.

from jate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.