Im trying to exectue TF-IDF on the Embedded Mode jate-2.0-beta.7 but

Timeout waiting for all directory ref counts to be released about jate HOT 5 CLOSED

ziqizhang commented on September 23, 2024

Timeout waiting for all directory ref counts to be released

from jate.

Comments (5)

ziqizhang commented on September 23, 2024

The error seems to happen because solr does not close the index properly. But it should not affect the results.

TF-IDF and TF can produce the same score if IDF of a term is 1.0. Typically if you have only 1 document in the corpus, that would be the case.

Can you provide your test files for us to investigate?

from jate.

sicheva commented on September 23, 2024

hi
Thanks for your response , i m using 26 files ...
Here is the zip file corpus.zip

from jate.

ziqizhang commented on September 23, 2024

Thank you.

I cannot reproduce your error, unfortunately. See the log file below.

It may be that we are using different schema files. I am using the ACLRDTEC config in the distribution for indexing your corpus, this gives some 2400 candidate terms. but your log indicates you had >4000. If you share your configurations, I can have another look.

The tfidf and ttf results are also different. Though both ranks 'plant' to be the first, it clearly gets different scores:

ttf: [{"string":"plant","score":360.0},
tfidf: [{"string":"plant","score":0.02591184465570414}

Log output:
indexing

Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:56 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:04:58 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:05:00 BST 2018 loading done
Sat Jun 16 10:05:00 BST 2018 loading done
2018-06-16 10:05:00 INFO Indexing:26 - DELETING PREVIOUS INDEX
2018-06-16 10:05:01 INFO Indexing:30 - INDEXING BEGINS
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.tika.parser.ParseContext (file:/home/zz/.m2/repository/org/apache/tika/tika-core/1.15/tika-core-1.15.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
WARNING: Please consider reporting this to the maintainers of org.apache.tika.parser.ParseContext
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-06-16 10:05:01 INFO IndexingHandler:28 - Beginning indexing dataset, total docs=26
2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (3).txt
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure.
2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (10).txt
2018-06-16 10:05:02 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (5).txt
2018-06-16 10:05:03 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (1).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (1).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (12).txt
2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (4).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (3).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (2).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (2).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (9).txt
2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (8).txt
2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (7).txt
2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (6).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (3).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (5).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (6).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (7).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (4).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (5).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (2).txt
2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (1).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (4).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (8).txt
2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (6).txt
2018-06-16 10:05:09 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (11).txt
2018-06-16 10:05:09 INFO IndexingHandler:87 - Complete indexing dataset. Total processed items = 26
2018-06-16 10:05:09 INFO Indexing:37 - INDEXING COMPLETE

tfidf

Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:51 BST 2018 loading done
Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:52 BST 2018 loading done
Sat Jun 16 10:06:53 BST 2018 loading done
Sat Jun 16 10:06:53 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:53 BST 2018 loading done
Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:54 BST 2018 loading done
Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:54 BST 2018 loading done
Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:55 BST 2018 loading done
Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:06:55 BST 2018 loading done
2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311
2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488
2018-06-16 10:06:55 INFO TFIDF:27 - Beginning computing TFIDF values,, total terms=2488
2018-06-16 10:06:55 INFO TFIDF:38 - Complete
2018-06-16 10:06:55 INFO AppTFIDF:492 - Exporting terms to [/home/zz/Work/data/tfidf.json]
2018-06-16 10:06:56 INFO AppTFIDF:496 - complete.

ttf

Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:33 BST 2018 loading done
Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:34 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:34 BST 2018 loading done
Sat Jun 16 10:08:34 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:35 BST 2018 loading done
Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:37 BST 2018 loading done
Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:37 BST 2018 loading done
Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:38 BST 2018 loading done
Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser...
Sat Jun 16 10:08:38 BST 2018 loading done
2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311
2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488
2018-06-16 10:08:38 INFO TTF:25 - Beginning computing TTF values,, total terms=2488
2018-06-16 10:08:38 INFO TTF:32 - Complete
2018-06-16 10:08:38 INFO AppTTF:492 - Exporting terms to [/home/zz/Work/data/ttf.json]
2018-06-16 10:08:38 INFO AppTTF:496 - complete.

from jate.

sicheva commented on September 23, 2024

Hi ,
I installed the Plugin Mode version -jate-2.0-beta.11 on Solr-7.2.1 and i get the same results as you so it's okay !
But still i don't understand why the same term "plant" is ranked first in the two alogorithme TF and TF-IDF. Normaly if the term "plant" is rancked filst in TF with maximal score ttf: [{"string":"plant","score":360.0}] (that means "plant" appeared a lot in the corpus ), Should not be ranckted first with the maximal score in alogorithme TD-IDF (give greater weight to the least frequent terms) !!

Here is the exmple .json
TF-->

JSON 0 string : "plant" score : 360 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null variants otherInfo 1 string : "view" score : 340 termInfo 2 string : "plante" score : 279 termInfo 3 string : "date" score : 235 termInfo

TF-IDF -->
JSON 0 string : "plant" score : 0.026804505797656034 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null variants otherInfo 1 string : "plante" score : 0.016739207424584502 termInfo 2 string : "view" score : 0.012504136237184318 termInfo 3 string : "dialog" score : 0.011585130414139116 termInfo

"plant" appeared in 11 files of 26 ...

Thank you for your help

from jate.

ziqizhang commented on September 23, 2024

OK im glad that it works for you in the end and I am closing this issue now.

To answer the question about TF and TFIDF, it is possible that both rank the same term as #1. They are not entirely the opposite, taking into account that this TFIDF is not the originl TFIDF that works for individual documents, i.e., you get different TFIDF for the word 'plant' in doc1 and doc2, doc3 etc.

TFIDF = TTF x IDF

so obviously if a term has high TTF, it could also have high TFIDF. In the case of 'plant' in your corpus, it may be that this word also has high IDF, i.e., it is found only in a small subset of the corpus. But within this small subset, it may have very high frequency, thus giving a high TTF in the corpus too.

Hope that makes sense.

from jate.

Timeout waiting for all directory ref counts to be released about jate HOT 5 CLOSED

Comments (5)

Log output:
indexing

tfidf

ttf

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (5)

Log output: indexing

tfidf

ttf

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Log output:
indexing