kmpoon / hlta Goto Github PK
View Code? Open in Web Editor NEWProvides functions for hierarchical latent tree analysis on text data for hierarchical topic detection
License: GNU General Public License v3.0
Provides functions for hierarchical latent tree analysis on text data for hierarchical topic detection
License: GNU General Public License v3.0
[info] Loading project definition from D:\Projects\hlta_demo\project
[info] Loading settings for project hlta_demo from build.sbt ...
[info] Set current project to HLTA (in build file:/D:/Projects/hlta_demo/)
[info] sbt server started at local:sbt-server-20838606bc2a549be1db
sbt:HLTA>
error: error while loading String, class file '/modules/java.base/java/lang/String.class' is broken
(class java.lang.NullPointerException/null)
[info] Defining Global / sbtStructureOptions, Global / sbtStructureOutputFile and 1 others.
as most of the dataset are .txt may please explain why you required the files be .pdf in the extracted part?
I am going work on the newsgroup but I can not find .pdf format, I know that I can convert but its going to be waste of time, I was thinking maybe you have a reason for doing this,
Hello, I have some topics where the percentage is 0.0, but looking at the Doc2Vec Assignment, these topics have many documents (and with probability 1) for the topic's percentage to be 0, I don't know if this percentage means other thing different than the number of documents belonging to the topic.
For example this topic with percentage 0.0:
{ "id": "Z145", "text": "samper ernesto-samper", "data": { "name": "Z145", "level": 1, "percentage": 0.0 }, "children": [] }
The Doc2Vec assignment is the following:
{"topic":"Z145", "doc":[["0", 1.00],["1", 1.00],["16", 1.00],["18", 1.00],["19", 1.00],["23", 1.00],["27", 1.00],["28", 1.00],["31", 1.00],["33", 1.00],["37", 1.00],["40", 1.00],["43", 1.00],["66", 1.00],["69", 1.00],["80", 1.00],["93", 1.00],["98", 1.00],["107", 1.00],["115", 1.00],["118", 1.00],["130", 1.00],["136", 1.00],["148", 1.00],["155", 1.00],["164", 1.00],["167", 1.00],["189", 1.00],["198", 1.00],["204", 1.00],["206", 1.00],["208", 1.00],["233", 1.00],["235", 1.00],["241", 1.00],["252", 1.00],["256", 1.00],["266", 1.00],["270", 1.00],["282", 1.00],["291", 1.00],["297", 1.00],["313", 1.00],["314", 1.00],["325", 1.00],["327", 1.00],["335", 1.00],["339", 1.00],["342", 1.00],["345", 1.00],["352", 1.00],["353", 1.00],["355", 1.00],["356", 1.00],["362", 1.00],["379", 1.00],["381", 1.00],["385", 1.00],["389", 1.00],["395", 1.00],["399", 1.00],["401", 1.00],["410", 1.00],["427", 1.00],["431", 1.00],["445", 1.00],["447", 1.00],["448", 1.00],["449", 1.00],["455", 1.00],["458", 1.00],["461", 1.00],["462", 1.00],["480", 1.00],["493", 1.00],["499", 1.00],["503", 1.00],["506", 1.00],["513", 1.00],["515", 1.00],["517", 1.00],["518", 1.00],["522", 1.00],["524", 1.00],["526", 1.00],["541", 1.00],["550", 1.00],["551", 1.00],["557", 1.00],["558", 1.00],["564", 1.00],["575", 1.00],["580", 1.00],["583", 1.00],["598", 1.00],["608", 1.00],["612", 1.00],["630", 1.00],["632", 1.00],["634", 1.00],["637", 1.00],["680", 1.00],["685", 1.00],["688", 1.00],["693", 1.00],["702", 1.00],["704", 1.00],["708", 1.00],["714", 1.00],["722", 1.00],["725", 1.00],["728", 1.00]]}
The total number of documents is 728, so the percentage clearly should be greater, or I don't what I am interpreting wrong.
Thank you so much.
I have encountered this error when I try to apply the cmd on a directory with only few .txt file with few content.
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HTD ./mydir testoutput
The error:
[main] INFO tm.hlta.HTD$ - Convert raw text/pdf to .sparse.txt format
[main] INFO tm.text.Convert$ - Reading documents
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 2 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 2 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.hlta.HTD$ - Output file reading order
[main] INFO tm.hlta.HTD$ - Building model
Exception in thread "main" java.lang.NullPointerException
at clustering.StepwiseEMHLTA.BridgingIslands(StepwiseEMHLTA.java:1214)
at clustering.StepwiseEMHLTA.FastHLTA_learn(StepwiseEMHLTA.java:520)
at clustering.StepwiseEMHLTA.IntegratedLearn(StepwiseEMHLTA.java:423)
at tm.hlta.HLTA$.apply(HLTA.scala:93)
at tm.hlta.HTD$.main(HTD.scala:203)
at tm.hlta.HTD.main(HTD.scala)
The error shows the error occurs during tree construction, but it is actually due to the dictionary file is not generated correctly, because the generated file: testoutput.dict.csv
and testoutput.sparse.txt
are empty which caused the issue, is there any argument that could ensure at least certain amount of words will be added into dictionary?
PS: I check the source code of hlta/src/main/scala/tm/text/Convert.scala
, it seems the variable minDocFraction
seems to handle the ratio, is it the --min-doc-fraction
in the argument list?
PS2: I have tried this argument with 0.1 and 0.2 but still, the xxx.sparse.txt and xxx.dict.csv is empty, any idea why this happens?
(base) D:\my_research\document_topic_modelling\hltm_python_util\hltm_python_util\JARS>java -cp HLTA.jar;HLTA-deps.jar tm.text.Convert -h
Usage: tm.text.Convert [OPTION]... name source max-words concat
-d, --debug Show debug message
--input-encoding <arg> Input text file encoding, default UTF-8, see
java.nio.charset.Charset for available
encodings
-i, --input-ext <arg>... Look for these extensions if a directory is
given, default "txt pdf"
-l, --language <arg> Language, default as English, can be {english,
chinese, nonascii}
--max-doc-fraction <arg> Maximum fraction of documents that a token can
appear to be selected. Default: 0.25
-m, --min-char <arg> Minimum number of characters of a word to be
selected. English default as 3,
Chinese/Nonascii default as 1
--min-doc-fraction <arg> Minimum fraction of documents that a token can
appear to be selected. Default: 0.0
--output-arff Additionally output arff format
-o, --output-hlcm Additionally output hlcm format
--output-lda Additionally output lda format
-s, --seed-words <arg> File containing tokens to be included,
regardless of other selection criteria.
--show-log-time Show time in log
--stop-words <arg> File of stop words, default using built-in
stopwords list
-t, --testset-ratio <arg> Split into training and testing set by a user
given ratio. Default is 0.0
-h, --help Show help message
Many thanks!
Is it possible to add a license to this repository? (https://choosealicense.com/no-permission/)
After reading your articles I would very much like to try HLTA and see how it works.
Thank you for your time!
Hello again :)
I have applied your approach on very large medical dataset. It outputs 5 topics and each topic only have one sub - category.
What should be the parameters or to better say how can I increase the number of topics to at least 10?
the only thing is increasing the number of words extracted in the Convert step?
and why there is that much repetitive words among clusters of the topic? I mainly mean its not good to see a topic or a word both in the lower level and the same in the upper level.
Thanks:)
In LDA methods after creation of topics model we can fit unseen document and get topics distribution on it. I wonder if it is possible with HLTA model? How to annotate unseen document by topics already created by HLTA model?
Thank you
I want to provide the algorithm an input file with discrete variables in csv or txt format (not documents).
I get the following error:
"Exception in thread "main" java.lang.Exception: Unsupported file format"
Thank u very much , you helped alot,
Hopefully, I runned the code for Nips dataset, and its working,
One question is that why it just show one level of the output?
for example part of my output is like this:
`0.278 cortex stimulus
0.228 firing spike
0.205 mixture expert
0.421 pixel theorem character cluster energy
0.205 speech classifier
0.202 circuit voltage`
and lastly, how can I see the evaluation part? you used coherence for evaluation, may I ask you to provide me with step by step direction how can I get that result also?
Thanks for taking the time :)
Hi again,
Sorry for opening many issues, actually this is not an issue but I do not know where should I ask, and also it may be helpful for another person when reading these all issues to grasp the model better,
So my question is that according to the output you have provided in the paper you have used 1-gram.
I mean all the words are just one word.
so why you used the Ngram model then BOW?.
I mean you just added in case someone wants to use N-gram but you have not used bigram or three-gram?
Thanks for adding my information
I am trying to extract the topics from CFPB dataset, but it is breaking in between due to OOM error. I have 16 GB of RAM which is enough for this dataset I hope.
Hi,
May I ask you to give me some guideline how can I run it but not using command line,
I want to go through the code and run the code step by step, each file has a couple of argument which is not obvious which files and with which order should I prepare.
I really appreciate your time
Hi,
Thanks for sharing the code for HLTA.
I'm trying to run PEM in the NIPS dataset, but I'm constantly getting this error in the clique tree propagation. The attribute _cells of Function returned by functions.get(var).project(var, value) seems to be empty for some runs.
at java.lang.System.arraycopy(Native Method)
at org.latlab.util.Function2D.project(Function2D.java:236)
at org.latlab.reasoner.CliqueTreePropagation.absorbEvidence(CliqueTreePropagation.java:168)
at org.latlab.reasoner.CliqueTreePropagation.propagate(CliqueTreePropagation.java:653)
at org.latlab.learner.ParallelEmLearner$ForkComputation.computeDirectly(ParallelEmLearner.java:312)
Thanks
Following the README file, I ran "java -cp HLTA.jar:HLTA-deps.jar tm.text.Convert --testsetRatio 0.2 datasetName ./source 1000 1" under the version of 2.1. The error was "Unknown option 'testsetRatio' ". Furthermore, I went through the src folder but did not find anything related to argv and testsetRatio flag. Could someone elaborate a little bit more?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.