kmpoon / hlta Goto Github PK

View Code? Open in Web Editor NEW

81.0 81.0 24.0 27.87 MB

Provides functions for hierarchical latent tree analysis on text data for hierarchical topic detection

License: GNU General Public License v3.0

R 0.34% Scala 12.04% Java 79.00% CSS 3.15% JavaScript 5.05% HTML 0.21% Shell 0.11% Python 0.09%

hlta's People

Contributors

Stargazers

Watchers

hlta's Issues

how to resolve an error of loading string class while build build.sbt file?

[info] Loading project definition from D:\Projects\hlta_demo\project
[info] Loading settings for project hlta_demo from build.sbt ...
[info] Set current project to HLTA (in build file:/D:/Projects/hlta_demo/)
[info] sbt server started at local:sbt-server-20838606bc2a549be1db
sbt:HLTA>
error: error while loading String, class file '/modules/java.base/java/lang/String.class' is broken
(class java.lang.NullPointerException/null)
[info] Defining Global / sbtStructureOptions, Global / sbtStructureOutputFile and 1 others.

Why Pdf format

as most of the dataset are .txt may please explain why you required the files be .pdf in the extracted part?

I am going work on the newsgroup but I can not find .pdf format, I know that I can convert but its going to be waste of time, I was thinking maybe you have a reason for doing this,

Percentage and assignments congruence

Hello, I have some topics where the percentage is 0.0, but looking at the Doc2Vec Assignment, these topics have many documents (and with probability 1) for the topic's percentage to be 0, I don't know if this percentage means other thing different than the number of documents belonging to the topic.

For example this topic with percentage 0.0:

{ "id": "Z145", "text": "samper ernesto-samper", "data": { "name": "Z145", "level": 1, "percentage": 0.0 }, "children": [] }

The Doc2Vec assignment is the following:

{"topic":"Z145", "doc":[["0", 1.00],["1", 1.00],["16", 1.00],["18", 1.00],["19", 1.00],["23", 1.00],["27", 1.00],["28", 1.00],["31", 1.00],["33", 1.00],["37", 1.00],["40", 1.00],["43", 1.00],["66", 1.00],["69", 1.00],["80", 1.00],["93", 1.00],["98", 1.00],["107", 1.00],["115", 1.00],["118", 1.00],["130", 1.00],["136", 1.00],["148", 1.00],["155", 1.00],["164", 1.00],["167", 1.00],["189", 1.00],["198", 1.00],["204", 1.00],["206", 1.00],["208", 1.00],["233", 1.00],["235", 1.00],["241", 1.00],["252", 1.00],["256", 1.00],["266", 1.00],["270", 1.00],["282", 1.00],["291", 1.00],["297", 1.00],["313", 1.00],["314", 1.00],["325", 1.00],["327", 1.00],["335", 1.00],["339", 1.00],["342", 1.00],["345", 1.00],["352", 1.00],["353", 1.00],["355", 1.00],["356", 1.00],["362", 1.00],["379", 1.00],["381", 1.00],["385", 1.00],["389", 1.00],["395", 1.00],["399", 1.00],["401", 1.00],["410", 1.00],["427", 1.00],["431", 1.00],["445", 1.00],["447", 1.00],["448", 1.00],["449", 1.00],["455", 1.00],["458", 1.00],["461", 1.00],["462", 1.00],["480", 1.00],["493", 1.00],["499", 1.00],["503", 1.00],["506", 1.00],["513", 1.00],["515", 1.00],["517", 1.00],["518", 1.00],["522", 1.00],["524", 1.00],["526", 1.00],["541", 1.00],["550", 1.00],["551", 1.00],["557", 1.00],["558", 1.00],["564", 1.00],["575", 1.00],["580", 1.00],["583", 1.00],["598", 1.00],["608", 1.00],["612", 1.00],["630", 1.00],["632", 1.00],["634", 1.00],["637", 1.00],["680", 1.00],["685", 1.00],["688", 1.00],["693", 1.00],["702", 1.00],["704", 1.00],["708", 1.00],["714", 1.00],["722", 1.00],["725", 1.00],["728", 1.00]]}

The total number of documents is 728, so the percentage clearly should be greater, or I don't what I am interpreting wrong.

Thank you so much.

(subroute1)text Convert fail due to small scale input text

I have encountered this error when I try to apply the cmd on a directory with only few .txt file with few content.

java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HTD ./mydir testoutput

The error:

[main] INFO tm.hlta.HTD$ - Convert raw text/pdf to .sparse.txt format
[main] INFO tm.text.Convert$ - Reading documents
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 2 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 2 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.hlta.HTD$ - Output file reading order
[main] INFO tm.hlta.HTD$ - Building model
Exception in thread "main" java.lang.NullPointerException
        at clustering.StepwiseEMHLTA.BridgingIslands(StepwiseEMHLTA.java:1214)
        at clustering.StepwiseEMHLTA.FastHLTA_learn(StepwiseEMHLTA.java:520)
        at clustering.StepwiseEMHLTA.IntegratedLearn(StepwiseEMHLTA.java:423)
        at tm.hlta.HLTA$.apply(HLTA.scala:93)
        at tm.hlta.HTD$.main(HTD.scala:203)
        at tm.hlta.HTD.main(HTD.scala)

The error shows the error occurs during tree construction, but it is actually due to the dictionary file is not generated correctly, because the generated file: testoutput.dict.csv and testoutput.sparse.txt are empty which caused the issue, is there any argument that could ensure at least certain amount of words will be added into dictionary?

PS: I check the source code of hlta/src/main/scala/tm/text/Convert.scala, it seems the variable minDocFraction seems to handle the ratio, is it the --min-doc-fraction in the argument list?

PS2: I have tried this argument with 0.1 and 0.2 but still, the xxx.sparse.txt and xxx.dict.csv is empty, any idea why this happens?

(base) D:\my_research\document_topic_modelling\hltm_python_util\hltm_python_util\JARS>java -cp HLTA.jar;HLTA-deps.jar tm.text.Convert -h
Usage: tm.text.Convert [OPTION]... name source max-words concat
  -d, --debug                     Show debug message
      --input-encoding  <arg>     Input text file encoding, default UTF-8, see
                                  java.nio.charset.Charset for available
                                  encodings
  -i, --input-ext  <arg>...       Look for these extensions if a directory is
                                  given, default "txt pdf"
  -l, --language  <arg>           Language, default as English, can be {english,
                                  chinese, nonascii}
      --max-doc-fraction  <arg>   Maximum fraction of documents that a token can
                                  appear to be selected. Default: 0.25
  -m, --min-char  <arg>           Minimum number of characters of a word to be
                                  selected. English default as 3,
                                  Chinese/Nonascii default as 1
      --min-doc-fraction  <arg>   Minimum fraction of documents that a token can
                                  appear to be selected. Default: 0.0
      --output-arff               Additionally output arff format
  -o, --output-hlcm               Additionally output hlcm format
      --output-lda                Additionally output lda format
  -s, --seed-words  <arg>         File containing tokens to be included,
                                  regardless of other selection criteria.
      --show-log-time             Show time in log
      --stop-words  <arg>         File of stop words, default using built-in
                                  stopwords list
  -t, --testset-ratio  <arg>      Split into training and testing set by a user
                                  given ratio. Default is 0.0
  -h, --help                      Show help message

Many thanks!

add license to repo

Is it possible to add a license to this repository? (https://choosealicense.com/no-permission/)

After reading your articles I would very much like to try HLTA and see how it works.

Thank you for your time!

extract more topics

Hello again :)

I have applied your approach on very large medical dataset. It outputs 5 topics and each topic only have one sub - category.

What should be the parameters or to better say how can I increase the number of topics to at least 10?
the only thing is increasing the number of words extracted in the Convert step?
and why there is that much repetitive words among clusters of the topic? I mainly mean its not good to see a topic or a word both in the lower level and the same in the upper level.

Thanks:)

How to fit model to unseen document?

In LDA methods after creation of topics model we can fit unseen document and get topics distribution on it. I wonder if it is possible with HLTA model? How to annotate unseen document by topics already created by HLTA model?

Thank you

Bugs when parsing pdf

When I feed in some pdfs, it shows me bug logs as below:

How can I fix it?
Just let me know if any further information is needed

Can I run the HLC code on discrete data?

I want to provide the algorithm an input file with discrete variables in csv or txt format (not documents).
I get the following error:

"Exception in thread "main" java.lang.Exception: Unsupported file format"

evaluation part

Thank u very much , you helped alot,
Hopefully, I runned the code for Nips dataset, and its working,
One question is that why it just show one level of the output?
for example part of my output is like this:
`0.278 cortex stimulus

0.228 firing spike

0.205 mixture expert

0.421 pixel theorem character cluster energy

0.205 speech classifier

0.202 circuit voltage`

and lastly, how can I see the evaluation part? you used coherence for evaluation, may I ask you to provide me with step by step direction how can I get that result also?

Thanks for taking the time :)

using both N-gram and BOW

Hi again,

Sorry for opening many issues, actually this is not an issue but I do not know where should I ask, and also it may be helpful for another person when reading these all issues to grasp the model better,

So my question is that according to the output you have provided in the paper you have used 1-gram.
I mean all the words are just one word.
so why you used the Ngram model then BOW?.
I mean you just added in case someone wants to use N-gram but you have not used bigram or three-gram?

Thanks for adding my information

How to handle out of memory error?

I am trying to extract the topics from CFPB dataset, but it is breaking in between due to OOM error. I have 16 GB of RAM which is enough for this dataset I hope.

argument

Hi,

May I ask you to give me some guideline how can I run it but not using command line,
I want to go through the code and run the code step by step, each file has a couple of argument which is not obvious which files and with which order should I prepare.

I really appreciate your time

Issue running PEM

Hi,

Thanks for sharing the code for HLTA.

I'm trying to run PEM in the NIPS dataset, but I'm constantly getting this error in the clique tree propagation. The attribute _cells of Function returned by functions.get(var).project(var, value) seems to be empty for some runs.

	at java.lang.System.arraycopy(Native Method)
	at org.latlab.util.Function2D.project(Function2D.java:236)
	at org.latlab.reasoner.CliqueTreePropagation.absorbEvidence(CliqueTreePropagation.java:168)
	at org.latlab.reasoner.CliqueTreePropagation.propagate(CliqueTreePropagation.java:653)
	at org.latlab.learner.ParallelEmLearner$ForkComputation.computeDirectly(ParallelEmLearner.java:312)

Thanks

V2.1 cannot split dataset into training and testing parts

Following the README file, I ran "java -cp HLTA.jar:HLTA-deps.jar tm.text.Convert --testsetRatio 0.2 datasetName ./source 1000 1" under the version of 2.1. The error was "Unknown option 'testsetRatio' ". Furthermore, I went through the src folder but did not find anything related to argv and testsetRatio flag. Could someone elaborate a little bit more?

kmpoon / hlta Goto Github PK

hlta's People

Contributors

Stargazers

Watchers

Forkers

hlta's Issues

Recommend Projects

Recommend Topics

Recommend Org