kcentric / deep_nlp_on_sf_literature Goto Github PK

Multi-pronged, multi-stage analysis of a 3.5M-sentences science fiction corpus using optimized NLP, with NER techniques, LDA modeling and LLM integration. After final commit, will be able to run a main file to generate a visualization of results on-demand. Modularized and documented code that can easily be reused/refitted for other kinds of corpii.

License: MIT License

Python 98.99% Jinja 1.01%

algorithm-design data-engineering data-visualization latent-dirichlet-allocation named-entity-recognition nlp-machine-learning

deep_nlp_on_sf_literature's People

Contributors

Stargazers

Watchers

Forkers

ifrasa

deep_nlp_on_sf_literature's Issues

Add some missing files to LLM directory

There was a large output text file when originally uploading the LLM directory, which caused upload to stop mid-process. Some files might have been left out; check and add them.

Make it clear that RoBERTa training-data generation is not perfect yet

There is one (relatively minor, but significant) issue in list_of_dicts_to_list_of_lists functionality (GPT_NER_Round_1.py). Basically, the RoBERTa tokenization isn't exactly matching up to the sentences with the original sentences that I tokenized.

For example, a term like "desert cat" has one label ("CONCEPT") in the GPT-annotated data. When I tokenize for RoBERTa, it should have "desert" - CONCEPT, "Ġcat" - CONCEPT where the "Ġ" represents a space as represented by RoBERTa tokenizer.

Instead (not for "desert cat", I think but for a term right before it) the tokenization process is generating a CONCEPT label for "Ġcat" but not for "desert" (something like that). If you run GPT_NER_Round_1.py as a script (deny creation of new data when it prompts you for that) you should be able to see this issue in the console output.

Change "as the article 'and'" to "as the article 'an'" in the README. :)

Organization changes required

Change "main files" to Data Processing
Add "Analysis/Visualization" with LDA and Visualization as subfolders
Ultimately main.py should be the ONLY file directly in the main directory. So organization should look like, a lot of directories and one main.py sticking out clearly

kcentric / deep_nlp_on_sf_literature Goto Github PK

deep_nlp_on_sf_literature's People

Contributors

Stargazers

Watchers

Forkers

deep_nlp_on_sf_literature's Issues

Add some missing files to LLM directory

Make it clear that RoBERTa training-data generation is not perfect yet

Change "as the article 'and'" to "as the article 'an'" in the README. :)

Organization changes required

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent