Git Product home page Git Product logo

kcentric / deep_nlp_on_sf_literature Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 11.92 MB

Multi-pronged, multi-stage analysis of a 3.5M-sentences science fiction corpus using optimized NLP, with NER techniques, LDA modeling and LLM integration. After final commit, will be able to run a main file to generate a visualization of results on-demand. Modularized and documented code that can easily be reused/refitted for other kinds of corpii.

License: MIT License

Python 98.99% Jinja 1.01%
algorithm-design data-engineering data-visualization latent-dirichlet-allocation named-entity-recognition nlp-machine-learning

deep_nlp_on_sf_literature's People

Contributors

kcentric avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ifrasa

deep_nlp_on_sf_literature's Issues

Add some missing files to LLM directory

There was a large output text file when originally uploading the LLM directory, which caused upload to stop mid-process. Some files might have been left out; check and add them.

Make it clear that RoBERTa training-data generation is not perfect yet

There is one (relatively minor, but significant) issue in list_of_dicts_to_list_of_lists functionality (GPT_NER_Round_1.py). Basically, the RoBERTa tokenization isn't exactly matching up to the sentences with the original sentences that I tokenized.

For example, a term like "desert cat" has one label ("CONCEPT") in the GPT-annotated data. When I tokenize for RoBERTa, it should have "desert" - CONCEPT, "Ġcat" - CONCEPT where the "Ġ" represents a space as represented by RoBERTa tokenizer.

Instead (not for "desert cat", I think but for a term right before it) the tokenization process is generating a CONCEPT label for "Ġcat" but not for "desert" (something like that). If you run GPT_NER_Round_1.py as a script (deny creation of new data when it prompts you for that) you should be able to see this issue in the console output.

Organization changes required

  • Change "main files" to Data Processing
  • Add "Analysis/Visualization" with LDA and Visualization as subfolders
  • Ultimately main.py should be the ONLY file directly in the main directory. So organization should look like, a lot of directories and one main.py sticking out clearly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.