Git Product home page Git Product logo

nicolay-r / deep-book-processing Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 1.43 MB

Toolkit aimed on Character Personalities Extraction from Literature Novel Books with Experiments organization in separate folders

License: MIT License

Python 99.65% Shell 0.35%
books dialogs parlai project-gutenberg sampling speaker-recognition dialogue-systems personality-traits pipeline ranking-system

deep-book-processing's Introduction

Hi I'm Nicolay! ๐Ÿ‘‹

  • My personal website at github for more information about me
  • Combine it with track-and-field ๐Ÿƒโ€โ™‚๏ธ, โ›ท๏ธ and ๐ŸŒŠ๐Ÿ„โ€โ™‚๏ธ

The most recent

deep-book-processing's People

Stargazers

 avatar

Watchers

 avatar

Forkers

ellyok

deep-book-processing's Issues

List of Personalization Schemes :exclamation:

List of the modeling of personalities.

  • SPECTRUM
    • occurance of the whole adjective words (no need to substract rich/poor).
      -- SPECTRUM bag-of-words
    • is what i've done.
      • rich -- embedding.
      • poor -- embedding.
        -- how many words occurst in context for character.
        -- SPECTRUM embedding
  • HLA Dictionary

Use Masked `_` character names instead of their IDS.

Explanation: it is a bug, because I did not expect to see IDs in utterances and prevent from the following overfitting.
However, it would be great to not to consider name of others!
So, solution is to completely mask names
#12 related

Dataset V4.0

  • #28
  • #18
  • #27
    image
  • #26
  • Switch to 5 folding parts instead of 10
    image
  • #25
    image
  • #29
  • #30
  • #35
  • Fixed to Persona: none in the case when persona is absent (df6bfd0)
  • #36
  • oversampled candidates are the same [bug] (04b18fa)
  • fixed bug with absense of the newline after a set of oversampled examples (c8f400b)
  • spectrums are not randomly selected (41addd9)

Aloha Users Factorization

Mainly need to follow the code of the ALOHA paper related to negative candidate responses selection:

  1. This is a code for selecting candidates:
    https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/core.py#L139-L186
  2. It requres the pre-trained matrix factorization:
    https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/matrix_factorization.py#L32
MatrixWrapper(hla_d, col1='char_id', col2='feature')
  1. All the related code is wrapped into write dataset method:
    https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/core.py#L205C9-L205C14

they use implicit python library for model training

For the factorization we need the dataset of features.

Here, is the link where the related dataset could be for the ALOHA paper:
https://cs.uwaterloo.ca/~jhoey/research/aloha/

This is how their features-related dataset looks like:
image

How to retreive the factors:

model.user_factors

:memo: Documentation -- Lexicon construction for utterances annotation (Prefix Analysis)

This is a main algorithm:

Prefix annotation

We consider text comments analysis for the presence of speakers and their relation with the predecessed utterance. In particular, in this work we aimed on prefixes in between the: 1. Utterance 2. Speaker in comment. Prefixes represent a subject of analysis that results in lexicon construction. The result lexicon is neccessary in utterances annotation.

To reveal the connection between character and utterance in comment, we follow the assumption of the relatively short distance in between (prefix).
Infact, comments with prefixes which length in terms 1, 2 and 3 were considered.

Given a whole set of books $B$. First step, we collect the whole dialogs $d_i$ for every book $i \in B$. The result is a set $D$, where $d_i \in D$ for each $i \in B$

Next step we analysie filtered comments. We follow the assumption that authors of the different books has the similar terms in comentary the related character.
Therefore, the goal of the related step is to reveal the most representative terms in comments.

To provide the related assessments, the tf-idf measure was chosen.

In particular, we calculate: tf-idf for every term (tfa is considered).
Finally we use the:

  1. p_threshold to filter the most representative terms for lexicon.

https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L8-L49

  1. Trigrams ended by to and at (utterances directed to other character)
  2. Rule 2 is actually becomes an exceptional in case when such prefixes is the only terms between utterance and caharacter.
    Corner cases filtering:
    https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L52-L65

Matching Algorith (Establishing Connection)

We consider:

  • characters that mentioned straightaway after the utterance in comment (absece of prefix)
  • patter-matching of the related prefix, using the constructed lexicon

https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/speaker_annotation.py#L7-L30

Conclusion

This part might become a source of the further optimization and improvements, applcation of LLM.
It could be actually fact-checked with LLM.

:memo: Documentation -- source of spectrums for annotation

We consider third-party resouce, contains crowd-sourced personality traits of 800 fictional characters from numerous fictional works. Each character is scored on 264 adjective-pair spectrums, such as "trusting / suspicious" or "luddite / technophile".

https://github.com/tacookson/data/blob/master/fictional-character-personalities/personalities.txt

We consider spectrum_low and spectrum_high parameter from these values.

There are two of them:

  1. Paragraph based (N = 1)
    https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/book/utils.py#L10-L47

  2. Comments-based (text part in between seegments of the dialog utterances)
    https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/utils_comments.py#L5-L19

Dataset V2

  • speaker utterances to 100
  • folding parts to 10
  • #8 (not in this version)
  • utterances in replies one speaker (should not have leverages); consider limit to 100

Dataset V3

  • #8
  • Set amount of traits to 8
  • Use none for the default and filled traits for spectrum version.
  • Use persona: instead of partner's persona
  • Use 20 candidates instead of 6
  • Fiter query/responses by miminum amount of words (in order to avoid I don't know and etc.)
  • Added candidates shuffling
  • Traits shuffling (to prevent from overfitting models with the specific order)

Enhance spectrums asssignation and selection for book characters

This could be treated as a task. For features generation, there is a need to provide an amount of the related spectrums. By relying on amout, it is possible to determine the appearance frequencies in addition to the normalized values we have for features at the moment.

  • saving the related data
  • using norm and non-norm data for sorting
  • using min-border filtering (to remove cases of rare apeared traits) [AUTO border mode]

Why we have a few amount of dialogs per speaker than expected 100? (v. 3.1)

#19 related.
Here is the log information:
image

Source of the problem:
It happens, because we first do selection of the speakers by relying on the amount of utterances related to them:
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/my_s3_dataset_0_create.py#L10-L12

And then, filter some of these utterances here at dataset writing stage:
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/my_s3_dataset_0_create.py#L28-L30

API polishing

  • iter_paragraphs_with_n_speakers -- term recognition as speakers should be function that is provided into the core as parameter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.