nicolay-r / deep-book-processing Goto Github PK

Adopt candidates selection from neg for spectrum based version #15
Select the grammatically close sentences in candidates from neg for the spectrum version (requires sentence embedding application) (done in c51d66f)

Result dataset stat

Use Masked `_` character names instead of their IDS.

Explanation: it is a bug, because I did not expect to see IDs in utterances and prevent from the following overfitting.
However, it would be great to not to consider name of others!
So, solution is to completely mask names
#12 related

First utterance in an extended processing might be missed

Try with book 1399.txt

Enhance spectrums asssignation and selection for book characters

This could be treated as a task. For features generation, there is a need to provide an amount of the related spectrums. By relying on amout, it is possible to determine the appearance frequencies in addition to the normalized values we have for features at the moment.

saving the related data
using norm and non-norm data for sorting
using min-border filtering (to remove cases of rare apeared traits) [AUTO border mode]

:exclamation: `iter_paragraphs_with_n_speakers` -- implementation relies on that all the speakers are separated words

This and related parameters should guarantee that speakers are separated into words / terms
https://github.com/nicolay-r/chatbot_experiments/blob/55bdb9df98018c06b7c4bbf614e07a97c14a31dd/test/my_s4_1_char_spectrum_cat_paragraphs.py#L25-L27

Fix proposal: 9f10e29

Project accessibility enancement :sparkles:

#50
#52
#53
Relocate text_source.py from the core to e_pairs experiment
https://github.com/nicolay-r/deep-book-processing/blob/04f246215aeb024b17c74de3324e28d88a946677/core/spectrums/text_source.py#L1-L13

Authors use non-intersected speaker sets!

Refactoring: Folds should have unique speakers

Select candidates from the related folding index (ab118d9)
Check absense of intersections with mentioned speakers (822b54b)

Aloha Users Factorization

Mainly need to follow the code of the ALOHA paper related to negative candidate responses selection:

This is a code for selecting candidates:
https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/core.py#L139-L186
It requres the pre-trained matrix factorization:
https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/matrix_factorization.py#L32

MatrixWrapper(hla_d, col1='char_id', col2='feature')

All the related code is wrapped into write dataset method:
https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/core.py#L205C9-L205C14

they use implicit python library for model training

For the factorization we need the dataset of features.

Here, is the link where the related dataset could be for the ALOHA paper:
https://cs.uwaterloo.ca/~jhoey/research/aloha/

This is how their features-related dataset looks like:

How to retreive the factors:

model.user_factors

Dataset V2

speaker utterances to 100
folding parts to 10
~~#8 (not in this version)~~
utterances in replies one speaker (should not have leverages); consider limit to 100

`iter_dataset_dialogs` -- provide `dialog_id` from this method

https://github.com/nicolay-r/chatbot_experiments/blob/520c115ccd4611110d82b45956434e8e1b78e16c/utils_my.py#L306-L325

Core generalization and refactoring

Here we need to utilize function as a parameter for detecting speaker. Code below is a mixture of algorithm and particular API (which is relies on {, } entries):
https://github.com/nicolay-r/chatbot_experiments/blob/55bdb9df98018c06b7c4bbf614e07a97c14a31dd/core/book/utils.py#L32-L36
Remove legacy code (0747aad)

Dataset V4.0

Remove models

Reason: Already out-of-date and not supported.

`speaker_ids` -- generating Sentiment datasets results in the equal set of attrubutes for all speakers

Problem: we do not clean the list of speakers.

https://github.com/nicolay-r/chatbot_experiments/blob/9e9cc64989d184fea23cad671f459347467c9e0e/my_s5_parlai_dataset_convert.py#L24-L52

ParlAI dataset construction script relies both on SQLite and textual version of the same datset

provide google-colab notebook

add histogram utterances stat for dataset

In some rare cases speaker might be missed in list of the clusters

#14 related

The amount of training samples reprent ~24% of the same amount in ALOHA studies

We need to provide oversamplng with features and taking parts of the utterances in dialogs.

Oversampling is applicable only for training (ab118d9)

Dialogue lines were treated with meta-information

#34 related

`Uniform` modeling across all utterances from all books of the same set

Explanation: not only from the same book, version 3.0

:warning: Refactor Dataset Documentation

🔧 Refactor dataset downloader
Make it in a form of the table at Markdown
Give a name to the dataset (see the paper draft)
Enhance accessability: provide formatting instruction

Spectrums -- analysis of paragraphs with speakers is based non-anotated version of dataset

Speaker set is expected to be smaller

Let's focus on the 400 speakers

Limitation on HLA utilized for selection; auhtors consider `40` highest HLA per speaker, among which we perform random selection

keep 40 most representative traits for further selection of 8 out of them (b03054b)
fix api for saving/loading spectrums (d19bbab)
add distrubution based random selection (464082b)

:memo: Documentation -- source of spectrums for annotation

We consider third-party resouce, contains crowd-sourced personality traits of 800 fictional characters from numerous fictional works. Each character is scored on 264 adjective-pair spectrums, such as "trusting / suspicious" or "luddite / technophile".

https://github.com/tacookson/data/blob/master/fictional-character-personalities/personalities.txt

We consider spectrum_low and spectrum_high parameter from these values.

There are two of them:

Paragraph based (N = 1)
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/book/utils.py#L10-L47
Comments-based (text part in between seegments of the dialog utterances)
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/utils_comments.py#L5-L19

My dataset -- filter only those speakers in spectrum analysis that a part of the dataset.

Bug -- `menioned name with variant`

https://github.com/nicolay-r/chatbot_experiments/blob/4cd8212e155afd259f849c3d24e17cc5633af4f1/my_s4_char_spectrum_embed.py#L47

List of Personalization Schemes :exclamation:

List of the modeling of personalities.

SPECTRUM
- occurance of the whole adjective words (no need to substract rich/poor).
  -- SPECTRUM tf-idf
- is what i've done.
  - rich -- embedding.
  - poor -- embedding.
    -- how many words occurst in context for character.
    -- SPECTRUM embedding
HLA Dictionary

Wrap `select_from_table(...)` into more simple API

API polishing

iter_paragraphs_with_n_speakers -- term recognition as speakers should be function that is provided into the core as parameter

Speakers diagram has `{` entries mentioning

:memo: Documentation -- Lexicon construction for utterances annotation (Prefix Analysis)

This is a main algorithm:

Prefix annotation

We consider text comments analysis for the presence of speakers and their relation with the predecessed utterance. In particular, in this work we aimed on prefixes in between the: 1. Utterance 2. Speaker in comment. Prefixes represent a subject of analysis that results in lexicon construction. The result lexicon is neccessary in utterances annotation.

To reveal the connection between character and utterance in comment, we follow the assumption of the relatively short distance in between (prefix).
Infact, comments with prefixes which length in terms 1, 2 and 3 were considered.

Given a whole set of books $B$. First step, we collect the whole dialogs $d_i$ for every book $i \in B$. The result is a set $D$, where $d_i \in D$ for each $i \in B$

Next step we analysie filtered comments. We follow the assumption that authors of the different books has the similar terms in comentary the related character.
Therefore, the goal of the related step is to reveal the most representative terms in comments.

To provide the related assessments, the tf-idf measure was chosen.

In particular, we calculate: tf-idf for every term (tfa is considered).
Finally we use the:

p_threshold to filter the most representative terms for lexicon.

https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L8-L49

Trigrams ended by to and at (utterances directed to other character)
Rule 2 is actually becomes an exceptional in case when such prefixes is the only terms between utterance and caharacter.
Corner cases filtering:
https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L52-L65

Matching Algorith (Establishing Connection)

We consider:

characters that mentioned straightaway after the utterance in comment (absece of prefix)
patter-matching of the related prefix, using the constructed lexicon

https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/speaker_annotation.py#L7-L30

Conclusion

This part might become a source of the further optimization and improvements, applcation of LLM.
It could be actually fact-checked with LLM.

GPT-2 experiments

There is a way to experiment straightaway in ParlAI:

https://parl.ai/docs/agent_refs/hugging_face.html#gpt2

!parlai train_model -m hugging_face/gpt2 --add-special-tokens True 
  --add-start-token True --gpt2-size medium -t gutenbergbookchars -bs 6 
  -mf parlai/gpt-2 -veps 0.5 --tensorboard_log True

Dataset V3

#8
Set amount of traits to 8
Use none for the default and filled traits for spectrum version.
Use persona: instead of partner's persona
Use 20 candidates instead of 6
Fiter query/responses by miminum amount of words (in order to avoid I don't know and etc.)
Added candidates shuffling
Traits shuffling (to prevent from overfitting models with the specific order)

Unsync API -- expected list of speakers.

https://github.com/nicolay-r/chatbot_experiments/blob/df48472db5449185d4977d67989c881a1ef0a298/core/utils_comments.py#L18

nicolay-r / deep-book-processing Goto Github PK

deep-book-processing's Issues

Result dataset stat

Prefix annotation

Matching Algorith (Establishing Connection)

Conclusion

Recommend Projects

Recommend Topics

Recommend Org