nicolay-r / deep-book-processing Goto Github PK

Toolkit aimed on Character Personalities Extraction from Literature Novel Books with Experiments organization in separate folders

License: MIT License

Python 99.65% Shell 0.35%

books dialogs parlai project-gutenberg sampling speaker-recognition dialogue-systems personality-traits pipeline ranking-system

deep-book-processing's Introduction

Hi I'm Nicolay! 👋

My personal website at github for more information about me
Combine it with track-and-field 🏃‍♂️, ⛷️ and 🌊🏄‍♂️

The most recent

09/05/2024: Taking part of the i3-simulations @ Luten / UK on 9-10th May 2024 for MMI-NLP 🇬🇧
07/05/2024: Joining the reviewer PC @ CIKM-2024 ✍️
06/05/2024: Joining the reviewer PC @ LOD-2024 ✍️
19/04/2024: Our findings on LLMs reasoning prospects in Sentiment Analysis pre-printed @ ArXiv 🥳
05/04/2024: Our findings on LLMs reasoning prospects in Sentiment Analysis were accepted @ LJoM 🥳

25/03/2024: Presenting our ARElight demo @ ECIR-2024 🥳
19/03/2024: Our CoT LLM systems #1 and #2 accepted @ SemEval-2024 🥳
01/03/2024: Research Fellow in Multimodal NLP (🖼️+📰) @ BU in the UK 💼
25/02/2024: Joining the reviewer PC @ BigCom2024 ✍️
22/02/2024: Giving a seminar @ Glasgow IR 🎤
13/02/2024: Joining the reviewer PC @ TextGraphs-17 as a part of ACL-2024 ✍️
23/01/2024: Joining the reviewer PC @ AINL-2024 ✍️
19/01/2024: Joining distingushed reviewers list @ ACM TiiS 🥳✍️
17/10/2023: Joining the reviewer PC @ CHIIR-2024 ✍️
19/03/2023: Our systems #1 and #2 accepted @ SemEval-2023 🥳
02/04/2023: Joining the reviewer PC @ CIKM-2023 ✍️
11/03/2023: Giving a seminar @ Newcastle University in the UK 🎤
10/02/2023: Giving a seminar @ Oxford Wolfson College in the UK 🎤
04/12/2022: Research Fellow in NLP / IR @ Newcastle University in the UK 💼

deep-book-processing's People

Stargazers

Watchers

Forkers

ellyok

deep-book-processing's Issues

Bug -- `menioned name with variant`

https://github.com/nicolay-r/chatbot_experiments/blob/4cd8212e155afd259f849c3d24e17cc5633af4f1/my_s4_char_spectrum_embed.py#L47

Dataset V3.1

Main branch: https://github.com/nicolay-r/chatbot_experiments/tree/dataset-v3.1

Adopt candidates selection from neg for spectrum based version #15
Select the grammatically close sentences in candidates from neg for the spectrum version (requires sentence embedding application) (done in c51d66f)

Result dataset stat

add histogram utterances stat for dataset

:exclamation: `iter_paragraphs_with_n_speakers` -- there were no support of multiple mentions of the same speaker.

List of Personalization Schemes :exclamation:

List of the modeling of personalities.

SPECTRUM
- occurance of the whole adjective words (no need to substract rich/poor).
  -- SPECTRUM bag-of-words
- is what i've done.
  - rich -- embedding.
  - poor -- embedding.
    -- how many words occurst in context for character.
    -- SPECTRUM embedding
HLA Dictionary

Authors use non-intersected speaker sets!

Refactoring: Folds should have unique speakers

Select candidates from the related folding index (ab118d9)
Check absense of intersections with mentioned speakers (822b54b)

Wrap `select_from_table(...)` into more simple API

Use Masked `_` character names instead of their IDS.

Explanation: it is a bug, because I did not expect to see IDs in utterances and prevent from the following overfitting.
However, it would be great to not to consider name of others!
So, solution is to completely mask names
#12 related

Dataset V4.0

Documentation -- original books count and resources

analyse original amount of books and speakers (e5209cb)

Aloha Users Factorization

Mainly need to follow the code of the ALOHA paper related to negative candidate responses selection:

This is a code for selecting candidates:
https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/core.py#L139-L186
It requres the pre-trained matrix factorization:
https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/matrix_factorization.py#L32

MatrixWrapper(hla_d, col1='char_id', col2='feature')

All the related code is wrapped into write dataset method:
https://github.com/newpro/aloha-chatbot/blob/c516e717c2af6fceff86465a44705ed8ff044387/src/core.py#L205C9-L205C14

they use implicit python library for model training

For the factorization we need the dataset of features.

Here, is the link where the related dataset could be for the ALOHA paper:
https://cs.uwaterloo.ca/~jhoey/research/aloha/

This is how their features-related dataset looks like:

How to retreive the factors:

model.user_factors

In some rare cases spectrums for speakers might be missed in the result ParlAI dataset

#13 related

The amount of training samples reprent ~24% of the same amount in ALOHA studies

We need to provide oversamplng with features and taking parts of the utterances in dialogs.

Oversampling is applicable only for training (ab118d9)

ParlAI dataset construction script relies both on SQLite and textual version of the same datset

Move the actual algorithm of lexicon construction into a sepearated `core` file

https://github.com/nicolay-r/chatbot_experiments/blob/9d37f2c0c87888ac7cff820d0ea1c7ebacbff85d/utils_my.py#L189-L206

Limitation on HLA utilized for selection; auhtors consider `40` highest HLA per speaker, among which we perform random selection

keep 40 most representative traits for further selection of 8 out of them (b03054b)
fix api for saving/loading spectrums (d19bbab)
add distrubution based random selection (464082b)

GPT-2 experiments

There is a way to experiment straightaway in ParlAI:

https://parl.ai/docs/agent_refs/hugging_face.html#gpt2

!parlai train_model -m hugging_face/gpt2 --add-special-tokens True 
  --add-start-token True --gpt2-size medium -t gutenbergbookchars -bs 6 
  -mf parlai/gpt-2 -veps 0.5 --tensorboard_log True

:memo: Documentation -- Lexicon construction for utterances annotation (Prefix Analysis)

This is a main algorithm:

Prefix annotation

We consider text comments analysis for the presence of speakers and their relation with the predecessed utterance. In particular, in this work we aimed on prefixes in between the: 1. Utterance 2. Speaker in comment. Prefixes represent a subject of analysis that results in lexicon construction. The result lexicon is neccessary in utterances annotation.

To reveal the connection between character and utterance in comment, we follow the assumption of the relatively short distance in between (prefix).
Infact, comments with prefixes which length in terms 1, 2 and 3 were considered.

Given a whole set of books $B$. First step, we collect the whole dialogs $d_i$ for every book $i \in B$. The result is a set $D$, where $d_i \in D$ for each $i \in B$

Next step we analysie filtered comments. We follow the assumption that authors of the different books has the similar terms in comentary the related character.
Therefore, the goal of the related step is to reveal the most representative terms in comments.

To provide the related assessments, the tf-idf measure was chosen.

In particular, we calculate: tf-idf for every term (tfa is considered).
Finally we use the:

p_threshold to filter the most representative terms for lexicon.

https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L8-L49

Trigrams ended by to and at (utterances directed to other character)
Rule 2 is actually becomes an exceptional in case when such prefixes is the only terms between utterance and caharacter.
Corner cases filtering:
https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L52-L65

Matching Algorith (Establishing Connection)

We consider:

characters that mentioned straightaway after the utterance in comment (absece of prefix)
patter-matching of the related prefix, using the constructed lexicon

https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/speaker_annotation.py#L7-L30

Conclusion

This part might become a source of the further optimization and improvements, applcation of LLM.
It could be actually fact-checked with LLM.

`Uniform` modeling across all utterances from all books of the same set

Explanation: not only from the same book, version 3.0

Dialogue lines were treated with meta-information

#34 related

:memo: Documentation -- source of spectrums for annotation

We consider third-party resouce, contains crowd-sourced personality traits of 800 fictional characters from numerous fictional works. Each character is scored on 264 adjective-pair spectrums, such as "trusting / suspicious" or "luddite / technophile".

https://github.com/tacookson/data/blob/master/fictional-character-personalities/personalities.txt

We consider spectrum_low and spectrum_high parameter from these values.

There are two of them:

Paragraph based (N = 1)
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/book/utils.py#L10-L47
Comments-based (text part in between seegments of the dialog utterances)
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/utils_comments.py#L5-L19

Dataset V2

speaker utterances to 100
folding parts to 10
~~#8 (not in this version)~~
utterances in replies one speaker (should not have leverages); consider limit to 100

Straightawaycropped image generation for t-SNE

Spectrums -- analysis of paragraphs with speakers is based non-anotated version of dataset

:exclamation: `iter_paragraphs_with_n_speakers` -- implementation relies on that all the speakers are separated words

This and related parameters should guarantee that speakers are separated into words / terms
https://github.com/nicolay-r/chatbot_experiments/blob/55bdb9df98018c06b7c4bbf614e07a97c14a31dd/test/my_s4_1_char_spectrum_cat_paragraphs.py#L25-L27

Fix proposal: 9f10e29

Speed up `cos` calculation

https://github.com/nicolay-r/chatbot_experiments/blob/520c115ccd4611110d82b45956434e8e1b78e16c/core/utils_math.py#L4-L12

Filtering response utterances that at least mention $k$ speakers

Reason: some utterances are about interactions with others, we need to experiment to avoid them.

This is related to: #34

Dataset V3

#8
Set amount of traits to 8
Use none for the default and filled traits for spectrum version.
Use persona: instead of partner's persona
Use 20 candidates instead of 6
Fiter query/responses by miminum amount of words (in order to avoid I don't know and etc.)
Added candidates shuffling
Traits shuffling (to prevent from overfitting models with the specific order)

Core generalization and refactoring

Here we need to utilize function as a parameter for detecting speaker. Code below is a mixture of algorithm and particular API (which is relies on {, } entries):
https://github.com/nicolay-r/chatbot_experiments/blob/55bdb9df98018c06b7c4bbf614e07a97c14a31dd/core/book/utils.py#L32-L36
Remove legacy code (0747aad)

In some rare cases speaker might be missed in list of the clusters

#14 related

Documentation -- filtering dialogues for Q/R pairs

added script for assessin original amout of annotated qr pairs (non-filtered) (4349b4b)

Enhance spectrums asssignation and selection for book characters

This could be treated as a task. For features generation, there is a need to provide an amount of the related spectrums. By relying on amout, it is possible to determine the appearance frequencies in addition to the normalized values we have for features at the moment.

saving the related data
using norm and non-norm data for sorting
using min-border filtering (to remove cases of rare apeared traits) [AUTO border mode]

Source of the problem:
It happens, because we first do selection of the speakers by relying on the amount of utterances related to them:
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/my_s3_dataset_0_create.py#L10-L12

And then, filter some of these utterances here at dataset writing stage:
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/my_s3_dataset_0_create.py#L28-L30

iter_paragraphs_with_n_speakers -- term recognition as speakers should be function that is provided into the core as parameter