nicolay-r / deep-book-processing Goto Github PK
View Code? Open in Web Editor NEWWorkflow for literature character personality profiling ๐ which is solely relies on book content ๐
License: MIT License
Workflow for literature character personality profiling ๐ which is solely relies on book content ๐
License: MIT License
#13 related
Reason: we would like to exclude those who initiate dialog.
Problem: at present, we use exact the same strategy with utterances selection as in the one without HLA.
Reason: some utterances are about interactions with others, we need to experiment to avoid them.
This is related to: #34
#19 related.
Here is the log information:
Source of the problem:
It happens, because we first do selection of the speakers by relying on the amount of utterances related to them:
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/my_s3_dataset_0_create.py#L10-L12
And then, filter some of these utterances here at dataset writing stage:
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/my_s3_dataset_0_create.py#L28-L30
Main branch: https://github.com/nicolay-r/chatbot_experiments/tree/dataset-v3.1
neg
for spectrum based version #15neg
for the spectrum
version (requires sentence embedding application) (done in c51d66f)Explanation: it is a bug, because I did not expect to see IDs in utterances and prevent from the following overfitting.
However, it would be great to not to consider name of others!
So, solution is to completely mask names
#12 related
Try with book 1399.txt
This could be treated as a task. For features generation, there is a need to provide an amount of the related spectrums. By relying on amout, it is possible to determine the appearance frequencies in addition to the normalized values we have for features at the moment.
This and related parameters should guarantee that speakers are separated into words / terms
https://github.com/nicolay-r/chatbot_experiments/blob/55bdb9df98018c06b7c4bbf614e07a97c14a31dd/test/my_s4_1_char_spectrum_cat_paragraphs.py#L25-L27
Fix proposal: 9f10e29
text_source.py
from the core
to e_pairs
experimentMainly need to follow the code of the ALOHA paper related to negative candidate responses selection:
MatrixWrapper(hla_d, col1='char_id', col2='feature')
write
dataset method:they use implicit python library for model training
For the factorization we need the dataset of features.
Here, is the link where the related dataset could be for the ALOHA paper:
https://cs.uwaterloo.ca/~jhoey/research/aloha/
This is how their features-related dataset looks like:
How to retreive the factors:
model.user_factors
{
, }
entries):Persona: none
in the case when persona is absent (df6bfd0)Reason: Already out-of-date and not supported.
Problem: we do not clean the list of speakers.
#14 related
We need to provide oversamplng with features and taking parts of the utterances in dialogs.
#34 related
Explanation: not only from the same book, version 3.0
We consider third-party resouce, contains crowd-sourced personality traits of 800 fictional characters from numerous fictional works. Each character is scored on 264 adjective-pair spectrums, such as "trusting / suspicious" or "luddite / technophile".
https://github.com/tacookson/data/blob/master/fictional-character-personalities/personalities.txt
We consider spectrum_low
and spectrum_high
parameter from these values.
There are two of them:
Paragraph based (N = 1)
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/book/utils.py#L10-L47
Comments-based (text part in between seegments of the dialog utterances)
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/utils_comments.py#L5-L19
List of the modeling of personalities.
iter_paragraphs_with_n_speakers
-- term recognition as speakers should be function that is provided into the core as parameterThis is a main algorithm:
We consider text comments analysis for the presence of speakers and their relation with the predecessed utterance. In particular, in this work we aimed on prefixes
in between the: 1. Utterance 2. Speaker in comment. Prefixes represent a subject of analysis that results in lexicon
construction. The result lexicon is neccessary in utterances annotation.
To reveal the connection between character and utterance in comment, we follow the assumption of the relatively short distance in between (prefix).
Infact, comments with prefixes which length in terms 1, 2 and 3 were considered.
Given a whole set of books
Next step we analysie filtered comments. We follow the assumption that authors of the different books has the similar terms in comentary the related character.
Therefore, the goal of the related step is to reveal the most representative terms in comments.
To provide the related assessments, the tf-idf
measure was chosen.
In particular, we calculate: tf-idf
for every term (tfa
is considered).
Finally we use the:
p_threshold
to filter the most representative terms for lexicon.to
and at
(utterances directed to other character)We consider:
This part might become a source of the further optimization and improvements, applcation of LLM.
It could be actually fact-checked with LLM.
There is a way to experiment straightaway in ParlAI:
https://parl.ai/docs/agent_refs/hugging_face.html#gpt2
!parlai train_model -m hugging_face/gpt2 --add-special-tokens True
--add-start-token True --gpt2-size medium -t gutenbergbookchars -bs 6
-mf parlai/gpt-2 -veps 0.5 --tensorboard_log True
none
for the default
and filled traits for spectrum
version.persona:
instead of partner's persona
I don't know
and etc.)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.