This is a main algorithm:
Prefix annotation
We consider text comments analysis for the presence of speakers and their relation with the predecessed utterance. In particular, in this work we aimed on prefixes
in between the: 1. Utterance 2. Speaker in comment. Prefixes represent a subject of analysis that results in lexicon
construction. The result lexicon is neccessary in utterances annotation.
To reveal the connection between character and utterance in comment, we follow the assumption of the relatively short distance in between (prefix).
Infact, comments with prefixes which length in terms 1, 2 and 3 were considered.
Given a whole set of books $B$. First step, we collect the whole dialogs $d_i$ for every book $i \in B$. The result is a set $D$, where $d_i \in D$ for each $i \in B$
Next step we analysie filtered comments. We follow the assumption that authors of the different books has the similar terms in comentary the related character.
Therefore, the goal of the related step is to reveal the most representative terms in comments.
To provide the related assessments, the tf-idf
measure was chosen.
In particular, we calculate: tf-idf
for every term (tfa
is considered).
Finally we use the:
p_threshold
to filter the most representative terms for lexicon.
https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L8-L49
- Trigrams ended by
to
and at
(utterances directed to other character)
- Rule 2 is actually becomes an exceptional in case when such prefixes is the only terms between utterance and caharacter.
Corner cases filtering:
https://github.com/nicolay-r/chatbot_experiments/blob/731d1a1892d36e155829e1fd9f095250636ec7a2/ceb_books_2_utterance_prefixes.py#L52-L65
Matching Algorith (Establishing Connection)
We consider:
- characters that mentioned straightaway after the utterance in comment (absece of prefix)
- patter-matching of the related prefix, using the constructed lexicon
https://github.com/nicolay-r/chatbot_experiments/blob/3b83e2e8730a6d3d5b59e82671001ba135598dc9/core/speaker_annotation.py#L7-L30
Conclusion
This part might become a source of the further optimization and improvements, applcation of LLM.
It could be actually fact-checked with LLM.