heckej / p-o-entrepreneurship-team-a-code Goto Github PK

Chatbot and helper services (depending on the Cluster Connector Server https://github.com/heckej/P-O-Entrepreneurship-Team-A-ClusterConnector)

License: MIT License

Python 14.02% Jupyter Notebook 14.96% HTML 21.80% C# 49.22%

chatbot nlp

p-o-entrepreneurship-team-a-code's Introduction

P&O Entrepreneurship - Team A - Virtual Company Assistant (code) a.k.a Cluster

About this project

This is the code base repository of our bachelor's thesis project.

Modules

The chatbot and its helper services depend on (and therefore all communicate with) the Cluster Connector Server, to which a connection is established using the cluster-connector (Python connector) or ClusterClient (C#/.NET Core connector) libraries. All connector related code can be found in the Cluster Connector repository. Another part of Cluster is the Cluster Moderator, a tool to be used by someone who moderates the questions and answers provided to the chatbot by the user.

Documentation

Documentation can be found at Clusterdocs.

p-o-entrepreneurship-team-a-code's People

Contributors

Watchers

Forkers

yvesdhondt

p-o-entrepreneurship-team-a-code's Issues

Implement server-chatbot communication protocol

Because the communication protocol for server-chatbot communication hadn't been decided entirely, it hasn't been implemented correctly in the ClusterClient message models.
Now the protocol is kind of clear and will be implemented as described on the wiki.

Question about `ProcessNLPMatchQuestionsResponse`

        /// <summary>
        /// Logica: krijgt MatchedQuestionsModel binnen
        /// en beslist of er een goede match is + het antwoord op deze vraag
        /// 
        /// </summary>
        /// <param name="matchQuestionModels"></param>
        /// <returns>
        /// 
        /// Resultaat bij Bernd is dan
        /// 1) Er is een goede match
        /// 2) We moeten verderZoeken
        /// 3) Er is geen match en vragen zijn op
        /// 
        /// Ik laat aan jullie om te beslissen hoe dit 1 model / meerdere modellen eruit zullen zien
        /// return model; aan het functie is het enige dat ik nodig heb :)
        /// 
        /// </returns>
        public static Object ProcessNLPMatchQuestionsResponse(List<MatchQuestionModelResponse> matchQuestionModels)
        {

            return null;
        }

As you can see above, the ProcessNLPMatchQuestionsResponse gets a list of MatchQuestionModelResponses as input. Given the definition of the MatchQuestionModelResponse:

    [Serializable]
    public class MatchQuestionModelResponse : BaseModel
    {
        private int _question_id = -1;
        private MatchQuestionModelInfo[] _possible_matches = null;
        private int _msg_id = -1;

        public int question_id { get => _question_id; set => _question_id = value; }
        public MatchQuestionModelInfo[] possible_matches { get => _possible_matches; set => _possible_matches = value; }
        public int msg_id { get => _msg_id; set => _msg_id = value; }

        public bool IsComplete()
        {
            return possible_matches != null && _question_id != -1 && _msg_id != -1;
        }
    }

shouldn't the task of ProcessNLPMatchQuestionsResponse just be to check ONE MatchQuestionModelResponse for a match in its MatchQuestionModelInfos instead of checking a list of MatchQuestionModelResponses? So basically:

        public static Object ProcessNLPMatchQuestionsResponse(MatchQuestionModelResponse matchQuestionModel)
        {

            return null;
        }

Only accept json encoded response

The server may respond with XML whenever the client accepts this, so the client should specifically ask for json.

Write xgboost model for question matching

Try to implement an algorithm that checks whether two questions are the same.

API in Python to connect NLP to server using POST/GET

Basic API using GET/POST to get tasks from the Cluster Connector server and post the processing results of the NLP model.

SQL user id to string

REST Api using Flask

https://programminghistorian.org/en/lessons/creating-apis-with-python-and-flask

Try out PyTest framework

Follow Tutorial from Real Python .

Offensive Questions Model

NLP Blacklist in database + logic

NLP offensiveness Logic processing

similar to #44

Handle what to do with received offensive models.

Create logic models for (plain C# won't be send as JSON.. yet?)

Representing a sentence that is offensive
Representing a sentence that is non-offensive

Discuss with chatbot and moderator what to do with both offensive and non-offensive sentences.
Moderator, manual review along with the user-id for example?
Chatbot, notification to user with warning?

Create QnA-maker model with some basic test questions

Filter incoming questions for nonsense

Look for an existing model or train a new one to distinguish meaningful sentences from gibberish.

Extending connector class with methods needed for the moderator

SQL database set-up

Return question field to server after nonsense/offensiveness test

First aid to questions on architecture

Whenever you are wondering to which component some (new) feature/functionality should belong, you can follow these rules of thumb and start a discussion below in case of uncertainties:

The feature

implements a decision making process -> logic
implements (part of) the communication protocol between the server and modules/servicies (chatbot, NLP ...):
- Server side -> cluster connector, cluster api
- Client side in Python -> cluster-connector (a.k.a Python connector, Python api)
- Client side in C# -> ClusterClient (a.k.a C# connector)
implements part of the dialog with a user -> chatbot
provides some 'intelligent' functionality to analyse user input -> NLP tools
...

Change user id data type to string

Wrapper Offensive Question Model

function: question2features
function: features2probabilityAbuse

Update `SendQuestion()` documentation

SendQuestion() currently throws a TimeoutException when no response was received from the server before a timeout occurred. However, as this is an async method, the exception cannot be catched by its caller. Therefore it's easiest at the moment to simply return null.

Support request-response flow to retrieve unanswered questions for user

The ClusterClient module currently expects the server to push questions that should be answered, i.e. it should send them whenever available. However, as in the last meeting it appeared to be more clear for the logic to handle this situation using a plain old request-response flow.
The way ClusterClient works now is described in this figure:

The way a request-response flow works as proposed is shown here:

Some extra notes:

In the first scenario the server can decide to wait for a bunch of questions to bundle them and send them all at once. Then only for every group of, say X, questions, a message should be sent in one way. If the buffer set in ClusterClient happens to be empty when a user wants to answer questions, it can still send a request to the server. ClusterClient could decide itself whether an extra request is needed, because it knows when it has requested last, so it can set a reasonable timeout between two 'hard' requests to give the server some time. Also it could keep track of the unanswered questions asked by users (which it already does, but those questions can still be answered by the NLP at that time), so it basically knows already whether there's any chance the server will ever send unanswered questions. However when the server sends an 'sorry, have to ask forum', ClusterClient will consider this to be the answer, unless this type of answer could be distinguished from real answers.
In the second scenario 2 messages are sent over the websocket connection for every user who wants to answer a question. This seems less efficient to me then the first scenario, though more comprehensible, perhaps.

Websocket connection from chatbot to Cluster Connector Server

Write use cases moderator

Speed up by using websockets instead of seperate GET/POST requests

Simple HTTP requests are easy. Speeding up using the header connection: keep-alive (using Sessions in Python requests) is easy too. However, this way we're still bound to using HTTP, in which the server can't push a message to the client. That's why it might be worth the effort of experimenting with web sockets, both in Python and in Azure.

Protocol to estimate 'answer offensiveness'

See proposal in response to issue #42.

Do research on Direct Line (Azure bot <-> client)

Implement Chatbot IO

Improve xgboost model for question matching

Database for storing Q-A pairs and Unanswered Questions

Create Who-is-Who LUIS model

Documenting websocket usage

Design Interface

Change user_id datatype in database from int to varchar(?)

Implement "Database Manager"

Connector Between Cluster Forum and Moderator panel

Server response on posting new open question to the forum (to the chatbot)

So if a user sends a question that currently can't be answered and isn't similar to any other open question, and therefore the question gets posted to the forum (added to open questions in the DB), how should the server respond to the chatbot?

ClusterLogic NLP Responses

being worked on
Fill out ClusterLogic > NLPHandler > ProcessNLPResponse

Response meaning, the Server receives the 'Response' of the NLP tool.

Example:

public static Object ProcessNLPMatchQuestionsResponse(List<MatchQuestionModelResponse> matchQuestionModels)
        {
            return null;
        }

With MatchQuestionModelResponse being:

    [Serializable]
    public class MatchQuestionModelResponse : BaseModel
    {
        private int _question_id = -1;
        private MatchQuestionModelInfo[] _possible_matches = null;
        private int _msg_id = -1;

        public int question_id { get => _question_id; set => _question_id = value; }
        public MatchQuestionModelInfo[] possible_matches { get => _possible_matches; set => _possible_matches = value; }
        public int msg_id { get => _msg_id; set => _msg_id = value; }

        public bool IsComplete()
        {
            return possible_matches != null && _question_id != -1 && _msg_id != -1;
        }
    }

    [Serializable]
    public class MatchQuestionModelInfo : BaseModel
    {
        
        private int _question_id = -1;
        private float _prob = -1;

        public int question_id { get => _question_id; set => _question_id = value; }
        public float prob { get => _prob; set => _prob = value; }

        public bool IsComplete()
        {
            return _question_id != -1 && prob != -1;
        }
    }

A C# representation of the NLP response returning possible related and matching questions

'matchQuestionModels' is a list containing all models the server has received. Please process these models based on their data. Example:
The above represents a model in which the server receives a list of questions and probabilities to represent similarity with a given question. The logic SHOULD decide which question can be used as a similar question to the given one.

TODO:

prepare and return a model representing that either:

Represents a similar question to the given one
Represent a model in which no matching question could be found

what to do with this model (decide on paper first)
--> Discuss with chatbot and Forum, questions without similarities will probably be send to the forum. Questions with similarities will probably be processed by the server and answers to the highest matching question will be send to the chatbot to be send to the user.

Create "Reply Module" in BotFrameworkv4

Combine the LUIS model, the QnA-maker and the "forum" (will still be a loose end)

Wrapper for communicating with Cluster

Python script to apply models to q&a databases.
returns answers as .json objects.

Nonsense analysis support

The NLP tools now have a functionality to analyse a text string for being nonsense. See this wiki page for a proposal on the communication protocol.

Create Who-is-Who database

Create a simple database with some contacts for (imaginary) people

Improve security communication NLP/chatbot - Cluster Connector

Currently connections to the Cluster API server can (and should) be done over HTTPS (SSL), which is encrypts the information sent over the connection. However, there's is no need to authenticate to use the API, so no matter the strength of SSL encryption, we currently don't control who is using the API. Proposed solutions contain:

extending the message protocol to use a hash of the message and some secret, which is known by the client (NLP/chatbot) and server only
using some kind of RSA signing (in which case the server would keep a list of trusted public keys)
only allowing msg_ids that are expected (i.e. which have been generated by the server and not yet sent by the client)
HTTP header authentication
...

Won't fix: decide on values of `Authorization` header fields

Experiment with async functions in C#

Speed up by using connection: keep-alive instead of seperate connections for each GET/POST request

By using the HTTP header connection: keep-alive one can use the persistency of the HTTP/1.1 protocol instead of creating seperate connections for each GET/POST request. Example for requests in Python: see StackOverflow.

Authorization header not sent on connection set-up

The server expects an Authorization header field, which is currently not sent along with the initiall HTTP request when setting up the websocket connection.

More efficient implementation get_next_request()

get_next_request() currently doesn't ask the server for new tasks when there are still tasks in the task list (hidden variable). A possibility to improve the efficiency of get_next_request() is to return a pending task when it is immediately available in the task list and ask the server for tasks in a seperate thread.

User interface for moderator functionality

Development of user interface for interaction with moderator functionality using 'Figma' tool.

Blacklist support

The NLP tools need to retrieve a blacklist containing offensive words from the server. This list changes over time, so it cannot store the list permanently.
A few possiblities are the following:

the server sends the list along with match questions request every time
the NLP connector sends a request to the server to check whether it has the most up to date blacklist and if it doesn't, the server sends the blacklist
the server keeps a hash of the date when the blacklist was updated for the last time and sends this hash along with every match questions request. The NLP connector saves this hash and compares it with every request. When the stored hash differs from the one sent by the server, the NLP connector sends a request to the server asking for the updated blacklist and caches the response along with the newest hash. Instead of a hash, the date itself could be used or just an incremented number of course.

Wrapper for Question Matching Model

A function that turns a question into features and a function that calculates the probability that 2 feature sets are equal.