rodrigopivi / chatito Goto Github PK

🎯🗯 Dataset generation for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!

Home Page: https://rodrigopivi.github.io/Chatito/

License: MIT License

JavaScript 4.85% TypeScript 91.95% PEG.js 3.20%

chatito nlu dataset nlp text-classification named-entity-recognition nlg dataset-generation chatbot chatbots

chatito's People

Contributors

Stargazers

Watchers

Forkers

mario21ic shivamgupta211 reloadbrain frackup caoxu915683474 pacjin79 craftdata keshavinamdar elliottgorrell sanjeev2487 novellll magicbowen myonlywayup jekirl cyzhangathit electricmaxxx sanyaade-machine-learning amckibben james-jr-sc faadal brandonpurvis derdanielb derrickjnet vunb skynet octalxia manikmalhotra92 harishgurram degaleesanp workalexgahr plattenschieber tomarraj008 decastro-alex brbart eric013 dfontan pyseany mullaikani rohitthapliyal2000 hicham1007 btechgautam zorrock allensmile charlottesean awesome-archive yc-wind algomaks maxqai rbramwell spiralswimmer whitespur dharma2018 vbabenko tinyprojects andrew-yian explorerman uestcxi bluekidds revmaker fuyuna nanhaishun strategicallynicole wegamekinglc anusalva-md samontherun iasonastopsis bijibing waynegerard based-god-fucked-my-bitch-fuckzig bikong2 frcmail dosapati eachan35 boragocode binhetech srinivas175 antomirios lllowen rl-conversation anuragsinghchaudhary stjordanis sadeepdarshana skmaingi angelherosong ppvastar mbkan qinghecode wengbenjue xuchen yeshwanth-mandla muke5hy axchanda wuhao199368 nessonma tonylv zhuyaolin nirajch khalidbentaleb dwtcourses jim2016713

chatito's Issues

the generate principle, i receive a many duplicates

when i generate rasa format traing data, i get many duplicates prompt, and i do not use traing and testing limit. I want to konw the generate principle, why can produce this phenomenon, it can not stop the current intent, and generate next intent?

[Chatito 1.2.1] - Optional Alias operator make a non optional Alias optional on diferent intent

Hi, I find now the same problem but now occurred with different intents definition.

%[greet]
    hola ~[bot?]
    buenas ~[bot?]

%[goodbye]
    chao ~[bot]
    adios ~[bot]

~[bot]
    botname1
    botname2

%[greet]
    hola ~[bot]
    buenas ~[bot?]

%[goodbye]
    chao ~[bot]
    adios ~[bot?]

~[bot]
    botname1
    botname2

or other combination where a optional operator are defined.
I am reading the code, but I don't know how help. Maybe the cache need map to "Alias" then "Actions" and then "Alias"?

Is there an api where we can generate the dsl chatito file programatically

Is there an api where we can generate the dsl chatito file via code . We have a use case wherein certain data elements shall come for a database and it would be great if we can programatically add those to the dsl rather than doing the same manually .

Multiple intents

How could I use declared intents to generate a combined one: Graphic example:

%[greet]
    Hola
    Ey
    Buenas
    Buenos días
    Buenas tardes
    Buenas noches

%[inform]
    Mi nombre es @[nombre]
    Me llamo @[nombre]
    Puedes llamarme @[nombre]

%[greetandinform]
    %[greet] %[inform]

So i can automatically generate the Rasa multiple intents like: greet+ask_question, happy+thankyou...
That at the moment is not working or I may using the wrong syntax, any help is welcome thank you.
(if relevant): I am using online IDE

possibility to provide name for the output file

For now the output file name is always "rasa_dataset_training.json". It would be good to have the option to provide your own name or at least the output file name could be named smth like the 'input_file_name' + '.json'

Training vs Testing

Seems like there is a recent update in the IDE. And I see a new field ('training: '2', 'testing': '1'). In the previous version, we only had to specify a numeric value in between the brackets and I suppose, they are the number of training samples that has to be generated. What is this 'testing' field that's newly added?

According to docs, your total data generated will be training + testing data count. What is this testing data? I'm a bit confused. We are creating training file here and how is testing data pitching in? There is no proper definition in docs that explains the difference between training data and testing data.

Note: I'm generating RASA NLU training data using the chatito IDE.

Expose Chatito API to JS more fluently

Hi there. Great job on this library (it is very helpful).

Is there anyway you could make it so that is easier to call Chatito from JS more fluently? What do I mean?

var chatito = require('chatito')

output = chatito('/path/to/my/lol.chatito')

Just an idea.

taking very long time

Hi, After installing 2.0, for a decently large dataset it is taking much longer than the earlier version (for the same file).
Same problem on the online too. Page times out.
Anyone have similar experiences?

regex extractor

hey,
i wanted to know how do you use regex extractor with chatito, do you have any example?

Improvement of displaying error

Sometime the error shows "Expected "%", "@", "\n", "~", correct indentation, or end of input but " " found." But it never say which line found the error. Could you please add the line number into error message?

Using a folder of chatio files fails because of the MergeDeep function

The MergeDeep function (Chatito/src/utils.ts) does not work for this scenario since the "common_examples" field is an array. But the MergeDeep function does not merge arrays. So only the last file in a directory is kept.
I found the example in https://github.com/saikojosh/Object-Assign-Deep which seems to explain the behavior.

Commenting

HI @rodrigopivi ,
Is there any way to comment in the *.chatito markup files?

Feature Request: ability to turn off rasa synonym behavior

It would be great to be able to turn off the behavior where single-token slot becomes a synonym. While that's a cool function to have, there are definitely instances where it's undesired behavior.

For reference:

When an alias is referenced inside a slot definition, and it is the only token of the slot sentence, by default the generator will tag the generated alias value as a synonym of the alias key name

Controlling this via a flag would be awesome!

Error during large dataset creation

Hi,

I included a range of numbers in one of the synonyms block, and the page stops responding. When I replaced them with a single constant string, it works.

This does not work,

%[lookForSomething]
~[find?] ~[all?] ~[free?] ~[restaurants?] from @[bookTime] to @[bookTime]
~[find?] ~[all?] ~[free?] ~[restaurants?]
~[find?] ~[all?] ~[free?] ~[restaurants?] from @[bookTime] to @[bookTime] on @[bookDate]

@[bookTime#snips/datetime]
~[someDigit] ~[am?]
~[someDigit] ~[pm?]
~[someDigit] ~[:] ~[someDigit] ~[am?]
~[someDigit] ~[:] ~[someDigit] ~[pm?]

@[bookDate#snips/datetime]
today
tomorrow
day after tomorrow
10th March
1st August
12th July

~[find]
show
look for
search
show me
find
identify

~[all]
any

~[free]
unoccupied

~[rooms]
room

~[:]
:

~[am]
am

~[pm]
pm

~[someDigit]
1
2
3
4
5
6
7
8
9

However, when I remove the ~[someDigit] synonyms (which largely reduces the number of combinations which Chatito has to generate) from the entity definitions, it works,

@[bookTime#snips/datetime]
someDigit ~[am?]
someDigit ~[pm?]
someDigit ~[:] someDigit ~[am?]
someDigit ~[:] someDigit ~[pm?]

@[bookDate#snips/datetime]
today
tomorrow
day after tomorrow
10th March
1st August
12th July

~[find]
show
look for
search
show me
find
identify

~[all]
any

~[free]
unoccupied

~[rooms]
room

~[:]
:
~[am]
am

~[pm]
pm

Any thoughts on this, or any place I am going wrong?

Markdown output format

Do you have any plans to allow for markdown output format (instead of JSON)? Currently I use Rasa's function to convert it, but it could save us the trouble if this was an option here.

E.g.:

from rasa_nlu.training_data import load_data
load_data(json_training_file).as_markdown()

how to control the data generation

when i use the same pattern and generate 100 samples, whether it can make sure that the data genereated always be same everytime i runs the generate and download the rasa_nlu data

Provide option of unshuffled dataset

Shuffled dataset is difficult to do the manual checking based on intention, I would suggest that to provide the option (e.g checkbox) to disable shuffling.

Feature Request: specify probabilities

It would be great if you could specify ratios between entries or probabilities or relative counts.

For instance if I have

~[topping]
    anchovies
    pepperoni

I would love to be able to specify that anchovies are less common than pepperoni.

~[topping]
    anchovies &[1]
    pepperoni &[99]

Then for every 100 examples 99 would say pepperoni and only 1 would say anchovies

implemantation of lookup tables for rasa_nlu

Hey,
with rasa_nlu 0.13.3 rasa introduced lookup tables, which have a similar notation
as synonyms.
Correct me if I'm wrong, but I don't think Chatito offers a way to specify lookup tables yet.
I think it would be a nice feature to add.
Cheers

Define entity in non top level element does not work

This is probably a feature request, unless this feature should work already. I would like to define an entity in a not top level element:

%[search]('training': '4')
    ~[greet?] ~[request] ~[thanks?]

~[greet]
    hi
    hello

~[thanks]
    thanks
    thx

~[request]
    I want a @[product]
    Give me a @[product]

@[product]
    ~[shoe]
    ~[shirt]

~[shoe]
    shoe
    cool shoe

~[shirt]
    shirt
    tshirt

It gives this sample

{
    "rasa_nlu_data": {
        "regex_features":[],
        "entity_synonyms":[],
        "common_examples":[
            {"text":"I want a @[product]","intent":"search","entities":[]}
        ]
    }
}

If I put @[product] in the top level element, it works fine. Changing the first part to

%[search]('training': '1')
    ~[greet?] I want a @[product] ~[thanks?]

gives
"I want a shoe thanks".

Is this possible in some way?

Unable to handle repeated entities in a sentence

Hi,

The following code

%[lookForRoom]
find all restaurants in @[areaName] from @[bookTime] to @[bookTime]
@[bookTime#snips/datetime]
10
11
@[areaName]
North Avenue
commerical street

generates,

{ "rasa_nlu_data": { "regex_features": [], "entity_synonyms": [], "common_examples": [ { "text": "find all restaurants in North Avenue from 10 to 11", "intent": "lookForRoom", "entities": [ { "start": 24, "end": 36, "value": "North Avenue", "entity": "areaName" }, {"start": 48, "end": 50, "value": "10", "entity": "bookTime"} ] },...

You would notice that only one occurrence of bookTime is being reflected in the generated dataset.
A workaround is to make two entities of bookTime1 and bookTime2, but its simply an overburden to copy paste the same thing twice.

Synonyms not generated in new version of the online editor

I've been trying to generate a text corpus in JSON format for Rasa Adapter, from a markdown data through Chatito's online editor. The synonyms are not being generated anymore as they were in previous version.

Markdown:

%[search]
what are my @[type] today
what are my @[type]
Fetch my @[type]
My @[type]
@[type]
~[ABC]

@[company]
TCS
Infosys Ltd
ACC

~[ABC]
ABC
abc
A B C

Output :
"entity_synonyms": [],

Do I need to make any changes in my markdown for generating synonyms in the new version?

Generate and download data set gives empty JSON file

The preview shows proper tree structure having valid cases.
However upon downloading the JSON file, I get to see nothing.
Attached the grammar file for investigation.

rules.txt

Synonyms are generated in the online editor, but not in the command line.

test.chatito (the default example from online editor):
test.chatito.txt

The result from online editor (for rasa output):
result_online_editor.json.txt

The result from command line (npx chatito test.chatito --format=rasa):
result_command_line.json.txt

The first generates synonyms:
"entity_synonyms": [{"value": "los angeles", "synonyms": ["la"]}],

But the second don't ...
"entity_synonyms": [],

Any thoughts?

Rasa to Snips converter

"format"
and
"formatOptions"

seem to indicate that one can convert rasa to snips format with the command line tool.
But I couldnt get it to work yet. Is there something wrong or is that not intendet as a converter?

Greetings

online IDE - mandatory to specify no of testing / training examples?

Looks like, it forces to specify no of examples for each and every example. Is there a way to specify this at global level, or, leave it as optional?

incorrect synonyms value

As per RASA, synonyms are generated when you return a single value for different texts:

I want to establish the same using Chatito:

Output is not as expected:

The "value" should have been same in all cases.

Feature Request: spelling and casing

This may be better as two separate features, but they fall under the same category and seem to fit with the spirit of this package.

It would be great to allow tags on certain examples/aliases/entities that performed further augmentation of the tokens by:

introducing spelling errors
perform case operations on items

The first one might be complex to do right, as ideally the spelling perturbations would follow the same distribution as actually observed typos, and that means potentially considering language. But in a simple implementation, you could just specify probabilities like "reversal": "0.05", "insertion": "0.01", "deletion": "0.05"

The scond one is easier and would be helpful for scaling. It would be great to specify that it's okay to lowercase, uppercase, propcase etc. any token in a given set.

Output generation warning

During execution of this DLS the output start/end position is not correct

%[findBoletoDASByMes]
~[greet]? ~[botName]? ~[please]? ~[find]? @[boleto]? @[mes]

~[greet]
oi
ola
opa
bom dia

~[botName]
mago
Mago

~[please]
por favor
poderia
pode
gostaria
quero

~[find]
gerar
emitir
imprimir
pagar

@[boleto]
boleto
DAS
contribuicao

~[janeiro]
janeiro
Janeiro
janero
jan
01

~[fevereiro]
fevereiro
feverero
fev
02

@[mes]
~[janeiro]
~[fevereiro]

Making combinations optional

We can insert optional by appending a "?" after the slot name, like "@[slot_name?]". Is there a way to make combinations optional as well, e.g. "for @[dollar_amount] dollar"? Right now, I have to duplicate these lines to have versions with and without these combinations, which results in an exponential increase in lines when I need multiple of those.

Use Chatito file only to declare slot ?

In my project I want to declare my slots like cities, dates etc.. in other file of my intent.

I read that is was not in standard but is it possible ?

How to use with regular expression features in Rasa?

Hi, given an intent to pass my age to my bot, how would I do this with chatito?

In other words:
Sentence -> I am 59 years old
Intent -> "inform_age"
Slots: { age: 59 }

I thought along the lines of:
%[inform_age]
I am @[age] years old
@[age]
??

Any help or guidance greatly appreciated.

mark entities in sentences

I'm currently facing this issue: I made a chatito file that contains all I need to generate my dataset and I'm very happy with it. However, there are a few sentences I want to add in which the entity mentions only make sense in that context and I don't want it them be reused. I would like anyway to mark such slots by using the DSL so that the entity is included in my RASA NLU training file.

Example:
my parcel should be delivered in @[delivery_time] days
versus
my parcel should be delivered as fast as possible where "as fast as possible" is also a delivery time but I don't want to use that expression in other sentences.

Is there any way to achieve this?

Feature Request : Force some specified samples to be included in output

Hi,

It would be a nice feature to be able to specify into a training set some mandatory sentences to be included into the output.

For example :

%[greet]('training: '2', 'testing': '1')
    hello
    hi
    hola &[training]
    salute

This could be useful for large sets with large combinations possibilities when you need to limit the output samples number, but you really want to have some specified samples to be included.

Logging output shows wrong filename for testing data

Hey,
I'm using the npm version of chatito version 2.1.4 and noticed a minor bug.
The logging output shows the wrong filename for the testing dataset:

Saved training dataset: ./default_dataset_training.json
Saved testing dataset: ./default_dataset_training.json

Global Installed version not generate entity_synonyms

I've installed and run using the file attached.
Command: npx chatito "C:\Temp\chatito.chatito" --format=rasa
chatito.txt

chatito_rasa_training_1111.txt

Why don't aliases produce the same value?

First off, thanks for this project! This has the potential of saving people lots of time!
I wonder, though, if you have

%[lightChange]
    Hey Bot @[switch] the lights

@[switch]
    ~[off]

~[off]
    turn off
    deactivate

, wouldn't it make more sense if these aliases produced the same value "off", as in

{
    "text": "Hey Bot turn off the lights",
    "intent": "lightChange",
    "entities": [
        {
            "start": 8,
            "end": 16,
            "value": "off",
            "entity": "switch"
        }
    ]
},
{
    "text": "Hey Bot deactivate the lights",
    "intent": "lightChange",
    "entities": [
        {
            "start": 8,
            "end": 18,
            "value": "off",
            "entity": "switch"
        }
    ]
}

instead of different values, as it is right now, like

{
    "text": "Hey Bot turn off the lights",
    "intent": "lightChange",
    "entities": [
        {
            "start": 8,
            "end": 16,
            "value": "turn off",
            "entity": "switch"
        }
    ]
},
{
    "text": "Hey Bot deactivate the lights",
    "intent": "lightChange",
    "entities": [
        {
            "start": 8,
            "end": 18,
            "value": "deactivate",
            "entity": "switch"
        }
    ]
}

? Or what's the recommended way to handle these cases? Can you do anything else than defining the aliases again in the code that later on handles the parsed entities?

Too many duplicates

Hi,

when trying to generate 'rasa_dataset_training.json' I get this message:

Too many duplicates while generating dataset! Looks like we have probably reached the maximun ammount of possible unique generated examples. The generator has stopped at 4385 examples for intent timeOff.

Any ideas how to avoid this?

Chatito online

Hi may be this is not the best place, but I'm getting error on Chatito online in this request:

https://rawgit.com/rodrigopivi/Chatito/master/core/datasetGenerator.js

Best regards

Custom entities with v2

When using chatito version 1.*, we can easily define entity per slot with the # notation, something like @[date#snips/datetime].

I do not know how to do the same thing with the new version of chatito. Maybe I should use the trainingOptions file but I can't find documentation on how to do it.

Training data produced is not in rasa format

Let me begin by saying "Thank you" for making this. It has potential to save a lot of time for bot makers like me. But I am facing a problem with this .The Json data produced after converting the DSL file is in another format .I would give an example

{
        "action": "booking",
        "id": "i want to book a flight at monday",
        "arg": {
            "mode": "flight ",
            "time": "monday "
        }
    },
    {
        "action": "booking",
        "id": "i want to book a flight at 27th jan",
        "arg": {
            "mode": "flight ",
            "time": "27th jan"
        }
    }

Am I doing something wrong? Or need to upgrade something.

[Chatito 1.2.0] - Optional Alias operator "expand the optional" to a non Optional Alias operator of the same name.

Hi, thanks for the generator.

I find a problem with version 1.2.0. The Optional alias get carried to non optional alias of same name.
Ex 1:

%[greet]
    hola ~[bot?]
    buenas ~[bot]

~[bot]
    botname1
    botname2

Return the following (Note the last "buenas"):

[
      {"text": "hola botname1", "intent": "greet", "entities": []},
      {"text": "hola botname2", "intent": "greet", "entities": []},
      {"text": "hola", "intent": "greet", "entities": []},
      {"text": "buenas botname1", "intent": "greet", "entities": []},
      {"text": "buenas botname2", "intent": "greet", "entities": []},
      {"text": "buenas", "intent": "greet", "entities": []}
]

Ex2:

%[greet]
    hola ~[bot]
    buenas ~[bot?]

~[bot]
    botname1
    botname2

Return the following (This is correct):

[
      {"text": "hola botname1", "intent": "greet", "entities": []},
      {"text": "hola botname2", "intent": "greet", "entities": []},
      {"text": "buenas botname1", "intent": "greet", "entities": []},
      {"text": "buenas botname2", "intent": "greet", "entities": []},
      {"text": "buenas", "intent": "greet", "entities": []}
]

Using rasa adapter.

Node module gives error: `Unexpected token {`

I installed chatito locally and after that globally. If I run it on a file that works with the online IDE, I get this error:

>> npx chatito data/chatito/search.chatito 
Unexpected token {
>>

Is the node module broken?

Custom entities not generated with the snips adapter

Hello,

In the online chatito IDE, It seems like the generation of snips entities for the snips dataset format does not work.
With this chatito file :
example.txt
I get a snips dataset where all the entity definitions are empty.
It seems to me that it should not be the case, and I don't see anything in the documentation that could explain this behavior.
Thank you for your help

wrong output format?

Hi,
I installed chatito via npm (Version 0.6). Transforming the lightsManager example with npx chatito lightsManager.chatito gave me a json-file, but not in the accurate format for RASA_NLU. My generated json looks as following:

[
    {
        "action": "lightChange",
        "id": "Hey Bot turn off the lights",
        "arg": {
            "switch": "turn off ",
            "lights": "lights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot turn off the lowLights",
        "arg": {
            "switch": "turn off ",
            "lights": "lowLights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot deactivate the lights",
        "arg": {
            "switch": "deactivate ",
            "lights": "lights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot deactivate the lowLights",
        "arg": {
            "switch": "deactivate ",
            "lights": "lowLights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot turn on the lights",
        "arg": {
            "switch": "turn on ",
            "lights": "lights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot turn on the lowLights",
        "arg": {
            "switch": "turn on ",
            "lights": "lowLights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot activate the lights",
        "arg": {
            "switch": "activate ",
            "lights": "lights "
        }
    },
    {
        "action": "lightChange",
        "id": "Hey Bot activate the lowLights",
        "arg": {
            "switch": "activate ",
            "lights": "lowLights "
        }
    }
]

Do you have an idea. Maybe I overlooked sth?
Thx.

can't generate training sample by using chatito online tool

I'm getting a syntax error, but it shows correct syntax in the bottom of the page.
https://rodrigopivi.github.io/Chatito/
I just open this tool page and it was fine until I click on the left text field. I then click generate dataset and got an error. I did not do anything about the example though. I even tried other examples and do not work.

I am using google chrome and getting the error. But IE works fine.

ERROR:

{
"message": "Expected "\n", "\r\n", or correct indentation but " " found.",
"expected": [
{"type": "literal", "text": "\n", "ignoreCase": false},
{"type": "literal", "text": "\r\n", "ignoreCase": false},
{"type": "other", "description": "correct indentation"}
],
"found": " ",
"location": {
"start": {"offset": 21, "line": 3, "column": 1},
"end": {"offset": 22, "line": 3, "column": 2}
},
"name": "SyntaxError"
}

%[sampleGetWeather]
will it be sunny in @[city] @[weatherDate] ?
what kind of weather should I expect @[weatherDate] in @[city] please
tell me if it is going to rain @[weatherDate] in @[city]
What is the weather in @[city] ?

@[weatherDate#snips/datetime]
at the end of the day
tomorrow morning
this afternoon
today

@[city#location]
~[los angeles]
rio de janeiro
tokyo
london
tel aviv
paris

~[los angeles]
los angeles
la

How to create test set with online tool?

I remember the online tool had the option to create train set and test set, but now I only can see train set. There is some way to create test sets?

testing dataset always gets created using the default adapter

Hey,
I noticed, that the testing dataset is using the default format instead of the specified one. I tried using the online version and locally through npm with the same results. I also tried snips and rasa with the same outcome.
Is this just me or is this intended?

Edit: Okay I just looked into the code and in the adapter it is not specified here. I think it would make sense to export the test data in the rasa-format by default. This makes the use of rasa_nlu.evaluate easier. Or am I missing something and there is a simple way to tell chatito the output format for the testing data?

Chatito npm script generates training/testing data only from last processed .chatito file

I have a list of .chatito files in the folder and I am using the following command

npx folder/path --format rasa --outputPath data/nlu

it prints

Processing file: /folder/path/bye.chatito
Processing file: /folder/path/greet.chatito
Processing file: /folder/path/negative.chatito
Saved training dataset: ./rasa_dataset_training.json
Saved testing dataset: ./rasa_dataset_training.json

But the training set and test set only contains sample from negative.chatito, the last processed file.

Chatito Online Editor

when I try to generate my data on Chatito Online editor follow RasaNLU output format, I got the training data in RASA but the testing data in SNIPS format.
I don't understand this problem. Would you check it?
Thanks