Comments (10)
I just read the updated spec.md
. It looks really good!
So, here is what I think we will need:
- Add parsing of "%" inside the probability operator.
- Allow alias definitions to have entity arguments.
- Implement the
defaultDistribution
cli argument. - Update the calculation of the weights considering the
distribution
entity argument (if set) anddefaultDistribution
configuration. - Expose
defaultDistribution
to the web editor
I feel like I can do 3 and 4. But I'm open to any suggestions.
from chatito.
Hi @nimf,
Thanks for the feedback, this tool is meant to help people who use it and all improvement ideas should be considered.
Being able to switch the sentence generation from the default "regular frequency" distribution to an "even" distribution is a great idea, this setting could be declared at the CLI params or the IDE config before generation (e.g.: --defaultDistribution=even
or --defaultDistribution=regular
), or at the DSL entity arguments level (e.g.: %[intent]("distribution": "even")
), or at both levels, CLI and DSL, to have full control over each entity.
Regarding the probability operator, if the 100 limit as the sum of all probabilities is removed, and float values can be accepted. Then the weighted chances would just behave as documented at ChanceJs lib (https://chancejs.com/miscellaneous/weighted.html), i think that would behave as you described.
Yes, this changes would be valuable. You are most welcome to open a PR with this ideas implemented.
from chatito.
--defaultDistribution
looks really good!
Regarding the probability operator, yeah, that would be exactly as described. My only concern is should we keep the percentage probability for regular distribution? Or should we also provide some argument to control that?
// As weights with even distribution
%[intent]("distribution": "even") // Weight Resulting percentage
*[2] ~[alias1] // 2 66.66%
~[alias2] ~[alias3] // 1 33.33%
// As percents with regular distribution
%[intent2]("distribution": "regular") // Resulting percentage
*[66] ~[alias1] // 66%
~[alias2] ~[alias3] // 34%
// As weights with regular distribution
%[intent2]("distribution": "regular") // Max Count Resulting Weight Resulting percentage
*[2] ~[alias1] // 100 200 28.57%
~[alias2] ~[alias3] // 500 500 71.43%
from chatito.
Good catch, relative weights and percentage probabilities are different things. So maybe changing the name to 'chance operator' might be better than 'probability operator' since the idea is to control the relative weights or the percentage probability.
What do you think of considering the value as a relative weight if there is no '%' symbol, and percentage probability if it comes with %.
Following that idea, then regular distribution would behave like:
%[intent]("distribution": "regular") // Max Count | Weight | Prob
~[alias1] // 100 100 10%
~[alias2] ~[alias3] // 500 500 50%
~[alias4] // 400 400 40%
// NOTE: operator with '%' defines the actual probability
%[intent]("distribution": "regular") // Max Count | Weight/Prob
*[20%] ~[alias1] // 100 20%
~[alias2] ~[alias3] // 500 44.4444% // (500*80/900)
~[alias4] // 400 35.5556% // (400*80/900)
// NOTE: operator without '%' it can just multiply max count as the weight
%[intent]("distribution": "regular") // Max Count | Weight | Prob
*[2] ~[alias1] // 100 200 18.1818%
~[alias2] ~[alias3] // 500 500 45.4545%
~[alias4] // 400 400 36.3636%
And for even:
%[intent]("distribution": "even") // Max Count | Weight | Prob
~[alias1] // 100 1 33.3333%
~[alias2] ~[alias3] // 500 1 33.3333%
~[alias4] // 400 1 33.3333%
%[intent]("distribution": "even") // Max Count | Weight | Prob
*[2] ~[alias1] // 100 2 50%
~[alias2] ~[alias3] // 500 1 25%
~[alias4] // 400 1 25%
%[intent2]("distribution": "even") // Max Count | Weight/Prob
*[20%] ~[alias1] // 100 20%
~[alias2] ~[alias3] // 500 40%
~[alias4] // 400 40%
Let me know your thoughts on this. Also then maybe consider an input error if an entity defines one sentence with %'s and other sentence without %, for consistency.
from chatito.
Also considering that maybe this adds complexity to the DSL that is not that useful, and only providing even distribution and weighted operator instead of percentage provides overall better datasets and covers the same needs, maybe the only benefit of the current regular frequency distribution implementation is that it may be faster because it won't produce that many duplicates.
from chatito.
What do you think of considering the value as a relative weight if there is no '%' symbol, and percentage probability if it comes with %.
This is awesome! When I was reading the documentation for the probability operator I thought "oh, maybe the percent sign in the end would make it more clear"
Let me know your thoughts on this.
I really like this.
I think regular distribution is helpful in many cases, so we can set it via the distribution
argument even when --defaultDistribution=even
Regarding dropping support for percentage probability operator:
Personally I like weighted probability more but I can clearly imagine when someone wants "this sentence to fill 30% of all examples and I don't care about the rest 10 sentences"
from chatito.
Agreed, keeping both strategies then. Just created a dev
branch hoping to continue this implementation there. I've updated on that branch the spec to reflect this new features. Please let me know your thoughts on this, so we can coordinate the implementation as I'm hoping to help on it too. Thanks @nimf.
from chatito.
Hi @nimf ,
1 and 2 are done at dev
branch. Hope you can rebase your PR to fit the new changes and continue with 3 and 4. Thanks for your help and collaboration.
from chatito.
Awesome!
I'll do a rebase and continue to work on 3 and 4 in that branch.
from chatito.
Published 2.3.0. It was great sharing the work on this Yuri, thanks.
from chatito.
Related Issues (20)
- relex
- Unhandled crash when generating testing data HOT 3
- Online ide HOT 2
- Optional slots HOT 1
- [BUG] Slot regression between v2.1.5 and v.2.2.1 HOT 5
- Import failing HOT 2
- Snips NLU output format error HOT 1
- 数据量太大,然后速度太慢了 HOT 2
- How can I add previous generated json file with new examples? HOT 1
- How can I add Number? HOT 1
- "Can't generate X examples" warning doesn't say which intent it is referring to HOT 2
- How to use Chatito in angularjs HOT 1
- Training/Testing Number Via Cli? HOT 2
- how to use regex_features? HOT 1
- Downloading dsl files? HOT 1
- How to start Chatito on local host HOT 1
- I got JavaScript heap out of memory when training HOT 1
- How to determine whether happened over-fit?
- Save entities for test HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatito.