Comments (9)
Ah ok, I did not know this. Sorry.
I will investigate further then.
from tongrams.
It fails on the third bigrams because that third bigram is آباء متعددي
but there is no متعددي
among the uni-grams (vocabulary).
The trie topology should be complete, that is: if a bigram X Y appears, then the unigrams must contain both X and Y.
from tongrams.
In fact, if I remove that bigrams and retain these bigrams:
6
آباء الأطفال 1
آباء الكنيسة 4
آباء المجلس 3
آباء المجمع 1
آباء بالتبني 1
آباء سلالات 1
i.e., all prefixed by آباء
, then it builds correctly a 2-gram model.
from tongrams.
Full Log
2021-12-07 11:03:45: Reading 1-grams counts
2021-12-07 11:03:45: Reading 2-grams counts
2021-12-07 11:03:45: Reading 3-grams counts
2021-12-07 11:03:46: Building vocabulary
2021-12-07 11:03:46: Hypergraph generation: trial 0
2021-12-07 11:03:46: Using 17 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 76.6394% nodes remaining
2021-12-07 11:03:46: Round 1, 66.0535% nodes remaining
2021-12-07 11:03:46: Round 2, 60.309% nodes remaining
2021-12-07 11:03:46: Round 3, 56.5286% nodes remaining
2021-12-07 11:03:46: Round 4, 53.8259% nodes remaining
2021-12-07 11:03:46: Round 5, 51.7457% nodes remaining
2021-12-07 11:03:46: Round 6, 50.1449% nodes remaining
2021-12-07 11:03:46: Round 7, 48.7746% nodes remaining
2021-12-07 11:03:46: Round 8, 47.6151% nodes remaining
2021-12-07 11:03:46: Round 9, 46.6423% nodes remaining
2021-12-07 11:03:46: Round 10, 45.8192% nodes remaining
2021-12-07 11:03:46: Round 11, 45.1056% nodes remaining
2021-12-07 11:03:46: Round 12, 44.4748% nodes remaining
2021-12-07 11:03:46: Round 13, 43.9136% nodes remaining
2021-12-07 11:03:46: Round 14, 43.4082% nodes remaining
2021-12-07 11:03:46: Round 15, 42.9121% nodes remaining
2021-12-07 11:03:46: Round 16, 42.4299% nodes remaining
2021-12-07 11:03:46: Round 17, 41.9951% nodes remaining
2021-12-07 11:03:46: Round 18, 41.6049% nodes remaining
2021-12-07 11:03:46: Round 19, 41.2314% nodes remaining
2021-12-07 11:03:46: Round 20, 40.8858% nodes remaining
2021-12-07 11:03:46: Round 21, 40.5578% nodes remaining
2021-12-07 11:03:46: Round 22, 40.2466% nodes remaining
2021-12-07 11:03:46: Round 23, 39.9325% nodes remaining
2021-12-07 11:03:46: Round 24, 39.6297% nodes remaining
2021-12-07 11:03:46: Round 25, 39.3389% nodes remaining
2021-12-07 11:03:46: Round 26, 39.0379% nodes remaining
2021-12-07 11:03:46: Round 27, 38.7396% nodes remaining
2021-12-07 11:03:46: Round 28, 38.447% nodes remaining
2021-12-07 11:03:46: Round 29, 38.1552% nodes remaining
2021-12-07 11:03:46: Round 30, 37.8579% nodes remaining
2021-12-07 11:03:46: Round 31, 37.5699% nodes remaining
2021-12-07 11:03:46: Round 32, 37.2652% nodes remaining
2021-12-07 11:03:46: Round 33, 36.9465% nodes remaining
2021-12-07 11:03:46: Round 34, 36.6269% nodes remaining
2021-12-07 11:03:46: Round 35, 36.2692% nodes remaining
2021-12-07 11:03:46: Round 36, 35.8901% nodes remaining
2021-12-07 11:03:46: Round 37, 35.4916% nodes remaining
2021-12-07 11:03:46: Round 38, 35.1014% nodes remaining
2021-12-07 11:03:46: Round 39, 34.6758% nodes remaining
2021-12-07 11:03:46: Round 40, 34.2169% nodes remaining
2021-12-07 11:03:46: Round 41, 33.7143% nodes remaining
2021-12-07 11:03:46: Round 42, 33.1847% nodes remaining
2021-12-07 11:03:46: Round 43, 32.5929% nodes remaining
2021-12-07 11:03:46: Round 44, 31.9314% nodes remaining
2021-12-07 11:03:46: Round 45, 31.2039% nodes remaining
2021-12-07 11:03:46: Round 46, 30.3807% nodes remaining
2021-12-07 11:03:46: Round 47, 29.4368% nodes remaining
2021-12-07 11:03:46: Round 48, 28.4278% nodes remaining
2021-12-07 11:03:46: Round 49, 27.2813% nodes remaining
2021-12-07 11:03:46: Round 50, 25.924% nodes remaining
2021-12-07 11:03:46: Round 51, 24.3325% nodes remaining
2021-12-07 11:03:46: Round 52, 22.4594% nodes remaining
2021-12-07 11:03:46: Round 53, 20.2715% nodes remaining
2021-12-07 11:03:46: Round 54, 17.6924% nodes remaining
2021-12-07 11:03:46: Round 55, 14.6664% nodes remaining
2021-12-07 11:03:46: Round 56, 11.1303% nodes remaining
2021-12-07 11:03:46: Round 57, 7.34526% nodes remaining
2021-12-07 11:03:46: Round 58, 3.72094% nodes remaining
2021-12-07 11:03:46: Round 59, 1.13161% nodes remaining
2021-12-07 11:03:46: Round 60, 0.142148% nodes remaining
2021-12-07 11:03:46: Round 61, 0.000929074% nodes remaining
2021-12-07 11:03:46: Assigning values
2021-12-07 11:03:46: Building 2-grams
2021-12-07 11:03:46: Writing 2-grams
2021-12-07 11:03:46: Hypergraph generation: trial 0
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 79.7639% nodes remaining
2021-12-07 11:03:46: Round 1, 70.1518% nodes remaining
2021-12-07 11:03:46: Round 2, 64.9241% nodes remaining
2021-12-07 11:03:46: Round 3, 62.7319% nodes remaining
2021-12-07 11:03:46: Round 4, 61.2142% nodes remaining
2021-12-07 11:03:46: Round 5, 60.371% nodes remaining
2021-12-07 11:03:46: Round 6, 59.5278% nodes remaining
2021-12-07 11:03:46: Round 7, 59.0219% nodes remaining
2021-12-07 11:03:46: Round 8, 58.516% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 1
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.1667% nodes remaining
2021-12-07 11:03:46: Round 1, 65% nodes remaining
2021-12-07 11:03:46: Round 2, 59.1667% nodes remaining
2021-12-07 11:03:46: Round 3, 55.5% nodes remaining
2021-12-07 11:03:46: Round 4, 52.3333% nodes remaining
2021-12-07 11:03:46: Round 5, 49.3333% nodes remaining
2021-12-07 11:03:46: Round 6, 47.1667% nodes remaining
2021-12-07 11:03:46: Round 7, 45.1667% nodes remaining
2021-12-07 11:03:46: Round 8, 44.3333% nodes remaining
2021-12-07 11:03:46: Round 9, 43.3333% nodes remaining
2021-12-07 11:03:46: Round 10, 42.3333% nodes remaining
2021-12-07 11:03:46: Round 11, 41.3333% nodes remaining
2021-12-07 11:03:46: Round 12, 40.3333% nodes remaining
2021-12-07 11:03:46: Round 13, 39.5% nodes remaining
2021-12-07 11:03:46: Round 14, 38.5% nodes remaining
2021-12-07 11:03:46: Round 15, 38% nodes remaining
2021-12-07 11:03:46: Round 16, 37.6667% nodes remaining
2021-12-07 11:03:46: Round 17, 37.3333% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 2
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.1252% nodes remaining
2021-12-07 11:03:46: Round 1, 63.773% nodes remaining
2021-12-07 11:03:46: Round 2, 57.2621% nodes remaining
2021-12-07 11:03:46: Round 3, 53.5893% nodes remaining
2021-12-07 11:03:46: Round 4, 51.586% nodes remaining
2021-12-07 11:03:46: Round 5, 48.581% nodes remaining
2021-12-07 11:03:46: Round 6, 47.0785% nodes remaining
2021-12-07 11:03:46: Round 7, 45.9098% nodes remaining
2021-12-07 11:03:46: Round 8, 44.2404% nodes remaining
2021-12-07 11:03:46: Round 9, 42.7379% nodes remaining
2021-12-07 11:03:46: Round 10, 41.7362% nodes remaining
2021-12-07 11:03:46: Round 11, 40.4007% nodes remaining
2021-12-07 11:03:46: Round 12, 38.5643% nodes remaining
2021-12-07 11:03:46: Round 13, 36.7279% nodes remaining
2021-12-07 11:03:46: Round 14, 35.7262% nodes remaining
2021-12-07 11:03:46: Round 15, 34.8915% nodes remaining
2021-12-07 11:03:46: Round 16, 34.0568% nodes remaining
2021-12-07 11:03:46: Round 17, 33.389% nodes remaining
2021-12-07 11:03:46: Round 18, 32.8881% nodes remaining
2021-12-07 11:03:46: Round 19, 32.7212% nodes remaining
2021-12-07 11:03:46: Round 20, 32.5543% nodes remaining
2021-12-07 11:03:46: Round 21, 32.2204% nodes remaining
2021-12-07 11:03:46: Round 22, 32.0534% nodes remaining
2021-12-07 11:03:46: Round 23, 31.8865% nodes remaining
2021-12-07 11:03:46: Round 24, 31.7195% nodes remaining
2021-12-07 11:03:46: Round 25, 31.3856% nodes remaining
2021-12-07 11:03:46: Round 26, 31.2187% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 3
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.2941% nodes remaining
2021-12-07 11:03:46: Round 1, 63.1933% nodes remaining
2021-12-07 11:03:46: Round 2, 58.1513% nodes remaining
2021-12-07 11:03:46: Round 3, 54.6218% nodes remaining
2021-12-07 11:03:46: Round 4, 51.5966% nodes remaining
2021-12-07 11:03:46: Round 5, 49.4118% nodes remaining
2021-12-07 11:03:46: Round 6, 48.2353% nodes remaining
2021-12-07 11:03:46: Round 7, 47.7311% nodes remaining
2021-12-07 11:03:46: Round 8, 47.2269% nodes remaining
2021-12-07 11:03:46: Round 9, 46.7227% nodes remaining
2021-12-07 11:03:46: Round 10, 46.2185% nodes remaining
2021-12-07 11:03:46: Round 11, 45.8824% nodes remaining
2021-12-07 11:03:46: Round 12, 45.5462% nodes remaining
2021-12-07 11:03:46: Round 13, 45.042% nodes remaining
2021-12-07 11:03:46: Round 14, 44.8739% nodes remaining
2021-12-07 11:03:46: Round 15, 44.7059% nodes remaining
2021-12-07 11:03:46: Round 16, 44.5378% nodes remaining
2021-12-07 11:03:46: Round 17, 44.3697% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 4
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 77.1382% nodes remaining
2021-12-07 11:03:46: Round 1, 65.9539% nodes remaining
2021-12-07 11:03:46: Round 2, 59.2105% nodes remaining
2021-12-07 11:03:46: Round 3, 55.2632% nodes remaining
2021-12-07 11:03:46: Round 4, 52.4671% nodes remaining
2021-12-07 11:03:46: Round 5, 50.1645% nodes remaining
2021-12-07 11:03:46: Round 6, 48.5197% nodes remaining
2021-12-07 11:03:46: Round 7, 47.0395% nodes remaining
2021-12-07 11:03:46: Round 8, 45.7237% nodes remaining
2021-12-07 11:03:46: Round 9, 44.2434% nodes remaining
2021-12-07 11:03:46: Round 10, 43.4211% nodes remaining
2021-12-07 11:03:46: Round 11, 42.4342% nodes remaining
2021-12-07 11:03:46: Round 12, 41.1184% nodes remaining
2021-12-07 11:03:46: Round 13, 39.9671% nodes remaining
2021-12-07 11:03:46: Round 14, 38.6513% nodes remaining
2021-12-07 11:03:46: Round 15, 37.0066% nodes remaining
2021-12-07 11:03:46: Round 16, 35.3618% nodes remaining
2021-12-07 11:03:46: Round 17, 32.8947% nodes remaining
2021-12-07 11:03:46: Round 18, 30.0987% nodes remaining
2021-12-07 11:03:46: Round 19, 25.8224% nodes remaining
2021-12-07 11:03:46: Round 20, 21.2171% nodes remaining
2021-12-07 11:03:46: Round 21, 16.1184% nodes remaining
2021-12-07 11:03:46: Round 22, 10.5263% nodes remaining
2021-12-07 11:03:46: Round 23, 4.76974% nodes remaining
2021-12-07 11:03:46: Round 24, 0.493421% nodes remaining
2021-12-07 11:03:46: Round 25, 0% nodes remaining
2021-12-07 11:03:46: Assigning values
error at position 3/641633
27133 < 131071
terminate called after throwing an instance of 'std::runtime_error'
what(): sequence is not sorted
from tongrams.
Hi,
this is strange. It looks like the files are malformed.
Could you share your input files, so that I can take a closer look to the problem?
Thanks!
from tongrams.
Sure.
2-grams.sorted.gz
3-grams.sorted.gz
1-grams.sorted.gz
from tongrams.
I gave a quick look at your files and they are, as I suspected before, malformed:
on each line, there is the count followed by the ngram string but it should be opposite,
as explained here https://github.com/jermp/tongrams#input-data-format.
You should format them as follows:
# of rows
<ngram> TAB <count>
...
And be sure they are sorted with the sort_grams
utility as you did before.
Let me know if it works after reformatting.
from tongrams.
Actually, it's just the rendering. this is a common problem with RTL languages.
If you try to parse it. you can confirm that the first field is indeed the ngram
tail 2-grams.sorted | cut -f 1
ييفارت توفماسيان
ييفانغ ونادي
ييكسيان من
ييلان في
ييمسومارواي مواليد
يينا ونادي
يينال في
يينال من
يينان دياوو
يييانغ في
tail 2-grams.sorted | cut -f 2
1
1
1
1
1
1
1
1
3
1
from tongrams.
Thanks, just noticed the issue.
WIth similar problems I used to get an error that looks like this:
2-grams file is incomplete:
'شتلاند' should have been found among 1-grams
That's why I didn't notice this time.
from tongrams.
Related Issues (18)
- Using Tongrams HOT 6
- Can't compile tongrams HOT 3
- SIGABRT Crash HOT 10
- format for vocabulary file HOT 1
- an error when I try python tongrams HOT 2
- how to use tongram in a class HOT 1
- Move as much of things out of headers as possible and make tongrams a shared library HOT 3
- Implement building ngrams storage via python HOT 4
- lookup() - Segmentation fault when ngram is not in data structure HOT 5
- Trying build_trie with arpa file. HOT 4
- Can't load MPH-based models in Python HOT 1
- sort_arpa can't work HOT 4
- Update external dependencies and use PTHash instead of EMPHF
- One master tool instead of many different executables
- Remove dependency from boost
- sort_grams - found the bug causing the exception HOT 4
- Compile fails on gcc 4.9, Debian Jessie HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tongrams.