Git Product home page Git Product logo

Comments (9)

jermp avatar jermp commented on August 23, 2024 1

Ah ok, I did not know this. Sorry.
I will investigate further then.

from tongrams.

jermp avatar jermp commented on August 23, 2024 1

It fails on the third bigrams because that third bigram is آباء متعددي but there is no متعددي among the uni-grams (vocabulary).
The trie topology should be complete, that is: if a bigram X Y appears, then the unigrams must contain both X and Y.

from tongrams.

jermp avatar jermp commented on August 23, 2024 1

In fact, if I remove that bigrams and retain these bigrams:

6
آباء الأطفال	1
آباء الكنيسة	4
آباء المجلس	3
آباء المجمع	1
آباء بالتبني	1
آباء سلالات	1

i.e., all prefixed by آباء, then it builds correctly a 2-gram model.

from tongrams.

abdullah-saal avatar abdullah-saal commented on August 23, 2024

Full Log

2021-12-07 11:03:45: Reading 1-grams counts
2021-12-07 11:03:45: Reading 2-grams counts
2021-12-07 11:03:45: Reading 3-grams counts
2021-12-07 11:03:46: Building vocabulary
2021-12-07 11:03:46: Hypergraph generation: trial 0
2021-12-07 11:03:46: Using 17 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 76.6394% nodes remaining
2021-12-07 11:03:46: Round 1, 66.0535% nodes remaining
2021-12-07 11:03:46: Round 2, 60.309% nodes remaining
2021-12-07 11:03:46: Round 3, 56.5286% nodes remaining
2021-12-07 11:03:46: Round 4, 53.8259% nodes remaining
2021-12-07 11:03:46: Round 5, 51.7457% nodes remaining
2021-12-07 11:03:46: Round 6, 50.1449% nodes remaining
2021-12-07 11:03:46: Round 7, 48.7746% nodes remaining
2021-12-07 11:03:46: Round 8, 47.6151% nodes remaining
2021-12-07 11:03:46: Round 9, 46.6423% nodes remaining
2021-12-07 11:03:46: Round 10, 45.8192% nodes remaining
2021-12-07 11:03:46: Round 11, 45.1056% nodes remaining
2021-12-07 11:03:46: Round 12, 44.4748% nodes remaining
2021-12-07 11:03:46: Round 13, 43.9136% nodes remaining
2021-12-07 11:03:46: Round 14, 43.4082% nodes remaining
2021-12-07 11:03:46: Round 15, 42.9121% nodes remaining
2021-12-07 11:03:46: Round 16, 42.4299% nodes remaining
2021-12-07 11:03:46: Round 17, 41.9951% nodes remaining
2021-12-07 11:03:46: Round 18, 41.6049% nodes remaining
2021-12-07 11:03:46: Round 19, 41.2314% nodes remaining
2021-12-07 11:03:46: Round 20, 40.8858% nodes remaining
2021-12-07 11:03:46: Round 21, 40.5578% nodes remaining
2021-12-07 11:03:46: Round 22, 40.2466% nodes remaining
2021-12-07 11:03:46: Round 23, 39.9325% nodes remaining
2021-12-07 11:03:46: Round 24, 39.6297% nodes remaining
2021-12-07 11:03:46: Round 25, 39.3389% nodes remaining
2021-12-07 11:03:46: Round 26, 39.0379% nodes remaining
2021-12-07 11:03:46: Round 27, 38.7396% nodes remaining
2021-12-07 11:03:46: Round 28, 38.447% nodes remaining
2021-12-07 11:03:46: Round 29, 38.1552% nodes remaining
2021-12-07 11:03:46: Round 30, 37.8579% nodes remaining
2021-12-07 11:03:46: Round 31, 37.5699% nodes remaining
2021-12-07 11:03:46: Round 32, 37.2652% nodes remaining
2021-12-07 11:03:46: Round 33, 36.9465% nodes remaining
2021-12-07 11:03:46: Round 34, 36.6269% nodes remaining
2021-12-07 11:03:46: Round 35, 36.2692% nodes remaining
2021-12-07 11:03:46: Round 36, 35.8901% nodes remaining
2021-12-07 11:03:46: Round 37, 35.4916% nodes remaining
2021-12-07 11:03:46: Round 38, 35.1014% nodes remaining
2021-12-07 11:03:46: Round 39, 34.6758% nodes remaining
2021-12-07 11:03:46: Round 40, 34.2169% nodes remaining
2021-12-07 11:03:46: Round 41, 33.7143% nodes remaining
2021-12-07 11:03:46: Round 42, 33.1847% nodes remaining
2021-12-07 11:03:46: Round 43, 32.5929% nodes remaining
2021-12-07 11:03:46: Round 44, 31.9314% nodes remaining
2021-12-07 11:03:46: Round 45, 31.2039% nodes remaining
2021-12-07 11:03:46: Round 46, 30.3807% nodes remaining
2021-12-07 11:03:46: Round 47, 29.4368% nodes remaining
2021-12-07 11:03:46: Round 48, 28.4278% nodes remaining
2021-12-07 11:03:46: Round 49, 27.2813% nodes remaining
2021-12-07 11:03:46: Round 50, 25.924% nodes remaining
2021-12-07 11:03:46: Round 51, 24.3325% nodes remaining
2021-12-07 11:03:46: Round 52, 22.4594% nodes remaining
2021-12-07 11:03:46: Round 53, 20.2715% nodes remaining
2021-12-07 11:03:46: Round 54, 17.6924% nodes remaining
2021-12-07 11:03:46: Round 55, 14.6664% nodes remaining
2021-12-07 11:03:46: Round 56, 11.1303% nodes remaining
2021-12-07 11:03:46: Round 57, 7.34526% nodes remaining
2021-12-07 11:03:46: Round 58, 3.72094% nodes remaining
2021-12-07 11:03:46: Round 59, 1.13161% nodes remaining
2021-12-07 11:03:46: Round 60, 0.142148% nodes remaining
2021-12-07 11:03:46: Round 61, 0.000929074% nodes remaining
2021-12-07 11:03:46: Assigning values
2021-12-07 11:03:46: Building 2-grams
2021-12-07 11:03:46: Writing 2-grams
2021-12-07 11:03:46: Hypergraph generation: trial 0
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 79.7639% nodes remaining
2021-12-07 11:03:46: Round 1, 70.1518% nodes remaining
2021-12-07 11:03:46: Round 2, 64.9241% nodes remaining
2021-12-07 11:03:46: Round 3, 62.7319% nodes remaining
2021-12-07 11:03:46: Round 4, 61.2142% nodes remaining
2021-12-07 11:03:46: Round 5, 60.371% nodes remaining
2021-12-07 11:03:46: Round 6, 59.5278% nodes remaining
2021-12-07 11:03:46: Round 7, 59.0219% nodes remaining
2021-12-07 11:03:46: Round 8, 58.516% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 1
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.1667% nodes remaining
2021-12-07 11:03:46: Round 1, 65% nodes remaining
2021-12-07 11:03:46: Round 2, 59.1667% nodes remaining
2021-12-07 11:03:46: Round 3, 55.5% nodes remaining
2021-12-07 11:03:46: Round 4, 52.3333% nodes remaining
2021-12-07 11:03:46: Round 5, 49.3333% nodes remaining
2021-12-07 11:03:46: Round 6, 47.1667% nodes remaining
2021-12-07 11:03:46: Round 7, 45.1667% nodes remaining
2021-12-07 11:03:46: Round 8, 44.3333% nodes remaining
2021-12-07 11:03:46: Round 9, 43.3333% nodes remaining
2021-12-07 11:03:46: Round 10, 42.3333% nodes remaining
2021-12-07 11:03:46: Round 11, 41.3333% nodes remaining
2021-12-07 11:03:46: Round 12, 40.3333% nodes remaining
2021-12-07 11:03:46: Round 13, 39.5% nodes remaining
2021-12-07 11:03:46: Round 14, 38.5% nodes remaining
2021-12-07 11:03:46: Round 15, 38% nodes remaining
2021-12-07 11:03:46: Round 16, 37.6667% nodes remaining
2021-12-07 11:03:46: Round 17, 37.3333% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 2
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.1252% nodes remaining
2021-12-07 11:03:46: Round 1, 63.773% nodes remaining
2021-12-07 11:03:46: Round 2, 57.2621% nodes remaining
2021-12-07 11:03:46: Round 3, 53.5893% nodes remaining
2021-12-07 11:03:46: Round 4, 51.586% nodes remaining
2021-12-07 11:03:46: Round 5, 48.581% nodes remaining
2021-12-07 11:03:46: Round 6, 47.0785% nodes remaining
2021-12-07 11:03:46: Round 7, 45.9098% nodes remaining
2021-12-07 11:03:46: Round 8, 44.2404% nodes remaining
2021-12-07 11:03:46: Round 9, 42.7379% nodes remaining
2021-12-07 11:03:46: Round 10, 41.7362% nodes remaining
2021-12-07 11:03:46: Round 11, 40.4007% nodes remaining
2021-12-07 11:03:46: Round 12, 38.5643% nodes remaining
2021-12-07 11:03:46: Round 13, 36.7279% nodes remaining
2021-12-07 11:03:46: Round 14, 35.7262% nodes remaining
2021-12-07 11:03:46: Round 15, 34.8915% nodes remaining
2021-12-07 11:03:46: Round 16, 34.0568% nodes remaining
2021-12-07 11:03:46: Round 17, 33.389% nodes remaining
2021-12-07 11:03:46: Round 18, 32.8881% nodes remaining
2021-12-07 11:03:46: Round 19, 32.7212% nodes remaining
2021-12-07 11:03:46: Round 20, 32.5543% nodes remaining
2021-12-07 11:03:46: Round 21, 32.2204% nodes remaining
2021-12-07 11:03:46: Round 22, 32.0534% nodes remaining
2021-12-07 11:03:46: Round 23, 31.8865% nodes remaining
2021-12-07 11:03:46: Round 24, 31.7195% nodes remaining
2021-12-07 11:03:46: Round 25, 31.3856% nodes remaining
2021-12-07 11:03:46: Round 26, 31.2187% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 3
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.2941% nodes remaining
2021-12-07 11:03:46: Round 1, 63.1933% nodes remaining
2021-12-07 11:03:46: Round 2, 58.1513% nodes remaining
2021-12-07 11:03:46: Round 3, 54.6218% nodes remaining
2021-12-07 11:03:46: Round 4, 51.5966% nodes remaining
2021-12-07 11:03:46: Round 5, 49.4118% nodes remaining
2021-12-07 11:03:46: Round 6, 48.2353% nodes remaining
2021-12-07 11:03:46: Round 7, 47.7311% nodes remaining
2021-12-07 11:03:46: Round 8, 47.2269% nodes remaining
2021-12-07 11:03:46: Round 9, 46.7227% nodes remaining
2021-12-07 11:03:46: Round 10, 46.2185% nodes remaining
2021-12-07 11:03:46: Round 11, 45.8824% nodes remaining
2021-12-07 11:03:46: Round 12, 45.5462% nodes remaining
2021-12-07 11:03:46: Round 13, 45.042% nodes remaining
2021-12-07 11:03:46: Round 14, 44.8739% nodes remaining
2021-12-07 11:03:46: Round 15, 44.7059% nodes remaining
2021-12-07 11:03:46: Round 16, 44.5378% nodes remaining
2021-12-07 11:03:46: Round 17, 44.3697% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 4
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 77.1382% nodes remaining
2021-12-07 11:03:46: Round 1, 65.9539% nodes remaining
2021-12-07 11:03:46: Round 2, 59.2105% nodes remaining
2021-12-07 11:03:46: Round 3, 55.2632% nodes remaining
2021-12-07 11:03:46: Round 4, 52.4671% nodes remaining
2021-12-07 11:03:46: Round 5, 50.1645% nodes remaining
2021-12-07 11:03:46: Round 6, 48.5197% nodes remaining
2021-12-07 11:03:46: Round 7, 47.0395% nodes remaining
2021-12-07 11:03:46: Round 8, 45.7237% nodes remaining
2021-12-07 11:03:46: Round 9, 44.2434% nodes remaining
2021-12-07 11:03:46: Round 10, 43.4211% nodes remaining
2021-12-07 11:03:46: Round 11, 42.4342% nodes remaining
2021-12-07 11:03:46: Round 12, 41.1184% nodes remaining
2021-12-07 11:03:46: Round 13, 39.9671% nodes remaining
2021-12-07 11:03:46: Round 14, 38.6513% nodes remaining
2021-12-07 11:03:46: Round 15, 37.0066% nodes remaining
2021-12-07 11:03:46: Round 16, 35.3618% nodes remaining
2021-12-07 11:03:46: Round 17, 32.8947% nodes remaining
2021-12-07 11:03:46: Round 18, 30.0987% nodes remaining
2021-12-07 11:03:46: Round 19, 25.8224% nodes remaining
2021-12-07 11:03:46: Round 20, 21.2171% nodes remaining
2021-12-07 11:03:46: Round 21, 16.1184% nodes remaining
2021-12-07 11:03:46: Round 22, 10.5263% nodes remaining
2021-12-07 11:03:46: Round 23, 4.76974% nodes remaining
2021-12-07 11:03:46: Round 24, 0.493421% nodes remaining
2021-12-07 11:03:46: Round 25, 0% nodes remaining
2021-12-07 11:03:46: Assigning values
error at position 3/641633
27133 < 131071
terminate called after throwing an instance of 'std::runtime_error'
  what():  sequence is not sorted

from tongrams.

jermp avatar jermp commented on August 23, 2024

Hi,
this is strange. It looks like the files are malformed.
Could you share your input files, so that I can take a closer look to the problem?
Thanks!

from tongrams.

abdullah-saal avatar abdullah-saal commented on August 23, 2024

Sure.

2-grams.sorted.gz
3-grams.sorted.gz
1-grams.sorted.gz

from tongrams.

jermp avatar jermp commented on August 23, 2024

I gave a quick look at your files and they are, as I suspected before, malformed:
on each line, there is the count followed by the ngram string but it should be opposite,
as explained here https://github.com/jermp/tongrams#input-data-format.
You should format them as follows:

# of rows
<ngram> TAB <count>
...

And be sure they are sorted with the sort_grams utility as you did before.
Let me know if it works after reformatting.

from tongrams.

abdullah-saal avatar abdullah-saal commented on August 23, 2024

Actually, it's just the rendering. this is a common problem with RTL languages.
If you try to parse it. you can confirm that the first field is indeed the ngram

tail 2-grams.sorted | cut -f 1
ييفارت توفماسيان
ييفانغ ونادي
ييكسيان من
ييلان في
ييمسومارواي مواليد
يينا ونادي
يينال في
يينال من
يينان دياوو
يييانغ في
 tail 2-grams.sorted | cut -f 2
1
1
1
1
1
1
1
1
3
1

from tongrams.

abdullah-saal avatar abdullah-saal commented on August 23, 2024

Thanks, just noticed the issue.
WIth similar problems I used to get an error that looks like this:

2-grams file is incomplete:
        'شتلاند' should have been found among 1-grams

That's why I didn't notice this time.

from tongrams.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.