harunzafer / nuve Goto Github PK
View Code? Open in Web Editor NEWNatural Language Processing Library for Turkish in C#
License: MIT License
Natural Language Processing Library for Turkish in C#
License: MIT License
This sentence has a citation. [1] This has not!
This sentence has multiple citations. [2][3][4] This has too. [5][6]. But this has none.
All extension methods should work for empty strings. Should not check if the string is null!
Language Reader methods should nevel return null Strings, but empty ones if necessary.
I think level is the more appropriate and beautiful name for this attribute.
We should probably remove it
The internal order of the allomorph rules of same level must be determined by the order in dictioanry.
For example OZEL_UYOR, DONUSUM_U are both level=2 rules of FC_ZAMAN_SIMDIKI_(U)yor. OZEL_UYOR must be processed before DONUSUM_U as written in the dictionary.
"yenmek" => yemekler yenir can already be solved like:
ye/FIIL Ul/FY_EDILGEN_Ul_(U)n Ur/FC_ZAMAN_GENIS_(U)r
so there is no need for the "yenmek" root.
But this is a critical decision to make!
It has no use
IC_HAL_ILGI_(n)Un => NNI_Gen_(n)Un/yin/yun/im
http://zembereknlp.blogspot.com.tr/2010/05/kelime-hatalar.html
We must read this post carefully and add the necessary roots to nuve Turkish dictionary.
This issue will always be open for the words which nuve should analyze but can not.
yaptırıveremedim [Fixed]
arttıracağını [Fixed]
Define the genres/types of the content and their ratios in the corpus. Brown Corpus, British National Corpus and Turkish National Corpus may give an idea.
...çalışmaya devam edeceklerdir. (O tarihte tatil günü cuma değildi.) Sadece...
Tam da o an içinizi derin bir huzur kaplayacak (biliyorum olacak : ))
...asla tercih etmeyeceğim bir ada. (Çünkü artık her yeri bina, her yer insan yığını) Ama bisiklet turu yapacaksam da...
In Turkish the noun "satın" always used with "almak" verb. Can we handle exceptions like this?
Why are there two entries for this verb? One has the flag ettirtgen_t?
This is another critical decision point along with the "dönüşlü" verbs. Should we remove all verbs in causative form?
else
{
Console.WriteLine(@"Duplicate key: " + key);
}
In Allomorph.cs
or maybe
```csharp
Some parts of the code doesn't seem right. Carefully review each. Especially in StringExtensions class there can be performance improvements.
Should our sentence tokenizer only take paragraphs as input or any text? What, precisely, is our definition of a sentence?
Remove all duplicates!
Carefully written tests are required
Suffix names in the file must be standart such as Poss3PS (IC_SAHIPLIK_O_(s)U). In the output we can add Poss3PS_sU or somehow show the lexical form. We can also put a separate column that shows the lexical form in a regex like from such as (s)U.
As an alternative we can just use the standart form plus the regex lexical form with an underscore. We can add an extra column for the suffix desc such as. "3. tekil şahıs sahiplik"
Instead of distinct this method can be more efficient and elegant.
Static readers, or static variables in readers may be partially overwriten, this very prone for bugs!
Keeping derived roots such as gözlük, gözlükçü, gözlükçülük and then remove duplicate solutions is much more complicated than keeping just the pure root words and generate solutions for derived (longer) roots when necessary.
Example: solutions for "gözlükçü" when we have all of "göz", "gözlük", "gözlükçü" in the root dictionary.
root: göz | suffixes: lük + çü
root: gözlük | suffix: çü
root: gözlükçü
Instead we keep a record of derivative suffixes and generate if needed:
göz + lük (d) + çü (d)
Lists that containin unique objects like flags, rules etc should be Set.
Move the method to a more suitable place or remove method if it is not necessary at all.
To prevent strong coupling to the QuickGraph Library create an IGraph Interface and a Concrete wrapper for QuickGraph.
Another "ettirtgen_t" issue. What's going on?
Collect 20 texts including approximately 500 words each.
If an allomorph does not exist at specified position, the condition depends on it should be false!
Prepare a readme file in English (for github) and write a blog post in Turkish (hrzafer.com)
Exapmle for numbered items:
Kısaltmalarla ilgili kurallar şunlardır:
Example for hyphened items:
Aşağıya en ilginçlerini alıyorum:
Starting with orthography! This probably will improve loading time.
26,96 sec for a million tokens on an intel i-7 3.40 GHz 64-bit machine
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.