gmamaladze / trienet Goto Github PK

.NET Implementations of Trie Data Structures for Substring Search, Auto-completion and Intelli-sense. Includes: patricia trie, suffix trie and a trie implementation using Ukkonen's algorithm.

License: MIT License

C# 98.19% Smalltalk 1.81%

data-structures dotnet algorithms search trie suffix-tries ukkonen-algorithm

trienet's Introduction

usage

  nuget install TrieNet

using Gma.DataStructures.StringSearch;
	
...

var trie = new UkkonenTrie<int>(3);
//var trie = new SuffixTrie<int>(3);

trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);

var result = trie.Retrieve("hel");

updates

Added UkkonenTrie<T> which is a trie implementation using Ukkonen's algorithm. Finally I managed to port (largely rewritten) a java implementation of Generalized Suffix Tree using Ukkonen's algorithm by Alessandro Bahgat (THANKS!).

I have not made all measurements yet, but it occurs to have significatly imroved build-up and look-up times.

trienet

you liked it, you find it useful

so I migrated it from dying https://trienet.codeplex.com/

  nuget install TrieNet

and created a NuGet package.

motivation

If you are implementing a modern user friendly peace of software you will very probably need something like this:

Or this:

I have seen manyquestions about an efficient way of implementing a (prefix or infix) search over a key value pairs where keys are strings (for instance see:http://stackoverflow.com/questions/10472881/search-liststring-for-string-startswith).

So it depends:

If your data source is aSQL or some other indexed database holdig your data it makes sense to utilize it’s search capabilities and issue a query to find maching records.
If you have a small ammount of data, a linear scan will be probably the most efficient.

IEnumerable> keyValuePairs;
...
var result = keyValuePairs.Select(pair => pair.Key.Contains(searchString));

If you are seraching in a large set of key value records you may need a special data structure to perform your seach efficiently.

trie

There is a family of data structures reffered as Trie. In this post I want to focus on a c# implementations and usage of Trie data structures. If you want to find out more about the theory behind the data structure itself Google will be probably your best friend. In fact most of popular books on data structures and algorithms describe tries (see.: Advanced Data Structures by Peter Brass)

implementation

The only working .NET implementation I found so far was this one:http://geekyisawesome.blogspot.de/2010/07/c-trie.html

Having some concerns about interface usability, implementation details and performance I have decided to implement it from scratch.

My small library contains a bunch of trie data structures all having the same interface:

public interface ITrie
{
  IEnumerable Retrieve(string query);
  void Add(string key, TValue value);
}

Class	Description
`Trie`	the simple trie, allows only prefix search, like `.Where(s => s.StartsWith(searchString))`
`SuffixTrie`	allows also infix search, like `.Where(s => s.Contains(searchString))`
`PatriciaTrie`	compressed trie, more compact, a bit more efficient during look-up, but a quite slower durig build-up.
`SuffixPatriciaTrie`	the same as PatriciaTrie, also enabling infix search.
`ParallelTrie`	very primitively implemented parallel data structure which allows adding data and retriving results from different threads simultaneusly.

performance

Important: all diagrams are given in logarithmic scale on x-axis.

To answer the question about when to use trie vs. linear search beter I’v experimeted with real data. As you can see below using a trie data structure may already be reasonable after 10.000 records if you are expecting many queries on the same data set.

Look-up times on patricia are slightly better, advantages of patricia bacame more noticable if you work with strings having many repeating parts, like quelified names of classes in sourcecode files, namespaces, variable names etc. So if you are indexing source code or something similar it makes sense to use patricia …

… even if the build-up time of patricia is higher compared to the normal trie.

demo app

The app demonstrates indexing of large text files and look-up inside them. I have experimented with huge texts containing millions of words. Indexing took usually only several seconds and the look-up delay was still unnoticable for the user.

trienet's People

Contributors

Stargazers

Watchers

trienet's Issues

Only one search result instead of two

var trie = new SuffixTrie<int>(3);

trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);

var result = trie.Retrieve("ll");

The result has only 1 as an element, however, there should be 1 and 3 as results.
And if you replace trie.Add("hell", 3) with trie.Add("hell3", 3), it works as expected (1, 3 are results).

Incompatible with UWP because of String.Intern()

I am using trienet within my UWP App, however I had to remove all the calls to string.Intern() (here and here), and replace them by simple string assignements such as m_Origin = origin, because the .NET team decided to remove methods which they "regretted adding in the first place, or restricted our plans around innovating for the future." (according to David Kean).

However, I'm guessing this has a negative impact on performance. Is there anyway to limit this?

Net standard 2.0 nuget package is not uploaded

It seems that the latest version of the TrieNet 1.0.4 is not published to NuGet. I can only download the nupkg from artifact of last success CI build in appveyor. Please help uploading to NuGet. Thanks alot.

Storing UkkonenTrie in Redis

Is it possible to store a UkkonenTrie in Redis? Tried serializing the whole object to Json but this failed.

Only for English?

It seems that hieroglyphs do not support search?

Not an issue - "fork" or contribution?

Hi,
I don't know if you are interested in this but I added .net core 2.1 targeted projects to your solution. I'm new to contributing on git so please pull or let me if helpful / "what's the right way to contribute".

https://github.com/dgerding/trienet.git

I can say my limited testing does seem to support MS' claims that .net core 2.1 is faster!

Thanks again.

Dave Gerding

Binary targeting .NET Framework 4.5?

Would it be possible to add support for .NET Framework 4.5? I need it for a project that's still targeting net45 and since the package currently supports .NET Standard 2.0, I had to create a new package out of it, after adding necessary targeting. It'd be great if that can be added here (maybe a different branch that can create a new package?) I am not sure if any of the previous versions have a support for net45.

Case sensitivity

How do I turn off case sensitivity? So I can retrieve strings even when they are not the same case as the filter. As a work around I have to retrieve for each permutation of capitalization and concatenate the results together.

Some issue with Unicode characters, maybe

Got an exception

System.ArgumentOutOfRangeException: startIndex cannot be larger than length of string. Parameter name: startIndex at System.String.Substring(Int32 startIndex, Int32 length) at Gma.DataStructures.StringSearch.UkkonenTrie1.TestAndSplit(Node1 inputs, String stringPart, Char t, String remainder, T value) at Gma.DataStructures.StringSearch.UkkonenTrie1.Update(Node1 inputNode, String stringPart, String rest, T value) at Gma.DataStructures.StringSearch.UkkonenTrie`1.Add(String key, T value) at TPB.Business.PirateBayDumpProcessor.Process(FileInfo file) in D:_Projects\TPB\TPB.Business\PirateBayDumpProcessor.cs:line 57 at TPB.ConsoleTester.Program.Main(String[] args) in D:_Projects\TPB\TPB.ConsoleTester\Program.cs:line 12} | System.ArgumentOutOfRangeException

when trying (pun not intended): trie.Add(entry.Name, entry);
where entry.Name was Tjockare än vatten (Thicker Than Water) - S02 E08 - 720p x265 H

Pattern / wildcard search

Is it possible to search for patterns also maybe with any Wildcard character like *

B*as would find all bananas and all baristas for example.

Another Unicode issue

Ran into an issue with unicode 0x300. This can be reproduced with the below code:

var a= "rosalía castro";
var b= "rosalía";
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
Console.WriteLine(t.Retrieve(a).Count());

This will print 0. Note that the second item added is not a byte-equal prefix of s, their unicode sequences are different. Though a.StartsWith(b) returns true, presumably because of culture settings. The second one uses two characters: a normal 'i' followed by unicode 0x300 to add the accent, while the first one uses a single accented i character.

Implement Update and Delete

Hi,

is it possible to update the value T on a word when it is already stored ?
As it seems now I have to destroy the the whole Trie structure and rebuild it ?