Git Product home page Git Product logo

trienet's Introduction

Build status NuGet version

TrieNet - The library provides .NET Data Structures for Prefix String Search and Substring (Infix) Search to Implement Auto-completion and Intelli-sense.

usage

  nuget install TrieNet
using Gma.DataStructures.StringSearch;
	
...

var trie = new UkkonenTrie<int>(3);
//var trie = new SuffixTrie<int>(3);

trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);

var result = trie.Retrieve("hel");

updates

Added UkkonenTrie<T> which is a trie implementation using Ukkonen's algorithm. Finally I managed to port (largely rewritten) a java implementation of Generalized Suffix Tree using Ukkonen's algorithm by Alessandro Bahgat (THANKS!).

I have not made all measurements yet, but it occurs to have significatly imroved build-up and look-up times.

trienet

you liked it, you find it useful

so I migrated it from dying https://trienet.codeplex.com/

  nuget install TrieNet

and created a NuGet package.

motivation

If you are implementing a modern user friendly peace of software you will very probably need something like this:

Or this:

I have seen manyquestions about an efficient way of implementing a (prefix or infix) search over a key value pairs where keys are strings (for instance see:http://stackoverflow.com/questions/10472881/search-liststring-for-string-startswith).

So it depends:

  • If your data source is aSQL or some other indexed database holdig your data it makes sense to utilize it’s search capabilities and issue a query to find maching records.

  • If you have a small ammount of data, a linear scan will be probably the most efficient.

IEnumerable> keyValuePairs;
...
var result = keyValuePairs.Select(pair => pair.Key.Contains(searchString));
  • If you are seraching in a large set of key value records you may need a special data structure to perform your seach efficiently.

trie

There is a family of data structures reffered as Trie. In this post I want to focus on a c# implementations and usage of Trie data structures. If you want to find out more about the theory behind the data structure itself Google will be probably your best friend. In fact most of popular books on data structures and algorithms describe tries (see.: Advanced Data Structures by Peter Brass)

implementation

The only working .NET implementation I found so far was this one:http://geekyisawesome.blogspot.de/2010/07/c-trie.html

Having some concerns about interface usability, implementation details and performance I have decided to implement it from scratch.

My small library contains a bunch of trie data structures all having the same interface:

public interface ITrie
{
  IEnumerable Retrieve(string query);
  void Add(string key, TValue value);
}
Class Description
Trie the simple trie, allows only prefix search, like .Where(s => s.StartsWith(searchString))
SuffixTrie allows also infix search, like .Where(s => s.Contains(searchString))
PatriciaTrie compressed trie, more compact, a bit more efficient during look-up, but a quite slower durig build-up.
SuffixPatriciaTrie the same as PatriciaTrie, also enabling infix search.
ParallelTrie very primitively implemented parallel data structure which allows adding data and retriving results from different threads simultaneusly.

performance

Important: all diagrams are given in logarithmic scale on x-axis.

To answer the question about when to use trie vs. linear search beter I’v experimeted with real data. As you can see below using a trie data structure may already be reasonable after 10.000 records if you are expecting many queries on the same data set.

Look-up times on patricia are slightly better, advantages of patricia bacame more noticable if you work with strings having many repeating parts, like quelified names of classes in sourcecode files, namespaces, variable names etc. So if you are indexing source code or something similar it makes sense to use patricia …

… even if the build-up time of patricia is higher compared to the normal trie.

demo app

The app demonstrates indexing of large text files and look-up inside them. I have experimented with huge texts containing millions of words. Indexing took usually only several seconds and the look-up delay was still unnoticable for the user.

trienet's People

Contributors

aj-mcgowan avatar gmamaladze avatar jussik avatar kokachernov avatar olibomby avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trienet's Issues

Case sensitivity

How do I turn off case sensitivity? So I can retrieve strings even when they are not the same case as the filter. As a work around I have to retrieve for each permutation of capitalization and concatenate the results together.

Only one search result instead of two

var trie = new SuffixTrie<int>(3);

trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);

var result = trie.Retrieve("ll");

The result has only 1 as an element, however, there should be 1 and 3 as results.
And if you replace trie.Add("hell", 3) with trie.Add("hell3", 3), it works as expected (1, 3 are results).

Some issue with Unicode characters, maybe

Got an exception

System.ArgumentOutOfRangeException: startIndex cannot be larger than length of string. Parameter name: startIndex at System.String.Substring(Int32 startIndex, Int32 length) at Gma.DataStructures.StringSearch.UkkonenTrie1.TestAndSplit(Node1 inputs, String stringPart, Char t, String remainder, T value) at Gma.DataStructures.StringSearch.UkkonenTrie1.Update(Node1 inputNode, String stringPart, String rest, T value) at Gma.DataStructures.StringSearch.UkkonenTrie`1.Add(String key, T value) at TPB.Business.PirateBayDumpProcessor.Process(FileInfo file) in D:_Projects\TPB\TPB.Business\PirateBayDumpProcessor.cs:line 57 at TPB.ConsoleTester.Program.Main(String[] args) in D:_Projects\TPB\TPB.ConsoleTester\Program.cs:line 12} | System.ArgumentOutOfRangeException

when trying (pun not intended): trie.Add(entry.Name, entry);
where entry.Name was Tjockare än vatten (Thicker Than Water) - S02 E08 - 720p x265 H

Implement Update and Delete

Hi,

is it possible to update the value T on a word when it is already stored ?
As it seems now I have to destroy the the whole Trie structure and rebuild it ?

Incompatible with UWP because of String.Intern()

I am using trienet within my UWP App, however I had to remove all the calls to string.Intern() (here and here), and replace them by simple string assignements such as m_Origin = origin, because the .NET team decided to remove methods which they "regretted adding in the first place, or restricted our plans around innovating for the future." (according to David Kean).

However, I'm guessing this has a negative impact on performance. Is there anyway to limit this?

Not an issue - "fork" or contribution?

Hi,
I don't know if you are interested in this but I added .net core 2.1 targeted projects to your solution. I'm new to contributing on git so please pull or let me if helpful / "what's the right way to contribute".

https://github.com/dgerding/trienet.git

I can say my limited testing does seem to support MS' claims that .net core 2.1 is faster!

Thanks again.

Dave Gerding

Binary targeting .NET Framework 4.5?

Would it be possible to add support for .NET Framework 4.5? I need it for a project that's still targeting net45 and since the package currently supports .NET Standard 2.0, I had to create a new package out of it, after adding necessary targeting. It'd be great if that can be added here (maybe a different branch that can create a new package?) I am not sure if any of the previous versions have a support for net45.

Another Unicode issue

Ran into an issue with unicode 0x300. This can be reproduced with the below code:

var a= "rosalía castro";
var b= "rosalía";
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
Console.WriteLine(t.Retrieve(a).Count());

This will print 0. Note that the second item added is not a byte-equal prefix of s, their unicode sequences are different. Though a.StartsWith(b) returns true, presumably because of culture settings. The second one uses two characters: a normal 'i' followed by unicode 0x300 to add the accent, while the first one uses a single accented i character.

Pattern / wildcard search

Is it possible to search for patterns also maybe with any Wildcard character like *

B*as would find all bananas and all baristas for example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.