ssoelvsten / anatree Goto Github PK
View Code? Open in Web Editor NEWAnatree: A Fast Data Structure for Anagrams
License: GNU General Public License v3.0
Anatree: A Fast Data Structure for Anagrams
License: GNU General Public License v3.0
I would want it to install the files in the folder:
usr/local/include/anatree/
But, instead it builds in
usr/local/include/src/
Otherwise, I'd suppose it would be in conflict with other applications installed.
Currently the header file includes some documentation to improve IntelliSense, but we ought to also use this to create a small Doxygen documentation. One can look at what was done for Adiar.
The current copy-constructor shares the non-immutable data structure. Yet, this is not expected behaviour.
Right now, we allow the user to find all words w' that are superseeded by w. But, likewise one might want to have all words w' that superseeds w. Essentially, this is just returning all words in the entire Anatree for the node matching w (including the node itself). One can draw inspiration from the anagrams_of
function to find the node (or even better is to do #19 first), and the subanagrams_of
to merge recursion results.
The complexity should be something like *O(w lg w + |Σ| + n T) where n is the size of the tree and T is the size of the output.
We notice the following interesting observation:
Observation: Each path to a non-empty node describes the Parikh's vector of that word (with the requested ordering).
Tasks
parikh()
subanagrams_of
and the keys()
function.parikh_equivalent(const anatree<> &o)
(or commutatively_equivalent(const anatree<> &o)
)Since the Anatree anyway is using C++20 to use the contains of unordered_set then we may as well also use the new concepts!
Specifically, the anatree class should require (see here for examples) the following
We should add the <<
operator as an alternative to insert(w)
or insert(begin, end)
.
We should provide an iterator over the entire Anatree (in a depth-first traversal on the tree).
begin()
is an iterator with its initial stack at the bottom-left node.
++
Either moves forward on said node or it traverses to the next node.--
Either moves backward on said node or it traverses to the previous node.end()
is an iterator at the end()
of the bottom-right node.Symmetrically, we want to provide the rbegin()
and rend()
iterators.
To implement #6 we first need to rename the current erase()
function to clear()
(name based on std::unordered_map
.
Add the contains()
function which returns a boolean value whether a word is included in the Anatree or not. This will be especially useful to do unit tests on insert()
and erase()
(...).
Right now, the Anatree only works with the <
operator, i.e. with an alphabetical order. Yet, since the size of the Anatree is dependant on this ordering, we should allow the user to specify it.
char_comp_t
to the anatree
template.
std::less<>
char_comp
of type char_comp_t
<
with use of char_comp
.char_comp
in sorted_word
.std::greater
such that there is a structural difference that is reflected in the trees size.When running make install
I get the following output
-- Installing: /usr/local/include/anatree
-- Installing: /usr/local/include/anatree/anatree.h
-- Installing: /usr/local/include/anatree/.anatree.h.~undo-tree~
-- Installing: /usr/local/include/anatree/.CMakeLists.txt.~undo-tree~
-- Installing: /usr/local/include/anatree/CMakeLists.txt
-- Installing: /usr/local/lib/cmake/anatree/anatree_config.cmake
This seems quite incorrect, as it was only intended for it to move the anatree.h file over.
Right now, the contains(w)
function is a light wrapper on the anagrams_of
function. Yet, what it essentially is doing is just check the size of the node's set of words. Hence, it is wasteful to obtain an entire copy of all words.
Instead, we should add the following private helper function that traverses the subtree as in anagrams_of but merely returns the node.
const node_ptr& find_node(const word_t w) const
Then the anagrams_of function can return a copy of the words while contains can merely check directly in the node's word_set.
In the keys()
function, we merge the words at the leaves. Yet, some of these keys can in fact be superseeded by the ones in the other.
Minimal Example
For { 'a', 'b', 'ab' } both 'ab' and 'b' is returned.
For { 'a', 'b', 'aab' } both 'aab' and 'b' is returned.
Tasks
keys(word_length)
keys(size_t word_length)
function.keys()
Currently, all nodes of the Anatree uses the std::shared_ptr
. Yet, since there only ever is one parent the std::unique_ptr
ought to suffice.
std::shared_ptr
with std::unique_ptr
.Additional Context
The reason it is as is now, is because the nodes share ownership with the traversing function. At that point in time, I was not sure exactly how to solve this - being at a gamejam, the minor overhead of reference counting was not really something I was worrying about.
For all but insert(w)
, the traversing function does not need to take ownership of a subtree. Hence, here we can merely borrow the pointer (using a reference &
). For insert(w)
we need to std::move(...)
the ownership from the node to the function on the recursive call and then std::move(...)
the result back into the node.
We should improve the interface for an end-user.
size()
function to tree_size()
(and its private variable accordingly).size()
function that reflects the number of words in the tree.empty()
as a wrapper for size() == 0u
.And of course
We currently provide the insert(w)
function to add one word at a time. We should provide the templated member function insert(IT begin, const IT end)
to add multiple elements from an iterator.
Of course, this should also be tested.
Assuming the same ordering is used (which is part of the type anyways), I'm quite confident that our Anatree is a canonical structure. Hence, we can just (recursively) compare the two DAGs for isomorphism.
==
operator calling a private and recursive subroutine that checks for isomorphism.
!=
operator being a simple wrapper on the ==
operator.Of course, this should be properly unit tested.
Add the erase(const word_t &w)
function that removes a word (if it exists) and also removes irrelevant nodes (if needed) to keep the data structure canonical ( see #12 ).
erase(const word_t &w)
to remove a single word.erase(IT begin, const IT &end)
to remove a set of words.Currently we have a chain of nodes with the same m_char
to depict more than one occurence of the same character. Yet, we can improve the asymptotic running time by switching to a random-access list (e.g. a std::vector
).
The question is, how can we achieve this while also ensuring as small a memory footprint as possible? If we change m_children
to be a std::vector<std::shared_ptr<node>>
then the vector would allocate space that is never used. Of course, we could let it have a smaller initial capacity that is more reasonable for language use-cases (i.e. a capacity of 3 or 4).
Addtional Context
This would decrease the running time of the following recursive subprocedures to be O(|Σ|) rather than O(|Σ| + w) where w is the length of the input word:
insert(w)
erase(w)
( #6 )anagrams_of(w)
contains(w)
Currently, for each node we have a std::unordered_set<...>
initialized. Yet, many of these nodes may be empty. Hence, we should move it behind a std::unique_ptr<...>
to decrease the initialisation and memory cost of each node.
Yet, first we should look into:
std::unordered_set<...>
already be set up with no initial capacity?Currently, only the .insert(w)
and .erase(w)
(see #6) functions actually change the graph (and hence require full ownership of the entire anagram tree). The remaining functions are read-only and can essentially be done in parallel.
We can add a read-write lock as described here
#include <shared_mutex>
using lock_t = std::shared_mutex;
using write_lock_t = std::unique_lock<lock_t>;
using read_lock_t = std::shared_lock<lock_t>;
To ensure there is no unneeded slowdown for (1) single-threaded implementations or (2) the anagram tree is created by a single thread to then be read by multiple threads, we need to allow the user to disable this. Hence, we should add the above locks inside of a concurrent_anatree
template class which itself is decoratoring the anatree.
We should provide the following three operations that are faster versions of deriving the entire set to then just .
has_anagram_of(w)
:has_superanagram_of(w)
:superanagrams_of(w)
in #18 . This should traverse the same path but merely return a bool
and shortcut the depth-first search as early as possible.has_subanagram_of(w)
:subanagrams_of(w)
. This should traverse the same path but merely return a bool
and shortcut the depth-first search as early as possible.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.