ssoelvsten / anatree Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 70 KB

Anatree: A Fast Data Structure for Anagrams

License: GNU General Public License v3.0

CMake 2.21% Makefile 4.90% C++ 92.90%

data-structure header-only anagrams single-file

anatree's Introduction

anatree's People

Contributors

Stargazers

Watchers

anatree's Issues

Wrong installation folder

I would want it to install the files in the folder:

usr/local/include/anatree/

But, instead it builds in

usr/local/include/src/

Otherwise, I'd suppose it would be in conflict with other applications installed.

Add Doxygen Documentation

Currently the header file includes some documentation to improve IntelliSense, but we ought to also use this to create a small Doxygen documentation. One can look at what was done for Adiar.

Deep-Copy-Constructor

The current copy-constructor shares the non-immutable data structure. Yet, this is not expected behaviour.

Add a move constructor that merely moves the pointer (like the copy constructor right now).
Replace copy constructor with a non-default deep-copy constructor.

Add `superanagrams_of` member function

Right now, we allow the user to find all words w' that are superseeded by w. But, likewise one might want to have all words w' that superseeds w. Essentially, this is just returning all words in the entire Anatree for the node matching w (including the node itself). One can draw inspiration from the anagrams_of function to find the node (or even better is to do #19 first), and the subanagrams_of to merge recursion results.

The complexity should be something like *O(w lg w + |Σ| + n T) where n is the size of the tree and T is the size of the output.

Add Parikh Vectors

We notice the following interesting observation:

Observation: Each path to a non-empty node describes the Parikh's vector of that word (with the requested ordering).

Tasks

Add parikh()
Construct the set of all Parikh vectors. This probably is some mixture of the subanagrams_of and the keys() function.
Add parikh_equivalent(const anatree<> &o) (or commutatively_equivalent(const anatree<> &o))
An equivalence operation that specifically only cares about whether each Parikh vector of one also is in the other (special case of #12 , I believe).

Add C++20 Concepts

Since the Anatree anyway is using C++20 to use the contains of unordered_set then we may as well also use the new concepts!

Specifically, the anatree class should require (see here for examples) the following

Add `<<` operator

We should add the << operator as an alternative to insert(w) or insert(begin, end).

Iterator

We should provide an iterator over the entire Anatree (in a depth-first traversal on the tree).

begin() is an iterator with its initial stack at the bottom-left node.
- ++ Either moves forward on said node or it traverses to the next node.
- -- Either moves backward on said node or it traverses to the previous node.
end() is an iterator at the end() of the bottom-right node.

Symmetrically, we want to provide the rbegin() and rend() iterators.

Rename `erase()` to `clear()`

To implement #6 we first need to rename the current erase() function to clear() (name based on std::unordered_map.

Add `anagrams_of(w)` and `contains()` function

Add the contains() function which returns a boolean value whether a word is included in the Anatree or not. This will be especially useful to do unit tests on insert() and erase() (...).

Template Character Ordering

Right now, the Anatree only works with the < operator, i.e. with an alphabetical order. Yet, since the size of the Anatree is dependant on this ordering, we should allow the user to specify it.

Add char_comp_t to the anatree template.
- Default value is std::less<>
- Add private member char_comp of type char_comp_t
- Replace all uses of < with use of char_comp.
- Use char_comp in sorted_word.
Add unit tests with alternative ordering, e.g. with std::greater such that there is a structural difference that is reflected in the trees size.

Installation includes all files in 'anatree/'

When running make install I get the following output

-- Installing: /usr/local/include/anatree
-- Installing: /usr/local/include/anatree/anatree.h
-- Installing: /usr/local/include/anatree/.anatree.h.~undo-tree~
-- Installing: /usr/local/include/anatree/.CMakeLists.txt.~undo-tree~
-- Installing: /usr/local/include/anatree/CMakeLists.txt
-- Installing: /usr/local/lib/cmake/anatree/anatree_config.cmake

This seems quite incorrect, as it was only intended for it to move the anatree.h file over.

Add private `find_node` helper member function

Right now, the contains(w) function is a light wrapper on the anagrams_of function. Yet, what it essentially is doing is just check the size of the node's set of words. Hence, it is wasteful to obtain an entire copy of all words.

Instead, we should add the following private helper function that traverses the subtree as in anagrams_of but merely returns the node.

const node_ptr& find_node(const word_t w) const

Then the anagrams_of function can return a copy of the words while contains can merely check directly in the node's word_set.

Superseeded words are returned in `keys()`

In the keys() function, we merge the words at the leaves. Yet, some of these keys can in fact be superseeded by the ones in the other.

Minimal Example

For { 'a', 'b', 'ab' } both 'ab' and 'b' is returned.
For { 'a', 'b', 'aab' } both 'aab' and 'b' is returned.

Tasks

Restrict the current algorithm to only be for keys(word_length)
- If one only wants keys of a specific length then the current algorithm already works with minor tweaks. Hence, we might think about also implementing a keys(size_t word_length) function.
Fix keys()
- The superseeding word is always obtained in the true branch (it is the same word (or a larger one) plus the additional char branched upon).
- The problem is itself very local/recursive: we do not need to know the prefix of how we got there. For example, in the true branch we also find the word 'b' (the word 'ab' excluding the current node) which blocks the word 'b' in the false branch.
  - That is, the 'a' can merely be added as part of the final return logic.
  - Yet, it's not necessarily only one letter: 'aab' also blocks 'b'; in that case we cannot do a string equality, but have to use a custom '<' operator.
  - All of this is achievable with a hashmap (suffix) -> word. The results of both recursions are added to a new hash map with the 'a' added in front. The content of the false branch is filtered based on the true branch. At the very end, the hash map is collapsed into a hash set.

Replace `std::shared_ptr` with `std::unique_ptr`

Currently, all nodes of the Anatree uses the std::shared_ptr. Yet, since there only ever is one parent the std::unique_ptr ought to suffice.

Replace std::shared_ptr with std::unique_ptr.

Additional Context

The reason it is as is now, is because the nodes share ownership with the traversing function. At that point in time, I was not sure exactly how to solve this - being at a gamejam, the minor overhead of reference counting was not really something I was worrying about.

For all but insert(w), the traversing function does not need to take ownership of a subtree. Hence, here we can merely borrow the pointer (using a reference &). For insert(w) we need to std::move(...) the ownership from the node to the function on the recursive call and then std::move(...) the result back into the node.

Add `size()` and `empty()`

We should improve the interface for an end-user.

Rename the current size() function to tree_size() (and its private variable accordingly).
Add a size() function that reflects the number of words in the tree.
Add empty() as a wrapper for size() == 0u.

And of course

Add unit tests

Add Insertion in Bulk

We currently provide the insert(w) function to add one word at a time. We should provide the templated member function insert(IT begin, const IT end) to add multiple elements from an iterator.

Of course, this should also be tested.

Add `==` (and `!=`) operator

Assuming the same ordering is used (which is part of the type anyways), I'm quite confident that our Anatree is a canonical structure. Hence, we can just (recursively) compare the two DAGs for isomorphism.

Add == operator calling a private and recursive subroutine that checks for isomorphism.
- Before running the recursive subroutine, check for trivial cases (inspired by Adiar's equality checking).
Add != operator being a simple wrapper on the == operator.

Of course, this should be properly unit tested.

Add `erase(w)` function

Add the erase(const word_t &w) function that removes a word (if it exists) and also removes irrelevant nodes (if needed) to keep the data structure canonical ( see #12 ).

erase(const word_t &w) to remove a single word.
erase(IT begin, const IT &end) to remove a set of words.

Variadic Outdegree of Nodes

Currently we have a chain of nodes with the same m_char to depict more than one occurence of the same character. Yet, we can improve the asymptotic running time by switching to a random-access list (e.g. a std::vector).

The question is, how can we achieve this while also ensuring as small a memory footprint as possible? If we change m_children to be a std::vector<std::shared_ptr<node>> then the vector would allocate space that is never used. Of course, we could let it have a smaller initial capacity that is more reasonable for language use-cases (i.e. a capacity of 3 or 4).

Addtional Context

This would decrease the running time of the following recursive subprocedures to be O(|Σ|) rather than O(|Σ| + w) where w is the length of the input word:

insert(w)
erase(w) ( #6 )
anagrams_of(w)
contains(w)

Improve Memory Usage for Empty Nodes

Currently, for each node we have a std::unordered_set<...> initialized. Yet, many of these nodes may be empty. Hence, we should move it behind a std::unique_ptr<...> to decrease the initialisation and memory cost of each node.

Yet, first we should look into:

How many empty nodes actually exist when creating an Anatree for common dictionaries?
Can the std::unordered_set<...> already be set up with no initial capacity?

Thread-Safety

Currently, only the .insert(w) and .erase(w) (see #6) functions actually change the graph (and hence require full ownership of the entire anagram tree). The remaining functions are read-only and can essentially be done in parallel.

We can add a read-write lock as described here

#include <shared_mutex>

using lock_t       = std::shared_mutex;
using write_lock_t = std::unique_lock<lock_t>;
using read_lock_t  = std::shared_lock<lock_t>;

To ensure there is no unneeded slowdown for (1) single-threaded implementations or (2) the anagram tree is created by a single thread to then be read by multiple threads, we need to allow the user to disable this. Hence, we should add the above locks inside of a concurrent_anatree template class which itself is decoratoring the anatree.

Add `has_anagram`, `has_superanagram`, and `has_subanagram`

We should provide the following three operations that are faster versions of deriving the entire set to then just .

has_anagram_of(w):
Trivial to implement after #19 is done.
has_superanagram_of(w):
A faster version of superanagrams_of(w) in #18 . This should traverse the same path but merely return a bool and shortcut the depth-first search as early as possible.
has_subanagram_of(w):
A faster version of subanagrams_of(w) . This should traverse the same path but merely return a bool and shortcut the depth-first search as early as possible.