Right now, I'm storing nodes in in a pre-order traversal that mirrors parsing the Newick string from left to right. As a result, I'm able to load the tree without loading the entire Newick string into memory, and I'm able to perform pre-order and post-order traversals extremely quickly (pre-order: iterate from 0 to n-1; post-order: iterate from n-1 to 0)
However, in this representation, I need to store all of every node's children, since there's no generic pattern for finding the children of a given node. As a result, every node has an entire vector
to store its children, where the size of that vector
is equal to the total number of children it has (and the vector
itself has additional memory overhead)
Instead, I should store the nodes in a level-order traversal so that 0 = root, then the next nodes are the children of the root, then the children of the first child of the root, then the children of the second child of the root, etc. In this representation, I can represent all of the children of a given node using just 2 numbers: the first and the last child (and everything in between can be assumed to also be a child)
To do this, I'll need to read the entire Newick string into memory up-front, and then I'll need to parse it from outside-in (rather than left-to-right). I was initially avoiding this to try to save memory, but for the Greengenes2 tree, the Newick string is ~500 MB, but peak memory usage to load it as a compact_tree
object is almost 2 GB. Thus, given that this new representation will dramatically reduce memory usage by avoiding having all of those vector
objects to store each node's children, I think it'll still be a net reduction in peak memory usage