hatchet / hatchet Goto Github PK

Analyze graph/hierarchical performance data using pandas dataframes

Home Page: https://hatchet.readthedocs.io

License: MIT License

Python 91.96% C 0.99% Roff 4.67% Shell 0.02% Cython 0.42% Elixir 1.76% C++ 0.18%

performance-analysis hierarchical-data comparative-analysis graphs trees hpc python performance data-analytics pandas

hatchet's Introduction

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing performance data that has a hierarchy (for example, serial or parallel profiles that represent calling context trees, call graphs, nested regions’ timers, etc.). Hatchet implements various operations to analyze a single hierarchical data set or compare multiple data sets, and its API facilitates analyzing such data programmatically.

To use hatchet, install it with pip:

$ pip install hatchet

Or, if you want to develop with this repo directly, run the install script from the root directory, which will build the cython modules and add the cloned directory to your PYTHONPATH:

$ source install.sh

Documentation

See the Getting Started page for basic examples and usage. Full documentation is available in the User Guide.

Examples of performance analysis using hatchet are available here.

Contributing

Hatchet is an open source project. We welcome contributions via pull requests, and questions, feature requests, or bug reports via issues.

You can connect with the hatchet community on slack. You can also reach the hatchet developers by email at: [email protected].

Authors

Many thanks go to Hatchet's contributors.

Hatchet was created by Abhinav Bhatele, [email protected].

Citing Hatchet

If you are referencing Hatchet in a publication, please cite the following paper:

Abhinav Bhatele, Stephanie Brink, and Todd Gamblin. Hatchet: Pruning the Overgrowth in Parallel Profiles. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). ACM, New York, NY, USA. DOI

License

Hatchet is distributed under the terms of the MIT license.

All contributions must be made under the MIT license. Copyrights in the Hatchet project are retained by contributors. No copyright assignment is required to contribute to Hatchet.

See LICENSE and NOTICE for details.

SPDX-License-Identifier: MIT

LLNL-CODE-741008

hatchet's People

Contributors

Stargazers

Watchers

hatchet's Issues

Add parameter to invert color scheme in tree printout

Add user parameter to reverse the default color scheme when printing the graph representation. The current scheme colors nodes red if they are >90% and green if they are <10% (at least for the extreme ranges). This feature would reverse these, so that nodes were green if they are >90%, and red if they are <10%. This is desirable for resulting graphs, for example, after dividing 1 graph by another.

Score-P reader

Expression query language

Read hpctoolkit database without metric-db

Hi all,

When we were at the scalable tool workshop, we found that hatchet was not able to read hpctoolkit database without metric-db files.

I would like to query if you guys have fixed that problem. If not, is there anything I can help?

Non deterministic behaviors with hpctoolkit reader

If HPCToolkit is collected with helper threads, there will be extra .db files, and Hatchet seems to be non-deterministic in picking which ones to read data from.

*Make a note if HPCToolkit is or not collected with helper threads.

Console warning when max metric value is 0

/usr/gapps/spot/dev/hatchet/hatchet/external/console.py:218: RuntimeWarning: invalid value encountered in double_scalars
  proportion_of_total = metric / self.max_metric

Literal reader: fix frame attributes

type should be determined by literal dataset
name and type should both be frame attributes

Check that repeat phases are aggregating time elapsed

If the same phase occurs multiple times throughout an execution (e.g., HPCTookit reader graph construction), we should be aggregating the total time.

Modify graphframe tree default behavior

Tree printout should show all nodes by default (currently, metrics must be larger than 0 in order to be printed). Additionally, the tree be flexible for users to determine how many decimals to printout.

KeyError: 'time' Exception during gf.tree() (with gf built from HPCToolkit database)

KeyError: 'time' thrown during execution of gf.tree().
gf was built from an HPCToolkit database.
Does not depend on whether or not tracing is enabled for hpcrun.

HPCToolkit version 2020.03.01-release
Installed from spack (spack install hpctoolkit +mpi) (spack speck: [email protected]%[email protected]~all-static~bgq~cray~cuda+mpi~papi arch=linux-ubuntu18.04-skylake/zq7vk3u)

From this script: https://github.com/ian-bertolacci/HPCToolkit-Testing-Miniapp/blob/master/hatchet_hello_world.py

To reproduce, clone the [email protected]/ian-bertolacci/HPCToolkit-Testing-Miniapp.git repo, and run the below script in it:

make clean;
for HPC_TRACE in yes no;
do
  export HPC_TRACE;
  echo "============== Building with HPC_TRACE=${HPC_TRACE} ==============";
  make analyse;
done;
for d in *.hpcdatabase;
do
  echo "============== Reading database ${d} ==============";
  ./hatchet_hello_world.py ${d};
done;

Output of the above on my system:

rm -r *.hpcmeasurements
rm: cannot remove '*.hpcmeasurements': No such file or directory
Makefile:86: recipe for target 'clean-hpcmeasurement' failed
make: [clean-hpcmeasurement] Error 1 (ignored)
rm -r *.hpcdatabase *.hpcdatabase-*
rm: cannot remove '*.hpcdatabase': No such file or directory
rm: cannot remove '*.hpcdatabase-*': No such file or directory
Makefile:89: recipe for target 'clean-hpcdatabase' failed
make: [clean-hpcdatabase] Error 1 (ignored)
rm *.hpcstruct
rm: cannot remove '*.hpcstruct': No such file or directory
Makefile:83: recipe for target 'clean-hpcstruct' failed
make: [clean-hpcstruct] Error 1 (ignored)
rm miniapp
rm: cannot remove 'miniapp': No such file or directory
Makefile:75: recipe for target 'clean-exe' failed
make: [clean-exe] Error 1 (ignored)
rm *.o
rm: cannot remove '*.o': No such file or directory
Makefile:78: recipe for target 'clean-objs' failed
make: [clean-objs] Error 1 (ignored)
============== Building with HPC_TRACE=yes ==============
mpicc miniapp.c -o miniapp -O0 -gdwarf-2 -g3 -fopenmp -lm
hpcstruct -j 2 miniapp -o miniapp_threads-2.hpcstruct
mpirun --oversubscribe -np 4 hpcrun -t -o miniapp_procs-4_threads-2_n-elts-1000_trace-yes.hpcmeasurements ./miniapp 1000
Rank 0/4 with 2 OpenMP threads
Rank 1/4 with 2 OpenMP threads
Rank 2/4 with 2 OpenMP threads
Rank 3/4 with 2 OpenMP threads
N: 1000
Sum: 670.986631
hpcprof-mpi -S miniapp_threads-2.hpcstruct -I ./+ miniapp_procs-4_threads-2_n-elts-1000_trace-yes.hpcmeasurements --metric-db yes -o miniapp_procs-4_threads-2_n-elts-1000_trace-yes_metric-db-yes.hpcdatabase
msg: STRUCTURE: [...redacted...]/HPCToolkit-Testing-Miniappminiapp
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/libhpcrun.so.0.0.0
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/ext-libs/libmonitor.so.0.0.0
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libmpi.so.40.10.4
msg: Line map : /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
msg: Line map : /lib/x86_64-linux-gnu/libc-2.27.so
msg: Line map : /lib/x86_64-linux-gnu/ld-2.27.so
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-rte.so.40.10.5
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-pal.so.40.10.6
msg: Populating Experiment database: [...redacted...]/HPCToolkit-Testing-Miniappminiapp_procs-4_threads-2_n-elts-1000_trace-yes_metric-db-yes.hpcdatabase
============== Building with HPC_TRACE=no ==============
mpirun --oversubscribe -np 4 hpcrun  -o miniapp_procs-4_threads-2_n-elts-1000_trace-no.hpcmeasurements ./miniapp 1000
Rank 0/4 with 2 OpenMP threads
Rank 1/4 with 2 OpenMP threads
Rank 2/4 with 2 OpenMP threads
Rank 3/4 with 2 OpenMP threads
N: 1000
Sum: 670.986631
hpcprof-mpi -S miniapp_threads-2.hpcstruct -I ./+ miniapp_procs-4_threads-2_n-elts-1000_trace-no.hpcmeasurements --metric-db yes -o miniapp_procs-4_threads-2_n-elts-1000_trace-no_metric-db-yes.hpcdatabase
msg: STRUCTURE: [...redacted...]/HPCToolkit-Testing-Miniappminiapp
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/libhpcrun.so.0.0.0
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/ext-libs/libmonitor.so.0.0.0
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libmpi.so.40.10.4
msg: Line map : /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
msg: Line map : /lib/x86_64-linux-gnu/libc-2.27.so
msg: Line map : /lib/x86_64-linux-gnu/ld-2.27.so
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-rte.so.40.10.5
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-pal.so.40.10.6
msg: Populating Experiment database: [...redacted...]/HPCToolkit-Testing-Miniappminiapp_procs-4_threads-2_n-elts-1000_trace-no_metric-db-yes.hpcdatabase
============== reading miniapp_procs-4_threads-2_n-elts-1000_trace-no_metric-db-yes.hpcdatabase ==============
                                                                CPUTIME (sec) (I)  ...                                             module
node                                               rank thread                     ...
{'name': '<no activity>', 'type': 'function'}      0    0                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        1                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        2                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        3                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                   1    0                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
...                                                                           ...  ...                                                ...
{'file': '/build/glibc-OTsEL5/glibc-2.27/elf/dl... 2    3                0.000000  ...                                               None
                                                   3    0                0.063879  ...                                               None
                                                        1                0.000000  ...                                               None
                                                        2                0.000000  ...                                               None
                                                        3                0.000000  ...                                               None

[1248 rows x 8 columns]
Traceback (most recent call last):
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./hatchet_hello_world.py", line 17, in <module>
    main()
  File "./hatchet_hello_world.py", line 13, in main
    print(gf.tree())
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/graphframe.py", line 548, in tree
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 66, in trees_as_text
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 105, in as_text
    node_time = dataframe.loc[df_index, metric]
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
    return self._getitem_tuple(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1372, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1452, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 620, in _get_label
    return self.obj._xs(label, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3537, in xs
    loc = self.index.get_loc(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'
============== reading miniapp_procs-4_threads-2_n-elts-1000_trace-yes_metric-db-yes.hpcdatabase ==============
                                                                CPUTIME (sec) (I)  ...                                             module
node                                               rank thread                     ...
{'name': '<no activity>', 'type': 'function'}      0    0                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        1                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        2                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        3                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                   1    0                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
...                                                                           ...  ...                                                ...
{'file': '<unknown file> [libopen-rte.so.40.10.... 2    3                     0.0  ...                                               None
                                                   3    0                     0.0  ...                                               None
                                                        1                     0.0  ...                                               None
                                                        2                     0.0  ...                                               None
                                                        3                     0.0  ...                                               None

[1360 rows x 8 columns]
Traceback (most recent call last):
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./hatchet_hello_world.py", line 17, in <module>
    main()
  File "./hatchet_hello_world.py", line 13, in main
    print(gf.tree())
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/graphframe.py", line 548, in tree
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 66, in trees_as_text
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 105, in as_text
    node_time = dataframe.loc[df_index, metric]
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
    return self._getitem_tuple(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1372, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1452, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 620, in _get_label
    return self.obj._xs(label, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3537, in xs
    loc = self.index.get_loc(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'

Squash raises a warning with pandas v0.22 and error with pandas v0.24

https://github.com/LLNL/hatchet/blob/master/hatchet/graphframe.py#L266

index_names = [self.dataframe.index.names]
agg_df = new_dataframe.groupby(index_names).agg(agg_dict)

Code to reproduce the error:

from hatchet import *

data_path = "./CallFlow/data/hpctoolkit-cpi-database"
gf = GraphFrame()
gf.from_hpctoolkit(data_path)

def lookup(df, node):
    return df.loc[df['node'] == node]

def getMaxIncTime(gf):
    ret = 0.0
    for root in gf.graph.roots:
        ret = max(ret, lookup(gf.dataframe, root)['time (inc)'].max())
    return ret

max_inclusive_time = getMaxIncTime(gf)
filter_gf = gf.filter(lambda x: True if(x['time (inc)'] > 0.1*max_inclusive_time) else False)
filter_gf.squash()

Error:

pandas v0.22

FutureWarning: 'rank' is both a column name and an index level."

pandas v0.24

ValueError: 'node' is both an index level and a column label, which is ambiguous.

Reference to the actual issue: pandas-dev/pandas#21080

One solution is to rename the index.names.

Specify number of levels in the tree to print

Docs: Add example manipulating just the dataframe

Add example appending a new column to the dataframe (i.e., result of diff), print tree with this metric
Check for examples of pandas analyses

Cleanup github branches

Can we remove branches if they have been merged?

Configurable legend in graph printout

Give users the ability to define their own color scheme and thresholds.

Error if call path trace generated with hpcrun

If the experiment.xml file was collected with hpcrun -t, then produce an error in the hpctoolkit reader. We will work towards adding support for trace data files.

from_literal does not account for multi-rooted trees

https://github.com/LLNL/hatchet/blob/master/hatchet/graphframe.py

        root_callpath.append(graph_dict['name'])
        lit_idx = 0
        graph_root = Node(lit_idx, tuple(root_callpath), None)

        node_dicts = []
        node_dicts.append(dict({'nid': lit_idx, 'node': graph_root, 'name': graph_dict['name']}, **graph_dict['metrics']))
        lit_idx += 1

        # call recursively on all children of root
        if 'children' in graph_dict:
            for child in graph_dict['children']:
                parse_node_literal(child, graph_root, list(root_callpath))

        self.exc_metrics = []
        self.inc_metrics = []
        for key in graph_dict['metrics'].keys():
            if '(inc)' in key:
                self.inc_metrics.append(key)
            else:
                self.exc_metrics.append(key)

        self.graph = Graph([graph_root])
        self.dataframe = pd.DataFrame(data=node_dicts)
        self.dataframe.set_index(['node'], drop=False, inplace=True)

The function from_literal does not account for multiple root nodes (sometimes the callgraph has multiple roots).

Add a Way to Read and Write a Full GraphFrame to Disk

Now that we're adding more graph-based analysis functionality, it might be useful to consider adding the ability to read and write the entirety of a GraphFrame to disk for later use.

Currently, the possible file/storage formats I've found that we could use for this are:

GraphML (supported by many graph tools, including Graph Database services like Neo4J)
GraphSON (mainly used by Apache TinkerPop)
CSV (involves making multiple CSVs to store node data and relationships. See this Neo4J page for an example)
GEXF (developed for Gephi, very similar to GraphML)
GML (looks almost like a poor-man's GraphSON, JSON-like syntax but less extensible, ASCII-only data)

Alternatively, we could implement a hybrid solution that uses one of the graph-specific file formats above to store the relational/hierarchical graph data and something like CSV to store the DataFrame data.

Additionally, very long term, we could consider adding read/write support for GQL storage solutions. GQL is an in-development ISO standard based on SQL for graph data. It is currently planned to be officially published in August 2021.

If we decide to implement this, I'll update this first comment with a more extensive list of options for storage.

Tau reader

Look into aliasing deprecated function parameters

Benefits old versions of hatchet notebooks in Spot as hatchet APIs change:
https://stackoverflow.com/questions/49802412/how-to-implement-deprecation-in-python-with-argument-alias

Unify schemas across readers

WIP here -- https://github.com/slabasan/hatchet/tree/fix-readers-unify-schemas

fix HPCToolkit reader, add tests for checking column types
fix caliper reader, add tests for checking column types
fix gprof reader
add tests for checking column types of gprof reader

groupby behavior with string columns

Let's say a dataframe contains both numerical (e.g., time) and string (e.g., name) columns. After a squash, which does a groupby on the dataframe, the string columns are removed. Should we be preserving the string columns? These are useful (in particular name) for printing the resulting tree.

>>> df
     angles  degrees      shape
foo                            
A         7       90     circle
C         4       90   triangle
A         9      145  rectangle
>>> df.groupby("foo").agg("sum")
     angles  degrees
foo                 
A        16      235
C         4       90

Squash doesn't mark nodes for merging correctly in certain situations

I found this while working on testing the query language.

This is the description of the problem that I posted to Slack:

I was working on some tests for the new Hatchet query language when I ran into the following issue/edge case. The test involved the following query being applied to mock_graph_literal from tests/conftest.py:

query = [
    {"name": "bar"},
    ("*", {"time (inc)": "> 50"}),
    {"name": "gr[a-z]+", "time (inc)": "<= 10"}
]

(For reference, this query matches paths starting with a node named "bar" followed by 0 or more nodes with an inclusive time > 50 followed by a node with a name starting with "gr" and with an inclusive time of <= 10.)

What I expected to get from this query was a forest of three trees, each consisting of a root node with name "bar" and a single child node with name "grault" and with an inclusive time of <= 10. Instead, what I got was the following:

bar
|-> grault
|-> grault
-> grault

I tracked this issue down to the find_merges function in graph.py. This function loops over the nodes in the trimmed DAG and determines whether or not to merge the current node's children using the children's frames. However, this causes incorrect behavior for the children of any nodes that will be merged. In my test above, this algorithm causes the three "bar" nodes in the trimmed DAG to be marked for merging, but it does not mark the three "grault" nodes for merging.

This can be fixed by changing the merge determination algorithm to loop over the new DAG (the DAG with nodes merged) rather than the old DAG.

Docs: Add example of producing a line graph for strong/weak scaling study

pip install fails when hatchet is a dependency.

CallFlow uses hatchet as a dependency (see requirements.txt).

However, the install fails.

pip install -r requirements.txt inside CallFlow repository fails.

See the error log here: link.

It complains that hatchet is not able to install pandas, even though it is already installed (see #1).

To replicate this issue,

pip uninstall hatchet

or
create a virtualenv with hatchet not installed.

PS: This does not occur if hatchet is installed independently.

Set default squash= parameter to True in filter()

This will keep the dataframe and graph in sync.

Update the Query Language High-Level API to Allow for Filtering with non-DataFrame Data

Currently, the filter field (Python dict) for a query node in the high-level query language API must be keyed on a column of the pandas.DataFrame stored within the GraphFrame. In the future, these filters should be allowed to be keyed on other node data not stored in the DataFrame, like the node's _depth and _hatchet_nid fields.

Add reindex=True parameter to groupby_aggregate()

Dump snapshot of graphframe for checkpoint

After some operations, we may want to dump out the graph and dataframe, so we can reload it into hatchet at a later time, in particular, if the operations or data ingestion takes a long time.

Update readthedocs for SC

Add groupby/aggregate + reindex operations

Create hatchet tutorial materials

Create spack package

Node equivalence

Should Node __eq__ also be checking that nid and callpath are equal?
return (id(self) == id(other) or (self.nid == other.nid and self.callpath == other.callpath))

The current implementation only checks that the id() of two Node objects are the same. If we are comparing two GraphFrame objects (from two different input files for example), the id()'s will not be the same.
return (id(self) == id(other))

I think we also want to check that the Node member variables are also identical.

Single function for updating graph/dataframe _hatchet_nid

The current implementation has a graph-oriented function called enumerate_traverse(), which traverses across each node and assigns an incremental _hatchet_nid. The second step is to use this information on the graph to update corresponding rows in the dataframe. Sometimes multiple rows need to be updated depending on what the index levels are. if the index levels are node and rank, then all ranks with the same node will need to be updated with the _hatchet_nid.

Explore large-scale visualization and analysis

Docs: Review hatchet user guide

What topics should be covered?

Extend hpctoolkit reader for trace file format

Apply mean/min/max on dataframe to graph structure

We can use built-in pandas functions to compute the mean, min, max, etc. of the dataframe. Can we update the graph based on this aggregate? Possibly add a mean/min/max operand on the graphframe, but a challenge is deciding on a generic interface.

Use case: I have several trials of the same execution, and want to compute the average performance over all trials, and use this as the "main" graphframe" for doing comparisons across nightly runs.

Add tests for add/sub of multiple graphframes

Adding multiple graphframes should already be supported in hatchet:

res_gf = gf1 + gf2 + gf3 + gf4

should be equivalent to:

res_gf = gf1.add(gf2).add(gf3).add(gf4)

Maintain percent contribution in result

If a node A is 90% of the max in gf1 and gf2 (colored red), then maintain it's color in the result gf3.

Explore integrating html into tree output

Interactive tree output using HTML enabling for example, ability to hide and show subtrees.

Travis CI failing for Python v3.6

Pytests failing on tests/callgrind.py. I believe it is coming from the re.match in gprof_dot_reader.py.

Integration with Roundtrip for interactive visualization

https://github.com/hdc-arizona/roundtrip

Add graphframe offset and scale APIs

Enables a graphframe to be offset (add +/- number) or scaled (multiply +/- number) by a value.

subtract returns NaNs in columns

In the simplest case, if we read in the same data set into two different GraphFrames, the columns and column names will be identical. However, subtract is looking at the id of each node when doing a subtract across rows with the same column name. We need to do a unify of the dataframes first before taking a diff.

KeyError: 'the label [time] is not in the [index]'

Hello,

Printing the tree of caliper-ex.cali can be done by calling cali-query:

cali-query -q 'select function,sum(sum#time.duration),
inclusive_sum(sum#time.duration) group by 
function format json-split' caliper-ex.cali > 0.json

and into hatchet:

>>> import hatchet as ht
>>> gf = ht.GraphFrame.from_caliper_json('./0.json')
>>> print(gf.tree())

but it will fail with:

Traceback (most recent call last):
  File "pandas/core/indexing.py", line 1790, in _validate_key
    error()
  File "pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [time] is not in the [index]'

The traceback is:

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hatchet/graphframe.py", line 539, in tree
    color=color,
  File "hatchet/external/printtree.py", line 64, in trees_as_text
    color=color,
  File "hatchet/external/printtree.py", line 102, in as_text
    node_time = dataframe.loc[df_index, metric]
  File "pandas/core/indexing.py", line 1472, in __getitem__
    return self._getitem_tuple(key)
  File "pandas/core/indexing.py", line 870, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "pandas/core/indexing.py", line 1027, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
  File "pandas/core/indexing.py", line 1478, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "pandas/core/indexing.py", line 1911, in _getitem_axis
    self._validate_key(key, axis)
  File "pandas/core/indexing.py", line 1798, in _validate_key
    error()
  File "pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [time] is not in the [index]'

The solution is to rename the second column in the json file:

grep columns 0.json
"columns": [ "inclusive#sum#time.duration", "sum#time.duration", "path" ],

instead of:

grep columns 0.json
"columns": [ "inclusive#sum#time.duration", "sum#sum#time.duration", "path" ],

My version is:

cali-query --version

2.3.0

caliper-ex.json is actually fine. Am i using cali-query in a wrong way ?

Thanks,

jg.

parent
node type (e.g., statement, function, loop)
hierarchy cols (i.e., if node type is statement, then hierarchy cols are file,line; if node type is function, then hierarchy cols is name)

Alternative solution for #114