Git Product home page Git Product logo

hatchet / hatchet Goto Github PK

View Code? Open in Web Editor NEW
99.0 10.0 37.0 25.35 MB

Analyze graph/hierarchical performance data using pandas dataframes

Home Page: https://hatchet.readthedocs.io

License: MIT License

Python 91.96% C 0.99% Roff 4.67% Shell 0.02% Cython 0.42% Elixir 1.76% C++ 0.18%
performance-analysis hierarchical-data comparative-analysis graphs trees hpc python performance data-analytics pandas

hatchet's Introduction

hatchet

Build Status Read the Docs codecov Code Style: Black Join slack

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing performance data that has a hierarchy (for example, serial or parallel profiles that represent calling context trees, call graphs, nested regions’ timers, etc.). Hatchet implements various operations to analyze a single hierarchical data set or compare multiple data sets, and its API facilitates analyzing such data programmatically.

To use hatchet, install it with pip:

$ pip install hatchet

Or, if you want to develop with this repo directly, run the install script from the root directory, which will build the cython modules and add the cloned directory to your PYTHONPATH:

$ source install.sh

Documentation

See the Getting Started page for basic examples and usage. Full documentation is available in the User Guide.

Examples of performance analysis using hatchet are available here.

Contributing

Hatchet is an open source project. We welcome contributions via pull requests, and questions, feature requests, or bug reports via issues.

You can connect with the hatchet community on slack. You can also reach the hatchet developers by email at: [email protected].

Authors

Many thanks go to Hatchet's contributors.

Hatchet was created by Abhinav Bhatele, [email protected].

Citing Hatchet

If you are referencing Hatchet in a publication, please cite the following paper:

  • Abhinav Bhatele, Stephanie Brink, and Todd Gamblin. Hatchet: Pruning the Overgrowth in Parallel Profiles. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). ACM, New York, NY, USA. DOI

License

Hatchet is distributed under the terms of the MIT license.

All contributions must be made under the MIT license. Copyrights in the Hatchet project are retained by contributors. No copyright assignment is required to contribute to Hatchet.

See LICENSE and NOTICE for details.

SPDX-License-Identifier: MIT

LLNL-CODE-741008

hatchet's People

Contributors

bhatele avatar cscully-allison avatar daboehme avatar dando18 avatar dhruvnm avatar ilumsden avatar jarusified avatar jblaschke avatar jrmadsen avatar kawilliams avatar khuck avatar kisaacs avatar lithomas1 avatar matthewkotila avatar ocnkr avatar omer-sharif avatar roastsea8 avatar slabasan avatar tgamblin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hatchet's Issues

Add parameter to invert color scheme in tree printout

Add user parameter to reverse the default color scheme when printing the graph representation. The current scheme colors nodes red if they are >90% and green if they are <10% (at least for the extreme ranges). This feature would reverse these, so that nodes were green if they are >90%, and red if they are <10%. This is desirable for resulting graphs, for example, after dividing 1 graph by another.

Read hpctoolkit database without metric-db

Hi all,

When we were at the scalable tool workshop, we found that hatchet was not able to read hpctoolkit database without metric-db files.

I would like to query if you guys have fixed that problem. If not, is there anything I can help?

Non deterministic behaviors with hpctoolkit reader

If HPCToolkit is collected with helper threads, there will be extra .db files, and Hatchet seems to be non-deterministic in picking which ones to read data from.

*Make a note if HPCToolkit is or not collected with helper threads.

Console warning when max metric value is 0

/usr/gapps/spot/dev/hatchet/hatchet/external/console.py:218: RuntimeWarning: invalid value encountered in double_scalars
  proportion_of_total = metric / self.max_metric

Modify graphframe tree default behavior

Tree printout should show all nodes by default (currently, metrics must be larger than 0 in order to be printed). Additionally, the tree be flexible for users to determine how many decimals to printout.

KeyError: 'time' Exception during gf.tree() (with gf built from HPCToolkit database)

KeyError: 'time' thrown during execution of gf.tree().
gf was built from an HPCToolkit database.
Does not depend on whether or not tracing is enabled for hpcrun.

HPCToolkit version 2020.03.01-release
Installed from spack (spack install hpctoolkit +mpi) (spack speck: [email protected]%[email protected]~all-static~bgq~cray~cuda+mpi~papi arch=linux-ubuntu18.04-skylake/zq7vk3u)

From this script: https://github.com/ian-bertolacci/HPCToolkit-Testing-Miniapp/blob/master/hatchet_hello_world.py

To reproduce, clone the [email protected]/ian-bertolacci/HPCToolkit-Testing-Miniapp.git repo, and run the below script in it:

make clean;
for HPC_TRACE in yes no;
do
  export HPC_TRACE;
  echo "============== Building with HPC_TRACE=${HPC_TRACE} ==============";
  make analyse;
done;
for d in *.hpcdatabase;
do
  echo "============== Reading database ${d} ==============";
  ./hatchet_hello_world.py ${d};
done;

Output of the above on my system:

rm -r *.hpcmeasurements
rm: cannot remove '*.hpcmeasurements': No such file or directory
Makefile:86: recipe for target 'clean-hpcmeasurement' failed
make: [clean-hpcmeasurement] Error 1 (ignored)
rm -r *.hpcdatabase *.hpcdatabase-*
rm: cannot remove '*.hpcdatabase': No such file or directory
rm: cannot remove '*.hpcdatabase-*': No such file or directory
Makefile:89: recipe for target 'clean-hpcdatabase' failed
make: [clean-hpcdatabase] Error 1 (ignored)
rm *.hpcstruct
rm: cannot remove '*.hpcstruct': No such file or directory
Makefile:83: recipe for target 'clean-hpcstruct' failed
make: [clean-hpcstruct] Error 1 (ignored)
rm miniapp
rm: cannot remove 'miniapp': No such file or directory
Makefile:75: recipe for target 'clean-exe' failed
make: [clean-exe] Error 1 (ignored)
rm *.o
rm: cannot remove '*.o': No such file or directory
Makefile:78: recipe for target 'clean-objs' failed
make: [clean-objs] Error 1 (ignored)
============== Building with HPC_TRACE=yes ==============
mpicc miniapp.c -o miniapp -O0 -gdwarf-2 -g3 -fopenmp -lm
hpcstruct -j 2 miniapp -o miniapp_threads-2.hpcstruct
mpirun --oversubscribe -np 4 hpcrun -t -o miniapp_procs-4_threads-2_n-elts-1000_trace-yes.hpcmeasurements ./miniapp 1000
Rank 0/4 with 2 OpenMP threads
Rank 1/4 with 2 OpenMP threads
Rank 2/4 with 2 OpenMP threads
Rank 3/4 with 2 OpenMP threads
N: 1000
Sum: 670.986631
hpcprof-mpi -S miniapp_threads-2.hpcstruct -I ./+ miniapp_procs-4_threads-2_n-elts-1000_trace-yes.hpcmeasurements --metric-db yes -o miniapp_procs-4_threads-2_n-elts-1000_trace-yes_metric-db-yes.hpcdatabase
msg: STRUCTURE: [...redacted...]/HPCToolkit-Testing-Miniappminiapp
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/libhpcrun.so.0.0.0
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/ext-libs/libmonitor.so.0.0.0
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libmpi.so.40.10.4
msg: Line map : /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
msg: Line map : /lib/x86_64-linux-gnu/libc-2.27.so
msg: Line map : /lib/x86_64-linux-gnu/ld-2.27.so
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-rte.so.40.10.5
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-pal.so.40.10.6
msg: Populating Experiment database: [...redacted...]/HPCToolkit-Testing-Miniappminiapp_procs-4_threads-2_n-elts-1000_trace-yes_metric-db-yes.hpcdatabase
============== Building with HPC_TRACE=no ==============
mpirun --oversubscribe -np 4 hpcrun  -o miniapp_procs-4_threads-2_n-elts-1000_trace-no.hpcmeasurements ./miniapp 1000
Rank 0/4 with 2 OpenMP threads
Rank 1/4 with 2 OpenMP threads
Rank 2/4 with 2 OpenMP threads
Rank 3/4 with 2 OpenMP threads
N: 1000
Sum: 670.986631
hpcprof-mpi -S miniapp_threads-2.hpcstruct -I ./+ miniapp_procs-4_threads-2_n-elts-1000_trace-no.hpcmeasurements --metric-db yes -o miniapp_procs-4_threads-2_n-elts-1000_trace-no_metric-db-yes.hpcdatabase
msg: STRUCTURE: [...redacted...]/HPCToolkit-Testing-Miniappminiapp
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/libhpcrun.so.0.0.0
msg: Line map : [...redacted...]/hpctoolkit-2020.03.01-zq7vk3umgvbevrocbh6tq5qisdbmoeoi/lib/hpctoolkit/ext-libs/libmonitor.so.0.0.0
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libmpi.so.40.10.4
msg: Line map : /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
msg: Line map : /lib/x86_64-linux-gnu/libc-2.27.so
msg: Line map : /lib/x86_64-linux-gnu/ld-2.27.so
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-rte.so.40.10.5
msg: Line map : [...redacted...]/openmpi-3.1.6-l6nkfeoxvyywrkycb3r6ljau2bp7oitm/lib/libopen-pal.so.40.10.6
msg: Populating Experiment database: [...redacted...]/HPCToolkit-Testing-Miniappminiapp_procs-4_threads-2_n-elts-1000_trace-no_metric-db-yes.hpcdatabase
============== reading miniapp_procs-4_threads-2_n-elts-1000_trace-no_metric-db-yes.hpcdatabase ==============
                                                                CPUTIME (sec) (I)  ...                                             module
node                                               rank thread                     ...
{'name': '<no activity>', 'type': 'function'}      0    0                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        1                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        2                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        3                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                   1    0                0.000000  ...  [...redacted...]/spack/opt/spack/linux-ubu...
...                                                                           ...  ...                                                ...
{'file': '/build/glibc-OTsEL5/glibc-2.27/elf/dl... 2    3                0.000000  ...                                               None
                                                   3    0                0.063879  ...                                               None
                                                        1                0.000000  ...                                               None
                                                        2                0.000000  ...                                               None
                                                        3                0.000000  ...                                               None

[1248 rows x 8 columns]
Traceback (most recent call last):
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./hatchet_hello_world.py", line 17, in <module>
    main()
  File "./hatchet_hello_world.py", line 13, in main
    print(gf.tree())
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/graphframe.py", line 548, in tree
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 66, in trees_as_text
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 105, in as_text
    node_time = dataframe.loc[df_index, metric]
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
    return self._getitem_tuple(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1372, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1452, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 620, in _get_label
    return self.obj._xs(label, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3537, in xs
    loc = self.index.get_loc(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'
============== reading miniapp_procs-4_threads-2_n-elts-1000_trace-yes_metric-db-yes.hpcdatabase ==============
                                                                CPUTIME (sec) (I)  ...                                             module
node                                               rank thread                     ...
{'name': '<no activity>', 'type': 'function'}      0    0                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        1                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        2                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                        3                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
                                                   1    0                     0.0  ...  [...redacted...]/spack/opt/spack/linux-ubu...
...                                                                           ...  ...                                                ...
{'file': '<unknown file> [libopen-rte.so.40.10.... 2    3                     0.0  ...                                               None
                                                   3    0                     0.0  ...                                               None
                                                        1                     0.0  ...                                               None
                                                        2                     0.0  ...                                               None
                                                        3                     0.0  ...                                               None

[1360 rows x 8 columns]
Traceback (most recent call last):
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./hatchet_hello_world.py", line 17, in <module>
    main()
  File "./hatchet_hello_world.py", line 13, in main
    print(gf.tree())
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/graphframe.py", line 548, in tree
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 66, in trees_as_text
    color=color,
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/hatchet/external/printtree.py", line 105, in as_text
    node_time = dataframe.loc[df_index, metric]
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1761, in __getitem__
    return self._getitem_tuple(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1271, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1372, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1452, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1964, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 620, in _get_label
    return self.obj._xs(label, axis=axis)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3537, in xs
    loc = self.index.get_loc(key)
  File "[...redacted...]/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'time'

Squash raises a warning with pandas v0.22 and error with pandas v0.24

https://github.com/LLNL/hatchet/blob/master/hatchet/graphframe.py#L266

index_names = [self.dataframe.index.names]
agg_df = new_dataframe.groupby(index_names).agg(agg_dict)

Code to reproduce the error:

from hatchet import *

data_path = "./CallFlow/data/hpctoolkit-cpi-database"
gf = GraphFrame()
gf.from_hpctoolkit(data_path)

def lookup(df, node):
    return df.loc[df['node'] == node]

def getMaxIncTime(gf):
    ret = 0.0
    for root in gf.graph.roots:
        ret = max(ret, lookup(gf.dataframe, root)['time (inc)'].max())
    return ret

max_inclusive_time = getMaxIncTime(gf)
filter_gf = gf.filter(lambda x: True if(x['time (inc)'] > 0.1*max_inclusive_time) else False)
filter_gf.squash()

Error:

pandas v0.22

FutureWarning: 'rank' is both a column name and an index level."

pandas v0.24

ValueError: 'node' is both an index level and a column label, which is ambiguous.

Reference to the actual issue: pandas-dev/pandas#21080

One solution is to rename the index.names.

from_literal does not account for multi-rooted trees

https://github.com/LLNL/hatchet/blob/master/hatchet/graphframe.py

        root_callpath.append(graph_dict['name'])
        lit_idx = 0
        graph_root = Node(lit_idx, tuple(root_callpath), None)

        node_dicts = []
        node_dicts.append(dict({'nid': lit_idx, 'node': graph_root, 'name': graph_dict['name']}, **graph_dict['metrics']))
        lit_idx += 1

        # call recursively on all children of root
        if 'children' in graph_dict:
            for child in graph_dict['children']:
                parse_node_literal(child, graph_root, list(root_callpath))

        self.exc_metrics = []
        self.inc_metrics = []
        for key in graph_dict['metrics'].keys():
            if '(inc)' in key:
                self.inc_metrics.append(key)
            else:
                self.exc_metrics.append(key)

        self.graph = Graph([graph_root])
        self.dataframe = pd.DataFrame(data=node_dicts)
        self.dataframe.set_index(['node'], drop=False, inplace=True)

The function from_literal does not account for multiple root nodes (sometimes the callgraph has multiple roots).

Add a Way to Read and Write a Full GraphFrame to Disk

Now that we're adding more graph-based analysis functionality, it might be useful to consider adding the ability to read and write the entirety of a GraphFrame to disk for later use.

Currently, the possible file/storage formats I've found that we could use for this are:

  • GraphML (supported by many graph tools, including Graph Database services like Neo4J)
  • GraphSON (mainly used by Apache TinkerPop)
  • CSV (involves making multiple CSVs to store node data and relationships. See this Neo4J page for an example)
  • GEXF (developed for Gephi, very similar to GraphML)
  • GML (looks almost like a poor-man's GraphSON, JSON-like syntax but less extensible, ASCII-only data)

Alternatively, we could implement a hybrid solution that uses one of the graph-specific file formats above to store the relational/hierarchical graph data and something like CSV to store the DataFrame data.

Additionally, very long term, we could consider adding read/write support for GQL storage solutions. GQL is an in-development ISO standard based on SQL for graph data. It is currently planned to be officially published in August 2021.

If we decide to implement this, I'll update this first comment with a more extensive list of options for storage.

groupby behavior with string columns

Let's say a dataframe contains both numerical (e.g., time) and string (e.g., name) columns. After a squash, which does a groupby on the dataframe, the string columns are removed. Should we be preserving the string columns? These are useful (in particular name) for printing the resulting tree.

>>> df
     angles  degrees      shape
foo                            
A         7       90     circle
C         4       90   triangle
A         9      145  rectangle
>>> df.groupby("foo").agg("sum")
     angles  degrees
foo                 
A        16      235
C         4       90

Squash doesn't mark nodes for merging correctly in certain situations

I found this while working on testing the query language.

This is the description of the problem that I posted to Slack:

I was working on some tests for the new Hatchet query language when I ran into the following issue/edge case. The test involved the following query being applied to mock_graph_literal from tests/conftest.py:

query = [
    {"name": "bar"},
    ("*", {"time (inc)": "> 50"}),
    {"name": "gr[a-z]+", "time (inc)": "<= 10"}
]

(For reference, this query matches paths starting with a node named "bar" followed by 0 or more nodes with an inclusive time > 50 followed by a node with a name starting with "gr" and with an inclusive time of <= 10.)

What I expected to get from this query was a forest of three trees, each consisting of a root node with name "bar" and a single child node with name "grault" and with an inclusive time of <= 10. Instead, what I got was the following:

bar
|-> grault
|-> grault
-> grault

I tracked this issue down to the find_merges function in graph.py. This function loops over the nodes in the trimmed DAG and determines whether or not to merge the current node's children using the children's frames. However, this causes incorrect behavior for the children of any nodes that will be merged. In my test above, this algorithm causes the three "bar" nodes in the trimmed DAG to be marked for merging, but it does not mark the three "grault" nodes for merging.

This can be fixed by changing the merge determination algorithm to loop over the new DAG (the DAG with nodes merged) rather than the old DAG.

pip install fails when hatchet is a dependency.

CallFlow uses hatchet as a dependency (see requirements.txt).

However, the install fails.

pip install -r requirements.txt inside CallFlow repository fails.

See the error log here: link.

It complains that hatchet is not able to install pandas, even though it is already installed (see #1).

To replicate this issue,

pip uninstall hatchet

or
create a virtualenv with hatchet not installed.

PS: This does not occur if hatchet is installed independently.

Dump snapshot of graphframe for checkpoint

After some operations, we may want to dump out the graph and dataframe, so we can reload it into hatchet at a later time, in particular, if the operations or data ingestion takes a long time.

Node equivalence

Should Node __eq__ also be checking that nid and callpath are equal?
return (id(self) == id(other) or (self.nid == other.nid and self.callpath == other.callpath))

The current implementation only checks that the id() of two Node objects are the same. If we are comparing two GraphFrame objects (from two different input files for example), the id()'s will not be the same.
return (id(self) == id(other))

I think we also want to check that the Node member variables are also identical.

Single function for updating graph/dataframe _hatchet_nid

The current implementation has a graph-oriented function called enumerate_traverse(), which traverses across each node and assigns an incremental _hatchet_nid. The second step is to use this information on the graph to update corresponding rows in the dataframe. Sometimes multiple rows need to be updated depending on what the index levels are. if the index levels are node and rank, then all ranks with the same node will need to be updated with the _hatchet_nid.

Apply mean/min/max on dataframe to graph structure

We can use built-in pandas functions to compute the mean, min, max, etc. of the dataframe. Can we update the graph based on this aggregate? Possibly add a mean/min/max operand on the graphframe, but a challenge is deciding on a generic interface.

Use case: I have several trials of the same execution, and want to compute the average performance over all trials, and use this as the "main" graphframe" for doing comparisons across nightly runs.

subtract returns NaNs in columns

In the simplest case, if we read in the same data set into two different GraphFrames, the columns and column names will be identical. However, subtract is looking at the id of each node when doing a subtract across rows with the same column name. We need to do a unify of the dataframes first before taking a diff.

KeyError: 'the label [time] is not in the [index]'

Hello,

Printing the tree of caliper-ex.cali can be done by calling cali-query:

cali-query -q 'select function,sum(sum#time.duration),
inclusive_sum(sum#time.duration) group by 
function format json-split' caliper-ex.cali > 0.json

and into hatchet:

>>> import hatchet as ht
>>> gf = ht.GraphFrame.from_caliper_json('./0.json')
>>> print(gf.tree())

but it will fail with:

Traceback (most recent call last):
  File "pandas/core/indexing.py", line 1790, in _validate_key
    error()
  File "pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [time] is not in the [index]'

The traceback is:

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hatchet/graphframe.py", line 539, in tree
    color=color,
  File "hatchet/external/printtree.py", line 64, in trees_as_text
    color=color,
  File "hatchet/external/printtree.py", line 102, in as_text
    node_time = dataframe.loc[df_index, metric]
  File "pandas/core/indexing.py", line 1472, in __getitem__
    return self._getitem_tuple(key)
  File "pandas/core/indexing.py", line 870, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "pandas/core/indexing.py", line 1027, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
  File "pandas/core/indexing.py", line 1478, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "pandas/core/indexing.py", line 1911, in _getitem_axis
    self._validate_key(key, axis)
  File "pandas/core/indexing.py", line 1798, in _validate_key
    error()
  File "pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [time] is not in the [index]'

The solution is to rename the second column in the json file:

grep columns 0.json
"columns": [ "inclusive#sum#time.duration", "sum#time.duration", "path" ],

instead of:

grep columns 0.json
"columns": [ "inclusive#sum#time.duration", "sum#sum#time.duration", "path" ],

My version is:

cali-query --version

2.3.0

caliper-ex.json is actually fine. Am i using cali-query in a wrong way ?

Thanks,

jg.

Reconstruct graphframe from outputted hatchet dataframe only

Instead of hatchet save() outputting a JSON file, see if we can construct the graphframe just from an outputted dataframe with additional columns:

  • parent
  • node type (e.g., statement, function, loop)
  • hierarchy cols (i.e., if node type is statement, then hierarchy cols are file,line; if node type is function, then hierarchy cols is name)

Alternative solution for #114

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.