ibm / chopstix Goto Github PK

View Code? Open in Web Editor NEW

4.0 6.0 9.0 25.85 MB

ChopStix: framework to extract representative microbenchmarks

License: Apache License 2.0

CMake 5.02% Shell 10.35% C 15.78% C++ 46.45% Python 13.63% Assembly 0.32% Makefile 8.45%

z power x86 microbenchmark automation characterization extraction linux ppc ppc64

chopstix's Introduction

ChopStiX

Extract representative microbenchmarks.

Quick start

Execute the following command to install ChopStiX:

./install.sh <INSTALLATION_DIRECTORY>

The command will perform all the necessary steps (i.e. including the download of specific requirements) and install ChopStiX in `<INSTALLATION_DIRECTORY>.

Installation

The following is a list of the minimal software requirements to be able to install ChopStiX.

A compiler with C++11 support (e.g. gcc 4.8.4)
libpfm-4.8.0: http://perfmon2.sourceforge.net/
CMake >= 2.8
Git
Python 3 (optional)
sqlite3 (optional)

To download and setup ChopStiX for installation follow these steps:

git clone https://github.com/IBM/chopstix.git chopstix
cd chopstix
git submodule sync
git submodule update --init --recursive

If you download the repository contents directly from github.com as a compressed zip file, you also have to download the external dependencies and decompress them into the ./external directory.

Compilation

ChopStiX uses CMake as build system. We have provided a simple wrapper in form of a configure script in order to provide a more accessible interface.

The basic build workflow is as follows:

mkdir build
cd build
../configure
make && make install

For more detailed information regarding configuration options see installation documentation.

Basic usage

ChopStiX saves all collected information in a local SQL database. By default it will save data to chop.db. Most commands have a -data option to change this path.

In general, you can invoke any command using chop <command>. For more information about a specific command, try chop help <command>. There are also some utility scripts (i.e. chop-marks) which are generally prefixed by chop-.

The basic workflow for ChopStiX is as follows:

chop sample ./my_app     # Sample invocation of ./my_app
chop disasm              # Detect and disassemble used object files
chop count               # Group and count samples per instruction
chop annotate            # Annotate control flow graph
chop search -target-coverage 90%   # Generate hottest paths
chop list paths          # List generated paths
chop text path -id <id>  # Show instructions for path with <id>

For a more detailed workflow example see the following documents:

Installation
General Usage
- Tracing
- Example workflow
Development information

chopstix's People

Contributors

Stargazers

Watchers

Forkers

rbertran llucalvarez bhaskers-blu-org1 arnaubigas ghas-results ghas-results

chopstix's Issues

System call detection is disabled during temporal/hybrid tracing

When temporal/hybrid tracing modes are used, the parent process detached from the child, and as a result, it can not track system calls (an split trace accordingly). There is not need to detach, we just need to disable monitoring, register a sigalarm and enable it back once the signal is received.

Check of C++11 compiler support in CMAKE Install setup script

CMAKE should fail if C++11 compiler support is not found. It should provide some guidelines to enable it. E.g : use CXX env variable to change the compiler ...

Properly organize state extraction scripts

Create a cohesive interface to ChopStiX

We can implement two behaviors:

a) direct dump: everything is handled during 'trace extraction' execution
- dump register to a file with the right format when starting a execution of region of interest
- append each page dump with the right format to the previous file on each page fault
- dump: dump corresponding MPT file when finishing the execution of the region of interest
b) dump to the database + offline dumping
- dump register values and page contents to the database
- provide post-mortem / offline command line interface to dump mpt/mps for a specific execution of a region of interest.

I'm inclined towards b) , as this will allow to consolidate all the data in a database. Then, it will be more easy to start the research of algorithms/approaches to prune/determine the right states to reproduce.

Use cycles instead of gettimeofday to avoid more system time

gettimeofday is too coarse-grain timing facility. Better implement architecture dependent functions to get execution cycles without overheads.

This is related to accurate profiling of the different parts of the library.

Perfmon2 does not implement pfm_initialize on System Z

CheckPerfmon2.cmake fails since it cannot find pfm_initialize

Generate a single trace per group when tracing

chop trace -group 10 , should generate a single trace

Add support for nested instrumentation

Add support for nested/recursive instrumentation.

Temporal model does not matter
Region base, will the top-most call, and exit at the bottom-most return.

Reduce memory map parsing overhead

Options: Parse memory once + checksum, or parse once + mem. alloc wrap to decrease overheads

Check condition code for branches

Branches are defined as conditional and not conditional when generating the CFG.
Improve that definition to check the 'condition' code so that conditional branches with the condition always taken/not taken will be recognized as non-conditional branches in the CFG.

Support to be added to:

ppc64
ppc64le
s390x
riscv

Flexible control of the invocations of regions of interest to be sampled

Selective sampling of invocations can be implemented in both "sides":

Tracer: tracer decides if a particular execution of the region of interest needs to be sampled.
- Pros: Less overhead if we decide to not sample.
- Cons: Less information. Tracer does not have information about previous sampled regions. As a result, only random or simple (e.g. first 10) sampling methods can be implemented (no intelligent sampling)
Tracee: tracee decides if a particular execution of the regions of interest needs to be sampled.
- Cons: Some overhead if we decide to not sample.
- Pros: tracee can use existing samples (e.g. historical data) to decide or not to sample a particular region.

I do think that having both would be beneficial. One one side, we can use the Tracer sampling to do a first level pruning of the executions (e.g. from 1 million to 1 thousand) and then use the Tracee extra information available to do a second and more intelligent filtering.

Check if all possible to implement within a library, or if we need ptrace at all

One day we were discussing if it'd be possible to implement everything in a library (no need of ptrace)

library init: register SIGILL and SIGSEGV handlers. Write ILLEGAL code into the entry points of the region of interest.
SEGILL handler:
- entry point: gets the context (it receives it), restores illegal instruction, write illegal code to the exit point, protects memory
- exit point: restores illegal instruction, wirtes illegal code to the entry point, unprotects memory.
SEGSEGV handler:
- in charge of getting memory addresses, contents, and origin code and unprotecting memory (if needed)

Is this implementable? If so, we'll avoid the ptrace 'hacks' ...

Pros: simple (all in one library), maybe less overhead?
Cons: less control of the target process (e.g. unable to track system calls)

Error: run_example.sh: line 33: cx-trace: command not found

After running the given tracing example shell, Error "cx-trace: command not found" is reported.
And I cannot find the cx-trace binary. I wonder to know if it has been transferred to "chop trace".
`
die() {
echo "$@" >&2
exit 1
}

rm -rf data
mkdir data

export TEST_ITER=10000
export TEST_SIZE=10000:w

cx-trace "$@" ./daxpy
`

trace.bin (trace of addresses) is always generated (and empty) regardless of the command line flag

Implement different policies / handlers based on environment variables

Enable a configuration API through environment variables.

We would like to do easily the following types of experiments (copied from a note I sent some time ago):

the application executed by the profiler (baseline + tracing overhead)
the application executed by the profiler + only inserting traps (baseline + tracing overhead + trapping overhead)
the application executed by the profiler + inserting traps + protect all + unprotect all on 1st failure (baseline + tracing overhead + trapping overhead + minimum tracing overhead)
the application executed by the profiler + inserting traps + protect all + unprotect the page on failures (baseline + tracing overhead + trapping overhead + tracing overhead)
~~the application executed by the profiler + inserting traps + protect all + do not unprotect the page on failures (baseline + tracing overhead + trapping overhead + maximum tracing overhead)~~
the application executed by the profiler + inserting traps + protect all + unprotect the page on failures and write to disk (all IO)

To do so, we should be able to control which of these 'modes' to use using either a command line argument or a environment variable. The main profiler/driver will select the handler based on that.

./chopstix/src/trace/memory.cpp:94:21: error: ‘explicit_bzero’ was not declared in this scope

When I try to build this tool on X86 device, this problems happens:

@:~/source-code/chopstix/build$ make -j
[  1%] [  9%] [ 10%] Built target cx-support
Built target vector_add
[ 47%] [ 48%] Built target example-daxpy
Built target compile_queries
[ 51%] Built target chop-detrace
Built target fmt
[ 52%] [ 52%] Built target test-daxpy
Scanning dependencies of target cxtrace
Built target clustering
[ 62%] [ 65%] Built target compile_client
[ 66%] Built target chop-perf-invok
Built target chop-trace2mpt
[ 67%] [ 70%] Built target cx-database
Building CXX object src/trace/CMakeFiles/cxtrace.dir/memory.cpp.o
[ 88%] Built target cx-core
[ 89%] Built target cx-arch
[ 98%] Built target chop
/home/source-code/chopstix/src/trace/memory.cpp:94:21: error: ‘explicit_bzero’ was not declared in this scope
     (unsigned long)&explicit_bzero,
                     ^
src/trace/CMakeFiles/cxtrace.dir/build.make:77: recipe for target 'src/trace/CMakeFiles/cxtrace.dir/memory.cpp.o' failed
make[2]: *** [src/trace/CMakeFiles/cxtrace.dir/memory.cpp.o] Error 1
CMakeFiles/Makefile2:456: recipe for target 'src/trace/CMakeFiles/cxtrace.dir/all' failed
make[1]: *** [src/trace/CMakeFiles/cxtrace.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

During tracing, if child process stops (gets a sigkill), a non-usefull error is reported

During tracing, if child process stops (gets a sigkill), a non-useful error is reported. We need to close tracing and finish OK saying that the child process finished.

Provide more verbose information in chop sample command

bertran@perfectp8:~/microprobe_chop/tests$ chop sample -pid 128992 -timeout 20s
{19-06-2017 16:32:14} chop: Started monitor
{19-06-2017 16:32:33} chop: Stopped monitor

I'd be good to printout the number of samples gathered, the sampling period, the performance counters as well as the output database file path so that users can identify quickly if anything is not correct.

Improve user feedback (add logging)

Get the overhead break-down when manually instrumenting the code

Remove install.sh script

Remove the install.sh script and use the new configure script.

Actions to do:

Remove script
Update CI framework

Implement support scripts to plot memory access pattern charts

Implement support script to plot memory access pattern chart (useful to detect access patterns for clustering).

This should work as in the case of CFGs plotting, just basic support to provide quick feedback to the user of what types of access patterns are going on. No need to get too much fancy. There is already plenty of plotting tools to do so.

Support for formating floating point number (%f/%g) in internal string formating

Automated code documentation

i.e. doxygen

Detailed Time break-down compilation mode

Enable a timing mode when compiling that collects the time spent in each phase.

Regarding the breakdown, I was thinking in the following categories:

Application time
Profiler time
- Init time (fork + first instrumentation)
- start region of interest time (fix traps and insert calls)
- end region of interest time (fix traps and insert calls)
Support library time:
- protect time
  - protect mem, system call time
  - IO time (write regs to disk)
  - other protect time (e.g.read all registers state, etc.)
- unprotect time
  - unprotect system call time
  - other protect time
- sighandler time
  - IO time (write)
  - unprotect system call time
  - other sighandler time

Regarding on how to implement it:

This should a be a configuration option during the compilation (e.g. -DTIMING in cmake), so that we can remove the timing overhead when needed. We need to implement this using architecture specific low level mechanisms when possible (e.g. STCKE instructions in Z, or MFTB in POWER) to read the time. Much faster: 1 instruction vs. 1 system call . If TIMING is enabled, reports are generated in stdout by default or in a file if TIMING_REPORT_FILE environment variable is defined with a filename.

Script to compute address for powerpc return wrong addresses

chopstix driver: error: Tracer:: track_mmap: Child did not stop

chop trace -begin 100 -end 300 ./hello_world
Error occurs：chopstix driver: error: Tracer:: track_mmap: Child did not stop
I find the status_ get 0 after waitpid, and the correct stopped signal should be 5 (status_ = 1407).

ASLR is not enabled by default

Command should be launched with:
setarch linux64 -R

Collect the source code adress of memory accesses

Currently we collect the access patterns (i.e. the target load/store address that generated the exception). We can collect also the origin of the access (i.e. which instruction address generated the exception). This information can be valuable for analysis of patterns and clustering.

In source/trace/system.cpp, in the signal handler, there are a couple of macros to obtain the actual address accessed that generated the fault.

Add SQL support to state extraction/tracer

The state should be saved in a database, instead of plain binary files.
Reasoning: This is both cleaner and a more portable approach.
Additionally we can implement versioning? Maybe this should be a separate issue..

Questions:

What should be the DB layout?
Should the information be saved in the same database as the rest of dynamic information?