intelpython / bearysta Goto Github PK

View Code? Open in Web Editor NEW

3.0 6.0 4.0 198 KB

Pandas-based statistics aggregation tool

License: Apache License 2.0

Python 98.51% HTML 1.49%

bearysta's Issues

Rethink aggregator recipes and the aggregator in general

Here's some deeper discussion on Python API for the aggregator.

relates to #94, #104 .
also relates to #90, as releasing only the Python API might make it easier for us to support.

Current aggregator recipes are not super flexible. They basically force a certain workflow which requires many layers of indirection and boilerplate in configs. This generally makes me think too much about how to fix particular configs, e.g. by adding another layer of indirection to pivot rows onto columns or something like that. (For examples, see PRs #96, #98, #99). While it's not too hard to do that for now, it's already very messy - note how many separate configs we have for random forests and logistic regression and SVM!)

A few problems with the current aggregator config structure include

Fixed series/axis/variants with fixed meanings (I also always forget the meanings because I'm not a big Excel user)
Hacked-in plotting support which infers everything from other things and as a result isn't so clean
Impossibility of ordering operations in a user-defined way

Anton suggested that we could

expose a Python API for the aggregator with higher-level functions than Pandas has,
implement the current configuration options based on that Python API, and
write configs in Python using the new API where they are more elegant.

For example, we could have these Python configs look like

from pyrforator import aggregate as agg
import pandas as pd

def recipe(path='data/sklearn*.csv', other_options...):
  # read data...
  # preprocess can be in the same format of regex -> (repl|drop|None)
  # or it could just be a function which returns the filtered line or None to drop it!
  df = agg.read_csv(path, preprocess={'@ Package': 'drop'}, ...pandas options...)

  # compare to native C
  df['Ratio'] = agg.ratio_of(df, columns=['Prefix'], values=['Time'], against=('Native-C',))

  return df

agg.run(recipe)

and then they get access to both our functions and pandas functions, and anything else they need! We just have to package automated_benchmarks into a conda package.

I also want to make the Python API so simple to use that it basically supersedes the YAML configs. A user should really just be able to run conda install pyrforator and then write their config.

Automatically collect environment information

Currently, for reporting, a user must manually run conda list and other commands. Instead, it would be useful for the benchmark.py script itself to handle both collection and reporting of environment information.

Add uncertainties support

It might be possible to use something like uncertain-panda to allow propagating uncertainties. Otherwise, if we simply take standard deviation, we would do that at every aggregation step, and since we aggregate the data multiple times before the end result, there will be only one point in each category so the sample standard deviation will be zero!

Create conda recipe

Now that we have a setup.py for bearysta, a conda recipe should not be too difficult.

Benchmark installation does not use pip dependencies

We have a viable workaround at the moment, so this is not a major problem at the moment.

When we create environments, we manually specify Python libraries to be installed from pip for stock environments. If we also ask for benchmarks at the same time, inside the dependencies: section in environment.yml, conda will attempt to resolve the benchmarks' dependencies, which results in conda installing those from its channels. If we ask for benchmarks later with conda install, conda will still install the dependencies from conda channels.

(1). The current workaround has been to install benchmarks without their dependencies, i.e. with conda install --no-deps [...]. Since conda 4.6, conda complains about inconsistent environments when benchmarks are installed in this way.
(2). Another potential workaround is to change our benchmark recipes to not require any dependencies. While this might work, I don't think it's good style.
(3). Can we possibly have conda's dependency solver take into account installed packages from pip?
If this were the case, we could simply use --override-channels on the benchmark install command to force using the least-priority pip dependencies.

There is another caveat with (1). When we install benchmarks, conda still tries to run the dependency solver even though we asked for --no-deps. This means that it fails to install benchmarks if we don't specify the channels from which we pulled the dependencies. There is yet another workaround for this problem: add all the dependency channels in addition to the benchmark channels. We could also specify benchmark URLs directly, but that sounds like it could get messy quickly.

Stop conda-activating for every benchmark command

We could avoid conda-activating for every single command by simply running conda activate once and then passing the environment to all real benchmarking commands. Though this won't save us very much time, it will definitely look less messy.

Integration with other tools

It would be useful to support integration with a number of other tools.

TeamCity

It would be useful to have output which allows TeamCity progress reporting.

Hyperfine

https://github.com/sharkdp/hyperfine

Probably trivial. We can just write an aggregator recipe and benchmark config.

airspeed velocity

It would be useful to be able to use airspeed velocity JSON as an input format to the aggregator. From older notes:

I can think of two approaches to generic JSON parsing.

First approach: user specifies JSON schema

Adding a user-specified json parser is a bit more difficult, because the users will have to somehow specify the schema used for the parsing. I have some ideas, but generally json outputs have complex structure (e.g. ibench outputs a map containing the prefix and a list of maps containing the problem and problem size and a list of the actual timings. Here's a simplified example of how that looks like...

{
  "name": "intelpython3",
  "date": "2019-06-20",
  "runs": [
    {
      "name": "Dot",
      "N": 2,
      "times": [
         0.04996657371520996,
         1.2636184692382812e-05,
         3.0994415283203125e-06
      ]
    }
  ]
}

we would need...

a way to get an object arbitrarily deep in the hierarchy. We can use list/tuple-based indexing for this, where each element of the list indexes the next inner layer.
a way to specify expansion of outer information onto multiple inner data (e.g. mapping "Dot" and "2" onto the three times and creating three rows) with an arbitrary number of layers.
certain JSON schemas are supported directly by pandas https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

Second approach: generic conversion of JSON to dataframe

If we want to convert an entire JSON file to a table, though, we could possibly follow the following approach:
expand everything except the outermost layer by starting with the innermost layer and following the logic

if the layer is a scalar, create a 1x1 table with empty string as the column name and the scalar value as the only value.
if the layer is a list, create a table of any inner values. If inner values are tables, simply concatenate them.
if the layer is a map, attempt to create one row containing each key: value pair. If inner values are tables...
- rename the table columns to be key.columnname for each columnname (or just key if columnname is the empty string)
- merge the tables.

Example

e.g. this would follow the steps for transforming a fictitious example into JSON:

(just like how we talk about matrices, nxm table means a table with n rows and m columns)

original JSON:

{
  "name": "intelpython3",
  "date": "2019-06-20",
  "runs": [
    {
      "name": "Dot",
      "N": 2,
      "times": [
         1, 2, 3
      ]
    },
    {
      "name": "Inv",
      "N": 2,
      "times": [
         4, 5, 6
      ]
    }
  ]
}

tabulating the innermost element:

{
  "name": "intelpython3",
  "date": "2019-06-20",
  "runs": [
    {
      "name": "Dot",
      "N": 2,
      "times": <3x1 table of times with empty string as column name>
    },
    {
      "name": "Inv",
      "N": 2,
      "times": <3x1 table of times with empty string as column name>
    }
  ]
}

tabulating the second innermost element:

{
  "name": "intelpython3",
  "date": "2019-06-20",
  "runs": [
    <3x3 table. name=Dot, N=2 for all, times=[1, 2, 3]>,
    <3x3 table. name=Inv, N=2 for all, times=[4, 5, 6]>
  ]
}

tabulating the third innermost element:

{
  "name": "intelpython3",
  "date": "2019-06-20",
  "runs": <6x3 table. name=[Dot, Dot, Dot, Inv, Inv, Inv], N=2, times=[1, 2, 3, 4, 5, 6]>
}

tabulating the fourth innermost element:

<6x5 table. name=intelpython3, date=2019-06-20, runs.name=[Dot, Dot, Dot, Inv, Inv, Inv], runs.N=2 for all, runs.times=[1, 2, 3, 4, 5, 6]>

Implementing this in python is actually pretty easy! All we do is parse the entire json input, then recursively follow this procedure, replacing objects in-place with pandas dataframes.

Override configs may not be working for multiple environments

Need to investigate. Not sure if this is happening, but it seems like overrides work properly for the first environment in which we run benchmarks, but not in any later environments.

By default, use the current Python environment

Currently, there is no way to use the "current" python environment in the runner. It would be a good idea to use it if --env-path is not specified.

Rethink runner-aggregator interface

Currently, the runner creates a complicated directory hierarchy to organize benchmarks. This both good and bad...

good: To send results for only one benchmark suite, all you have to do is send the one directory. (But then again, if you want to send results for the one benchmark suite but for all environments, then you have to send three directories and do some glob that's annoying to write)
bad: To aggregate results, we have to know about this directory structure in the aggregator
bad: To aggregate manually run results with existing aggregator configs, we have to create directories

We should rethink this and other parts of the runner-aggregator interface to simplify what's going on and make it easier to do things like manually run the aggregator on a new collection of files that aren't organized in the complicated way.

Of course, adding a command-line option to the aggregator telling it to load files from a particular directory would work to solve this issue... but how would it work when config files reference each other? Would paths mentioned in aggregator recipes be relative to the specified path instead?

There's probably more considerations for this issue...

What we need to pass from runner to aggregator

Actual data given by the benchmarks
Parameters used to launch the benchmarks. Can include command line
Environment and machine information
Possibly even the "raw" aggregator config used to read raw results into a pandas dataframe

Add data generation phase to benchmark runner

Sometimes generating datasets takes a long time. It would be nice to include this in runner functionality so benchmarks don't have to spend time on generating random numbers.

Add raw numbers to the Excel output

For verification of results, it would be useful to include raw numbers and benchmark recipes in the Excel/HTML outputs.