intelpython / bearysta Goto Github PK
View Code? Open in Web Editor NEWPandas-based statistics aggregation tool
License: Apache License 2.0
Pandas-based statistics aggregation tool
License: Apache License 2.0
Here's some deeper discussion on Python API for the aggregator.
relates to #94, #104 .
also relates to #90, as releasing only the Python API might make it easier for us to support.
Current aggregator recipes are not super flexible. They basically force a certain workflow which requires many layers of indirection and boilerplate in configs. This generally makes me think too much about how to fix particular configs, e.g. by adding another layer of indirection to pivot rows onto columns or something like that. (For examples, see PRs #96, #98, #99). While it's not too hard to do that for now, it's already very messy - note how many separate configs we have for random forests and logistic regression and SVM!)
A few problems with the current aggregator config structure include
Anton suggested that we could
For example, we could have these Python configs look like
from pyrforator import aggregate as agg
import pandas as pd
def recipe(path='data/sklearn*.csv', other_options...):
# read data...
# preprocess can be in the same format of regex -> (repl|drop|None)
# or it could just be a function which returns the filtered line or None to drop it!
df = agg.read_csv(path, preprocess={'@ Package': 'drop'}, ...pandas options...)
# compare to native C
df['Ratio'] = agg.ratio_of(df, columns=['Prefix'], values=['Time'], against=('Native-C',))
return df
agg.run(recipe)
and then they get access to both our functions and pandas functions, and anything else they need! We just have to package automated_benchmarks into a conda package.
I also want to make the Python API so simple to use that it basically supersedes the YAML configs. A user should really just be able to run conda install pyrforator
and then write their config.
Currently, for reporting, a user must manually run conda list
and other commands. Instead, it would be useful for the benchmark.py script itself to handle both collection and reporting of environment information.
It might be possible to use something like uncertain-panda
to allow propagating uncertainties. Otherwise, if we simply take standard deviation, we would do that at every aggregation step, and since we aggregate the data multiple times before the end result, there will be only one point in each category so the sample standard deviation will be zero!
Now that we have a setup.py for bearysta
, a conda recipe should not be too difficult.
We have a viable workaround at the moment, so this is not a major problem at the moment.
When we create environments, we manually specify Python libraries to be installed from pip for stock environments. If we also ask for benchmarks at the same time, inside the dependencies:
section in environment.yml
, conda will attempt to resolve the benchmarks' dependencies, which results in conda installing those from its channels. If we ask for benchmarks later with conda install
, conda will still install the dependencies from conda channels.
(1). The current workaround has been to install benchmarks without their dependencies, i.e. with conda install --no-deps [...]
. Since conda 4.6, conda complains about inconsistent environments when benchmarks are installed in this way.
(2). Another potential workaround is to change our benchmark recipes to not require any dependencies. While this might work, I don't think it's good style.
(3). Can we possibly have conda's dependency solver take into account installed packages from pip?
If this were the case, we could simply use --override-channels
on the benchmark install command to force using the least-priority pip dependencies.
There is another caveat with (1). When we install benchmarks, conda
still tries to run the dependency solver even though we asked for --no-deps
. This means that it fails to install benchmarks if we don't specify the channels from which we pulled the dependencies. There is yet another workaround for this problem: add all the dependency channels in addition to the benchmark channels. We could also specify benchmark URLs directly, but that sounds like it could get messy quickly.
We could avoid conda-activating for every single command by simply running conda activate once and then passing the environment to all real benchmarking commands. Though this won't save us very much time, it will definitely look less messy.
It would be useful to support integration with a number of other tools.
It would be useful to have output which allows TeamCity progress reporting.
https://github.com/sharkdp/hyperfine
Probably trivial. We can just write an aggregator recipe and benchmark config.
It would be useful to be able to use airspeed velocity JSON as an input format to the aggregator. From older notes:
I can think of two approaches to generic JSON parsing.
Adding a user-specified json parser is a bit more difficult, because the users will have to somehow specify the schema used for the parsing. I have some ideas, but generally json outputs have complex structure (e.g. ibench outputs a map containing the prefix and a list of maps containing the problem and problem size and a list of the actual timings. Here's a simplified example of how that looks like...
{
"name": "intelpython3",
"date": "2019-06-20",
"runs": [
{
"name": "Dot",
"N": 2,
"times": [
0.04996657371520996,
1.2636184692382812e-05,
3.0994415283203125e-06
]
}
]
}
we would need...
If we want to convert an entire JSON file to a table, though, we could possibly follow the following approach:
expand everything except the outermost layer by starting with the innermost layer and following the logic
key.columnname
for each columnname
(or just key
if columnname
is the empty string)e.g. this would follow the steps for transforming a fictitious example into JSON:
(just like how we talk about matrices, n
xm
table means a table with n
rows and m
columns)
original JSON:
{
"name": "intelpython3",
"date": "2019-06-20",
"runs": [
{
"name": "Dot",
"N": 2,
"times": [
1, 2, 3
]
},
{
"name": "Inv",
"N": 2,
"times": [
4, 5, 6
]
}
]
}
tabulating the innermost element:
{
"name": "intelpython3",
"date": "2019-06-20",
"runs": [
{
"name": "Dot",
"N": 2,
"times": <3x1 table of times with empty string as column name>
},
{
"name": "Inv",
"N": 2,
"times": <3x1 table of times with empty string as column name>
}
]
}
tabulating the second innermost element:
{
"name": "intelpython3",
"date": "2019-06-20",
"runs": [
<3x3 table. name=Dot, N=2 for all, times=[1, 2, 3]>,
<3x3 table. name=Inv, N=2 for all, times=[4, 5, 6]>
]
}
tabulating the third innermost element:
{
"name": "intelpython3",
"date": "2019-06-20",
"runs": <6x3 table. name=[Dot, Dot, Dot, Inv, Inv, Inv], N=2, times=[1, 2, 3, 4, 5, 6]>
}
tabulating the fourth innermost element:
<6x5 table. name=intelpython3, date=2019-06-20, runs.name=[Dot, Dot, Dot, Inv, Inv, Inv], runs.N=2 for all, runs.times=[1, 2, 3, 4, 5, 6]>
Implementing this in python is actually pretty easy! All we do is parse the entire json input, then recursively follow this procedure, replacing objects in-place with pandas dataframes.
Need to investigate. Not sure if this is happening, but it seems like overrides work properly for the first environment in which we run benchmarks, but not in any later environments.
Currently, there is no way to use the "current" python environment in the runner. It would be a good idea to use it if --env-path
is not specified.
Currently, the runner creates a complicated directory hierarchy to organize benchmarks. This both good and bad...
We should rethink this and other parts of the runner-aggregator interface to simplify what's going on and make it easier to do things like manually run the aggregator on a new collection of files that aren't organized in the complicated way.
Of course, adding a command-line option to the aggregator telling it to load files from a particular directory would work to solve this issue... but how would it work when config files reference each other? Would paths mentioned in aggregator recipes be relative to the specified path instead?
There's probably more considerations for this issue...
Sometimes generating datasets takes a long time. It would be nice to include this in runner functionality so benchmarks don't have to spend time on generating random numbers.
For verification of results, it would be useful to include raw numbers and benchmark recipes in the Excel/HTML outputs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.