<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Original comment by kevinb@google.com on 8 Jul 2010 a

Original comment by kevinb@google.com on 14 Jul 2010

Report a quality score for each measurement about caliper HOT 15 CLOSED

google commented on May 5, 2024

Report a quality score for each measurement

from caliper.

Comments (15)

GoogleCodeExporter commented on May 5, 2024

+1

Also, I don't know what's the science behind the cutoffs chosen, but I just 
changed 
them from 0.9-1.1, to 0.5-2.0 -- maybe I went too lax on it, I dunno.

Original comment by [email protected] on 12 Jan 2010 at 7:24

from caliper.

GoogleCodeExporter commented on May 5, 2024

It's a sanity check. I think the best fix is to write a new measurer. Perhaps 
we abandon 
the "always run for 5 seconds" default, and instead watch the numbers and wait 
for 
them to converge. If they don't converge for a long time, then fail.

Original comment by limpbizkit on 13 Jan 2010 at 3:58

from caliper.

GoogleCodeExporter commented on May 5, 2024

Here are my latest thoughts:

We don't need to think of this as a sanity check that fails your whole run; 
instead, 
we could calculate and report a "quality score" for each measurement, and let 
the 
user decide.

As we run post-warmup trials, instead of keeping the rep count fixed 
(post-warmup), 
we can let it vary somewhat within a range. The "score" I'm referring to 
evaluates 
how linear these (time, reps) pairs appear to be. (Optionally, if they're 
linear but 
with a positive y-intercept, we could consider that value to be the per-trial 
overhead and correct for it in the values reported?)

Now, an intelligent measurer can be configured with not only a maximum run 
length in 
seconds, but also an "early termination" score that, if achieved, will end the 
run 
early to save time.

And, in a multiple-measurements-per-scenario situation, we no longer have to 
just 
blindly average (or median) the individual measurements; we can look at their 
quality 
scores and discard some of them, or weight them lower.  Of course, I think that 
most 
often if one measurement has low quality, the rest probably do too, because 
it's 
likely an inherent problem in the way the benchmark is constructed.

Original comment by [email protected] on 13 Jan 2010 at 5:29

from caliper.

GoogleCodeExporter commented on May 5, 2024

Incidentally if it's a benchmark that allocates any memory, then it's unlikely 
that every timing interval will get 
interrupted by GC the same number of times with the same workload to do, so 
variation is inevitable.

Original comment by [email protected] on 22 Jan 2010 at 12:48

from caliper.

GoogleCodeExporter commented on May 5, 2024

Still think that varying the rep count and evaluating the linearity of results 
would 
be a great thing to do.

Original comment by [email protected] on 7 Jun 2010 at 5:36

Added labels: Milestone-0.5

from caliper.

GoogleCodeExporter commented on May 5, 2024

With r132 I've added a basic sanity check to give a friendly exception when the 
benchmark body is optimized away.

The only thing we're missing is checks for enh's reported error case: when the 
benchmark is close to, but not linear. For example, if the number of reps 
impacts the runtime. I'm unsure what the fix here will be.

Original comment by [email protected] on 6 Jul 2010 at 12:49

Changed state: Started

from caliper.

GoogleCodeExporter commented on May 5, 2024

It would be nice to simply report a 'quality score' for each measurement 
somewhere.  Then just let users decide what to do.

So, if the results are absolutely linear, the quality score is 1.  It's 
possible all we need here is the statistical correlation coefficient between 
the rep counts and elapsed time measurements.

Original comment by [email protected] on 8 Jul 2010 at 4:05

from caliper.

GoogleCodeExporter commented on May 5, 2024

Original comment by [email protected] on 8 Jul 2010 at 4:05

Changed title: Report a quality score for each measurement

from caliper.

GoogleCodeExporter commented on May 5, 2024

Original comment by [email protected] on 14 Jul 2010 at 11:33

Added labels: Milestone-1.0
Removed labels: Milestone-0.5

from caliper.

GoogleCodeExporter commented on May 5, 2024

Earlier I said "It's possible all we need here is the statistical correlation 
coefficient between the rep counts and elapsed time measurements."  This is 
probably not true.  Probably, we can vary the rep count just to check for 
concavity for sanity's sake, but the statistic the user will want to see really 
is the standard deviation.

Original comment by [email protected] on 17 Jul 2010 at 9:24

from caliper.

GoogleCodeExporter commented on May 5, 2024

Original comment by [email protected] on 19 Mar 2011 at 2:57

Changed state: Accepted

from caliper.

GoogleCodeExporter commented on May 5, 2024

Original comment by [email protected] on 19 Mar 2011 at 3:06

Added labels: Type-Enhancement

from caliper.

GoogleCodeExporter commented on May 5, 2024

In caliper 1.0 we gather and report all the relevant data (subject to the 
user's preferences in caliperrc).  So even if we don't show a stddev on the 
screen, anyone who cares can always compute it later.  Tagging post-1.0.

Original comment by [email protected] on 14 Nov 2011 at 8:44

Added labels: Milestone-Post-1.0
Removed labels: Milestone-1.0

from caliper.

GoogleCodeExporter commented on May 5, 2024

Original comment by [email protected] on 8 Feb 2012 at 9:49

Added labels: Component-Runner

from caliper.

GoogleCodeExporter commented on May 5, 2024

I no longer think we want/need to do this.

Original comment by [email protected] on 16 May 2012 at 11:27

Changed state: WontFix

from caliper.

Report a quality score for each measurement about caliper HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent