Comments (4)
If we don’t filter the thresholds, then probably the first bin’s bin.x0 and the last bin’s bin.x1 should be undefined. This would be consistent with threshold.invertExtent from d3-scale. But this would need a major version bump, and while perhaps more flexible, it also makes it more cumbersome in the common case because you’d need to handle the unbounded extreme bins specially when rendering the histogram; see the example.
So, I think it probably makes more sense to keep the current behavior where the histogram always has a finite domain (either set explicitly or determined from the data), and then the thresholds (even if set explicitly) must be constrained to that domain.
Since the lower bound of the bin is inclusive, we should tweak the threshold filtering of the upper bound (as described above) so that the threshold 99 isn’t dropped if the domain is [0, 99]:
// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m;
We should also document this behavior.
from d3-array.
I believe including the upper threshold value in the max bin is standard histogram behavior.
I wonder, would histogram libraries in the R or Python worlds give the same result?
from d3-array.
Per the API reference:
Thresholds are defined as an array of values [x0, x1, …]. Any value less than x0 will be placed in the first bin; any value greater than or equal to x0 but less than x1 will be placed in the second bin; and so on. Thus, the generated histogram will have thresholds.length + 1 bins.
But, what does not appear to be documented in the API reference is how histogram.domain affects the thresholds. From histogram.js:
// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] >= x1) tz.pop(), --m;
So, any threshold value that is less than or equal to domain[0]
is dropped, as is any threshold value that is greater than or equal to domain[1]
.
Thus, if you have 100 thresholds [0, 1, 2, … 97, 98, 99] (d3.range(100)
), and the default domain [0, 99] (d3.extent(d3.range(100))
), you in fact only have 98 thresholds [1, 2, … 97, 98] because the 0 and 99 thresholds are removed.
This means the histogram will generate 99 (98 + 1) bins, with any value less than 1 ([0]
) in the first bin, (bins[0]
), any value greater than or equal to 1 and less than 2 ([1]
) in the second bin (bins[1]
), and so on, with the 99th bin (bins[98]
) containing any value greater than or equal to 98 ([98, 99]
).
I think it’s possible that the logic should be tweaked slightly since the upper bound of the bin is exclusive:
// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m;
That would produce 99 thresholds [1, 2, … 98, 99] and 100 bins, [[0], [1], … [98], [99]]. But that’s a little weird because it means that the last bin can only contain values that are exactly equal to the maximum value. This change feels to tailored to the contrived example in question.
So, perhaps it makes more sense to only remove thresholds when those thresholds aren’t specified explicitly; that is, to only remove thresholds when they are specified as a count. This would also apply to the builtin threshold count estimators d3.thresholdFreedmanDiaconis, d3.thresholdScott and d3.thresholdSturges. Then if 100 threshold values are specified explicitly, you’d always get 101 thresholds regardless of the value of histogram.domain. However, depending on the domain and the input values, some of those bins would be empty, as expected. And the default would be [[], [0], [1], … [98], [99]]. I think this change would make a lot of sense, since when you don’t specify the threshold values explicitly, the specified count is only a hint anyway.
from d3-array.
I’m still not 100% sure this is a good idea since the first and last bin’s bin.x0 and bin.x1 will be equal. Actually it’s worse than that because the first bin’s bin.x0 is set to domain[0] and the last bin.x1 is set to domain[1], which is wrong if the thresholds are outside or coincident with the domain endpoints.
from d3-array.
Related Issues (20)
- BUG: d3-array/dist/d3-array.js: Unexpected token (139:15) HOT 4
- fix(babel): cumsum HOT 1
- binary ticks increments on linear scale HOT 2
- D3-array produces ERR_REQUIRE_ESM with node >= 15 HOT 3
- bisectCenter naming HOT 1
- quantile returns undefined on an empty array, differs from extent HOT 1
- Docs: define the bin thresholds with array HOT 2
- First and last thresholds are set to data extent (not explicitly stated limits) HOT 2
- bisector no longer supports two-argument (object, value) comparator HOT 12
- Testing a lib using `d3-array` HOT 1
- d3.blur HOT 1
- Incorrect results for binary search on large arrays due to miscomputation of midpoint HOT 11
- d3.bin can mutate the user-specified thresholds
- About the sorting problem of d3.rank HOT 2
- Insecure Randomness for the useof Math.random() in shuffle API (security vulnerability) HOT 1
- d3.thresholdScott returns NaN for single-element arrays
- Feature request: `find` / `findValue` methods
- groupSort should use ascendingDefined instead of ascending
- medianIndex/quantileIndex doesn’t handle missing data HOT 3
- can d3-array also support BigInt numbers? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from d3-array.