Git Product home page Git Product logo

Comments (6)

jpivarski avatar jpivarski commented on June 11, 2024

Good catch! Actually, this is intentional, and I was a little worried about how this subtlety would affect users, but couldn't accept the NumPy behavior without a caveat I'll describe below.

First of all, NumPy broadcasts single-element lists into plural-element lists:

>>> np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + np.array([[10], [20]])
array([[11.1, 12.2, 13.3],
       [24.4, 25.5, 26.6]])

and as a side note, to explain why I say, "plural,"

>>> np.array([[], []]) + np.array([[10], [20]])
array([], shape=(2, 0), dtype=float64)

If you make Awkward arrays the normal way (with the ak.Array constructor, which calls ak.from_iter in the case below), you see that you get an error in this case:

>>> ak.Array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + ak.Array([[10], [20]])
...
ValueError: in ListOffsetArray64, cannot broadcast nested list

So how can Awkward arrays be a generalization of NumPy, if they don't do what NumPy does? Well, consider constructing them using NumPy arrays (the constructor, when passed NumPy arrays, calls ak.from_numpy):

>>> ak.Array(np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])) + ak.Array(np.array([[10], [20]]))
<Array [[11.1, 12.2, 13.3, ... 25.5, 26.6]] type='2 * 3 * float64'>

Now it works! What's the difference?

In the first case, the arrays are jagged/the lists they contain can have any length, even though in this case they happen to have the same length. This can be seen as var in the data types:

>>> ak.type(ak.Array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]))
2 * var * float64
>>> ak.type(ak.Array([[10], [20]]))
2 * var * int64

In the second case, the arrays are regular: their lengths must all be the same and this is expressed in the data type as a number:

>>> ak.type(ak.Array(np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])))
2 * 3 * float64
>>> ak.type(ak.Array(np.array([[10], [20]])))
2 * 1 * int64

NumPy arrays always contain regular length lists:

>>> ak.type(np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]))
2 * 3 * float64
>>> ak.type(np.array([[10], [20]]))
2 * 1 * int64

The distinction between var length (jagged) lists that happen to have the same lengths in some instance and fixed length lists is a user-visible type distinction, not an implementation detail. It's similar to the distinction between floating point numbers that happen to be whole and integers. Arrays with regular type exactly generalize NumPy's broadcasting rules; in-principle variable length lists (which are already beyond NumPy) follow a different broadcasting rule. I'll explain the reason for that in the next comment.

from awkward-0.x.

jpivarski avatar jpivarski commented on June 11, 2024

Broadcasting can apply to two arguments with the same number of dimensions, as above, or a different number of dimensions. If you're broadcasting an N×M (two-dimensional) array with a one-dimensional array, you have to decide whether you're going to require the one-dimensional array to have length N (i.e. align the arrays such that the first dimension of first array corresponds to the only dimension of the second array) or require it to have length M (i.e. align the arrays such that the second dimension of the first array corresponds to the only dimension of the second array).

If you'll broadcast N×M with N, I'd call that "left broadcasting," and it's the right choice. If you'll broadcast N×M with M, I'd call that "right broadcasting" and it's the wrong choice. NumPy chose right broadcasting.

If all the arrays are rectangular, it might be a matter of taste whether you left broadcast or right broadcast, or maybe right broadcasting makes more sense if you're thinking of these arrays as matrices and vectors. But when we think of the arrays as containing objects, particularly physics objects, left broadcasting is the only natural way to think about it.

For example, in order to get a one-dimensional, plural-element NumPy array of lists to broadcast with [[1, 2, 3], [4, 5, 6]] (2×3), it has to have 3 elements and those 3 elements are aligned with the inner dimension:

>>> np.array([[1, 2, 3], [4, 5, 6]]) + np.array([100, 200, 300])
array([[101, 202, 303],
       [104, 205, 306]])

To get a one-dimensional jagged array of lists to broadcast with [[1, 2, 3], [4, 5, 6]] (var × var, but 2×3 in this case), it has to have 2 elements and those 2 elements are aligned with the outer dimension:

>>> ak.Array([[1, 2, 3], [4, 5, 6]]) + ak.Array([100, 200])
<Array [[101, 102, 103], [204, 205, 206]] type='2 * var * int64'>

I said "jagged" above, because if the Awkward array has regular dimensions (really 2×3), then it does what NumPy does:

>>> ak.Array(np.array([[1, 2, 3], [4, 5, 6]])) + ak.Array(np.array([100, 200, 300]))
<Array [[101, 202, 303], [104, 205, 306]] type='2 * 3 * int64'>

Why is left broadcasting "correct"? Imagine that the first array is a set of jets in events (outer dimension for events; inner dimension for jets, which can be different in each event), and that the second array is the missing energy of each event (by definition, exactly one value per event). Associating all jets in each event with its event's missing energy is left broadcasting:

>>> jet_phi - met_phi
# jagged with the same lengths as jet_phi

It's what you need for broadcasting to emulate this imperative code:

>>> for event in events:
...     met_phi = event.met_phi
...     for jet in event.jets:
...         yield jet.phi - met_phi

Right broadcasting is the wrong way: given a jagged array and a one-dimensional array [a, b, c], it would match a with all the "firsts" in the jagged array, b with all the "seconds" in the jagged array, and c with all the "thirds" in the jagged array. In fact, the jagged array couldn't even be jagged—a list with more than 3 elements would have nothing to match the 4th to, and a list with fewer than 3 elements would have to decide which ones to discard (start discarding on the left, a, or start discarding on the right, b?).

So jagged arrays either had to be left broadcasted or broadcasting between different numbers of dimensions couldn't be allowed unless they happened to be regular. For a while, I was leaning toward forbidding broadcasting between different numbers of dimensions, which would require users to explicitly make the number of dimensions match, using np.newaxis. (Note that it's still possible to be explicit, even if one side has var-length list type:

>>> # explicit left broadcasting
>>> ak.Array([[1, 2, 3], [4, 5, 6]]) + ak.Array([100, 200])[:, np.newaxis]
<Array [[101, 102, 103], [204, 205, 206]] type='2 * var * int64'>
>>> # explicit right broadcasting
>>> ak.Array([[1, 2, 3], [4, 5, 6]]) + ak.Array([100, 200, 300])[np.newaxis, :]
<Array [[101, 202, 303], [104, 205, 306]] type='2 * var * int64'>

My original thought was that this should be required any time you want to broadcast a jagged array of jets with a flat array of missing energy, though that might alienate users coming from Awkward 0, which implicitly left-broadcasts.)

This is why the broadcasting rule depends on the distinction between type var * int64 and type 3 * int64: the left broadcasting is really the only way to broadcast jagged arrays and the right broadcasting ensures that when an Awkward array is identical to a NumPy array in value and in type, then it will behave like the NumPy array.

I just wish NumPy picked the "correct" choice.

from awkward-0.x.

jpivarski avatar jpivarski commented on June 11, 2024

So in the documentation this numpy broadcasting is given as an example

I just took another look at that documentation, which was written just after Awkward 0 became viable and before we were even considering the Awkward 1 project. At that time, I wasn't aware of the distinction between left broadcasting and right broadcasting, or that NumPy had chosen the opposite of what seemed natural to me for jagged arrays: left broadcasting.

The examples are all correct: they returns the results as written in the documentation. Somehow, I wrote that without realizing there's a left broadcasting/right broadcasting problem; it's probably because both the NumPy example and the Awkward example have the same number of elements in the arrays:

numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + numpy.array([[100], [200]])
# array([[101.1, 102.2, 103.3],
#        [204.4, 205.5, 206.6]])

is broadcasting 2×3 and 2×1 (length 2 and length 2) and

awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]) + awkward.fromiter([100, 200, 300])
# <JaggedArray [[101.1 102.2 103.3] [] [304.4 305.5]] at 0x781188390940>

is broadcasting 3×var and 3 (length 3 and length 3). Actually, the NumPy example isn't using any implicit broadcasting at all: 2 dimensional and 2 dimensional. If I had tried something more similar to the jagged example, like

>>> np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + np.array([100, 200])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

I might have noticed that something was wrong and would have had to address it back then. Writing Awkward 1 forced me to look more carefully at what exactly NumPy does, and that's when I noticed there was this problem.

from awkward-0.x.

HenryDayHall avatar HenryDayHall commented on June 11, 2024

Wow thank you, this has really helped me understand broadcasting, both in numpy and in awkward. It's one of those things that would make lot's of things more efficient, but I was a bit nervous of using because I wasn't very clear on how it behaves. Really I just got lazy when thinking about dimensions and fell back on loops.

Your explanation for choosing left broadcasting for physics objects makes total sense.

Many thanks for the extended discussion, your insights have really helped me with this.

from awkward-0.x.

jpivarski avatar jpivarski commented on June 11, 2024

You're welcome! I'm glad that helped—now I just have to figure out how to get that into the documentation in a place where people will find it.

from awkward-0.x.

nsmith- avatar nsmith- commented on June 11, 2024

Awkward probably needs an equivalent of https://numpy.org/doc/stable/user/basics.broadcasting.html somewhere at the beginning to discuss these things.

from awkward-0.x.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.