quansight-labs / uarray Goto Github PK

View Code? Open in Web Editor NEW

99.0 16.0 26.0 1.86 MB

A general dispatch and override mechanism for Python.

Home Page: https://uarray.org/

License: BSD 3-Clause "New" or "Revised" License

Python 40.30% Jupyter Notebook 7.87% C++ 50.80% C 1.03%

uarray's Introduction

`uarray` - A back-end mechanism geared towards array computing

Contributing

See CONTRIBUTING.md for more information on how to contribute to uarray.

uarray's People

Contributors

Stargazers

Watchers

uarray's Issues

When to compute?

I am working on replicating the paper's example in uarray. I made some good progress here (https://github.com/Quansight-Labs/uarray/blob/6e03be28383fdc08bef02cd0f6662614e69f3d05/uarray/Intro.ipynb), but realize that I need to change how I am processing concrete arrays. Right now, for things like equivalence, I compute all possible indexes of the array and see if every element is equal. That relies on an ExplodeVector operation which takes a 1D array and returns a new 1D array with every element indexed so each can be replaced. I am realizing now that this approach breaks down and becomes confusing, because you don't know when exactly to call ExplodeVector.

The underlying problem is when to actually compute intermediate results. When we know an arrays contents or shape at compile time, we should use that eagerly and compute operations on that as we replace. Why? Because we need this for shape calculations. What if you don't want to use some data at compile time? Like if you have big numpy arrays, you don't want it computing things willy nilly on them while it is pattern matching. You want it to wait and return some sort of expression that you can use to compute the result. I think this is the point @hameerabbasi was making last week when we talked about this (correct me if I am am remembering incorrectly).

The form in the paper goes from this originally:

((<1 0> ψ A^3) +·× (<2> ψ ((<1 0> ψ A^3) ·× (<0 1> ψ B^3))))

To this replaced:

((<1 0> ψp A^3) +·× (((<1 0> ‡v (<1> ↑ <2>)) ψf A^3) ·× ((<0 1> ‡v (<1> ↓ <2>)) ψp B^3)))

(ψp is just the index operator when we know it's a partial index vs a full index. ‡v is vector concat)

Ok so we see this expression (<1> ↑ <2>) in there which should be simplified. It should just return <2>, because this is saying "Take the first 1 item in the vector". So we could go and add a manual rule for Take when the arguments are two concrete vectors, which would work. But this is superfluous. We have already defined Take in terms of shape and index, and the resulting concrete transformation can be inferred from this. Right now, the I would call ExplodeVector(vector(1), expr) to transform this, since this would be replaced by indexing on that original expression when would then be replaced by the Index rule for Take.

But when do I call ExplodeVector? It seems arbitrary. I don't want to have to sprinkle this all over. I don't even have a coherent reason for where I should put it. So we should do things another way.

OK so let's see if we can come up with a plan to remove uses of ExplodeVector.

One thing that could help is to use abstract arrays. I used these when defining this expression, as the A and B terms, with A = AbstractWithDimension(Scalar(3), variable_name="A"). One place where ExplodeVector is used is in Equiv. Instead of seeing if every element of the two arrays is the same, we should see if there shapes have the same form and if indexing them has the same form. We can use the __eq__ method on the actual operations to compare equivalence of form. It recursively verifies that the trees are equal. But... You could have two different expressions that have the same normal form that are not equal. A simple example is that Take(vector(1), vector(10)) is equivalent to vector(10). So I think in this case, for Equiv we can fully index each expression by a vector of abstract scalars. Then this will cause the pattern matching for Index to take place and we will get the normal form.

This is how Lenore simplifies the expressions in the paper. She indexes them with a vector of <i>, so that the rules are applied.

OK but what about something like Take(vector(0), Scalar(10))? Here the dimensionality is 0 so we don't have to index it. Yet, indexing it would cause Take's index replacement to be used so would be necessary to cause it to be simplified. So we would have to do Index(Vector(), Take(vector(0), Scalar(10)) which is a bit weird because you also want to remove indexing by an empty vector, i.e. replace it with just it's result. In which case the indexing of take wouldn't be applied. Now yes, this could be fixed by adding multiple matching levels, so the take indexing is guaranteed to be replaced before the replacement rule of indexing by an empty vector, but this seems very ad hoc.

So this makes me think that any operation that results in a scalar should just be replaced by that definition. For example, the cross product of two scalars should be replaced by the resulting scalar (even if that scalar is not a concrete values but some operation). It seems silly to wait for an empty index to trigger this replacement.

So this brings us to needing shape information when defining the indexing. For example, we wanna say something like "when the result of the cross product is a scalar, replace it with that result. Otherwise keep it"

It makes me wonder about changing our representation entirely into three different abstract array operations:

AbstractWithDimensions(n_dim)
AbstractWithShape(n_dim, shape)
AbstractWithIndex(n_dim, shape, index_fn)

We would also keep some sort of concrete vector (Vector) and concrete scalar (Scalar).

This would make the replacements a bit clearer, so that you are replacing OuterProduct with its normal form, instead of defining replacements on taking the shape or index of it.

So the million dollar question is how to represent the index_fn symbolically. It could be represented as another array? With abstract array being the indexes? OK so how would we define the replacement for Index(x, AbstractWithIndex(n_dim, shape, index_fn))?

Not sure, but this form would allow us to do something like AbstractWithIndex(Scalar(0), x, index_fn) -> index_fn.

We could define it as the resulting array with the unknown indices being AbstractWithDimensions(0, variable_name=f"index_{i}") (scalars). That's all well and good but how do you replace those abstract dimensions with real ones once you index? You could add some bound scoping. Like a WithVariables(Scalar({"index_1": xxx}), expr) operation that is replaced by forwarding on the WithVariables to each of the expressions. Something like this:

register(WithVariables(scalar, x), lambda scalar, x: matchpy.functions.substitute(scalar.value, x))

I found that matchpy actually has a substitute function already defined that seems like it does exactly what we want! OK this makes me feel like maybe this is an OK track to go down then. (maybe this should be called Substitute then instead of WithVariables).

So how about we assign the index vector to a variable, like this:

register(Index(x, AbstractWithIndex(x1, x2, x3)), lambda x, x1, x2, x3: substitute({"index": x}, x3))

However, then all indexes will be labeled the same. OK so let's add a parameter to AbstractWithIndex which is the name of the index variable, so it works like AbstractWithIndex(dim, shape, index_variable_name, indexed_value) where index_variable_name is a scalar of a string.

This sounds OK to try to implement. I wanna rename them though to this:

AbstractWithDimensions -> HasDim(dim)
AbstractWithShape -> HasShape(shape)
AbstractWithIndex -> HasIndex(shape, index_variable_name, contents)

Then maybe:

Scalar -> HasValue
Vector -> HasItems

Tranpose

Add transpose operation
Try transpose example with NumPy
Show example of gamma function with reverse

Look into using pampy instead of matchpy

https://github.com/santinic/pampy

Prepare for talk

Mix code and narrative more.

Add resources for links

Make joke about intellagibility

Add A and B capital as defined above.

Add headers first to describe outline

Fix code or make joke about it. make it hard

Give shout outs to other projects

community different experiment

reference other peoples work

High level visualization of MOA for presentation

CC @saulshanabrook @costrouc

Get NumPy ufuncs and reductions working.

Switch Vector to nested type

Like a recursive Cons list

Make replacements more declarative

References

Category theory background

http://igm.univ-mlv.fr/~giraudo/Exposes/TreePatterns3.pdf

Graph rewriting:

https://pdfs.semanticscholar.org/3b7f/59f6ec57cb4d815b3aad42fb88452585fc6c.pdf

Term rewriting:

http://project-coco.uibk.ac.at/2018/

Add the ability to register instance methods.

Remove `value` of Value type

Matchpy sets default name attribute we should use

Gather data on the commonly used numpy functions.

What is a simple code example using uarray that demonstrates "replatforming SciPy"?

Can the simple example be used to illustrate:

dispatch to an optimized implementation
"symbolic numpy"
implementation over getitem and iterate

List of directions

We should compose a list of directions we can move in. Aka a roadmap.

So that we are transparent about where we want to go and so that we can get support for that vision.

Create new notebooks detailing use and integration of uarray, metadsl, and moa

Metadsl + uarray: https://github.com/Quansight-Labs/metadsl/blob/master/explorations/2019.04.03%20MetaDSL%20%2B%20UArray.ipynb

Articulate UX for creating backends

We should have a defined Python protocol or something like that for backends developers to fill in.

This makes explicit the that API boundary

Separate out uarray into multiple python libraries

Draft initial specification of uarray protocol

Draft the initial specification of the class and methods required to implement a uarray backend. Current thinking, we require just these things at minimum:

Construction of an array
A method to provide the shape of the array
A method to provide a specific item from an array
A method to retrieve the datatype (int, double, etc.) for a dimension?

This will also require adding a mechanism for "generalized ufuncs" or maps.

The specification could consider additional, optional, methods like:

Retrieving ranges

Separate into multiple packages

Here is a sketch of how we could separate the python packages:

uarray: Core types and dispatching infrastructure
umoa: MoA operations and reductionson them.
unumpy:
- ndarray.py: Lazy numpy ndarray that creates graph. Use __array_ufunc__ and __array_function__
- ast.py: Compile to Python AST of numpy operations. Have default implementations use existing numpy functions
unumpy_moa: Adds moa translations for a numpy operations, so we can reduce them (remove intermediates, etc)
unumpy_llvm: Adds llvmlite compilations for numpy expressions
uscipy_linalg/ unumpy_linalg: Adds scipy or numpy compatible linalg module. Depends on unumpy and adds default Python AST versions of linalg functions
uscipy_linalg_moa/ unumpy_linalg_moa: Adds moa reductions for linalg

MVP could be, AST creation of a bunch of numpy code, where just one bit is fed through MoA reductions, rest is compiled to existing numpy operations.

Next one would be same, but in LLVM. This would (I think) be more work.

Picture hosting

Placeholder issue for hosting images

Implement at least one non-numpy backend

Let's go with Pytorch!

small remark: the front page format on pypi is not great

Moa - Matchpy Need way of only substituting full indexing

We should be able to register replacements for indexing that is limited to full indexing. Right now we can only define partial indexing. Although one could figure out how to talk about indexing everything in a partial way, it's a bit more cumbersome and might result in extra work. So it should be possible to do that, but not required. So how does someone define indexing on an array that only matches when the index vector is of length >= N?

One option is to have an extended version of commands that contains the shape of the index as well as possibly the dimensionality of the operation (if needed). Then those can be compared to be equal and only match then.

For example, this is how it could work with concrete arrays:

register(
    Index(array, array1),
    lambda array, array1: IndexAugmented(Equal(Shape(array), Shape(Shape(array1)), array, array1)
)

register(
    IndexAugmented(array, array1, array2),
    array_is_true,
    lambda array, array1, array2: do_indexing...(array1, array2)
)

That seems to work OK, however if the array is not true, and can't be resolved to true for whatever reason, then you end up with this big mess in a IndexAugmented. Then let's say you index this again...

Well then we could add a rule for Index(x, IndexAugmented(x1, array, array1)) That would given a new IndexAugmented.

Docs for core uarray

Multiply internal representations: N-dimensional, vectors, Nested sequences

Notes from DB talk:

How do schemas fit in our world?

This is what dimensionality and shape information are. They are different levels of schemas.

Need way to talk about all these at the same time and convert between them. Simply!

Nested sequence vs ND array. They are just different schemas.

Need to validate our specs. Generate random arrays. Need specs for types! Esp integer. Integer ranges...

Sequences are the most expressive, but we need to be able to think about arrays in other forms. Here are some examples of different interfaces in Python land:

N-dimensional arrays: nested-sequence is superset of this. have dim, shape, and indexing: Numpy arrays
Nested Sequences: Recursive indexing. nd array can translate to this. Lists of lists
Vectors: Any indexing to value, contiguous memory. N-Dimensional and Nested Sequences can translate to this. lists, JS arraybuffer

We want to be able to write algorithms that target any of these and we can have translations between them. i.e. any algorithms written for nested sequences can be used on n-dimensional arrays, b/c nd arrays can always be translated to nested sequences. Reverse is not true, not all nested sequences can be translated to arrays (if ragged).

We can always map from ND arrays to vectors, by using gamma functions and reverse using inverse gamma and shape.

To translate from nested sequences to flat vectors requires also making some offset vector.

Decide on API pattern for svd and lu

Many implementations return two matrices. However, lapack and blas return a single matrix. We need to choose a primary function definition that returns either one or two matrixes, and an alternative function definition that returns the alternative and is implemented by default using the primary implementation.

MoA - Matchpy Figure out primitive types

If you concat two arrays then index them, how do we know how to compile that? If they are abstract random arrays, we can use the MoA definitions. If they are NumPy arrays, it's possible we might wanna turn the result into a NumPy array. But let's make the explicit! We should have a ToNumpyArray operator that has for things like concatting two numpy arrays so we can optimize those use cases. But if you are generally just indexing on concatted numpy arrays then we use the MoA formalisms. If you wanted to compute that intermediate result, use the To...

We should do the same for Literal arrays (which is what I should call them or Concrete). So any operation on them should not give one back, we just define ToLiteralArray to provide shortcuts.

And of course both of those will fall back on MoA indexing definitions if there is no override

Add visualizations with graphviz

Use https://graphviz.readthedocs.io/en/stable/examples.html#btree-py to visualize tree. Per operation, make sub-text be name for that argument

Pattern matching at Python level

Look at how Java is implementing/framing this: http://cr.openjdk.java.net/~briangoetz/amber/pattern-match.html

Feedback from tech share

I walked through adding a reverse operator, like this https://github.com/Quansight-Labs/uarray/blob/master/uarray/2018.11.02%20Adding%20Reverse.ipynb

at the quansight tech share. Here was some feedback:

Possible to create higher level interface aka sympy, for creating array expressions. But also dont wanna abstract too much before things settle down

Good use case would be to Implement FFT and derivative

From lenore "Combining the ability to run programs, see the math, and publish a document is very useful for education at all levels"

Rename to function accessor

Next Steps

As I have been mulling over this discussion in #12, I would like to re-articulate my immediate goals to support reducing the expression in the paper.

There are the current next steps I will work on. After these are fixed, I will take another look at expressions like <1> concat <2> to see when they should be replaced.

Remove thunks

Current Status

Right now, for operations like And and If some of the operands are stored as scalars with array operations as their value instead of direct array operations. This is a way of "hiding" the inner operations from MatchPy so that they are not replaced. For example, this is the current replacement for If:

uarray/uarray/uarray.py

Lines 529 to 534 in 6e03be2

 register( 

 If(scalar, scalar1, scalar2), 

 lambda scalar, scalar1, scalar2: scalar1.value() 

 if scalar.value 

 else scalar2.value(), 

 )

The idea is that if the conditional in the If expression can be reduced to a boolean scalar, then we can replace the operation with one of it's branches. until it is replaced, however, matchpy won't try to do any matching on it's branches because they are hidden with scalars, so it can't see them. (I also put them in parameterless functions, but looking back, I don't think this is necessary. I think they could just be put in scalar values and not wrapped in a function).

This was done because when I was implementing the Equiv operator I wanted to use the If operator but I didn't want it to expand the branches because then we would get some infinite expansion. It also helps to limit matchpy from replacing expressions that won't be used anyway.

Why it needs to change

What I forgot about is that sometimes an If statement won't reduce at compile time and that's OK. For example, let's say we wanted to use the MoA definition of Concat and then compile that to a tensorflow graph. We would actually want to translate the If operator to an If operation in Tensorflow. So we would want both of it's paths to be replaced by matchpy into reduced forms, so that they in turn could be turned into TensorFlow graphs.

Sometimes If and And will be resolved at compile time and sometimes they won't. When they are not, we still want to reduce their operands.

Possible Solution

Keep the operands to If and And as regular expressions, don't wrap them in scalars or functions.

Repercussions

The equivalence operation needs to change so we don't get infinite recursions, which brings us to the next change...

Remove `ExplodeVector`

Current Status

ExplodeVector is a way of materializing some vector expression into its values:

uarray/uarray/uarray.py

Lines 584 to 587 in 6e03be2

 register( 

 ExplodeVector(scalar, x), 

 lambda scalar, x: Vector(*(FullIndex(vector(i), x) for i in range(scalar.value))), 

 )

Why would you want to do this? I am using it in two places right now.

When checking if two arrays are Equiv we check if their shapes are equivalent and if every element of them is equal. We do this by indexing every element of both arrays and seeing if all are equal.

uarray/uarray/uarray.py

Lines 594 to 605 in 6e03be2

 def _equiv(x, x1): 

 # operations and operands are the same, and variable names are the same (if not none) 

 if x == x1: 

 return Scalar(True) 

 return if_( 

 and_(is_scalar(x), is_scalar(x1)), 

 EquivScalar(x, x1), 

 and_( 

 EquivVector(explode_vector(Shape(x)), explode_vector(Shape(x1))), 

 EquivVector(to_vector(x), to_vector(x1)), 

 ), 

 )

Also, I am using it as a debug tool, so that if I have an array, I can turn it into a materialized vector to see if that is right.

Why it needs to change

It's confusing, slow, and messy!

But it really has to go mainly because it becomes confusing when it should be used on arrays. If you are checking if arrays are equivalent it makes some sense to use this on them to reduce them to some materialized form. However, there are many other places where want to turn expressions into their concrete forms if you can (really all places) so it seems very weird to sprinkle ExplodeVector calls all over willy nilly. We need another way of achieving the same goal, which is to replace expressions with their reduced form if we can at compile time (if we are operating on concrete arrays instead of abstract ones).

Also checking equivalence this way is pretty time consuming, since it requires checking if every element of two arrays are equal which means the time is proportional to the length of arrays. Which brings us to...

Possible Solution

In equivalences, instead of indexing every element and seeing if all are equal, we can index an array with an abstract vector of its shape and see if the normal form is the same.

# dim && shape && contents are the same
register(
    Equiv(x, x1),
    lambda x, x1: And(
        EquivScalar(Dim(x), Dim(x1)),
        And(
            EquivArray(Scalar(1), Scalar(1), Shape(x), Shape(x1)),
            EquivArray(Dim(x), Dim(x1), x, x1),
        ),
    ),
)


def _abstract_vector(prefix_name, length):
    return Vector(
        *(
            AbstractWithDimension(Scalar(0), variable_name=f"{prefix_name}_{i}")
            for i in range(length)
        )
    )


# Contents are the same if fully indexing both result in equiv scalars
register(
    EquivArray(scalar, scalar1, x, x1),
    lambda scalar, scalar1, x, x1: EquivScalar(
        FullIndex(_abstract_vector("_equiv", scalar.value), x),
        FullIndex(_abstract_vector("_equiv", scalar1.value), x1),
    ),
)

# two scalars are equivalent if their forms are the same
register(
    EquivScalar(x, x1),
    matchpy.EqualVariablesConstraint("x", "x1"),
    lambda x, x1: Scalar(True),
)

# we know if they are equivalent if both are scalars
register(
    EquivScalar(scalar, scalar1), lambda scalar, scalar1: Scalar(scalar == scalar1)
)

# Also if we are comparing two concrete vectors indexed abstractly
# we know if they are not equal if their forms are not equal
register(
    EquivScalar(
        Index(Vector(AbstractWithDimension(x)), Vector(xs)),
        Index(Vector(AbstractWithDimension(x1)), Vector(xs1)),
    ),
    xs_are_scalars,
    xs1_are_scalars,
    matchpy.EqualVariablesConstraint("x", "x1"),
    lambda x, x1, xs, xs1: Scalar(xs == xs1),
)

For the EqualVariablesConstraint replacement, this ends up calling __eq__ on the two operations to see if they are the same, which defers to checking if they are the same operation type and if all operands are equal:

https://github.com/HPAC/matchpy/blob/ff8b28a0e178f3cc5d6546c25ee1d891cff64425/matchpy/expressions/expressions.py#L500-L506

So the goal here is to replace things at compile time that we know are equal or not, but to leave equality expressions in the tree that we might not have the knowledge yet to know if they are equal.

Repercussions

Forms like A + B and B + A will not be found to be equivalent iff A and B are abstract arrays. To address this we would either have to extend equality to deal with commutativity and associativity or define a normal form for expressions like this so they will be sorted. I think this is fine at the moment. We can revisit this when we get problems like this in practice.

Figure out top level marketing

We should modify the top level readme to reflect where we are at currently how users can learn more and contribute.

Look into building on existing dependent typing system

idris-lang/Idris-dev#2040

Make bin ops associative

Add associative and commutative args sub expressions so they can move around without affecting operation

Change vectorcallable to cons list

Implement common methods in the NumPy API

Complete backend base API

Overlap with egison

As I was looking into translating einsum notation into MoA (https://github.com/Quansight-Labs/uarray/issues/17), I came across a very recent paper by @egisatoshi called "Scalar and Tensor Parameters for Importing the Notation in Differential Geometry into Programming".

Interestingly, he has also been working on similar types of pattern matching methods that we are using (see "Non-linear Pattern Matching with Backtracking for Non-free Data Types").

He is also developing the Egison programming language:

Egison is a functional programming language featuring its expressive pattern-matching facility. Egison allows users to define efficient and expressive pattern-matching methods for arbitrary user-defined data types including non-free data types such as lists, multisets, sets, trees, graphs, and mathematical expressions.

Since we seem to be approaching similar areas (array programming) with similar methods (pattern matching), I suspect we could learn from each other.

Also I wonder how Egison's pattern matching differs from Matchpy's (written by @wheerd).

There is a seperate Ruby library that implements Egison pattern matching that looks similar in scope to MatchPy in Python.

Look at change from lambda functions to combinators

Add the ability to register class methods.

Draft initial specification of Python MoA API

Support numpy.select use case

We should add support for numpy.select

This is the example from the docs:

x = np.arange(10)
condlist = [x<3, x>5]
choicelist = [x, x**2]
np.select(condlist, choicelist)

I believe if we have a function like this:

def t(x):
    condlist = [x<3, x>5]
    choicelist = [x, x**2]
    return np.select(condlist, choicelist)

We should jit it to something like this (assuming inputs are 1 dimensional):

def t(x):
    new = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        if x[i] < 3:
             new[i] = x[i]
        elif x[i] > 5:
             new[i] = x[i] **2
        else:
             new[i] = 0
     return new

Obligatory xkcd reference 😜

Define uarray.scipy.linalg code examples

Python 3.7 tests fail due to README generating StopIteration

Running tests on python 3.7 fails with the following. I will fix in a PR.

python extract_readme_tests.py
pytest

==================================== ERRORS ====================================
_______________________ ERROR collecting test_readme.py ________________________
uarray/machinery.py:45: in replace_scan
    expr = _replace_once(expr)
uarray/machinery.py:24: in _replace_once
    pos, replacement, subst = next(iter(_matches(expr)))
E   StopIteration

The above exception was the direct cause of the following exception:
test_readme.py:42: in <module>
    assert list(uarray.replace_scan(List())) == [List(), Nil()]
E   RuntimeError: generator raised StopIteration
----------- generated xml file: /build/uarray/junit/test-results.xml -----------

	register(
	If(scalar, scalar1, scalar2),
	lambda scalar, scalar1, scalar2: scalar1.value()
	if scalar.value
	else scalar2.value(),
	)

	register(
	ExplodeVector(scalar, x),
	lambda scalar, x: Vector(*(FullIndex(vector(i), x) for i in range(scalar.value))),
	)

	def _equiv(x, x1):
	# operations and operands are the same, and variable names are the same (if not none)
	if x == x1:
	return Scalar(True)
	return if_(
	and_(is_scalar(x), is_scalar(x1)),
	EquivScalar(x, x1),
	and_(
	EquivVector(explode_vector(Shape(x)), explode_vector(Shape(x1))),
	EquivVector(to_vector(x), to_vector(x1)),
	),
	)

quansight-labs / uarray Goto Github PK

uarray's Introduction

uarray - A back-end mechanism geared towards array computing

Contributing

uarray's People

Contributors

Stargazers

Watchers

Forkers

uarray's Issues

References

Remove thunks

Current Status

Why it needs to change

Possible Solution

Repercussions

Remove ExplodeVector

Current Status

Why it needs to change

Possible Solution

Repercussions

Recommend Projects

Recommend Topics

Recommend Org

`uarray` - A back-end mechanism geared towards array computing

Remove `ExplodeVector`