ibm / pyflowgraph Goto Github PK

View Code? Open in Web Editor NEW

25.0 9.0 8.0 358 KB

Flow graphs for Python

License: Apache License 2.0

Python 100.00%

open-discovery

pyflowgraph's Introduction

Flow graphs for Python

Record dataflow graphs of Python programs using dynamic program analysis.

The package can be used standalone but is designed primarily to be used in conjunction with our semantic flow graphs. The main use case is analyzing short scripts in data science and scientific computing. This package is not appropriate for analyzing large-scale industrial software.

This is alpha software. Contributions are welcome!

Command-line interface

The package ships with a minimal CLI, invokable as python -m flowgraph. You can use the CLI to run and record a Python script as a raw flow graph.

python -m flowgraph input.py --out output.graphml

For a more comprehensive CLI, with support for recording, semantic enrichment, and visualization of flow graphs, see the Julia package for semantic flow graphs.

pyflowgraph's People

Contributors

Stargazers

Watchers

Forkers

epatters fagan2888 bhaskers-blu-org1 theredshift mfazelnia ghas-results

pyflowgraph's Issues

Support nested flow graphs in GraphML serialization

Note that only the functions flow_graph_to_graphml and flow_graph_from_graphml need to be extended. The general-purpose GraphML serializer (in the graphml module) already supports nested graphs.

Can't track non-weak-referenceable objects

The object tracker, currently used in a crucial way to determine the flow graph topology, only works for objects that are weak referenceable. Types whose instances are not weak referenceable include:

all builtin scalar types (int, float, bool, str, bytes, NoneType)
some builtin container types (list, tuple, dict)
a few miscallenous builtin types (slice)

This is a fairly serious defect.

AttributeTracer is a hack

The AttributeTracer class is a terrible hack to ensure that attribute accesses are picked up by sys.settrace.

https://github.com/IBM/pyflowgraph/blob/master/flowgraph/trace/attribute_tracer.py

Besides being a hack, it fails on builtin types. Find a better solution.

Remove object tracker from tracer

The object tracker is inside the tracer, which no longer makes sense. To avoid duplicating the object tracking logic, move the object tracker to the flow graph builder. The tracer will then know nothing about the object tracker.

Example suite from scikit-learn

In order to improve the robustness of the Python program analysis and to expand the Data Science Ontology, I would like to test this package against a suite of examples larger than the current set of integration tests.

Conveniently, scikit-learn includes a large collection of self-contained example scripts. It would be best to start with a one subdirectory, perhaps for linear models or SVMs.

Pip installation error due to dependency issue

I'm trying to install PyFlowGraph in conda environment using PIP but the installation is failing with given error.

(base) C:\Windows\system32>pip install pyflowgraph==0.0.1 Collecting pyflowgraph==0.0.1 Downloading pyflowgraph-0.0.1-py2.py3-none-any.whl (16 kB) Requirement already satisfied: six in c:\programdata\miniconda3\lib\site-packages (from pyflowgraph==0.0.1) (1.15.0) Collecting PySide<1.2.4,>=1.2.2 Using cached PySide-1.2.2.tar.gz (9.3 MB) ERROR: Command errored out with exit status 1: command: 'C:\ProgramData\Miniconda3\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\doshi\\AppData\\Local\\Temp\\pip-install-b0fj7thi\\pyside_942316f248e34c29a40589d59519b7b6\\setup.py'"'"'; __file__='"'"'C:\\Users\\doshi\\AppData\\Local\\Temp\\pip-install-b0fj7thi\\pyside_942316f248e34c29a40589d59519b7b6\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\doshi\AppData\Local\Temp\pip-pip-egg-info-iissbheg' cwd: C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\ Complete output (9 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\setup.py", line 89, in <module> from utils import rmtree File "C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\utils.py", line 10, in <module> import popenasync File "C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\popenasync.py", line 26, in <module> if subprocess.mswindows: AttributeError: module 'subprocess' has no attribute 'mswindows' ---------------------------------------- WARNING: Discarding https://files.pythonhosted.org/packages/b4/7b/2fc9d9e5c651c1550362d87bc4ab4cfe5368b312c1eaf477b5a4be708abd/PySide-1.2.2.tar.gz#sha256=53129fd85e133ef630144c0598d25c451eab72019cdcb1012f2aec773a3f25be (from https://pypi.org/simple/pyside/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ERROR: Could not find a version that satisfies the requirement PySide<1.2.4,>=1.2.2 (from pyflowgraph) ERROR: No matching distribution found for PySide<1.2.4,>=1.2.2

Failing tests in Python 2

Several unit tests and integration tests are failing under Python 2.7. We have always aimed to support both Python 2 and 3, but the Python 2 support has bit rotted.

This issue is preparation for creating a Travis CI build that checks Python 2 and Python 3.

Command-line interface

Create a simple CLI to make it easier to run the program analysis on Python scripts.

Expand varargs when binding function arguments

When recording a flow graph, *args and **kwargs arguments should be "expanded" as part of binding a function call to a function signature.

For example, one function that is currently broken is numpy's meshgrid, which has signature:

np.meshgrid(*xi, **kwargs)

Currently, the call np.meshgrid(x, y, copy=True) is bound as

xi=(x, y), kwargs={'copy': True}

Under our programming model, the call should be bound as something like

xi__0=x, xi__1=y, copy=True

Deterministic node labels

The flow graph builder should use deterministic labels for nodes, not (random) UUIDs. This will make flow graphs easier to version control.

Note that R flow graph already have deterministic node labels.

Kernel broken on IPython v7.0

Our custom IPython kernel is broken on IPython v7.0 and later. Because the IPython team dropped Python 2.7 support in that release, they were able to adopt asyncio in the kernel. The kernel API now uses coroutines.

By far the simplest fix is to just drop our own Python 2.7 support for the kernel. (We can retain Python 2.7 compatibility in other parts of the codebase.)

AST transformers for sequence literals

Sequence literals, for lists, tuples, sets, and dictionaries, are currently not traced. Define an AST transformer to make them traceable, ala #16.

Tuples literals are perhaps the most important, not only because they appear often in their own right but because the AST transformer for extended indexing generates them, e.g.

x[:m,:n]

becomes

operator.getitem(x, (slice(m), slice(n)))

Static analysis of variable access and assignment

Use static analysis to capture variable access and assignment, as a supplement to tracking objects by memory address. Specifically:

Rewrite the AST to add hooks for variable access and assignment
Emit trace events from the tracer on variable access and assignment
Use these events in the flow graph builder to maintain a mapping from variable names to nodes in the flow graph

This feature constitutes the last major step in implementing #18 and fixing #17.

Object instances exposed by modules show up as unknown inputs

Object instances, such as singletons, exposed by third-party modules show up in the flow graph as unknown objects. Examples of such instances include NumPy's "index tricks" objects, e.g.np.c_, and matplotlib's color map objects.

Cannot trace mathematical operators

The tracer does not pick up mathematical operators, like + and *, because they are not reported by sys.settrace or sys.setprofile. Closely related to #10.

Support for NetworkX v2.4

Minor breakage in NetworkX v2.4. Should be a quick fix.

Replace sys.settrace with AST rewriting

Using sys.settrace (or sys.setprofile) for tracing function calls was a quick way to get started but has fundamental limitations, some of which are documented in #10, #11, and #12.

A more flexible and scalable approach is to rewrite the AST with explicit tracing machinery. In a nutshell, replace function calls

f(x,y,z=1)

with

CALL(f, x, y, z=1)

where CALL is a special shim that calls and records a function.

It will take a little effort to get this approach off the ground, but it should ultimately work better. We took this approach from the beginning in rflowgraph, because R has no equivalent to sys.settrace AFAIK.

Cannot trace builtin functions

The tracer does not pick up builtin functions and methods. To be clear, "builtin" means implemented in C, not necessarily restricted to functions built into the Python language itself:

https://docs.python.org/3/library/types.html#types.BuiltinFunctionType

Thus, builtins include third-party C extensions not wrapped at the Python level, e.g., numpy.random.rand.

AST rewriting breaks functions that inspect the call stack

In rare cases, the current AST rewriting strategy can break functions which violate referential transparency. For example, patsy's formula evaluation magic breaks because it inspects the local namespace of the previous frame in the call stack.

The reason is that rewriting the function call f(x,y) to

trace(f, x, y)

(see #13) changes the observed call stack. Instead, we should rewrite f(x,y) to

trace_return((trace_fun(f))(trace_arg(x), trace_arg(y)))

which, in view of Python's left-to-right evaluation order, will evaluate as

trace_fun(f)
trace_arg(x)
trace_arg(y)
trace_return(...)

This will slightly complicate the tracing machinery but will better preserve the call stack. In particular, it should fix the problem with patsy.

AST transformers for in-place mathematical operators

Implement AST transformers for in-place mathematical operators, using a combination of in-place operator functions (operator.iadd, operator.imul, etc) and setter functions (setattr and operator.setitem).

Support syntax like:

y += 1
x[::2] += 1
df.x *= 0.5

Follow up to #12.

Migrate to NetworkX 2

When this project was started, NetworkX was at version 1.11. Now that NetworkX 2 is finally released, we should migrate. I think the main breaking change is moving from lists to iterators in node/edge accessors. See the migration guide.

Supplement object tracking with static analysis

Currently, dataflow dependencies are determined solely by tracking objects, essentially by memory address (technically, use Python's weak references). Naturally, this does not work for objects that do not have unique/stable memory addresses, such as primitive scalar values (see #17).

Now that we're using AST rewriting, as of #15, we should supplement the memory-based object tracker with static analysis to capture dataflow. A satisfactory resolution of this issue would fix #14 and #17. A similar approach is already being used in rflowgraph.

Undefined name 'target' in ast_tracer.py

flake8 testing of https://github.com/IBM/pyflowgraph on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./flowgraph/trace/ast_tracer.py:260:25: F821 undefined name 'target'
        elif isinstance(target, ast.List):
                        ^
1     F821 undefined name 'target'
1

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

F821: undefined name name
F822: undefined name name in __all__
F823: local variable name referenced before assignment
E901: SyntaxError or IndentationError
E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

Use syntax to identify multiple return values

Like many programming languages, Python treats function inputs and outputs asymmetrically. A function can have many arguments, but only one return value. Multiple return values are represented implicitly by returning a tuple. The system currently detects tuple return values and interprets them appropriately.

Unfortunately, not all Python functions respect the convention of returning tuples. For example, numpy.meshgrid returns a list of arrays, but in an expression like

xv, yv = np.meshgrid(x, y)

it is abundantly clear that we have a function call with two arguments and two (logical) return values. The system should use the syntax of multiple return values to identify them.

ibm / pyflowgraph Goto Github PK

pyflowgraph's Introduction

Flow graphs for Python

Command-line interface

pyflowgraph's People

Contributors

Stargazers

Watchers

Forkers

pyflowgraph's Issues

Recommend Projects

Recommend Topics

Recommend Org