Git Product home page Git Product logo

pyflowgraph's Introduction

Flow graphs for Python

Build Status Python 2.7 Python 3.6 Python 3.7 DOI

Record dataflow graphs of Python programs using dynamic program analysis.

The package can be used standalone but is designed primarily to be used in conjunction with our semantic flow graphs. The main use case is analyzing short scripts in data science and scientific computing. This package is not appropriate for analyzing large-scale industrial software.

This is alpha software. Contributions are welcome!

Command-line interface

The package ships with a minimal CLI, invokable as python -m flowgraph. You can use the CLI to run and record a Python script as a raw flow graph.

python -m flowgraph input.py --out output.graphml

For a more comprehensive CLI, with support for recording, semantic enrichment, and visualization of flow graphs, see the Julia package for semantic flow graphs.

pyflowgraph's People

Contributors

epatters avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyflowgraph's Issues

Can't track non-weak-referenceable objects

The object tracker, currently used in a crucial way to determine the flow graph topology, only works for objects that are weak referenceable. Types whose instances are not weak referenceable include:

  • all builtin scalar types (int, float, bool, str, bytes, NoneType)
  • some builtin container types (list, tuple, dict)
  • a few miscallenous builtin types (slice)

This is a fairly serious defect.

Remove object tracker from tracer

The object tracker is inside the tracer, which no longer makes sense. To avoid duplicating the object tracking logic, move the object tracker to the flow graph builder. The tracer will then know nothing about the object tracker.

Example suite from scikit-learn

In order to improve the robustness of the Python program analysis and to expand the Data Science Ontology, I would like to test this package against a suite of examples larger than the current set of integration tests.

Conveniently, scikit-learn includes a large collection of self-contained example scripts. It would be best to start with a one subdirectory, perhaps for linear models or SVMs.

Pip installation error due to dependency issue

I'm trying to install PyFlowGraph in conda environment using PIP but the installation is failing with given error.

(base) C:\Windows\system32>pip install pyflowgraph==0.0.1 Collecting pyflowgraph==0.0.1 Downloading pyflowgraph-0.0.1-py2.py3-none-any.whl (16 kB) Requirement already satisfied: six in c:\programdata\miniconda3\lib\site-packages (from pyflowgraph==0.0.1) (1.15.0) Collecting PySide<1.2.4,>=1.2.2 Using cached PySide-1.2.2.tar.gz (9.3 MB) ERROR: Command errored out with exit status 1: command: 'C:\ProgramData\Miniconda3\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\doshi\\AppData\\Local\\Temp\\pip-install-b0fj7thi\\pyside_942316f248e34c29a40589d59519b7b6\\setup.py'"'"'; __file__='"'"'C:\\Users\\doshi\\AppData\\Local\\Temp\\pip-install-b0fj7thi\\pyside_942316f248e34c29a40589d59519b7b6\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\doshi\AppData\Local\Temp\pip-pip-egg-info-iissbheg' cwd: C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\ Complete output (9 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\setup.py", line 89, in <module> from utils import rmtree File "C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\utils.py", line 10, in <module> import popenasync File "C:\Users\doshi\AppData\Local\Temp\pip-install-b0fj7thi\pyside_942316f248e34c29a40589d59519b7b6\popenasync.py", line 26, in <module> if subprocess.mswindows: AttributeError: module 'subprocess' has no attribute 'mswindows' ---------------------------------------- WARNING: Discarding https://files.pythonhosted.org/packages/b4/7b/2fc9d9e5c651c1550362d87bc4ab4cfe5368b312c1eaf477b5a4be708abd/PySide-1.2.2.tar.gz#sha256=53129fd85e133ef630144c0598d25c451eab72019cdcb1012f2aec773a3f25be (from https://pypi.org/simple/pyside/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ERROR: Could not find a version that satisfies the requirement PySide<1.2.4,>=1.2.2 (from pyflowgraph) ERROR: No matching distribution found for PySide<1.2.4,>=1.2.2

Failing tests in Python 2

Several unit tests and integration tests are failing under Python 2.7. We have always aimed to support both Python 2 and 3, but the Python 2 support has bit rotted.

This issue is preparation for creating a Travis CI build that checks Python 2 and Python 3.

Expand varargs when binding function arguments

When recording a flow graph, *args and **kwargs arguments should be "expanded" as part of binding a function call to a function signature.

For example, one function that is currently broken is numpy's meshgrid, which has signature:

np.meshgrid(*xi, **kwargs)

Currently, the call np.meshgrid(x, y, copy=True) is bound as

xi=(x, y), kwargs={'copy': True}

Under our programming model, the call should be bound as something like

xi__0=x, xi__1=y, copy=True

Deterministic node labels

The flow graph builder should use deterministic labels for nodes, not (random) UUIDs. This will make flow graphs easier to version control.

Note that R flow graph already have deterministic node labels.

Kernel broken on IPython v7.0

Our custom IPython kernel is broken on IPython v7.0 and later. Because the IPython team dropped Python 2.7 support in that release, they were able to adopt asyncio in the kernel. The kernel API now uses coroutines.

By far the simplest fix is to just drop our own Python 2.7 support for the kernel. (We can retain Python 2.7 compatibility in other parts of the codebase.)

AST transformers for sequence literals

Sequence literals, for lists, tuples, sets, and dictionaries, are currently not traced. Define an AST transformer to make them traceable, ala #16.

Tuples literals are perhaps the most important, not only because they appear often in their own right but because the AST transformer for extended indexing generates them, e.g.

x[:m,:n]

becomes

operator.getitem(x, (slice(m), slice(n)))

Static analysis of variable access and assignment

Use static analysis to capture variable access and assignment, as a supplement to tracking objects by memory address. Specifically:

  1. Rewrite the AST to add hooks for variable access and assignment
  2. Emit trace events from the tracer on variable access and assignment
  3. Use these events in the flow graph builder to maintain a mapping from variable names to nodes in the flow graph

This feature constitutes the last major step in implementing #18 and fixing #17.

Cannot trace mathematical operators

The tracer does not pick up mathematical operators, like + and *, because they are not reported by sys.settrace or sys.setprofile. Closely related to #10.

Replace sys.settrace with AST rewriting

Using sys.settrace (or sys.setprofile) for tracing function calls was a quick way to get started but has fundamental limitations, some of which are documented in #10, #11, and #12.

A more flexible and scalable approach is to rewrite the AST with explicit tracing machinery. In a nutshell, replace function calls

f(x,y,z=1)

with

CALL(f, x, y, z=1)

where CALL is a special shim that calls and records a function.

It will take a little effort to get this approach off the ground, but it should ultimately work better. We took this approach from the beginning in rflowgraph, because R has no equivalent to sys.settrace AFAIK.

AST rewriting breaks functions that inspect the call stack

In rare cases, the current AST rewriting strategy can break functions which violate referential transparency. For example, patsy's formula evaluation magic breaks because it inspects the local namespace of the previous frame in the call stack.

The reason is that rewriting the function call f(x,y) to

trace(f, x, y)

(see #13) changes the observed call stack. Instead, we should rewrite f(x,y) to

trace_return((trace_fun(f))(trace_arg(x), trace_arg(y)))

which, in view of Python's left-to-right evaluation order, will evaluate as

trace_fun(f)
trace_arg(x)
trace_arg(y)
trace_return(...)

This will slightly complicate the tracing machinery but will better preserve the call stack. In particular, it should fix the problem with patsy.

AST transformers for in-place mathematical operators

Implement AST transformers for in-place mathematical operators, using a combination of in-place operator functions (operator.iadd, operator.imul, etc) and setter functions (setattr and operator.setitem).

Support syntax like:

  • y += 1
  • x[::2] += 1
  • df.x *= 0.5

Follow up to #12.

Migrate to NetworkX 2

When this project was started, NetworkX was at version 1.11. Now that NetworkX 2 is finally released, we should migrate. I think the main breaking change is moving from lists to iterators in node/edge accessors. See the migration guide.

Supplement object tracking with static analysis

Currently, dataflow dependencies are determined solely by tracking objects, essentially by memory address (technically, use Python's weak references). Naturally, this does not work for objects that do not have unique/stable memory addresses, such as primitive scalar values (see #17).

Now that we're using AST rewriting, as of #15, we should supplement the memory-based object tracker with static analysis to capture dataflow. A satisfactory resolution of this issue would fix #14 and #17. A similar approach is already being used in rflowgraph.

Undefined name 'target' in ast_tracer.py

flake8 testing of https://github.com/IBM/pyflowgraph on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./flowgraph/trace/ast_tracer.py:260:25: F821 undefined name 'target'
        elif isinstance(target, ast.List):
                        ^
1     F821 undefined name 'target'
1

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

  • F821: undefined name name
  • F822: undefined name name in __all__
  • F823: local variable name referenced before assignment
  • E901: SyntaxError or IndentationError
  • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

Use syntax to identify multiple return values

Like many programming languages, Python treats function inputs and outputs asymmetrically. A function can have many arguments, but only one return value. Multiple return values are represented implicitly by returning a tuple. The system currently detects tuple return values and interprets them appropriately.

Unfortunately, not all Python functions respect the convention of returning tuples. For example, numpy.meshgrid returns a list of arrays, but in an expression like

xv, yv = np.meshgrid(x, y)

it is abundantly clear that we have a function call with two arguments and two (logical) return values. The system should use the syntax of multiple return values to identify them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.