usyd-blockchain / vandal Goto Github PK
View Code? Open in Web Editor NEWStatic program analysis framework for Ethereum smart contract bytecode.
License: BSD 3-Clause "New" or "Revised" License
Static program analysis framework for Ethereum smart contract bytecode.
License: BSD 3-Clause "New" or "Revised" License
The decompiler should display two CFGs when given creation code as input:
CODECOPY
operations?We might consider targeting Julia or solidity-assembly.
In this fashion we might be able to emit compilable decompiled code.
The ERC20 standard requires contracts to implement an interface, described here.
It would be cool to be able to automatically identify contracts that conform, and is trivial to implement as a datalog spec.
Desiderata:
Something like Plotly or Bokeh may be handy here if the thing is to be interactive, but perhaps it's just better to have a JS viewer which just reads a json dump of the CFG.
One example contract which produces this error (several different contracts do):
0x1a9559716cafded0b9573768ea52ee29e922b687_2271995_runtime.txt
Traceback (most recent call last):
File "../../bin/decompile", line 206, in <module>
cfg = tac_cfg.TACGraph.from_bytecode(args.infile)
File "/home/lexi/repos/vandal/bin/../src/tac_cfg.py", line 123, in from_bytecode
return cls(blockparse.EVMBytecodeParser(bytecode).parse())
File "/home/lexi/repos/vandal/bin/../src/tac_cfg.py", line 99, in __init__
self.apply_operations()
File "/home/lexi/repos/vandal/bin/../src/tac_cfg.py", line 260, in apply_operations
block.apply_operations(use_sets)
File "/home/lexi/repos/vandal/bin/../src/tac_cfg.py", line 1189, in apply_operations
op.lhs.values = mem.Variable.arith_op(op.opcode.name, rhs).values
File "/home/lexi/repos/vandal/bin/../src/memtypes.py", line 270, in arith_op
result = ssle.cartesian_map(getattr(cls, opname), args)
File "/home/lexi/repos/vandal/bin/../src/lattice.py", line 316, in cartesian_map
return cls([f(*args) for args in prod])
File "/home/lexi/repos/vandal/bin/../src/lattice.py", line 316, in <listcomp>
return cls([f(*args) for args in prod])
File "/home/lexi/repos/vandal/bin/../src/memtypes.py", line 392, in BYTE
return (v >> ((cls.SIZE - b) * 8)) & 0xFF
OverflowError: Python int too large to convert to C ssize_t
➜ python --version
Python 3.6.1 :: Continuum Analytics, Inc.
Running time is dominated by joins on the state structure.
We could do better, faster asymptotics, and more expressive.
Interval trees for memory -- which would allow a sparser representation and greater precision for memory.
Better representation for the variable and lattice types for faster arithmetic and less bloat.
Perhaps write these as C extensions for better speed.
Instead of naively calculating exit stacks from all values in the entry stack, the delta between the latest entry stack and the one from the previous dataflow step can be used to determine the new output stack.
Should bring binary arithmetic operation running times down from the product to the sum of the cardinalities of the input sets.
Currently we have some dodgy integration test stuff going on in .travis.yml
, in whcih we just run our stuff over a bunch of examples and check that it exits successfully.
Problems with this:
The solution is to implement a better testing harness using a TAP-compliant framework such as BATS.
Is it possible to check vulnerabilities in contract build with Solidity 0.5x?
In src/dataflow.py
, replace all occurrences of time.clock()
by time.process_time()
.
time.clock()
has been deprecated since Python 3.3 and has been removed in Python 3.8.
If you are a prefectionist, you might also want to remove a typo:
Replace
"blocks were cloned on average of %.2f times each."
either by
"blocks were cloned an average of %.2f times each."
(on
replaced by an
), or by
"blocks were cloned, on average, %.2f times each."
(of
removed).
It would be nice if the decompiler had an option to output just the IL code without other metadata.
i.e. you put bytecode into the decompiler and it guesses whether what you gave it is Solidity, Viper, etc.
It could be by some manually-designed means, but I kinda like the idea of using ML with learned features, so that we don't have to hand-specify anything. Just jam a bunch of examples into something like https://arxiv.org/abs/1603.05629 .
Our intermediate language is not really three address code.
We need to implement the new REVERT and RETURNDATA* opcodes included in the metropolis release.
The tests really need to be better. Some categories of tests to include (in no particular order):
Fix and merge the memory abstraction work.
In future it might be necessary to improve the memory system with interval trees for handling the joins of non-disjoint memory states. Add another card for that when this one is retired.
A bunch of analytics are collected during dataflow analysis.
Add a flag to toggle whether this occurs or not, and make it clear how to output this stuff.
Maybe make an exporter which produces the analytics only, and nothing else.
The analytics information should be a graph object member; then exporters can be modified to easily output this data along with the rest of it.
In file src/opcodes.py
, replace the line
CREATE2 = OpCode("CREATE2", 0xfb, 4, 1)
by
CREATE2 = OpCode("CREATE2", 0xf5, 4, 1)
The hex code of CREATE2 is 0xf5
, not 0xfb
.
Furthermore I suggest to integrate the lines below the headline
# New Byzantinium OpCodes for block.number >= BYZANTIUM_FORK_BLKNUM
into the general list of instructions, since the Constantinople opcodes, even newer than Byzantinium, have been integrated there as well.
Let the value in a given connected component stabilise first before tackling other regions of the graph. Calculate in topological order from the entry block.
This should allow loops to complete their calculations before propagating values out to the rest of the graph, thus saving computation. See, for example, the mem_leak example, with set-valued calculations turned on. It takes longer than it otherwise should.
It's likely that recomputing is just the most efficient thing for us, but, given that the graph is only partially constructed at any given time, it may be necessary to carry around a structure to dynamically track and merge the connected components of the graph, as it is being formed. This might be a double union find if edges may only be added, something more complex if we want fast component tracking with edge removal, for example: http://dl.acm.org/citation.cfm?id=320215, http://dl.acm.org/citation.cfm?id=502095, http://dl.acm.org/citation.cfm?id=335345.
This should be otherwise equivalent to the current mechanism in acyclic graphs, by the definition of the fixed point (i.e. the dataflow iteration may proceed in arbitrary block order). But will increase speed, I believe.
Should be tackled at the same time as #23 , or maybe this one immediately before that one: but both should probably share a notion of graph components.
Currently the IL output does the following:
0x57: V23 = 0x20
0x59: V24 = ADD 0x20 V22
corresponding to:
0x57 PUSH1 0x20
0x59 ADD
In this case, the 0x20
in the IL language is actually an instance of V23
, but its value is shown instead in the output, which is confusing. We should perhaps disable this by default and add a config option or command-line argument to re-enable it.
When running the bulk_analyser just now I've seen quite a few contracts error out with Error: 'NoneType' object has no attribute 'is_private'
.
Stack trace:
➜ bin/decompile ~/eth-ramdisk/ethscrape-2018-03-27-b5321831-uniq/0xff95fbb77852c846724ce4fa140ca25f31a90bb0_2627330_runtime.hex
Traceback (most recent call last):
File "bin/decompile", line 224, in <module>
dataflow.analyse_graph(cfg)
File "/home/lexi/repos/vandal/bin/../src/dataflow.py", line 139, in analyse_graph
cfg.extract_functions()
File "/home/lexi/repos/vandal/bin/../src/tac_cfg.py", line 802, in extract_functions
fe.extract()
File "/home/lexi/repos/vandal/bin/../src/function.py", line 107, in extract
self.private_functions.extend(self.extract_private_functions())
File "/home/lexi/repos/vandal/bin/../src/function.py", line 300, in extract_private_functions
f.is_private = True
AttributeError: 'NoneType' object has no attribute 'is_private'
Here's a contract that triggers the exception:
➜ cat ~/eth-ramdisk/ethscrape-2018-03-27-b5321831-uniq/0xff95fbb77852c846724ce4fa140ca25f31a90bb0_2627330_runtime.hex
606060405236156100b95760e060020a600035046306fdde0381146100be578063095ea7b3146100fd57806318160ddd1461011057806323b872dd14610128578063313ce5671461014757806354fd4d501461015457806370a08231146101935780638da5cb5b146101bd57806395d89b41146100be578063a39a45b7146101d4578063a4e2d63414610259578063a9059cbb14610266578063cae9ca511461027f578063d8162db714610345578063dd62ed3e14610352575b610002565b346100025761038b60408051808201909152600581527f524f554e44000000000000000000000000000000000000000000000000000000602082015281565b34610002576103f96004356024356102d9565b346100025761040d6b033b2e3c9fd0803ce800000081565b34610002576103f96004356024356044356000836104525b4360001190565b346100025761041f601281565b346100025761038b60408051808201909152600381527f302e310000000000000000000000000000000000000000000000000000000000602082015281565b346100025761040d600435600160a060020a0381166000908152600160205260409020545b919050565b3461000257610435600054600160a060020a031681565b34610002576103f960043560008054600160a060020a039081163390911614156101b857805473ffffffffffffffffffffffffffffffffffffffff19168217815560408051600160a060020a038416815290517f3edd90e7770f06fafde38004653b33870066c33bfc923ff6102acd601f85dfbc9181900360200190a15060016101b8565b34610002576103f9610140565b34610002576103f9600435602435600033610580610140565b3461000257604080516020604435600481810135601f81018490048402850184019095528484526103f994813594602480359593946064949293910191819084018382808284375094965050505050505060008361066881855b33600160a060020a03908116600081815260026020908152604080832094871680845294825280832086905580518681529051929493927f8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925929181900390910190a35060015b92915050565b346100025761040d600081565b346100025761040d600435602435600160a060020a0382811660009081526002602090815260408083209385168352929052205461033f565b60405180806020018281038252838181518152602001915080519060200190808383829060006004602084601f0104600302600f01f150905090810190601f1680156103eb5780820380516001836020036101000a031916815260200191505b509250505060405180910390f35b604080519115158252519081900360200190f35b60408051918252519081900360200190f35b6040805160ff9092168252519081900360200190f35b60408051600160a060020a03929092168252519081900360200190f35b158061046c5750600054600160a060020a03908116908216145b15610578578330600160a060020a031681600160a060020a031614151561057657600160a060020a0386166000908152600160205260409020548490108015906104d65750600260209081526040600081812033600160a060020a03168252909252902054849010155b80156104e25750600084115b1561057157600160a060020a03858116600081815260016020908152604080832080548a0190558a851680845281842080548b90039055600283528184203390961684529482529182902080548990039055815188815291519293927fddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef9281900390910190a360019250610576565b600092505b505b509392505050565b158061059a5750600054600160a060020a03908116908216145b15610661578330600160a060020a031681600160a060020a031614151561065f5733600160a060020a03166000908152600160205260409020548490108015906105e45750600084115b1561065a5733600160a060020a03908116600081815260016020908152604080832080548a90039055938916808352918490208054890190558351888152935191937fddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef929081900390910190a36001925061065f565b600092505b505b5092915050565b156105785780600160a060020a0316638f4ffcb1338630876040518560e060020a0281526004018085600160a060020a0316815260200184815260200183600160a060020a03168152602001806020018281038252838181518152602001915080519060200190808383829060006004602084601f0104600302600f01f150905090810190601f1680156107105780820380516001836020036101000a031916815260200191505b5095505050505050600060405180830381600087803b156100025760325a03f115610002575050506001915061057856
Currently the output of bin/decompile can differ each time it is run on the same input. This is likely due to our use of unordered collections such as dict
. Unfortunately, this makes output testing of the kind being implemented in lexibrent:bats-tests impossible.
The solution I propose is updating the functions that produce textual output so that all non-deterministically ordered data structures are sorted.
For example, recently I saw these two different outputs for basic_optimized.dasm
:
5c5
< Successors: [0x1a, 0x18]
---
> Successors: [0x18, 0x1a]
108a109
> --
My proposed solution is to sort the preds and succs lists, and any other relevant structures in our output rendering functions.
Similar to stack capacity freezing. Better than widening, in some cases, e.g. variables with sequential values.
At present, procedures are cloned at the block level. But the current algorithm induces an infinite loop when the cfg is cyclic.
A solution: keep track of SCCs, and clone them in their entirety. In this way procedures still should get cloned without bloating up their interiors if they contain cycles, and cycles will not produce infinite clones of themselves.
Should be tackled at the same time as #22 .
We should change Vandal to follow a setuptools-compliant package structure, as discussed in #4. This is currently being worked on by @SamuelMarks.
Relevant examples / information on how everything should be structured:
We use python 3.8.13 versions and the environment is Linux Ubuntu20.04
Timeout Error
We have found that some of the timeout issues are caused by internal processing logic, such as the following, where Python gets stuck in a long computation when the e value is large and then throws an exception when the time limit is exceeded.
@classmethod
def EXP(cls, b: int, e: int) -> int:
"""Exponentiation: return b to the power of e."""
return b ** e
In the test code, several of these contract codes have similar problems
b9625381f086e7b8512e4825f6af1117e9c84d43_TokenFRT.sol/Math.bin-runtime
8Ce9E9442F2791FC63CD6394cC12F2dE4fbc1D71_UNIV2LPOracle.sol/UNIV2LPOracleFactory.bin-runtime
5C1cBdc3b8dD2a3456643A62547ef9AA5e1571f3_DMMFactory.sol/DMMFactory.bin-runtime
3.Fix suggestions
Refer to the implementation of the EXP operation in py-evm.
@curry
def exp(computation: BaseComputation, gas_per_byte: int) -> None:
"""
Exponentiation
"""
base, exponent = computation.stack_pop_ints(2)
bit_size = exponent.bit_length()
byte_size = ceil8(bit_size) // 8
if exponent == 0:
result = 1
elif base == 0:
result = 0
else:
result = pow(base, exponent, constants.UINT_256_CEILING)
computation.consume_gas(
gas_per_byte * byte_size,
reason="EXP: exponent bytes",
)
computation.stack_push_int(result)
Throwing an exception error. ValueError:negative shift count
The EVM specification defines the semantics of the BYTE operation as (x >> (248 - i * 8)) && 0xFF
, but in practice, there are cases where the value of i is large and causes (248 - i * 8)
to be negative, which requires special handling in the specific implementation, and the relevant implementation code in Vandal does not handle such cases, resulting in the following exception.
@classmethod
def BYTE(cls, b: int, v: int) -> int:
"""Return the b'th byte of v."""
return (v >> ((cls.SIZE - b) * 8)) & 0xFF
The exception will occur in the following code
348505ac7160bf45a150599abe10dc4e1ff2feed_ERC20Token.sol/ERC20Token.bin-runtime
[Error]Called Error : Traceback (most recent call last):
File ""/home/liux/vandal/bin/decompile"", line 206, in <module>
cfg = tac_cfg.TACGraph.from_bytecode(args.infile)
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 123, in from_bytecode
return cls(blockparse.EVMBytecodeParser(bytecode).parse())
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 99, in **init**
self.apply_operations()
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 260, in apply_operations
block.apply_operations(use_sets)
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 1189, in apply_operations
op.lhs.values = mem.Variable.arith_op([op.opcode.name](http://op.opcode.name/), rhs).values
File ""/home/liux/vandal/bin/../src/memtypes.py"", line 270, in arith_op
result = ssle.cartesian_map(getattr(cls, opname), args)
File ""/home/liux/vandal/bin/../src/lattice.py"", line 316, in cartesian_map
return cls([f(*args) for args in prod])
File ""/home/liux/vandal/bin/../src/lattice.py"", line 316, in <listcomp>
return cls([f(*args) for args in prod])
File ""/home/liux/vandal/bin/../src/memtypes.py"", line 392, in BYTE
return (v >> ((cls.SIZE - b) * 8)) & 0xFF
ValueError: negative shift count
abe56c6c9636cf55107d98c3083be8b6ff6970c3_MasterChef.sol/SafeMath.bin-runtime
[Error]Called Error : Traceback (most recent call last):
File ""/home/liux/vandal/bin/decompile"", line 206, in <module>
cfg = tac_cfg.TACGraph.from_bytecode(args.infile)
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 123, in from_bytecode
return cls(blockparse.EVMBytecodeParser(bytecode).parse())
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 99, in **init**
self.apply_operations()
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 260, in apply_operations
block.apply_operations(use_sets)
File ""/home/liux/vandal/bin/../src/tac_cfg.py"", line 1189, in apply_operations
op.lhs.values = mem.Variable.arith_op([op.opcode.name](http://op.opcode.name/), rhs).values
File ""/home/liux/vandal/bin/../src/memtypes.py"", line 270, in arith_op
result = ssle.cartesian_map(getattr(cls, opname), args)
File ""/home/liux/vandal/bin/../src/lattice.py"", line 316, in cartesian_map
return cls([f(*args) for args in prod])
File ""/home/liux/vandal/bin/../src/lattice.py"", line 316, in <listcomp>
return cls([f(*args) for args in prod])
File ""/home/liux/vandal/bin/../src/memtypes.py"", line 392, in BYTE
return (v >> ((cls.SIZE - b) * 8)) & 0xFF
ValueError: negative shift count
3.Fix suggestions
The implementation specifically handles the case where (248 - i * 8)
is negative, see the implementation of BYTE operations in py-evm or panoramix.
def byte_op(computation: BaseComputation) -> None:
"""
Bitwise And
"""
position, value = computation.stack_pop_ints(2)
if position >= 32:
result = 0
else:
result = (value // pow(256, 31 - position)) % 256
computation.stack_push_int(result)
def byte_op(position, value):
if position >= 32:
return 0
return (value // pow(256, 31 - position)) % 256
Currently all stacks and variables are purged of their values at the start of each new dataflow round. It would be nice to be able to purge the unwidened values, but not purge the widened ones.
The CFG should store inside itself which variables have previously been widened and optionally force their values to Top in future dataflow analyses.
Exception handling in the bulk_analyser doesn't clean up or wait for Souffle subprocesses.
We should have a CLI in bin/
which runs the entire framework pipeline (decompilation + analysis with Souffle). Currently the analysis step is implemented separately in tools/bulk_analyser
.
We should be able to inject values for e.g. TIMESTAMP
and ADDRESS
, etc. if contract metadata is provided.
SHA3 should be functional once we have the memory stuff integrated properly. c.f. #14
Currently state is joined before body execution. Instead, we might execute the body upon each incoming state structure and join afterwards in order to properly yield the actual meet-over-paths solution.
For example, the parity multisig wallet problem, https://kovan.etherscan.io/solcbuginfo
Hi, base on the CFGs generated for the DAO_hack example, there seems to be a dangling node.
What is the reason that we have these kinds of dangling nodes?
They don't seem to be reachable, what is the purpose of having them in the first place?
I understand this is probably related to the design of EVM. But I've looked everywhere and could not find a proper explanation.
We need to review all our documentation (including auto-generated), and possibly centralise it on the GitHub wiki rather than the various .md files we currently have scattered around. Some of these may also be out of date, so we need to check this too.
Hi,
I am trying to build my analysis on top of vandal but I am very confused about the naming convention for the flow in function calls.
Here is my toy example:
contract SymVars {
function addTo(uint number) returns (uint) {
var a = 1;
var c = inc(a, number);
return c;
}
function inc(uint k, uint v) returns (uint) {
return (k + v + 100);
}
}
Technically, there should be flows a->k and number->v. However, the flows seem to be broken in the IR unless you are using some implicit naming convention:
I am only listing the relevant IRs:
0xdd: JUMPDEST
0xde: V62 = 0x0
0xe1: V63 = 0x0
0xe3: V64 = 0x1
0xe7: V65 = 0xf3
0xeb: V66 = 0xff
0xed: V67 = AND 0xff 0x1
0xef: V68 = 0xff
0xf2: JUMP 0xff
0xff: JUMPDEST
0x100: V69 = 0x0
0x102: V70 = 0x64
0x106: V71 = ADD S1 S0
0x107: V72 = ADD V71 0x64
0x10e: JUMP {0xc7, 0xf3}
Here, it seems that you implicitly use S0 and S1 to represent the formal parameters of the caller (i.e., inc), but where do they come from? If I directly perform data flow analysis on the IRs generated by vandal, the results will be unsound. Could you please clarify?
Thanks,
Right now, a variable defined in terms of the join of other variables just gets the name Var
, but it would be nice if instead their were given names based upon the variables they were composed of.
This should also handle variables defined in terms of themselves.
Setting a timeout in Vandal doesn't seem to do anything for long-running contracts.
This is the bailout_seconds
option, in particular. We noticed that running bulk_analyser
sometimes produces Vandal analysis times of several hours for some contracts, even when this option is set to 20 seconds.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.