softdevteam / diffract Goto Github PK
View Code? Open in Web Editor NEWA Rust implementation of the GumTree differ
License: Other
A Rust implementation of the GumTree differ
License: Other
It would be useful (for demonstrations and debugging) to have a version of the calc
grammar with no whitespace, because the full trees are difficult to read.
Actions are currently implemented as structs, and edit scripts are a struct which contains a vector of boxed objects which implement the ApplyAction
traits.
Possibly we could replace the action structs with enums, or it may be possible to replace the Box with a reference.
NB: https://stackoverflow.com/questions/48952627/implementing-partialeq-on-a-boxed-trait
Currently, all nodes in ASTs have a unique identifier of type NodeId
. In both the from
and to
ASTs, node ids are numbered from 0. Ideally, it would be better to use separate types for the different tree identifiers (e.g. FromNodeId
and ToNodeId
).
This would help to avoid bugs where we have used a node from the "wrong" tree (several such bugs have already been fixed in this repo!).
/cc @k1502606
Currently in integration tests such as tests/test_edit_script.rs
we have a boilerplate function which applies the tests, then a number of functions which apply the boilerplate to the input files.
It would be nicer to glob the test files in the tests/
directory (currently *.calc
, *.txt
and *.java
) and apply the tests with a macro, as in https://stackoverflow.com/questions/34662713/how-can-i-create-parameterized-tests-in-rust/34666891#34666891
This needs a little thought, we would need to automatically select the correct lexer and parser, based on the filename, as we do in main
and ast
. This code might be better factored out since we already use it in two places in the code base.
Globbing requires an external crate: https://crates.io/crates/glob
The terms from Chawathe et al. label
and value
are confusing, and should be replaced by the more commonly used type
and value
. This is likely to be a surprisingly large refactoring.
For example, using: https://crates.io/crates/proptest to test the output of the edit script generator.
Windows Tests failed because the newline endings are different from Linux.
---- test_nested_comment_java stdout ----
thread 'test_nested_comment_java' panicked at 'assertion failed: `(left == right)`
left: `"^~\n ~\n COMMENT /*\n * // Single line comment nested in multi-line comment.\n */\n ~\n WHITESPACE \n\n ~\n goal\n compilation_unit\n package_declaration_opt\n import_declarations_opt\n type_declarations_opt\n type_declarations\n type_declaration\n class_declaration\n modifiers_opt\n modifiers\n modifier\n PUBLIC public\n ~\n WHITESPACE \n ~\n CLASS class\n ~\n WHITESPACE \n ~\n IDENTIFIER NestedComment\n ~\n WHITESPACE \n ~\n type_parameters_opt\n super_opt\n interfaces_opt\n class_body\n LBRACE {\n ~\n class_body_declarations_opt\n RBRACE }\n ~\n WHITESPACE \n\n ~\n"`,
right: `"^~\n ~\n COMMENT /*\r\n * // Single line comment nested in multi-line comment.\r\n */\n ~\n WHITESPACE \r\n\n ~\n goal\n compilation_unit\n package_declaration_opt\n import_declarations_opt\n type_declarations_opt\n type_declarations\n type_declaration\n class_declaration\n modifiers_opt\n modifiers\n modifier\n PUBLIC public\n ~\n WHITESPACE \n ~\n CLASS class\n ~\n WHITESPACE \n ~\n IDENTIFIER NestedComment\n ~\n WHITESPACE \n ~\n type_parameters_opt\n super_opt\n interfaces_opt\n class_body\n LBRACE {\n ~\n class_body_declarations_opt\n RBRACE }\n ~\n WHITESPACE \r\n\n ~\n"`', tests\test_ast.rs:60:4
The reason for having a TemporaryMappingStore
is that the edit script generation algorithm needs to differentiate between the "original" mappings provided by the mapping algorithm, and the new mappings added by the edit script generator itself.
However, in our implementation of the MappingStore
type, we keep an extra piece of metadata with each mapping. This metadata is used by the GraphViz output to show (as debug output) which part of the Diffract system contributed each mapping.
Potentially, the edit script generator could use this metadata to decide whether a mapping existed in the original store, or in the overall store. Ideally this would be encapsulated within the MappingStore
API, and then the TemporaryMappingStore
type could be deleted, in which case we would also avoid copying mappings to and from the temporary store.
The file Big.java
(24597 lines) which is used in testing the Eco project causes a stack overflow when diffed with diffract.
The isomorphism tests based on hashes are a transliteration of the GT code (based on Chilowicz et al. 2011). Some hashes (based on tree structure) are stored within AST nodes, but these need to be recomputed whenever the AST changes. This could be made more efficient, for example by storing version numbers in each node, and whenever a node changes incrementing its version and that of its ancestors. This would indicate which hashes need to be recomputed and which do not.
This was discovered in PR #36 but reproducing does not require the -g
switch, so the bug must have been in the code base for some time.
Example input files:
$ cat tests/add.calc
1+2
$ cat tests/Hello.txt
public class Hello {
public static void main(String[] args) {
System.out.println("Hello, world!");
}
}
Example run:
$ diffract -a -m Myers tests/add.calc tests/Hello.txt
INFO:diffract: Selecting the Myers matching algorithm.
^~
~
Expr
Term
Factor
INT 1
~
PLUS +
~
Expr
Term
Factor
INT 2
~
WHITESPACE
~
document
paragraph
WORD public
SPACE
paragraph
WORD class
SPACE
paragraph
WORD Hello
SPACE
paragraph
WORD {
END
document
paragraph
SPACE
paragraph
WORD public
SPACE
paragraph
WORD static
SPACE
paragraph
WORD void
SPACE
paragraph
WORD main(String[]
SPACE
paragraph
WORD args)
SPACE
paragraph
WORD {
END
document
paragraph
SPACE
paragraph
WORD System.out.println("Hello,
SPACE
paragraph
WORD world!");
END
document
paragraph
SPACE
paragraph
WORD }
END
document
paragraph
WORD }
END
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /checkout/src/libcore/option.rs:335:20
stack backtrace:
0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::_print
at /checkout/src/libstd/sys_common/backtrace.rs:69
2: std::panicking::default_hook::{{closure}}
at /checkout/src/libstd/sys_common/backtrace.rs:58
at /checkout/src/libstd/panicking.rs:381
3: std::panicking::default_hook
at /checkout/src/libstd/panicking.rs:397
4: std::panicking::rust_panic_with_hook
at /checkout/src/libstd/panicking.rs:577
5: std::panicking::begin_panic
at /checkout/src/libstd/panicking.rs:538
6: std::panicking::begin_panic_fmt
at /checkout/src/libstd/panicking.rs:522
7: rust_begin_unwind
at /checkout/src/libstd/panicking.rs:498
8: core::panicking::panic_fmt
at /checkout/src/libcore/panicking.rs:71
9: core::panicking::panic
at /checkout/src/libcore/panicking.rs:51
10: <core::option::Option<T>>::unwrap
at /checkout/src/libcore/macros.rs:20
11: <diffract::matchers::MappingStore<T>>::generate_edit_script
at ./src/lib/matchers.rs:345
12: diffract::main
at src/main.rs:374
13: __rust_maybe_catch_panic
at /checkout/src/libpanic_unwind/lib.rs:99
14: std::rt::lang_start
at /checkout/src/libstd/panicking.rs:459
at /checkout/src/libstd/panic.rs:361
at /checkout/src/libstd/rt.rs:59
15: main
16: __libc_start_main
In the Chawathe (1996) edit script generator we need to find the position of a child node with respect to its parent (i.e. "this is the third child of its parent"). In the action
and edit_script
modules these numbers are stored as u16
, but it is worth thinking about whether this should be refactored as a u32
.
In the Zhang-Shasha implementation here and in GT, the cost of inserting and deleting a subtree is fixed at 1.0 (the cost of updating is a qgram distance between the two strings). Is this really sensible, or would it be better to take into account the size and shape of the subtree being inserted/deleted? Or the type of the root node of the subtree?
These thoughts were partly inspired by this slide deck on Zhang-Shasha:
https://grfia.dlsi.ua.es/repositori/grfia/otros/JoseRect09032007.pdf
and these papers:
https://grfia.dlsi.ua.es/repositori/grfia/pubs/172/paper.pdf
https://grfia.dlsi.ua.es/repositori/grfia/pubs/342/distance-string-pfa.pdf
...so that Diffract can implement a number of different generation algorithms and the end-user can choose between them.
AST based differs have an interesting pathological case: if the two input files are of different file types and are parsed by different parsers, they will appear different EVEN IF the files are identical (up to their filenames). This is because AST differs match on the "type" of each AST node, and different parsers are likely to give nodes with the same content different types. An obvious case would be where a Java file has been copied and renamed as a text file.
At the moment several places in the code base assume that Node 0 is the root node of the AST. In fact, INS
ert operations can insert a new root node during the edit script generation phase of the differencer.
The INS
operation needs a specific test case for replacing the root, the graphviz generator, etc. need careful thought.
At the moment the Diffract repo contains a number grammars that work with the lpar library. Some of these are specific to Diffract (for example, grammars used in unit tests), but others are for "big" languages such as Java. For these, it makes little sense to keep the lex and yacc files here, it would be more sensible to add the grammars repo as a submodule. The code which determines which lexer and parser to use would have to be a little more complex, as it would have to search the grammars submodule for a suitable lex/yacc file, then search through the files in this repo as a fall-back.
At the moment, node types (Expr
, Factor
, etc.) are stored as a generic type T
with some trait bounds. In practice, the parser in the AST (which calls lrpar
) stores those node types as String
s. To save space, it would be preferable to store these as a u16
or similar, and carry around a HashMap
or a copy of the grammar, so that we can pretty-print node types where necessary.
The GumTree code uses a (cheap) test for isomorphism which compares hashes. If a hash collision occurs, the test will not be accurate. How often does this happen in practice?
In the Diffract code we also have a (more expensive) exact isomorphism test which could be used in place of the hash comparison, but presumably this is considerably more expensive (but by how much)?
Diffract uses a lot of memory, diffing the Big.java
test case with itself results in an OOM error.
Each node in an AST holds a large amount of information with it. Some of this relates to where the original lexeme (if there is one) was located in the input file:
pub char_no: Option<usize>,
pub col_no: Option<usize>,
pub line_no: Option<usize>,
pub token_len: Option<usize>
These could be reduced to an offset and a token length, if the original input file was held in memory.
Option<NodeId<U>>
types could be replaced with a bespoke type that holds a None
/ Some
switch in the 64th bit of the representation.
To start with, the contents of each lexeme pub label: String
could be stored as an offset into an array or vector.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.