softdevteam / diffract Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 560 KB

A Rust implementation of the GumTree differ

License: Other

Lex 0.64% Yacc 10.31% Rust 88.93% Java 0.11%

diffract's People

Contributors

Stargazers

Watchers

Forkers

quentinledilavrec

diffract's Issues

Different versions of the grammars

It would be useful (for demonstrations and debugging) to have a version of the calc grammar with no whitespace, because the full trees are difficult to read.

Revisit the design of actions and edit scripts

Actions are currently implemented as structs, and edit scripts are a struct which contains a vector of boxed objects which implement the ApplyAction traits.

Possibly we could replace the action structs with enums, or it may be possible to replace the Box with a reference.

NB: https://stackoverflow.com/questions/48952627/implementing-partialeq-on-a-boxed-trait

Use different types for `NodeId`

Currently, all nodes in ASTs have a unique identifier of type NodeId. In both the from and to ASTs, node ids are numbered from 0. Ideally, it would be better to use separate types for the different tree identifiers (e.g. FromNodeId and ToNodeId).

This would help to avoid bugs where we have used a node from the "wrong" tree (several such bugs have already been fixed in this repo!).

/cc @k1502606

Glob files and apply macro for integration tests

Currently in integration tests such as tests/test_edit_script.rs we have a boilerplate function which applies the tests, then a number of functions which apply the boilerplate to the input files.

It would be nicer to glob the test files in the tests/ directory (currently *.calc, *.txt and *.java) and apply the tests with a macro, as in https://stackoverflow.com/questions/34662713/how-can-i-create-parameterized-tests-in-rust/34666891#34666891

This needs a little thought, we would need to automatically select the correct lexer and parser, based on the filename, as we do in main and ast. This code might be better factored out since we already use it in two places in the code base.

Globbing requires an external crate: https://crates.io/crates/glob

Change nomenclature

The terms from Chawathe et al. label and value are confusing, and should be replaced by the more commonly used type and value. This is likely to be a surprisingly large refactoring.

Add fuzz testing

For example, using: https://crates.io/crates/proptest to test the output of the edit script generator.

Tests Failed On Windows

Windows Tests failed because the newline endings are different from Linux.

---- test_nested_comment_java stdout ----
        thread 'test_nested_comment_java' panicked at 'assertion failed: `(left == right)`
  left: `"^~\n ~\n  COMMENT /*\n * // Single line comment nested in multi-line comment.\n */\n  ~\n   WHITESPACE \n\n   ~\n goal\n  compilation_unit\n   package_declaration_opt\n   import_declarations_opt\n   type_declarations_opt\n    type_declarations\n     type_declaration\n      class_declaration\n       modifiers_opt\n        modifiers\n         modifier\n          PUBLIC public\n          ~\n           WHITESPACE  \n           ~\n       CLASS class\n       ~\n        WHITESPACE  \n        ~\n       IDENTIFIER NestedComment\n       ~\n        WHITESPACE  \n        ~\n       type_parameters_opt\n       super_opt\n       interfaces_opt\n       class_body\n        LBRACE {\n        ~\n        class_body_declarations_opt\n        RBRACE }\n        ~\n         WHITESPACE \n\n         ~\n"`,
 right: `"^~\n ~\n  COMMENT /*\r\n * // Single line comment nested in multi-line comment.\r\n */\n  ~\n   WHITESPACE \r\n\n   ~\n goal\n  compilation_unit\n   package_declaration_opt\n   import_declarations_opt\n   type_declarations_opt\n    type_declarations\n     type_declaration\n      class_declaration\n       modifiers_opt\n        modifiers\n         modifier\n          PUBLIC public\n          ~\n           WHITESPACE  \n           ~\n       CLASS class\n       ~\n        WHITESPACE  \n        ~\n       IDENTIFIER NestedComment\n       ~\n        WHITESPACE  \n        ~\n       type_parameters_opt\n       super_opt\n       interfaces_opt\n       class_body\n        LBRACE {\n        ~\n        class_body_declarations_opt\n        RBRACE }\n        ~\n         WHITESPACE \r\n\n         ~\n"`', tests\test_ast.rs:60:4

Do we need the TemporaryMappingStore type?

The reason for having a TemporaryMappingStore is that the edit script generation algorithm needs to differentiate between the "original" mappings provided by the mapping algorithm, and the new mappings added by the edit script generator itself.

However, in our implementation of the MappingStore type, we keep an extra piece of metadata with each mapping. This metadata is used by the GraphViz output to show (as debug output) which part of the Diffract system contributed each mapping.

Potentially, the edit script generator could use this metadata to decide whether a mapping existed in the original store, or in the overall store. Ideally this would be encapsulated within the MappingStore API, and then the TemporaryMappingStore type could be deleted, in which case we would also avoid copying mappings to and from the temporary store.

Stack overflow on diffing large files

The file Big.java (24597 lines) which is used in testing the Eco project causes a stack overflow when diffed with diffract.

Reconsider isomorphic tests with hashing

The isomorphism tests based on hashes are a transliteration of the GT code (based on Chilowicz et al. 2011). Some hashes (based on tree structure) are stored within AST nodes, but these need to be recomputed whenever the AST changes. This could be made more efficient, for example by storing version numbers in each node, and whenever a node changes incrementing its version and that of its ancestors. This would indicate which hashes need to be recomputed and which do not.

Edit script generation fails if input files have different parsers

This was discovered in PR #36 but reproducing does not require the -g switch, so the bug must have been in the code base for some time.

Example input files:

$ cat tests/add.calc
1+2

$ cat tests/Hello.txt
public class Hello {
    public static void main(String[] args) {
        System.out.println("Hello, world!");
    }
}

Example run:

$ diffract -a -m Myers tests/add.calc tests/Hello.txt 

INFO:diffract: Selecting the Myers matching algorithm.
^~
 ~
 Expr
  Term
   Factor
    INT 1
    ~
  PLUS +
  ~
  Expr
   Term
    Factor
     INT 2
     ~
      WHITESPACE 

      ~

document
 paragraph
  WORD public
  SPACE  
  paragraph
   WORD class
   SPACE  
   paragraph
    WORD Hello
    SPACE  
    paragraph
     WORD {
 END 

 document
  paragraph
   SPACE     
   paragraph
    WORD public
    SPACE  
    paragraph
     WORD static
     SPACE  
     paragraph
      WORD void
      SPACE  
      paragraph
       WORD main(String[]
       SPACE  
       paragraph
        WORD args)
        SPACE  
        paragraph
         WORD {
  END 

  document
   paragraph
    SPACE         
    paragraph
     WORD System.out.println("Hello,
     SPACE  
     paragraph
      WORD world!");
   END 

   document
    paragraph
     SPACE     
     paragraph
      WORD }
    END 

    document
     paragraph
      WORD }
     END 


thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /checkout/src/libcore/option.rs:335:20
stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
             at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at /checkout/src/libstd/sys_common/backtrace.rs:69
   2: std::panicking::default_hook::{{closure}}
             at /checkout/src/libstd/sys_common/backtrace.rs:58
             at /checkout/src/libstd/panicking.rs:381
   3: std::panicking::default_hook
             at /checkout/src/libstd/panicking.rs:397
   4: std::panicking::rust_panic_with_hook
             at /checkout/src/libstd/panicking.rs:577
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:538
   6: std::panicking::begin_panic_fmt
             at /checkout/src/libstd/panicking.rs:522
   7: rust_begin_unwind
             at /checkout/src/libstd/panicking.rs:498
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:71
   9: core::panicking::panic
             at /checkout/src/libcore/panicking.rs:51
  10: <core::option::Option<T>>::unwrap
             at /checkout/src/libcore/macros.rs:20
  11: <diffract::matchers::MappingStore<T>>::generate_edit_script
             at ./src/lib/matchers.rs:345
  12: diffract::main
             at src/main.rs:374
  13: __rust_maybe_catch_panic
             at /checkout/src/libpanic_unwind/lib.rs:99
  14: std::rt::lang_start
             at /checkout/src/libstd/panicking.rs:459
             at /checkout/src/libstd/panic.rs:361
             at /checkout/src/libstd/rt.rs:59
  15: main
  16: __libc_start_main

Is u16 a suitable type for describing the nth child of a node?

In the Chawathe (1996) edit script generator we need to find the position of a child node with respect to its parent (i.e. "this is the third child of its parent"). In the action and edit_script modules these numbers are stored as u16, but it is worth thinking about whether this should be refactored as a u32.

How should costs (in matchers) be set?

In the Zhang-Shasha implementation here and in GT, the cost of inserting and deleting a subtree is fixed at 1.0 (the cost of updating is a qgram distance between the two strings). Is this really sensible, or would it be better to take into account the size and shape of the subtree being inserted/deleted? Or the type of the root node of the subtree?

These thoughts were partly inspired by this slide deck on Zhang-Shasha:

https://grfia.dlsi.ua.es/repositori/grfia/otros/JoseRect09032007.pdf

and these papers:

https://grfia.dlsi.ua.es/repositori/grfia/pubs/172/paper.pdf
https://grfia.dlsi.ua.es/repositori/grfia/pubs/342/distance-string-pfa.pdf

Factor out edit script generator

...so that Diffract can implement a number of different generation algorithms and the end-user can choose between them.

Consider identical input files with different file types as a special case

AST based differs have an interesting pathological case: if the two input files are of different file types and are parsed by different parsers, they will appear different EVEN IF the files are identical (up to their filenames). This is because AST differs match on the "type" of each AST node, and different parsers are likely to give nodes with the same content different types. An obvious case would be where a Java file has been copied and renamed as a text file.

Implement Error trait for error structs

Relax the expectation that Node 0 in an AST is the root

At the moment several places in the code base assume that Node 0 is the root node of the AST. In fact, INSert operations can insert a new root node during the edit script generation phase of the differencer.

The INS operation needs a specific test case for replacing the root, the graphviz generator, etc. need careful thought.

Add `grammars` repo as a submodule

At the moment the Diffract repo contains a number grammars that work with the lpar library. Some of these are specific to Diffract (for example, grammars used in unit tests), but others are for "big" languages such as Java. For these, it makes little sense to keep the lex and yacc files here, it would be more sensible to add the grammars repo as a submodule. The code which determines which lexer and parser to use would have to be a little more complex, as it would have to search the grammars submodule for a suitable lex/yacc file, then search through the files in this repo as a fall-back.

Change the way we use type `T` in ASTs (for node types)

At the moment, node types (Expr, Factor, etc.) are stored as a generic type T with some trait bounds. In practice, the parser in the AST (which calls lrpar) stores those node types as Strings. To save space, it would be preferable to store these as a u16 or similar, and carry around a HashMap or a copy of the grammar, so that we can pretty-print node types where necessary.

Related to #49 and #46

Isomorphism testing: do hash collisions appear in practice

The GumTree code uses a (cheap) test for isomorphism which compares hashes. If a hash collision occurs, the test will not be accurate. How often does this happen in practice?

In the Diffract code we also have a (more expensive) exact isomorphism test which could be used in place of the hash comparison, but presumably this is considerably more expensive (but by how much)?

Optimising diffract for memory usage

Diffract uses a lot of memory, diffing the Big.java test case with itself results in an OOM error.

Each node in an AST holds a large amount of information with it. Some of this relates to where the original lexeme (if there is one) was located in the input file:

    pub char_no: Option<usize>,
    pub col_no: Option<usize>,
    pub line_no: Option<usize>,
    pub token_len: Option<usize>

These could be reduced to an offset and a token length, if the original input file was held in memory.

Option<NodeId<U>> types could be replaced with a bespoke type that holds a None / Some switch in the 64th bit of the representation.

To start with, the contents of each lexeme pub label: String could be stored as an offset into an array or vector.