Git Product home page Git Product logo

daff's People

Contributors

edwindj avatar gwarnes-mdsol avatar jeroen avatar salim-b avatar warnes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

daff's Issues

Implement C++ version of daff

The original daff is written in the programming language Haxe, which can target many programming languages.

Currently the R package daff uses the javascript version of daff. Haxe however can also generate C++, so the lib may also be implemented using the C++ target. I don't know if/how the C++ version of daff should compile as a R package: to be found out.

e is locked

When I run, I get

Error in e$ctx <<- V8::new_context("window") : 
  cannot change value of locked binding for 'e'

Should you use assign instead?

e <- new.env()

get_context <- function(){
  if (is.null(e$ctx)){
    assign("ctx",V8::new_context("window"),envir=e )
    e$ctx$source(system.file("js/underscore.js", package="V8"))
    e$ctx$source(system.file("js/daff.js", package="daff"))
    e$ctx$source(system.file("js/util.js", package="daff"))
  }
  e$ctx
}

or, should e be a list rather than environment like this?

e <- list()

get_context <- function(){
  if (is.null(e$ctx)){
    e$ctx <- V8::new_context("window")
    e$ctx$source(system.file("js/underscore.js", package="V8"))
    e$ctx$source(system.file("js/daff.js", package="daff"))
    e$ctx$source(system.file("js/util.js", package="daff"))
  }
  e$ctx
}

HTML rendering does not escape passed data

Hello and thanks for this awesome project.

I have been using daff like so:

var data1 = ...;
var data2 = ...;

// Wrap into tables
var data1_table = new daff.TableView(data1);
var data2_table = new daff.TableView(data2);

// Calculate alignment
var alignment = daff.compareTables(data1_table, data2_table).align();

// Produce diff
var data_diff = [];
var table_diff = new daff.TableView(data_diff);

// Set diff options
var flags = new daff.CompareFlags();
    flags.always_show_header = false;
    flags.ordered = false;
    flags.show_unchanged_columns = true;
    flags.unchanged_column_context = 0;
    flags.unchanged_context = 0;
var highlighter = new daff.TableDiff(alignment,flags);
highlighter.hilite(table_diff);

if (table_diff.data.length === 0) {
    return;
}

// Get HTML
var diff2html = new daff.DiffRender();
diff2html.render(table_diff);
var table_diff_html = diff2html.html();

table_diff_html contains unescaped data. For example if data2 has a field that contains <foo>, that part of the field is never displayed to the user.

Could it be that I'm not calling something properly? Perhaps it is here that escaping ought to be done?

inconsistent behavior when replacing values with NA

Hey there,

when I replace three values in the iris dataset with NA and daff it, this is translated into a patch with two value replacements and one row replacement.

> iris2 <- iris
> iris2[1:3,1] <- NA
> diff_data(iris, iris2)
Daff Comparison:irisvs.iris2First 6 and last 6 patch lines:
    @@ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   ->      5.1->NA         3.5          1.4         0.2  setosa
2   ->      4.9->NA           3          1.4         0.2  setosa
3  +++           NA         3.2          1.3         0.2  setosa
4  ---          4.7         3.2          1.3         0.2  setosa
5               4.6         3.1          1.5         0.2  setosa
6  ...          ...         ...          ...         ...     ...
7  ...          ...         ...          ...         ...     ...
8   ->      5.1->NA         3.5          1.4         0.2  setosa
9   ->      4.9->NA           3          1.4         0.2  setosa
10 +++           NA         3.2          1.3         0.2  setosa
11 ---          4.7         3.2          1.3         0.2  setosa
12              4.6         3.1          1.5         0.2  setosa
13 ...          ...         ...          ...         ...     ...

I'd expect the same patch notation for the three cases, but I could be missing something.

data_diff objects fail on round trip to Rdata files

I found the following behavior surprising:

library(daff)
example(diff_data)  # creates the "dd" patch

save(dd, file = "patch_loading.Rdata")
load("patch_loading.Rdata")

dd
## Error in context_eval(join(src), private$context) : 
##  Context has been disposed.

Almost all R objects can be safely written to Rdata or RDS files. Is this failure expected behavior?

If this is intended, I can contribute a documentation patch to diff_data and write_diff so people at least know to expect this -- it caught me off guard since R objects are usually self-contained and Rdata/RDS files are a big part of my workflow.

Add htmlwidgets

An htmlwidgets version of daff will allow for data_diff in shiny and rmarkdown.

Why are matching (white) rows displayed in render_diff?

From what I understand in the color coding, the color white represents rows which remain unchanged from the source to the target. Why are these displayed by render_diff? And is it possible to not display these rows?

If I select some of these unchanged rows and use them in diff_data I get an empty result, which is correct so I don't understand why they're being displayed when their part of a larger data frame.
Thanks

Handle NAs

Hi,

Thank you for this very usefull package!

I tried the following code:

> y <- iris[1:3,]
> x <- y
> x$Sepal.Width[2] <- NA
> y
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> x
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9          NA          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> patch <- diff_data(y, x)
> render_diff(patch)
@@Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.13.51.40.2setosa
->4.93->1.41.4->0.20.2->setosasetosa->NULL
4.73.21.30.2setosa

Is it expected?

Best,
David

Encoding Issue with Windows and Daff

I am experiencing encoding problems which I believe are a combined issue of Windows and render_diff's way of generating the html file.

render_diff seems to be generating the html file by setting the encoding to utf-8 as standard. On windows machines, this does not seem to work, as the data written to the disk is encoded as ANSI.

My workaround has been to change the encoding details in the html-file:

render_diff(changes, file = write_file, view = FALSE)

# reopen file and replace encoding details
x <- readLines(write_file)
y <- gsub( "<meta charset='utf-8'>", "<meta charset='ANSI'>", x )
cat(y, file=write_file, sep="\n")

While it works, it might be nice to have this fixed, since daff is very useful. I have not been able to identify the exact location of the bug, hence this description.

Here is an example on Windows (Daff v0.3.0):

df1 <- data.frame(x = "ä", y = "è")
df2 <- data.frame(x = "ö", y = "è")
diff <- diff_data(df1, df2)
render_diff(diff)

image

Daff doesn't correctly detect added columns that share the same name

Example:

> iris2 <- data.frame(iris, iris, iris, check.names = FALSE)
> d <- diff_data(iris, iris2)
> d
Daff Comparison: ‘iris’ vs. ‘iris2’ 
  First 6 and last 6 patch lines:
      !       ---
1    @@   Species
2   +++      <NA>
3   +++      <NA>
4   +++      <NA>
5   +++      <NA>
6   +++      <NA>
... ...       ...
296 --- virginica
297 --- virginica
298 --- virginica
299 --- virginica
300 --- virginica
301 --- virginica

> summary(d)

Data diff:
 Comparison: ‘iris’ vs. ‘iris2’ 
        #        Modified Reordered Deleted Added
Rows    150      0        0         150     150  
Columns 5 --> 15 0        0         5       0    

Proposed solution:
data_diff should check for duplicated column names. If found, it should call make.unique and generate a warning.

Error: "Ordering took too long, something went wrong" when calling diff_data with large dataset inputs

First of all, thank you for this fantastic package. I've used it successfully for several tasks, and have only recently come upon this issue.

Issue:

I get a cryptic message that pops up when I call diff_data() on two large datasets (one has 41,222 observations and 23 variables, and the other has 32,077 observations and 21 variables). Note that if I use dplyr::slice to chop each dataset to 10,000 rows each, the same error occurs.

Attempted traceback:

I was able to use debugonce(diff_data) to step through the function and believe I have identified the source. It appears to come during the call ctx$assign(tv_diff$var_name, JS(diff)) (which of course I'm aware comes from the V8 package, so please let me know if I ought to cross-post this issue in that repo's tracker as well or instead).

I step into ctx$assign(tv_diff$var_name, JS(diff)), which looks like this...

function (name, value, auto_unbox = TRUE, ...) 
{
  stopifnot(is.character(name))
  obj <- if (inherits(value, "JS_EVAL")) {
    invisible(this$eval(paste("var", name, "=", value)))
  }
  else {
    invisible(this$eval(paste("var", name, "=", toJSON(value, 
      auto_unbox = auto_unbox, ...))))
  }
}

... and then have no problems until I step into invisible(this$eval(paste("var", name, "=", value))). It looks like the error ultimately occurs in paste("var", name, "=", value), which calls .Internal(paste(list(...), sep, collapse)).

With all of the functions that I stepped through, I'm not sure if an issue with one or more functions in the daff package and/or the V8 package caused the base system to hit a fault or if it's a problem with my own data (which is sensitive and unable to be shared). Apologies for the exceedingly circuitous traceback and lack of a reproducible example.

feature request: summary

It would be amazing if there was summary method for diff_data objects providing a tally of differences in terms of cells, columns, and rows (possibly % difference) etc.

# from README
library(daff)
y <- iris[1:3,]
x <- y

x <- head(x,2) # remove a row
x[1,1] <- 10 # change a value
x$hello <- "world"  # add a column
x$Species <- NULL # remove a column

# modified
x <- rbind(x, c(3, 3, 3, 3, "test"))
x <- x[-2,]

patch <- diff_data(y, x)

changes <- length(grep("->", unlist(patch$get_data())))
col_added <- length(which(names(patch$get_data()) == "+++"))
col_rmd <- length(which(names(patch$get_data()) == "---"))

Handle class

Hi,

This is not really a bug, but a feature request: is it able to detect class changes?

> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> class(x[, 1])
[1] "factor"
> class(x[, 2])
[1] "character"
> class(x[, 3])
[1] "numeric"
> render_diff(diff_data(y, x))
@@Sepal.LengthSepal.Width...

Thanks!
David

Issue with modified or inserted/deleted row

Hi, thank you for the very useful package.
In the same comparison, I find sometimes modified rows and sometimes a couple of inserted/deleted rows.
It is very strange because the format of keys columns is the same and I don't find difference in the key columns.
Tks

CRAN

Will daff be available on CRAN again at some point?

diff_data detects changes in primary keys, even if ids is specified

Maybe I misunderstand the role of the ids parameter, but records with different primary keys should be treated separately.
Instead, it seems that diff_data detects changes, even if the key is forced.

df_ref <- tibble::tribble(
   ~a, ~b, ~key,
   1, "test1", "key_001",    # only in ref
   2, "test2", "key_002",
   3, "test3", "key_003",
)

df_current <- tibble::tribble(
   ~a, ~b, ~key,
   2, "TEST2", "key_002",    # non-key change
   3, "test3", "KEY_003",    # key change
   4, "test4", "key_004",    # only in current
)


diff_structure <- daff::diff_data(
   data_ref = df_ref,
   data = df_current,
   ids = "key",
   ordered = FALSE
)

diff_structure
#> Daff Comparison: 'df_ref' vs. 'df_current' 
#>     a b            key             
#> --- 1 test1        key_001         
#> ->  2 test2->TEST2 key_002         
#> ->  3 test3        key_003->KEY_003
#> +++ 4 test4        key_004

I would have expected something like:

#> ---  3 test3        key_003
#> +++  3 test3        KEY_003

Even if I compare without the ID, I get the same result:

diff_structure_no_id <- daff::diff_data(
   data_ref = df_ref,
   data = df_current,
   ordered = FALSE
)

diff_structure_no_id
#> Daff Comparison: 'df_ref' vs. 'df_current' 
#>     a b            key             
#> --- 1 test1        key_001         
#> ->  2 test2->TEST2 key_002         
#> ->  3 test3        key_003->KEY_003
#> +++ 4 test4        key_004

Created on 2021-04-14 by the reprex package (v2.0.0)

(I am using the latest daff on CRAN, 0.3.5)

Thanks!

`render_diff(use.DataTables = TRUE)` broken

The default setting automatically sorts by the first column. This completely distorts the view, because placeholder rows come first, then all changes or deleted rows, then all new rows.

I can work around with use.DataTables = FALSE for my use case.

Do we need to add an artificial rowid column?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.