edwindj / daff Goto Github PK

View Code? Open in Web Editor NEW

151.0 151.0 19.0 1.81 MB

Diff, patch and merge for data.frames, see http://paulfitz.github.io/daff/

Home Page: https://edwindj.github.io/daff/

License: Other

R 84.50% JavaScript 4.31% HTML 11.19%

daff data diff r

daff's People

Contributors

Stargazers

Watchers

Forkers

arturochian timelyportfolio 14mmm gwarnes-mdsol applied-statistic-using-r markvanderloo nemochina2008 michaelgasser ruialv ktaranov gwd666 salim-b francoisluc jmaspons warnes porfila jacpete olivroy

daff's Issues

Implement C++ version of daff

The original daff is written in the programming language Haxe, which can target many programming languages.

Currently the R package daff uses the javascript version of daff. Haxe however can also generate C++, so the lib may also be implemented using the C++ target. I don't know if/how the C++ version of daff should compile as a R package: to be found out.

Add tolerance to comparison when it's numeric values

Hi,

sometimes it's just rounding issue and should be ignored.

something like below:
https://github.com/alexsanjoseph/compareDF

e is locked

When I run, I get

Error in e$ctx <<- V8::new_context("window") : 
  cannot change value of locked binding for 'e'

Should you use assign instead?

e <- new.env()

get_context <- function(){
  if (is.null(e$ctx)){
    assign("ctx",V8::new_context("window"),envir=e )
    e$ctx$source(system.file("js/underscore.js", package="V8"))
    e$ctx$source(system.file("js/daff.js", package="daff"))
    e$ctx$source(system.file("js/util.js", package="daff"))
  }
  e$ctx
}

or, should e be a list rather than environment like this?

e <- list()

get_context <- function(){
  if (is.null(e$ctx)){
    e$ctx <- V8::new_context("window")
    e$ctx$source(system.file("js/underscore.js", package="V8"))
    e$ctx$source(system.file("js/daff.js", package="daff"))
    e$ctx$source(system.file("js/util.js", package="daff"))
  }
  e$ctx
}

"Primary Key Id' issue

Thanks for the reminder: will look into it next week!

Originally posted by @edwindj in #31 (comment)

lines doubly printed ico small nr of patches

Detail, really, but when printing diff_data , I get the first six and last six lines. This gives doubly printed lines when there are less then 12 patch lines.

HTML rendering does not escape passed data

Hello and thanks for this awesome project.

I have been using daff like so:

var data1 = ...;
var data2 = ...;

// Wrap into tables
var data1_table = new daff.TableView(data1);
var data2_table = new daff.TableView(data2);

// Calculate alignment
var alignment = daff.compareTables(data1_table, data2_table).align();

// Produce diff
var data_diff = [];
var table_diff = new daff.TableView(data_diff);

// Set diff options
var flags = new daff.CompareFlags();
    flags.always_show_header = false;
    flags.ordered = false;
    flags.show_unchanged_columns = true;
    flags.unchanged_column_context = 0;
    flags.unchanged_context = 0;
var highlighter = new daff.TableDiff(alignment,flags);
highlighter.hilite(table_diff);

if (table_diff.data.length === 0) {
    return;
}

// Get HTML
var diff2html = new daff.DiffRender();
diff2html.render(table_diff);
var table_diff_html = diff2html.html();

table_diff_html contains unescaped data. For example if data2 has a field that contains <foo>, that part of the field is never displayed to the user.

Could it be that I'm not calling something properly? Perhaps it is here that escaping ought to be done?

Implement `ids` and `ignore` columns

Allow for explicitly defining which columns are identifying or should be ignore in the comparison.

inconsistent behavior when replacing values with NA

Hey there,

when I replace three values in the iris dataset with NA and daff it, this is translated into a patch with two value replacements and one row replacement.

> iris2 <- iris
> iris2[1:3,1] <- NA
> diff_data(iris, iris2)
Daff Comparison: ‘iris’ vs. ‘iris2’ 
  First 6 and last 6 patch lines:
    @@ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   ->      5.1->NA         3.5          1.4         0.2  setosa
2   ->      4.9->NA           3          1.4         0.2  setosa
3  +++           NA         3.2          1.3         0.2  setosa
4  ---          4.7         3.2          1.3         0.2  setosa
5               4.6         3.1          1.5         0.2  setosa
6  ...          ...         ...          ...         ...     ...
7  ...          ...         ...          ...         ...     ...
8   ->      5.1->NA         3.5          1.4         0.2  setosa
9   ->      4.9->NA           3          1.4         0.2  setosa
10 +++           NA         3.2          1.3         0.2  setosa
11 ---          4.7         3.2          1.3         0.2  setosa
12              4.6         3.1          1.5         0.2  setosa
13 ...          ...         ...          ...         ...     ...

I'd expect the same patch notation for the three cases, but I could be missing something.

data_diff objects fail on round trip to Rdata files

I found the following behavior surprising:

library(daff)
example(diff_data)  # creates the "dd" patch

save(dd, file = "patch_loading.Rdata")
load("patch_loading.Rdata")

dd
## Error in context_eval(join(src), private$context) : 
##  Context has been disposed.

Almost all R objects can be safely written to Rdata or RDS files. Is this failure expected behavior?

If this is intended, I can contribute a documentation patch to diff_data and write_diff so people at least know to expect this -- it caught me off guard since R objects are usually self-contained and Rdata/RDS files are a big part of my workflow.

Add htmlwidgets

An htmlwidgets version of daff will allow for data_diff in shiny and rmarkdown.

Why are matching (white) rows displayed in render_diff?

From what I understand in the color coding, the color white represents rows which remain unchanged from the source to the target. Why are these displayed by render_diff? And is it possible to not display these rows?

If I select some of these unchanged rows and use them in diff_data I get an empty result, which is correct so I don't understand why they're being displayed when their part of a larger data frame.
Thanks

Activate id parameter in diff_data function

could you please help me or suggest any workaround to use ID as primary key while mapping rows for differences ?

Handle NAs

Hi,

Thank you for this very usefull package!

I tried the following code:

> y <- iris[1:3,]
> x <- y
> x$Sepal.Width[2] <- NA
> y
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> x
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9          NA          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> patch <- diff_data(y, x)
> render_diff(patch)

@@	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
	5.1	3.5	1.4	0.2	setosa
->	4.9	3->1.4	1.4->0.2	0.2->setosa	setosa->NULL
	4.7	3.2	1.3	0.2	setosa

Is it expected?

Best,
David

Encoding Issue with Windows and Daff

I am experiencing encoding problems which I believe are a combined issue of Windows and render_diff's way of generating the html file.

render_diff seems to be generating the html file by setting the encoding to utf-8 as standard. On windows machines, this does not seem to work, as the data written to the disk is encoded as ANSI.

My workaround has been to change the encoding details in the html-file:

render_diff(changes, file = write_file, view = FALSE)

# reopen file and replace encoding details
x <- readLines(write_file)
y <- gsub( "<meta charset='utf-8'>", "<meta charset='ANSI'>", x )
cat(y, file=write_file, sep="\n")

While it works, it might be nice to have this fixed, since daff is very useful. I have not been able to identify the exact location of the bug, hence this description.

Here is an example on Windows (Daff v0.3.0):

df1 <- data.frame(x = "ä", y = "è")
df2 <- data.frame(x = "ö", y = "è")
diff <- diff_data(df1, df2)
render_diff(diff)

daff removed from CRAN

I don't normally raise these kinds of point-out-the-obvious issues, but this one is a bit strange because the last fail log of daff appeared to be passing with the normal suite of tests: https://cran-archive.r-project.org/web/checks/2019-05-13_check_results_daff.html

Daff doesn't correctly detect added columns that share the same name

Example:

> iris2 <- data.frame(iris, iris, iris, check.names = FALSE)
> d <- diff_data(iris, iris2)
> d
Daff Comparison: ‘iris’ vs. ‘iris2’ 
  First 6 and last 6 patch lines:
      !       ---
1    @@   Species
2   +++      <NA>
3   +++      <NA>
4   +++      <NA>
5   +++      <NA>
6   +++      <NA>
... ...       ...
296 --- virginica
297 --- virginica
298 --- virginica
299 --- virginica
300 --- virginica
301 --- virginica

> summary(d)

Data diff:
 Comparison: ‘iris’ vs. ‘iris2’ 
        #        Modified Reordered Deleted Added
Rows    150      0        0         150     150  
Columns 5 --> 15 0        0         5       0

Proposed solution:
data_diff should check for duplicated column names. If found, it should call make.unique and generate a warning.

Error: "Ordering took too long, something went wrong" when calling diff_data with large dataset inputs

First of all, thank you for this fantastic package. I've used it successfully for several tasks, and have only recently come upon this issue.

Issue:

I get a cryptic message that pops up when I call diff_data() on two large datasets (one has 41,222 observations and 23 variables, and the other has 32,077 observations and 21 variables). Note that if I use dplyr::slice to chop each dataset to 10,000 rows each, the same error occurs.

Attempted traceback:

I was able to use debugonce(diff_data) to step through the function and believe I have identified the source. It appears to come during the call ctx$assign(tv_diff$var_name, JS(diff)) (which of course I'm aware comes from the V8 package, so please let me know if I ought to cross-post this issue in that repo's tracker as well or instead).

I step into ctx$assign(tv_diff$var_name, JS(diff)), which looks like this...

function (name, value, auto_unbox = TRUE, ...) 
{
  stopifnot(is.character(name))
  obj <- if (inherits(value, "JS_EVAL")) {
    invisible(this$eval(paste("var", name, "=", value)))
  }
  else {
    invisible(this$eval(paste("var", name, "=", toJSON(value, 
      auto_unbox = auto_unbox, ...))))
  }
}

... and then have no problems until I step into invisible(this$eval(paste("var", name, "=", value))). It looks like the error ultimately occurs in paste("var", name, "=", value), which calls .Internal(paste(list(...), sep, collapse)).

With all of the functions that I stepped through, I'm not sure if an issue with one or more functions in the daff package and/or the V8 package caused the base system to hit a fault or if it's a problem with my own data (which is sensitive and unable to be shared). Apologies for the exceedingly circuitous traceback and lack of a reproducible example.

feature request: summary

It would be amazing if there was summary method for diff_data objects providing a tally of differences in terms of cells, columns, and rows (possibly % difference) etc.

# from README
library(daff)
y <- iris[1:3,]
x <- y

x <- head(x,2) # remove a row
x[1,1] <- 10 # change a value
x$hello <- "world"  # add a column
x$Species <- NULL # remove a column

# modified
x <- rbind(x, c(3, 3, 3, 3, "test"))
x <- x[-2,]

patch <- diff_data(y, x)

changes <- length(grep("->", unlist(patch$get_data())))
col_added <- length(which(names(patch$get_data()) == "+++"))
col_rmd <- length(which(names(patch$get_data()) == "---"))

Consider integrating with objectdiff

objectdiff

Wonder if there's anything we can collaborate on?

Handle class

Hi,

This is not really a bug, but a feature request: is it able to detect class changes?

> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> class(x[, 1])
[1] "factor"
> class(x[, 2])
[1] "character"
> class(x[, 3])
[1] "numeric"
> render_diff(diff_data(y, x))

@@	Sepal.Length	Sepal.Width	...

Thanks!
David

Issue with modified or inserted/deleted row

Hi, thank you for the very useful package.
In the same comparison, I find sometimes modified rows and sometimes a couple of inserted/deleted rows.
It is very strange because the format of keys columns is the same and I don't find difference in the key columns.
Tks

CRAN

Will daff be available on CRAN again at some point?

diff_data detects changes in primary keys, even if ids is specified

Maybe I misunderstand the role of the ids parameter, but records with different primary keys should be treated separately.
Instead, it seems that diff_data detects changes, even if the key is forced.

df_ref <- tibble::tribble(
   ~a, ~b, ~key,
   1, "test1", "key_001",    # only in ref
   2, "test2", "key_002",
   3, "test3", "key_003",
)

df_current <- tibble::tribble(
   ~a, ~b, ~key,
   2, "TEST2", "key_002",    # non-key change
   3, "test3", "KEY_003",    # key change
   4, "test4", "key_004",    # only in current
)


diff_structure <- daff::diff_data(
   data_ref = df_ref,
   data = df_current,
   ids = "key",
   ordered = FALSE
)

diff_structure
#> Daff Comparison: 'df_ref' vs. 'df_current' 
#>     a b            key             
#> --- 1 test1        key_001         
#> ->  2 test2->TEST2 key_002         
#> ->  3 test3        key_003->KEY_003
#> +++ 4 test4        key_004

I would have expected something like:

#> ---  3 test3        key_003
#> +++  3 test3        KEY_003

Even if I compare without the ID, I get the same result:

diff_structure_no_id <- daff::diff_data(
   data_ref = df_ref,
   data = df_current,
   ordered = FALSE
)

diff_structure_no_id
#> Daff Comparison: 'df_ref' vs. 'df_current' 
#>     a b            key             
#> --- 1 test1        key_001         
#> ->  2 test2->TEST2 key_002         
#> ->  3 test3        key_003->KEY_003
#> +++ 4 test4        key_004

^{Created on 2021-04-14 by the reprex package (v2.0.0)}

(I am using the latest daff on CRAN, 0.3.5)

Thanks!

`render_diff(use.DataTables = TRUE)` broken

The default setting automatically sorts by the first column. This completely distorts the view, because placeholder rows come first, then all changes or deleted rows, then all new rows.

I can work around with use.DataTables = FALSE for my use case.

Do we need to add an artificial rowid column?

Unknown error "Cannot set property '14' of undefined

patch_data(x, patch)
Error in context_eval(join(src), private$context) :
TypeError: Cannot set property '14' of undefined