edwindj / daff Goto Github PK
View Code? Open in Web Editor NEWDiff, patch and merge for data.frames, see http://paulfitz.github.io/daff/
Home Page: https://edwindj.github.io/daff/
License: Other
Diff, patch and merge for data.frames, see http://paulfitz.github.io/daff/
Home Page: https://edwindj.github.io/daff/
License: Other
The original daff
is written in the programming language Haxe, which can target many programming languages.
Currently the R package daff
uses the javascript version of daff. Haxe however can also generate C++, so the lib may also be implemented using the C++ target. I don't know if/how the C++ version of daff should compile as a R package: to be found out.
Hi,
sometimes it's just rounding issue and should be ignored.
something like below:
https://github.com/alexsanjoseph/compareDF
When I run, I get
Error in e$ctx <<- V8::new_context("window") :
cannot change value of locked binding for 'e'
Should you use assign
instead?
e <- new.env()
get_context <- function(){
if (is.null(e$ctx)){
assign("ctx",V8::new_context("window"),envir=e )
e$ctx$source(system.file("js/underscore.js", package="V8"))
e$ctx$source(system.file("js/daff.js", package="daff"))
e$ctx$source(system.file("js/util.js", package="daff"))
}
e$ctx
}
or, should e
be a list
rather than environment
like this?
e <- list()
get_context <- function(){
if (is.null(e$ctx)){
e$ctx <- V8::new_context("window")
e$ctx$source(system.file("js/underscore.js", package="V8"))
e$ctx$source(system.file("js/daff.js", package="daff"))
e$ctx$source(system.file("js/util.js", package="daff"))
}
e$ctx
}
Thanks for the reminder: will look into it next week!
Originally posted by @edwindj in #31 (comment)
Detail, really, but when printing diff_data
, I get the first six and last six lines. This gives doubly printed lines when there are less then 12 patch lines.
Hello and thanks for this awesome project.
I have been using daff like so:
var data1 = ...;
var data2 = ...;
// Wrap into tables
var data1_table = new daff.TableView(data1);
var data2_table = new daff.TableView(data2);
// Calculate alignment
var alignment = daff.compareTables(data1_table, data2_table).align();
// Produce diff
var data_diff = [];
var table_diff = new daff.TableView(data_diff);
// Set diff options
var flags = new daff.CompareFlags();
flags.always_show_header = false;
flags.ordered = false;
flags.show_unchanged_columns = true;
flags.unchanged_column_context = 0;
flags.unchanged_context = 0;
var highlighter = new daff.TableDiff(alignment,flags);
highlighter.hilite(table_diff);
if (table_diff.data.length === 0) {
return;
}
// Get HTML
var diff2html = new daff.DiffRender();
diff2html.render(table_diff);
var table_diff_html = diff2html.html();
table_diff_html
contains unescaped data. For example if data2
has a field that contains <foo>
, that part of the field is never displayed to the user.
Could it be that I'm not calling something properly? Perhaps it is here that escaping ought to be done?
Allow for explicitly defining which columns are identifying or should be ignore in the comparison.
Hey there,
when I replace three values in the iris dataset with NA
and daff
it, this is translated into a patch with two value replacements and one row replacement.
> iris2 <- iris
> iris2[1:3,1] <- NA
> diff_data(iris, iris2)
Daff Comparison: ‘iris’ vs. ‘iris2’
First 6 and last 6 patch lines:
@@ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 -> 5.1->NA 3.5 1.4 0.2 setosa
2 -> 4.9->NA 3 1.4 0.2 setosa
3 +++ NA 3.2 1.3 0.2 setosa
4 --- 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 ... ... ... ... ... ...
7 ... ... ... ... ... ...
8 -> 5.1->NA 3.5 1.4 0.2 setosa
9 -> 4.9->NA 3 1.4 0.2 setosa
10 +++ NA 3.2 1.3 0.2 setosa
11 --- 4.7 3.2 1.3 0.2 setosa
12 4.6 3.1 1.5 0.2 setosa
13 ... ... ... ... ... ...
I'd expect the same patch notation for the three cases, but I could be missing something.
I found the following behavior surprising:
library(daff)
example(diff_data) # creates the "dd" patch
save(dd, file = "patch_loading.Rdata")
load("patch_loading.Rdata")
dd
## Error in context_eval(join(src), private$context) :
## Context has been disposed.
Almost all R objects can be safely written to Rdata or RDS files. Is this failure expected behavior?
If this is intended, I can contribute a documentation patch to diff_data
and write_diff
so people at least know to expect this -- it caught me off guard since R objects are usually self-contained and Rdata/RDS files are a big part of my workflow.
An htmlwidgets
version of daff will allow for data_diff
in shiny and rmarkdown.
From what I understand in the color coding, the color white represents rows which remain unchanged from the source to the target. Why are these displayed by render_diff? And is it possible to not display these rows?
If I select some of these unchanged rows and use them in diff_data I get an empty result, which is correct so I don't understand why they're being displayed when their part of a larger data frame.
Thanks
could you please help me or suggest any workaround to use ID as primary key while mapping rows for differences ?
Hi,
Thank you for this very usefull package!
I tried the following code:
> y <- iris[1:3,]
> x <- y
> x$Sepal.Width[2] <- NA
> y
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> x
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 NA 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> patch <- diff_data(y, x)
> render_diff(patch)
@@ | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | |
-> | 4.9 | 3->1.4 | 1.4->0.2 | 0.2->setosa | setosa->NULL |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Is it expected?
Best,
David
I am experiencing encoding problems which I believe are a combined issue of Windows and render_diff
's way of generating the html file.
render_diff
seems to be generating the html file by setting the encoding to utf-8 as standard. On windows machines, this does not seem to work, as the data written to the disk is encoded as ANSI.
My workaround has been to change the encoding details in the html-file:
render_diff(changes, file = write_file, view = FALSE)
# reopen file and replace encoding details
x <- readLines(write_file)
y <- gsub( "<meta charset='utf-8'>", "<meta charset='ANSI'>", x )
cat(y, file=write_file, sep="\n")
While it works, it might be nice to have this fixed, since daff
is very useful. I have not been able to identify the exact location of the bug, hence this description.
Here is an example on Windows (Daff v0.3.0):
df1 <- data.frame(x = "ä", y = "è")
df2 <- data.frame(x = "ö", y = "è")
diff <- diff_data(df1, df2)
render_diff(diff)
I don't normally raise these kinds of point-out-the-obvious issues, but this one is a bit strange because the last fail log of daff appeared to be passing with the normal suite of tests: https://cran-archive.r-project.org/web/checks/2019-05-13_check_results_daff.html
Example:
> iris2 <- data.frame(iris, iris, iris, check.names = FALSE)
> d <- diff_data(iris, iris2)
> d
Daff Comparison: ‘iris’ vs. ‘iris2’
First 6 and last 6 patch lines:
! ---
1 @@ Species
2 +++ <NA>
3 +++ <NA>
4 +++ <NA>
5 +++ <NA>
6 +++ <NA>
... ... ...
296 --- virginica
297 --- virginica
298 --- virginica
299 --- virginica
300 --- virginica
301 --- virginica
> summary(d)
Data diff:
Comparison: ‘iris’ vs. ‘iris2’
# Modified Reordered Deleted Added
Rows 150 0 0 150 150
Columns 5 --> 15 0 0 5 0
Proposed solution:
data_diff
should check for duplicated column names. If found, it should call make.unique
and generate a warning.
First of all, thank you for this fantastic package. I've used it successfully for several tasks, and have only recently come upon this issue.
Issue:
I get a cryptic message that pops up when I call diff_data()
on two large datasets (one has 41,222 observations and 23 variables, and the other has 32,077 observations and 21 variables). Note that if I use dplyr::slice
to chop each dataset to 10,000 rows each, the same error occurs.
Attempted traceback:
I was able to use debugonce(diff_data)
to step through the function and believe I have identified the source. It appears to come during the call ctx$assign(tv_diff$var_name, JS(diff))
(which of course I'm aware comes from the V8
package, so please let me know if I ought to cross-post this issue in that repo's tracker as well or instead).
I step into ctx$assign(tv_diff$var_name, JS(diff))
, which looks like this...
function (name, value, auto_unbox = TRUE, ...)
{
stopifnot(is.character(name))
obj <- if (inherits(value, "JS_EVAL")) {
invisible(this$eval(paste("var", name, "=", value)))
}
else {
invisible(this$eval(paste("var", name, "=", toJSON(value,
auto_unbox = auto_unbox, ...))))
}
}
... and then have no problems until I step into invisible(this$eval(paste("var", name, "=", value)))
. It looks like the error ultimately occurs in paste("var", name, "=", value)
, which calls .Internal(paste(list(...), sep, collapse))
.
With all of the functions that I stepped through, I'm not sure if an issue with one or more functions in the daff
package and/or the V8
package caused the base system to hit a fault or if it's a problem with my own data (which is sensitive and unable to be shared). Apologies for the exceedingly circuitous traceback and lack of a reproducible example.
It would be amazing if there was summary
method for diff_data
objects providing a tally of differences in terms of cells, columns, and rows (possibly % difference) etc.
# from README
library(daff)
y <- iris[1:3,]
x <- y
x <- head(x,2) # remove a row
x[1,1] <- 10 # change a value
x$hello <- "world" # add a column
x$Species <- NULL # remove a column
# modified
x <- rbind(x, c(3, 3, 3, 3, "test"))
x <- x[-2,]
patch <- diff_data(y, x)
changes <- length(grep("->", unlist(patch$get_data())))
col_added <- length(which(names(patch$get_data()) == "+++"))
col_rmd <- length(which(names(patch$get_data()) == "---"))
Wonder if there's anything we can collaborate on?
Hi,
This is not really a bug, but a feature request: is it able to detect class changes?
> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> class(x[, 1])
[1] "factor"
> class(x[, 2])
[1] "character"
> class(x[, 3])
[1] "numeric"
> render_diff(diff_data(y, x))
@@ | Sepal.Length | Sepal.Width | ... |
---|
Thanks!
David
Hi, thank you for the very useful package.
In the same comparison, I find sometimes modified rows and sometimes a couple of inserted/deleted rows.
It is very strange because the format of keys columns is the same and I don't find difference in the key columns.
Tks
Will daff
be available on CRAN again at some point?
Maybe I misunderstand the role of the ids
parameter, but records with different primary keys should be treated separately.
Instead, it seems that diff_data
detects changes, even if the key is forced.
df_ref <- tibble::tribble(
~a, ~b, ~key,
1, "test1", "key_001", # only in ref
2, "test2", "key_002",
3, "test3", "key_003",
)
df_current <- tibble::tribble(
~a, ~b, ~key,
2, "TEST2", "key_002", # non-key change
3, "test3", "KEY_003", # key change
4, "test4", "key_004", # only in current
)
diff_structure <- daff::diff_data(
data_ref = df_ref,
data = df_current,
ids = "key",
ordered = FALSE
)
diff_structure
#> Daff Comparison: 'df_ref' vs. 'df_current'
#> a b key
#> --- 1 test1 key_001
#> -> 2 test2->TEST2 key_002
#> -> 3 test3 key_003->KEY_003
#> +++ 4 test4 key_004
I would have expected something like:
#> --- 3 test3 key_003
#> +++ 3 test3 KEY_003
Even if I compare without the ID, I get the same result:
diff_structure_no_id <- daff::diff_data(
data_ref = df_ref,
data = df_current,
ordered = FALSE
)
diff_structure_no_id
#> Daff Comparison: 'df_ref' vs. 'df_current'
#> a b key
#> --- 1 test1 key_001
#> -> 2 test2->TEST2 key_002
#> -> 3 test3 key_003->KEY_003
#> +++ 4 test4 key_004
Created on 2021-04-14 by the reprex package (v2.0.0)
(I am using the latest daff on CRAN, 0.3.5)
Thanks!
The default setting automatically sorts by the first column. This completely distorts the view, because placeholder rows come first, then all changes or deleted rows, then all new rows.
I can work around with use.DataTables = FALSE
for my use case.
Do we need to add an artificial rowid column?
patch_data(x, patch)
Error in context_eval(join(src), private$context) :
TypeError: Cannot set property '14' of undefined
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.