sidora-tools / sidora.core Goto Github PK
View Code? Open in Web Editor NEWBackend for all sidora applications
License: Other
Backend for all sidora applications
License: Other
I just wanted to call sidora.core::get_df
in the main sidora app and found an interesting problem here. First of all this integration should be as simple as replacing this line
with
this_table <- sidora.core::get_df(tab, this_con, cache_dir = "data_cache")
This gives me Error in as.Date.numeric: 'origin' must be supplied
for some tables when I run the app. I don't really understand where this is coming from, but a fix could be to replace
Line 39 in b860fab
and
Line 49 in b860fab
with
dplyr::mutate_if(colnames(.) %in% coltypes_date, as.Date, origin = "1970-01-01") %>%
Unfortunately this throws Error in : cannot join a Date object with an object that is not a Date object
for some tables. Do you have an idea where this is coming from?
As @ivelsko pointed out on mattermost, the README should state that the tunnel is only necessary when running locally and not via the servers.
Lines 51 to 58 in 2d6c8dd
Additionally, the error message printed out by get_pandora_connection()
is outdated, as it describes the old credentials file format.
sidora.core/R/utils_get_pandora_connection.R
Lines 15 to 17 in 2d6c8dd
I'm finally getting a little bit back into using pandora/sidora again, and I have just written a little function that allows you to add new tags to entries (which you could then use with format_as_update_existing()
, to save a file for the whole Pandora uploading thing).
add_new_tags <- function(old_tags, new_tags){
old_tags <- str_split(old_tags, ",")
new_tags <- str_split(new_tags, ",")
append(old_tags, new_tags) %>% unlist %>% sort %>% unique %>% paste0(collapse = ",") %>% gsub("^,", "", .)
}
This brought up a question in my mind as whether we should provide 'modification' functions or not, and if yes, to what extent?
Thoughts @sidora-tools/core?
Also @ivelsko would this be a sort of thing you would be interested in?
A more detailed example:
library(dplyr)
add_new_tags <- function(old_tags, new_tags){
old_tags <- str_split(old_tags, ",")
new_tags <- str_split(new_tags, ",")
append(old_tags, new_tags) %>% unlist %>% sort %>% unique %>% paste0(collapse = ",") %>% gsub("^,", "", .)
}
samples <- get_df("TAB_Sample", con)
# filter
samples_raw <- samples %>%
filter(grepl("^ALA", sample.Full_Sample_Id)) %>%
convert_all_ids_to_values() %>%
filter(sample.Type_Group == "Calculus")
# update tag and convert for update exisiting entries format
samples_updated <- samples_raw %>%
mutate(sample.Tags = map2(sample.Tags, "FoodTransforms,James Fellows Yates", add_new_tags)) %>%
sidora.core::format_as_update_existing()
#write_csv(samples_updated, "Sample.csv")
There are different versions of boolean types across pandora
yes|no
true|false
We currently standardise this in pandora to an R logical, but we should also make a reverse function so we can convert back for upload.
@TCLamnidis and I realized today that we implemented a very strict check for the set of tables that can be downloaded with get_df
:
sidora.core/R/dataprep_get_data.R
Lines 81 to 84 in b84ad9d
I think this could be replaced by a warning, to allow downloading tables not (yet) in pandora_tables_all
.
I think it's a nuisance that things like capture.Probe_Set
aren't automatically converted to "SG", "TF" and so on.
Since I know we already have efficient hash-based code to do fast lookups, and that seems to be already implemented automatically on the CLI (in some modules?), I think we should just always do this. What do you think?
Same holds for Workers, sequencing machines, protocols, and other simple string lookups.
In the sidora web-app I allow the user to filter given a 'creation' date range (e.g. if someone wanted to check how many libraries they made in a given year).
This would be good to to add to the functions here https://github.com/sidora-tools/sidora.core/blob/master/R/dataprep_filter_data.R
All of the Ethically_Culturally_sensitive columns for my CMC data are wrong, where sample.Ethically_culturally_sensitive
is all NA, extract.Ethically_culturally_sensitive
is all FALSE, and library.Ethically_culturally_sensitive
is a combination of NA and FALSE, and all should be TRUE.
I just checked the Pandora website and they’re all correctly marked “Yes” there. Is sidora.core pulling old information?
When the 'Ethically/Culturally sensitive' option was added to Pandora, my samples were already in there and were automatically checked “No”, so at one point FALSE was a correct entry for all of them, but now they’re all “Yes” so everything should be TRUE. Since sample.Ethically_culturally_sensitive
is NA this was never correct, while extract.Ethically_culturally_sensitive
and library.Ethically_culturally_sensitive
as FALSE are outdated. I think the NAs in library.Ethically_culturally_sensitive
come from extracts that were never built into libraries so those are probably ok
#' would return a table either from Cache as tibble or as tbl() dbplyer object
get_table(db_con, entityType, fromCache=T, toCache=T)
...
filter_table(input, filter_spec)
join_tables(table_source, ...) -> would give error if you lack an intermediate table
examples:
join_tables(sites, individuals, samples)
error: join_tables(individuals, samples, libraries)
Recently apparently a new field was added to many of the tabs with the Eva labs coreDB ID
We should include this when people are pulling tables
Certain TABs contain restricted columns to read only users. We may in the future need to build custom SQL functions e.g. for TAB_workers
to select non-protected information:
e.g.
pull_worker <- function(con){
## Assumes con already generated
tbl(con, build_sql("SELECT Id, Name, Username FROM TAB_User", con = con))
}
I was preparing a pandora-explorer thing the last couple of nights and realised we still have not yet implementd the loading of the analysis tables. If we want more people to use sidora.core etc. for stuff like that, we should do this as a basic functionality.
The main issue is the IDs and the results are decoupled across two tables (as in the results table doesn't actually display the Full_Analysis_ID).
My question would be whether we should make a custom loading procedure for analysis (similar to the restricted access), or keep them decoupled and ask a user to major them. It only takes a few minutes to load the actual analysis results data (even if it's quite bag) so isn't unreasonable to allow loading atm.
Thoughts?
I.e. sequencing batches have an 'id' but the actual batch name is stored in TAB_Batch rather than in the TAB_Extract table itself
Should we do this by default? As we are read-only I don't see why we would ever need deleted entries to be considered?
Also improve error message handling for better dev traceback.
This is more of a formatting change request than an issue. I wanted to use some information in the Analysis tab, but this information is presented differently from the other tabs. That made selecting/filtering for what I wanted more involved, because I kept losing samples and had to figure out why.
Instead of having each entry in Analysis as a column, it's all as rows under 2 columns (analysis.Title
, analysis.Result
). Is it necessary that the information is presented this way instead of making each entry an individual column?
The way it is currently means that if a sequenced library isn't run through this analysis, it won't have an entry, not even an place-holder NA, so when I filtered for "Initial reads", a bunch of samples I wanted to include were lost from my table (yes, these were blanks. Absolutely I still need them).
I did realize the entries under analysis.Title
are not consistent, which is coming from Pandora itself, which is a problem. For example, GRG003.B0101.SG1.1.Human_Shotgun
has:
Initial reads (forward+reverse):
Failed reads (fwd+rev):
Failed reads (fwd+rev) in %:
Merged reads: Merged reads in %:
Mapped reads (fwd+rev+merged):
Mapped reads (fwd+rev+merged) in %:
Mapped fragments:
Mapped fragments in %:
Mapped fragments (L>=30):
Mapped fragments (L>=30) in %:
while GRG004.A0101.SG1.1.Human_Shotgun
has:
Initial reads:
Failed reads:
Failed reads in %:
Mapped reads/fragments:
Mapped reads/fragments in %:
Mapped reads/fragments (L>=30):
Mapped reads/fragments (L>=30) in %:
Is that difference b/c the human shotgun screening pipeline changed? Can it be normalized across Pandora, so that you can make the entries columns like for the other tabs?
The value in the Library tab 'Quantification post-Indexing' box isn't being read, and shows NA even though this information is present in Pandora online.
For example, library ARS001.A0101 has 'NA' for column library.Quantification_post-Indexing_total
, however online the box has the value 4.06E+10.
library(sidora.core)
con <- get_pandora_connection(cred_file = "~/.credentials")
sites_df <- get_df("TAB_Library", con) %>% convert_all_ids_to_values()
sites_df %>% filter(library.Full_Library_Id == "ARS001.A0101") %>% select(contains("Quantification"))
# A tibble: 1 x 2
`library.Quantification_pre-Indexing_total` `library.Quantification_post-Indexing_total`
<int> <int>
1 377000000 NA
Just playing with this, I release that date-columns are parsed as character vectors in the nibbles. Would be nice to convert them automatically to proper date objects (can't remember what the canonic library for that would be).
As discussed in #16 and on mattermost, there are a number of columns in various tables of Pandora, that only seem to serve as placeholders to be filled from other tables up- or downstream in the Pandora hierarchy. Imho they are pointless and confusing for data analysis.
I see three options how to address that:
get_df()
I don't like 1. for obvious reasons (and generally consider the ethics issue solved with the startup message). 2. is tricky to pull off and would force a lot of data download even if one only wants a single table. So 3. it is, in my opinion.
What do you folks say? (including @jfy133)
Geographic names often contain non-ascii characters and sidora seems to download them from Pandora in the wrong way. That causes problems at many points downstream -- I have no idea how we managed to miss that until now.
Related: https://stackoverflow.com/questions/30932708/how-to-change-dplyrtbl-connection-encoding-to-utf8
This contains all the help info... might be a good reference
> sidora_entry_help("TAB_Capture", "Files")
Files such as quantification results (if performed).
Use case (from Ben): user wants a flat table of just columns from Site and Library tables.
Proposed implementation:
sidora.core/R/dataprep_join_data.R
Lines 127 to 148 in 1a6088b
2022-04-19
2022-03-23
A shower thought led me to an idea for a very simple filter function implementation in sidora.cli, which is almost as powerful as the one in the webinterface. Hacking together the prototype made me realize that the current join function causes variables to have different names depending on which tables are merged. That was clear from the beginning, but now I realized that this is highly confusing.
I therefore suggest to rename all variables at download to include the table name: e.g. Id
in table TAB_Site
should become Id.site
, Protocol
in table TAB_Sample
should become Protocol.sample
and so forth.
This change will break current code (!), but it will tremendously simplify things down the line, because variable names become reliable. Especially useful for high-overlap variables like tag
.
Discussion about this is scheduled for next friday: @jfy133 @stschiff @nevrome
Function convert_all_ids_to_values()
requires con
, but the README does not specify this.
Hi guys, I encountered some inconsistent sidora table formatting that caused me some confusion while trying to filter and join tables. In some columns where there's no entry a cell is left blank, while in others it's filled with NA. For example
df_list <- get_df_list(c(
"TAB_Library", "TAB_Capture", "TAB_Sequencing"
), con = con)
lib_info <- join_pandora_tables(df_list)
lib_info <- convert_all_ids_to_values(lib_info, con)
lib_info %>%
filter(str_detect(library.Full_Library_Id, "CMC040")) %>%
select(library.Full_Library_Id, library.Protocol, library.Batch, sequencing.Full_Sequencing_Id)
# A tibble: 3 × 4
library.Full_Library_Id library.Protocol library.Batch sequencing.Full_Sequencing_Id
<chr> <chr> <chr> <chr>
1 CMC040.A0201 dsLibrary non UDG 2015 "" NA
2 CMC040.B0101 dsLibrary non UDG 2015 "Li04_VE_2019-04-17" CMC040.B0101.SG1.1
3 CMC040.B0101 dsLibrary non UDG 2015 "Li04_VE_2019-04-17" CMC040.B0101.SG1.2
In Pandora the cell for the library batch is empty. Is this just an R/Pandora thing that has to be tolerated, or would it be possible to have this made consistent (everything is NA)?
We need a software license for this and the other sidora repos
@nevrome, thanks for getting this started! I'm learning a lot from going through your code. As a test whether I'm getting things right, I'd like to run by you a suggestion for improving join_df_list
, which I find currently
check_completeness
, which I think we don't need.How about this:
join_df_list <- function(parent, child) {
ret <- NULL
if(names(parent) == "TAB_Site" & names(child) == "TAB_Individual") {
ret <- dplyr::left_join(
child,
parent,
by = ("Site" = "Id"),
suffix = c(".Individual", ".Site")
)
}
if(names(parent) == "TAB_Individual" & names(child) == "TAB_Sample") {
ret <- dplyr::left_join(
child,
parent,
by = ("Individual" = "Id"),
suffix = c(".Sample", ".Individual")
)
}
if(names(parent) == "TAB_Sample" & names(child) == "TAB_Extract") {
ret <- dplyr::left_join(
child,
parent,
by = ("Sample" = "Id"),
suffix = c(".Extract", ".Sample")
)
}
if(names(parent) == "TAB_Extract" & names(child) == "TAB_Library") {
ret <- dplyr::left_join(
child,
parent,
by = ("Extract" = "Id"),
suffix = c(".Library", ".Extract")
)
}
if(is.null(ret))
stop("Error: Inputs need to be named by Pandora table names and/or missing intermediate table.")
else
return(ret)
}
You would call this for example as join_df_list(c("TAB_Site"=site_tab), c("TAB_Individual"=ind_tab))
.
I'm happy to make a PR, but as I'm still learning R I'd like to make sure I'm not missing something trivial...
As i nthe title.
*colname*
.Unfortunately it seems we have multiple columns where they have the same name, but different column types.
For example: individual.Archaeological_ID
is of character, but sample.Archaeological_ID
is of integer, with integer corresponding to the indiviudal.Individual.Id
numeric ID.
Also, sample_type_group.Type_Group
is character, while sample_type.Type_Group
is numeric.
This causes problems in the sidora.cli view
module when trying to make all columns 'human readable' as they are in the Pandora web page itself.
Solution from @nevrome is to instead of vectors have tables of: table | column | type
, and modify columns based on this information.
I create a conda environment with the provided environment.yml
and then install sidora.core
with the provided commands.
Trying to create a connection object then throws the error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': namespace ‘rlang’ 0.4.11 is already loaded, but >= 1.0.5 is required
Using install.packages('rlang')
to get the latest version of rlang, restarting the session and retrying the command fixed the issue.
I noticed cases of odd-looking entries for library.P7_Index_Id
and library.P5_Index_Id
, when pulling these and corresponding sequences for the indices for a list of library IDs using the following:
con <- sidora.core::get_pandora_connection("/Users/taylor_hermes/.credentials")
res <- get_df("TAB_Library", con) %>% convert_all_ids_to_values(con)
samples <- c("ATG010.A0101", "CP2010.A0101", "IST003.A0101")
list_seq <- filter(res, res$library.Full_Library_Id %in% samples) %>%
select(library.External_Library_Id,
library.Full_Library_Id,
library.Index_Set,
library.P7_Index_Id,
library.P7_Index_Sequence,
library.P5_Index_Id,
library.P5_Index_Sequence) %>%
arrange(library.Full_Library_Id)
For library ATG010.A0101
, I should get 1197
and 1133
for library.P7_Index_Id
and library.P5_Index_Id
, respectively. However, I get d703
and 1113
. It seems that the index sequences are correct according to my spot checking with a larger list of library IDs. When I showed this behavior to @jfy133 a short time ago, he thinks there may be a table lookup issue.
Attached is the output from the code above.
test1.csv
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.