sidora-tools / sidora.core Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 3.0 1.06 MB

Backend for all sidora applications

License: Other

R 100.00%

adna ancient-dna database r

sidora.core's People

Contributors

Stargazers

Watchers

Forkers

jfy133 tclamnidis meganemichel

sidora.core's Issues

date objects

@jfy133

I just wanted to call sidora.core::get_df in the main sidora app and found an interesting problem here. First of all this integration should be as simple as replacing this line

https://github.com/jfy133/sidora/blob/e1a724ed8d0493bcc16b1b471e805534ac047f93/app/helpers_loading_data.R#L12

with

this_table <- sidora.core::get_df(tab, this_con, cache_dir = "data_cache")

This gives me Error in as.Date.numeric: 'origin' must be supplied for some tables when I run the app. I don't really understand where this is coming from, but a fix could be to replace

sidora.core/R/enforce_types.R

Line 39 in b860fab

dplyr::mutate_if(colnames(.) %in% coltypes_date, as.Date) %>%

and

sidora.core/R/enforce_types.R

Line 49 in b860fab

dplyr::mutate_if(colnames(.) %in% coltypes_date, as.Date) %>%

with

dplyr::mutate_if(colnames(.) %in% coltypes_date, as.Date, origin = "1970-01-01") %>%

Unfortunately this throws Error in : cannot join a Date object with an object that is not a Date object for some tables. Do you have an idea where this is coming from?

Update documentation clarification and error message for new release.

As @ivelsko pointed out on mattermost, the README should state that the tunnel is only necessary when running locally and not via the servers.

sidora.core/README.md

Lines 51 to 58 in 2d6c8dd

 4. Establish an ssh tunnel to the pandora database server with 

 ```bash 

 ssh -L 10001:pandora.eva.mpg.de:3306 <your username>@daghead1 

 ``` 

  > You must make a new tunnel each time you want to connect (e.g. after you log out or reboot your machine)

Additionally, the error message printed out by get_pandora_connection() is outdated, as it describes the old credentials file format.

sidora.core/R/utils_get_pandora_connection.R

Lines 15 to 17 in 2d6c8dd

 "[sidora.core] error: can't find .credentials file. Please create one ", 

 "containing three lines:", 

 "the database host, the username, the password."

Allow data-entry updating utility functions?

I'm finally getting a little bit back into using pandora/sidora again, and I have just written a little function that allows you to add new tags to entries (which you could then use with format_as_update_existing(), to save a file for the whole Pandora uploading thing).

add_new_tags <- function(old_tags, new_tags){
  old_tags <- str_split(old_tags, ",")
  new_tags <- str_split(new_tags, ",")
  
  append(old_tags, new_tags) %>% unlist %>% sort %>% unique %>% paste0(collapse = ",") %>% gsub("^,", "", .)
}

This brought up a question in my mind as whether we should provide 'modification' functions or not, and if yes, to what extent?

Thoughts @sidora-tools/core?

Also @ivelsko would this be a sort of thing you would be interested in?

A more detailed example:

library(dplyr)

add_new_tags <- function(old_tags, new_tags){
  old_tags <- str_split(old_tags, ",")
  new_tags <- str_split(new_tags, ",")
  
  append(old_tags, new_tags) %>% unlist %>% sort %>% unique %>% paste0(collapse = ",") %>% gsub("^,", "", .)
}


samples <- get_df("TAB_Sample", con)

# filter
samples_raw <- samples %>%
  filter(grepl("^ALA", sample.Full_Sample_Id)) %>%
  convert_all_ids_to_values() %>%
  filter(sample.Type_Group == "Calculus")

# update tag and convert for update exisiting entries format
samples_updated <- samples_raw %>%
  mutate(sample.Tags = map2(sample.Tags, "FoodTransforms,James Fellows Yates", add_new_tags)) %>%
  sidora.core::format_as_update_existing()

#write_csv(samples_updated, "Sample.csv")

Standardise logical columns

There are different versions of boolean types across pandora

yes|no
true|false

We currently standardise this in pandora to an R logical, but we should also make a reverse function so we can convert back for upload.

Soften unnecessarily strict check for table download

@TCLamnidis and I realized today that we implemented a very strict check for the set of tables that can be downloaded with get_df:

sidora.core/R/dataprep_get_data.R

Lines 81 to 84 in b84ad9d

 if ( any(!tab %in% sidora.core::pandora_tables_all) ) 

 stop(paste0("[sidora.core] error: tab not found in available tables. Options: ", 

 paste(sidora.core::pandora_tables_all, collapse = ", "), 

 ". Your selection: ", tab))

I think this could be replaced by a warning, to allow downloading tables not (yet) in pandora_tables_all.

Automatic lookup of capture probe set and other easy-to-lookup columns

I think it's a nuisance that things like capture.Probe_Set aren't automatically converted to "SG", "TF" and so on.

Since I know we already have efficient hash-based code to do fast lookups, and that seems to be already implemented automatically on the CLI (in some modules?), I think we should just always do this. What do you think?

Same holds for Workers, sequencing machines, protocols, and other simple string lookups.

Add filtering of tables by creation date

In the sidora web-app I allow the user to filter given a 'creation' date range (e.g. if someone wanted to check how many libraries they made in a given year).

This would be good to to add to the functions here https://github.com/sidora-tools/sidora.core/blob/master/R/dataprep_filter_data.R

Ethically_culturally_sensitive columns have incorrect/outdated entries

All of the Ethically_Culturally_sensitive columns for my CMC data are wrong, where sample.Ethically_culturally_sensitive is all NA, extract.Ethically_culturally_sensitive is all FALSE, and library.Ethically_culturally_sensitive is a combination of NA and FALSE, and all should be TRUE.

I just checked the Pandora website and they’re all correctly marked “Yes” there. Is sidora.core pulling old information?

When the 'Ethically/Culturally sensitive' option was added to Pandora, my samples were already in there and were automatically checked “No”, so at one point FALSE was a correct entry for all of them, but now they’re all “Yes” so everything should be TRUE. Since sample.Ethically_culturally_sensitive is NA this was never correct, while extract.Ethically_culturally_sensitive and library.Ethically_culturally_sensitive as FALSE are outdated. I think the NAs in library.Ethically_culturally_sensitive come from extracts that were never built into libraries so those are probably ok

Defining a Core API

#' would return a table either from Cache as tibble or as tbl() dbplyer object

get_table(db_con, entityType, fromCache=T, toCache=T)
...

filter_table(input, filter_spec)

join_tables(table_source, ...) -> would give error if you lack an intermediate table

examples:

join_tables(sites, individuals, samples)
error: join_tables(individuals, samples, libraries)

Add support for new coredb ID entries

Recently apparently a new field was added to many of the tabs with the Eva labs coreDB ID

We should include this when people are pulling tables

Custom function for protected tables

Certain TABs contain restricted columns to read only users. We may in the future need to build custom SQL functions e.g. for TAB_workers to select non-protected information:

e.g.

pull_worker <- function(con){
  ## Assumes con already generated
  tbl(con, build_sql("SELECT Id, Name, Username FROM TAB_User", con = con))
}

Add analysis_result_string table loading

I was preparing a pandora-explorer thing the last couple of nights and realised we still have not yet implementd the loading of the analysis tables. If we want more people to use sidora.core etc. for stuff like that, we should do this as a basic functionality.

The main issue is the IDs and the results are decoupled across two tables (as in the results table doesn't actually display the Full_Analysis_ID).

My question would be whether we should make a custom loading procedure for analysis (similar to the restricted access), or keep them decoupled and ask a user to major them. It only takes a few minutes to load the actual analysis results data (even if it's quite bag) so isn't unreasonable to allow loading atm.

Thoughts?

Make lookup function for metadata in auxilary tables

I.e. sequencing batches have an 'id' but the actual batch name is stored in TAB_Batch rather than in the TAB_Extract table itself

By default filter out deleted entries with get_df()?

Should we do this by default? As we are read-only I don't see why we would ever need deleted entries to be considered?

List of Release names

Add input validation checks for all functions

Also improve error message handling for better dev traceback.

Analysis tab format is inconsistent with the other tabs

This is more of a formatting change request than an issue. I wanted to use some information in the Analysis tab, but this information is presented differently from the other tabs. That made selecting/filtering for what I wanted more involved, because I kept losing samples and had to figure out why.

Instead of having each entry in Analysis as a column, it's all as rows under 2 columns (analysis.Title, analysis.Result). Is it necessary that the information is presented this way instead of making each entry an individual column?

The way it is currently means that if a sequenced library isn't run through this analysis, it won't have an entry, not even an place-holder NA, so when I filtered for "Initial reads", a bunch of samples I wanted to include were lost from my table (yes, these were blanks. Absolutely I still need them).

I did realize the entries under analysis.Title are not consistent, which is coming from Pandora itself, which is a problem. For example, GRG003.B0101.SG1.1.Human_Shotgun has:

Initial reads (forward+reverse): 
Failed reads (fwd+rev): 
Failed reads (fwd+rev) in %: 
Merged reads: Merged reads in %: 
Mapped reads (fwd+rev+merged): 
Mapped reads (fwd+rev+merged) in %: 
Mapped fragments: 
Mapped fragments in %: 
Mapped fragments (L>=30): 
Mapped fragments (L>=30) in %:

while GRG004.A0101.SG1.1.Human_Shotgun has:

Initial reads: 
Failed reads: 
Failed reads in %: 
Mapped reads/fragments: 
Mapped reads/fragments in %: 
Mapped reads/fragments (L>=30): 
Mapped reads/fragments (L>=30) in %:

Is that difference b/c the human shotgun screening pipeline changed? Can it be normalized across Pandora, so that you can make the entries columns like for the other tabs?

missing data in library.Quantification_post-Indexing_total

The value in the Library tab 'Quantification post-Indexing' box isn't being read, and shows NA even though this information is present in Pandora online.

For example, library ARS001.A0101 has 'NA' for column library.Quantification_post-Indexing_total, however online the box has the value 4.06E+10.

library(sidora.core)
con <- get_pandora_connection(cred_file = "~/.credentials")
sites_df  <- get_df("TAB_Library", con) %>% convert_all_ids_to_values()
sites_df %>% filter(library.Full_Library_Id == "ARS001.A0101") %>% select(contains("Quantification"))

# A tibble: 1 x 2
  `library.Quantification_pre-Indexing_total` `library.Quantification_post-Indexing_total`
                                        <int>                                        <int>
1                                   377000000                                           NA

convert datetime-columns automatically

Just playing with this, I release that date-columns are parsed as character vectors in the nibbles. Would be nice to convert them automatically to proper date objects (can't remember what the canonic library for that would be).

Fake columns

As discussed in #16 and on mattermost, there are a number of columns in various tables of Pandora, that only seem to serve as placeholders to be filled from other tables up- or downstream in the Pandora hierarchy. Imho they are pointless and confusing for data analysis.

I see three options how to address that:

Do nothing and keep the fake columns around (as suggested by @ivelsko for the ethics columns)
Populate them from the actual data columns in the other tables (as suggested by @trhermes)
Delete them in the process of and behind get_df()

I don't like 1. for obvious reasons (and generally consider the ethics issue solved with the startup message). 2. is tricky to pull off and would force a lot of data download even if one only wants a single table. So 3. it is, in my opinion.

What do you folks say? (including @jfy133)

Character encoding

Geographic names often contain non-ascii characters and sidora seems to download them from Pandora in the wrong way. That causes problems at many points downstream -- I have no idea how we managed to miss that until now.

Make function to call `TAB_Field_Comment`

This contains all the help info... might be a good reference

> sidora_entry_help("TAB_Capture", "Files")
Files such as quantification results (if performed).

Request convinence function to provide flat table of non-sequential tables.

Use case (from Ben): user wants a flat table of just columns from Site and Library tables.

Proposed implementation:

get_df to get tables requested

make_complete (

sidora.core/R/dataprep_join_data.R

Lines 127 to 148 in 1a6088b

 #' Make sequence-complete Pandora table list 

 #' 

 #' Pandoras layout is a hierarchical sequence of tables: All tables have a clear  

 #' predecessor and successor. \code{join_pandora_tables()} uses this fact to  

 #' merge tables accordingly. \code{make_complete_table_list} is 

 #' a helper function to fill the gaps in a sequence of Pandora tables. 

 #' 

 #' @param tabs character vector. List of Pandora table names 

 #' @param join_order_vector character vector. Reference vector with the Pandora  

 #' structure 

 #' 

 #' @export 

 make_complete_table_list <- function( 

 tabs, 

 join_order_vector = sidora.core::pandora_tables 

 ) { 

 positions <- sapply(tabs, function(x) { which(x == join_order_vector) }) 

 res <- join_order_vector[seq(min(positions), max(positions), 1)] 

 return(res) 

 }

)

allow make_complete to fill in the gaps
New functionality: if not sequential tables requested, remove the columns from 'intemrediate' tables filled in by make_complete before returning (in case above, e.g. psuedocode: `select_if(-starts_with("sample", "individual", "extract")

Update to support new columns

2022-04-19

Added Osteological Age/Sex, Archeological Culture/Date and extra C14 fields to TAB_Individual
Added fields DOI, Library_Strandedness, Library_UDG to TAB_Protocols

2022-03-23

Add Protocol class and menu

Standardized variable names

A shower thought led me to an idea for a very simple filter function implementation in sidora.cli, which is almost as powerful as the one in the webinterface. Hacking together the prototype made me realize that the current join function causes variables to have different names depending on which tables are merged. That was clear from the beginning, but now I realized that this is highly confusing.

I therefore suggest to rename all variables at download to include the table name: e.g. Id in table TAB_Site should become Id.site, Protocol in table TAB_Sample should become Protocol.sample and so forth.

This change will break current code (!), but it will tremendously simplify things down the line, because variable names become reliable. Especially useful for high-overlap variables like tag.

Discussion about this is scheduled for next friday: @jfy133 @stschiff @nevrome

convert_all_ids_to_values() requires con

Function convert_all_ids_to_values() requires con, but the README does not specify this.

Convert blank cells to NA for consistency?

Hi guys, I encountered some inconsistent sidora table formatting that caused me some confusion while trying to filter and join tables. In some columns where there's no entry a cell is left blank, while in others it's filled with NA. For example

df_list <- get_df_list(c(
  "TAB_Library", "TAB_Capture", "TAB_Sequencing"
), con = con)

lib_info <- join_pandora_tables(df_list)

lib_info <- convert_all_ids_to_values(lib_info, con)

lib_info %>% 
  filter(str_detect(library.Full_Library_Id, "CMC040")) %>%
  select(library.Full_Library_Id, library.Protocol, library.Batch, sequencing.Full_Sequencing_Id) 

# A tibble: 3 × 4
  library.Full_Library_Id library.Protocol       library.Batch        sequencing.Full_Sequencing_Id
  <chr>                   <chr>                  <chr>                <chr>                        
1 CMC040.A0201            dsLibrary non UDG 2015 ""                   NA                           
2 CMC040.B0101            dsLibrary non UDG 2015 "Li04_VE_2019-04-17" CMC040.B0101.SG1.1           
3 CMC040.B0101            dsLibrary non UDG 2015 "Li04_VE_2019-04-17" CMC040.B0101.SG1.2

In Pandora the cell for the library batch is empty. Is this just an R/Pandora thing that has to be tolerated, or would it be possible to have this made consistent (everything is NA)?

License

We need a software license for this and the other sidora repos

coding join_df_list

@nevrome, thanks for getting this started! I'm learning a lot from going through your code. As a test whether I'm getting things right, I'd like to run by you a suggestion for improving join_df_list, which I find currently

doesn't expose its true input requirements (two tables).
modifies the input list in place (scary)
requires an additional function check_completeness, which I think we don't need.

How about this:

join_df_list <- function(parent, child) {
  ret <- NULL
  if(names(parent) == "TAB_Site" & names(child) == "TAB_Individual") {
    ret <- dplyr::left_join(
      child, 
      parent, 
      by = ("Site" = "Id"), 
      suffix = c(".Individual", ".Site")
    )
  }
  if(names(parent) == "TAB_Individual" & names(child) == "TAB_Sample") {
    ret <- dplyr::left_join(
      child, 
      parent, 
      by = ("Individual" = "Id"), 
      suffix = c(".Sample", ".Individual")
    )
  }
  if(names(parent) == "TAB_Sample" & names(child) == "TAB_Extract") {
    ret <- dplyr::left_join(
      child, 
      parent, 
      by = ("Sample" = "Id"), 
      suffix = c(".Extract", ".Sample")
    )
  }
  if(names(parent) == "TAB_Extract" & names(child) == "TAB_Library") {
    ret <- dplyr::left_join(
      child, 
      parent, 
      by = ("Extract" = "Id"), 
      suffix = c(".Library", ".Extract")
    )
  }
  if(is.null(ret))
    stop("Error: Inputs need to be named by Pandora table names and/or missing intermediate table.")
  else
    return(ret)
}

You would call this for example as join_df_list(c("TAB_Site"=site_tab), c("TAB_Individual"=ind_tab)).

I'm happy to make a PR, but as I'm still learning R I'd like to make sure I'm not missing something trivial...

Data and lookup table/function organisation

As i nthe title.

Make function to format table in form acceptable to Pandora upload functions

Brief:

This will require a new hash of Column name to whether it is mandatory (or not).
All mandatory columns will need to be converted to pandora column names, then wrapped in *colname*.
Non-mandatory columns will just need to be converted.
We will need to make sure that extra columns are not accidently included (like numeric IDs etc).

enforce_types needs to be more sophisticated

Unfortunately it seems we have multiple columns where they have the same name, but different column types.

For example: individual.Archaeological_ID is of character, but sample.Archaeological_ID is of integer, with integer corresponding to the indiviudal.Individual.Id numeric ID.

Also, sample_type_group.Type_Group is character, while sample_type.Type_Group is numeric.

This causes problems in the sidora.cli view module when trying to make all columns 'human readable' as they are in the Pandora web page itself.

Solution from @nevrome is to instead of vectors have tables of: table | column | type , and modify columns based on this information.

Provided environment not working

I create a conda environment with the provided environment.yml and then install sidora.core with the provided commands.

Trying to create a connection object then throws the error:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': namespace ‘rlang’ 0.4.11 is already loaded, but >= 1.0.5 is required

Using install.packages('rlang') to get the latest version of rlang, restarting the session and retrying the command fixed the issue.

Erroneous entries for library index IDs

I noticed cases of odd-looking entries for library.P7_Index_Id and library.P5_Index_Id, when pulling these and corresponding sequences for the indices for a list of library IDs using the following:

con <- sidora.core::get_pandora_connection("/Users/taylor_hermes/.credentials")
res <- get_df("TAB_Library", con) %>% convert_all_ids_to_values(con)

samples <- c("ATG010.A0101", "CP2010.A0101", "IST003.A0101")

list_seq <- filter(res, res$library.Full_Library_Id %in% samples) %>% 
  select(library.External_Library_Id,
         library.Full_Library_Id,
         library.Index_Set,
         library.P7_Index_Id,
         library.P7_Index_Sequence,
         library.P5_Index_Id,
         library.P5_Index_Sequence) %>% 
  arrange(library.Full_Library_Id)

For library ATG010.A0101, I should get 1197 and 1133 for library.P7_Index_Id and library.P5_Index_Id, respectively. However, I get d703 and 1113. It seems that the index sequences are correct according to my spot checking with a larger list of library IDs. When I showed this behavior to @jfy133 a short time ago, he thinks there may be a table lookup issue.

Attached is the output from the code above.
test1.csv

	4. Establish an ssh tunnel to the pandora database server with

	```bash
	ssh -L 10001:pandora.eva.mpg.de:3306 <your username>@daghead1
	```

	> You must make a new tunnel each time you want to connect (e.g. after you log out or reboot your machine)

	"[sidora.core] error: can't find .credentials file. Please create one ",
	"containing three lines:",
	"the database host, the username, the password."

	if ( any(!tab %in% sidora.core::pandora_tables_all) )
	stop(paste0("[sidora.core] error: tab not found in available tables. Options: ",
	paste(sidora.core::pandora_tables_all, collapse = ", "),
	". Your selection: ", tab))

	#' Make sequence-complete Pandora table list
	#'
	#' Pandoras layout is a hierarchical sequence of tables: All tables have a clear
	#' predecessor and successor. \code{join_pandora_tables()} uses this fact to
	#' merge tables accordingly. \code{make_complete_table_list} is
	#' a helper function to fill the gaps in a sequence of Pandora tables.
	#'
	#' @param tabs character vector. List of Pandora table names
	#' @param join_order_vector character vector. Reference vector with the Pandora
	#' structure
	#'
	#' @export
	make_complete_table_list <- function(
	tabs,
	join_order_vector = sidora.core::pandora_tables
	) {
	positions <- sapply(tabs, function(x) { which(x == join_order_vector) })
	res <- join_order_vector[seq(min(positions), max(positions), 1)]
	return(res)
	}