Pipeline failing with input and matrix sample names including "-" character on CUSTOM_MATRIXFILTER step, error gives all names of samples with "-" ID. When I check the work folder for the process, the names in each of the files have had "-" changed to ".", but I think this might be being compared to the original names and failing?
If I change names with sed in input and matrix file to remove "-", the pipeline continues without error.
nextflow run nf-core/differentialabundance \
-profile singularity \
-r dev \
--input diffabundance/inputs/samplesheet.csv \
--contrasts diffabundance/inputs/serCan_contrasts.csv \
--matrix diffabundance/inputs/data/rsem.merged.transcript_counts.tsv \
--gtf reference/serCan2020/serCan2020_no_genes.gtf \
-w diffabundance/work \
--features_id_col 'transcript_id' \
--outdir diffabundance/results \
--max_memory '160.GB' \
--deseq2_cores 8 \
-resume
# output:
executor > local (2)
[32/9bd716] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, cached: 1 ✔
[52/70c121] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1 ✔
[ed/3225f0] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, failed: 1 ✘
[- ] process > NFCORE_DIFFERENTIALABUNDANC... -
[- ] process > NFCORE_DIFFERENTIALABUNDANC... -
[- ] process > NFCORE_DIFFERENTIALABUNDANC... -
[- ] process > NFCORE_DIFFERENTIALABUNDANC... -
[- ] process > NFCORE_DIFFERENTIALABUNDANC... -
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/differentialabundance] Pipeline completed with errors-
Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])'
Caused by:
Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])` terminated with an error exit status (1)
Command executed [/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/custom/matrixfilter/templates/matrixfilter.R]:
#!/usr/bin/env Rscript
# Filter rows based on the number of columns passing the abundance threshold. By
# default this will be any row with a value of 1 or more, which would be a
# permissive threshold for RNA-seq data.
#
# In RNA-seq studies it's often not enough to just remove genes not expressed in
# any sample. We also want to remove anything likely to be part of noise, or
# which has sufficiently low expression that differential analysis would not be
# useful. For that reason we might require a higher threshold than 1, and
# require that more than one sample passes.
#
# Often we want to know that a gene is expressed in a substantial enough number
# of sample that differential analysis worthwhile, so we may pick a threshold
# sample number related to group size. Note that we do not filter with an
# awareness of the groups themselves, since this adds bias towards discovery
# between those groups.
################################################
################################################
## Functions ##
################################################
################################################
#' Parse out options from a string without recourse to optparse
#'
#' @param x Long-form argument list like --opt1 val1 --opt2 val2
#' @param opt_defaults A lis with default argument values
#'
#' @return named list of options and values similar to optparse
parse_args <- function(x, opt_defaults){
args_list <- unlist(strsplit(x, ' ?--')[[1]])[-1]
args_vals <- unlist(lapply(args_list, function(y) strsplit(y, ' +')), recursive = FALSE)
# Ensure the option vectors are length 2 (key/ value) to catch empty ones
args_vals <- lapply(args_vals, function(z){ length(z) <- 2; z})
parsed_args <- structure(lapply(args_vals, function(x) x[2]), names = lapply(args_vals, function(x) x[1]))
parsed_args[! is.na(parsed_args)]
# Now apply CLI options to override defaults
opt_types <- lapply(opt_defaults, class)
for ( ao in names(parsed_args)){
if (! ao %in% names(opt_defaults)){
stop(paste("Invalid option:", ao))
}else{
# Preserve classes from defaults where possible
if (! is.null(opt_defaults[[ao]])){
parsed_args[[ao]] <- as(parsed_args[[ao]], opt_types[[ao]])
}
opt_defaults[[ao]] <- parsed_args[[ao]]
}
}
opt_defaults
}
#' Flexibly read CSV or TSV files
#'
#' @param file Input file
#' @param header Passed to read.delim()
#' @param row.names Passed to read.delim()
#' @param nrows Passed to read.delim()
#'
#' @return output Data frame
read_delim_flexible <- function(file, header = TRUE, row.names = NULL, nrows = -1 ){
ext <- tolower(tail(strsplit(basename(file), split = "\\.")[[1]], 1))
if (ext == "tsv" || ext == "txt") {
separator <- "\t"
} else if (ext == "csv") {
separator <- ","
} else {
stop(paste("Unknown separator for", ext))
}
read.delim(
file,
sep = separator,
header = header,
row.names = row.names
)
}
# Set up default options
opt <- list(
abundance_matrix_file = 'rsem.merged.transcript_counts.assay.tsv',
sample_file = 'samplesheet.sample_metadata.tsv',
minimum_abundance = 1,
minimum_samples = 1,
minimum_proportion = 0,
grouping_variable = NULL
)
opt <- parse_args('--minimum_samples 1 --minimum_abundance 1', opt)
abundance_matrix <- read_delim_flexible(opt$abundance_matrix_file, row.names = 1)
feature_id_name <- colnames( read_delim_flexible(opt$abundance_matrix_file, nrows = 1)[1])
# If a sample sheet was specified, validate the matrix against it
if (opt$sample_file != ''){
# Read the sample sheet and check against matrix
samplesheet <- read_delim_flexible(opt$sample_file, row.names = 1)
missing_samples <- setdiff(rownames(samplesheet), colnames(abundance_matrix))
if (length(missing_samples) > 0){
stop(
paste(
paste(missing_samples, collapse = ', '),
'not represented in supplied abundance matrix'
)
)
}else{
abundance_matrix <- abundance_matrix[,rownames(samplesheet)]
}
}else{
# If we're not using a sample sheet to select columns, then at least make
# sure the ones we have are numeric (some upstream things like the RNA-seq
# workflow have annotation colummns as well)
numeric_columns <- unlist(lapply(1:ncol(abundance_matrix), function(x) is.numeric(abundance_matrix[,x])))
abundance_matrix <- abundance_matrix[,numeric_columns]
}
# If we want to define n based on the levels of a grouping variable...
if ((opt$sample_file != '') && ( ! is.null(opt$grouping_variable))){
# Pick a minimum number of samples to pass threshold based on group size
if (! opt$grouping_variable %in% colnames(samplesheet)){
stop(paste(opt$grouping_variable, "not in supplied sample sheet"))
}else{
opt$minimum_samples <- min(table(samplesheet[[opt$grouping_variable]]))
if ( opt$minimum_proportion > 0){
opt$minimum_samples <- opt$minimum_samples * opt$minimum_proportion
}
}
}else if (opt$minimum_proportion > 0){
# Or if we want to define it based on a static proportion of the sample number
opt$minimum_samples <- ncol(abundance_matrix) * opt$minimum_proportion
}
# Generate a boolean vector specifying the features to retain
keep <- apply(abundance_matrix, 1, function(x){
sum(x > opt$minimum_abundance) >= opt$minimum_samples
})
# Write out the matrix retaining the specified rows and re-prepending the
# column with the feature identifiers
prefix = ifelse('study' == 'null', '', 'study')
write.table(
data.frame(rownames(abundance_matrix)[keep], abundance_matrix[keep,,drop = FALSE]),
file = paste0(
prefix,
'.filtered.tsv'
),
col.names = c(feature_id_name, colnames(abundance_matrix)),
row.names = FALSE,
sep = ' ',
quote = FALSE
)
################################################
################################################
## R SESSION INFO ##
################################################
################################################
sink("R_sessionInfo.log")
print(sessionInfo())
sink()
################################################
################################################
## VERSIONS FILE ##
################################################
################################################
r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
writeLines(
c(
'"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER":',
paste(' r-base:', r.version)
),
'versions.yml')
################################################
################################################
################################################
################################################
Command exit status:
1
Command output:
(empty)
Command error:
Error: ES1_48_Control-lipid_0, ES2_48_Control-lipid_14, ES3_48_Control-lipid_21, ES4_164_Control-protein_0, ES5_164_Control-protein_14, ES6_164_Control-protein_21, ES7_SEM19-1354_Control-protein_14, ES8_SEM19-1354_Control-protein_21, ES9_172_Control-protein_0, ES10_172_Control-protein_14, ES11_172_Control-protein_21, ES12_209_MG-lipid_0, ES13_209_MG-lipid_14, ES14_209_MG-lipid_21, ES15_601_Control-protein_14, ES16_601_Control-protein_21, ES17_615_Control-protein_0, ES18_615_Control-protein_14, ES19_615_Control-protein_21, ES20_618_Control-lipid_0, ES21_618_Control-lipid_14, ES22_618_Control-lipid_21, ES23_Blue-010_Control-lipid_0, ES24_Blue-010_Control-lipid_21, ES25_619_MG-lipid_0, ES26_619_MG-lipid_14, ES27_619_MG-lipid_21, ES103_625_Control-lipid_0, ES104_625_Control-lipid_14, ES105_625_Control-lipid_21, ES31_Blue-055_MG-protein_0, ES32_Blue-055_MG-protein_14, ES33_101-LS_MG-lipid_0, ES34_101-LS_MG-lipid_14, ES35_101-LS_MG-lipid_21, ES36_AR17-21_MG-lipid_0, ES37_AR17-21_MG-lipid_14,
Execution halted
Work dir:
diffabundance/work/ed/3225f0361d750970c0c0439ac3407c
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Nextflow version 22.10.1
Hardware HPC
Executor slurm
Container engine: Singularity
OS Linux
Version of nf-core/differentialabundance: dev