tleonardi / nanocompore Goto Github PK
View Code? Open in Web Editor NEWRNA modifications detection from Nanopore dRNA-Seq data
Home Page: https://nanocompore.rna.rocks
License: GNU General Public License v3.0
RNA modifications detection from Nanopore dRNA-Seq data
Home Page: https://nanocompore.rna.rocks
License: GNU General Public License v3.0
Call the results() function from init()
nanocompore/nanocompore/nanocompore_main.py
Line 132 in 183296c
This RE doesn't match anything anymore.
When using a very short reference transcript (e.g. some present in the latest GENCODE GTF annotation) nanocompore fails with:
nanocompore.common.NanocomporeError: Not enough p-values for a context of order 2
nanocompore/nanocompore/TxComp.py
Line 307 in 9e5e75d
In my case some transcript were 8 < 9 (context=2)
Is it possible to discard the reference throwing warning but without terminating the program?
I've just realised that this breaks SampCompDB.save_report(). I'm fixing it.
Originally posted by @tleonardi in #54 (comment)
Do you know the meaning of these qc-fail and could not calibrate fields?
Is it possible we are losing info in the qc-fail group for Red (treated) samples?
## Ctrl [post-run summary] total reads: 1075, unparseable: 0, qc fail: 91 , could not calibrate: 652 , no alignment: 9, bad fast5: 0
## Red [post-run summary] total reads: 3118, unparseable: 0, qc fail: 455 , could not calibrate: 1078 , no alignment: 24, bad fast5: 0
It is however possible that it is a problem with extreme datasets as this.
Need to load reference BED file and recompute coordinates. Could use bedparse to convert from transcriptome-space to genome-space
For p-value adjustment we should ignore positions with low coverage in order to increase power. Conceputally, we should implement idependent filtering as in this paper
It looks like nanocompore sometimes spawns more threads than it should.. Starting it with nthreads=4 with the 7SK IVT data starts 16 threads.
We need to test nancompore with artificial data.
Is the current approach of logistic regression the best?
What about a negative binomial on the cluster counts? For example wald test on a logistic model ~Cluster+Condition vs ~Cluster
nanocompore_results.txt has results unsorted by position. Please check whether it is behaving as expected. #
nanocompore/nanocompore/TxComp.py
Line 253 in 183296c
In addition to BED files we should also produce tabular output that reports median/dwell at each position
Add option to plot bars to report the pvalue.
Rename the option 'gmm_method_anova=True' to 'force_logit=False'
Here is the trackback
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-12-d4623718066d> in <module>
9
10 # Run the analysis
---> 11 db = s ()
~/Programming/nanocompore/nanocompore/SampComp.py in __call__(self)
226
227 # Return database wrapper object
--> 228 return SampCompDB(db_fn=self.__output_db_fn, fasta_fn=self.__fasta_fn, bed_fn=self.__bed_fn)
229
230 #~~~~~~~~~~~~~~PRIVATE MULTIPROCESSING METHOD~~~~~~~~~~~~~~#
~/Programming/nanocompore/nanocompore/SampCompDB.py in __init__(self, db_fn, fasta_fn, bed_fn, run_type)
86 #Create results df with adjusted p-values
87 if self._pvalue_tests:
---> 88 self.results = self.__calculate_results(adjust=True)
89
90 def __repr__(self):
~/Programming/nanocompore/nanocompore/SampCompDB.py in __calculate_results(self, adjust)
153 if adjust:
154 for col in self._pvalue_tests:
--> 155 df[col] = self.__multipletests_filter_nan(df[col], method="fdr_bh")
156 return df
157
~/Programming/nanocompore/nanocompore/SampCompDB.py in __multipletests_filter_nan(pvalues, method)
202 """
203 pvalues_no_nan = [p for p in pvalues if not np.isnan(p)]
--> 204 corrected_p_values = multipletests(pvalues_no_nan, method=method)[1]
205 for i, p in enumerate(pvalues):
206 if np.isnan(p):
~/.virtualenvs/Python3.6/lib/python3.6/site-packages/statsmodels/stats/multitest.py in multipletests(pvals, alpha, method, is_sorted, returnsorted)
142
143 ntests = len(pvals)
--> 144 alphacSidak = 1 - np.power((1. - alphaf), 1./ntests)
145 alphacBonf = alphaf / float(ntests)
146 if method.lower() in ['b', 'bonf', 'bonferroni']:
ZeroDivisionError: float division by zero
The pvalues for neighbouring positions are highly correlated, thus violating the assumptions of Fisher's method for combining p-values.
Based on the benchmarking done in this paper, Hou's method seems the best solution.
Hey @a-slide ,
I was thinking of getting rid of this function because it looks like its functionality is already covered by getitem
Is there a reason why I should keep it instead?
Add GMM model plotting to plot_position()
Traceback (most recent call last):
File "/home/lp471/bin/nanocompore1.0.0b1/bin/nanocompore", line 11, in <module>
sys.exit(main())
File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 71, in main
args.func(args)
File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 124, in sample_compare_main
sc_out.save_report(output_fn=f"{outpath}/nanocompore_results.txt")
File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 268, in save_report
for record in self.results[['chr', 'pos', 'ref','strand', 'ref_kmer', 'lowCov']+methods].values.tolist():
File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['chr' 'strand'] not in index"
This is a bug, tests is not longer defined.
nanocompore/nanocompore/SampCompDB.py
Line 147 in ae9c5dd
Commit 4c8a7f0 removed __multipletests_filter_nan() from sampCompDB.py. However the function is still needed, because it can happen that certain positions (e.g. 5') have coverage lower than the threshold, resulting in NaN pvalues.
If statsmodels.stats.multitest.multipletests is given an array containing at least one NaN, it returns an array of all NaN.
The positions and bases reported on the x axis by plot_position() are off by 2.
What do you think of having an imput YAML file instead of the 4 options required at the moment ?
sample1:
rep1: path/to/sample1/rep1/data
rep2: path/to/sample1/rep2/data
sample2:
rep1: path/to/sample2/rep1/data
rep2: path/to/sample2/rep2/data
It would be parsed directly as required by nanocompore
Permutation of all samples.
Introduce the ability to compare multiple samples at the same time.
When the coverage of two samples if very different one of the two lines becomes almost flat (e.g. in 7SK). We should add an option to scale the coverage to the mean.
We could add the GMM order as an option.
But what about the logistig regression?
These headers are only defined when SampCompDB is run with a BED file
nanocompore/nanocompore/SampCompDB.py
Line 283 in ae9c5dd
Originally posted by @tleonardi in https://github.com/tleonardi/nanocompore/diffs/0
Originally posted by @a-slide in #9 (comment)
Another option, at least for the kmean strategy, is a multidimentional kmean combining all adjacent positions instead of position per position and then combining pvalues.
Add CLI
I tried the package on a large dataset and my impression is that the kmean methods requires a coverage of at least 75 in both samples to call anything significant. This will be an issue for the large majority of the transcripts.
This should be further tested to determine the threshold.
Upon release it would be great to provide a Docker container with the current version of Nanopolish + NanopolishComp + nanocompore + all dependencies.
nanocompore sampcomp --file_list1 nanocompore/BC2_nanopolish_collapsed_reads.tsv --file_list2 nanocompore/BC3_nanopolish_collapsed_reads.tsv --label1 Red --label2 BH --fasta Reference/Mus_musculus.fa --bed Reference/Mus_musculus.bed --outpath /nanocompore/comp --min_coverage 25 --nthreads 10 --max_invalid_kmers_freq 1.0 --comparison_methods GMM,KS,MW --overwrite
Initialise SampComp and checks options
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition Red. This is not recomended. The statistics will be calculated with the logit method
warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition BH. This is not recomended. The statistics will be calculated with the logit method
warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
Initialise Whitelist and checks options
Read eventalign index files
References found in index: 2
Filter out references with low coverage
References remaining after reference coverage filtering: 2
Start data processing
100%|████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 7.64s/ Processed References]
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/statsmodels/stats/multitest.py:325: RuntimeWarning: overflow encountered in true_divide
pvals_corrected_raw = pvals_sorted / ecdffactor
Using n_events for dwell time is not ideal. Would be better to use end-start of raw signal. To obtain those from nanopolish comp need to update nanopolish to last version and use flag --signal-index
The log_dwell option of gmm_test() is not used anywhere and should be removed.
Try to get full coverage for all the core functions.
In particular, pay attention to potential off-by-one errors
Needs to be implemented, most likely in SampComp.__write_output()
The documentation is outdated
I am now testing the pipeline on poliA+ RNA runs, with read numbers closer to a real biological situation.
The problem is that for a 650k read sample, the program now produces a ~50GB flat text file.
I was wondering how difficult it would be to make nanocompore read a gzipped (if not some other form of binary) file. Otherwise, my impression is that I/O will bottleneck every subsequent handling of the results files and overwhelm the storage very quickly.
Currently, a 3M reads run (which is biologically acceptable, but for sure far from saturation!) produce a total of ~>600GB files after analysis!
Thank you,
Luca
Instead of using a pvalue aggregation method, it would probably be better to use a multidimensional clustering method to take into account adjacent positions
In addition a Gaussian mixture model will certainly be better than a kmean methods as it does not assume that the shape of the cluster is ovoid
Might be an option to refine the prediction, base on ONT model mean value from https://github.com/nanoporetech/kmer_models
paired_test() in TxComp takes batches of data points and does the S1 vs S2 test for each batch. Then it takes the median of the p values obtained.
We have to carefully think about this... I am not sure this is an effective way of controlling false positives...
What about getting rid of this and doing p-values correction instead?
You cannot have replicate called the same in the 2 conditions. I assumed it uses only 1 condition if not as it raises 0 variance errors.
Doesn't work
eventalign_fn_dict = {
'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
'Unmodified': {'rep1':'./sample_files/unmodified_rep_1.tsv', 'rep2':'./sample_files/unmodified_rep_2.tsv'}}
Works fine
eventalign_fn_dict = {
'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
'Unmodified': {'rep3':'./sample_files/unmodified_rep_1.tsv', 'rep4':'./sample_files/unmodified_rep_2.tsv'}}
The problem started after txComp refactor
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.