tleonardi / nanocompore Goto Github PK

View Code? Open in Web Editor NEW

78.0 9.0 12.0 140.96 MB

RNA modifications detection from Nanopore dRNA-Seq data

Home Page: https://nanocompore.rna.rocks

License: GNU General Public License v3.0

Python 100.00%

rna-modifications bioinformatics nanopore

nanocompore's People

Contributors

Stargazers

Watchers

Forkers

teshomeb anksi fjcamlab mustafaelshani w-zhennan piroonj hendrikweisser g-s-2018 zovoilis-lab sschmeier mifrill q1134269149

nanocompore's Issues

Figure out at what position m6A has the strongest dwell time effect

Bedparse from the CLI doesn't create BED files

nanocompore/nanocompore/nanocompore_main.py

Line 132 in 183296c

r = re.compile("adjusted_*")

This RE doesn't match anything anymore.

Missing GMM pvalues in plot_pvalue

It is GMM specific and started after TxComp refactor.

Here is an example plot.

Short ref transcript: NanocomporeError: Not enough p-values for a context of order 2

When using a very short reference transcript (e.g. some present in the latest GENCODE GTF annotation) nanocompore fails with:

nanocompore.common.NanocomporeError: Not enough p-values for a context of order 2

nanocompore/nanocompore/TxComp.py

Line 307 in 9e5e75d

if len(pvalues_vector)<(context*3)+3:

In my case some transcript were 8 < 9 (context=2)

Is it possible to discard the reference throwing warning but without terminating the program?

SampCompDB.save_report() broken in TxCompare_refactor

I've just realised that this breaks SampCompDB.save_report(). I'm fixing it.

Originally posted by @tleonardi in #54 (comment)

Nanopolish discarded reads

Do you know the meaning of these qc-fail and could not calibrate fields?
Is it possible we are losing info in the qc-fail group for Red (treated) samples?

## Ctrl [post-run summary] total reads: 1075, unparseable: 0, qc fail: 91 , could not calibrate: 652 , no alignment: 9, bad fast5: 0
## Red  [post-run summary] total reads: 3118, unparseable: 0, qc fail: 455 , could not calibrate: 1078 , no alignment: 24, bad fast5: 0

It is however possible that it is a problem with extreme datasets as this.

Output coordinates are wrong for spliced transcripts

Need to load reference BED file and recompute coordinates. Could use bedparse to convert from transcriptome-space to genome-space

Improve p-value adjustment

For p-value adjustment we should ignore positions with low coverage in order to increase power. Conceputally, we should implement idependent filtering as in this paper

Threads over-subscription

It looks like nanocompore sometimes spawns more threads than it should.. Starting it with nthreads=4 with the 7SK IVT data starts 16 threads.

Artificial data generator

We need to test nancompore with artificial data.

Replicates

Is the current approach of logistic regression the best?
What about a negative binomial on the cluster counts? For example wald test on a logistic model ~Cluster+Condition vs ~Cluster

nanocompore_results is unsorted (?)

nanocompore_results.txt has results unsorted by position. Please check whether it is behaving as expected. #

Logit always reports counts of 0 for condition 2 cluster 2

nanocompore/nanocompore/TxComp.py

Line 253 in 183296c

 contingency_table = "%s:%s/%s__%s:%s/%s" % (condition_labels[0], S1_counts[0], S1_counts[1], condition_labels[1], S2_counts[0], S2_counts[2]) 

Add detailed tabular reports

In addition to BED files we should also produce tabular output that reports median/dwell at each position

Option to plot bars in plot_pvalue()

Add option to plot bars to report the pvalue.

Rename gmm_method_anova=True

Rename the option 'gmm_method_anova=True' to 'force_logit=False'

Error in SampComDB when using sequence_context in SampComp

Here is the trackback

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-12-d4623718066d> in <module>
      9 
     10 # Run the analysis
---> 11 db = s ()

~/Programming/nanocompore/nanocompore/SampComp.py in __call__(self)
    226 
    227         # Return database wrapper object
--> 228         return SampCompDB(db_fn=self.__output_db_fn, fasta_fn=self.__fasta_fn, bed_fn=self.__bed_fn)
    229 
    230     #~~~~~~~~~~~~~~PRIVATE MULTIPROCESSING METHOD~~~~~~~~~~~~~~#

~/Programming/nanocompore/nanocompore/SampCompDB.py in __init__(self, db_fn, fasta_fn, bed_fn, run_type)
     86         #Create results df with adjusted p-values
     87         if self._pvalue_tests:
---> 88             self.results = self.__calculate_results(adjust=True)
     89 
     90     def __repr__(self):

~/Programming/nanocompore/nanocompore/SampCompDB.py in __calculate_results(self, adjust)
    153         if adjust:
    154             for col in self._pvalue_tests:
--> 155                 df[col] = self.__multipletests_filter_nan(df[col], method="fdr_bh")
    156         return df
    157 

~/Programming/nanocompore/nanocompore/SampCompDB.py in __multipletests_filter_nan(pvalues, method)
    202         """
    203         pvalues_no_nan = [p for p in pvalues if not np.isnan(p)]
--> 204         corrected_p_values = multipletests(pvalues_no_nan, method=method)[1]
    205         for i, p in enumerate(pvalues):
    206             if np.isnan(p):

~/.virtualenvs/Python3.6/lib/python3.6/site-packages/statsmodels/stats/multitest.py in multipletests(pvals, alpha, method, is_sorted, returnsorted)
    142 
    143     ntests = len(pvals)
--> 144     alphacSidak = 1 - np.power((1. - alphaf), 1./ntests)
    145     alphacBonf = alphaf / float(ntests)
    146     if method.lower() in ['b', 'bonf', 'bonferroni']:

ZeroDivisionError: float division by zero

The pvalues for neighbouring positions are highly correlated, thus violating the assumptions of Fisher's method for combining p-values.
Based on the benchmarking done in this paper, Hou's method seems the best solution.

SampCompDB.__get_ref_data()

Hey @a-slide ,
I was thinking of getting rid of this function because it looks like its functionality is already covered by getitem
Is there a reason why I should keep it instead?

Add GMM model plotting to plot_position()

Error when running without BED

Traceback (most recent call last):
  File "/home/lp471/bin/nanocompore1.0.0b1/bin/nanocompore", line 11, in <module>
    sys.exit(main())
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 71, in main
    args.func(args)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/nanocompore_main.py", line 124, in sample_compare_main
    sc_out.save_report(output_fn=f"{outpath}/nanocompore_results.txt")
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 268, in save_report
    for record in self.results[['chr', 'pos', 'ref','strand', 'ref_kmer', 'lowCov']+methods].values.tolist():
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['chr' 'strand'] not in index"

NameError: name 'tests' is not defined

This is a bug, tests is not longer defined.

nanocompore/nanocompore/SampCompDB.py

Line 147 in ae9c5dd

df=df[['ref_id', 'pos', 'ref_kmer']+tests]

Testing on artificial data

Generate an artificial transcriptome with all possible kmers
Benchamark nanocompore on one transcript with modification at varying %

Add reporting/plotting for missing positions

multipletests and NaN

Commit 4c8a7f0 removed __multipletests_filter_nan() from sampCompDB.py. However the function is still needed, because it can happen that certain positions (e.g. 5') have coverage lower than the threshold, resulting in NaN pvalues.
If statsmodels.stats.multitest.multipletests is given an array containing at least one NaN, it returns an array of all NaN.

plot_position() prints wrong positions

The positions and bases reported on the x axis by plot_position() are off by 2.

Simplify input datasets in CLI

What do you think of having an imput YAML file instead of the 4 options required at the moment ?

sample1:
    rep1:   path/to/sample1/rep1/data
    rep2:   path/to/sample1/rep2/data
sample2:
    rep1:   path/to/sample2/rep1/data
    rep2:   path/to/sample2/rep2/data

It would be parsed directly as required by nanocompore

Use replicates to control for false positives

Permutation of all samples.
Introduce the ability to compare multiple samples at the same time.

Add tests for correct handling of 0 within group variance

Add option to scale coverage

When the coverage of two samples if very different one of the two lines becomes almost flat (e.g. in 7SK). We should add an option to scale the coverage to the mean.

Is order 2 the best GMM?

We could add the GMM order as an option.
But what about the logistig regression?

Error when SampCompDB is run without a BED file

These headers are only defined when SampCompDB is run with a BED file

nanocompore/nanocompore/SampCompDB.py

Line 283 in ae9c5dd

 for record in self.results[['chr', 'pos', 'ref_id','strand', 'ref_kmer']+self._pvalue_tests].values.tolist(): 

Whitelist `max_mismatching_freq` and `max_missing_freq`

Originally posted by @tleonardi in https://github.com/tleonardi/nanocompore/diffs/0

Multidimentional kmean

Originally posted by @a-slide in #9 (comment)

Another option, at least for the kmean strategy, is a multidimentional kmean combining all adjacent positions instead of position per position and then combining pvalues.

CLI is broken

Add CLI

kmean method requires a very high coverage

I tried the package on a large dataset and my impression is that the kmean methods requires a coverage of at least 75 in both samples to call anything significant. This will be an issue for the large majority of the transcripts.
This should be further tested to determine the threshold.

Make a docker container or a conda package

Upon release it would be great to provide a Docker container with the current version of Nanopolish + NanopolishComp + nanocompore + all dependencies.

Calculate an empirical pvalue for the kmeans method

RuntimeWarning: overflow encountered in true_divide

nanocompore sampcomp --file_list1 nanocompore/BC2_nanopolish_collapsed_reads.tsv --file_list2 nanocompore/BC3_nanopolish_collapsed_reads.tsv --label1 Red --label2 BH --fasta Reference/Mus_musculus.fa --bed Reference/Mus_musculus.bed --outpath /nanocompore/comp --min_coverage 25 --nthreads 10 --max_invalid_kmers_freq 1.0 --comparison_methods GMM,KS,MW --overwrite

Initialise SampComp and checks options
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition Red. This is not recomended. The statistics will be calculated with the logit method
  warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/nanocompore/SampComp.py:89: NanocomporeWarning: Only 1 replicate found for condition BH. This is not recomended. The statistics will be calculated with the logit method
  warn(NanocomporeWarning ("Only 1 replicate found for condition {}. This is not recomended. The statistics will be calculated with the logit method".format(cond_lab)))
Initialise Whitelist and checks options
Read eventalign index files
	References found in index: 2
Filter out references with low coverage
	References remaining after reference coverage filtering: 2
Start data processing
100%|████████████████████████████████████████████████████████████| 2/2 [00:21<00:00,  7.64s/ Processed References]
/home/lp471/bin/nanocompore1.0.0b1/lib/python3.6/site-packages/statsmodels/stats/multitest.py:325: RuntimeWarning: overflow encountered in true_divide
  pvals_corrected_raw = pvals_sorted / ecdffactor

Improve approximation of dwell time

Using n_events for dwell time is not ideal. Would be better to use end-start of raw signal. To obtain those from nanopolish comp need to update nanopolish to last version and use flag --signal-index

Unexposed log_dwell options

The log_dwell option of gmm_test() is not used anywhere and should be removed.

Add tests

Try to get full coverage for all the core functions.
In particular, pay attention to potential off-by-one errors

Add multiple hypothesis correction

Needs to be implemented, most likely in SampComp.__write_output()

Rewrite documentation

The documentation is outdated

Large nanopolishComp .tsv files. Compressed/binary files?

I am now testing the pipeline on poliA+ RNA runs, with read numbers closer to a real biological situation.

The problem is that for a 650k read sample, the program now produces a ~50GB flat text file.
I was wondering how difficult it would be to make nanocompore read a gzipped (if not some other form of binary) file. Otherwise, my impression is that I/O will bottleneck every subsequent handling of the results files and overwhelm the storage very quickly.
Currently, a 3M reads run (which is biologically acceptable, but for sure far from saturation!) produce a total of ~>600GB files after analysis!

Thank you,
Luca

Implement multidimentional clustering + Gaussian mixture model

Instead of using a pvalue aggregation method, it would probably be better to use a multidimensional clustering method to take into account adjacent positions

In addition a Gaussian mixture model will certainly be better than a kmean methods as it does not assume that the shape of the cluster is ovoid

Include model of mean intensity in the prediction

Might be an option to refine the prediction, base on ONT model mean value from https://github.com/nanoporetech/kmer_models

Problems with paired_test() p-values

paired_test() in TxComp takes batches of data points and does the S1 vs S2 test for each batch. Then it takes the median of the p values obtained.
We have to carefully think about this... I am not sure this is an effective way of controlling false positives...
What about getting rid of this and doing p-values correction instead?

Replicates have to be named differently

You cannot have replicate called the same in the 2 conditions. I assumed it uses only 1 condition if not as it raises 0 variance errors.

Doesn't work

eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep1':'./sample_files/unmodified_rep_1.tsv', 'rep2':'./sample_files/unmodified_rep_2.tsv'}}

Works fine

eventalign_fn_dict = {
        'Modified': {'rep1':'./sample_files/modified_rep_1.tsv', 'rep2':'./sample_files/modified_rep_2.tsv'},
        'Unmodified': {'rep3':'./sample_files/unmodified_rep_1.tsv', 'rep4':'./sample_files/unmodified_rep_2.tsv'}}

The problem started after txComp refactor