I have written a new fast medium/diffuse area chip-seq caller in Python. It uses a bin

MetaSeq on non-traditional files that have already been binned? about metaseq HOT 8 CLOSED

daler commented on September 16, 2024

MetaSeq on non-traditional files that have already been binned?

from metaseq.

Comments (8)

commented on September 16, 2024

It seems like there are two readers that it might be possible to use:

the deseq reader or
the dataframe reader.

Can any of these be used as is? If not, what do I need to change about my data to use them?

It seems that I was mistaken about the possibility of creating bed files; my binned files do not contain strand info. It seems like using one of the two readers above or creating my own reader is the most straightforward option.

Better example of my data than the one shown above:

Chromosome  Bin ../data/bed/Exp2_0_Polymerase_II.bed.gz ../data/bed/Exp1_0_Polymerase_II.bed.gz ../data/bed/Exp2_0h_Input.bed.gz    ../data/bed/Exp1_0h_Input.bed.gz
chr1    10000   1   3   4   3
chr1    10050   0   2   1   2
chr1    10100   1   4   4   2
chr1    10150   3   2   4   1
chr1    10200   5   2   2   3
chr1    10250   3   0   1   2
chr1    10300   3   1   2   1
chr1    10350   1   4   1   4
chr1    10450   2   4   2   2

It would not surprise me if this is very close to one of the formats metaseq already knows how to parse.

from metaseq.

commented on September 16, 2024

Come to think of it, metaseq would at least need some gene info for producing the heatmaps.

I'm guessing metaseq needs one array per input file per tss. So for example for Sample1 you'd have number_of_tsses arrays where each array is tss_region_length/bin_length long.

If I create a dict like:

{"tss1": [0, 2, 1, 8...], "tss2": [42, 50, 68, 74...], "tss3": ...}

per data-column (../data/bed/Exp2_0_Polymerase_II.bed.gz, ../data/bed/Exp1_0_Polymerase_II.bed.gz etc.)
what would I then need to make those nice tss-plots/heatmaps from example 1? I think I'd like to incorporate the creation of those graphs in my chip-seq caller with a --graph option (reminding people to cite your paper of course).

But if meta-seq has code that can work with the type of data shown in the two previous posts that would be even better. It would have to pick out the bins from tss-regions by itself, though.

from metaseq.

daler commented on September 16, 2024

Hi Endre -

Thanks for taking an interest.

First some background to orient you . . . the main goal of metaseq is to get genomics data into Python data structures (mostly numpy.array and pandas.DataFrame). In the Example 1 case, the input is a BED file of TSSes containing windows to look within and a BAM file. The genomic_signal object is initialized simply by pointing to some interval-based genomic data (BAM, BED, bigBed, bigWig; BAM in this example). Its local_coverage function is responsible for accepting an interval and returning a numpy.array representing the values for that interval (linearly interpolated into bins, if they were also specified). An array is built by querying the genomic data for each interval (here, a TSS provided in the BED file) in parallel. The specific way in which a genomic_signal object extracts data from the underlying file depends on the filetype_adapter.

So really the question is how to get your data format into a numpy.array. Or more generally, how to make this format useful for various downstream tools (not just limited to metaseq)?

If you have chrom, start, end fields as the first 3 columns, then you can use bedtools/pybedtools for interval manipulation and pysam with tabix for indexing. You can also create separate bigWig files for each sample (have to go bedGraph -> bigWig with cut -f 1,2,3,4 and cut -f 1,2,3,5 etc).

chr1    10000   10050   1   3   4   3
chr1    10050   10100   0   2   1   2
chr1    10100   10150   1   4   4   2

Option 1: create one bigWig file for each column, and use metaseq and other downstream analysis tools (like deepTools) as-is.

Option 2: A genomic_signal object could be initialized with this data file and bgzipped and tabix-indexed if needed. The object should also store the column you care about. Then, when its local_coverage is provided with an interval, it does the tabix query to extract all lines intersecting that interval. For each line, it extracts the column you care about.
It would be important to call local_coverage with the use_score=True argument to return the score rather than the presence/absence (as is the case when using BAM).

In benchmarks when creating arrays, bigWig is substantially faster when creating arrays from tabixed files. My hunch is that the initial step of bigWig creation will likely be offset by performance gains in downstream processing.

Specifically for a TSS plot, you would have to provide the TSSes to represent each row of the array. It's these intervals that need strand if you want to talk about upstream and downstream. You don't need strand in the underlying data file.

Side note 1: If you end up digging through the code, the perhaps-odd arrangement of the code is for parallel performance -- it was my first heavy use of multiprocessing in a module, and had to organize things this way in order to pickle functions across processes. I'd like to find time to simplify it if possible, but it's low priority at the moment.

Side note 2: I'd love to try out your peak caller (existing ones leave a lot to be desired, esp for broad domains). I can't find it in your repos though!

from metaseq.

commented on September 16, 2024

Thanks for the explanation.

Without reading your paper, just the docs, we got the impression that it was for creating those two very pretty graphs (there are two of us w/different projects looking into using metaseq for those TSS-graphs). I guess there is a lot more to metaseq then and that its main purpose is different/broader. Do I understand it correctly if it is for reading and using genomic data in Python, speedily and easily (with many additional conveniences)?

It seems like creating bigwig-files is the way to go then, especially since that format can then be used with other tools. Hadn't heard about deepseq, but it looks great.

The code looks good to me, but if you are redoing the parallelism, you might want to look into joblib. It has a parallel function with many conveniences, such as sane error messages and cleanup.

I'll send you an e-mail about the ChIP-seq peak caller.

from metaseq.

daler commented on September 16, 2024

Right, it's for getting genomic data into Python. It's a library of tools to build your own visualizations, rather than a CLI to produce canned plots. Enough people ask about a CLI that I may add one, but deepTools is probably better for that kind of task.

You can just scroll through the figures in the paper for more examples -- they are all unmodified, straight from metaseq. There are more in the supplemental PDF, too.

I agree, joblib is the way to go. Memory.cache is really helpful for working with lots of large numpy arrays on disk, too.

from metaseq.

commented on September 16, 2024

Thanks. Meta is a good name, this is a tad abstract.

from metaseq.

commented on September 16, 2024

Btw, there are a few quirks with 'Memory.cache' that in my view makes it less than ideal for inclusion in software published for others to use:

uses a lot of storage space
cannot delete just parts of the cache
impossible to inspect the cache deeply

You should consider telling users how much space the cache is currently taking and how to delete it. I regret not doing so in my software.

from metaseq.

daler commented on September 16, 2024

Hmm, I haven't hit the space problem yet, and I've only been using it for simple cases. Good to know, thanks.

from metaseq.

MetaSeq on non-traditional files that have already been binned? about metaseq HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent