<a class="user-mention notranslate" data-hovercard-type="organization" data-hovercard-

I am talking to <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Represent a CpG with single row during import about bsseq HOT 11 CLOSED

demis001 commented on July 18, 2024

Represent a CpG with single row during import

from bsseq.

Comments (11)

kasperdanielhansen commented on July 18, 2024

This sounds like a strand issue. Many tools report 2 (meth, cov) counts for each CpG coming from the forward and reverse strand, and we usually want to pool those values. This usually works, so I am going to need some more detail on what exactly you're doing, to help you. Best, Kasper

…

On Fri, Oct 27, 2023 at 10:55 AM Dereje Jima ***@***.***> wrote: @hansenlab <https://github.com/hansenlab> I have been using bsmap -> MOABS for a long time and am recently interested in doing multivariate analysis in R. I did paired-end reads for 100 cases and control. Whenever I import the coverage file from Bismark (ADB_102_S45.cov.gz) to R using bsseq, I get twice the CpG in the human genome. i.e., a single CpG represented by two rows in the coverage file. I expected 29 million rows in the GRange object at most, but I got 58896438. 10469 and 10470 are the same CpG. Any idea how to fix this? granges(bs.black) GRanges object with 58896438 ranges and 0 metadata columns: usernames ranges strand [1] 1 10469 * [2] 1 10470 * [3] 1 10471 * [4] 1 10472 * [5] 1 10484 * ... ... ... ... [58896434] MT 16496 * [58896435] MT 16542 * [58896436] MT 16543 * [58896437] MT 16565 * [58896438] MT 16566 * @demis001 <https://github.com/demis001> — Reply to this email directly, view it on GitHub <#128>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF2DH64VAIE4PKTLXV4FC3YBPDM3AVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DKNRZHA4DOOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Best, Kasper

from bsseq.

demis001 commented on July 18, 2024

Thanks, Kasper, for the prompt reply. I tried to represent consecutive rows with a single row in each replicate file using a simple Python script. Even though my 100 files have less than 29 million rows, the merge GRange object differed from what I anticipated (still higher than 29 million). Is there a better approach to go from Bismark to bsseq to resolve this issue?

"""
import sys

def merge_consecutive_lines(lines):
merged_lines = []

def process_and_append(start_line, end_line):
    max_col4 = max(float(line.split("\t")[3]) for line in lines[start_line:end_line + 1])
    sum_col5 = sum(int(line.split("\t")[4]) for line in lines[start_line:end_line + 1])
    sum_col6 = sum(int(line.split("\t")[5]) for line in lines[start_line:end_line + 1])
    merged_lines.append("\t".join([lines[start_line].split("\t")[0], lines[start_line].split("\t")[1], lines[end_line].split("\t")[2], str(max_col4), str(sum_col5), str(sum_col6)]))

start_line = 0

for i in range(1, len(lines)):
    current_line = lines[i].split("\t")
    prev_line = lines[i - 1].split("\t")
    
    if int(current_line[1]) != int(prev_line[2]) + 1:
        process_and_append(start_line, i - 1)
        start_line = i

process_and_append(start_line, len(lines) - 1)

return merged_lines

if name == "main":
#if len(sys.argv) != 2:
#print("Usage: python merge_lines.py input.txt")
#sys.exit(1)

#input_file = sys.argv[1]

#with open(input_file, 'r') as file:
    #lines = file.read().splitlines()
lines = sys.stdin.read().splitlines()

merged_lines = merge_consecutive_lines(lines)

for line in merged_lines:
    # Print columns 1, 2, 4, 5, and 6
    columns = line.split("\t")
    print("\t".join([columns[0], columns[1], columns[3], columns[4], columns[5]]))

"""

Best,
Dereje

from bsseq.

demis001 commented on July 18, 2024

I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq.

from bsseq.

kasperdanielhansen commented on July 18, 2024

So you're processing the bismark files in python? I would guess that's the issue. We have a function (and a manpage) which reads in the raw bismark output. I suggest you do that. Best, Kasper

…

On Fri, Oct 27, 2023 at 12:27 PM Dereje Jima ***@***.***> wrote: I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq. — Reply to this email directly, view it on GitHub <#128 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF2DH4JMDCACFD4MWQKATDYBPOGPAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGE4TANZXGA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Best, Kasper

from bsseq.

demis001 commented on July 18, 2024

No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis.

I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now.

Dereje

from bsseq.

kasperdanielhansen commented on July 18, 2024

Yeah, so see if ?read.bismark helps you. It should contain clear statements on how to get from Bismark to bsseq, and if it doesn't, I am happy to amend the instructions. But from my POV you should never have to edit the raw output files. However, bismark can output stuff in a few different formats. Best, Kasper

…

On Fri, Oct 27, 2023 at 1:11 PM Dereje Jima ***@***.***> wrote: No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis. I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now. Dereje — Reply to this email directly, view it on GitHub <#128 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF2DH4V3Z5AMN7THCTQAUDYBPTLJAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGI2DMMJSG4> . You are receiving this because you commented.Message ID: ***@***.***>

-- Best, Kasper

from bsseq.

demis001 commented on July 18, 2024

I am talking to @FelixKrueger, a developer of Bismark. He suggested running the "coverage2cytosine --merge_CpG" option. I don't know if this is the intended input for bsseq downstream analysis. I am running it right now to see how the coverage looks like. I will look into that.

Dereje

from bsseq.

demis001 commented on July 18, 2024

It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads.

from bsseq.

kasperdanielhansen commented on July 18, 2024

Sure, I mean, you could start by following my advice and read the man page and see if that clarifies things for you. I do see the User's Guide section on reading data is in dire need of updating. Best, Kasper

…

On Fri, Oct 27, 2023 at 2:07 PM Dereje Jima ***@***.***> wrote: It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads. — Reply to this email directly, view it on GitHub <#128 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF2DHYIAHGHR4SQCF5O27TYBPZ6FAVCNFSM6AAAAAA6S6PMVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGMYTCNBSGA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Best, Kasper

from bsseq.

demis001 commented on July 18, 2024

Yes, I looked into the read.bismark(). files: It says, The path to the files created by running Bismark's methylation extractor, one sample per file. Files ending in .gz or .bz2 will be automatically decompressed to [tempfile](http://127.0.0.1:14081/help/library/bsseq/help/tempfile)(). We strongly recommend you use the 'genome wide cytosine report' output files. See section 'File formats' for further details. That is what I read in bsseq. I provided the *.cov.gz file from Bismark methylation extractor. This represents each "C" in the genome not the aggregate CpG. from both strand.

from bsseq.

demis001 commented on July 18, 2024

In the MOABS pipeline, I always get a single row for C and G in the single CpG. In the differentially methylated "C" (DMC) file, I always get one row in the DMC file. In your case, I am getting a row for "C" and a row for "G" after processing. Here is a sample output from MOABS mcall:

#chrom	start	end	totalC_0	nominalRatio_0	ratioCI_0	totalC_1	nominalRatio_1	ratioCI_1	nominalDif_1-0	credibleDif_1-0	difCI_1-0	p_sim_1_v_0	p_fet_1_v_0	class
1	10468	10470	24	0.583	0.398,0.769	20	0.75	0.562,0.938	0.167	0	-0.0722,0.365	0.0309	0.342	hypo
1	10470	10472	25	0.68	0.504,0.856	23	0.739	0.562,0.916	0.0591	0	-0.154,0.258	0.0576	0.756	hypo
1	10483	10485	27	0.815	0.66,0.969	32	0.906	0.781,1	0.0914	0	-0.0583,0.244	0.0556	0.45	hypo

from bsseq.

Represent a CpG with single row during import about bsseq HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent