Git Product home page Git Product logo

Comments (11)

kasperdanielhansen avatar kasperdanielhansen commented on July 18, 2024

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

Thanks, Kasper, for the prompt reply. I tried to represent consecutive rows with a single row in each replicate file using a simple Python script. Even though my 100 files have less than 29 million rows, the merge GRange object differed from what I anticipated (still higher than 29 million). Is there a better approach to go from Bismark to bsseq to resolve this issue?

"""
import sys

def merge_consecutive_lines(lines):
merged_lines = []

def process_and_append(start_line, end_line):
    max_col4 = max(float(line.split("\t")[3]) for line in lines[start_line:end_line + 1])
    sum_col5 = sum(int(line.split("\t")[4]) for line in lines[start_line:end_line + 1])
    sum_col6 = sum(int(line.split("\t")[5]) for line in lines[start_line:end_line + 1])
    merged_lines.append("\t".join([lines[start_line].split("\t")[0], lines[start_line].split("\t")[1], lines[end_line].split("\t")[2], str(max_col4), str(sum_col5), str(sum_col6)]))

start_line = 0

for i in range(1, len(lines)):
    current_line = lines[i].split("\t")
    prev_line = lines[i - 1].split("\t")
    
    if int(current_line[1]) != int(prev_line[2]) + 1:
        process_and_append(start_line, i - 1)
        start_line = i

process_and_append(start_line, len(lines) - 1)

return merged_lines

if name == "main":
#if len(sys.argv) != 2:
#print("Usage: python merge_lines.py input.txt")
#sys.exit(1)

#input_file = sys.argv[1]

#with open(input_file, 'r') as file:
    #lines = file.read().splitlines()
lines = sys.stdin.read().splitlines()

merged_lines = merge_consecutive_lines(lines)

for line in merged_lines:
    # Print columns 1, 2, 4, 5, and 6
    columns = line.split("\t")
    print("\t".join([columns[0], columns[1], columns[3], columns[4], columns[5]]))

"""

Best,
Dereje

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

I am sorry! Are you saying the way data is loaded into bsseq is correct? I used the DSS package that was built based on bsseq and got erroneous results. and went back to using bsseq and saw how the merging is handled in bsseq.

from bsseq.

kasperdanielhansen avatar kasperdanielhansen commented on July 18, 2024

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

No, I didn't! The manual of the DSS package suggested extracting four columns of the input file from the *.cov.gz" Bistmark output file. I then processed that using their guideline. In the end, I found the number of DMC reporter was off than what exist in the human genome. I then went back again to fix the input and redid the procedure after merging the reads from both strand using the Python script, again got the same issue after the complete DMC and DMR analysis.

I then decided to use bsseq without DSS to see how the bsseq handles the import and Bismark mapping. That is where I am right now.

Dereje

from bsseq.

kasperdanielhansen avatar kasperdanielhansen commented on July 18, 2024

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

I am talking to @FelixKrueger, a developer of Bismark. He suggested running the "coverage2cytosine --merge_CpG" option. I don't know if this is the intended input for bsseq downstream analysis. I am running it right now to see how the coverage looks like. I will look into that.

Dereje

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

It would be nice if you wrote a complete vignette including Bismark for both single-end and paired-end reads.

from bsseq.

kasperdanielhansen avatar kasperdanielhansen commented on July 18, 2024

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

Yes, I looked into the read.bismark(). files: It says, The path to the files created by running Bismark's methylation extractor, one sample per file. Files ending in .gz or .bz2 will be automatically decompressed to [tempfile](http://127.0.0.1:14081/help/library/bsseq/help/tempfile)(). We strongly recommend you use the 'genome wide cytosine report' output files. See section 'File formats' for further details. That is what I read in bsseq. I provided the *.cov.gz file from Bismark methylation extractor. This represents each "C" in the genome not the aggregate CpG. from both strand.

from bsseq.

demis001 avatar demis001 commented on July 18, 2024

In the MOABS pipeline, I always get a single row for C and G in the single CpG. In the differentially methylated "C" (DMC) file, I always get one row in the DMC file. In your case, I am getting a row for "C" and a row for "G" after processing. Here is a sample output from MOABS mcall:

#chrom	start	end	totalC_0	nominalRatio_0	ratioCI_0	totalC_1	nominalRatio_1	ratioCI_1	nominalDif_1-0	credibleDif_1-0	difCI_1-0	p_sim_1_v_0	p_fet_1_v_0	class
1	10468	10470	24	0.583	0.398,0.769	20	0.75	0.562,0.938	0.167	0	-0.0722,0.365	0.0309	0.342	hypo
1	10470	10472	25	0.68	0.504,0.856	23	0.739	0.562,0.916	0.0591	0	-0.154,0.258	0.0576	0.756	hypo
1	10483	10485	27	0.815	0.66,0.969	32	0.906	0.781,1	0.0914	0	-0.0583,0.244	0.0556	0.45	hypo

from bsseq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.