foxotech / methylprep Goto Github PK

View Code? Open in Web Editor NEW

32.0 4.0 14.0 139.86 MB

Python-based preprocessing software for Illumina methylation arrays

License: MIT License

Python 99.38% R 0.62%

epigenetics methylation

methylprep's Introduction

methylprep is a python package for processing Illumina methylation array data. View on ReadTheDocs.

Methylprep is part of the methylsuite

methylprep is part of the methylsuite of python packages that provide functions to process and analyze DNA methylation data from Illumina's Infinium arrays (27k, 450k, and EPIC, as well as mouse arrays). The methylprep package contains functions for processing raw data files from arrays and downloading/processing public data sets from GEO (the NIH Gene Expression Omnibus database repository), or from ArrayExpress. It contains both a command line interface (CLI) for processing data from local files, and a set of functions for building a custom pipeline in a jupyter notebook or python scripting environment. The aim is to offer a standard process, with flexibility for those who want it.

methylprep data processing has also been tested and benchmarked to match the outputs of two popular R packages: sesame (v1.10.4) and minfi (v1.38).

Methylsuite package components

You should install all three components, as they work together. The parts include:

methylprep: (this package) for processing idat files or downloading GEO datasets from NIH. Processing steps include
- infer type-I channel switch
- NOOB (normal-exponential convolution on out-of-band probe data)
- poobah (p-value with out-of-band array hybridization, for filtering lose signal-to-noise probes)
- qualityMask (to exclude historically less reliable probes)
- nonlinear dye bias correction (AKA signal quantile normalization between red/green channels across a sample)
- calculate beta-value, m-value, or copy-number matrix
- large batch memory management, by splitting it up into smaller batches during processing
methylcheck: for quality control (QC) and analysis, including
- functions for filtering out unreliable probes, based on the published literature
  - Note that methylprep process will exclude a set of unreliable probes by default. You can disable that using the --no_quality_mask option from CLI.
- sample outlier detection
- array level QC plots, based on Genome Studio functions
- a python clone of Illumina's Bead Array Controls Reporter software (QC)
- data visualization functions based on seaborn and matplotlib graphic libraries.
- predict sex of human samples from probes
- interactive method for assigning samples to groups, based on array data, in a Jupyter notebook
methylize provides more analysis and interpretation functions
- differentially methylated probe statistics (between treatment and control samples)
- volcano plots (which probes are the most different?)
- manhattan plots (where in genome are the differences?)

Installation

methylprep maintains configuration files for your Python package manager of choice: pipenv or pip. Conda install is coming soon.

>>> pip install methylprep

or if you want to install all three packages at once:

>>> pip install methylsuite

Tutorials and Guides

If you're new to DNA methylation analysis, we recommend reading through this introduction in order get the background knowledge needed to best utilize methylprep effectively. Otherwise, you're ready to use methylprep for:

processing your own methylation data
downloading unprocessed data (like IDAT files) from GEO.
downloading preprocessed data (like beta values) from GEO.
building a composite dataset using control samples from GEO.
building a composite dataset from GEO data with any keyword you choose (e.g. combining all GEO datasets that have methylation data from patients with brain cancer).

methylprep's People

Contributors

Stargazers

Watchers

Forkers

jaminsore duke-lv deepbody-me adamritter wonyoungcho notmaurox hoannguyen1711 joe-nano gilgameshjw quito418 laurarg87 isycara

methylprep's Issues

How to control filtering in command line

I am trying to run methylprep process from the command line without poobah filtering. Documentation says:

--poobah        By default, any beta-values or m-values output will
                        contain all probes. If True, those probes that fail
                        the p-value signal:noise detection are replaced with
                        NaNs in dataframes in beta_values and m_value output.

It also states that --poobah is a Boolean variable with default True. This suggests that if I would like a beta_values output without poobah filtering I should add the option --poobah False to the methylprep process command. However, when I do this I get the error:

[Error]:
unrecognized arguments: False

How can I switch poobah filtering off in the command line?

Can I also confirm that the CSV file named something like "200772300016_R01C01_processed.csv" in a folder called "200772300016" contains all beta-values etc without any filtering?

Create samplesheet from MINiML file, without processing or downloading

Version 1.1 will do this as part of the download or samplesheet CLI functions, but this version should be more interactive. You provide path to a miniml file like GSE122038_family.xml and it confirms how many samples to import, which columns of meta data to include, and tries to assign common meta data fields, including Replicate, Control, Sample_Type, and Visit (for a longitudinal study).

Add to version 1.2

Need 27k sample for unit testing

Accessing probe intensities for QC metrics

Greetings,

I am playing around with this repo in an attempt to implement GCT score calculation (Explained here: https://academic.oup.com/nar/article/45/4/e22/2290930).

I've added some code to manifest.py that allows me to add a column to the manifest dataframe that contains the extension base for a probe.

Next is to get the mean green intensity of probes expected to extend into a C and divide that by the mean green intensity of probes expected to extend into a T. I've included my function for doing so below but I am not getting close to producing a ratio near 1 (the expected result).

I was hoping to get some advice on where I might be going wrong here. I've had some trouble tracking probe intensity information throughout the analysis so it may be possible that I am using the wrong dataframe for this analysis. I am afraid my current approach uses unnormalized / dye bias corrected intensities which would prevent me from calculating a meaningful ratio.

I appreciate any advice you could share with me to help me implement GCT score calculation. From my testing, it's a useful way to quantify levels of bisulfite conversion and I'd be happy to add the feature and open a pull request.

Thanks,
Mauro Chavez


def _calculate_gct_score(data_container):
    """
    SeSSme implementation in R:
        # Get the type I probes expected to extend into C and T
        extC <- sesameDataGet(paste0(sdfPlatform(sdf), '.probeInfo'))$typeI.extC
        extT <- sesameDataGet(paste0(sdfPlatform(sdf), '.probeInfo'))$typeI.extT
        ## prbs <- rownames(oobG(sset))
        df <- InfIR(sdf)
        extC <- intersect(df$Probe_ID, extC)
        extT <- intersect(df$Probe_ID, extT)
        dC <- df[match(extC, df$Probe_ID),]
        dT <- df[match(extT, df$Probe_ID),]
        mean(c(dC$MG, dC$UG), na.rm=TRUE) / mean(c(dT$MG, dT$UG), na.rm=TRUE)
    """
    if "Next_Base" not in data_container.man.columns:
        LOGGER.warning(f"No Next_Base col found in manifest: NO GCT Score calculated")
        return None
    print(data_container.man.columns)
    # Successful conversion means that all these probes should light up RED
    ext_c_probes = data_container.man[(data_container.man["Next_Base"] == "C") & (data_container.man["Infinium_Design_Type"] == "I")]
    ext_c_probes = ext_c_probes.merge(
        data_container.green_idat.probe_means,
        left_on="AddressA_ID", right_index=True,
    ).merge(
        data_container.green_idat.probe_means,
        left_on="AddressB_ID", right_index=True,
        suffixes=('_AddressA_grn', '_AddressB_grn')
    ).merge(
        data_container.red_idat.probe_means,
        left_on="AddressA_ID", right_index=True,
    ).merge(
        data_container.red_idat.probe_means,
        left_on="AddressB_ID", right_index=True,
        suffixes=('_AddressA_red', '_AddressB_red')
    )
    print("Exp C probes")
    print(ext_c_probes[["mean_value_AddressA_grn", "mean_value_AddressB_grn", "mean_value_AddressA_red", "mean_value_AddressB_red"]].mean())

    # Use these probes to normalize against how often a green signal comes from an extension into a T
    ext_t_probes = data_container.man[(data_container.man["Next_Base"] == "T") & (data_container.man["Infinium_Design_Type"] == "I")]
    ext_t_probes = ext_t_probes.merge(
        data_container.green_idat.probe_means,
        left_on="AddressA_ID", right_index=True,
    ).merge(
        data_container.green_idat.probe_means,
        left_on="AddressB_ID", right_index=True,
        suffixes=('_AddressA_grn', '_AddressB_grn')
    ).merge(
        data_container.red_idat.probe_means,
        left_on="AddressA_ID", right_index=True,
    ).merge(
        data_container.red_idat.probe_means,
        left_on="AddressB_ID", right_index=True,
        suffixes=('_AddressA_red', '_AddressB_red')
    )
    print("Exp T probes")
    print(ext_t_probes[["mean_value_AddressA_grn", "mean_value_AddressB_grn", "mean_value_AddressA_red", "mean_value_AddressB_red"]].mean())

    return (ext_c_probes["mean_value_AddressA_grn"].mean() + ext_c_probes["mean_value_AddressB_grn"].mean()) / (ext_t_probes["mean_value_AddressA_grn"].mean() + ext_t_probes["mean_value_AddressB_grn"].mean())

download function confused by other non-supported GPL-platform types

Sample: BS_NS_BAL_641 has unrecognized platform

python -m methylprep -v download -i GSE133062 -d ../GSE133062

INFO:methylprep.download.geo:Downloaded and unpacked GSE133062
Traceback (most recent call last):
  File "/Users/mmaxmeister/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/mmaxmeister/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/mmaxmeister/methylprep/methylprep/__main__.py", line 6, in <module>
    cli_app()
  File "/Users/mmaxmeister/methylprep/methylprep/cli.py", line 240, in cli_app
    build_parser()
  File "/Users/mmaxmeister/methylprep/methylprep/cli.py", line 58, in build_parser
    parsed_args.func(func_args)
  File "/Users/mmaxmeister/methylprep/methylprep/cli.py", line 232, in cli_download
    run_series(args.id, args.data_dir, dict_only=args.dict_only)
  File "/Users/mmaxmeister/methylprep/methylprep/download/process_data.py", line 64, in run_series
    seen_platforms = geo_metadata(id, series_path, GEO_PLATFORMS, str(path))
  File "/Users/mmaxmeister/methylprep/methylprep/download/geo.py", line 151, in geo_metadata
    raise ValueError(f'Sample: {title} has unrecognized platform: {platform}')
ValueError: Sample: BS_NS_BAL_641 has unrecognized platform: GPL23976

RFE: Allow to specify output directory of methylprep-cli

This will help in places where facilities leave the IDAT files on read-only volumes and so there's no way to write to them.

downloader - remove unused folders after processing

GEO downloader doesn't handle "TAR (of TXT)" of files, only "TAR (of IDAT)" files.

this might be fixable, but have to investigate.

Add download command example in the README

Could you include an example of what the download command would look like in the README?

methylprep generates only csv files, so methylcheck doesn't work

Hi,
I'm new to python and methylation.
I ran the methylprep and it generates all the required output, but in .csv and not in .pkl.
So when I do methylcheck in jupyter like this:

import methylcheck
from pathlib import Path
filepath = Path('D:\bla\blablu\blablubli\blablublibsi')

betas = methylcheck.load(filepath)
betas.head()

I get this error with betas.head():

WARNING:methylcheck.load_processed:No files of type (beta_value) found in D:\D:\bla\blablu\blablubli\blablublibsi' (or sub-folders).

AttributeError Traceback (most recent call last)
Cell In[2], line 2
1 betas = methylcheck.load(filepath)
----> 2 betas.head()

AttributeError: 'NoneType' object has no attribute 'head'

Why were no pkl files created and is this really the reason why it is not recognizing the beta_values (they are there in the .csv) ?

Would be great if you could help me out. Cheers!

save_control only works with single batches

need to update pipeline to handle multi-batch processing and saving controls for each batch
loads manifest for each batch, which is slower, but allows run_pipeline to process separate batches as separate array types, which is more common issue with older GEO datasets. But upstream, need to create a step that looks at samples and puts them into separate batches if multiple array types are present. AT the moment, you have to manually split these datasets into different folders and run methylprep on each separate folder

Any groups for exchange?

Hello,
Thanks for making and maintaining the project.
As far as I see, Bioinformatics are commonly in R language, and python devs should either learn R or don't do Bioinformatics projects.
I have a question, is there any groups to exchange with other people who uses methylprep and methylsuite?
I have some questions and it will be good to be part of active community.

methylprep -all - batch-size memory does not clear between batches

via @moranr7

Running :

python -m methylprep -v process -d ./ --all --no_sample_sheet

and

python -m methylprep -v process -d ./ --all --no_sample_sheet --batch-size X

act no different in terms of memory requirements. For 107 samples, they both use >60GB of memory.

dev note: the --all overrides any other kwargs, so --batch-size is ignored

Between batches, the memory does not clear. It just continues to grow. Should the memory not be cleared after the .pkl objects are written to free up memory?

Also, the memory requirement seems very high, is there a memory leak?

Thanks,
Ray

ModuleNotFoundError: No module named 'methylcheck.probes'

load_processed function: enable it to create dataframe of betas from processed csvs

methylprep.load and .load_processed won't use the processed CSV files yet.

offer Apache Arrow output option in methylprep.run_pipeline -- ideal for GPU machine learning use case

Adam: It would make sense to use Apache Arrow instead of pickle as a file format, as NVIDIA has libraries that can load it directly to the GPU, also it's a language independent format. It would be important to make sure that the meta data file always has Sample_ID column.

Helpful to know. This format would be useless to me, and probably to a lot of vanilla python users who've never seen them, but we can add the option to support deep learning. I've never seen an Apache Arrow file.

Going from CSVs to pickles was a design choice that broke compatibility with R users, but made everything 100X faster when dealing with large data sets that can't fit into memory. Loading a stack of CSVs takes minutes compared to a few seconds with a py3 pickle.

download support TAR of BED files too

test with https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135770

methylprep.read_geo exceptions

For files of the <GEO>_signals.txt.gz variety, it won't apply p-value detection to probes if the p-value column is there for each sample.
IF meth and unmeth column names differ in case (e.g. SampleA Unmethylated Signal vs samplea Methylated Signal) it wont work. Instead you have to use pandas like this:

>>>import pandas as pd
>>> x = pd.read_csv('GSE67530_signals.txt', sep='\t')
>>> cols = {i:i.title() for i in list(x.columns)}
>>> x = x.rename(columns=cols)
>>> x.to_csv('GSE67530_signals_revised.txt', sep='\t')
>>> z = methylprep.read_geo('GSE67530_signals_revised.txt')
>>> z.to_csv('GSE67530_betas.csv.gz')

(make samples case insensitive. It's possible to have the function try this trick before failing)

sample_sheet parser forces Sample_Name(s) exist and are unique; will cause name-mismatches in methylcheck.

Processing note:
The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique.
This may cause sample_sheets and processed data in dataframes to not match up.

Will fix in future version:

return a list of sample name --> rename lookup dict.
add the meta_data builder to methylprep, with correct names, so that users can just avoid referring to their raw data in subsequent data analysis steps. If Sample_Names are provided but not unique, they're close enough for a person to tell them apart. And if they're omitted, it shouldn't affect anyone to add dummy Sample_Names to the data.

RUVm (postprocessing method to reduce batch effects)

As with other types of microarray platforms, technical artifacts are a concern, including background fluorescence, dye-bias from the use of two color channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing.

https://bioconductor.org/packages/release/bioc/vignettes/missMethyl/inst/doc/missMethyl.html#introduction

I found a new-ish method, RUVm, for removing unwanted variance between batches using negative controls. This works well after NOOB.
Should I plan to port this from R into methylprep?
https://cran.r-project.org/web/packages/ruv/ruv.pdf

Let’s hold off for now on RUVm for now. We may need to optimize our pre-processing pipeline, and need a strategy for empirically evaluating the utility of each step.

poobah pkl file is not generated

Using methylprep version '1.2.11' and command :
python -m methylprep -v process --all

I cannot pass the parameter --export_poobah to the function.

Any idea what is happening here ?

Thanks,
Ray

Analysing Infinium MethylationEPIC v2.0 Kit already implemented?

Hello
Many thanks for providing methylprep.
May I ask if the analysis of files originating form Infinium MethylationEPIC v2.0 chips is already possible?

methylize diff_meth_regions throwing an error

methylize functions
diff_meth_regions is throwing an error in while accersiong the multiprocessing module from python.

methylprep not working with pandas version >2 due to API changes

Exception raised when running a pipeline:

Traceback (most recent call last):
File "", line 1, in
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/pipeline.py", line 331, in run_pipeline
data_container = SampleDataContainer(
^^^^^^^^^^^^^^^^^^^^
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/pipeline.py", line 590, in init
infer_type_I_probes(self, debug=self.debug)
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/infer_channel_switch.py", line 19, in infer_type_I_probes
channels = get_infer_channel_probes(container.manifest, container.green_idat, container.red_idat, debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/paul/miniconda3/lib/python3.11/site-packages/methylprep/processing/infer_channel_switch.py", line 171, in get_infer_channel_probes
oobG_IG = oobG.append(green_in_band).sort_index()
^^^^^^^^^^^
File "/home/paul/miniconda3/lib/python3.11/site-packages/pandas/core/generic.py", line 5989, in getattr
return object.getattribute(self, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?

It seems this is the reason:
https://stackoverflow.com/questions/75956209/dataframe-object-has-no-attribute-append
Problem's solved if I downgrade to pandas 1.5.3

Suggested change:

Bump the requirement to >2 and adapt the corresponding change
Limit the requirement to <2
Check pandas version and use different functions accordingly

CLI does not import the right module

Upon first installation of 1.6.0 (which, coincidentally, does not show up in the Releases here in GH), I got:

methylprep-cli -h
Traceback (most recent call last):
  File "/home/incalci/shared/conda-envs/methylsuite/bin/methylprep-cli", line 5, in <module>
    from methylprep.cli import app
ImportError: cannot import name 'app' from 'methylprep.cli' (/home/incalci/shared/conda-envs/methylsuite/lib/python3.10/site-packages/methylprep/cli.py)

It looks like app was renamed cli_app.

confirm order of probes does not change during processing raw data

Mark mentioned that methylprep was reordering the probes during processing. It might be this pandas dataframe bug:

Warning (from warnings module):
File “/Users/mmaxmeister/legx/test_combine_mds.py”, line 302
slice_betas = pd.concat([slice_betas, betas[sample]], axis=1)
FutureWarning: Sorting because non-concatenation axis is not >aligned. A future version of pandas will change to not sort by default.
To accept the future behavior, pass ‘sort=False’.
To retain the current behavior and silence the warning, pass ‘sort=True’.

I’m sure I accepted the (sort=True) part to suppress the warning, but as a result, it is reordering the probes. Will put this as an issue.

Unable to download the latest version of methylsuite v1.7.1

missing data in manifest files for non-CpG probes; FIX requires users to delete their local manifest files and methylprep will redownload them.

Nichole Rigby - 9:17 AM DEC 12
Not sure if this is a bug or not, but a single control probe shows up in the methylprep version of the manifest. Its one of the -99 ones. I'm guessing it should be in the control dataframes, and not the manifest?

Another bug, (and this effects our new QC functions), is that some of the controls are missing from the control dataframes. For example, the raw manifest (outside of methylprep) contains these STAINING controls.

New bug related to the control probes: I also found some inconsistency in the names of the BISULFITE CONVERSION probes. Some platforms have names like BS Conversion I-C1 (EPIC+) and others have similar names with the - missing (450K) for only SOME of the BISULFITE CONVERSION probes. In some cases, the original manifest has names with the - and the manifest in methylprep is missing the - from the name.

9:09 AM
I think the solution might be to through all the manifest files and methylprep manifests and make sure the - is in the name for each.

Nichole: I checked them all, only changed and re-uploaded the ones that were affected also had to delete the .methylprep_manifest_files directory. My code now works.

For users to get the update, they'll need to delete the manifest files on their machine, and it will re-download them.

Manifest

Marc,

I'm trying in vain to import my own manifest (version b5) with methylprep. Can you give me more explanation about this.

Stef L

run_pipeline(); change default behavior to return None instead of list of data_containers/beta_values to save memory.

v1.3.0 cleaned up a lot of unnecessary memory usage. The final run_pipeline() step "consolidate" uses 2X the prev-max memory. If I dropped this as a default behavior (to return a list of data_container objects, or a df of beta_values), it would reduce the memory spike -- which could caused the process to get killed on computers with less memory processing a large batch. Either way, all files are processed before it would crash, but it would just be tidier to avoid using excessive memory.

Processing 280 EPIC samples on MacOS used ~16GB memory at its max. Seems like a lot. It would use <10GB otherwise at its max-usage point (based on mprof / python memory-profiler

Control of output file location/names

I would like to be able to specify where/with what prefixes output files should go - right now my workflow involves copying an entire BeadChip sequencing directory from separate sequencing data storage where we do not have write permissions (taking up most of my local storage), running methylprep/-check/-ize CLI, then running a bash script to delete all of the copied files and pre-pend the directory structure onto the output files and move them into the appropriate analysis directory. I have looked through the documentation but have not found anywhere to specify output path for results - is there one? If not, could there be in the future?

Adapting sample_sheets.py to Windows users

I ran into this error in trying to create sample_sheets:

File "sample_sheets.py", line 198, in create_sample_sheet
    raise ValueError(file_name_error_msg.format(idat))
ValueError: This .idat file does not have the right pattern to auto-generate a sample sheet:

To fix it for Windows users, one simply needs to change the "/" with "\\" like so:

filename = str(idat).split("\\")[-1]

Methylprep doesn't process mixed data (EPIC + 450k) sets yet

Some GEO data sets contain both kinds of sample array data. The current fix requires additional user steps, which could be dealt with programmatically:

How to process a batch of GEO idats using methylprep and pipeline (fastest way)

Use web browser to find your data set. Note the GSExxxxx ID, such as GSE142512.
python -m methylprep download -d GSE142512 -i GSE142512
where -d is the data folder to place files, and -i is the GSE ID.
This example data set contains both 450k and EPIC samples. So I used MacOS finder (Windows File Explorer)
to move the 450k idats into another folder first, because they need to be processed separately.
(methylprep and pipeline don't process mixed data sets yet.)
Create a sample sheet for EACH folder of idats:
python -m methylprep sample_sheet -d GSE142512 -t Blood --create
note the '--create' part is vital, otherwise it just reads an existing samplesheet.
In this example, I am adding in the sample_type for all of these samples ('Blood') as it was not in the samplesheet meta data already. This command uses sample idat filenames, and doesn't necessarily parse all of the _family_meta_data stuff.
The basic stuff is Sample_Name (added) | GSM_ID | Sentrix_ID | Sentrix_Position (from filename)
For the full meta data, use python -m methylprep meta_data -d GSE142512 -i GSE142512 instead.
From here, you can process each folder separately with methylprep process

copy number

Hi,

I'm trying to use the package to obtaiin copy numbers from the idat files, however it gives me a message, that this is still unsupported.
Is there maybe an easy workaround for this, or when will this be available?

Thank you!