Git Product home page Git Product logo

metient's Introduction

Metient

Metient (METastasis + gradiENT) is a tool for migration history inference. You can find our preprint on bioRxiv.

Metient is available as a python library, installable via pip. It has been tested on Linux. Installation and running one of the tutorials should take ~5 minutes.

Installation

# mamba used for speed, can use conda instead if mamba is not installed
mamba create -n "met" python=3.8.8 ipython
mamba activate met
pip install metient

Tutorial

To run the tutorial notebooks, clone this repo:

git clone [email protected]:morrislab/metient.git
cd metient/tutorial/

There are different Jupyter Notebook tutorials based on your use case:

  1. I have a cohort of patients (~5 or more patients) with the same cancer type. (Metient-calibrate)
    • I want Metient to estimate which mutations/mutation clusters are present in which anatomical sites. Tutorial 1
    • I know which mutations/mutation clusters are present in which anatomical sites. Tutorial 2
  2. I have a small number of patients, or I want to enforce my own parsimony metric weights. (Metient-evaluate)
    • I want Metient to estimate which mutations/mutation clusters are present in which anatomical sites. Tutorial 3
    • I know which mutations/mutation clusters are present in which anatomical sites. Tutorial 4

If your jupyter notebook does not automatically recognize your conda environment, run the following:

pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "met"

Then in the jupyter notebook, select Kernel > Change kernel > met.

Inputs

There are two required inputs, a tsv file with information for each sample and mutation/mutation cluster, and a txt file specifying the edges of the clone tree.

1. Tsv file

There are two types of tsvs that are accepted, depending on if you'd like Metient to estimate the presence of cancer clones in each tumor site (1a), or if you'd like to input this yourself (1b).

1a. If you would like Metient to estimate the prevalance of each cancer clone in each tumor site, use the following input tsv format.

1a example tsv

Each row in this tsv should correspond to the reference and variant read counts at a single locus in a single tumor sample:

Column name Description
anatomical_site_index Zero-based index for anatomical_site_label column. Rows with the same anatomical site index and cluster_index will get pooled together.
anatomical_site_label Name of the anatomical site
character_index Zero-based index for character_label column
character_label Name of the mutation. This is used in visualizations, so it should be short. NOTE: due to graphing dependencies, this string cannot contain colons.
cluster_index If using a clustering method, the cluster index that this mutation belongs to. NOTE: this must correspond to the indices used in the tree txt file. Rows with the same anatomical site index and cluster_index will get pooled together.
ref The number of reads that map to the reference allele for this mutation or mutation cluster in this anatomical site.
var The number of reads that map to the variant allele for this mutation or mutation cluster in this anatomical site.
site_category Must be one of primary or metastasis. If multiple primaries are specified, such that the primary label is used for multiple different anatomical site indices (i.e., the true primary is not known), we will run Metient multiple times with each primary used as the true primary. Output files are saved with the suffix _{anatomical_site_label} to indicate which primary was used in that run.
var_read_prob This gives Metient the ability to correct for the effect copy number alterations (CNAs) have on the relationship between variant allele frequency (VAF, i.e., the proportion of alleles bearing the mutation) and subclonal frequency (i.e., the proportion of cells bearing the mutation). Let j = character_index. var_read_prob is the probabilty of observing a read from the variant allele for mutation at j in a cell bearing the mutation. Thus, if mutation at j occurred at a diploid locus with no CNAs, this should be 0.5. In a haploid cell (e.g., male sex chromosome) with no CNAs, this should be 1.0. If a CNA duplicated the reference allele in the lineage bearing mutation j prior to j occurring, there will be two reference alleles and a single variant allele in all cells bearing j, such that var_read_prob = 0.3333. If using a CN caller that reports major and minor CN: var_read_prob = (p*maj)/(p*(maj+min)+(1-p)*2), where p is tumor purity, maj is major CN, min is minor CN, and we're assuming the variant allele has major CN. For more information, see S2.2 of PairTree's supplementary info for more details.

1b. If you would like to input the prevalence of each cancer clone in each tumor site, use the following input tsv format.

1b example tsv

Each row in this tsv should correspond to a single mutation/mutation cluster in a single tumor sample:

Column name Description
anatomical_site_index Zero-based index for anatomical_site_label column. Rows with the same anatomical site index and cluster_index will get pooled together.
anatomical_site_label Name of the anatomical site
cluster_index If using a clustering method, the cluster index that this mutation belongs to. NOTE: this must correspond to the indices used in the tree txt file. Rows with the same anatomical site index and cluster_index will get pooled together.
cluster_label Name of the mutation or cluster of mutations. This is used in visualizations, so it should be short. NOTE: due to graphing dependencies, this string cannot contain colons.
present Must be one of 0 or 1. 1 indicates that this mutation/mutation cluster is present in this anatomical site, and 0 indicates that it is not.
site_category Must be one of primary or metastasis. If multiple primaries are specified, such that the primary label is used for multiple different anatomical site indices (i.e., the true primary is not known), we will run Metient multiple times with each primary used as the true primary. Output files are saved with the suffix _{anatomical_site_label} to indicate which primary was used in that run.
num_mutations The number of mutations in this cluster.

2. Tree txt file

A .txt file where each line is an edge from the first index to the second index. Must correspond to the cluster_index column in the input tsv.

Example tree .txt file

Outputs

Metient will output a pickle file in the specificed output directory for each patient that is inputted.

In the pickle file you'll find the following keys:

  • ordered_anatomical_sites: a list of anatomical sites in the order used for the matrices detailed below.
  • full_tree_node_idx_to_labels: list of dictionaries, in order from best to worst solution. This is solution specific because reolving polytomies can change the tree. Each dictionary maps node index (as used for the matrices detailed below) to the label used on the tree. The reason this is different from what is inputted is that metient adds leaf nodes which correspond to the inferred presence of each node in anatomical sites. Each leaf node is labeled as <parent_node_name>_<anatomical_site>.
  • clone_tree_labelings: list of numpy ndarrays, in order from best to worst solution. Each numpy array is a matrix (shape: len(ordered_anatomical_sites), len(full_node_idx_to_label.values())). Row i corresponds to the site at index i in ordered_anatomical_sites, and column j corresponds to the node with label full_node_idx_to_label[j]. Each column is a one-hot vector representing the location inferred by Metient for that node.
  • full_adjacency_matrices: list of numpy ndarrays, in order from best to worst tree. Each tensor is a matrix (shape: len(full_node_idx_to_label.values()), len(full_node_idx_to_label.values())). A 1 at index i,j indicates an edge from i to j.
  • observed_clone_matrix: numpy ndarray (shape: len(ordered_anatomical_sites), len(full_node_idx_to_label.values())). Row i corresponds to the site at index i in ordered_anatomical_sites, and column j corresponds to the node with label full_node_idx_to_label[j]. A value at i,j greater than 0.05 indicates that that node is present in that antomical site. These are the nodes that get added as leaf nodes.

metient's People

Contributors

divyakoyy avatar

Stargazers

 avatar Stephen Staklinski avatar Pedro F. Ferreira avatar  avatar

Watchers

 avatar

Forkers

techthiyanes

metient's Issues

Running metient without a known primary

Hi,

Thanks for making this tool available the pre-print was fantastic.

I wanted to ask if you had any experience running metient in a situation where the primary site is unknown. Say I have ten samples where I am unsure which is the primary site - should I input "primary" for all samples in the site_category table? My understanding is that it would run metient ten times, each time using a different sample as a different "true primary." When metient runs (I am planning on using the evaluate mode since I have a low case number) will it compare the graph scores only within a primary designation (and output the most likely graph for each primary designation, but not comparison between primary designations), or will it compare the scores of the graphs also between different primary designations.

I see in supplementary figure S3 for patient 1 it output three solutions, with the graph on the fair right having the primary designation "left ovary" instead of right ovary as the others do. How was metient run in this case i.e. did you input for both left and right ovary primary in the site_category_table? This case is particularly interesting because in the machina paper they argue given migration/comigration numbers the tumor likely originated from the left ovary (although room for interpretation).

Thank you again looks like a great tool!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.