Git Product home page Git Product logo

Comments (48)

jharenza avatar jharenza commented on July 28, 2024 2

Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done? And what other filters may have been applied to this data? Can we see the command that was given to run Strelka2? I did notice the methods so far say that Strelka2 was run using best practices so I will investigate that a bit more to see if anything there makes more sense of the data.

The only filtering done was PASS for variants. The somatic workflows are here.

from openpbta-analysis.

kgaonkar6 avatar kgaonkar6 commented on July 28, 2024 2

Hi @cansav09! I've previously done TMB analysis for a subset of our samples for a paper (https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656587.full.pdf) . I calculated TMB = total variants in coding region/length of exome bed (bp)*1000000. Where variants in coding region are variants overlapping an inclusive coding region file : https://github.com/AstraZeneca-NGS/reference_data/blob/master/hg38/bed/Exome-AZ_V2.bed and using calculated length of coding region (which is this bed file was 159697302 bp).

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024 2

Interestingly, the github repo for the code seems to previously have classified IGR (which I think may be intergenic)-line 90- as silent, so potentially based on the version you are using, the reading in of the maf may be missing those variants. Can you read the file outside of maftools and just do a table on those variant classifications?

strelka2_df <- data.table::fread(file.path(data_dir, "pbta-snv-strelka2.vep.maf.gz"),
                              skip = 1,  # skip version string
                              data.table = FALSE)

summary(as.factor(strelka2_df$Variant_Classification))

Reading in without MAF tools made it give me a lot more data. That is a very unfortunate default setting maftools has. Good catch!

          3'Flank                  3'UTR                5'Flank 
                614717                  93603                 664499 
                 5'UTR        Frame_Shift_Del        Frame_Shift_Ins 
                 33920                   1799                   1291 
                   IGR           In_Frame_Del           In_Frame_Ins 
               3745915                   1628                    231 
                Intron      Missense_Mutation      Nonsense_Mutation 
               4930882                  79471                   3466 
      Nonstop_Mutation                    RNA                 Silent 
                    78                  69709                  34473 
         Splice_Region            Splice_Site        Targeted_Region 
                 10014                   2728                      1 
Translation_Start_Site 
                  1667 

I will re-run my original analyses and also try to add in the stuff we've mentioned here and then I will file a draft PR to continue this discussion. At least we have one problem solved though!

from openpbta-analysis.

yuankunzhu avatar yuankunzhu commented on July 28, 2024 2

cool. glad we figured this out. yea, we kept all somatics including intergenic and silent ones. just got in office, did a quick look too:

$ zcat pbta-snv-strelka2.vep.maf.gz | sed 1d | cut -f9 | sort | uniq -c
 614717 3'Flank
  93603 3'UTR
 664499 5'Flank
  33920 5'UTR
   1799 Frame_Shift_Del
   1291 Frame_Shift_Ins
3745915 IGR
   1628 In_Frame_Del
    231 In_Frame_Ins
4930882 Intron
  79471 Missense_Mutation
   3466 Nonsense_Mutation
     78 Nonstop_Mutation
  69709 RNA
  34473 Silent
  10014 Splice_Region
   2728 Splice_Site
      1 Targeted_Region
   1667 Translation_Start_Site
      1 Variant_Classification

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

@cansav09 I think we are looking for something like this - Figure 1 from the Grobner, 2018 landscape, but for brain tumors:
grobner-figure1

For reference, here is the second pan-pediatric cancer landscape from 2018:
Ma, 2018

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

Hi @cansav09! I think @afarrel may have some input here, as he has worked with a lot of TMB data. People do this many different ways. I had previously used only non-synonymous mutations to be reflective of somatic burden. Looks like there is a harmonization effort here, but not sure what they have decided. One other note is that you use coding only, then you should calculate the coding region in Mb from WGS/WES used for this. Currently, I think much of the data you have was derived from WGS, but some incoming data will be WES and we can provide the WES capture region for this.

I think #s1-4 look good!

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

TMB typically does not include CNV data, so I would recommend against using it. Otherwise, I would add to compare to adult tumors, as mentioned above in the one reference.

from openpbta-analysis.

afarrel avatar afarrel commented on July 28, 2024 1

Just to reiterate what @jharenza said, typically we include SNVs and small InDels in TMB calls. In regards to TMBs you need to define if you are calling all SNVs or only non-synonymous mutations. In some of our applications, we perform TMB calls using Tumor-Only methods because we don't have matched normal data and exclude synonymous mutation calls to reduce false positives and better reflect somatic mutation burden.

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

Also, @cansav09, looking at the figure example, it does look like they restrict to coding SNVs. Your plot looks like you are on the right track, so maybe these new calculations will help - we expect to see a lower TMB in pediatric samples compared to adult. If you also wanted to order the x-axis by median for the groups and add the y-intercepts for 2, 10, and 100 muts/Mb, then we can mimic that figure and assess hypermutated samples in our cohort!

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

I think you can safely use the one listed above, provided by @kgaonkar6, select variants in the region within that bed, and use that bed's size. Sorry the fact we did this previously escaped me!

from openpbta-analysis.

migbro avatar migbro commented on July 28, 2024 1

@cansav09 @jharenza , this depends on what you are using as your input data. If you got it from Ped cBioportal, the import process basically removes intergenic variants. I'd suggest using the unaltered mafs. Strelka2 used these intervals for WGS calls:

chr1	0	248956422
chr2	0	242193529
chr3	0	198295559
chr4	0	190214555
chr5	0	181538259
chr6	0	170805979
chr7	0	159345973
chr8	0	145138636
chr9	0	138394717
chr10	0	133797422
chr11	0	135086622
chr12	0	133275309
chr13	0	114364328
chr14	0	107043718
chr15	0	101991189
chr16	0	90338345
chr17	0	83257441
chr18	0	80373285
chr19	0	58617616
chr20	0	64444167
chr21	0	46709983
chr22	0	50818468
chrX	0	156040895
chrY	0	57227415
chrM	0	16569

Mutect2, based on their recommended calling regions, with the addition of chromosome M:

wgs_canonical_calling_regions.hg38.txt
It's actually a bed file, but I gave it the txt extension so that this platform would allow the upload.
Hope this helps!

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024 1

I quickly looked at the GDC website - do you know if your processed data is from GDC?

Yes. I believe so. I'm using the TCGA-LGG, TCGA-GBM, and TCGA-PCPG mutation data.

from openpbta-analysis.

allisonheath avatar allisonheath commented on July 28, 2024 1

Re: TCGA WXS BED - So as background, you have to remember that TCGA exomes were sequenced during a span of over ten years. So the kits/centers/etc changed quite a bit. All of the ones ever used are archived here: https://bitbucket.org/cghub/cghub-capture-kit-info/src/master/

However, as part of the MC3 efforts (data and paper links, synapse site) to unify the exome calling they choose:

7.Capture Kit - ( MAF tag: 'bitgt') The filter represented a simple process of intersecting all mutations calls with the subset of the genome that intersected with all of the capture kits used by the different sequencing centers.

My understanding is that this should be the same filter that GDC used to produce the project-level public MAFs (if you have access and look in the protected MAFs you'll see them tagged as bitgt as well). And I believe that is the 3_boosters from https://bitbucket.org/cghub/cghub-capture-kit-info/src/master/BI/vendor/Agilent/ but I haven't found a reference to fully support that yet.

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

Hi @cansavvy!

  1. I think you can use the same window in both, especially since we do not have access to the TCGA interval files. The only exception I can think of is if TCGA is WES (is it WES or WGS? I probably should know this), then we would not use a whole genome size for those samples, we would have to use a window and subset for mutations in that window. We would underestimate TMB in this case since we would be missing all noncoding TCGA mutations and having a whole genome as the denominator.
  2. Probably good practice to leave in, but I defer to 1, regarding the window.

from openpbta-analysis.

cgreene avatar cgreene commented on July 28, 2024 1

@jharenza : TCGA is complicated as noted by @allisonheath in #3 (comment).

My impression is that the MC3 calls, which I understand based on the conversation above to be what we're using, follow the BED file here: https://gdc.cancer.gov/about-data/publications/mc3-2017

If we're using MC3, then we should use that interval. It doesn't matter that we're only looking at brain tumors if MC3 only called in those regions anyway. It would be good to check that all of the calls that we're getting from TCGA are in fact in those regions @cansavvy.

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024 1

ok sure - saw it was the one referenced in the PR, so I plopped that info here. I can create a new one :)

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

This may be a simplistic question, but in our case TMB = Median(# coding indels + # coding base substitutions) / Megabase? Second question is, I am presuming we need to do some kind of analysis that looks at differences between synonymous and nonsynonymous mutations? This perhaps may needs to be related to age of diagnosis?

from openpbta-analysis.

cgreene avatar cgreene commented on July 28, 2024

@PichaiRaman : I think this came from you - how were you planning to do this analysis?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Now that I've gotten somewhat familiar with the SNV data, I can do this analysis as well. It looks like Grobner says they used Number of SNV of coding mutations per Mb? Just to try to clarify, does this mean we include synonymous and non-synonymous mutations alike in this number? They don't mention distinguishing between this in their Methods, but I've seen other papers that look into it.

Here's my tentative plan, let me know what you think:

Step 1) Start with Mutect2 results only. (Once I make the script for this, and then later after #30 is determined, I can easily go back and change it).
Step 2) Calculate total number of mutations per Mb
Step 3) Graph this in a jitterplot like above with the disease type as the x axis.
Step 4) Split this out into a jitterplot that examines nonsynonymous vs synonymous.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Looks like there is a harmonization effort here, but not sure what they have decided.

This is interesting. Glad someone is working on making things uniform, but yeah, hard to tell what the conclusion is from this.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

I think the first part of this analysis will be determining if there are differences between different methods of calculations of TMB.

My thought is I should calculate TMBs and compare for the following 2 sets of variables:

  1. Coding mutations only vs all mutations.
  2. SNVs + CNVs vs only using SNVs - this however requires that we have an integration of CNV data - #27 (Let me know if there is something I can do to help with this).

So this would result in four different methods of calculating TMBs for each sample. All being Mutations per Mb.

All mutations Coding Variants Only
SNVs + CNVs
SNVs only

After calculating these TMBs I will test the differences/similarities of the calculations by:

  1. Correlating sample's TMB calculations (probably both Spearman's and Pearson's).
  2. Comparing TMB distributions (as well as comparing mean and medians)

How does this sound?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Is there a more exact way to determine the number of Megabases sequences per each sample? I can’t find anyone’s exact code for determining TMB so I am unsure what number(s) to use for the “per coding Megabases” part

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done? And what other filters may have been applied to this data? Can we see the command that was given to run Strelka2? I did notice the methods so far say that Strelka2 was run using best practices so I will investigate that a bit more to see if anything there makes more sense of the data.

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

Is there a more exact way to determine the number of Megabases sequences per each sample? I can’t find anyone’s exact code for determining TMB so I am unsure what number(s) to use for the “per coding Megabases” part

For WES data, it is calculated using the bed regions file and for WGS, you can use the size of the whole genome - enlisting @migbro for this reference file and a size, as I just noticed that the links to the code are not in the manuscript...

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

For WGS, will we need to calculate the percent of the reference genome covered in order to get an exact genome size in Mb? Or do we just use a general approximation? Are the bed regions files already prepared previously?

Also, I've set up the TCGA data, and I will generally attempt to calculate TMB the same way. Do you have an idea of what is the easiest way to obtain genome sizes for the TCGA samples? Has this been calculated elsewhere?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Here's a preview of the data as it is now, before dividing by genome size. I'm not sure which disease labels I should be using for both the TCGA and PBTA data.

For the TCGA data, I have only used what were brain tumors so hopefully it's more comparable.

log2_coding_mutations.pdf

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Screen Shot 2019-09-11 at 1 10 13 PM

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

I've previously done TMB analysis for a subset of our samples for a paper (https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656587.full.pdf) . I calculated TMB = total variants in coding region/length of exome bed (bp)*1000000. Where variants in coding region are variants overlapping an inclusive coding region file : https://github.com/AstraZeneca-NGS/reference_data/blob/master/hg38/bed/Exome-AZ_V2.bed and using calculated length of coding region (which is this bed file was 159697302 bp).

@kgaonkar6 Thanks so much for the code and paper link! This is super helpful! So am I correct in saying that you only used the size of the reference genome for the denominator for all your samples? (As opposed to calculating the percent of the reference exome that each sample had covered and then obtaining a megabase count per sample.)

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Also were these data filtered to take out intergenic mutations? I’m assuming yes based on what I’m seeing in the annotation but how was this done?

The only filtering done was PASS for variants. The somatic workflows are here.

I'm trying to look through the code and determine why the WGS samples don't have any intergenic mutations at all (okay there are 2 out of all the samples) but it isn't clear to me yet.

The reason I want to know this, is for determining the genome size in Mb to divide by, it seems like we would only want to use the size of the whole genome as our denominator for TMB if the data if we also had a chance of finding intergenic mutations in this workflow i.e. If the workflow is intentionally selecting only coding mutations for WGS it seems like the denominator for TMB should reflect that.

But if we are intentionally only trying to look at the coding mutations for WGS so that it is more comparable to WES, then that makes sense, but I'm wondering why we would still want to use the whole genome size as a denominator. Or do we?

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

Hmm, that's a good point - I would expect to see intergenic, noncoding mutations as well - @yuankunzhu and @migbro, do you know if the pipelines selectively retained only coding variants?

If we only look at coding mutations, then yes, we would only use those within the coding genome/WES regions (as @kgaonkar6 mentioned) as the denominator.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Sounds good, @jharenza .Should I just make a coding BED file from the hg38 reference genome or is this something you guys have prepared and can share with me?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Perfect! Thanks, @migbro, @jharenza, and @kgaonkar6, all this information helps a lot! Will hopefully have an updated plot tomorrow.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

One more question, any idea where I could find the coding regions used for TCGA WES samples? I’ve been digging around a bit and haven’t figured out what the best approach is to get the information.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Also, @migbro or @jharenza is it possible to get the unaltered mafs in v5 of these data?

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

Also, @migbro is it possible to get the unaltered mafs in v5 of these data?

@cansav09 these mafs should be unaltered (only PASS variant VCFs), so I think I am confused if you are not seeing intragenic variants, and we may have to look into these files a bit more. I am wondering if the annotation pipeline somehow removed them, but this would be unexpected.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

IDK if this information helps, but if I look at Variant_Classification column right after I read in the MAF file with maftools, there are no NAs and all the variant classifications reported are coding related changes. Is it possible mutations got dropped if they didn't have a classification?

# Read in original strelka file with maftools
strelka2 <- maftools::read.maf(file.path(data_dir, "pbta-snv-strelka2.vep.maf.gz"))

# Get a summary of Variant_Classification 
summary(strelka2@data$Variant_Classification)
# Output from above
     Frame_Shift_Del        Frame_Shift_Ins           In_Frame_Del 
                  1799                   1291                   1628 
          In_Frame_Ins      Missense_Mutation      Nonsense_Mutation 
                   231                  79471                   3465 
      Nonstop_Mutation            Splice_Site Translation_Start_Site 
                    78                   2728                   1667 

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

One more question, any idea where I could find the coding regions used for TCGA WES samples? I’ve been digging around a bit and haven’t figured out what the best approach is to get the information.

I quickly looked at the GDC website - do you know if your processed data is from GDC? It looks like they have an intervals.bed file they use for regions in variant calling, but don't see how to download it. @allisonheath - do you know if we can obtain this?

If we can't get that, I might suggest that the coding regions you use for the PBTA data, also use for theirs, select the variants in those regions and that way we are comparing the same genomic regions with calls that have been made in those regions. I suppose this may not be apples:apples if for some reason there were no calls in some of their coding regions, but I would think they would be as close as we can get if we can't get their intervals file.

Looking

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

IDK if this information helps, but if I look at Variant_Classification column right after I read in the MAF file with maftools, there are no NAs and all the variant classifications reported are coding related changes. Is it possible mutations got dropped if they didn't have a classification?

# Read in original strelka file with maftools
strelka2 <- maftools::read.maf(file.path(data_dir, "pbta-snv-strelka2.vep.maf.gz"))

# Get a summary of Variant_Classification 
summary(strelka2@data$Variant_Classification)
# Output from above
     Frame_Shift_Del        Frame_Shift_Ins           In_Frame_Del 
                  1799                   1291                   1628 
          In_Frame_Ins      Missense_Mutation      Nonsense_Mutation 
                   231                  79471                   3465 
      Nonstop_Mutation            Splice_Site Translation_Start_Site 
                    78                   2728                   1667 

Ahh, it looks like maftools read.maf changed a bit since I last used it. Its default before was to remove any silent variants, but now, seems to ask you to define vc_nonSyn as what you consider as non-synonymous and the rest will be considered silent and removed. Interestingly, the github repo for the code seems to previously have classified IGR (which I think may be intergenic)-line 90- as silent, so potentially based on the version you are using, the reading in of the maf may be missing those variants. Can you read the file outside of maftools and just do a table on those variant classifications? The intergenics could also potentially be blank (haven't looked into these files yet).

Hmm, but then this MAF documentation suggests that IGR would not be in a somatic MAF file, so it could be missed with annotation. We can check this.

Though, if you will subset for coding only for TMB (as the figure does above), I guess this doesn't matter as much, however, we should be sure they are there. It will be important, however, for mutational signatures. Since you mention TCGA is WES data, not WGS data, then I would suggest restricting to coding regions to compare coding SNVs.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Strelka2 used these intervals for WGS calls:

@migbro, thanks for this information, I've implemented it into my R notebook. Do you also have the BED regions for the WXS samples?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

@allisonheath , Thanks so much for all these resources, I had to take a little time to go through them. We are only using the non-protected MAFs and for the first pass of this analysis, I've been using TCGA-LGG, TCGA-GBM, and TCGA-PCPG.

Do you think it's reasonable to use the Target Region BED File - gencode.v19.basic.exome.bed from MC3 to calculate TMB? Or will this not be very representative because it uses all the projects and here I am only using brain-related TCGA tumor data?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Screen Shot 2019-11-08 at 10 31 29 AM

Here's the updated plot, There are some outliers of sorts. Is it acceptable to plot this on a log scale? Or what do you recommend?

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

@cansavvy you can definitely plot on log10 scale! This is nice - there are some known HGAT that are hypermutated.

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

Cool! A couple other questions, @jharenza, to calculate TMB for the TCGA samples, I used the same genome windows and size I did for the PBTA samples. I filtered out any mutations that were not in the WGS windows used for TCGA, however all of them turned out to be inside the windows so the filter was irrelevant.

Two questions:

  1. Is this method okay for calculating the TCGA TMB? Or do you suggest something else like using some other genome window for TCGA data?
  2. All the mutations end up being within the WGS_strelka2 window, so do you still want me to keep in the filter steps? Or can I just write up this explanation and remove the code since no mutations are actually being filtered out?

from openpbta-analysis.

cansavvy avatar cansavvy commented on July 28, 2024

The core of this analysis has been done. The next question for this analysis is whether to use/apply the molecular subtyping labels to these data. This issue has been tracked here on #335 so we can revisit that when molecular subtyping tickets have all been addressed.

from openpbta-analysis.

jaclyn-taroni avatar jaclyn-taroni commented on July 28, 2024

The analyses described in this issue can be found in: analyses/snv-callers for TMB calculation and analyses/tmb-compare-tcga for the comparison to TCGA.

See Tumor Mutation Burden Calculation section of the analyses/snv-callers README for the summary of how TMB is calculated for this project.

Any remaining points have now been subsumed by #257, #335.

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

@cansavvy @jaclyn-taroni @jashapiro @yuankunzhu - commenting on this issue for some additional background I found, from one of the earlier papers that identified an association with PMS2 mutations and hypermutation. Of note, they do additional filtering, which I am not sure yet (have to read) if TCGA had done (which may be a reason we see higher TMBs in pediatric samples in some algorithms).

Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden
Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med. 2017 Apr 19;9(1):34.

Link
Abstract
Background
High tumor mutational burden (TMB) is an emerging biomarker of sensitivity to immune checkpoint inhibitors and has been shown to be more significantly associated with response to PD-1 and PD-L1 blockade immunotherapy than PD-1 or PD-L1 expression, as measured by immunohistochemistry (IHC). The distribution of TMB and the subset of patients with high TMB has not been well characterized in the majority of cancer types.

Methods
In this study, we compare TMB measured by a targeted comprehensive genomic profiling (CGP) assay to TMB measured by exome sequencing and simulate the expected variance in TMB when sequencing less than the whole exome. We then describe the distribution of TMB across a diverse cohort of 100,000 cancer cases and test for association between somatic alterations and TMB in over 100 tumor types.

Results
We demonstrate that measurements of TMB from comprehensive genomic profiling are strongly reflective of measurements from whole exome sequencing and model that below 0.5 Mb the variance in measurement increases significantly. We find that a subset of patients exhibits high TMB across almost all types of cancer, including many rare tumor types, and characterize the relationship between high TMB and microsatellite instability status. We find that TMB increases significantly with age, showing a 2.4-fold difference between age 10 and age 90 years. Finally, we investigate the molecular basis of TMB and identify genes and mutations associated with TMB level. We identify a cluster of somatic mutations in the promoter of the gene PMS2, which occur in 10% of skin cancers and are highly associated with increased TMB.

Conclusions
These results show that a CGP assay targeting ~1.1 Mb of coding genome can accurately assess TMB compared with sequencing the whole exome. Using this method, we find that many disease types have a substantial portion of patients with high TMB who might benefit from immunotherapy. Finally, we identify novel, recurrent promoter mutations in PMS2, which may be another example of regulatory mutations contributing to tumorigenesis.

TMB calculations
Panel
TMB was defined as the number of somatic, coding, base substitution, and indel mutations per megabase of genome examined. All base substitutions and indels in the coding region of targeted genes, including synonymous alterations, are initially counted before filtering as described below. Synonymous mutations are counted in order to reduce sampling noise. While synonymous mutations are not likely to be directly involved in creating immunogenicity, their presence is a signal of mutational processes that will also have resulted in nonsynonymous mutations and neoantigens elsewhere in the genome. Non-coding alterations were not counted. Alterations listed as known somatic alterations in COSMIC and truncations in tumor suppressor genes were not counted, since our assay genes are biased toward genes with functional mutations in cancer. Alterations predicted to be germline by the somatic-germline-zygosity algorithm were not counted. Alterations that were recurrently predicted to be germline in our cohort of clinical specimens were not counted. Known germline alterations in dbSNP were not counted. Germline alterations occurring with two or more counts in the ExAC database were not counted. To calculate the TMB per megabase, the total number of mutations counted is divided by the size of the coding region of the targeted territory. The nonparametric Mann–Whitney U-test was subsequently used to test for significance in difference of means between two populations.

Alterations predicted to be germline by the somatic-germline-zygosity algorithm were not counted. Alterations that were recurrently predicted to be germline in our cohort of clinical specimens were not counted. Known germline alterations in dbSNP were not counted. Germline alterations occurring with two or more counts in the ExAC database were not counted.
here, they remove germline or predicted germline alterations using dbSNP germline alterations and ExAC frequencies

WES of 29 samples which also had panel seq
WES was performed on 29 samples as previously described for which CGP had also been performed. Briefly, tumors were sequenced using Agilent’s exome enrichment kit (Sure Select V4; with >50% of baits above 25× coverage). The matched blood-derived DNA was also sequenced. Base calls and intensities from the Illumina HiSeq 2500 were processed into FASTQ files using CASAVA. The paired-end FASTQ files were aligned to the genome (to UCSC’s hg19 GRCh37) with BWA (v0.5.9). Duplicate paired-end sequences were removed using Picard MarkDuplicates (v1.35) to reduce potential PCR bias. Aligned reads were realigned for known insertion/deletion events using SRMA (v0.1.155). Base quality scores were recalibrated using the Genome Analysis Toolkit (v1.1-28). Somatic substitutions were identified using MuTect (v1.1.4). Mutations were then filtered against common single-nucleotide polymorphisms (SNPs) found in dbSNP (v132), the 1000 Genomes Project (Feb 2012), a 69-sample Complete Genomics data set, and the Exome Sequencing Project (v6500). <- extra filtering step

TCGA
TCGA data were obtained from public repositories. For this analysis, we used the somatic called variants as determined by TCGA as the raw mutation count. We used 38 Mb as the estimate of the exome size. For the downsampling analysis, we simulated the observed number of mutations/Mb 1000 times using the binomial distribution at whole exome TMB = 100 mutations/Mb, 20 mutations/Mb, and 10 mutations/Mb and did this for megabases of exome sequenced ranging from 0–10 Mb. Melanoma TCGA data were obtained from dbGap accession number phs000452.v1.p1.

from openpbta-analysis.

jharenza avatar jharenza commented on July 28, 2024

For the PPTC PDX paper, we had removed germline variants by way of a panel of normals, since we had tumor-only samples. However, we do mention in the paper that our TMBs are a bit higher because we likely did not remove all of the germline variants in that regard. In OpenPBTA, we should be removing these with the paired normal, but I also wonder if more are coming through for any reason.

from openpbta-analysis.

jaclyn-taroni avatar jaclyn-taroni commented on July 28, 2024

@jharenza file a new issue that uses either the proposed analysis or updated analysis template and link to this one please. The additional structure in those templates has proven to be helpful.

from openpbta-analysis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.