Git Product home page Git Product logo

miplicorn's People

Contributors

arisp99 avatar georgeseif avatar jeffandbailey avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

miplicorn's Issues

Tests failing on all platforms

Bug Description

Tests fail on all platforms. See test logs for details. One error on each platform, windows error looks ggplot2 related. Mac and Unix look related to deprecated read file function.

Expected Behavior

Tests should pass

Required Action

Debug each

MIPr Read In function for AA tables instead of just AN tables

AA tables need to be read in for various analyses upfront prior to chromosome level analysis. The AA tables and AN tables have different initial columns produced from the variant analysis and it would be great if there was also a read-in function specific to those tables, that produces similar tibble structures to AN MIPr::read function.

Thanks!

Coverage plot

In #43, we mentioned that it would be useful to have a coverage plot that shows the average coverage grouped across a specific variable. This coverage plot should be compatible with the reference table and the combined reference, alternate, and coverage tbale.

AA abbreviations

Hi, it would be good to convert between three and single letter amino acid codes. Here's a very messy conversion hack. It could be a function

#replace AA abreviations to single letter
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "val", replacement = "V")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "ala", replacement = "A")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "his", replacement = "H")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "leu", replacement = "L")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "arg", replacement = "R")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "pro", replacement = "P")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "lys", replacement = "K")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "thr", replacement = "T")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "cys", replacement = "C")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "met", replacement = "M")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "tyr", replacement = "Y")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "ile", replacement = "I")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "asn", replacement = "N")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "glu", replacement = "E")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "ser", replacement = "S")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "gly", replacement = "G")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "asp", replacement = "D")
clinical_summary <- mutate_if(clinical_summary,
is.character,
str_replace_all,
pattern = "phe", replacement = "F")

Labels for Insertions and Deletions

--it would be helpful to know which mutations are insertions and deletions for downstream analysis. A single column delineating this with a boolean or other would be a simple way to show this.

Cannot generate a chromosome map given arbitrary column names

Bug Description

If the 4th column of the probes input is not probe_set, chromosome_map() fails.

Expected Behavior

It would be ideal if you could feed in any set of column names to generate the figures.

Reprex

library(miplicorn)

genome <- tibble::tribble(
             ~V1, ~V2,      ~V3,
   "PvP01_01_v1",  1L, 1021664L,
   "PvP01_02_v1",  1L,  956327L,
   "PvP01_03_v1",  1L,  896704L,
   "PvP01_04_v1",  1L, 1012024L,
   "PvP01_05_v1",  1L, 1524814L,
   "PvP01_06_v1",  1L, 1042791L,
   "PvP01_07_v1",  1L, 1652210L,
   "PvP01_08_v1",  1L, 1761288L,
   "PvP01_09_v1",  1L, 2237066L,
   "PvP01_10_v1",  1L, 1548844L,
   "PvP01_11_v1",  1L, 2131221L,
   "PvP01_12_v1",  1L, 3182763L,
   "PvP01_13_v1",  1L, 2093556L,
   "PvP01_14_v1",  1L, 3153402L,
  "PvP01_API_v1",  1L,   29582L,
  "PvP01_MIT_v1",  1L,    5989L
  )

probes <- tibble::tribble(
         ~CHROM,   ~START,     ~END, ~PROBE_SET,
  "PvP01_04_v1", 1011487L, 1011488L,      "SNP",
  "PvP01_09_v1", 1221489L, 1221490L,      "SNP",
  "PvP01_04_v1", 1011292L, 1011293L,      "SNP",
  "PvP01_09_v1", 1233254L, 1233255L,      "SNP",
  "PvP01_04_v1", 1011456L, 1011457L,      "SNP",
  "PvP01_12_v1", 2384609L, 2384610L,      "SNP",
  "PvP01_09_v1", 1234166L, 1234167L,      "SNP",
  "PvP01_12_v1", 2384596L, 2384597L,      "SNP",
  "PvP01_09_v1", 1876033L, 1876034L,      "SNP",
  "PvP01_14_v1",  455911L,  455912L,      "SNP"
  )

chromosome_map(genome, probes, "karyoploteR")
#> Error: Problem with `filter()` input `..1`.
#> ℹ Input `..1` is `.data$probe_set == "SNP"`.
#> x Column `probe_set` not found in `.data`

chromosome_map(
  genome, 
  dplyr::rename(probes, probe_set = PROBE_SET), 
  "karyoploteR"
)

Created on 2021-10-21 by the reprex package (v2.0.1)

Release miplicorn 0.2.0

Prepare for release:

  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • Review pkgdown reference index for, e.g., missing topics

Create Github release:

  • usethis::use_version('minor')
  • usethis::use_github_release()
  • usethis::use_dev_version()

Mutation prevalence calculation using genotype table

Currently, the only way to compute the prevalence of mutations is by using the reference, alternate, and coverage tables:

ref_file <- miplicorn_example("reference_AA_table.csv")
alt_file <- miplicorn_example("alternate_AA_table.csv")
cov_file <- miplicorn_example("coverage_AA_table.csv")
data <- read_tbl_ref_alt_cov(
  ref_file,
  alt_file,
  cov_file,
  gene == "atp6" | gene == "crt"
)
mutation_prevalence(data, 5)
#> # A tibble: 16 × 4
#>    mutation_name  n_total n_mutant prevalence
#>    <chr>            <int>    <int>      <dbl>
#>  1 atp6-Ala623Glu      36       NA     NA    
#>  2 atp6-Glu431Lys      39       NA     NA    
#>  3 atp6-Gly639Asp      26       19      0.731
#>  4 atp6-Ser466Asn      15        9      0.6  
#>  5 atp6-Ser769Asn      17       NA     NA    
#>  6 crt-Ala220Ser       11        4      0.364
#>  7 crt-Asn326Asp       21        8      0.381
#>  8 crt-Asn326Ser       26       NA     NA    
#>  9 crt-Asn75Glu        29       24      0.828
#> 10 crt-Cys72Ser        31       23      0.742
#> 11 crt-His97Leu        47       NA     NA    
#> 12 crt-His97Tyr        47       NA     NA    
#> 13 crt-Ile356Leu       22       15      0.682
#> 14 crt-Ile356Thr       41       18      0.439
#> 15 crt-Lys76Thr        29       25      0.862
#> 16 crt-Met74Ile        29       24      0.828

An alternative solution would be to use the genotype table and count the number of zeros, ones, and twos. It would be useful to provide a method to compute the prevalence of mutations using the genotype table rather than the reference, alternate, and coverage tables.

Improve mutation frequency error message when no reference UMI count exists

When the input data does not have the reference UMI count, the mutation frequency calculation fails and prints an ugly error message.

library(miplicorn)

data <- read_tbl_ref_alt_cov(
  miplicorn_example("reference_AA_table.csv"),
  miplicorn_example("alternate_AA_table.csv"),
  miplicorn_example("coverage_AA_table.csv"),
  gene == "atp6" | gene == "crt"
)

data <- dplyr::select(data, -ref_umi_count)

mutation_frequency(data, threshold = 5)
#> Error in `dplyr::filter()` at miplicorn/R/mutation-frequency.R:67:2:
#> ! Problem while computing `..2 = .data$alt_umi_count > threshold | ...`.
#> Caused by error in `.data$ref_umi_count`:
#> ! Column `ref_umi_count` not found in `.data`.

Created on 2022-07-13 by the reprex package (v2.0.1)

The lack of the ref_umi_count column should be caught earlier by the function and a nice error message should be printed.

Deprecate `chromosome_map()`

The function simply runs a check on the inputs and then redirects to one of the underlying plotting functions. In essence, it acts as a very simple function wrapper. Instead of having this extra function, we should deprecate chromosome_map() and keep the two underlying plotting functions.

Default plots for tables

Following the creation of classes for all read_tbl_*() functions (#39), it would be useful to create default plots to visualize the read data. To accomplish this, we can define autoplot() and plot() methods for each class. Some ideas for default plots follow:

  • For tables that contain UMI counts or coverage, a default plot could visualize counts/coverage via a bar plot.
  • For the combined reference, alternate, and coverage table, we can use a stacked bar plot to visualize counts.
  • For the genotype table, we could create a bar plot counting the number of calls of each genotype.

Note that it is not feasible to plot each sample as a unique point on the x-axis as data tends to contain too many samples to effectively visualize. Instead, we could ask users to select a grouping variable, e.g., mutation_name.

Formating error for "mutation_name" in AN tables

When reading in AN tables the "mutation_name" object is incorrectly adding a,'.' between ":.:", instead of a single colon, ":".

Example
chr11:637436:.:T:A (what occurs)
chr11:637436:T:A (what should occur)

Thanks!

Sort data

Related Problem

When plotting data, it is often helpful to order the data to aid in visualization. While in some cases we want to sort based on one column, in some cases it may be more beneficial to sort on multiple columns.

Solution Requested

Given the below data, it would be nice to be able to sort by multiple columns at once.

data <- tibble::tribble(
  ~sample, ~gene_id, ~gene, ~mutation_name, ~exonic_func, ~aa_change, ~targeted,
  "D10-23", "PF3D7_0106300", "atp6", "atp6-Ala623Glu", "missense_variant", "Ala623Glu", "Yes", 
  "D10-43", "PF3D7_0106300", "mdr1", "mdr1-Ala623Glu", "missense_variant", "Ala623Glu", "Yes", 
  "D10-55", "PF3D7_0106300", "atp6", "atp6-Ala623Glu", "missense_variant", "Ala623Glu", "Yes",
  "D10-5", "PF3D7_0106300", "mdr1", "mdr16-Ala623Glu", "missense_variant", "Ala623Glu", "Yes",
  "D10-47", "PF3D7_0106300", "dhps", "dhps-Ala623Glu", "missense_variant", "Ala623Glu", "Yes",
  "D10-15", "PF3D7_0106300", "atp6", "atp6-Ala623Glu", "missense_variant", "Ala623Glu", "Yes",
)

data
#> # A tibble: 6 × 7
#>   sample gene_id       gene  mutation_name   exonic_func      aa_change targeted
#>   <chr>  <chr>         <chr> <chr>           <chr>            <chr>     <chr>   
#> 1 D10-23 PF3D7_0106300 atp6  atp6-Ala623Glu  missense_variant Ala623Glu Yes     
#> 2 D10-43 PF3D7_0106300 mdr1  mdr1-Ala623Glu  missense_variant Ala623Glu Yes     
#> 3 D10-55 PF3D7_0106300 atp6  atp6-Ala623Glu  missense_variant Ala623Glu Yes     
#> 4 D10-5  PF3D7_0106300 mdr1  mdr16-Ala623Glu missense_variant Ala623Glu Yes     
#> 5 D10-47 PF3D7_0106300 dhps  dhps-Ala623Glu  missense_variant Ala623Glu Yes     
#> 6 D10-15 PF3D7_0106300 atp6  atp6-Ala623Glu  missense_variant Ala623Glu Yes

Solutions Considered

While a simple dplyr::arrange() can help for columns like gene, it becomes a bit more complicated if we want to sort sample because the values contain both characters and numbers.

Frequency of mutations

Related Problem

In #28, we computed the prevalence of mutations by counting the number of samples with a given mutation. Another related analysis would be to compute the frequency of mutations where we compute the weighted average of the alternate allele.

Solution Requested

The strategy employed would be very similar to #28. We would first create a table with the weighted average of the alternate allele and then create a function to plot the data.

Implement GDS

Implement GDS as main data structure and GDS querying to replace tidy parsing as primary filtering functions to allow scalabiulity to large VCFS.

Involves:

  1. Add VCF to GDS T script to be run after MIPTools or as part of MIPTools
  2. Add relevant metadata to GDS object
  3. Make metadata adding and querying dynamic to allow new metadata field querying without new feature patch
  4. Replace tidy parsing functions with efficient GDS query functions
  5. Benchmark new dev version against original version
  6. Profile with profviz code profiler tool to identify memory/time/duplicate process bottlenecks and resolve with alternative functions/logic solutions

Copy Number Variation analysis

Related Problem

Solution Requested

We would like to detect increased copy number for plasmepsin 1/2 (PM1/PM2), mdr1, gch1.

As a starting point, it seems appropriate to count UMIs from genes that we know have a single copy number, and then if the UMIs from our gene that we suspect has copy number variation, a 2 fold increase in UMIs in that gene would be called as having 2 (copies.

A more complex solution that uses Hardy Weinberg would be useful for future builds.

Misformatted metadata causes incorrect output

Bug Description

If the metadata in a file in misformatted (i.e., does not contain six lines), read_file() and read() returns the incorrect output. Currently, we lose information and, instead of being treated as rows, samples are treated as columns.

# Metadata contains only four lines
misformatted <- tibble::tribble(
                         ~Gene,            ~atp6,           ~mdr1,
               "Mutation Name", "atp6-Ala623Glu", "mdr1-Asn86Tyr",
                   "AA Change",      "Ala623Glu",      "Asn86Tyr",
                    "Targeted",            "Yes",           "Yes",
                  "D10-JJJ-23",              "0",            "13",
                  "D10-JJJ-43",              "0",             "0",
                  "D10-JJJ-50",             "15",             "0"
               )

path <- tempfile()
readr::write_csv(misformatted, path)
MIPr::read_file(path)
#> # A tibble: 2 × 8
#>   sample     gene  mutation_name  aa_change targeted d10_jjj_23 d10_jjj_43 value
#>   <chr>      <chr> <chr>          <chr>     <chr>    <chr>      <chr>      <chr>
#> 1 D10-JJJ-50 atp6  atp6-Ala623Glu Ala623Glu Yes      0          0          15   
#> 2 D10-JJJ-50 mdr1  mdr1-Asn86Tyr  Asn86Tyr  Yes      13         0          0
unlink(path)

Created on 2021-09-14 by the reprex package (v2.0.1)

Expected Behavior

We would expect to see six rows in our final dataset with all the metadata represented as columns, as shown below:

MIPr::read_file(path)
#> # A tibble: 6 × 8
#>   sample     gene  mutation_name aa_change targeted value
#>   <chr>      <chr> <chr>         <chr>     <chr>    <chr>
#> 1 D10-JJJ-23 atp6  atp6-Ala623G… Ala623Glu Yes      0    
#> 2 D10-JJJ-43 atp6  atp6-Ala623G… Ala623Glu Yes      0    
#> 3 D10-JJJ-50 atp6  atp6-Ala623G… Ala623Glu Yes      15   
#> 4 D10-JJJ-23 mdr1  mdr1-Asn86Tyr Asn86Tyr  Yes      13   
#> 5 D10-JJJ-43 mdr1  mdr1-Asn86Tyr Asn86Tyr  Yes      0    
#> 6 D10-JJJ-50 mdr1  mdr1-Asn86Tyr Asn86Tyr  Yes      0

Sample IDs format is being changed when read in data

Miplicorn read functions are great! I just need the function to read in the table without making the sample names all lowercase and changing dashes to underscores. This saves me time from changing all the metadata files, etc.

Thanks!

Example of what happens:
ji_05_59_ug_sur_2020_1

What I need to keep them as:
JI-05-59-ug-sur-2020-01

Release miplicorn 0.2.1

Prepare for release:

  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()

Release miplicorn 0.1.0

First release:

Prepare for release:

  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE) Errors with bioconductor packages and webshot2. Not an issue as long as don't submit to CRAN yet. Examples for chromosome_map.R take too long
  • devtools::check_win_devel()
  • pkgdown::build_site()
  • Review pkgdown reference index for, e.g., missing topics

Ready to release:

  • Bump version (in DESCRIPTION and NEWS)
  • usethis::use_github_release()
  • usethis::use_dev_version()

Will not submit the initial release to CRAN.

read functions unable to process NAs

Trying to read in MIPTools AA tables both individually using miplicorn::read_tbl_references() etc. and miplicorn::read_tbl_ref_alt_cov() and getting the following error when running both. I would like to be able to read all three tables into the function and have them reformatted in the output as previously done in the past miplicorn::read() function.

Error: Error: Names repair functions can't return NA values.

Original input:
data<-miplicorn::read_tbl_ref_alt_cov(
"/R/extended-haplotype-Uganda/data/hap3files/DR2_reference_AA_table_hap3.csv",
"
/R/extended-haplotype-Uganda/data/hap3files/DR2_alternate_AA_table_hap3.csv",
"~/R/extended-haplotype-Uganda/data/hap3files/DR2_coverage_AA_table_hap3.csv"
)

and individually

ref<-miplicorn::read_tbl_reference("~/R/extended-haplotype-Uganda/data/hap3files/DR2_reference_AA_table_hap3.csv")

Thank you!

Issues with table configuration after using 'read In' Functions for ref/alt/coverage tables

Hi!
the read in function appears to be mixing up which columns are which for alt/cov/ref. Right now my alt and coverage columns have their values switched. I checked with the table outputs directly from MIPTools to be sure.

Here is my code block.
ref<-"/R/extended-haplotype-Uganda/data/hap3files/reference_AN_table_IBC.csv"
cov<-"
/R/extended-haplotype-Uganda/data/hap3files/coverage_AN_table_IBC.csv"
alt<-"~/R/extended-haplotype-Uganda/data/hap3files/alternate_AN_table_IBC.csv"
data<-miplicorn::read_tbl_ref_alt_cov(ref, cov, alt, targeted == "Yes")

I am wondering if the values for ref/alt/cov need to be in any specific order and if this is documented?

Thanks so much!

Apply threshold to alternate UMI count

Bug Description

When we compute the frequency and prevalence of mutations, we enforce a minimum UMI count which reflects the confidence in the genotype call.

When we compute the prevalence of mutations, we also use a threshold to filter our alternate UMI counts and ensure we are confident that a mutation was actually a mutation.

total <- dplyr::filter(
data,
.data$coverage > threshold &
(.data$alt_umi_count > threshold | .data$ref_umi_count > threshold)
)
mutant_data <- dplyr::filter(total, .data$alt_umi_count > threshold)

When we compute the frequency of mutations, we do not ensure that the alternate UMI count is greater than the threshold. As a result, if we have a single UMI count this will show up in our results, even though it is more likely to be due to error.

# Compute weighted average of the alt umi count
wt_average <- data %>%
dplyr::filter(.data$coverage > threshold) %>%
dplyr::mutate(alt_freq = .data$alt_umi_count / .data$coverage) %>%
dplyr::group_by(.data$mutation_name) %>%
dplyr::summarise(
frequency = sum(.data$alt_freq * .data$coverage) / sum(.data$coverage)
)

Expected Behavior

I think we should instead be checking the threshold for the alternate UMI counts as well. If the alternate UMI count is less than the threshold, then we should set the frequency of the alternate allele to zero.

Mutation prevalence tables and plots

Solution Requested

I'd like to add a feature that produces a visually appealing table and bar graph to summarize the prevalence of key drug resistance mutations. To do this we need to read in the mutant AA table, the reference AA table and the coverage AA table. The user will set a threshold (eg a minimum UMI count of 3 and coverage of 3) which is the confidence we have in the genotyping call. The user can also add metadata and specify groupings to summarize by (eg region or country). With ref, alt and cov in a pivot_longer tibble, we then remove lines below this threshold and summarize by mutations. We then produce a table with the overall prevalences of the key mutations, with sections below with the prevalences by region. For the plot, it will appear much like the coverage plot in the vignette, but the y axis will be the prevalences calculated as above.

Filtering functions

Related Problem

In many situations, it becomes helpful, if not necessary, to filter data before conducting analyses or creating figures. There are several filtering steps that we could take, but some are used almost every time. For instance, to filter based on coverage we use the following:

library(miplicorn)
library(dplyr)

ref_file <- miplicorn_example("reference_AA_table.csv")
alt_file <- miplicorn_example("alternate_AA_table.csv")
cov_file <- miplicorn_example("coverage_AA_table.csv")

table <- read_tbl_ref_alt_cov(ref_file, alt_file, cov_file)

dplyr::filter(table, coverage > 3)
#> # A tibble: 3,847 × 10
#>    sample     gene_id       gene  mutation_name  exonic_func  aa_change targeted
#>    <chr>      <chr>         <chr> <chr>          <chr>        <chr>     <chr>   
#>  1 D10-JJJ-23 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  2 D10-JJJ-43 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  3 D10-JJJ-55 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  4 D10-JJJ-15 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  5 D10-JJJ-28 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  6 D10-JJJ-52 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  7 D10-JJJ-13 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  8 D10-JJJ-1  PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  9 D10-JJJ-45 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#> 10 D10-JJJ-38 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#> # … with 3,837 more rows, and 3 more variables: ref_umi_count <dbl>,
#> #   alt_umi_count <dbl>, coverage <dbl>

Created on 2021-11-30 by the reprex package (v2.0.1)

Solution Requested

Currently, we must use dplyr::filter() to conduct these filtering steps. It may be beneficial to users to have a family of functions for filtering. These functions will likely just call dplyr::filter() under the hood, but if we specify individual functions, it may make it clearer which steps users should consider in the first place. The following, for instance, makes it explicit that the user is filtering based on the coverage.

filter_coverage(table, 3)
#> # A tibble: 3,847 × 10
#>    sample     gene_id       gene  mutation_name  exonic_func  aa_change targeted
#>    <chr>      <chr>         <chr> <chr>          <chr>        <chr>     <chr>   
#>  1 D10-JJJ-23 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  2 D10-JJJ-43 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  3 D10-JJJ-55 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  4 D10-JJJ-15 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  5 D10-JJJ-28 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  6 D10-JJJ-52 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  7 D10-JJJ-13 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  8 D10-JJJ-1  PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#>  9 D10-JJJ-45 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#> 10 D10-JJJ-38 PF3D7_0106300 atp6  atp6-Ala623Glu missense_va… Ala623Glu Yes     
#> # … with 3,837 more rows, and 3 more variables: ref_umi_count <dbl>,
#> #   alt_umi_count <dbl>, coverage <dbl>

Useful Filtering Functions

  • filter_coverage()
  • filter_umi_count()

Poor performance of amino acid conversion functions

In their current form, convert_single() and convert_three() take a long time when used in combination with dplyr::mutate() on a data frame. For example, the following data frame with 1,000 rows takes around a minute:

library(miplicorn)
cov_file <- miplicorn_example("coverage_AA_table.csv")
data <- read_tbl_coverage(cov_file) %>% dplyr::slice_sample(n = 1000)
bench::system_time(dplyr::mutate(data, aa_change = convert_three(aa_change)))
#> process    real 
#>   1.03m   1.03m

Created on 2022-07-06 by the reprex package (v2.0.1)

Given the large size of genomic data sets, it would be beneficial to improve the performance of these functions.

Enhanced chromosome map

Current Implementation

The current implementation of chromosome_map() provides a fantastic overview of probe coverage. However, it is limited in some regards:

  1. There is not enough resolution to determine individual probes
  2. Probes that overlap can hide information depending on the order the probes are plotted.

chromoMap

Suggested Implementation

While useful for visualization and understanding probe coverage, the limitation described above limit the use of the current figures. As such, a plot showing overlapping regions and each individual probe would be beneficial. An example using the package {karyoploteR} is shown below:

karyoploteR.

Suggested Implementation

As it is informative to have both plots, the user could select a plot using a parameter, say, plotting_software or plotting_package. As both of these figures rely on other packages, {miplicorn} won't be able to provide full customizability, but can provide a base set of images. For additional flexibility, users can refer to the underlying packages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.