Git Product home page Git Product logo

Comments (8)

mtmorgan avatar mtmorgan commented on August 27, 2024

Can you please provide a reproducible example of the problem?

from genomeinfodb.

hpages avatar hpages commented on August 27, 2024

I don't see that MutationalPatterns::read_vcfs_as_granges calls "our function" extractSeqLevelsByGroup. I don't know what extractSeqLevelsByGroup is or what it does. AFAICT there is no such function in the GenomeInfoDb package and I don't see one in the MutationalPatterns package either (looking at the master branch only).

So yes please follow Martin's advice as it's hard to help you without having a little bit more details about what's going on.

FWIW I'll take a guess that the problem you're running into is caused by the exotic chromosome names in the Macaca_fascicularis_5.0 assembly:

> library(BSgenome.Mfascicularis.NCBI.5.0)
> seqlevels(BSgenome.Mfascicularis.NCBI.5.0)[1:30]
 [1] "MFA1"         "MFA2"         "MFA3"         "MFA4"         "MFA5"        
 [6] "MFA6"         "MFA7"         "MFA8"         "MFA9"         "MFA10"       
[11] "MFA11"        "MFA12"        "MFA13"        "MFA14"        "MFA15"       
[16] "MFA16"        "MFA17"        "MFA18"        "MFA19"        "MFA20"       
[21] "MFAX"         "MT"           "Scaffold1372" "Scaffold1442" "Scaffold1519"
[26] "Scaffold3048" "Scaffold3501" "Scaffold27"   "Scaffold47"   "Scaffold56"  

These names would certainly confuse GenomeInfoDb::seqlevelsStyle() and maybe MutationalPatterns::read_vcfs_as_granges() makes overly optimistic assumptions about the capabilities of GenomeInfoDb::seqlevelsStyle(). I don't know, I'm not familiar with the MutationalPatterns package.

To rename the chromosomes and scaffolds to the more standard UCSC naming scheme, you can do:

chrominfo <- getChromInfoFromNCBI("Macaca_fascicularis_5.0")

head(chrominfo)
#   SequenceName       SequenceRole AssignedMolecule GenBankAccn Relationship
# 1         MFA1 assembled-molecule             <NA>  CM001919.1            =
# 2         MFA2 assembled-molecule             <NA>  CM001920.1            =
# 3         MFA3 assembled-molecule             <NA>  CM001921.1            =
# 4         MFA4 assembled-molecule             <NA>  CM001922.1            =
# 5         MFA5 assembled-molecule             <NA>  CM001923.1            =
# 6         MFA6 assembled-molecule             <NA>  CM001924.1            =
#    RefSeqAccn     AssemblyUnit SequenceLength UCSCStyleName circular
# 1 NC_022272.1 Primary Assembly      227556264          chr1       NA
# 2 NC_022273.1 Primary Assembly      192460366          chr2       NA
# 3 NC_022274.1 Primary Assembly      192294377          chr3       NA
# 4 NC_022275.1 Primary Assembly      170955103          chr4       NA
# 5 NC_022276.1 Primary Assembly      189454096          chr5       NA
# 6 NC_022277.1 Primary Assembly      181584905          chr6       NA

seqlevels(BSgenome.Mfascicularis.NCBI.5.0) <- setNames(chrominfo$UCSCStyleName,
                                                       chrominfo$SequenceName)

seqlevels(BSgenome.Mfascicularis.NCBI.5.0)[1:30]
#  [1] "chr1"                     "chr2"                    
#  [3] "chr3"                     "chr4"                    
#  [5] "chr5"                     "chr6"                    
#  [7] "chr7"                     "chr8"                    
#  [9] "chr9"                     "chr10"                   
# [11] "chr11"                    "chr12"                   
# [13] "chr13"                    "chr14"                   
# [15] "chr15"                    "chr16"                   
# [17] "chr17"                    "chr18"                   
# [19] "chr19"                    "chr20"                   
# [21] "chrX"                     "chrM"                    
# [23] "chr1_AQIA01037047_random" "chr1_AQIA01037052_random"
# [25] "chr1_AQIA01037053_random" "chr1_AQIA01037091_random"
# [27] "chr1_AQIA01037098_random" "chr1_KE145507_random"    
# [29] "chr1_KE145508_random"     "chr1_KE145509_random"    

Then try MutationalPatterns::read_vcfs_as_granges() again and see if that helps.

from genomeinfodb.

MuShuw avatar MuShuw commented on August 27, 2024

Hello,
I looked to verify if I didn't make an error, but extractSeqlevelsByGroup is a function of GenomeInfoDb.
It can be found here : https://github.com/Bioconductor/GenomeInfoDb/blob/master/R/seqlevelsStyle.R#L222

I was allowed to share a bit of a vcf file that can be found here : https://github.com/MuShuw/sharable/blob/master/test.vcf

We use R 3.6.5 on the project.

The next lines can reproduce the error :

ref_genome <- "BSgenome.Mfascicularis.NCBI.5.0"
library(MutationalPatterns)
library(GenomeInfoDb)
library(ref_genome, character.only = TRUE)
path_dat <- '.' # wherever are the vcfs
vcf_files <- list.files(path_dat, pattern = ".vcf", full.names = TRUE)
sample_names <- c("test")
vcfs <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)

At that point appears the next error :

Error in extractSeqlevelsByGroup(species = ref_organism, style = ref_style, :
The style specified by 'NCBIEnsembl' does not have a compatible entry for the species Macaca fascicularis

The traceback give this :

6: stop("The style specified by '", style, "' does not have a compatible entry for the species ", 
       species)
5: extractSeqlevelsByGroup(species = ref_organism, style = ref_style, 
       group = "auto")
4: FUN(X[[i]], ...)
3: lapply(X = X, FUN = FUN, ...)
2: mclapply(seq_along(vcf_files), function(index) {
       file <- vcf_files[index]
       vcf <- rowRanges(readVcf(file, genome_name))
       seqlevelsStyle(vcf) <- ref_style[1]
       groups <- c()
       if (group != "none") {
           if (group == "auto+sex") {
               groups <- c(extractSeqlevelsByGroup(species = ref_organism, 
                   style = ref_style, group = "auto"), extractSeqlevelsByGroup(species = ref_organism, 
                   style = ref_style, group = "sex"))
               groups_names <- names(groups)
               if (!is.null(groups_names)) {
                   unique_names <- unique(groups_names)
                   groups <- llply(unique_names, function(x) groups[groups_names == 
                     x])
                   groups <- llply(groups, unlist, recursive = FALSE)
                   groups <- unique(as.vector(groups[[1]]))
               }
           }
           else {
               groups <- extractSeqlevelsByGroup(species = ref_organism, 
                   style = ref_style, group = group)
               groups <- unique(as.vector(t(groups)))
           }
           groups <- intersect(groups, seqlevels(vcf))
           vcf <- keepSeqlevels(vcf, groups, pruning.mode = "tidy")
       }
       if (check_alleles) {
           rem <- which(all(!(!is.na(match(vcf$ALT, DNA_BASES)) & 
               !is.na(match(vcf$REF, DNA_BASES)) & (lengths(vcf$ALT) == 
               1))))
           if (length(rem) > 0) {
               vcf = vcf[-rem]
               warnings <- c(warnings, paste(length(rem), "position(s) with indels and/or multiple", 
                   "alternative alleles are excluded from", paste(sample_names[[index]], 
                     ".", sep = "")))
           }
       }
       return(list(vcf, warnings))
   }, mc.cores = num_cores)
1: read_vcfs_as_granges(vcf_files, "test", ref_genome)

Concerning the proposed solution using getChromInfoFromNCBI, I have not been able to use that function despite loading the library.

Thanks again for your help.

Elyas.

from genomeinfodb.

hpages avatar hpages commented on August 27, 2024

Yes I see now that GenomeInfoDb has an extractSeqlevelsByGroup function. I originally missed it because you reported it with the incorrect spelling (extractSeqLevelsByGroup) (case matters in R).

If you're using R 3.6.5 that means you're using an old and unsupported version of Bioconductor. The current version (BioC 3.11) is only supported on R 4.0. Bioconductor evolves quickly and getChromInfoFromNCBI is one of the many things that were introduced in BioC 3.11.

FWIW I'm currently working on making the switch between NCBI and UCSC style easier for BSgenome.Mfascicularis.NCBI.5.0 and other BSgenome packages. The goal is to be able to just do:

genome <- BSgenome.Mfascicularis.NCBI.5.0
seqlevelsStyle(genome) <- "UCSC"

This will only be available in BioC >= 3.12 though (BioC 3.12 is the current development version).

I strongly recommend that you update your installation to R 4.0 + BioC 3.11.

Best,
H.

from genomeinfodb.

hpages avatar hpages commented on August 27, 2024

@MuShuw Have you made progress on this?

Heads up: with BSgenome 1.57.3 you can now switch the style of all the sequence names in BSgenome.Mfascicularis.NCBI.5.0 with just:

library(BSgenome.Mfascicularis.NCBI.5.0)
genome <- BSgenome.Mfascicularis.NCBI.5.0

seqinfo(genome)
# Seqinfo object with 7601 sequences (1 circular) from Macaca_fascicularis_5.0 genome:
#   seqnames     seqlengths isCircular                  genome
#   MFA1          227556264      FALSE Macaca_fascicularis_5.0
#   MFA2          192460366      FALSE Macaca_fascicularis_5.0
#   MFA3          192294377      FALSE Macaca_fascicularis_5.0
#   MFA4          170955103      FALSE Macaca_fascicularis_5.0
#   MFA5          189454096      FALSE Macaca_fascicularis_5.0
#   ...                 ...        ...                     ...
#   Scaffold7531       5581      FALSE Macaca_fascicularis_5.0
#   Scaffold7558      21004      FALSE Macaca_fascicularis_5.0
#   Scaffold7563       5184      FALSE Macaca_fascicularis_5.0
#   Scaffold7587       8082      FALSE Macaca_fascicularis_5.0
#   Scaffold7621       7181      FALSE Macaca_fascicularis_5.0

seqlevelsStyle(genome)
# [1] "NCBI"

seqlevelsStyle(genome) <- "UCSC"  # switch style

Then:

seqinfo(genome)
# Seqinfo object with 7601 sequences (1 circular) from macFas5 genome:
#   seqnames       seqlengths isCircular  genome
#   chr1            227556264      FALSE macFas5
#   chr2            192460366      FALSE macFas5
#   chr3            192294377      FALSE macFas5
#   chr4            170955103      FALSE macFas5
#   chr5            189454096      FALSE macFas5
#   ...                   ...        ...     ...
#   chrUn_KE147191       5581      FALSE macFas5
#   chrUn_KE147192      21004      FALSE macFas5
#   chrUn_KE147193       5184      FALSE macFas5
#   chrUn_KE147194       8082      FALSE macFas5
#   chrUn_KE147195       7181      FALSE macFas5

Hope this helps,
H.

from genomeinfodb.

MuShuw avatar MuShuw commented on August 27, 2024

@hpages Hello,

We haven't had R 4.0 installed on the server yet. I will give you an update once the former solution will be tested.

BSgenome 1.57.3 is the lastest version and that last solution is the switch you were working on if understand porperly ? If so I guess I should go for this solution first once BioC 3.11 is available to me ?

Thank you for your help Hervé.

Regards,
Elyas.

from genomeinfodb.

hpages avatar hpages commented on August 27, 2024

Hi @MuShuw,

Just to clarify the recent improvements to the seqlevelsStyle() setter are in the master branch of BSgenome (in version >= 1.57.3). The master branch is what goes in BioC 3.12 which is the current development version of Bioconductor. Since the improvements also introduce some subtle changes of semantic I'm not planning to backport them to the RELEASE_3_11 branch of BSgenome. In other words, these improvements won't be available in BioC 3.11 (the latest and most current Bioconductor release).

Best,
H.

from genomeinfodb.

hpages avatar hpages commented on August 27, 2024

Hi @MuShuw , is this working for you now? Thanks! H.

from genomeinfodb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.