bioconductor / genomeinfodb Goto Github PK

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style

Home Page: https://bioconductor.org/packages/GenomeInfoDb

R 100.00%

genomeinfodb's Introduction

GenomeInfoDb is an R/Bioconductor package that provides utilities for manipulating chromosome names, including modifying them to follow a particular naming style.

See https://bioconductor.org/packages/GenomeInfoDb for more information including how to install the release version of the package (please refrain from installing directly from GitHub).

genomeinfodb's People

Contributors

Stargazers

Watchers

Forkers

jorainer ivanek inambioinfo villafup priceless-p kakopo simplecodez iman-l berylkanali hsadia538 miachom winnie09 5l1v3r1

genomeinfodb's Issues

seqlevelsStyle time out

Hello,

I am trying to analyze the atac-seq data using Seurat package. But always stucked at seqlevelsStyle(annotation) <- "UCSC".

I have updated the R to 4.1 and GenomeInfoDB to 1.30.0, but the same problem still happens. Could you please help me with it?

The detailed information are attached below:

Many thanks!

Zhaozhe

Below is detailed information:

code:

library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v86)

counts <- Read10X_h5("data/scAtac/10x_example/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
fragpath <- "data/scAtac/10x_example/atac_v1_pbmc_10k_fragments.tsv.gz"

annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotation) <- "UCSC"

Error message:
Error in function (type, msg, asError = TRUE) : Connection timed out after 300498 milliseconds
Traceback:

seqlevelsStyle<-(*tmp*, value = "UCSC")
seqlevelsStyle<-(*tmp*, value = "UCSC")
seqlevelsStyle<-(*tmp*, value = value)
seqlevelsStyle<-(*tmp*, value = value)
mapply(.set_seqlevelsStyle_from_seqlevels_and_genome, genome2seqlevels,
. names(genome2seqlevels), MoreArgs = list(value), SIMPLIFY = FALSE,
. USE.NAMES = FALSE)
standardGeneric("mapply")
eval(mc, env)
eval(mc, env)
eval(mc, env)
mapply(FUN = FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = SIMPLIFY,
. USE.NAMES = USE.NAMES)
(function (seqlevels, genome, new_style)
. {
. ans <- DataFrame(seqlevels = seqlevels, genome = genome)
. if (is.na(genome) || !(new_style %in% c("NCBI", "RefSeq",
. "UCSC"))) {
. seqlevelsStyle(ans[, "seqlevels"]) <- new_style
. return(ans)
. }
. ans <- DataFrame(seqlevels = seqlevels, genome = genome)
. old_style <- .is_NCBI_assembly_or_UCSC_genome(genome)
. if (is.na(old_style)) {
. seqlevelsStyle(ans[, "seqlevels"]) <- new_style
. return(ans)
. }
. if (old_style == "NCBI")
. old_style <- .get_seqlevelsStyle_for_NCBI_seqlevels(seqlevels)
. if (identical(new_style, old_style))
. return(ans)
. if (new_style == "UCSC") {
. new_genome <- .map_NCBI_assembly_to_UCSC_genome(genome)
. if (is.na(new_genome)) {
. warning(wmsg("cannot switch ", genome, "'s seqlevels ",
. "to ", new_style, " style"))
. return(ans)
. }
. chrominfo <- getChromInfoFromUCSC(new_genome, map.NCBI = TRUE)
. SequenceName <- chrominfo[, "NCBI.SequenceName"]
. RefSeqAccn <- chrominfo[, "NCBI.RefSeqAccn"]
. UCSC_seqlevels <- chrominfo[, "chrom"]
. m <- match(seqlevels, SequenceName)
. m2 <- match(seqlevels, RefSeqAccn)
. m[is.na(m)] <- m2[is.na(m)]
. new_seqlevels <- UCSC_seqlevels[m]
. }
. else if (identical(old_style, "UCSC")) {
. new_genome <- .map_UCSC_genome_to_NCBI_assembly(genome)
. if (is.na(new_genome)) {
. warning(wmsg("cannot switch ", genome, "'s seqlevels ",
. "from ", old_style, " to ", new_style, " style"))
. return(ans)
. }
. chrominfo <- getChromInfoFromUCSC(genome, map.NCBI = TRUE)
. UCSC_seqlevels <- chrominfo[, "chrom"]
. if (new_style == "NCBI") {
. NCBI_seqlevels <- chrominfo[, "NCBI.SequenceName"]
. }
. else {
. NCBI_seqlevels <- chrominfo[, "NCBI.RefSeqAccn"]
. }
. m <- match(seqlevels, UCSC_seqlevels)
. new_seqlevels <- NCBI_seqlevels[m]
. }
. else {
. chrominfo <- getChromInfoFromNCBI(genome)
. SequenceName <- chrominfo[, "SequenceName"]
. RefSeqAccn <- chrominfo[, "RefSeqAccn"]
. if (new_style == "RefSeq") {
. m <- match(seqlevels, SequenceName)
. new_seqlevels <- RefSeqAccn[m]
. }
. else {
. m <- match(seqlevels, RefSeqAccn)
. new_seqlevels <- SequenceName[m]
. }
. new_genome <- genome
. }
. replace_idx <- which(!is.na(new_seqlevels))
. if (length(replace_idx) == 0L) {
. warning(wmsg("cannot switch ", genome, "'s seqlevels ",
. "from ", old_style, " to ", new_style, " style"))
. return(ans)
. }
. if (length(replace_idx) < length(new_seqlevels))
. warning(wmsg("cannot switch some of ", genome, "'s seqlevels ",
. "from ", old_style, " to ", new_style, " style"))
. ans[replace_idx, "seqlevels"] <- new_seqlevels[replace_idx]
. ans[replace_idx, "genome"] <- new_genome
. ans
. })(dots[[1L]][[1L]], dots[[2L]][[1L]], "UCSC")
getChromInfoFromUCSC(new_genome, map.NCBI = TRUE)
.get_chrom_info_for_registered_UCSC_genome(script_path, assembled.molecules.only = assembled.molecules.only,
. map.NCBI = map.NCBI, add.ensembl.col = add.ensembl.col, goldenPath.url = goldenPath.url,
. recache = recache)
do.call(.add_NCBI_cols_to_UCSC_chrom_info, c(list(ans), NCBI_linker))
do.call(.add_NCBI_cols_to_UCSC_chrom_info, c(list(ans), NCBI_linker))
(function (UCSC_chrom_info, assembly_accession, AssemblyUnits = NULL,
. special_mappings = NULL, unmapped_seqs = NULL, drop_unmapped = FALSE)
. {
. UCSC_seqlevels <- UCSC_chrom_info[, "chrom"]
. if (length(unmapped_seqs) != 0L) {
. unmapped_seqs_role <- rep.int(names(unmapped_seqs), lengths(unmapped_seqs))
. unmapped_seqs <- unlist(unmapped_seqs, use.names = FALSE)
. unmapped_idx <- match(unmapped_seqs, UCSC_seqlevels)
. stopifnot(!anyNA(unmapped_idx))
. }
. NCBI_chrom_info <- getChromInfoFromNCBI(assembly_accession,
. assembly.units = AssemblyUnits)
. NCBI_seqlevels <- NCBI_chrom_info[, "SequenceName"]
. NCBI_UCSCStyleName <- NCBI_chrom_info[, "UCSCStyleName"]
. NCBI_GenBankAccn <- NCBI_chrom_info[, "GenBankAccn"]
. NCBI_RefSeqAccn <- NCBI_chrom_info[, "RefSeqAccn"]
. L2R <- .map_UCSC_seqlevels_to_NCBI_seqlevels(UCSC_seqlevels,
. NCBI_seqlevels, NCBI_UCSCStyleName, NCBI_GenBankAccn,
. NCBI_RefSeqAccn, special_mappings = special_mappings)
. L2R_is_NA <- is.na(L2R)
. mapped_idx <- which(!L2R_is_NA)
. if (length(unmapped_seqs) != 0L)
. stopifnot(all(L2R_is_NA[unmapped_idx]))
. if (isTRUE(drop_unmapped)) {
. UCSC_chrom_info <- S4Vectors:::extract_data_frame_rows(UCSC_chrom_info,
. mapped_idx)
. L2R <- L2R[mapped_idx]
. mapped_idx <- seq_along(L2R)
. idx <- setdiff(seq_along(NCBI_seqlevels), L2R)
. if (length(idx) != 0L) {
. in1string <- paste0(NCBI_seqlevels[idx], collapse = ", ")
. stop(wmsg("no UCSC seqlevel could be mapped to the following ",
. "NCBI seqlevel(s): ", in1string))
. }
. }
. else {
. unexpectedly_unmapped_idx <- which(L2R_is_NA & !(UCSC_seqlevels %in%
. unmapped_seqs))
. if (length(unexpectedly_unmapped_idx) != 0L) {
. in1string <- paste0(UCSC_seqlevels[unexpectedly_unmapped_idx],
. collapse = ", ")
. stop(wmsg("cannot map the following UCSC seqlevel(s) to an ",
. "NCBI seqlevel: ", in1string))
. }
. }
. NCBI_chrom_info <- S4Vectors:::extract_data_frame_rows(NCBI_chrom_info,
. L2R)
. stopifnot(identical(UCSC_chrom_info[mapped_idx, "circular"],
. NCBI_chrom_info[mapped_idx, "circular"]))
. compare_idx <- which(!is.na(NCBI_chrom_info[, "SequenceLength"]))
. stopifnot(identical(UCSC_chrom_info[compare_idx, "size"],
. NCBI_chrom_info[compare_idx, "SequenceLength"]))
. if (assembly_accession != "GCF_000001405.25") {
. compare_idx <- which(!is.na(NCBI_chrom_info[, "UCSCStyleName"]))
. x <- UCSC_chrom_info[compare_idx, "chrom"]
. y <- NCBI_chrom_info[compare_idx, "UCSCStyleName"]
. if (assembly_accession == "GCF_000001635.26") {
. i1 <- match("chr9_KB469738_fix", x)
. i2 <- match("chr9_KB469738v3_fix", y)
. stopifnot(isFALSE(is.na(i1)), isFALSE(is.na(i2)),
. identical(i1, i2))
. x <- x[-i1]
. y <- y[-i2]
. }
. stopifnot(identical(x, y))
. }
. drop_columns <- c("SequenceLength", "UCSCStyleName", "circular")
. NCBI_chrom_info <- drop_cols(NCBI_chrom_info, drop_columns)
. colnames(NCBI_chrom_info) <- paste0("NCBI.", colnames(NCBI_chrom_info))
. ans <- cbind(UCSC_chrom_info, NCBI_chrom_info)
. if (length(unmapped_seqs) != 0L)
. ans[unmapped_idx, "NCBI.SequenceRole"] <- unmapped_seqs_role
. stopifnot(!is.unsorted(ans[, "NCBI.SequenceRole"]))
. ans
. })(structure(list(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5",
. "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13",
. "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20",
. "chr21", "chr22", "chrX", "chrY", "chrM", "chr1_GL383518v1_alt",
. "chr1_GL383519v1_alt", "chr1_GL383520v2_alt", "chr1_KI270759v1_alt",
. "chr1_KI270760v1_alt", "chr1_KI270761v1_alt", "chr1_KI270762v1_alt",
. "chr1_KI270763v1_alt", "chr1_KI270764v1_alt", "chr1_KI270765v1_alt",
. "chr1_KI270766v1_alt", "chr1_KI270892v1_alt", "chr2_GL383521v1_alt",
. "chr2_GL383522v1_alt", "chr2_GL582966v2_alt", "chr2_KI270767v1_alt",
. "chr2_KI270768v1_alt", "chr2_KI270769v1_alt", "chr2_KI270770v1_alt",
. "chr2_KI270771v1_alt", "chr2_KI270772v1_alt", "chr2_KI270773v1_alt",
. "chr2_KI270774v1_alt", "chr2_KI270775v1_alt", "chr2_KI270776v1_alt",
. "chr2_KI270893v1_alt", "chr2_KI270894v1_alt", "chr3_GL383526v1_alt",
. "chr3_JH636055v2_alt", "chr3_KI270777v1_alt", "chr3_KI270778v1_alt",
. "chr3_KI270779v1_alt", "chr3_KI270780v1_alt", "chr3_KI270781v1_alt",
. "chr3_KI270782v1_alt", "chr3_KI270783v1_alt", "chr3_KI270784v1_alt",
. "chr3_KI270895v1_alt", "chr3_KI270924v1_alt", "chr3_KI270934v1_alt",
. "chr3_KI270935v1_alt", "chr3_KI270936v1_alt", "chr3_KI270937v1_alt",
. "chr4_GL000257v2_alt", "chr4_GL383527v1_alt", "chr4_GL383528v1_alt",
. "chr4_KI270785v1_alt", "chr4_KI270786v1_alt", "chr4_KI270787v1_alt",
. "chr4_KI270788v1_alt", "chr4_KI270789v1_alt", "chr4_KI270790v1_alt",
. "chr4_KI270896v1_alt", "chr4_KI270925v1_alt", "chr5_GL339449v2_alt",
. "chr5_GL383530v1_alt", "chr5_GL383531v1_alt", "chr5_GL383532v1_alt",
. "chr5_GL949742v1_alt", "chr5_KI270791v1_alt", "chr5_KI270792v1_alt",
. "chr5_KI270793v1_alt", "chr5_KI270794v1_alt", "chr5_KI270795v1_alt",
. "chr5_KI270796v1_alt", "chr5_KI270897v1_alt", "chr5_KI270898v1_alt",
. "chr6_GL000250v2_alt", "chr6_GL000251v2_alt", "chr6_GL000252v2_alt",
. "chr6_GL000253v2_alt", "chr6_GL000254v2_alt", "chr6_GL000255v2_alt",
. "chr6_GL000256v2_alt", "chr6_GL383533v1_alt", "chr6_KB021644v2_alt",
. "chr6_KI270758v1_alt", "chr6_KI270797v1_alt", "chr6_KI270798v1_alt",
. "chr6_KI270799v1_alt", "chr6_KI270800v1_alt", "chr6_KI270801v1_alt",
. "chr6_KI270802v1_alt", "chr7_GL383534v2_alt", "chr7_KI270803v1_alt",
. "chr7_KI270804v1_alt", "chr7_KI270805v1_alt", "chr7_KI270806v1_alt",
. "chr7_KI270807v1_alt", "chr7_KI270808v1_alt", "chr7_KI270809v1_alt",
. "chr7_KI270899v1_alt", "chr8_KI270810v1_alt", "chr8_KI270811v1_alt",
. "chr8_KI270812v1_alt", "chr8_KI270813v1_alt", "chr8_KI270814v1_alt",
. "chr8_KI270815v1_alt", "chr8_KI270816v1_alt", "chr8_KI270817v1_alt",
. "chr8_KI270818v1_alt", "chr8_KI270819v1_alt", "chr8_KI270820v1_alt",
. "chr8_KI270821v1_alt", "chr8_KI270822v1_alt", "chr8_KI270900v1_alt",
. "chr8_KI270901v1_alt", "chr8_KI270926v1_alt", "chr9_GL383539v1_alt",
. "chr9_GL383540v1_alt", "chr9_GL383541v1_alt", "chr9_GL383542v1_alt",
. "chr9_KI270823v1_alt", "chr10_GL383545v1_alt", "chr10_GL383546v1_alt",
. "chr10_KI270824v1_alt", "chr10_KI270825v1_alt", "chr11_GL383547v1_alt",
. "chr11_JH159136v1_alt", "chr11_JH159137v1_alt", "chr11_KI270826v1_alt",
. "chr11_KI270827v1_alt", "chr11_KI270829v1_alt", "chr11_KI270830v1_alt",
. "chr11_KI270831v1_alt", "chr11_KI270832v1_alt", "chr11_KI270902v1_alt",
. "chr11_KI270903v1_alt", "chr11_KI270927v1_alt", "chr12_GL383549v1_alt",
. "chr12_GL383550v2_alt", "chr12_GL383551v1_alt", "chr12_GL383552v1_alt",
. "chr12_GL383553v2_alt", "chr12_GL877875v1_alt", "chr12_GL877876v1_alt",
. "chr12_KI270833v1_alt", "chr12_KI270834v1_alt", "chr12_KI270835v1_alt",
. "chr12_KI270836v1_alt", "chr12_KI270837v1_alt", "chr12_KI270904v1_alt",
. "chr13_KI270838v1_alt", "chr13_KI270839v1_alt", "chr13_KI270840v1_alt",
. "chr13_KI270841v1_alt", "chr13_KI270842v1_alt", "chr13_KI270843v1_alt",
. "chr14_KI270844v1_alt", "chr14_KI270845v1_alt", "chr14_KI270846v1_alt",
. "chr14_KI270847v1_alt", "chr15_GL383554v1_alt", "chr15_GL383555v2_alt",
. "chr15_KI270848v1_alt", "chr15_KI270849v1_alt", "chr15_KI270850v1_alt",
. "chr15_KI270851v1_alt", "chr15_KI270852v1_alt", "chr15_KI270905v1_alt",
. "chr15_KI270906v1_alt", "chr16_GL383556v1_alt", "chr16_GL383557v1_alt",
. "chr16_KI270853v1_alt", "chr16_KI270854v1_alt", "chr16_KI270855v1_alt",
. "chr16_KI270856v1_alt", "chr17_GL000258v2_alt", "chr17_GL383563v3_alt",
. "chr17_GL383564v2_alt", "chr17_GL383565v1_alt", "chr17_GL383566v1_alt",
. "chr17_JH159146v1_alt", "chr17_JH159147v1_alt", "chr17_JH159148v1_alt",
. "chr17_KI270857v1_alt", "chr17_KI270858v1_alt", "chr17_KI270859v1_alt",
. "chr17_KI270860v1_alt", "chr17_KI270861v1_alt", "chr17_KI270862v1_alt",
. "chr17_KI270907v1_alt", "chr17_KI270908v1_alt", "chr17_KI270909v1_alt",
. "chr17_KI270910v1_alt", "chr18_GL383567v1_alt", "chr18_GL383568v1_alt",
. "chr18_GL383569v1_alt", "chr18_GL383570v1_alt", "chr18_GL383571v1_alt",
. "chr18_GL383572v1_alt", "chr18_KI270863v1_alt", "chr18_KI270864v1_alt",
. "chr18_KI270911v1_alt", "chr18_KI270912v1_alt", "chr19_GL000209v2_alt",
. "chr19_GL383573v1_alt", "chr19_GL383574v1_alt", "chr19_GL383575v2_alt",
. "chr19_GL383576v1_alt", "chr19_GL949746v1_alt", "chr19_GL949747v2_alt",
. "chr19_GL949748v2_alt", "chr19_GL949749v2_alt", "chr19_GL949750v2_alt",
. "chr19_GL949751v2_alt", "chr19_GL949752v1_alt", "chr19_GL949753v2_alt",
. "chr19_KI270865v1_alt", "chr19_KI270866v1_alt", "chr19_KI270867v1_alt",
. "chr19_KI270868v1_alt", "chr19_KI270882v1_alt", "chr19_KI270883v1_alt",
. "chr19_KI270884v1_alt", "chr19_KI270885v1_alt", "chr19_KI270886v1_alt",
. "chr19_KI270887v1_alt", "chr19_KI270888v1_alt", "chr19_KI270889v1_alt",
. "chr19_KI270890v1_alt", "chr19_KI270891v1_alt", "chr19_KI270914v1_alt",
. "chr19_KI270915v1_alt", "chr19_KI270916v1_alt", "chr19_KI270917v1_alt",
. "chr19_KI270918v1_alt", "chr19_KI270919v1_alt", "chr19_KI270920v1_alt",
. "chr19_KI270921v1_alt", "chr19_KI270922v1_alt", "chr19_KI270923v1_alt",
. "chr19_KI270929v1_alt", "chr19_KI270930v1_alt", "chr19_KI270931v1_alt",
. "chr19_KI270932v1_alt", "chr19_KI270933v1_alt", "chr19_KI270938v1_alt",
. "chr20_GL383577v2_alt", "chr20_KI270869v1_alt", "chr20_KI270870v1_alt",
. "chr20_KI270871v1_alt", "chr21_GL383578v2_alt", "chr21_GL383579v2_alt",
. "chr21_GL383580v2_alt", "chr21_GL383581v2_alt", "chr21_KI270872v1_alt",
. "chr21_KI270873v1_alt", "chr21_KI270874v1_alt", "chr22_GL383582v2_alt",
. "chr22_GL383583v2_alt", "chr22_KB663609v1_alt", "chr22_KI270875v1_alt",
. "chr22_KI270876v1_alt", "chr22_KI270877v1_alt", "chr22_KI270878v1_alt",
. "chr22_KI270879v1_alt", "chr22_KI270928v1_alt", "chrX_KI270880v1_alt",
. "chrX_KI270881v1_alt", "chrX_KI270913v1_alt", "chr1_KI270706v1_random",
. "chr1_KI270707v1_random", "chr1_KI270708v1_random", "chr1_KI270709v1_random",
. "chr1_KI270710v1_random", "chr1_KI270711v1_random", "chr1_KI270712v1_random",
. "chr1_KI270713v1_random", "chr1_KI270714v1_random", "chr2_KI270715v1_random",
. "chr2_KI270716v1_random", "chr3_GL000221v1_random", "chr4_GL000008v2_random",
. "chr5_GL000208v1_random", "chr9_KI270717v1_random", "chr9_KI270718v1_random",
. "chr9_KI270719v1_random", "chr9_KI270720v1_random", "chr11_KI270721v1_random",
. "chr14_GL000009v2_random", "chr14_GL000194v1_random", "chr14_GL000225v1_random",
. "chr14_KI270722v1_random", "chr14_KI270723v1_random", "chr14_KI270724v1_random",
. "chr14_KI270725v1_random", "chr14_KI270726v1_random", "chr15_KI270727v1_random",
. "chr16_KI270728v1_random", "chr17_GL000205v2_random", "chr17_KI270729v1_random",
. "chr17_KI270730v1_random", "chr22_KI270731v1_random", "chr22_KI270732v1_random",
. "chr22_KI270733v1_random", "chr22_KI270734v1_random", "chr22_KI270735v1_random",
. "chr22_KI270736v1_random", "chr22_KI270737v1_random", "chr22_KI270738v1_random",
. "chr22_KI270739v1_random", "chrY_KI270740v1_random", "chrUn_GL000195v1",
. "chrUn_GL000213v1", "chrUn_GL000214v1", "chrUn_GL000216v2", "chrUn_GL000218v1",
. "chrUn_GL000219v1", "chrUn_GL000220v1", "chrUn_GL000224v1", "chrUn_GL000226v1",
. "chrUn_KI270302v1", "chrUn_KI270303v1", "chrUn_KI270304v1", "chrUn_KI270305v1",
. "chrUn_KI270310v1", "chrUn_KI270311v1", "chrUn_KI270312v1", "chrUn_KI270315v1",
. "chrUn_KI270316v1", "chrUn_KI270317v1", "chrUn_KI270320v1", "chrUn_KI270322v1",
. "chrUn_KI270329v1", "chrUn_KI270330v1", "chrUn_KI270333v1", "chrUn_KI270334v1",
. "chrUn_KI270335v1", "chrUn_KI270336v1", "chrUn_KI270337v1", "chrUn_KI270338v1",
. "chrUn_KI270340v1", "chrUn_KI270362v1", "chrUn_KI270363v1", "chrUn_KI270364v1",
. "chrUn_KI270366v1", "chrUn_KI270371v1", "chrUn_KI270372v1", "chrUn_KI270373v1",
. "chrUn_KI270374v1", "chrUn_KI270375v1", "chrUn_KI270376v1", "chrUn_KI270378v1",
. "chrUn_KI270379v1", "chrUn_KI270381v1", "chrUn_KI270382v1", "chrUn_KI270383v1",
. "chrUn_KI270384v1", "chrUn_KI270385v1", "chrUn_KI270386v1", "chrUn_KI270387v1",
. "chrUn_KI270388v1", "chrUn_KI270389v1", "chrUn_KI270390v1", "chrUn_KI270391v1",
. "chrUn_KI270392v1", "chrUn_KI270393v1", "chrUn_KI270394v1", "chrUn_KI270395v1",
. "chrUn_KI270396v1", "chrUn_KI270411v1", "chrUn_KI270412v1", "chrUn_KI270414v1",
. "chrUn_KI270417v1", "chrUn_KI270418v1", "chrUn_KI270419v1", "chrUn_KI270420v1",
. "chrUn_KI270422v1", "chrUn_KI270423v1", "chrUn_KI270424v1", "chrUn_KI270425v1",
. "chrUn_KI270429v1", "chrUn_KI270435v1", "chrUn_KI270438v1", "chrUn_KI270442v1",
. "chrUn_KI270448v1", "chrUn_KI270465v1", "chrUn_KI270466v1", "chrUn_KI270467v1",
. "chrUn_KI270468v1", "chrUn_KI270507v1", "chrUn_KI270508v1", "chrUn_KI270509v1",
. "chrUn_KI270510v1", "chrUn_KI270511v1", "chrUn_KI270512v1", "chrUn_KI270515v1",
. "chrUn_KI270516v1", "chrUn_KI270517v1", "chrUn_KI270518v1", "chrUn_KI270519v1",
. "chrUn_KI270521v1", "chrUn_KI270522v1", "chrUn_KI270528v1", "chrUn_KI270529v1",
. "chrUn_KI270530v1", "chrUn_KI270538v1", "chrUn_KI270539v1", "chrUn_KI270544v1",
. "chrUn_KI270548v1", "chrUn_KI270579v1", "chrUn_KI270580v1", "chrUn_KI270581v1",
. "chrUn_KI270582v1", "chrUn_KI270583v1", "chrUn_KI270584v1", "chrUn_KI270587v1",
. "chrUn_KI270588v1", "chrUn_KI270589v1", "chrUn_KI270590v1", "chrUn_KI270591v1",
. "chrUn_KI270593v1", "chrUn_KI270741v1", "chrUn_KI270742v1", "chrUn_KI270743v1",
. "chrUn_KI270744v1", "chrUn_KI270745v1", "chrUn_KI270746v1", "chrUn_KI270747v1",
. "chrUn_KI270748v1", "chrUn_KI270749v1", "chrUn_KI270750v1", "chrUn_KI270751v1",
. "chrUn_KI270752v1", "chrUn_KI270753v1", "chrUn_KI270754v1", "chrUn_KI270755v1",
. "chrUn_KI270756v1", "chrUn_KI270757v1", "chr1_KN196472v1_fix",
. "chr1_KN196473v1_fix", "chr1_KN196474v1_fix", "chr1_KN538360v1_fix",
. "chr1_KN538361v1_fix", "chr1_KQ031383v1_fix", "chr1_KZ208906v1_fix",
. "chr1_KZ559100v1_fix", "chr2_KN538362v1_fix", "chr2_KN538363v1_fix",
. "chr2_KQ031384v1_fix", "chr2_ML143341v1_fix", "chr2_ML143342v1_fix",
. "chr3_KN196475v1_fix", "chr3_KN196476v1_fix", "chr3_KN538364v1_fix",
. "chr3_KQ031385v1_fix", "chr3_KQ031386v1_fix", "chr3_KV766192v1_fix",
. "chr3_KZ559104v1_fix", "chr4_KQ983257v1_fix", "chr4_ML143344v1_fix",
. "chr4_ML143345v1_fix", "chr4_ML143346v1_fix", "chr4_ML143347v1_fix",
. "chr4_ML143348v1_fix", "chr4_ML143349v1_fix", "chr5_KV575244v1_fix",
. "chr5_ML143350v1_fix", "chr6_KN196478v1_fix", "chr6_KQ031387v1_fix",
. "chr6_KQ090016v1_fix", "chr6_KV766194v1_fix", "chr6_KZ208911v1_fix",
. "chr6_ML143351v1_fix", "chr7_KQ031388v1_fix", "chr7_KV880764v1_fix",
. "chr7_KV880765v1_fix", "chr7_KZ208912v1_fix", "chr7_ML143352v1_fix",
. "chr8_KV880766v1_fix", "chr8_KV880767v1_fix", "chr8_KZ208914v1_fix",
. "chr8_KZ208915v1_fix", "chr9_KN196479v1_fix", "chr9_ML143353v1_fix",
. "chr10_KN196480v1_fix", "chr10_KN538365v1_fix", "chr10_KN538366v1_fix",
. "chr10_KN538367v1_fix", "chr10_KQ090021v1_fix", "chr10_ML143354v1_fix",
. "chr10_ML143355v1_fix", "chr11_KN196481v1_fix", "chr11_KQ090022v1_fix",
. "chr11_KQ759759v1_fix", "chr11_KV766195v1_fix", "chr11_KZ559108v1_fix",
. "chr11_KZ559109v1_fix", "chr11_ML143356v1_fix", "chr11_ML143357v1_fix",
. "chr11_ML143358v1_fix", "chr11_ML143359v1_fix", "chr11_ML143360v1_fix",
. "chr12_KN196482v1_fix", "chr12_KN538369v1_fix", "chr12_KN538370v1_fix",
. "chr12_KQ759760v1_fix", "chr12_KZ208916v1_fix", "chr12_KZ208917v1_fix",
. "chr12_ML143361v1_fix", "chr12_ML143362v1_fix", "chr13_KN196483v1_fix",
. "chr13_KN538371v1_fix", "chr13_KN538372v1_fix", "chr13_KN538373v1_fix",
. "chr13_ML143363v1_fix", "chr13_ML143364v1_fix", "chr13_ML143365v1_fix",
. "chr13_ML143366v1_fix", "chr14_KZ208920v1_fix", "chr14_ML143367v1_fix",
. "chr15_KN538374v1_fix", "chr15_ML143369v1_fix", "chr15_ML143370v1_fix",
. "chr15_ML143371v1_fix", "chr15_ML143372v1_fix", "chr16_KV880768v1_fix",
. "chr16_KZ559113v1_fix", "chr16_ML143373v1_fix", "chr17_KV575245v1_fix",
. "chr17_KV766196v1_fix", "chr17_ML143374v1_fix", "chr17_ML143375v1_fix",
. "chr18_KQ090028v1_fix", "chr18_KZ208922v1_fix", "chr18_KZ559115v1_fix",
. "chr19_KN196484v1_fix", "chr19_KQ458386v1_fix", "chr19_ML143376v1_fix",
. "chr21_ML143377v1_fix", "chr22_KQ759762v1_fix", "chr22_ML143378v1_fix",
. "chr22_ML143379v1_fix", "chr22_ML143380v1_fix", "chrX_ML143381v1_fix",
. "chrX_ML143382v1_fix", "chrX_ML143383v1_fix", "chrX_ML143384v1_fix",
. "chrX_ML143385v1_fix", "chrY_KN196487v1_fix", "chrY_KZ208923v1_fix",
. "chrY_KZ208924v1_fix", "chr1_KQ458382v1_alt", "chr1_KQ458383v1_alt",
. "chr1_KQ458384v1_alt", "chr1_KQ983255v1_alt", "chr1_KV880763v1_alt",
. "chr1_KZ208904v1_alt", "chr1_KZ208905v1_alt", "chr2_KQ983256v1_alt",
. "chr2_KZ208907v1_alt", "chr2_KZ208908v1_alt", "chr3_KZ208909v1_alt",
. "chr3_KZ559101v1_alt", "chr3_KZ559102v1_alt", "chr3_KZ559103v1_alt",
. "chr3_KZ559105v1_alt", "chr3_ML143343v1_alt", "chr4_KQ090013v1_alt",
. "chr4_KQ090014v1_alt", "chr4_KQ090015v1_alt", "chr4_KQ983258v1_alt",
. "chr4_KV766193v1_alt", "chr5_KN196477v1_alt", "chr5_KV575243v1_alt",
. "chr5_KZ208910v1_alt", "chr6_KQ090017v1_alt", "chr7_KZ208913v1_alt",
. "chr7_KZ559106v1_alt", "chr8_KZ559107v1_alt", "chr9_KQ090018v1_alt",
. "chr9_KQ090019v1_alt", "chr10_KQ090020v1_alt", "chr11_KN538368v1_alt",
. "chr11_KZ559110v1_alt", "chr11_KZ559111v1_alt", "chr12_KQ090023v1_alt",
. "chr12_KZ208918v1_alt", "chr12_KZ559112v1_alt", "chr13_KQ090024v1_alt",
. "chr13_KQ090025v1_alt", "chr14_KZ208919v1_alt", "chr14_ML143368v1_alt",
. "chr15_KQ031389v1_alt", "chr16_KQ031390v1_alt", "chr16_KQ090026v1_alt",
. "chr16_KQ090027v1_alt", "chr16_KZ208921v1_alt", "chr17_KV766197v1_alt",
. "chr17_KV766198v1_alt", "chr17_KZ559114v1_alt", "chr18_KQ458385v1_alt",
. "chr18_KZ559116v1_alt", "chr19_KV575246v1_alt", "chr19_KV575247v1_alt",
. "chr19_KV575248v1_alt", "chr19_KV575249v1_alt", "chr19_KV575250v1_alt",
. "chr19_KV575251v1_alt", "chr19_KV575252v1_alt", "chr19_KV575253v1_alt",
. "chr19_KV575254v1_alt", "chr19_KV575255v1_alt", "chr19_KV575256v1_alt",
. "chr19_KV575257v1_alt", "chr19_KV575258v1_alt", "chr19_KV575259v1_alt",
. "chr19_KV575260v1_alt", "chr22_KN196485v1_alt", "chr22_KN196486v1_alt",
. "chr22_KQ458387v1_alt", "chr22_KQ458388v1_alt", "chr22_KQ759761v1_alt",
. "chrX_KV766199v1_alt"), size = c(248956422L, 242193529L, 198295559L,
. 190214555L, 181538259L, 170805979L, 159345973L, 145138636L, 138394717L,
. 133797422L, 135086622L, 133275309L, 114364328L, 107043718L, 101991189L,
. 90338345L, 83257441L, 80373285L, 58617616L, 64444167L, 46709983L,
. 50818468L, 156040895L, 57227415L, 16569L, 182439L, 110268L, 366580L,
. 425601L, 109528L, 165834L, 354444L, 911658L, 50258L, 185285L,
. 256271L, 162212L, 143390L, 123821L, 96131L, 161578L, 110099L,
. 120616L, 136240L, 110395L, 133041L, 70887L, 223625L, 138019L,
. 174166L, 161218L, 214158L, 180671L, 173151L, 173649L, 248252L,
. 205312L, 224108L, 113034L, 162429L, 109187L, 184404L, 162896L,
. 166540L, 163458L, 197351L, 164170L, 165607L, 586476L, 164536L,
. 376187L, 119912L, 244096L, 111943L, 158965L, 205944L, 220246L,
. 378547L, 555799L, 1612928L, 101241L, 173459L, 82728L, 226852L,
. 195710L, 179043L, 126136L, 164558L, 131892L, 172708L, 1144418L,
. 130957L, 4672374L, 4795265L, 4604811L, 4677643L, 4827813L, 4606388L,
. 4929269L, 124736L, 185823L, 76752L, 197536L, 271782L, 152148L,
. 175808L, 870480L, 75005L, 119183L, 1111570L, 157952L, 209988L,
. 158166L, 126434L, 271455L, 209586L, 190869L, 374415L, 292436L,
. 282736L, 300230L, 141812L, 132244L, 305841L, 158983L, 145606L,
. 133535L, 36640L, 985506L, 624492L, 318687L, 136959L, 229282L,
. 162988L, 71551L, 171286L, 60032L, 439082L, 179254L, 309802L,
. 181496L, 188315L, 154407L, 200998L, 191409L, 186169L, 67707L,
. 204059L, 177092L, 296895L, 210133L, 106711L, 214625L, 218612L,
. 120804L, 169178L, 184319L, 138655L, 152874L, 167313L, 408271L,
. 76061L, 119498L, 238139L, 56134L, 40090L, 572349L, 306913L, 180306L,
. 191684L, 169134L, 37287L, 103832L, 322166L, 180703L, 1351393L,
. 1511111L, 296527L, 388773L, 327382L, 244917L, 430880L, 263054L,
. 478999L, 5161414L, 196384L, 192462L, 89672L, 2659700L, 134193L,
. 232857L, 63982L, 1821992L, 375691L, 133151L, 223995L, 90219L,
. 278131L, 70345L, 88070L, 2877074L, 235827L, 108763L, 178921L,
. 196688L, 391357L, 137721L, 1423190L, 325800L, 157099L, 289831L,
. 104552L, 167950L, 164789L, 198278L, 159547L, 167999L, 111737L,
. 157710L, 174061L, 177381L, 385657L, 155864L, 170222L, 188024L,
. 987716L, 729520L, 1064304L, 1091841L, 1066390L, 1002683L, 987100L,
. 796479L, 52969L, 43156L, 233762L, 61734L, 248807L, 170399L, 157053L,
. 171027L, 204239L, 209512L, 155532L, 170698L, 184499L, 170680L,
. 205194L, 170665L, 184516L, 190932L, 123111L, 170701L, 198005L,
. 282224L, 187935L, 189352L, 186203L, 200773L, 170148L, 215732L,
. 170537L, 1066800L, 128386L, 118774L, 183433L, 58661L, 63917L,
. 201197L, 74653L, 116689L, 82692L, 143900L, 166743L, 162811L,
. 96924L, 74013L, 259914L, 263666L, 101331L, 186262L, 304135L,
. 176103L, 284869L, 144206L, 274009L, 175055L, 32032L, 127682L,
. 66860L, 40176L, 42210L, 176043L, 40745L, 41717L, 161471L, 153799L,
. 155397L, 209709L, 92689L, 40062L, 38054L, 176845L, 39050L, 100316L,
. 201709L, 191469L, 211173L, 194050L, 38115L, 39555L, 172810L,
. 43739L, 448248L, 1872759L, 185591L, 280839L, 112551L, 150754L,
. 41543L, 179772L, 165050L, 42811L, 181920L, 103838L, 99375L, 73985L,
. 37240L, 182896L, 164239L, 137718L, 176608L, 161147L, 179198L,
. 161802L, 179693L, 15008L, 2274L, 1942L, 2165L, 1472L, 1201L,
. 12399L, 998L, 2276L, 1444L, 37690L, 4416L, 21476L, 1040L, 1652L,
. 2699L, 1368L, 1048L, 1026L, 1121L, 1428L, 1428L, 3530L, 1803L,
. 2855L, 8320L, 2805L, 1650L, 1451L, 2656L, 2378L, 1136L, 1048L,
. 1045L, 1930L, 4215L, 1750L, 1658L, 990L, 1788L, 1537L, 1216L,
. 1298L, 2387L, 1484L, 971L, 1308L, 970L, 1143L, 1880L, 2646L,
. 1179L, 2489L, 2043L, 2145L, 1029L, 2321L, 1445L, 981L, 2140L,
. 1884L, 1361L, 92983L, 112505L, 392061L, 7992L, 1774L, 1233L,
. 3920L, 4055L, 5353L, 1951L, 2318L, 2415L, 8127L, 22689L, 6361L,
. 1300L, 3253L, 2186L, 138126L, 7642L, 5674L, 2983L, 1899L, 2168L,
. 91309L, 993L, 1202L, 1599L, 31033L, 1553L, 7046L, 6504L, 1400L,
. 4513L, 2969L, 6158L, 44474L, 4685L, 5796L, 3041L, 157432L, 186739L,
. 210658L, 168472L, 41891L, 66486L, 198735L, 93321L, 158759L, 148850L,
. 150742L, 27745L, 62944L, 40191L, 36723L, 79590L, 71251L, 186494L,
. 166200L, 122022L, 460100L, 305542L, 467143L, 330031L, 44955L,
. 208149L, 365499L, 481245L, 145975L, 84043L, 451168L, 305979L,
. 415308L, 373699L, 165718L, 411654L, 105527L, 230434L, 235734L,
. 341066L, 53476L, 176674L, 125549L, 276109L, 673059L, 89956L,
. 268330L, 320750L, 245716L, 139427L, 242796L, 73265L, 179932L,
. 142129L, 468267L, 589656L, 254759L, 156998L, 265876L, 165120L,
. 6367528L, 330164L, 25408L, 277797L, 14347L, 85284L, 420164L,
. 264545L, 454963L, 292944L, 108875L, 181958L, 196940L, 140877L,
. 305244L, 279644L, 45257L, 165419L, 270122L, 217075L, 170928L,
. 211377L, 541038L, 86533L, 315610L, 1046838L, 64689L, 297568L,
. 192531L, 35455L, 206320L, 356766L, 148762L, 7309L, 158944L, 65394L,
. 409912L, 690932L, 399183L, 4998962L, 97763L, 369264L, 5500449L,
. 396515L, 1927115L, 480415L, 270967L, 154723L, 281919L, 137908L,
. 56695L, 407387L, 93070L, 230843L, 370917L, 405389L, 493165L,
. 519485L, 101037L, 461303L, 12295L, 412368L, 403128L, 28824L,
. 68192L, 14678L, 17435L, 101150L, 48370L, 209722L, 141019L, 349938L,
. 212205L, 278659L, 551020L, 166136L, 140355L, 535088L, 181658L,
. 140361L, 175849L, 164041L, 197752L, 302885L, 195063L, 215443L,
. 90922L, 163749L, 236512L, 205407L, 420675L, 139087L, 362221L,
. 135987L, 82315L, 680662L, 172555L, 103072L, 163882L, 134099L,
. 185507L, 203552L, 301637L, 181167L, 109323L, 174808L, 154139L,
. 168146L, 123480L, 171798L, 264228L, 2365364L, 169136L, 59016L,
. 267463L, 78609L, 246895L, 276292L, 116753L, 205101L, 163186L,
. 163926L, 170206L, 168131L, 293522L, 241058L, 159285L, 178197L,
. 166713L, 99845L, 161095L, 223118L, 100553L, 156965L, 171263L,
. 145691L, 156562L, 153027L, 155930L, 174749L, 145162L, 188004L
. ), assembled = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
. TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
. TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), circular = c(FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
. FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
. )), class = "data.frame", row.names = c(NA, 640L)), assembly_accession = "GCF_000001405.39")
getChromInfoFromNCBI(assembly_accession, assembly.units = AssemblyUnits)
.get_NCBI_chrom_info_from_accession(accession, circ_seqs = circ_seqs,
. assembled.molecules.only = assembled.molecules.only, assembly.units = assembly.units,
. recache = recache)
fetch_assembly_report(accession)
.form_assembly_report_url(assembly_accession)
list_ftp_dir(url)
getURL(url)
curlPerform(curl = curl, .opts = opts, .encoding = .encoding)
function (type, msg, asError = TRUE)
. {
. if (!is.character(type)) {
. i = match(type, CURLcodeValues)
. typeName = if (is.na(i))
. character()
. else names(CURLcodeValues)[i]
. }
. typeName = gsub("^CURLE_", "", typeName)
. fun = (if (asError)
. stop
. else warning)
. fun(structure(list(message = msg, call = sys.call()), class = c(typeName,
. "GenericCurlError", "error", "condition")))
. }(28L, "Connection timed out after 300498 milliseconds", TRUE)

annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

seqlevelsStyle(annotation) <- "UCSC"

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS/LAPACK: /home/zhe/anaconda3/envs/r4/lib/libopenblasp-r0.3.18.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=zh_CN.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.18.3
[3] AnnotationFilter_1.18.0 GenomicFeatures_1.46.3
[5] AnnotationDbi_1.56.2 Biobase_2.54.0
[7] GenomicRanges_1.46.1 GenomeInfoDb_1.30.0
[9] IRanges_2.28.0 S4Vectors_0.32.3
[11] BiocGenerics_0.40.0 SeuratObject_4.0.4
[13] Seurat_4.1.0 Signac_1.5.0

loaded via a namespace (and not attached):
[1] uuid_1.0-3 fastmatch_1.1-3
[3] BiocFileCache_2.2.0 plyr_1.8.6
[5] igraph_1.2.11 repr_1.1.4
[7] lazyeval_0.2.2 splines_4.1.1
[9] BiocParallel_1.28.3 listenv_0.8.0
[11] scattermore_0.7 SnowballC_0.7.0
[13] ggplot2_3.3.5 digest_0.6.29
[15] htmltools_0.5.2 fansi_1.0.2
[17] memoise_2.0.1 magrittr_2.0.1
[19] tensor_1.5 cluster_2.1.2
[21] ROCR_1.0-11 globals_0.14.0
[23] Biostrings_2.62.0 matrixStats_0.61.0
[25] docopt_0.7.1 spatstat.sparse_2.1-0
[27] prettyunits_1.1.1 colorspace_2.0-2
[29] rappdirs_0.3.3 blob_1.2.2
[31] ggrepel_0.9.1 dplyr_1.0.7
[33] sparsesvd_0.2 crayon_1.4.2
[35] RCurl_1.98-1.5 jsonlite_1.7.3
[37] spatstat.data_2.1-2 survival_3.2-13
[39] zoo_1.8-9 glue_1.6.0
[41] polyclip_1.10-0 gtable_0.3.0
[43] zlibbioc_1.40.0 XVector_0.34.0
[45] leiden_0.3.9 DelayedArray_0.20.0
[47] future.apply_1.8.1 abind_1.4-5
[49] scales_1.1.1 DBI_1.1.2
[51] miniUI_0.1.1.1 Rcpp_1.0.8
[53] progress_1.2.2 viridisLite_0.4.0
[55] xtable_1.8-4 reticulate_1.23
[57] spatstat.core_2.3-2 bit_4.0.4
[59] htmlwidgets_1.5.4 httr_1.4.2
[61] RColorBrewer_1.1-2 ellipsis_0.3.2
[63] ica_1.0-2 XML_3.99-0.8
[65] pkgconfig_2.0.3 farver_2.1.0
[67] dbplyr_2.1.1 ggseqlogo_0.1
[69] uwot_0.1.11 deldir_1.0-6
[71] utf8_1.2.2 tidyselect_1.1.1
[73] rlang_0.4.12 reshape2_1.4.4
[75] later_1.3.0 cachem_1.0.6
[77] munsell_0.5.0 tools_4.1.1
[79] generics_0.1.1 RSQLite_2.2.9
[81] ggridges_0.5.3 evaluate_0.14
[83] stringr_1.4.0 fastmap_1.1.0
[85] yaml_2.2.1 goftest_1.2-3
[87] bit64_4.0.5 fitdistrplus_1.1-6
[89] purrr_0.3.4 RANN_2.6.1
[91] KEGGREST_1.34.0 pbapply_1.5-0
[93] future_1.23.0 nlme_3.1-155
[95] mime_0.12 slam_0.1-50
[97] RcppRoll_0.3.0 xml2_1.3.3
[99] biomaRt_2.50.2 hdf5r_1.3.5
[101] compiler_4.1.1 filelock_1.0.2
[103] curl_4.3.2 plotly_4.10.0
[105] png_0.1-7 spatstat.utils_2.3-0
[107] tibble_3.1.6 tweenr_1.0.2
[109] stringi_1.7.6 lattice_0.20-45
[111] IRdisplay_1.1 ProtGenerics_1.26.0
[113] Matrix_1.4-0 vctrs_0.3.8
[115] pillar_1.6.4 lifecycle_1.0.1
[117] BiocManager_1.30.16 spatstat.geom_2.3-1
[119] lmtest_0.9-39 RcppAnnoy_0.0.19
[121] data.table_1.14.2 cowplot_1.1.1
[123] bitops_1.0-7 irlba_2.3.5
[125] rtracklayer_1.54.0 httpuv_1.6.5
[127] patchwork_1.1.1 BiocIO_1.4.0
[129] R6_2.5.1 promises_1.2.0.1
[131] KernSmooth_2.23-20 gridExtra_2.3
[133] lsa_0.73.2 parallelly_1.30.0
[135] codetools_0.2-18 MASS_7.3-55
[137] assertthat_0.2.1 SummarizedExperiment_1.24.0
[139] rjson_0.2.21 GenomicAlignments_1.30.0
[141] qlcMatrix_0.9.7 sctransform_0.3.3
[143] Rsamtools_2.10.0 GenomeInfoDbData_1.2.7
[145] hms_1.1.1 mgcv_1.8-38
[147] parallel_4.1.1 grid_4.1.1
[149] rpart_4.1-15 IRkernel_1.3
[151] tidyr_1.1.4 MatrixGenerics_1.6.0
[153] Rtsne_0.15 pbdZMQ_0.3-6
[155] ggforce_0.3.3 shiny_1.7.1
[157] base64enc_0.1-3 restfulr_0.0.13

Proposed contribution task for Outreachy applicants: Register NCBI assembly Felis_catus_9.0

Felis_catus_9.0 is a Cat assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000181335.3/

Note that Felis_catus_9.0 is the assembly that felCat9, the latest UCSC genome for Cat, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Cat in the UCSC species tree on the left, click on it, then make sure to select the latest Cat Assembly (felCat9). This will display a bunch of additional information about the felCat9 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The registered_NCBI_assemblies() function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is that getChromInfoFromNCBI() still works on an unregistered assembly, but in "degraded" mode, that is:

The name of the assembly is not recognized, only look up by GenBank or RefSeq accession works.
The returned circularity flags are not guaranteed to be accurate. This potential inaccuracy is communicated to the user by placing NAs instead of FALSEs in the circular column of the returned data.frame.

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/. If this is a new organism, then we need to start a new file. See the other files for the naming scheme: the name of the file must be the full scientific name of the organism, with the underscore used as separator, and with the first letter capitalized. File extension must be .R.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

GenomeInfoDB can not be installed

Hello,
I upgraded my R to 4.0.3 and I can not install GenomeInfoDb. Here is what I have done (I also tried installing it from command line as well but received the same error).

if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
BiocManager::install("GenomeInfoDb")

Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
Installing package(s) 'GenomeInfoDb'
also installing the dependency ‘GenomeInfoDbData’

trying URL 'https://bioconductor.org/packages/3.12/bioc/bin/macosx/contrib/4.0/GenomeInfoDb_1.26.1.tgz'
Content type 'application/x-gzip' length 4025771 bytes (3.8 MB)
==================================================
downloaded 3.8 MB


The downloaded binary packages are in
	/var/folders/1q/vdjw7j1s4jz2zljv1ykf_s646zdp7h/T//Rtmpr7uCtl/downloaded_packages
installing the source package ‘GenomeInfoDbData’

trying URL 'https://bioconductor.org/packages/3.12/data/annotation/src/contrib/GenomeInfoDbData_1.2.4.tar.gz'
Content type 'application/x-gzip' length 10673545 bytes (10.2 MB)
==================================================
downloaded 10.2 MB

Warning in file(con, "r") :
  cannot open file '/var/db/timezone/zoneinfo/+VERSION': No such file or directory
dyld: lazy symbol binding failed: Symbol not found: _utimensat
  Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libR.dylib (which was built for Mac OS X 10.13)
  Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: _utimensat
  Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libR.dylib (which was built for Mac OS X 10.13)
  Expected in: /usr/lib/libSystem.B.dylib

/Library/Frameworks/R.framework/Resources/bin/INSTALL: line 34:  4695 Done                    echo 'tools:::.install_packages()'
      4696 Abort trap: 6           | R_DEFAULT_PACKAGES= LC_COLLATE=C "${R_HOME}/bin/R" $myArgs --no-echo --args ${args}

The downloaded source packages are in
	‘/private/var/folders/1q/vdjw7j1s4jz2zljv1ykf_s646zdp7h/T/Rtmpr7uCtl/downloaded_packages’
Old packages: 'processx', 'rmarkdown'
Update all/some/none? [a/s/n]: 
n
Warning message:
In install.packages(...) :
  installation of package ‘GenomeInfoDbData’ had non-zero exit status

I tried to follow up on the dyld errors, but I have not found a solution as how to resolve it yet.
Appreciate your help with this. My environment is:

version
_

platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)

Thank You,
NF

Error in .make_Seqinfo_from_genome(genome)

I was trying to import alevin counts to R using tximeta, which depends on the package GenomeinfoDb. However, since I used the GRCm39 (Release M26) from Gencode for alignment and annotation. I encountered the following error from GenomeinfoDb:

Error in .make_Seqinfo_from_genome(genome) : 
  "GRCm39" is not a registered NCBI assembly or UCSC genome (use
  registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC
  assemblies/genomes currently registered in the GenomeInfoDb package)

Is there a fix for this error, since GRCm39 was released in 2020?

Thank you very much for your help in advance!

dynamic Seqinfo lookup for hg19 failing

The HelloRanges package is failing in devel and release because the dynamic resolution of sequence information from UCSC is broken for hg19:

Seqinfo(genome = "hg19")
## Error in FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs =
## supported_genome$circ_seqs,  : 
##  cannot map the following UCSC seqlevel(s) to an NCBI seqlevel:
##  chr1_jh636052_fix, chrX_jh806600_fix, chrX_jh806587_fix,
##  chr7_jh159134_fix, chrX_jh159150_fix, chrX_jh806590_fix,
##  chr10_jh591181_fix, chr1_jh636053_fix, chr5_gl339449_alt,
##  chr14_kb021645_fix, chrX_jh720453_fix, chrX_jh806601_fix,
##  chr7_gl582971_fix, chrX_jh806599_fix, chr19_gl949749_alt,
##  chr19_gl949750_alt, chr19_gl949748_alt, chr19_kb021647_fix,
##  chrX_jh806597_fix, chr10_ke332501_fix, chr19_gl949751_alt,
##  chr19_gl949746_alt, chr19_gl949752_alt, chrX_jh806598_fix,
##  chrX_jh720451_fix, chrX_jh806591_fix, chr11_jh806581_fix,
##  chrX_jh806588_fix, chrX_jh806592_fix, chr19_gl949753_alt,
##  chr1_jh636054_fix, chrX_jh720454_fix, chr19_gl949747_alt,
##  chr7_jh636058_fix, chrX_jh806602_fix, chr17_gl383561_fix,
##  chr8_gl949743_fix, chr2_kb663603_fix, chr19_gl582977_fix,
##  chr19_ke332505_fix, chr11_jh159140_fix, chr5_ke332497_fix,
##  chr17_gl383560_fix, chrX_jh720452_fix, chr4_ke332496_fix,
##  chr6_kb663604_fix, chr

I wonder if it would be better to include a static copy of this information, at least for the most commonly accessed genome? Stabler, faster and more reproducible.

BSgenome

Error with Seqinfo

Hi, I'm trying to use the CERES package which depends on GenomeInfoDb. It fails due to the following error:

Seqinfo(genome="hg19")
Error in function (type, msg, asError = TRUE) :
Failed connect to ftp.ncbi.nlm.nih.gov:21; Connection refused

I've checked fetchExtendedChromInfoFromUCSC and it appears to be supported - is this a problem with my proxy settings? If so could I download and point the function to a file instead?

Comparing Seqinfo objects

It might be nice to support some comparison operations between two Seqinfo objects, particularly match(). Then, if Seqinfo were a Vector (shouldn't it be?), we could do, for example:

all(seqinfo(which) %in% seqinfo(x))

to check whether the universe of which is a subset of that of x.

I think match() should consider the tuple of sequence name and genome version, where the NA version is wildcard. I noticed that e.g. intersect,Seqinfo,Seqinfo() just considers seqnames.

Windows build error

I am using R-hub builder and there is a problem with GenomeInfoDb on Windows:

Error: package 'GenomeInfoDb' required by 'GenomicRanges' could not be found

Bioconductor also reports an error: http://bioconductor.org/checkResults/release/bioc-LATEST/GenomeInfoDb/

Based on that output, it looks like the problem is related to NCBI:

> GenomeInfoDb:::.test()
Timing stopped at: 80.2 21.16 131.3
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 36388 did not have 10 elements
In addition: Warning messages:
1: In getChromInfoFromNCBI(genome, as.Seqinfo = TRUE) :
  Assembly Bos_taurus_UMD_3.1.1 is mapped to more than one assembly
  (GCF_000003055.5, GCA_000003055.5). The first one was selected.
2: In getChromInfoFromNCBI(genome, as.Seqinfo = TRUE) :
  Assembly MusPutFurMale1.0 is mapped to more than one assembly
  (GCA_000004665.1, GCA_000239315.1). The first one was selected.
3: In getChromInfoFromNCBI(genome, as.Seqinfo = TRUE) :
  Assembly GRCg6a is mapped to more than one assembly (GCA_000002315.5,
  GCF_000002315.6). The first one was selected.


RUNIT TEST PROTOCOL -- Mon Jan 25 03:24:07 2021 
*********************************************** 
Number of test functions: 21 
Number of errors: 1 
Number of failures: 0 

 
1 Test Suite : 
GenomeInfoDb RUnit Tests - 21 test functions, 1 error, 0 failures
ERROR in test_seqlevelsStyle_Seqinfo: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 36388 did not have 10 elements

Test files with failing tests

   test_seqlevelsStyle.R 
     test_seqlevelsStyle_Seqinfo 


Error in BiocGenerics:::testPackage("GenomeInfoDb") : 
  unit tests failed for package GenomeInfoDb
Calls: <Anonymous> -> <Anonymous>
Execution halted

'new2old' not supported in 'seqinfo<-,DNAStringSet-method'

Hi,

I'm trying to use the package in workflow, with the following code:

len <- 91
dna <- readDNAStringSet("results/genomepy/GRCh38.p13/GRCh38.p13.fa")

names(dna) <- sapply(strsplit(names(dna), " "), .subset, 1)

get_velocity_files(
        X = "results/genomepy/GRCh38.p13/GRCh38.p13.annotation.gtf",
        L = len,
        Genome = dna,
        out_path = "results/busparse/get_velocity_files/GRCh38.p13",
        style = "Ensembl",
        transcript_version = NULL,
        gene_version = NULL
    )

And I get the following error:

Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: XVector
Loading required package: GenomeInfoDb

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Assuming that all chromosomes are linenar.
Error: 'new2old' not supported in 'seqinfo<-,DNAStringSet-method'
Execution halted

Any help would be appreciated.
Thank you,
Susheel

Proposed contribution task for Outreachy applicants: Register UCSC genome gorGor6

gorGor6 is the latest UCSC genome for Gorilla (Gorilla gorilla gorilla). See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Gorilla in the UCSC species tree on the left, click on it, then make sure to select the latest Gorilla Assembly (gorGor6). This will display a bunch of additional information about the gorGor6 assembly.

Note that many UCSC genomes are already registered in the GenomeInfoDb package (83 as of October 2022). The registered_UCSC_genomes() function in GenomeInfoDb returns the list of all the UCSC genomes that are currently registered in the package. An important thing to be aware of is that getChromInfoFromUCSC() still works on an unregistered genome, but in "degraded" mode, that is:

the assembled.molecules argument is ignored,
the assembled and circular columns of the returned data.frame are filled with NAs,
and the chromosomes/sequences are not returned in any particular order.

Registering a genome fixes that. In other words, once a genome is registered in GenomeInfoDb, the information returned by getChromInfoFromUCSC() for that genome is guaranteed to be complete and accurate.

See ?getChromInfoFromUCSC (after loading GenomeInfoDb) for more information.

Registering a new UCSC genome is only a matter of adding a new file, called "registration file", to GenomeInfoDb/inst/registered/UCSC_genomes/. Note that the folder contains a README.TXT file that provides some brief information about what a "registration file" should contain (unfortunately the registration process is not fully documented).

For gorGor6, since this is the first gorGor genome that we're going to register in GenomeInfoDb, we need to start the gorGor6.R file from scratch. However, looking at other registration files to get a feeling of how things are done is always a good idea. Don't bother with the NCBI_LINKER component for now. We'll add it later, once the corresponding NCBI assembly (Kamilah_GGO_v0) is also registered (registering Kamilah_GGO_v0 is the topic of issue #61).

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Proposed contribution task for Outreachy applicants: Enable "offline mode" for xenTro10

This task depends on issue #46 being completed first (i.e. PR accepted and merged, and issue closed). Although it's not a requirement that the 2 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

See ?getChromInfoFromUCSC in the GenomeInfoDb package for information about the "offline mode". See README.TXT file in the GenomeInfoDb/inst/extdata/assembled_molecules_info/UCSC/ folder for how to enable "offline mode" for a registered UCSC genome.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Proposed contribution task for Outreachy applicants: Register UCSC genome canFam6

canFam6 is the latest UCSC genome for Dog (Canis lupus familiaris). See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

the assembled.molecules argument is ignored,
the assembled and circular columns of the returned data.frame are filled with NAs,
and the chromosomes/sequences are not returned in any particular order.

See ?getChromInfoFromUCSC (after loading GenomeInfoDb) for more information.

For canFam6, the easiest way to go would be to copy canFam5.R -> canFam6.R, then to make some adjustments to canFam6.R. Don't bother with the NCBI_LINKER component for now. We'll add it later, once the corresponding NCBI assembly (Dog10K_Boxer_Tasha) is also registered (registering Dog10K_Boxer_Tasha is the topic of issue #44).

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

GitHub installation fails due to missing GenomeInfoDbData

Hi, I'm working on a package named basejump that uses ensembldb as a dependency. I've recently run into GenomeInfoDb failing to install because GenomeInfoDbData doesn't get set up properly. I can't figure out why this is happening. The imports in both ensembldb and GenomeInfoDb look good.

The weird thing is that GenomeInfoDb installs fine for me using biocLite("GenomeInfoDb") but it fails when attempting to use biocLite("Bioconductor/GenomeInfoDb") or devtools::install_github("Bioconductor/GenomeInfoDb"). This is leading me to be believe there's some possible Bioconductor-related bug in the devtools package, but I'm not sure. Similarly, ensembldb installs fine using biocLite("ensembldb") but is failing when attempting to install the repo directly: biocLite("jotsetung/ensembldb"). Any help figuring this out would be greatly appreciated!

Best,
Mike

Seqinfo constructor should auto-fill seqlevels= when seqlengths= is named

Currently, construction from a named character vector looks like:

lens <- c(chrA=1, chrB=2)
Seqinfo(seqlevels=names(lens), seqlengths=lens)

However, if seqlengths= is named, I would have expected the following to suffice:

Seqinfo(seqlengths=lens)

and have seqlevels=names(seqlengths) by default. This would avoid the need to specify two arguments and allow convenient manual specification by passing in a single named vector directly to seqlengths=:

Seqinfo(seqlengths=c(chrA=1, chrB=2, chrC=3))

Issue with setting `seqlevelStyle()`

I'm currently working on some code in an AnVIL workspace and I'm having an issue utilizing seqlevelStyle() when setting it to 'RefSeq'. The following examples work as expected:

> library(BSgenome.Hsapiens.UCSC.hg19)
> seqlevelsStyle(Hsapiens)
[1] "NCBI"
> seqlevelsStyle(Hsapiens) <- "RefSeq"
>
> library(BSgenome.Hsapiens.NCBI.GRCh38)
> seqlevelsStyle(Hsapiens)
[1] "NCBI"
> seqlevelsStyle(Hsapiens) <- "RefSeq"

But doing something like the following results in an error:

> library(GenomicRanges)
> GR <- GRanges("chr1:1-10")
> seqlevelsStyle(GR)
[1] "UCSC"
> seqlevelsStyle(GR) <- "RefSeq"
Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) : 
  supplied seqname style "RefSeq" is not supported
>

Here's my session info:


> sessionInfo() 
R version 4.0.3 (2020-10-10) 
Platform: x86_64-pc-linux-gnu (64-bit) 
Running under: Ubuntu 20.04 LTS  

Matrix products: default 
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so  

locale:  
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8         
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C               
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C         

attached base packages: 
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base       

other attached packages:  
[1] Rsamtools_2.6.0                         BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000   
[3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.42.3                   
[5] AnnotationDbi_1.52.0                    ChIPseeker_1.26.2                        
[7] rGREAT_1.22.0                           DiffBind_3.0.15                          
[9] SummarizedExperiment_1.20.0             Biobase_2.50.0                          
[11] MatrixGenerics_1.2.1                    matrixStats_0.59.0                      
[13] BSgenome.Hsapiens.UCSC.hg19_1.4.3       BSgenome_1.58.0                         
[15] rtracklayer_1.50.0                      MotifDb_1.32.0                          
[17] Biostrings_2.58.0                       XVector_0.30.0                          
[19] GenomicRanges_1.42.0                    GenomeInfoDb_1.26.7                     
[21] IRanges_2.24.1                          ATACseqQC_1.14.4                        
[23] S4Vectors_0.28.1                        BiocGenerics_0.36.1                     
[25] AnVIL_1.2.1                             dplyr_1.0.6                             
[27] BiocManager_1.30.15                      

loaded via a namespace (and not attached):   
[1] rappdirs_0.3.3                AnnotationForge_1.32.0        coda_0.19-4                     
[4] tidyr_1.1.3                   ggplot2_3.3.3                 bit64_4.0.5                     
[7] irlba_2.3.3                   DelayedArray_0.16.3           data.table_1.14.0              
[10] hwriter_1.3.2                 KEGGREST_1.30.1               RCurl_1.98-1.3                 
[13] AnnotationFilter_1.14.0       generics_0.1.0                cowplot_1.1.1                  
[16] lambda.r_1.2.4                RSQLite_2.2.7                 shadowtext_0.0.8               
[19] bit_4.0.4                     enrichplot_1.10.2             base64url_1.4                  
[22] xml2_1.3.2                    httpuv_1.6.1                  assertthat_0.2.1               
[25] batchtools_0.9.15             viridis_0.6.1                 amap_0.8-18                    
[28] apeglm_1.12.0                 xfun_0.23                     hms_1.1.0                      
[31] promises_1.2.0.1              fansi_0.5.0                   progress_1.2.2                 
[34] caTools_1.18.2                dbplyr_2.1.1                  htmlwidgets_1.5.3              
[37] Rgraphviz_2.34.0              igraph_1.2.6                  DBI_1.1.1                      
[40] geneplotter_1.68.0            futile.logger_1.4.3           purrr_0.3.4                    
[43] ellipsis_0.3.2                backports_1.2.1               V8_3.4.2                       
[46] GenomicScores_2.2.0           annotate_1.68.0               biomaRt_2.46.3                 
[49] vctrs_0.3.8                   ensembldb_2.14.1              cachem_1.0.5                   
[52] withr_2.4.2                   ggforce_0.3.3                 DOT_0.1                        
[55] bdsmatrix_1.3-4               checkmate_2.0.0               GenomicAlignments_1.26.0       
[58] prettyunits_1.1.1             DOSE_3.16.0                   lazyeval_0.2.2                 
[61] crayon_1.4.1                  genefilter_1.72.1             edgeR_3.32.1                   
[64] pkgconfig_2.0.3               tweenr_1.0.2                  ProtGenerics_1.22.0            
[67] rlang_0.4.11                  lifecycle_1.0.0               BiocFileCache_1.14.0           
[70] GOstats_2.56.0                AnnotationHub_2.22.1          VennDiagram_1.6.20             
[73] invgamma_1.1                  randomForest_4.6-14           rsvg_2.1.2                     
[76] polyclip_1.10-0               graph_1.68.0                  Matrix_1.3-2                   
[79] ashr_2.2-47                   Rhdf5lib_1.12.1               boot_1.3-27                    
[82] GlobalOptions_0.1.2           pheatmap_1.0.12               png_0.1-7                      
[85] viridisLite_0.4.0             rjson_0.2.20                  splitstackshape_1.4.8          
[88] bitops_1.0-7                  KernSmooth_2.23-18            rhdf5filters_1.2.1             
[91] blob_1.2.1                    mixsqp_0.3-43                 stringr_1.4.0                  
[94] SQUAREM_2021.1                qvalue_2.22.0                 regioneR_1.22.0                
[97] ShortRead_1.48.0              brew_1.0-6                    jpeg_0.1-8.1                  
[100] scales_1.1.1                  memoise_2.0.0                 GSEABase_1.52.1               
[103] magrittr_2.0.1                plyr_1.8.6                    gplots_3.1.1                  
[106] zlibbioc_1.36.0               compiler_4.0.3                scatterpie_0.1.6              
[109] tinytex_0.32                  bbmle_1.0.23.1                RColorBrewer_1.1-2            
[112] plotrix_3.8-1                 DESeq2_1.30.1                 ade4_1.7-16                   
[115] cli_2.5.0                     systemPipeR_1.24.6            Category_2.56.0               
[118] formatR_1.11                  MASS_7.3-53.1                 tidyselect_1.1.1              
[121] stringi_1.6.2                 emdbook_1.3.12                yaml_2.2.1                    
[124] GOSemSim_2.16.1               askpass_1.1                   locfit_1.5-9.4                
[127] ChIPpeakAnno_3.24.2           latticeExtra_0.6-29           ggrepel_0.9.1                 
[130] grid_4.0.3                    VariantAnnotation_1.36.0      polynom_1.4-0                 
[133] fastmatch_1.1-0               tools_4.0.3                   rapiclient_0.1.3              
[136] rstudioapi_0.13               gridExtra_2.3                 farver_2.1.0                  
[139] ggraph_2.0.5                  digest_0.6.27                 rvcheck_0.1.8                 
[142] shiny_1.6.0                   Rcpp_1.0.6                    BiocVersion_3.12.0            
[145] later_1.2.0                   motifStack_1.34.0             httr_1.4.2                    
[148] colorspace_2.0-1              XML_3.99-0.6                  truncnorm_1.0-8               
[151] splines_4.0.3                 RBGL_1.66.0                   graphlayouts_0.7.1            
[154] multtest_2.46.0               preseqR_4.0.0                 xtable_1.8-4                  
[157] jsonlite_1.7.2                futile.options_1.0.1          tidygraph_1.2.0               
[160] R6_2.5.0                      mime_0.10                     pillar_1.6.1                  
[163] htmltools_0.5.1.1             glue_1.4.2                    fastmap_1.1.0                 
[166] BiocParallel_1.24.1           interactiveDisplayBase_1.28.0 fgsea_1.16.0                 
[169] GreyListChIP_1.22.0           mvtnorm_1.1-2                 utf8_1.2.1                    
[172] lattice_0.20-41               tibble_3.1.2                  numDeriv_2016.8-1.1           
[175] curl_4.3.1                    gtools_3.9.2                  GO.db_3.12.1                  
[178] openssl_1.4.4                 survival_3.2-7                limma_3.46.0                  
[181] munsell_0.5.0                 GetoptLong_1.0.5              DO.db_2.9                     
[184] rhdf5_2.34.0                  GenomeInfoDbData_1.2.4        HDF5Array_1.18.1              
[187] reshape2_1.4.4                gtable_0.3.0
>

Provide a complete mapping of UCSC chromosome names to Ensembl

@ivanek opened an issue over at ensembldb (jorainer/ensembldb#88).
ensembldb in general uses the genomeStyles function from GenomeInfoDb to map chromosome names. As of now we can only map the standard chromosome names from UCSC to Ensembl. It would also be nice to be able to map names for patched chromosomes etc. Eventually Robert's solution mentioned over at the ensembldb issue might be a starting point?

Registration request for Acyrthosiphon pisum NCBI assembly

Hi,
I'd like to have the genome for Acyrthosiphon pisum registered for the purpose of forging a BSgenome package. The assembly is pea_aphid_22Mar2018_4r6ur and below is the link to the NCBI page:

https://www.ncbi.nlm.nih.gov/assembly/GCF_005508785.1

Let me know if there's anything I can clear up or help with. Thank you very much.

Mitocondrial chromosome naming, seqlevelsStyle()

I thanks for a nice function, I use seqlevelsStyle a lot.

One thing has always bothered me, is for the mitocondrial chromosome.

Two objects with same naming convention as these:

> seqlevelsStyle(bamFile)
[1] "UCSC"
> seqlevelsStyle(cds)
[1] "UCSC"

Can show seqlevels as this:

seqlevels(bamFile)
[1] "MT" 
seqlevels(cds)
[1] "chrM"

So even though they use the same seqlevelsStyle, the objects does not name the mitochondrial chromosome the same.
Is there a reason for this ? :)

Getting error in installing GenomeInfoDb

Hello

I am not able to install GenomeInfoDb

Can you help me please?

> remotes::install_github("Bioconductor/GenomeInfoDb")
Downloading GitHub repo Bioconductor/GenomeInfoDb@HEAD
✓  checking for file ‘/tmp/Rtmp4he3Nm/remotese273c245ce2/Bioconductor-GenomeInfoDb-d4604b1/DESCRIPTION’ ...
─  preparing ‘GenomeInfoDb’:
  ✓  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘GenomeInfoDb_1.31.1.tar.gz’

Installing package into ‘/opt/R/4.0.5/lib/R/site-library’
(as ‘lib’ is unspecified)
* installing *source* package ‘GenomeInfoDb’ ...
mv: cannot move '/opt/R/4.0.5/lib/R/site-library/GenomeInfoDb' to '/opt/R/4.0.5/lib/R/site-library/00LOCK-GenomeInfoDb/GenomeInfoDb': Permission denied
ERROR: cannot remove earlier installation, is it in use?
  * removing ‘/opt/R/4.0.5/lib/R/site-library/GenomeInfoDb’
Warning message:
  In i.p(...) :
  installation of package ‘/tmp/Rtmp4he3Nm/filee27222dfe53/GenomeInfoDb_1.31.1.tar.gz’ had non-zero exit status
>

> version
_                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          0.5                         
year           2021                        
month          03                          
day            31                          
svn rev        80133                       
language       R                           
version.string R version 4.0.5 (2021-03-31)
nickname       Shake and Throw             
>

Update to Callithrix_jacchus.R

Hi, this isn't an issue, but I was not sure how to appropriately propose an addition to one of the NCBI assembly lists.

I'd like to add a newer assembly for marmoset to the list and also rename the current assembly listed (currently is listed as ferret). I wasn't sure of naming conventions for the assemblies, so I took the assembly names from NCBI.

Here are my proposed changes:

ORGANISM <- "Callithrix jacchus"

### List of assemblies by date.
ASSEMBLIES <- list(
    list(assembly="Callithrix jacchus-3.2",
         date="2010/01/22",
         assembly_accession="GCA_000004665.1",  # calJac3
         circ_seqs=character(0)),

    list(assembly="Callithrix_jacchus_cj1700_1.1",
         date="2020/05/22",
         assembly_accession="GCA_009663435.2",  # cj1700
         circ_seqs=character(0))
)

Here is the current version:

ORGANISM <- "Callithrix jacchus"

### List of assemblies by date.
ASSEMBLIES <- list(
    list(assembly="MusPutFurMale1.0",
         date="2010/01/22",
         assembly_accession="GCA_000004665.1",  # calJac3
         circ_seqs=character(0))
)

Registration request for Patiria miniata NCBI assembly

I'd like to have the genome for Patiria miniata registered for the purpose of forging a BSgenome package. The assembly is ASM1570657v1 (Pmin_3.0) and below is the link to the NCBI page:

https://www.ncbi.nlm.nih.gov/assembly/GCF_015706575.1/

Let me know if there's anything I can clear up or help with. Thank you very much.

GRCh38.p14 covers T2T-CHM13 information, correct?

I am wondering GRCh38.p14 covers T2T-CHM13 information, correct?

Thanks.

Shicheng

Problem with Seqinfo(genome="hg38")

The following error occurs:

> Seqinfo(genome="hg38")
 Error in .order_seqlevels(chrom_sizes[, "chrom"]) : 
  !anyNA(m32) is not TRUE

No errors occur with Seqinfo(genome="hg19")
Here is my session info:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] GenomeInfoDb_1.28.4 IRanges_2.26.0      S4Vectors_0.30.2    BiocGenerics_0.38.0

loaded via a namespace (and not attached):
 [1] compiler_4.1.1         BiocManager_1.30.16    prettyunits_1.1.1      bitops_1.0-7           remotes_2.4.1          tools_4.1.1           
 [7] testthat_3.1.0         pkgbuild_1.2.0         pkgload_1.2.3          memoise_2.0.0          lifecycle_1.0.1        rlang_0.4.11          
[13] cli_3.0.1              rstudioapi_0.13        curl_4.3.2             fastmap_1.1.0          GenomeInfoDbData_1.2.6 withr_2.4.2           
[19] desc_1.4.0             fs_1.5.0               devtools_2.4.2         rprojroot_2.0.2        glue_1.4.2             R6_2.5.1              
[25] processx_3.5.2         sessioninfo_1.1.1      callr_3.7.0            purrr_0.3.4            magrittr_2.0.1         ps_1.6.0              
[31] ellipsis_0.3.2         usethis_2.1.0          renv_0.14.0            RCurl_1.98-1.5         cachem_1.0.6           crayon_1.4.1

Proposed contribution task for Outreachy applicants: Enable "offline mode" for canFam6

This task depends on issue #43 being completed first (i.e. PR accepted and merged, and issue closed). Although it's not a requirement that the 2 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Error: package or namespace load failed for ‘GenomeInfoDb’; there is no package called ‘GenomeInfoDbData’

I am having this issue of my jupyter notebook not being able to find the GenomeInfoDbData package even though it is already installed in my conda environment. Note: I am trying to use this package as part of the SpatialExperiment library.

Conda Version: 4.14.0
Running locally on macOS 11.6
Environment Dependencies:
requirements.txt

Error Readout when running library(SpatialExperiment) in a jupyter notebook:

`Loading required package: SingleCellExperiment

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats

Attaching package: ‘MatrixGenerics’

The following objects are masked from ‘package:matrixStats’:

colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars

Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics

Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges

Loading required package: GenomeInfoDb

Error: package or namespace load failed for ‘GenomeInfoDb’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
there is no package called ‘GenomeInfoDbData’

Error: package ‘GenomeInfoDb’ could not be loaded
Traceback:

library(SpatialExperiment)
.getRequiredPackages2(pkgInfo, quietly = quietly)
library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc,
. quietly = quietly)
.getRequiredPackages2(pkgInfo, quietly = quietly)
library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc,
. quietly = quietly)
.getRequiredPackages2(pkgInfo, quietly = quietly)
library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc,
. quietly = quietly)
.getRequiredPackages2(pkgInfo, quietly = quietly)
stop(gettextf("package %s could not be loaded", sQuote(pkg)),
. call. = FALSE, domain = NA)`

Proposed contribution task for Outreachy applicants: Register UCSC genome xenTro10

xenTro10 is the latest UCSC genome for the Western clawed frog (Xenopus tropicalis). See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find X. tropicalis in the UCSC species tree on the left, click on it, then make sure to select the latest X. tropicalis Assembly (xenTro10). This will display a bunch of additional information about the xenTro10 assembly.

the assembled.molecules argument is ignored,
the assembled and circular columns of the returned data.frame are filled with NAs,
and the chromosomes/sequences are not returned in any particular order.

See ?getChromInfoFromUCSC (after loading GenomeInfoDb) for more information.

For xenTro10, since this is the first xenTro genome that we're going to register in GenomeInfoDb, we need to start the xenTro10.R file from scratch. However, looking at other registration files to get a feeling of how things are done is always a good idea. Don't bother with the NCBI_LINKER component for now. We'll add it later, once the corresponding NCBI assembly (UCB_Xtro_10.0) is also registered (registering UCB_Xtro_10.0 will is the topic of issue #47).

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

ref : Macaca Fascicularis 5.0, missing style

Hello.

We are working with VCFs of Macaca Fascicularis.
We had an issue with your function extractSeqLevelsByGroup, called by MutationalPatterns::read_vcfs_as_granges .
The chromosomes style notation of Macaca Fascicularis is not available.
Could BSgenome.Mfascicularis.NCBI.5.0 be added to the supported organism ?
Macaca Fascicularis is an alternative to Macaca Mulatta on which research is increasing.

Regards,
Elyas.

seqlevelsStyle on GRanges throws warning

library(GenomeInfoDb)
library(GenomicRanges)
gr <- GRanges(rep(c("chr2", "chr3", "chrM"), 2), IRanges(1:6, 10))
seqlevelsStyle(gr)
[1] "UCSC"
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] GenomicRanges_1.34.0 GenomeInfoDb_1.18.0 IRanges_2.16.0
[4] S4Vectors_0.20.0 BiocGenerics_0.28.0

Proposed contribution task for Outreachy applicants: Link felCat9 (UCSC genome) to Felis_catus_9.0 (NCBI assembly)

This task depends on issues #49 and #50 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

The purpose of "linking" a UCSC genome to the NCBI assembly that it is based on, is to support the map.NCBI argument of the getChromInfoFromUCSC() function. Try getChromInfoFromUCSC("hg19", map.NCBI=TRUE). See what happens? Now try getChromInfoFromUCSC("felCat9", map.NCBI=TRUE). See what the problem is? Check the documentation of the map.NCBI argument in ?getChromInfoFromUCSC to learn more about what this argument does.

Linking a UCSC genome to its NCBI assembly is done by defining an NCBI_LINKER object in the registration file for the UCSC genome (felCat9.R in this case). There's some very succinct information about what NCBI_LINKER should look like in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/. Don't hesitate to look at other registration files to see examples of how NCBI_LINKER is defined.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Utilize UCSC REST API to retrieve Seqinfo

hello,

As UCSC REST API is now available, there might be an opportunity to update the method of retrieving Seqinfo from UCSC for better maintainability.

Currently, Seqinfo is retrieved by download a file from UCSC goldenPath and later processed with a read.table().
With the UCSC REST API Seqinfo could also be retrieved however it is going to requires a JSON parsing library(something like jsonlite) as a dependency.

Example : http://api.genome.ucsc.edu/list/chromosomes?genome=mm9

What do you all think about this?

Mmusculus.v79 UCSC/NCBI chr name name mismatch.

It is probably a consequence of the patch of Issue #27

tmp <- genes(EnsDb.Mmusculus.v79)
seqlevelsStyle(tmp) <- "UCSC"

This triggers an error (poorly translated from French):

Error in (function (UCSC_chrom_info, assembly_accession, AssemblyUnits = NULL, identical(UCSC_chrom_info[compare_idx, chrom"], NCBI_chrom_info[compare_idx,  .... is not TRUE

I cannot pinpoint which chromosome name is wrong (sorry, I am still a novice), but the function should be: .add_NCBI_cols_to_UCSC_chrom_info

Any suggestion about where I should look for to find the culprit?

Proposed contribution task for Outreachy applicants: Register NCBI assembly Dog10K_Boxer_Tasha

Dog10K_Boxer_Tasha is a Dog assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000002285.5/

Note that Dog10K_Boxer_Tasha is the assembly that canCam6, the latest UCSC genome for Dog, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Dog in the UCSC species tree on the left, click on it, then make sure to select the latest Dog Assembly (canFam6). This will display a bunch of additional information about the canFam6 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

The name of the assembly is not recognized, only look up by GenBank or RefSeq accession works.
The returned circularity flags are not guaranteed to be accurate. This potential inaccuracy is communicated to the user by placing NAs instead of FALSEs in the circular column of the returned data.frame.

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Proposed contribution task for Outreachy applicants: Register NCBI assembly UCB_Xtro_10.0

UCB_Xtro_10.0 is a Western clawed frog (Xenopus tropicalis) assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000004195.4/

Note that UCB_Xtro_10.0 is the assembly that xenTro10, the latest UCSC genome for the Western clawed frog, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find the Western clawed frog in the UCSC species tree on the left, click on it, then make sure to select the latest X. tropicalis Assembly (xenTro10). This will display a bunch of additional information about the xenTro10 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

The name of the assembly is not recognized, only look up by GenBank or RefSeq accession works.
The returned circularity flags are not guaranteed to be accurate. This potential inaccuracy is communicated to the user by placing NAs instead of FALSEs in the circular column of the returned data.frame.

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Proposed contribution task for Outreachy applicants: Enable "offline mode" for felCat9

This task depends on issue #49 being completed first (i.e. PR accepted and merged, and issue closed). Although it's not a requirement that the 2 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

unable to find an inherited method for function ‘seqinfo<-’ for signature ‘"TxDb"’

Hi,

I noticed that seqlevelsStyle<- does not work anymore for TxDb objects:

> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> seqlevelsStyle(txdb)
[1] "UCSC"
> seqlevelsStyle(txdb) <- "NCBI"
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘seqinfo<-’ for signature ‘"TxDb"’

Is this functionality permanently removed or a bug?

Thanks,
Markus

Proposed contribution task for Outreachy applicants: Link xenTro10 (UCSC genome) to UCB_Xtro_10.0 (NCBI assembly)

This task depends on issues #46 and #47 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

The purpose of "linking" a UCSC genome to the NCBI assembly that it is based on, is to support the map.NCBI argument of the getChromInfoFromUCSC() function. Try getChromInfoFromUCSC("hg19", map.NCBI=TRUE). See what happens? Now try getChromInfoFromUCSC("xenTro10", map.NCBI=TRUE). See what the problem is? Check the documentation of the map.NCBI argument in ?getChromInfoFromUCSC to learn more about what this argument does.

Linking a UCSC genome to its NCBI assembly is done by defining an NCBI_LINKER object in the registration file for the UCSC genome (xenTro10.R in this case). There's some very succinct information about what NCBI_LINKER should look like in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/. Don't hesitate to look at other registration files to see examples of how NCBI_LINKER is defined.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Proposed contribution task for Outreachy applicants: Link canFam6 (UCSC genome) to Dog10K_Boxer_Tasha (NCBI assembly)

This task depends on issues #43 and #44 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

The purpose of "linking" a UCSC genome to the NCBI assembly that it is based on, is to support the map.NCBI argument of the getChromInfoFromUCSC() function. Try getChromInfoFromUCSC("hg19", map.NCBI=TRUE). See what happens? Now try getChromInfoFromUCSC("canFam6", map.NCBI=TRUE). See what the problem is? Check the documentation of the map.NCBI argument in ?getChromInfoFromUCSC to learn more about what this argument does.

Linking a UCSC genome to its NCBI assembly is done by defining an NCBI_LINKER object in the registration file for the UCSC genome (canFam6.R in this case). There's some very succinct information about what NCBI_LINKER should look like in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/. Don't hesitate to look at other registration files to see examples of how NCBI_LINKER is defined.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Issue with seqlevelStyle()

I was doing scATAC analysis using Signac, where I need a GenomeInfoDb and EnsDb.Hsapiens.v86 to annotate my data. It worked nicely at the beginning of last week, but then I ran into the issue with the following code:

annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"

Here is the error message I received:

Error in .order_seqlevels(chrom_sizes[, "chrom"]) : 
  !anyNA(m32) is not TRUE

I would appreciate if you can have a look and give me some feedback

Here's my R sessionInfo():

R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /fast/work/users/yhsieh_m/miniconda3/envs/Signac/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] dplyr_1.0.7               EnsDb.Hsapiens.v86_2.99.0
 [3] ensembldb_2.17.4          AnnotationFilter_1.17.1  
 [5] GenomicFeatures_1.45.2    AnnotationDbi_1.55.1     
 [7] Biobase_2.53.0            GenomicRanges_1.45.0     
 [9] GenomeInfoDb_1.29.8       IRanges_2.27.2           
[11] S4Vectors_0.31.5          BiocGenerics_0.39.2      
[13] patchwork_1.1.1           ggplot2_3.3.5            
[15] SeuratObject_4.0.2        Seurat_4.0.5             
[17] Signac_1.4.0             

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                  reticulate_1.22            
  [3] tidyselect_1.1.1            RSQLite_2.2.8              
  [5] htmlwidgets_1.5.4           grid_4.1.1                 
  [7] docopt_0.7.1                BiocParallel_1.27.17       
  [9] Rtsne_0.15                  munsell_0.5.0              
 [11] codetools_0.2-18            ica_1.0-2                  
 [13] future_1.22.1               miniUI_0.1.1.1             
 [15] withr_2.4.2                 colorspace_2.0-2           
 [17] filelock_1.0.2              knitr_1.36                 
 [19] rstudioapi_0.13             ROCR_1.0-11                
 [21] tensor_1.5                  listenv_0.8.0              
 [23] MatrixGenerics_1.5.4        slam_0.1-48                
 [25] GenomeInfoDbData_1.2.7      polyclip_1.10-0            
 [27] bit64_4.0.5                 farver_2.1.0               
 [29] parallelly_1.28.1           vctrs_0.3.8                
 [31] generics_0.1.0              xfun_0.26                  
 [33] biovizBase_1.41.0           BiocFileCache_2.1.1        
 [35] lsa_0.73.2                  ggseqlogo_0.1              
 [37] R6_2.5.1                    hdf5r_1.3.4                
 [39] bitops_1.0-7                spatstat.utils_2.2-0       
 [41] cachem_1.0.6                DelayedArray_0.19.4        
 [43] assertthat_0.2.1            promises_1.2.0.1           
 [45] BiocIO_1.3.0                scales_1.1.1               
 [47] nnet_7.3-16                 gtable_0.3.0               
 [49] globals_0.14.0              goftest_1.2-3              
 [51] rlang_0.4.11                RcppRoll_0.3.0             
 [53] splines_4.1.1               rtracklayer_1.53.1         
 [55] lazyeval_0.2.2              dichromat_2.0-0            
 [57] checkmate_2.0.0             spatstat.geom_2.3-0        
 [59] yaml_2.2.1                  reshape2_1.4.4             
 [61] abind_1.4-5                 backports_1.2.1            
 [63] httpuv_1.6.3                Hmisc_4.6-0                
 [65] tools_4.1.1                 ellipsis_0.3.2             
 [67] spatstat.core_2.3-0         RColorBrewer_1.1-2         
 [69] ggridges_0.5.3              Rcpp_1.0.7                 
 [71] plyr_1.8.6                  base64enc_0.1-3            
 [73] progress_1.2.2              zlibbioc_1.39.0            
 [75] purrr_0.3.4                 RCurl_1.98-1.5             
 [77] prettyunits_1.1.1           rpart_4.1-15               
 [79] deldir_1.0-5                pbapply_1.5-0              
 [81] cowplot_1.1.1               zoo_1.8-9                  
 [83] SummarizedExperiment_1.23.5 ggrepel_0.9.1              
 [85] cluster_2.1.2               magrittr_2.0.1             
 [87] data.table_1.14.2           scattermore_0.7            
 [89] lmtest_0.9-38               RANN_2.6.1                 
 [91] SnowballC_0.7.0             ProtGenerics_1.25.1        
 [93] fitdistrplus_1.1-6          matrixStats_0.61.0         
 [95] hms_1.1.1                   mime_0.12                  
 [97] xtable_1.8-4                XML_3.99-0.8               
 [99] jpeg_0.1-9                  sparsesvd_0.2              
[101] gridExtra_2.3               compiler_4.1.1             
[103] biomaRt_2.49.6              tibble_3.1.5               
[105] KernSmooth_2.23-20          crayon_1.4.1               
[107] htmltools_0.5.2             mgcv_1.8-38                
[109] later_1.3.0                 Formula_1.2-4              
[111] tidyr_1.1.4                 DBI_1.1.1                  
[113] tweenr_1.0.2                dbplyr_2.1.1               
[115] MASS_7.3-54                 rappdirs_0.3.3             
[117] Matrix_1.3-4                parallel_4.1.1             
[119] igraph_1.2.7                pkgconfig_2.0.3            
[121] GenomicAlignments_1.29.0    foreign_0.8-81             
[123] plotly_4.10.0               spatstat.sparse_2.0-0      
[125] xml2_1.3.2                  XVector_0.33.0             
[127] VariantAnnotation_1.39.0    stringr_1.4.0              
[129] digest_0.6.28               sctransform_0.3.2          
[131] RcppAnnoy_0.0.19            spatstat.data_2.1-0        
[133] Biostrings_2.61.2           leiden_0.3.9               
[135] fastmatch_1.1-3             htmlTable_2.3.0            
[137] uwot_0.1.10                 restfulr_0.0.13            
[139] curl_4.3.2                  shiny_1.7.1                
[141] Rsamtools_2.9.1             rjson_0.2.20               
[143] lifecycle_1.0.1             nlme_3.1-153               
[145] jsonlite_1.7.2              BSgenome_1.61.0            
[147] viridisLite_0.4.0           fansi_0.5.0                
[149] pillar_1.6.3                lattice_0.20-45            
[151] KEGGREST_1.33.0             fastmap_1.1.0              
[153] httr_1.4.2                  survival_3.2-13            
[155] glue_1.4.2                  qlcMatrix_0.9.7            
[157] png_0.1-7                   bit_4.0.4                  
[159] ggforce_0.3.3               stringi_1.7.5              
[161] blob_1.2.2                  latticeExtra_0.6-29        
[163] memoise_2.0.0               irlba_2.3.3                
[165] future.apply_1.8.1

GenomeInfoDb for non- NCBI or -UCSC assemblies/genomes

Hello,

I have a custom genome compiled via BSgenome. When I want to use this genome with GenomeInfoDb, I am having the following error:

> Seqinfo(genome="BSgenomeCper")

Error in .make_Seqinfo_from_genome(genome): "BSgenomeCper" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

Does this mean that I can use GenomeInfoDb only for the currently registered NCBI or UCSC assemblies/genomes and if the genome of interest is not publicly available (the genome I am using is in-house genome), it's not possible to precess it via GenomeInfoDb? Is there any work around?

Best,
Aybuge

Installing DESeq2 Error: no package called ‘GenomeInfoDbData’

Just updated my previous R to 4.01 and now I cant load DESeq2.

When trying to load it again I get this:

library(DESeq2)
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Error: package or namespace load failed for ‘GenomeInfoDb’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
there is no package called ‘GenomeInfoDbData’
Error: package ‘GenomeInfoDb’ could not be loaded

So I tried to load that package:

if (!requireNamespace("BiocManager", quietly = TRUE))

```
install.packages("BiocManager")
```

BiocManager::install("GenomeInfoDbData")
Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.1 (2020-06-06)
Installing package(s) 'GenomeInfoDbData'
installing the source package ‘GenomeInfoDbData’

trying URL 'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/GenomeInfoDbData_1.2.3.tar.gz'
Content type 'application/x-gzip' length 10413139 bytes (9.9 MB)

downloaded 9.9 MB

dyld: lazy symbol binding failed: Symbol not found: _utimensat
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libR.dylib (which was built for Mac OS X 10.13)
Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: _utimensat
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libR.dylib (which was built for Mac OS X 10.13)
Expected in: /usr/lib/libSystem.B.dylib

/Library/Frameworks/R.framework/Resources/bin/INSTALL: line 34: 9542 Done echo 'tools:::.install_packages()'
9543 Abort trap: 6 | R_DEFAULT_PACKAGES= LC_COLLATE=C "${R_HOME}/bin/R" $myArgs --no-echo --args ${args}

The downloaded source packages are in
‘/private/var/folders/0t/8jm6lgqs0qj63rprpf9q_nfw0000gn/T/RtmpMNoZz3/downloaded_packages’
Old packages: 'RcppArmadillo', 'survival'
Update all/some/none? [a/s/n]:
a

There are binary versions available but the source versions are later:
binary source needs_compilation
RcppArmadillo 0.9.880.1.0 0.9.900.1.0 TRUE
survival 3.1-12 3.2-3 TRUE

Do you want to install from sources the packages which need compilation? (Yes/no/cancel) no
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/RcppArmadillo_0.9.880.1.0.tgz'
Content type 'application/x-gzip' length 1863064 bytes (1.8 MB)

downloaded 1.8 MB

trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/survival_3.1-12.tgz'
Content type 'application/x-gzip' length 7849515 bytes (7.5 MB)

downloaded 7.5 MB

The downloaded binary packages are in
/var/folders/0t/8jm6lgqs0qj63rprpf9q_nfw0000gn/T//RtmpMNoZz3/downloaded_packages
Warning message:
In install.packages(...) :
installation of package ‘GenomeInfoDbData’ had non-zero exit status

Any advice folks?

Error in .order_seqlevels(chrom_sizes[, "chrom"]) : !anyNA(m32) is not TRUE

> suppressPackageStartupMessages(library(GenomeInfoDb))
> Seqinfo(genome="hg38")
Error in .order_seqlevels(chrom_sizes[, "chrom"]) :
  !anyNA(m32) is not TRUE

SessionInfo

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.10.0
LAPACK: /usr/lib/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomeInfoDb_1.28.4 IRanges_2.26.0      S4Vectors_0.30.2
[4] BiocGenerics_0.38.0

loaded via a namespace (and not attached):
[1] compiler_4.1.2         GenomeInfoDbData_1.2.6 RCurl_1.98-1.5
[4] bitops_1.0-7

Changing seqlevelsStyle of BSgenome fails because of multiple genomes

Hi,

I think this is related to #12 and https://stat.ethz.ch/pipermail/bioc-devel/2020-July/016966.html (seqlevelsStyle now being able to rename contigs and scaffolds).
Trying to convert the seqlevelsStyle of the UCSC hg19 BSgenome (same for hg38) fails:

library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(Hsapiens) <- "NCBI"

gives

Error in .normarg_genome(value, seqnames(x)) : 
  when 'genome' vector is named and contains more than one distinct
  value, it cannot have duplicated names

For completeness, the reason I noticed this is that it causes the SGSeq package (on which one of my packages depends) to fail during the vignette building (http://bioconductor.org/checkResults/3.12/bioc-LATEST/SGSeq/merida1-buildsrc.html), and I guess I'm trying to figure out where it should be fixed 😃

Thanks!

Session info

> BiocManager::version()
[1] ‘3.12’
> BiocManager::valid()
[1] TRUE

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.57.4                  
 [3] rtracklayer_1.49.3                Biostrings_2.57.2                
 [5] XVector_0.29.3                    GenomicRanges_1.41.5             
 [7] GenomeInfoDb_1.25.8               IRanges_2.23.10                  
 [9] S4Vectors_0.27.12                 BiocGenerics_0.35.4              

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11             knitr_1.29                 
 [3] zlibbioc_1.35.0             GenomicAlignments_1.25.3   
 [5] BiocParallel_1.23.2         lattice_0.20-41            
 [7] tools_4.0.2                 grid_4.0.2                 
 [9] SummarizedExperiment_1.19.6 Biobase_2.49.0             
[11] xfun_0.15                   matrixStats_0.56.0         
[13] crayon_1.3.4                Matrix_1.2-18              
[15] GenomeInfoDbData_1.2.3      bitops_1.0-6               
[17] RCurl_1.98-1.2              DelayedArray_0.15.7        
[19] compiler_4.0.2              Rsamtools_2.5.3            
[21] XML_3.99-0.4

installation of GenomeInfoDbData failed

Hello,

The GenomeInfoDbData somehow failed to be installed (a dependency of GenomeInfoDb). I was wondering whether anyone has a clue on this issue?
I restarted with a newly-installed R but got the same error. Thanks in advance!

rtools40 was installed in C:\rtools40\usr\bin and was in Sys.getenv('PATH')

Sys.which('make')
make
"C:\rtools40\usr\bin\make.exe"

BiocManager::install("GenomeInfoDbData")

'getOption("repos")' replaces Bioconductor
standard repositories, see '?repositories' for
details

replacement repositories:
    CRAN: https://cran.rstudio.com/

Bioconductor version 3.13 (BiocManager 1.30.16),
  R 4.1.1 (2021-08-10)
Installing package(s) 'BiocVersion',
  'GenomeInfoDbData'
trying URL 'https://bioconductor.org/packages/3.13/bioc/bin/windows/contrib/4.1/BiocVersion_3.13.1.zip'
Content type 'application/zip' length 9399 bytes
downloaded 9399 bytes

package ‘BiocVersion’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\name\AppData\Local\Temp\RtmpyqbAD8\downloaded_packages
installing the source package ‘GenomeInfoDbData’

trying URL 'https://bioconductor.org/packages/3.13/data/annotation/src/contrib/GenomeInfoDbData_1.2.6.tar.gz'
Content type 'application/x-gzip' length 10973004 bytes (10.5 MB)
downloaded 10.5 MB


The downloaded source packages are in
	‘C:\Users\name\AppData\Local\Temp\RtmpyqbAD8\downloaded_packages’
Old packages: 'lattice', 'nlme', 'survival'
Update all/some/none? [a/s/n]: 
n
Warning message:
In .inet_warning(msg) :
  installation of package ‘GenomeInfoDbData’ had non-zero exit status

sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils    
[5] datasets  methods   base     

loaded via a namespace (and not attached):
[1] BiocManager_1.30.16 compiler_4.1.1     
[3] tools_4.1.1

Caching seqinfo

Hi, my package uses genomeInfoDb, and we use the seqlevelsStyle function to clean up user-inputted data and ensure consistent chromosome names (in our case, we go with NCBI style, which means stripping chr prefixes). I can see that what seems like a simple task gets complicated under the hood with the need to download the latest info from NCBI, Ensembl, and UCSC.

I found that .UCSC_cached_chrom_info and .NCBI_cached_chrom_info store the necessary information for seqlevelsStyle throughout a session, but an internet connection is initially necessary every new session. This causes a problem for offline users and users on networks that for whatever reason are blocking any of NCBI/UCSC/Ensembl traffic (yes, this is really happening). Since seqinfo is such a small amount of data, is there a plan to take advantage of R's support for caching user data to save this information and allow seqlevelsStyle to run offline? Or is there a safe workaround to supply the necessary seqinfo?

I did it this way, but I'm concerned this could cause problems with new GenomeInfoDb releases or if anything changes on the NCBI/UCSC/Ensembl server side.

# Get information for local caching
bsg = getBSgenome("hg19")
seqlevelsStyle(bsg) = "NCBI"
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ucsc_info = getFromNamespace(".UCSC_cached_chrom_info", "GenomeInfoDb")[["hg19"]]
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ncbi_info = getFromNamespace(".NCBI_cached_chrom_info", "GenomeInfoDb")[["GCF_000001405.25"]]
saveRDS(ncbi_info, "hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
saveRDS(ucsc_info, "hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")

# Later, in new (offline) R session
ucsc_info = readRDS("hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")
ncbi_info = readRDS("hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
assign('hg19', ucsc_info, envir = get(".UCSC_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))
assign('GCF_000001405.25', ncbi_info, envir = get(".NCBI_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))

# seqlevelsStyle now works offline

seqlevelsStyle setter does not work with specific genome build

Hi Hervé, @hpages
I encountered this issue when working for a fix for TCGAutils in devel.
Am I taking the right approach here?
Best,
Marcel

suppressPackageStartupMessages({
    library(curatedTCGAData)
    library(TCGAutils)
    library(GenomeInfoDb)
})

suppressMessages({
coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD",
    assays = c("CNASeq", "Mutation", "miRNA*",
        "RNASeq2*", "mRNAArray", "Methyl*"), dry.run = FALSE)
})

rag <- "COAD_Mutation-20160128"
grag <- rowRanges(coad[[rag]])

genome(grag)
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" 
#>   17   18   19   20   21   22    X    Y 
#> "36" "36" "36" "36" "36" "36" "36" "36"
genome(grag) <- translateBuild(genome(grag))
genome(grag)
#>      1      2      3      4      5      6      7      8      9     10     11 
#> "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" 
#>     12     13     14     15     16     17     18     19     20     21     22 
#> "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" 
#>      X      Y 
#> "hg18" "hg18"

seqlevelsStyle(grag) <- "UCSC"

seqlevels(grag)
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20" "21" "22" "X"  "Y"

# OTOH

gr <- rowRanges(coad[[rag]])
genome(gr) <- "Homo_sapiens"
seqlevelsStyle(gr) <- "UCSC"

seqlevels(gr)
#>  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
#> [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
#> [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"

^{Created on 2020-07-06 by the reprex package (v0.3.0)}

N-ary nature of merge,Seqinfo,Seqinfo-method is not documented.

As briefly mentioned in Bioconductor/GenomicRanges#54. Currently, I only see in ?Seqinfo:

‘merge(x, y)’: Merge ‘x’ and ‘y’ into a single Seqinfo object where the
          keys (aka the seqnames) are ‘union(seqnames(x),
          seqnames(y))’.  If a row in ‘y’ has the same key as a row in
          ‘x’, and if the 2 rows contain compatible information (NA
          values are compatible with anything), then they are merged
          into a single row in the result.  If they cannot be merged
          (because they contain different seqlengths, and/or
          circularity flags, and/or genome identifiers), then an error
          is raised.  In addition to check for incompatible sequence
          information, ‘merge(x, y)’ also compares ‘seqnames(x)’ with
          ‘seqnames(y)’ and issues a warning if each of them has names
          not in the other. The purpose of these checks is to try to
          detect situations where the user might be combining or
          comparing objects based on different reference genomes.

A user could possibly infer the presence of N-ary functionality for Seqinfo's merge from ?"merge,Vector,Vector-method", if one happened to stumble across that documentation... but I wouldn't bet on it.

Proposed contribution task for Outreachy applicants: Register UCSC genome felCat9

felCat9 is the latest UCSC genome for Cat (Felis catus). See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

the assembled.molecules argument is ignored,
the assembled and circular columns of the returned data.frame are filled with NAs,
and the chromosomes/sequences are not returned in any particular order.

See ?getChromInfoFromUCSC (after loading GenomeInfoDb) for more information.

For felCat9, since this is the first felCat genome that we're going to register in GenomeInfoDb, we need to start the felCat9.R file from scratch. However, looking at other registration files to get a feeling of how things are done is always a good idea. Don't bother with the NCBI_LINKER component for now. We'll add it later, once the corresponding NCBI assembly (Felis_catus_9.0) is also registered (registering Felis_catus_9.0 is the topic of issue #50).

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

`getChromInfoFromEnsembl()` fails with Ensembl 103 for GenomeInfoDb 1.26.2 (Bioc 3.12)

Hi, I'm seeing an error pop up with getChromInfoFromEnsembl() specifically with the new Ensembl 103 release. Here's a minimal reprex:

## Ensembl 102 works as expected.
seqinfo_hs_102 <-
    GenomeInfoDb::getChromInfoFromEnsembl(
        species = "Homo sapiens",
        release = 102L,
        as.Seqinfo = TRUE
    )

## Seqinfo object with 944 sequences (1 circular) from GRCh38.p13 genome:
##   seqnames     seqlengths isCircular     genome
##   KI270510.1         2415       <NA> GRCh38.p13
##   KI270539.1          993       <NA> GRCh38.p13
##   KI270395.1         1143       <NA> GRCh38.p13
##   KI270752.1        27745       <NA> GRCh38.p13
##   KI270388.1         1216       <NA> GRCh38.p13
##   ...                 ...        ...        ...
##   HG1466_PATCH      17435       <NA> GRCh38.p13
##   HG1506_PATCH      28824       <NA> GRCh38.p13
##   HG1507_PATCH      68192       <NA> GRCh38.p13
##   HG439_PATCH      403128       <NA> GRCh38.p13
##   HG1509_PATCH      14678       <NA> GRCh38.p13
## Seqinfo object with 944 sequences (1 circular) from GRCh38.p13 genome:
##   seqnames     seqlengths isCircular     genome
##   KI270510.1         2415       <NA> GRCh38.p13
##   KI270539.1          993       <NA> GRCh38.p13
##   KI270395.1         1143       <NA> GRCh38.p13
##   KI270752.1        27745       <NA> GRCh38.p13
##   KI270388.1         1216       <NA> GRCh38.p13
##   ...                 ...        ...        ...
##   HG1466_PATCH      17435       <NA> GRCh38.p13
##   HG1506_PATCH      28824       <NA> GRCh38.p13
##   HG1507_PATCH      68192       <NA> GRCh38.p13
##   HG439_PATCH      403128       <NA> GRCh38.p13
##   HG1509_PATCH      14678       <NA> GRCh38.p13

## Ensembl 103 is failing due to column parsing.
seqinfo_hs_103 <-
    GenomeInfoDb::getChromInfoFromEnsembl(
        species = "Homo sapiens",
        release = 103L,
        as.Seqinfo = TRUE
    )

Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",  : 
  more columns than column names
Calls: <Anonymous> ... .simple_read_table -> do.call -> do.call -> <Anonymous>
Backtrace:
    █
 1. └─GenomeInfoDb::getChromInfoFromEnsembl(...)
 2.   └─GenomeInfoDb:::get_Ensembl_FTP_core_db_url(...)
 3.     └─GenomeInfoDb:::.find_core_db_in_Ensembl_FTP_species_index(...)
 4.       └─GenomeInfoDb:::.fetch_species_index_from_url(url)
 5.         └─GenomeInfoDb:::fetch_table_from_url(url, header = TRUE)
 6.           └─GenomeInfoDb:::.simple_read_table(destfile, ...)
 7.             ├─BiocGenerics::do.call(read.table, args)
 8.             ├─base::do.call(read.table, args)
 9.             └─(function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", ...

Is this supported in Bioc Devel? I'm taking a look in the v1.27.7 source code, but wanted to post in case others hit this error with Bioconductor 3.12.

Best,
Mike

Problem on scaffold name (mm10)

There is a seemingly new scaffold on mm10 (Mus musculus), with a "bad" name: chrna_GL456050_alt.
So far, you can reproduce the result with:

GenomeInfoDb:::fetch_chrom_sizes_from_UCSC("mm10", "http://hgdownload.cse.ucsc.edu/goldenPath")[179, ]

The following command line gives the same result:

wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromInfo.txt.gz

The problem arises, for instance, with the function .order_seqlevels, namely on line 24.
The reason is that scaffold names are divided into 3 chunks, separated with _.
The first chunk should match a chromosome (actually, an ASSEMBLED_MOLECULES).
Obviously, it fails here: chrna is not a chromosome name.

What should I do here?

Wait for UCSC to update the scaffold name (but what if it takes months to place the scaffold?)
Try to avoid .order_seqlevels (but it is a low level function, which is, for me, called by seqlevelsStyle() <-).
Anything else?

Thanks a lot!

bioconductor / genomeinfodb Goto Github PK

genomeinfodb's Introduction

genomeinfodb's People

Contributors

Stargazers

Watchers

Forkers

genomeinfodb's Issues

trying URL 'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/GenomeInfoDbData_1.2.3.tar.gz' Content type 'application/x-gzip' length 10413139 bytes (9.9 MB)

Do you want to install from sources the packages which need compilation? (Yes/no/cancel) no trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/RcppArmadillo_0.9.880.1.0.tgz' Content type 'application/x-gzip' length 1863064 bytes (1.8 MB)

trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/survival_3.1-12.tgz' Content type 'application/x-gzip' length 7849515 bytes (7.5 MB)

SessionInfo

Recommend Projects

Recommend Topics

Recommend Org

trying URL 'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/GenomeInfoDbData_1.2.3.tar.gz'
Content type 'application/x-gzip' length 10413139 bytes (9.9 MB)

Do you want to install from sources the packages which need compilation? (Yes/no/cancel) no
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/RcppArmadillo_0.9.880.1.0.tgz'
Content type 'application/x-gzip' length 1863064 bytes (1.8 MB)

trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/survival_3.1-12.tgz'
Content type 'application/x-gzip' length 7849515 bytes (7.5 MB)