Git Product home page Git Product logo

dbvar's Introduction

dbVar is NCBI's database of human genomic structural variation โ€“ insertions, deletions, duplications, inversions, mobile elements, and translocations

============================

directory layout

.
+-- Structural_Variant_Sets              # dbVar Reference SV Project & Data
+-- specs                                # dbVar Design and Schema Specifications     
+-- tutorials                            # Initial dir and README setup
+-- README.md

dbvar's People

Contributors

hefferon avatar johng-001 avatar lonphan avatar lopezjohn avatar thefferon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbvar's Issues

Create BED files for upload and viewing in UCSC browser

It would be really helpful if you would provide the non-redundant variants in BED format, so users could upload them to the UCSC genome browser, or similar browsers.

VCF could also be really useful, especially in the context of annotations... Then these data could easily be incorporated into bioinformatics workflows...

Please review plus requests

HI @lonphan ,

Please review my pull requests adding and updating the dbVar submission template v3.4.

Please enable write permissions to this repository for me, John L, and John G.

Thanks,
Tim

SVs in some chromosomes are missing in nr_deletion files

Hello,

I downloaded the latest version of all files in the directory /pub/dbVar/sandbox/sv_datasets/nonredundant/deletions.

However, it seems the SVs in chr 1, 2, 9, 10, 11, 12 and X are missing in some files (e.g. GRCh37.nr_deletions.tsv, GRCh37.nr_deletions.bed, GRCh37.nr_deletions.tsv, and GRCh38.nr_deletions.bed).

For example, when I checked the file using sed '1,2d' GRCh38.nr_deletions.tsv | cut -f1 | sort -k1,1V | uniq -c, it gives:
183462 3
215779 4
179941 5
186375 6
177445 7
156126 8
100796 13
102341 14
88040 15
103951 16
97261 17
84456 18
89493 19
73053 20
46757 21
51888 22
7323 Y
56 mt

But GRCh38.nr_deletions.pathogenic.tsv (which I believe is a subset of GRCh38.nr_deletions.tsv) contains SVs from all chromosomes:

sed '1,2d' GRCh38.nr_deletions.pathogenic.tsv | cut -f1 | sort -k1,1V | uniq -c
1126 1
1657 2
788 3
499 4
677 5
729 6
863 7
561 8
725 9
452 10
691 11
370 12
409 13
321 14
765 15
1415 16
1175 17
342 18
489 19
272 20
189 21
661 22
1847 X
76 Y
16 mt

Would be great if you could help update the files. Thank you!

Best,
Anson

How can I find which variants are equivalent on GRCh37 and GRCh38?

The variants in these don't have their own IDs. If I am looking at a variant on GRCh37 and I want to know where it is in the GRCh38 file, how do I do that?

I guess I could use one or more of the accessions in the last column, "SV", but it seems like there should be an easier, clearer way.

Have you thought about giving the non-redundant variants their own unique IDs, so they could be compared more easily between GRCh37 and GRCh38?

What's the difference between 'common' and 'somatic' SVs?

Hi, I can see that pathogenic structural variations come from Clinvar, but what are common and somatic types of SVs?

How to understand 'variant calls are from germline samples only (no somatic)'?

And how can I get the actual breakpoints of each structural variation? Is it possible to know whether identical variant calls were derived from different individuals or from the same sample but as a result of different analyses?

Thanks!

Add headers to data files

I suggest including a header line at the top of each file - so we don't have to look them up in the README, and copy them into the file if we need them there.

I would also suggest changing some of the header names. In the README they are:

chr
outermost_start
outermost_stop
SV_count
variant_type
method
analysis
platform
study
SV

It's not completely clear what some of these are:

  1. Why 'outermost_start' and 'outermost_stop'? Are there other coordinates we are not seeing in the data file? How do we know what they represent?
  2. What is 'SV_count'? How does it relate to "nsv"s and "nssvs"?
  3. How do we look up a publication mentioned in the "study" column? Do you have citations somewhere?

Aggregation method (variant call -> variant region) removes variant type information

Dear dbVar team, I have a suggestion about the method to aggregate variant calls into a variant region. Currently, you are aggregating independently of the variant type (deletion or duplication). The aggregation of multiple CNVs is a very convenient step, especially for clinical interpretation. Unfortunately, this approach removes the variant type information of the region. Therefore, I think a good alternative would be to keep the aggregation step but considering only CNVs with the same variant class. In this way, we would get a list of variant regions and their variant types associated.

Requested annotations for non-redundant variant data

The following annotations would be really useful for the non-redundant variants:

  • overlap with genes
  • repeat regions and segmental duplication regions
  • overlap with known dosage sensitive genes
  • overlap with known clinical variants, e.g., from ClinVar

I'm sure there are a lot more.

What are 'outermost_start' and 'outermost_stop'?

The column headers in the README list columns 2 and 3 as "outermost_start" and "outermost_stop". What does that mean - are there other, less "outer", starts and/or stops you are not showing?

The only explanation in the README is (in Example record 1):

The non-redundant coordinates for this record in dbVar are chr1, with an outermost start of 10001 and outermost stop of 1535693.

Does this indicate there is breakpoint ambiguity? How can we tell in any given case, since all variants are described using these "outermost"-type terms?

Add population and frequency data?

If you have information on the population of origin for any of these variants, it could be really useful to include it in the data files.

Also, allele frequency data must be available for at least some of these variants. Do you plan to include allele frequencies in these files, to help with variant interpretation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.