ncbi / dbvar Goto Github PK

View Code? Open in Web Editor NEW

38.0 15.0 14.0 3.07 MB

dbVar

Home Page: https://www.ncbi.nlm.nih.gov/dbvar

dbvar structural variation

dbvar's Introduction

dbVar (https://www.ncbi.nlm.nih.gov/dbvar)

dbVar is NCBI's database of human genomic structural variation – insertions, deletions, duplications, inversions, mobile elements, and translocations

============================

directory layout

.
+-- Structural_Variant_Sets              # dbVar Reference SV Project & Data
+-- specs                                # dbVar Design and Schema Specifications     
+-- tutorials                            # Initial dir and README setup
+-- README.md

dbvar's People

Contributors

Stargazers

Watchers

Forkers

johng-001 thefferon lonphan lopezjohn dasmoocher hotliu global-localhost global19 global19-atlassian-net egekcmz yaningyang qpc-github

dbvar's Issues

Is there a non-redundant inversion set?

It seems like only insertion, deletion and duplication.

Create BED files for upload and viewing in UCSC browser

It would be really helpful if you would provide the non-redundant variants in BED format, so users could upload them to the UCSC genome browser, or similar browsers.

VCF could also be really useful, especially in the context of annotations... Then these data could easily be incorporated into bioinformatics workflows...

Please review plus requests

HI @lonphan ,

Please review my pull requests adding and updating the dbVar submission template v3.4.

Please enable write permissions to this repository for me, John L, and John G.

Thanks,
Tim

SVs in some chromosomes are missing in nr_deletion files

Hello,

I downloaded the latest version of all files in the directory /pub/dbVar/sandbox/sv_datasets/nonredundant/deletions.

However, it seems the SVs in chr 1, 2, 9, 10, 11, 12 and X are missing in some files (e.g. GRCh37.nr_deletions.tsv, GRCh37.nr_deletions.bed, GRCh37.nr_deletions.tsv, and GRCh38.nr_deletions.bed).

For example, when I checked the file using sed '1,2d' GRCh38.nr_deletions.tsv | cut -f1 | sort -k1,1V | uniq -c, it gives:
183462 3
215779 4
179941 5
186375 6
177445 7
156126 8
100796 13
102341 14
88040 15
103951 16
97261 17
84456 18
89493 19
73053 20
46757 21
51888 22
7323 Y
56 mt

But GRCh38.nr_deletions.pathogenic.tsv (which I believe is a subset of GRCh38.nr_deletions.tsv) contains SVs from all chromosomes:

sed '1,2d' GRCh38.nr_deletions.pathogenic.tsv | cut -f1 | sort -k1,1V | uniq -c
1126 1
1657 2
788 3
499 4
677 5
729 6
863 7
561 8
725 9
452 10
691 11
370 12
409 13
321 14
765 15
1415 16
1175 17
342 18
489 19
272 20
189 21
661 22
1847 X
76 Y
16 mt

Would be great if you could help update the files. Thank you!

Best,
Anson

How can I find which variants are equivalent on GRCh37 and GRCh38?

The variants in these don't have their own IDs. If I am looking at a variant on GRCh37 and I want to know where it is in the GRCh38 file, how do I do that?

I guess I could use one or more of the accessions in the last column, "SV", but it seems like there should be an easier, clearer way.

Have you thought about giving the non-redundant variants their own unique IDs, so they could be compared more easily between GRCh37 and GRCh38?

What's the difference between 'common' and 'somatic' SVs?

Hi, I can see that pathogenic structural variations come from Clinvar, but what are common and somatic types of SVs?

How to understand 'variant calls are from germline samples only (no somatic)'?

And how can I get the actual breakpoints of each structural variation? Is it possible to know whether identical variant calls were derived from different individuals or from the same sample but as a result of different analyses?

Thanks!

Add headers to data files

I suggest including a header line at the top of each file - so we don't have to look them up in the README, and copy them into the file if we need them there.

I would also suggest changing some of the header names. In the README they are:

chr
outermost_start
outermost_stop
SV_count
variant_type
method
analysis
platform
study
SV

It's not completely clear what some of these are:

Why 'outermost_start' and 'outermost_stop'? Are there other coordinates we are not seeing in the data file? How do we know what they represent?
What is 'SV_count'? How does it relate to "nsv"s and "nssvs"?
How do we look up a publication mentioned in the "study" column? Do you have citations somewhere?

Aggregation method (variant call -> variant region) removes variant type information

Dear dbVar team, I have a suggestion about the method to aggregate variant calls into a variant region. Currently, you are aggregating independently of the variant type (deletion or duplication). The aggregation of multiple CNVs is a very convenient step, especially for clinical interpretation. Unfortunately, this approach removes the variant type information of the region. Therefore, I think a good alternative would be to keep the aggregation step but considering only CNVs with the same variant class. In this way, we would get a list of variant regions and their variant types associated.

Requested annotations for non-redundant variant data

The following annotations would be really useful for the non-redundant variants:

overlap with genes
repeat regions and segmental duplication regions
overlap with known dosage sensitive genes
overlap with known clinical variants, e.g., from ClinVar

I'm sure there are a lot more.

Create wiki for this GitHub repo and and FAQ page to wiki

Create wiki for this GitHub repo and and FAQ page to wiki. Incorporate answers to closed issues in FAQ wiki page.

What are 'outermost_start' and 'outermost_stop'?

The column headers in the README list columns 2 and 3 as "outermost_start" and "outermost_stop". What does that mean - are there other, less "outer", starts and/or stops you are not showing?

The only explanation in the README is (in Example record 1):

The non-redundant coordinates for this record in dbVar are chr1, with an outermost start of 10001 and outermost stop of 1535693.

Does this indicate there is breakpoint ambiguity? How can we tell in any given case, since all variants are described using these "outermost"-type terms?

Add population and frequency data?

If you have information on the population of origin for any of these variants, it could be really useful to include it in the data files.

Also, allele frequency data must be available for at least some of these variants. Do you plan to include allele frequencies in these files, to help with variant interpretation?