Git Product home page Git Product logo

clinvar's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clinvar's Issues

List Order Mismatch between Clinical_Sig & All_Summiters

Example:
Variant=chr1:976059_C>T
ID=RCV000195231

The result in your clinvar_alleles.tsv:
clinical_significance="Likely benign;Uncertain significance"
all_submitters="Genetic Services Laboratory, University of Chicago;PreventionGenetics"

If you look at the order of the list....which would be useful....Likely benign was submitted by U of Chicago.
But, that is not the case:
https://www.ncbi.nlm.nih.gov/clinvar/RCV000195231/

That's just one example, there are many, many, many more.

I can see where this comes from. Regex and the XML structure. In script: parse_clinvar_xml.py:104

current_row['all_submitters'] = ';'.join([
            submitter_node.attrib['submitter'].replace(';', ',')
            for submitter_node in elem.findall('.//ClinVarSubmissionID')
            if submitter_node.attrib is not None and submitter_node.attrib.has_key('submitter')
        ])

The "submitters" is obtained from a separate node, without any attempt to match against the nested clin_sig description.

clinical_significance=elem.find('.//ReferenceClinVarAssertion/ClinicalSignificance')
    if clinical_significance.find('.//ReviewStatus') is not None:
        current_row['review_status']=clinical_significance.find('.//ReviewStatus').text;
    if clinical_significance.find('.//Description') is not None:
        current_row['clinical_significance']=clinical_significance.find('.//Description').text

If I had a solution worked out, I would make a pull request. But it appears to tricky, so far.

missing hgvs_c when indel is realigned and shifted.

I noticed a lot of indels/fs have missing hgvs_c NM transcript information. I do see the hgvs_p (with the protein),

Example:
This variant
https://www.ncbi.nlm.nih.gov/clinvar/variation/52203/

is listed as Chr13: 32341188 – 32341192 in clinvar

In the files from the clinvar/macarthur pipeline,
It gets left shifted here: 13 32341183

There is no “hgvs_c” field in the output from the pipeline
It does have have (which is the same as clinvar .. not sure if that was recomputed).
"hgvs_p":"NP_000050.2:p.Ile2278Serfs",

However, we can find the hgvs_c information under the molecular consequence field.

"molecular_consequence":["NM_000059.3:c.6833_6837del:frameshift variant"]

I’m not sure if the pipeline computed that or it’s from clinvar (same as in clinvar)..
But it could be propagated to the hgvs_c field.

Convert the pipeline to hail for .tsv processing steps?

Next time the pipeline needs updates, we should probably convert all steps that follow .xml parsing to a Spark-based hail pipeline.

Currently, the steps that generate clinvar x gnomAD tables take hours to run, so I skipped them for the latest release - hail would be able to perform these joins much more efficiently.

Need to Update pathogenic & conflict boolean regex.

Looks like Clinvar has new string added to clin_sig: "Conflicting interpretations of pathogenicity"

I just handle it by changing these lines in join_data.R

combined$pathogenic = as.integer(grepl('athogenic\\b',combined$clinical_significance))


# conflicted = 1 if at least one submission each of [likely] benign and [likely] pathogenic
combined$conflicted = as.integer((grepl('athogenic',combined$clinical_significance) & grepl('enign',combined$clinical_significance)) | grepl('Conflicting interpretations',combined$clinical_significance, ignore.case=TRUE))

That way you get better boolean values.

Improper labelling of conflicting variants as pathogenic

It seems reformatted data tagged as 'May release' was produced from Mar (or prior to that?) NCBI clinvar data. Since April, NCBI clinvar seems to have changed how they report aggregated clinical significance, and this completely affects how conflicted variants are identified if current code is used with recent NCBI clinvar data. Current NCBI's practice is to use the term Conflicting interpretations of pathogenicity for conflicts, and these variants are mistakenly tagged as TRUE for pathogenic column (instead of TRUE for conflicted) as they have the string 'pathogenic'.

@XiaoleiZ - Since I am using data from your fork on Aug data, I thought I should tag you here.

@bw2 @ericminikel @XiaoleiZ - Any suggestion on which repo or fork to use for actively maintained data, if any? I would like to contribute if you need more manpower on maintaining a centralized repo.

Job parse failure on system execution. ENV issues?

master.py LINE 149

The custom sorting via eGrep & Sort commands.
# sort job.add(("cat " + "<(gunzip -c IN:%(tmp_dir)s/clinvar_table_normalized.%(fsuffix)s.tsv.gz

When executing on the system fails on a syntax error, which is fairly uninformative.

[Sep 12 13:29:27]: --> Exec 1.4: cat <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | head -1) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 ) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 ) | bgzip -c > output_tmp/tmp.2017-09-12_13.29.26.clinvar_allele_trait_pairs.multi.b37.tsv.gz

[Sep 12 13:29:27]: Output (last mod N/A): output_tmp/clinvar_allele_trait_pairs.multi.b37.tsv.gz [doesn't exist yet]

[Sep 12 13:29:28]: /bin/sh: -c: line 0: syntax error near unexpected token ('`

[Sep 12 13:29:28]: /bin/sh: -c: line 0: cat <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | head -1) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 |

I'm trying to work through this right now, but if you have any suggestions....

Need to check whether REF base is correct

Ellen Tsai from Heidi Rehm's lab has emailed today to say that we are currently letting through tens of variants that have the wrong REF base in ClinVar. Per Adrian Tan's July 2 email, vt normalize checks whether the REF base is correct. If we make our python implementation of normalize do the same, and throw out variants with the wrong REF, then this should fix the variants Ellen points out. Here is the list of offending variants from her email:

If you're interested, the 36 variants I've flagged are provided below.

5 variants listed twice in clinvar.tsv where ref/alt are switched between the rows (e.g. chr pos ref alt):

11
64572018 C
T
11
64572018 T
C
7
117199533 A
G
7
117199533 G
A
9
139393681 C
T
9
139393681 G
A
9
139412360 A
G
9
139412360 T
C
9
139414018 C
A
9
139414018 G
T

5 variants where ref(clinvar.tsv) doesn't match hg19? (e.g. myhg19.fasta vs clinvar.tsv)

9:134385436 .. 'A' vs 'G'
MT:14498 .. 'T' vs 'C'

A in UCSC browser, but apparently 'T' in my fasta file

MT:14729 .. 'T' vs 'G'
17:181683 .. 'A' vs 'C'
17:256448 .. 'T' vs 'C'

16 variants where ref(myhg19.fasta)=N but ref(clinvar.tsv)=[ACGT]

17:355278 .. 'N' vs 'C'
17:355570 .. 'N' vs 'C'
17:355653 .. 'N' vs 'G'
17:355749 .. 'N' vs 'G'
17:355765 .. 'N' vs 'G'
17:356212 .. 'N' vs 'C'
17:356248 .. 'N' vs 'A'
17:356346 .. 'N' vs 'C'
17:356370 .. 'N' vs 'C'
Y:545354 .. 'N' vs 'G'
Y:545422 .. 'N' vs 'A'
Y:545469 .. 'N' vs 'C'
Y:545475 .. 'N' vs 'C'
Y:551577 .. 'N' vs 'G'
Y:551578 .. 'N' vs 'C'
Y:1359342 .. 'N' vs 'G'

Missing variants

Hello,

I have come across a handful of variants that were present in previous versions of the curated Clinvar TSV that are not present in the most recent version, but are present in the Clinvar database. I am using both the clinvar_alleles.single.b37.tsv and clinvar_alleles.single.b37.vcf committed on 2017-03-13:

10 55582650 . A ATGTT
2 179325735 rs17304212 C G
2 179325735 rs17304212 C T
2 179325816 rs79399438 G A
3 150690352 rs111033258 A C
5 149740732 rs56180593 C T

Is there a reason these are no longer present? Were they inadvertently dropped? Or am I missing them in either the TSV or VCF?

Update?

Hi,

I really like this resource, and the job you have put together to make clinvar data available in a clean way. I stumbled into some problems trying to make the workflow work on my own computer, but then you tend to do regular updates of the output.tsv file, which is excellent. Are you planning on any update of this soon?

On beforehand, thanks.

Sigve

Origin column

hi,

Thanks for a very useful resource, it simplifies a lot. Would you consider adding the variant origin in the output (for annoation of 'germline,'somatic','de novo' etc.)? It is present in the variant_summary.txt.gz. I guess adding one more variable to join_data.R would work, e.g.:

colnames(txt_extract) = c('measureset_id','symbol','clinical_significance','review_status','hgvs_c','hgvs_p','origin')

best,
Sigve

Something wrong?

Hi,

Your jobs are awesome! I think it would be very useful for us. But as I look at the example clinvar.tsv.gz file and the clinvar_example_750_rows.tsv file, I find a mistake.

chrom pos ref alt mut measureset_id symbol clinical_significance pathogenic benign conflicted review_status gold_stars hgvs_c hgvs_p all_submitters all_traits all_pmids inheritance_modes age_of_onset prevalence disease_mechanism origin xrefs
1 949523 C T ALT 183381 ISG15 Pathogenic 1 0 0 no assertion criteria provided 0 NM_005101.3:c.163C>T NP_005092.1:p.Gln55Ter OMIM Immunodeficiency 38 with basal ganglia calcification;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION 25307056 Childhood <1 / 1 000 000 germline MedGen:CN221808;OMIM:616126;Orphanet:319563
1 949696 C CG ALT 161455 ISG15 Pathogenic 1 0 0 no assertion criteria provided 0 NM_005101.3:c.339dupG NP_005092.1:p.Leu114Alafs OMIM Immunodeficiency 38 with basal ganglia calcification;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION 1,22859821,25307056 Childhood <1 / 1 000 000 germline MedGen:CN221808;OMIM:616126;Orphanet:319563

At this sheet, the second row of the result, there is a "1" in the all_pmids column, are there a mistake of the program?

mut column

Hi,

Thanks for this - very useful. I intend to convert this to a VCF file to use downstream. I can't find any info on what the mut columns is. It's value is commonly ALT, but sometimes REF. What does is mean?

thanks!

Improper handling of pseudoautosomal region

File clinvar_alleles.single.b38.tsv.gz has 43 duplicate rows. Here are the allele IDs for those rows:
[231307, 150500, 24917, 260558, 150301, 38949, 45434, 45436, 167501, 24912, 190850, 24911, 24915, 257893, 231306, 260559, 25394, 178250, 24914, 178251, 178253, 260556, 189165, 189166, 260560, 45437, 178249, 137687, 231308, 137685, 150305, 99001, 137686, 137689, 150303, 361265, 99002, 150304, 98999, 38950, 23109, 195246, 137688]

On closer look, a subset of them are in pseudoautosomal region (ie. one row should have had chrX and the other chrY). Allele IDs 231307 and 24915 are part of that subset.

ps - At this point, I'm just nitpicking when reporting issues :)

Clinical Significance Order Lost in Allele Grouping

The clinical sig is lost per submitter, in the group_by_allele.py.

For example: https://www.ncbi.nlm.nih.gov/clinvar/variation/92428/
Has 4 submitters, 1 calls Likely Benign 3 calls Benign.

clinvar_allele_trait_pairs.single.tsv.gz | grep 39073 | grep 53676401 | less
Has 3 lines, with the correct Clin Sig Order.

But clinvar_alleles_grouped.single.tsv.gz | grep 39073 | grep 53676401 | less
Reduces it to "benign;likely benign" indicating 2 entries, for 4 submitters. The order is lost.

BUT.... submitters_ordered is still correct:
EGL Genetic Diagnostics,Eurofins Clinical Diagnostics;GeneReviews;Illumina Clinical Services Laboratory,Illumina;Center for Pediatric Genomic Medicine,Children's Mercy Hospital and Clinics

Issue with R script

Hi,

I am getting the following error, Could you please walk me through the way to resolve it. I am using python 2.7 on Ubuntu 14.10 machine. All required libraries are up to date.

180593 records processed
180594 records processed
[Nov 18 10:53:45]: Final counts of variants discarded:
[Nov 18 10:53:45]: REF == ALT: 226
[Nov 18 10:53:45]: Wrong REF: 4082
[Nov 18 10:53:45]: Invalid nucleotide: 9
[Nov 18 10:53:45]: Finished 1.3. Running time: 0:00:21.418153 sec.
[Nov 18 10:53:45]: Renamed tmp.2016-11-18_10.46.10.clinvar_table_normalized.tsv to clinvar_table_normalized.tsv
[Nov 18 10:53:45]: --> Exec 1.4: Rscript join_data.R variant_summary.txt.gz
[Nov 18 10:53:45]: Output (last mod N/A): clinvar_combined.tsv [doesn't exist yet]
[Nov 18 10:53:50]: [1] 180594 14
[Nov 18 10:53:59]: Warning message:
[Nov 18 10:53:59]: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
[Nov 18 10:53:59]: embedded nul(s) found in input
[Nov 18 10:53:59]: [1] 354049 30
[Nov 18 10:53:59]: Error in [.data.frame(x, r, vars, drop = drop) :
[Nov 18 10:53:59]: undefined columns selected
[Nov 18 10:53:59]: Calls: subset -> subset.data.frame -> [ -> [.data.frame
[Nov 18 10:54:16]: $subset(txt_download, assembly == "GRCh37", select = desired_columns)
[Nov 18 10:54:16]: <environment: 0x2c70db0>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $subset.data.frame(txt_download, assembly == "GRCh37", select = desired_colu
[Nov 18 10:54:16]: <environment: 0x2dd9a78>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $x[r, vars, drop = drop]
[Nov 18 10:54:16]: <environment: 0x2d9c030>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $``[.data.frame(x, r, vars, drop = drop)
[Nov 18 10:54:16]: <environment: 0x2d9c068>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $`stop("undefined columns selected")`
[Nov 18 10:54:16]: <environment: 0x2d9a5c0>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: attr(,"error.message")
[Nov 18 10:54:16]: [1] "Error in `[.data.frame`(x, r, vars, drop = drop) : \n undefined columns selected\nCalls: subset -> subset.data.frame -> [ -> [.data.frame\n"
[Nov 18 10:54:16]: attr(,"class")
[Nov 18 10:54:16]: [1] "dump.frames"
[Nov 18 10:54:16]: Finished 1.4. Running time: 0:00:31.037980 sec.
[Nov 18 10:54:16]: WARNING: unable to rename tmp.2016-11-18_10.46.10.clinvar_combined.tsv to clinvar_combined.tsv. tmp.2016-11-18_10.46.10.clinvar_combined.tsv is not readable
[Nov 18 10:54:16]: --> Skipping 1.5: (cat clinvar_combined.tsv | head -1 > tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv )
[Nov 18 10:54:16]: Input (last mod N/A): clinvar_combined.tsv [Error - input file not found]
[Nov 18 10:54:16]: Traceback (most recent call last):
[Nov 18 10:54:16]: File "/home/ecgi6/.local/lib/python2.7/site-packages/pypez.py", line 769, in does_command_need_to_run
[Nov 18 10:54:16]: try: raise Exception("File not found: " + str(input_filename))
[Nov 18 10:54:16]: Exception: File not found: clinvar_combined.tsv
[Nov 18 10:54:16]:

Another translation for gold stars

gold_stars_table = list(
   '-' = 0,
   ...

combined[which(combined$review_status == "-"),]
allele_id chrom pos ref alt measureset_type measureset_id
255248 414032 17 48271723 A T Variant
255249 414033 7 94049849 A C Variant

Looks like sometimes review status can be "-"

RScript crash

I tried to run the parser with the newest ClinVar may release

I was stuck at the 1.32. RScript crashed after a few second of run.

Any idea how it happened?

Thanks for developing the tool.

[May 17 12:05:49]: --> Exec 1.32: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-17_11.05.29.clinvar_alleles_combine[May 17 12:05:49]: --> Exec 1.32: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-17_11.05.29.clinvar_alleles_combined.single.b37.tsv.gz
[May 17 12:05:49]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 17 12:06:13]: [1] 282106 25
[May 17 12:06:37]: [1] 634027 30
[May 17 12:06:44]: Error in write.table(combined, output_table, sep = "\t", row.names = F, :
[May 17 12:06:44]: unimplemented type 'list' in 'EncodeElement'
[May 17 12:06:54]: $write.table(combined, output_table, sep = "\\t", row.names = F, col.names =
[May 17 12:06:54]: <environment: 0x7fc00382c120>
[May 17 12:06:54]: attr(,"error.message")
[May 17 12:06:54]: [1] "Error in write.table(combined, output_table, sep = "\t", row.names = F, : \n unimplemented type 'list' in 'EncodeElement'\n"
[May 17 12:06:54]: attr(,"class")
[May 17 12:06:54]: [1] "dump.frames"
[May 17 12:06:54]: Finished 1.32. Running time: 0:01:04.242256 sec.d.single.b37.tsv.gz
[May 17 12:05:49]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 17 12:06:13]: [1] 282106 25
[May 17 12:06:37]: [1] 634027 30
[May 17 12:06:44]: Error in write.table(combined, output_table, sep = "\t", row.names = F, :
[May 17 12:06:44]: unimplemented type 'list' in 'EncodeElement'
[May 17 12:06:54]: $write.table(combined, output_table, sep = "\\t", row.names = F, col.names =
[May 17 12:06:54]: <environment: 0x7fc00382c120>
[May 17 12:06:54]: attr(,"error.message")
[May 17 12:06:54]: [1] "Error in write.table(combined, output_table, sep = "\t", row.names = F, : \n unimplemented type 'list' in 'EncodeElement'\n"
[May 17 12:06:54]: attr(,"class")
[May 17 12:06:54]: [1] "dump.frames"
[May 17 12:06:54]: Finished 1.32. Running time: 0:01:04.242256 sec.

Need to update output

ClinVar released a new XML dump on July 9:

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/

And our output is currently from July 8.

Mismatch between clinical_significance_ordered and submitters_ordered

Hi,

Thanks for releasing this great resources. I noticed some discrepancies between the semicolon-separated lists in clinical_significance_ordered and submitters_ordered:

In [1]: df = pd.read_csv('clinvar_alleles_example_750_rows.single.b37.tsv', sep='\t')

In [2]: df.shape
Out[2]: (749, 39)

In [3]: for col in 'rcv scv clinical_significance_ordered submitters_ordered'.split():
    ...:     df['len_' + col] = df[col].apply(lambda x: len(x.split(';')))

In [4]: diffs = df[df.len_clinical_significance_ordered != df.len_submitters_ordered].shape

In [5]: diffs.shape
Out[5]: (120, 43)

Ordered clinical significance doesn't seem to match the RCV or SCV lists either. Is this intended?

Thanks

Semi-colon in pathogenic, benign and conflicted columns. Bug?

Certain rows have values 0 and 1 separated by a semi-colon in columns pathogenic, benign and conflicted of file clinvar_alleles.tsv.gz. For example, pathogenic column in line#6704 is 1;0. If I am not wrong, these columns must have either 0 or 1 by definition. Is this correct, or am I missing something?

Edit: I see these rows always seem to have more than one measureset_id. Is this the reason why above-mentioned columns may have multiple values?

Use a more general python interpreter location

Instead of having each python script start with:

#!/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/bin/python

You should use:

#!/usr/bin/env python

The use of env ensures that the "current" python interpreter, wherever it is installed, is used.

Multiple isoforms not in index

Some of the reference transcripts used by clinvar are not always the ones chosen by Clinical labs. The current version of this pipeline generates a single "hgvs_c" and "hgvs_p" for the reference transcript only.
e.g. in https://www.ncbi.nlm.nih.gov/clinvar/variation/12408/ look under "hgvs" and you will find two NM and two AA changes and an "Other names" section all indicating a two transcripts annotations.

NM_001001430.2(TNNT2):c.236T>A (p.Ile79Asn)
Other names:
p.I79N:ATC>AAC
Protein change:I79N, I89N
HGVS:
NG_007556.1:g.17040T>A
NM_000364.3:c.266T>A
NM_001001430.2:c.236T>A

Please let me know if you plan on working on this, I might. [email protected]

using hg38

This is a very useful script collection. Is it possible to use another reference, e.g. hg38.fa ? I tried editing the parsing script replacing 37 with 38, but then the normalisation breaks, even though I feed it the corresponding reference, e.g. hg38. What am I overlooking here?

The error thrown is:
'Incorrect REF value: 1 1014143 C T (actual REF should be )' ...
(...)

Thank you very much.

Memory Footprint too High

Hello,

I came across your clinvar pipeline via a google search and decided to experiment with it. It turns out that for the VM I am using, there is a failure because too much memory is required in the parse_clinvar_xml.py stage.

I see that you are parsing the entire ClinVar XML tree at once via an ET.parse(handle) command.

Have you considered using ET.iterparse() to parse the XML on the fly? I think you could detect the end of ClinVarSet elements, extract info from the element, and then use an elem.clear() to free the memory. An example is discussed at the following link:

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree

Thanks for posting your code.

-- Gregory S. Turenchalk

Sometimes "symbol" disagrees with primary hgvs gene annotation

Hi there,
Really glad you guys shared your code and parsed data on github!

Mostly this issue is just to put it on your radar for something to improve for the next version, but notice that for the following entry we get conflicting information for the gene symbol:

On ClinVar it's EGFR:
https://www.ncbi.nlm.nih.gov/clinvar/variation/45271/
For the hgvs_c we get EGFR:
https://www.ncbi.nlm.nih.gov/nuccore/NM_005228

But in clinvar_alleles.single.b37.tsv we get EGFR-AS1:

> paste <(gzcat clinvar_alleles.single.b37.tsv.gz | head -n1 | tr '\t' '\n' | cat -n) <(gzcat clinvar_alleles.single.b37.tsv.gz | grep RCV000038427 | tr '\t' '\n') | column -t -s$'\t'
     1  chrom                  7
     2  pos                    55249063
     3  ref                    G
     4  alt                    A
     5  measureset_type        Variant
     6  measureset_id          45271
     7  rcv                    RCV000038427;RCV000321080
     8  allele_id              54438
     9  symbol                 EGFR-AS1
    10  hgvs_c                 NM_005228.4:c.2361G>A
    11  hgvs_p                 NP_005219.2:p.Gln787=
    12  molecular_consequence  NM_005228.4:c.2361G>A:synonymous variant
    13  clinical_significance  Benign;Likely benign
    14  pathogenic             0
    15  benign                 1
    16  conflicted             0
    17  review_status          criteria provided, multiple submitters, no conflicts
    18  gold_stars             2
    19  all_submitters         Laboratory for Molecular Medicine,Partners HealthCare Personalized Medicine;PreventionGenetics,PreventionGenetics;Illumina Clinical Services Laboratory,Illumina
    20  all_traits             not specified;Not Specified;NOT SPECIFIED;Lung cancer;Lung Cancer
    21  all_pmids              25741868;17409930,23562183,23667368,24627688,24846033,25311215
    22  inheritance_modes
    23  age_of_onset
    24  prevalence
    25  disease_mechanism
    26  origin                 germline;somatic
    27  xrefs                  MedGen:CN169374;Genetic Alliance:Lung+Cancer/4334;Genetics Home Reference:lung-cancer;MedGen:C0684249;OMIM:211980;SNOMED CT:187875007

This can be traced to the following code from parse_clinical_xml.py:

            #find the gene symbol
            current_row['symbol']=''
            genesymbol = measure[i].findall('.//Symbol')
            if genesymbol is not None:
                for symbol in genesymbol:
                    if(symbol.find('ElementValue').attrib.get('Type')=='Preferred'):
                        current_row['symbol']=symbol.find('ElementValue').text;
                        break

Notice how we break after the first success, but for this example we have multiple "Preferred" symbols in the xml in a confusing order.

<ClinVarSet ID="17452916">
...
    <ClinVarAccession Acc="RCV000038427" Version="3" Type="RCV" DateUpdated="2017-01-25"/>
...
    <MeasureSet Type="Variant" ID="45271">
      <Measure Type="single nucleotide variant" ID="54438">
        <Name>
          <ElementValue Type="Preferred">NM_005228.4(EGFR):c.2361G&gt;A (p.Gln787=)</ElementValue>
        </Name>
...
          <Symbol>
            <ElementValue Type="Preferred">EGFR-AS1</ElementValue>
          </Symbol>
...
          <Symbol>
            <ElementValue Type="Preferred">EGFR</ElementValue>
          </Symbol>
...

Perhaps one might want to simply use the gene symbol given in parenthesis in the text for the first .//Name/ElementValue in the .//Measure

I'll admit I haven't done an extensive analysis on the best choice for extracting the gene symbol from the xml, and the solution I just tossed your way ^ might be incorrect, but right now the symbol column in your output is not always reliable.

Also, check out (the first one I noticed via grep) https://www.ncbi.nlm.nih.gov/clinvar/variation/41400 for an example where there's no "Preferred" gene symbol, but there is a symbol in parenthesis given that seems to check out. The XML does not have a "Preferred" symbol in this case.

ClinSig changes for some halpotypes in join_data.R - become wrong!

This is the result of join_data.R for multi

chrom pos ref alt measureset_type measureset_id rcv allele_id symbol hgvs_c clinical_significance clinical_significance_ordered pathogenic benign conflicted
2 166277030 T G Haplotype 30359 RCV000023304 39315 SCN9A NM_002977.3:c.2794A>C Benign/Likely benign Pathogenic 0 1 0

It's wrong: NM_002977.3(SCN9A):c.[2794A>C;2971G>T] – Haplotype

Looking at each step:

clinvar_table_raw.multi.tsv - Correct
clinvar_table_normalized.multi.tsv.gz - Correct
clinvar_allele_trait_pairs.multi.tsv.gz - Correct
clinvar_alleles_grouped.multi.tsv.gz - Correct
clinvar_alleles_combined.multi.tsv.gz - WRONG

There are 20 halpotype variants with this same issue:

measureset_type measureset_id rcv allele_id
Haplotype 1631 RCV000001698 16670
Haplotype 5706 RCV000006060 20745
Haplotype 5813 RCV000006169 20852
Haplotype 7239 RCV000007661 22244
Haplotype 13065 RCV000013940 28104
Haplotype 13399 RCV000014336;RCV000014337 28436
Haplotype 16318 RCV000017711;RCV000201276 31357
Haplotype 16876;16877 RCV000018372;RCV000018373 31916
Haplotype 4297 RCV000004533;RCV000004534;RCV000004535;RCV000004536 38384
Haplotype 4297 RCV000004533;RCV000004534;RCV000004535;RCV000004536 38385
Haplotype 4816 RCV000005085 38434
Haplotype 9398 RCV000010000 38447
Haplotype 9407 RCV000010010 38448
Haplotype 16318;217371 RCV000017711;RCV000201276;RCV000201278 38476
Haplotype 30359 RCV000023304 39315
Haplotype 38571 RCV000021985 46849
Haplotype 402236 RCV000454199 98655
Haplotype 218894 RCV000203245 137950
Haplotype 225143 RCV000210779 227037
Haplotype 188053 RCV000167863 255673

add .gitignore File to Repo?

Would you be willing to add a .gitignore file to this repo, so when we fork or clone the repository, we don't have to pull down all the data files?

*.tsv
*.tsv.gz
*.tbi

It my be better to host the data files, that result from this code somewhere else, for those users who just want the pre-canned results.

Augmented tables in output

Currently, only the b37/multi folder has augmented tables (with ExAC and GnomAD data). Shouldn't these augmented tables be in each output folder?

Missing data issue

Greetings,

I was checking over resulted tsv file and found out that at least some variants included in clinvar VCF file ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/ ) are not being shown there.

Can you please explain why ? For example, the following variant appears in the ClinVar VCF file but doesn't appear in the TSV file included in the codebase for this repo AND also doesn't appear in regenerated TSV file using the latest ClinVar dataset:

7 92130876 rs61750420 C T

Thanks in advance,
Peter

b38 availability?

Hi, I am wondering if the clinvar files are available for b38 version too
Any technichal issues holding it so?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.