macarthur-lab / clinvar Goto Github PK
View Code? Open in Web Editor NEWThis repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
License: Other
This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
License: Other
Example:
Variant=chr1:976059_C>T
ID=RCV000195231
The result in your clinvar_alleles.tsv:
clinical_significance="Likely benign;Uncertain significance"
all_submitters="Genetic Services Laboratory, University of Chicago;PreventionGenetics"
If you look at the order of the list....which would be useful....Likely benign was submitted by U of Chicago.
But, that is not the case:
https://www.ncbi.nlm.nih.gov/clinvar/RCV000195231/
That's just one example, there are many, many, many more.
I can see where this comes from. Regex and the XML structure. In script: parse_clinvar_xml.py:104
current_row['all_submitters'] = ';'.join([
submitter_node.attrib['submitter'].replace(';', ',')
for submitter_node in elem.findall('.//ClinVarSubmissionID')
if submitter_node.attrib is not None and submitter_node.attrib.has_key('submitter')
])
The "submitters" is obtained from a separate node, without any attempt to match against the nested clin_sig description.
clinical_significance=elem.find('.//ReferenceClinVarAssertion/ClinicalSignificance')
if clinical_significance.find('.//ReviewStatus') is not None:
current_row['review_status']=clinical_significance.find('.//ReviewStatus').text;
if clinical_significance.find('.//Description') is not None:
current_row['clinical_significance']=clinical_significance.find('.//Description').text
If I had a solution worked out, I would make a pull request. But it appears to tricky, so far.
I noticed a lot of indels/fs have missing hgvs_c NM transcript information. I do see the hgvs_p (with the protein),
Example:
This variant
https://www.ncbi.nlm.nih.gov/clinvar/variation/52203/
is listed as Chr13: 32341188 – 32341192 in clinvar
In the files from the clinvar/macarthur pipeline,
It gets left shifted here: 13 32341183
There is no “hgvs_c” field in the output from the pipeline
It does have have (which is the same as clinvar .. not sure if that was recomputed).
"hgvs_p":"NP_000050.2:p.Ile2278Serfs",
However, we can find the hgvs_c information under the molecular consequence field.
"molecular_consequence":["NM_000059.3:c.6833_6837del:frameshift variant"]
I’m not sure if the pipeline computed that or it’s from clinvar (same as in clinvar)..
But it could be propagated to the hgvs_c field.
In #27 , @raymond301 suggested putting the clinvar data in a separate github repo from the code.
It seems simpler to me to keep everything in one place. Does anybody else have a preference on this?
Next time the pipeline needs updates, we should probably convert all steps that follow .xml parsing to a Spark-based hail pipeline.
Currently, the steps that generate clinvar x gnomAD tables take hours to run, so I skipped them for the latest release - hail would be able to perform these joins much more efficiently.
Looks like Clinvar has new string added to clin_sig: "Conflicting interpretations of pathogenicity"
I just handle it by changing these lines in join_data.R
combined$pathogenic = as.integer(grepl('athogenic\\b',combined$clinical_significance))
# conflicted = 1 if at least one submission each of [likely] benign and [likely] pathogenic
combined$conflicted = as.integer((grepl('athogenic',combined$clinical_significance) & grepl('enign',combined$clinical_significance)) | grepl('Conflicting interpretations',combined$clinical_significance, ignore.case=TRUE))
That way you get better boolean values.
It seems reformatted data tagged as 'May release' was produced from Mar (or prior to that?) NCBI clinvar data. Since April, NCBI clinvar seems to have changed how they report aggregated clinical significance, and this completely affects how conflicted
variants are identified if current code is used with recent NCBI clinvar data. Current NCBI's practice is to use the term Conflicting interpretations of pathogenicity
for conflicts, and these variants are mistakenly tagged as TRUE for pathogenic
column (instead of TRUE for conflicted
) as they have the string 'pathogenic'.
@XiaoleiZ - Since I am using data from your fork on Aug data, I thought I should tag you here.
@bw2 @ericminikel @XiaoleiZ - Any suggestion on which repo or fork to use for actively maintained data, if any? I would like to contribute if you need more manpower on maintaining a centralized repo.
master.py LINE 149
The custom sorting via eGrep & Sort commands.
# sort job.add(("cat " + "<(gunzip -c IN:%(tmp_dir)s/clinvar_table_normalized.%(fsuffix)s.tsv.gz
When executing on the system fails on a syntax error, which is fairly uninformative.
[Sep 12 13:29:27]: --> Exec 1.4: cat <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | head -1) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 ) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 ) | bgzip -c > output_tmp/tmp.2017-09-12_13.29.26.clinvar_allele_trait_pairs.multi.b37.tsv.gz
[Sep 12 13:29:27]: Output (last mod N/A): output_tmp/clinvar_allele_trait_pairs.multi.b37.tsv.gz [doesn't exist yet]
[Sep 12 13:29:28]: /bin/sh: -c: line 0: syntax error near unexpected token
('`
[Sep 12 13:29:28]: /bin/sh: -c: line 0: cat <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | head -1) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 |
I'm trying to work through this right now, but if you have any suggestions....
Ellen Tsai from Heidi Rehm's lab has emailed today to say that we are currently letting through tens of variants that have the wrong REF base in ClinVar. Per Adrian Tan's July 2 email, vt normalize checks whether the REF base is correct. If we make our python implementation of normalize do the same, and throw out variants with the wrong REF, then this should fix the variants Ellen points out. Here is the list of offending variants from her email:
If you're interested, the 36 variants I've flagged are provided below.
5 variants listed twice in clinvar.tsv where ref/alt are switched between the rows (e.g. chr pos ref alt):
11
64572018 C
T
11
64572018 T
C
7
117199533 A
G
7
117199533 G
A
9
139393681 C
T
9
139393681 G
A
9
139412360 A
G
9
139412360 T
C
9
139414018 C
A
9
139414018 G
T5 variants where ref(clinvar.tsv) doesn't match hg19? (e.g. myhg19.fasta vs clinvar.tsv)
9:134385436 .. 'A' vs 'G'
MT:14498 .. 'T' vs 'C'A in UCSC browser, but apparently 'T' in my fasta file
MT:14729 .. 'T' vs 'G'
17:181683 .. 'A' vs 'C'
17:256448 .. 'T' vs 'C'16 variants where ref(myhg19.fasta)=N but ref(clinvar.tsv)=[ACGT]
17:355278 .. 'N' vs 'C'
17:355570 .. 'N' vs 'C'
17:355653 .. 'N' vs 'G'
17:355749 .. 'N' vs 'G'
17:355765 .. 'N' vs 'G'
17:356212 .. 'N' vs 'C'
17:356248 .. 'N' vs 'A'
17:356346 .. 'N' vs 'C'
17:356370 .. 'N' vs 'C'
Y:545354 .. 'N' vs 'G'
Y:545422 .. 'N' vs 'A'
Y:545469 .. 'N' vs 'C'
Y:545475 .. 'N' vs 'C'
Y:551577 .. 'N' vs 'G'
Y:551578 .. 'N' vs 'C'
Y:1359342 .. 'N' vs 'G'
Hello,
I have come across a handful of variants that were present in previous versions of the curated Clinvar TSV that are not present in the most recent version, but are present in the Clinvar database. I am using both the clinvar_alleles.single.b37.tsv and clinvar_alleles.single.b37.vcf committed on 2017-03-13:
10 55582650 . A ATGTT
2 179325735 rs17304212 C G
2 179325735 rs17304212 C T
2 179325816 rs79399438 G A
3 150690352 rs111033258 A C
5 149740732 rs56180593 C T
Is there a reason these are no longer present? Were they inadvertently dropped? Or am I missing them in either the TSV or VCF?
Hi,
Can a benign column and a conflicts column (just like Pathogenic and Conflicting_Pathogenicity) be added to the results?
Hi,
I really like this resource, and the job you have put together to make clinvar data available in a clean way. I stumbled into some problems trying to make the workflow work on my own computer, but then you tend to do regular updates of the output.tsv file, which is excellent. Are you planning on any update of this soon?
On beforehand, thanks.
Sigve
hi,
Thanks for a very useful resource, it simplifies a lot. Would you consider adding the variant origin in the output (for annoation of 'germline,'somatic','de novo' etc.)? It is present in the variant_summary.txt.gz. I guess adding one more variable to join_data.R would work, e.g.:
colnames(txt_extract) = c('measureset_id','symbol','clinical_significance','review_status','hgvs_c','hgvs_p','origin')
best,
Sigve
The step invoking join_data.R
currently fails on the master
branch, as the script expects clinvar_table_dedup_normalized.tsv
but the correct input file seems to be clinvar_table_normalized.tsv
.
Hi,
Your jobs are awesome! I think it would be very useful for us. But as I look at the example clinvar.tsv.gz
file and the clinvar_example_750_rows.tsv
file, I find a mistake.
chrom | pos | ref | alt | mut | measureset_id | symbol | clinical_significance | pathogenic | benign | conflicted | review_status | gold_stars | hgvs_c | hgvs_p | all_submitters | all_traits | all_pmids | inheritance_modes | age_of_onset | prevalence | disease_mechanism | origin | xrefs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 949523 | C | T | ALT | 183381 | ISG15 | Pathogenic | 1 | 0 | 0 | no assertion criteria provided | 0 | NM_005101.3:c.163C>T | NP_005092.1:p.Gln55Ter | OMIM | Immunodeficiency 38 with basal ganglia calcification;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION | 25307056 | Childhood | <1 / 1 000 000 | germline | MedGen:CN221808;OMIM:616126;Orphanet:319563 | |||
1 | 949696 | C | CG | ALT | 161455 | ISG15 | Pathogenic | 1 | 0 | 0 | no assertion criteria provided | 0 | NM_005101.3:c.339dupG | NP_005092.1:p.Leu114Alafs | OMIM | Immunodeficiency 38 with basal ganglia calcification;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION | 1,22859821,25307056 | Childhood | <1 / 1 000 000 | germline | MedGen:CN221808;OMIM:616126;Orphanet:319563 |
At this sheet, the second row of the result, there is a "1" in the all_pmids
column, are there a mistake of the program?
Hi,
Thanks for this - very useful. I intend to convert this to a VCF file to use downstream. I can't find any info on what the mut
columns is. It's value is commonly ALT
, but sometimes REF
. What does is mean?
thanks!
File clinvar_alleles.single.b38.tsv.gz has 43 duplicate rows. Here are the allele IDs for those rows:
[231307, 150500, 24917, 260558, 150301, 38949, 45434, 45436, 167501, 24912, 190850, 24911, 24915, 257893, 231306, 260559, 25394, 178250, 24914, 178251, 178253, 260556, 189165, 189166, 260560, 45437, 178249, 137687, 231308, 137685, 150305, 99001, 137686, 137689, 150303, 361265, 99002, 150304, 98999, 38950, 23109, 195246, 137688]
On closer look, a subset of them are in pseudoautosomal region (ie. one row should have had chrX and the other chrY). Allele IDs 231307 and 24915 are part of that subset.
ps - At this point, I'm just nitpicking when reporting issues :)
This repo currently contains only a clinvar_alleles_with_exac_v1.single.b37.tsv.gz
table, but
we could generate similar tables for the new gnomAD exomes and genomes datasets (http://gnomad.broadinstitute.org/downloads).
If anyone's interested in having such tables, or has preferences on table columns/format, please let me know.
Wouldn't it be useful to include strand and genomic start and end coordinates of variants (in clinvar_alleles.single file)? Having such info would makes things easier to compare to databases in bed-file format (which of course is my use case).
The clinical sig is lost per submitter, in the group_by_allele.py.
For example: https://www.ncbi.nlm.nih.gov/clinvar/variation/92428/
Has 4 submitters, 1 calls Likely Benign 3 calls Benign.
clinvar_allele_trait_pairs.single.tsv.gz | grep 39073 | grep 53676401 | less
Has 3 lines, with the correct Clin Sig Order.
But clinvar_alleles_grouped.single.tsv.gz | grep 39073 | grep 53676401 | less
Reduces it to "benign;likely benign" indicating 2 entries, for 4 submitters. The order is lost.
BUT.... submitters_ordered is still correct:
EGL Genetic Diagnostics,Eurofins Clinical Diagnostics;GeneReviews;Illumina Clinical Services Laboratory,Illumina;Center for Pediatric Genomic Medicine,Children's Mercy Hospital and Clinics
I don't see mitochondria in the output. Maybe I'm missing something? Anyway, thanks for putting this together!
MIT, or what are we doing these days?
Hi,
I am getting the following error, Could you please walk me through the way to resolve it. I am using python 2.7 on Ubuntu 14.10 machine. All required libraries are up to date.
180593 records processed
180594 records processed
[Nov 18 10:53:45]: Final counts of variants discarded:
[Nov 18 10:53:45]: REF == ALT: 226
[Nov 18 10:53:45]: Wrong REF: 4082
[Nov 18 10:53:45]: Invalid nucleotide: 9
[Nov 18 10:53:45]: Finished 1.3. Running time: 0:00:21.418153 sec.
[Nov 18 10:53:45]: Renamed tmp.2016-11-18_10.46.10.clinvar_table_normalized.tsv to clinvar_table_normalized.tsv
[Nov 18 10:53:45]: --> Exec 1.4: Rscript join_data.R variant_summary.txt.gz
[Nov 18 10:53:45]: Output (last mod N/A): clinvar_combined.tsv [doesn't exist yet]
[Nov 18 10:53:50]: [1] 180594 14
[Nov 18 10:53:59]: Warning message:
[Nov 18 10:53:59]: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
[Nov 18 10:53:59]: embedded nul(s) found in input
[Nov 18 10:53:59]: [1] 354049 30
[Nov 18 10:53:59]: Error in [.data.frame
(x, r, vars, drop = drop) :
[Nov 18 10:53:59]: undefined columns selected
[Nov 18 10:53:59]: Calls: subset -> subset.data.frame -> [ -> [.data.frame
[Nov 18 10:54:16]: $subset(txt_download, assembly == "GRCh37", select = desired_columns)
[Nov 18 10:54:16]: <environment: 0x2c70db0>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $subset.data.frame(txt_download, assembly == "GRCh37", select = desired_colu
[Nov 18 10:54:16]: <environment: 0x2dd9a78>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $x[r, vars, drop = drop]
[Nov 18 10:54:16]: <environment: 0x2d9c030>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $``[.data.frame(x, r, vars, drop = drop)
[Nov 18 10:54:16]: <environment: 0x2d9c068>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $`stop("undefined columns selected")`
[Nov 18 10:54:16]: <environment: 0x2d9a5c0>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: attr(,"error.message")
[Nov 18 10:54:16]: [1] "Error in `[.data.frame`(x, r, vars, drop = drop) : \n undefined columns selected\nCalls: subset -> subset.data.frame -> [ -> [.data.frame\n"
[Nov 18 10:54:16]: attr(,"class")
[Nov 18 10:54:16]: [1] "dump.frames"
[Nov 18 10:54:16]: Finished 1.4. Running time: 0:00:31.037980 sec.
[Nov 18 10:54:16]: WARNING: unable to rename tmp.2016-11-18_10.46.10.clinvar_combined.tsv to clinvar_combined.tsv. tmp.2016-11-18_10.46.10.clinvar_combined.tsv is not readable
[Nov 18 10:54:16]: --> Skipping 1.5: (cat clinvar_combined.tsv | head -1 > tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv )
[Nov 18 10:54:16]: Input (last mod N/A): clinvar_combined.tsv [Error - input file not found]
[Nov 18 10:54:16]: Traceback (most recent call last):
[Nov 18 10:54:16]: File "/home/ecgi6/.local/lib/python2.7/site-packages/pypez.py", line 769, in does_command_need_to_run
[Nov 18 10:54:16]: try: raise Exception("File not found: " + str(input_filename))
[Nov 18 10:54:16]: Exception: File not found: clinvar_combined.tsv
[Nov 18 10:54:16]:
gold_stars_table = list(
'-' = 0,
...
combined[which(combined$review_status == "-"),]
allele_id | chrom | pos | ref | alt | measureset_type | measureset_id |
---|---|---|---|---|---|---|
255248 | 414032 | 17 | 48271723 | A | T | Variant |
255249 | 414033 | 7 | 94049849 | A | C | Variant |
Looks like sometimes review status can be "-"
ClinVar record 292157 is about gene MTHFR, but it is wrongly mapped to gene C1orf167 in file clinvar_alleles.single.b37.tsv.gz. Coordinate Chr1:11786191 maps to C1orf167 in + strand and MTHFR in - strand. Looks like somewhere in the code, strand variable is overlooked during gene mapping.
Currently it's only in the .tsv
I tried to run the parser with the newest ClinVar may release
I was stuck at the 1.32. RScript crashed after a few second of run.
Any idea how it happened?
Thanks for developing the tool.
[May 17 12:05:49]: --> Exec 1.32: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-17_11.05.29.clinvar_alleles_combine[May 17 12:05:49]: --> Exec 1.32: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-17_11.05.29.clinvar_alleles_combined.single.b37.tsv.gz
[May 17 12:05:49]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 17 12:06:13]: [1] 282106 25
[May 17 12:06:37]: [1] 634027 30
[May 17 12:06:44]: Error in write.table(combined, output_table, sep = "\t", row.names = F, :
[May 17 12:06:44]: unimplemented type 'list' in 'EncodeElement'
[May 17 12:06:54]: $write.table(combined, output_table, sep = "\\t", row.names = F, col.names =
[May 17 12:06:54]: <environment: 0x7fc00382c120>
[May 17 12:06:54]: attr(,"error.message")
[May 17 12:06:54]: [1] "Error in write.table(combined, output_table, sep = "\t", row.names = F, : \n unimplemented type 'list' in 'EncodeElement'\n"
[May 17 12:06:54]: attr(,"class")
[May 17 12:06:54]: [1] "dump.frames"
[May 17 12:06:54]: Finished 1.32. Running time: 0:01:04.242256 sec.d.single.b37.tsv.gz
[May 17 12:05:49]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 17 12:06:13]: [1] 282106 25
[May 17 12:06:37]: [1] 634027 30
[May 17 12:06:44]: Error in write.table(combined, output_table, sep = "\t", row.names = F, :
[May 17 12:06:44]: unimplemented type 'list' in 'EncodeElement'
[May 17 12:06:54]: $write.table(combined, output_table, sep = "\\t", row.names = F, col.names =
[May 17 12:06:54]: <environment: 0x7fc00382c120>
[May 17 12:06:54]: attr(,"error.message")
[May 17 12:06:54]: [1] "Error in write.table(combined, output_table, sep = "\t", row.names = F, : \n unimplemented type 'list' in 'EncodeElement'\n"
[May 17 12:06:54]: attr(,"class")
[May 17 12:06:54]: [1] "dump.frames"
[May 17 12:06:54]: Finished 1.32. Running time: 0:01:04.242256 sec.
ClinVar released a new XML dump on July 9:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/
And our output is currently from July 8.
Hi,
Thanks for releasing this great resources. I noticed some discrepancies between the semicolon-separated lists in clinical_significance_ordered
and submitters_ordered
:
In [1]: df = pd.read_csv('clinvar_alleles_example_750_rows.single.b37.tsv', sep='\t')
In [2]: df.shape
Out[2]: (749, 39)
In [3]: for col in 'rcv scv clinical_significance_ordered submitters_ordered'.split():
...: df['len_' + col] = df[col].apply(lambda x: len(x.split(';')))
In [4]: diffs = df[df.len_clinical_significance_ordered != df.len_submitters_ordered].shape
In [5]: diffs.shape
Out[5]: (120, 43)
Ordered clinical significance doesn't seem to match the RCV or SCV lists either. Is this intended?
Thanks
Looks like clinvar_alleles_with_exac_v1.*.tsv.gz files missing in b3*/single folders ?
Certain rows have values 0 and 1 separated by a semi-colon in columns pathogenic, benign and conflicted of file clinvar_alleles.tsv.gz. For example, pathogenic column in line#6704 is 1;0. If I am not wrong, these columns must have either 0 or 1 by definition. Is this correct, or am I missing something?
Edit: I see these rows always seem to have more than one measureset_id. Is this the reason why above-mentioned columns may have multiple values?
Instead of having each python script start with:
#!/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/bin/python
You should use:
#!/usr/bin/env python
The use of env
ensures that the "current" python interpreter, wherever it is installed, is used.
Some of the reference transcripts used by clinvar are not always the ones chosen by Clinical labs. The current version of this pipeline generates a single "hgvs_c" and "hgvs_p" for the reference transcript only.
e.g. in https://www.ncbi.nlm.nih.gov/clinvar/variation/12408/ look under "hgvs" and you will find two NM and two AA changes and an "Other names" section all indicating a two transcripts annotations.
NM_001001430.2(TNNT2):c.236T>A (p.Ile79Asn)
Other names:
p.I79N:ATC>AAC
Protein change:I79N, I89N
HGVS:
NG_007556.1:g.17040T>A
NM_000364.3:c.266T>A
NM_001001430.2:c.236T>A
Please let me know if you plan on working on this, I might. [email protected]
This is a very useful script collection. Is it possible to use another reference, e.g. hg38.fa ? I tried editing the parsing script replacing 37 with 38, but then the normalisation breaks, even though I feed it the corresponding reference, e.g. hg38. What am I overlooking here?
The error thrown is:
'Incorrect REF value: 1 1014143 C T (actual REF should be )' ...
(...)
Thank you very much.
Hello,
I came across your clinvar pipeline via a google search and decided to experiment with it. It turns out that for the VM I am using, there is a failure because too much memory is required in the parse_clinvar_xml.py stage.
I see that you are parsing the entire ClinVar XML tree at once via an ET.parse(handle) command.
Have you considered using ET.iterparse() to parse the XML on the fly? I think you could detect the end of ClinVarSet elements, extract info from the element, and then use an elem.clear() to free the memory. An example is discussed at the following link:
http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree
Thanks for posting your code.
-- Gregory S. Turenchalk
Hi there,
Really glad you guys shared your code and parsed data on github!
Mostly this issue is just to put it on your radar for something to improve for the next version, but notice that for the following entry we get conflicting information for the gene symbol:
On ClinVar it's EGFR:
https://www.ncbi.nlm.nih.gov/clinvar/variation/45271/
For the hgvs_c we get EGFR:
https://www.ncbi.nlm.nih.gov/nuccore/NM_005228
But in clinvar_alleles.single.b37.tsv
we get EGFR-AS1:
> paste <(gzcat clinvar_alleles.single.b37.tsv.gz | head -n1 | tr '\t' '\n' | cat -n) <(gzcat clinvar_alleles.single.b37.tsv.gz | grep RCV000038427 | tr '\t' '\n') | column -t -s$'\t'
1 chrom 7
2 pos 55249063
3 ref G
4 alt A
5 measureset_type Variant
6 measureset_id 45271
7 rcv RCV000038427;RCV000321080
8 allele_id 54438
9 symbol EGFR-AS1
10 hgvs_c NM_005228.4:c.2361G>A
11 hgvs_p NP_005219.2:p.Gln787=
12 molecular_consequence NM_005228.4:c.2361G>A:synonymous variant
13 clinical_significance Benign;Likely benign
14 pathogenic 0
15 benign 1
16 conflicted 0
17 review_status criteria provided, multiple submitters, no conflicts
18 gold_stars 2
19 all_submitters Laboratory for Molecular Medicine,Partners HealthCare Personalized Medicine;PreventionGenetics,PreventionGenetics;Illumina Clinical Services Laboratory,Illumina
20 all_traits not specified;Not Specified;NOT SPECIFIED;Lung cancer;Lung Cancer
21 all_pmids 25741868;17409930,23562183,23667368,24627688,24846033,25311215
22 inheritance_modes
23 age_of_onset
24 prevalence
25 disease_mechanism
26 origin germline;somatic
27 xrefs MedGen:CN169374;Genetic Alliance:Lung+Cancer/4334;Genetics Home Reference:lung-cancer;MedGen:C0684249;OMIM:211980;SNOMED CT:187875007
This can be traced to the following code from parse_clinical_xml.py:
#find the gene symbol
current_row['symbol']=''
genesymbol = measure[i].findall('.//Symbol')
if genesymbol is not None:
for symbol in genesymbol:
if(symbol.find('ElementValue').attrib.get('Type')=='Preferred'):
current_row['symbol']=symbol.find('ElementValue').text;
break
Notice how we break
after the first success, but for this example we have multiple "Preferred" symbols in the xml in a confusing order.
<ClinVarSet ID="17452916">
...
<ClinVarAccession Acc="RCV000038427" Version="3" Type="RCV" DateUpdated="2017-01-25"/>
...
<MeasureSet Type="Variant" ID="45271">
<Measure Type="single nucleotide variant" ID="54438">
<Name>
<ElementValue Type="Preferred">NM_005228.4(EGFR):c.2361G>A (p.Gln787=)</ElementValue>
</Name>
...
<Symbol>
<ElementValue Type="Preferred">EGFR-AS1</ElementValue>
</Symbol>
...
<Symbol>
<ElementValue Type="Preferred">EGFR</ElementValue>
</Symbol>
...
Perhaps one might want to simply use the gene symbol given in parenthesis in the text for the first .//Name/ElementValue in the .//Measure
I'll admit I haven't done an extensive analysis on the best choice for extracting the gene symbol from the xml, and the solution I just tossed your way ^ might be incorrect, but right now the symbol column in your output is not always reliable.
Also, check out (the first one I noticed via grep) https://www.ncbi.nlm.nih.gov/clinvar/variation/41400 for an example where there's no "Preferred" gene symbol, but there is a symbol in parenthesis given that seems to check out. The XML does not have a "Preferred" symbol in this case.
This is the result of join_data.R for multi
chrom | pos | ref | alt | measureset_type | measureset_id | rcv | allele_id | symbol | hgvs_c | clinical_significance | clinical_significance_ordered | pathogenic | benign | conflicted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 166277030 | T | G | Haplotype | 30359 | RCV000023304 | 39315 | SCN9A | NM_002977.3:c.2794A>C | Benign/Likely benign | Pathogenic | 0 | 1 | 0 |
It's wrong: NM_002977.3(SCN9A):c.[2794A>C;2971G>T] – Haplotype
Looking at each step:
clinvar_table_raw.multi.tsv - Correct
clinvar_table_normalized.multi.tsv.gz - Correct
clinvar_allele_trait_pairs.multi.tsv.gz - Correct
clinvar_alleles_grouped.multi.tsv.gz - Correct
clinvar_alleles_combined.multi.tsv.gz - WRONG
There are 20 halpotype variants with this same issue:
measureset_type | measureset_id | rcv | allele_id |
---|---|---|---|
Haplotype | 1631 | RCV000001698 | 16670 |
Haplotype | 5706 | RCV000006060 | 20745 |
Haplotype | 5813 | RCV000006169 | 20852 |
Haplotype | 7239 | RCV000007661 | 22244 |
Haplotype | 13065 | RCV000013940 | 28104 |
Haplotype | 13399 | RCV000014336;RCV000014337 | 28436 |
Haplotype | 16318 | RCV000017711;RCV000201276 | 31357 |
Haplotype | 16876;16877 | RCV000018372;RCV000018373 | 31916 |
Haplotype | 4297 | RCV000004533;RCV000004534;RCV000004535;RCV000004536 | 38384 |
Haplotype | 4297 | RCV000004533;RCV000004534;RCV000004535;RCV000004536 | 38385 |
Haplotype | 4816 | RCV000005085 | 38434 |
Haplotype | 9398 | RCV000010000 | 38447 |
Haplotype | 9407 | RCV000010010 | 38448 |
Haplotype | 16318;217371 | RCV000017711;RCV000201276;RCV000201278 | 38476 |
Haplotype | 30359 | RCV000023304 | 39315 |
Haplotype | 38571 | RCV000021985 | 46849 |
Haplotype | 402236 | RCV000454199 | 98655 |
Haplotype | 218894 | RCV000203245 | 137950 |
Haplotype | 225143 | RCV000210779 | 227037 |
Haplotype | 188053 | RCV000167863 | 255673 |
Would you be willing to add a .gitignore file to this repo, so when we fork or clone the repository, we don't have to pull down all the data files?
*.tsv
*.tsv.gz
*.tbi
It my be better to host the data files, that result from this code somewhere else, for those users who just want the pre-canned results.
Currently, only the b37/multi folder has augmented tables (with ExAC and GnomAD data). Shouldn't these augmented tables be in each output folder?
Greetings,
I was checking over resulted tsv file and found out that at least some variants included in clinvar VCF file ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/ ) are not being shown there.
Can you please explain why ? For example, the following variant appears in the ClinVar VCF file but doesn't appear in the TSV file included in the codebase for this repo AND also doesn't appear in regenerated TSV file using the latest ClinVar dataset:
7 92130876 rs61750420 C T
Thanks in advance,
Peter
This tool has been very helpful! I would like to have the ModeOfInheritence field pulled into the flatfile. Is there any addendum that you can provide to the script? IS there a way I could customize the columns I want pulled into the flatfile?
Hi, I am wondering if the clinvar files are available for b38 version too
Any technichal issues holding it so?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.