macarthur-lab / clinvar Goto Github PK

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.

License: Other

Python 100.00%

clinvar's People

Stargazers

Watchers

clinvar's Issues

List Order Mismatch between Clinical_Sig & All_Summiters

Example:
Variant=chr1:976059_C>T
ID=RCV000195231

The result in your clinvar_alleles.tsv:
clinical_significance="Likely benign;Uncertain significance"
all_submitters="Genetic Services Laboratory, University of Chicago;PreventionGenetics"

If you look at the order of the list....which would be useful....Likely benign was submitted by U of Chicago.
But, that is not the case:
https://www.ncbi.nlm.nih.gov/clinvar/RCV000195231/

That's just one example, there are many, many, many more.

I can see where this comes from. Regex and the XML structure. In script: parse_clinvar_xml.py:104

current_row['all_submitters'] = ';'.join([
            submitter_node.attrib['submitter'].replace(';', ',')
            for submitter_node in elem.findall('.//ClinVarSubmissionID')
            if submitter_node.attrib is not None and submitter_node.attrib.has_key('submitter')
        ])

The "submitters" is obtained from a separate node, without any attempt to match against the nested clin_sig description.

clinical_significance=elem.find('.//ReferenceClinVarAssertion/ClinicalSignificance')
    if clinical_significance.find('.//ReviewStatus') is not None:
        current_row['review_status']=clinical_significance.find('.//ReviewStatus').text;
    if clinical_significance.find('.//Description') is not None:
        current_row['clinical_significance']=clinical_significance.find('.//Description').text

If I had a solution worked out, I would make a pull request. But it appears to tricky, so far.

missing hgvs_c when indel is realigned and shifted.

I noticed a lot of indels/fs have missing hgvs_c NM transcript information. I do see the hgvs_p (with the protein),

Example:
This variant
https://www.ncbi.nlm.nih.gov/clinvar/variation/52203/

is listed as Chr13: 32341188 – 32341192 in clinvar

In the files from the clinvar/macarthur pipeline,
It gets left shifted here: 13 32341183

There is no “hgvs_c” field in the output from the pipeline
It does have have (which is the same as clinvar .. not sure if that was recomputed).
"hgvs_p":"NP_000050.2:p.Ile2278Serfs",

However, we can find the hgvs_c information under the molecular consequence field.

"molecular_consequence":["NM_000059.3:c.6833_6837del:frameshift variant"]

I’m not sure if the pipeline computed that or it’s from clinvar (same as in clinvar)..
But it could be propagated to the hgvs_c field.

Problems with hgvs representation for Clinvar haplotype entries

It looks like the clinvar.tsv file doesn't always represent the correct c. hgvs for Clinvar entries that have multiple c. hgvs as part of the haplotype. And not all haplotypes seem to be represented in the clinvar.tsv file. Could you have a look?

should code be moved to a separate github repo from the data?

In #27 , @raymond301 suggested putting the clinvar data in a separate github repo from the code.

It seems simpler to me to keep everything in one place. Does anybody else have a preference on this?

Convert the pipeline to hail for .tsv processing steps?

Next time the pipeline needs updates, we should probably convert all steps that follow .xml parsing to a Spark-based hail pipeline.

Currently, the steps that generate clinvar x gnomAD tables take hours to run, so I skipped them for the latest release - hail would be able to perform these joins much more efficiently.

Need to Update pathogenic & conflict boolean regex.

Looks like Clinvar has new string added to clin_sig: "Conflicting interpretations of pathogenicity"

I just handle it by changing these lines in join_data.R

combined$pathogenic = as.integer(grepl('athogenic\\b',combined$clinical_significance))


# conflicted = 1 if at least one submission each of [likely] benign and [likely] pathogenic
combined$conflicted = as.integer((grepl('athogenic',combined$clinical_significance) & grepl('enign',combined$clinical_significance)) | grepl('Conflicting interpretations',combined$clinical_significance, ignore.case=TRUE))

That way you get better boolean values.

Improper labelling of conflicting variants as pathogenic

It seems reformatted data tagged as 'May release' was produced from Mar (or prior to that?) NCBI clinvar data. Since April, NCBI clinvar seems to have changed how they report aggregated clinical significance, and this completely affects how conflicted variants are identified if current code is used with recent NCBI clinvar data. Current NCBI's practice is to use the term Conflicting interpretations of pathogenicity for conflicts, and these variants are mistakenly tagged as TRUE for pathogenic column (instead of TRUE for conflicted) as they have the string 'pathogenic'.

@XiaoleiZ - Since I am using data from your fork on Aug data, I thought I should tag you here.

@bw2 @ericminikel @XiaoleiZ - Any suggestion on which repo or fork to use for actively maintained data, if any? I would like to contribute if you need more manpower on maintaining a centralized repo.

Job parse failure on system execution. ENV issues?

master.py LINE 149

The custom sorting via eGrep & Sort commands.
# sort job.add(("cat " + "<(gunzip -c IN:%(tmp_dir)s/clinvar_table_normalized.%(fsuffix)s.tsv.gz

When executing on the system fails on a syntax error, which is fairly uninformative.

[Sep 12 13:29:27]: --> Exec 1.4: cat <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | head -1) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 ) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 ) | bgzip -c > output_tmp/tmp.2017-09-12_13.29.26.clinvar_allele_trait_pairs.multi.b37.tsv.gz

[Sep 12 13:29:27]: Output (last mod N/A): output_tmp/clinvar_allele_trait_pairs.multi.b37.tsv.gz [doesn't exist yet]

[Sep 12 13:29:28]: /bin/sh: -c: line 0: syntax error near unexpected token ('`

[Sep 12 13:29:28]: /bin/sh: -c: line 0: cat <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | head -1) <(gunzip -c output_tmp/clinvar_table_normalized.multi.b37.tsv.gz | tail -n +2 |

I'm trying to work through this right now, but if you have any suggestions....

Need to check whether REF base is correct

Ellen Tsai from Heidi Rehm's lab has emailed today to say that we are currently letting through tens of variants that have the wrong REF base in ClinVar. Per Adrian Tan's July 2 email, vt normalize checks whether the REF base is correct. If we make our python implementation of normalize do the same, and throw out variants with the wrong REF, then this should fix the variants Ellen points out. Here is the list of offending variants from her email:

If you're interested, the 36 variants I've flagged are provided below.

5 variants listed twice in clinvar.tsv where ref/alt are switched between the rows (e.g. chr pos ref alt):

11
64572018 C
T
11
64572018 T
C
7
117199533 A
G
7
117199533 G
A
9
139393681 C
T
9
139393681 G
A
9
139412360 A
G
9
139412360 T
C
9
139414018 C
A
9
139414018 G
T

5 variants where ref(clinvar.tsv) doesn't match hg19? (e.g. myhg19.fasta vs clinvar.tsv)

9:134385436 .. 'A' vs 'G'
MT:14498 .. 'T' vs 'C'

A in UCSC browser, but apparently 'T' in my fasta file

MT:14729 .. 'T' vs 'G'
17:181683 .. 'A' vs 'C'
17:256448 .. 'T' vs 'C'

16 variants where ref(myhg19.fasta)=N but ref(clinvar.tsv)=[ACGT]

17:355278 .. 'N' vs 'C'
17:355570 .. 'N' vs 'C'
17:355653 .. 'N' vs 'G'
17:355749 .. 'N' vs 'G'
17:355765 .. 'N' vs 'G'
17:356212 .. 'N' vs 'C'
17:356248 .. 'N' vs 'A'
17:356346 .. 'N' vs 'C'
17:356370 .. 'N' vs 'C'
Y:545354 .. 'N' vs 'G'
Y:545422 .. 'N' vs 'A'
Y:545469 .. 'N' vs 'C'
Y:545475 .. 'N' vs 'C'
Y:551577 .. 'N' vs 'G'
Y:551578 .. 'N' vs 'C'
Y:1359342 .. 'N' vs 'G'

Missing variants

Hello,

I have come across a handful of variants that were present in previous versions of the curated Clinvar TSV that are not present in the most recent version, but are present in the Clinvar database. I am using both the clinvar_alleles.single.b37.tsv and clinvar_alleles.single.b37.vcf committed on 2017-03-13:

10 55582650 . A ATGTT
2 179325735 rs17304212 C G
2 179325735 rs17304212 C T
2 179325816 rs79399438 G A
3 150690352 rs111033258 A C
5 149740732 rs56180593 C T

Is there a reason these are no longer present? Were they inadvertently dropped? Or am I missing them in either the TSV or VCF?

Request: Adding Benign column to the output

Hi,

Can a benign column and a conflicts column (just like Pathogenic and Conflicting_Pathogenicity) be added to the results?

Update?

Hi,

I really like this resource, and the job you have put together to make clinvar data available in a clean way. I stumbled into some problems trying to make the workflow work on my own computer, but then you tend to do regular updates of the output.tsv file, which is excellent. Are you planning on any update of this soon?

On beforehand, thanks.

Sigve

Origin column

hi,

Thanks for a very useful resource, it simplifies a lot. Would you consider adding the variant origin in the output (for annoation of 'germline,'somatic','de novo' etc.)? It is present in the variant_summary.txt.gz. I guess adding one more variable to join_data.R would work, e.g.:

colnames(txt_extract) = c('measureset_id','symbol','clinical_significance','review_status','hgvs_c','hgvs_p','origin')

best,
Sigve

join_data.R fails because of input file not being found

The step invoking join_data.R currently fails on the master branch, as the script expects clinvar_table_dedup_normalized.tsv but the correct input file seems to be clinvar_table_normalized.tsv.

Something wrong?

Hi,

Your jobs are awesome! I think it would be very useful for us. But as I look at the example clinvar.tsv.gz file and the clinvar_example_750_rows.tsv file, I find a mistake.

chrom	pos	ref	alt	mut	measureset_id	symbol	clinical_significance	pathogenic	benign	conflicted	review_status	gold_stars	hgvs_c	hgvs_p	all_submitters	all_traits	all_pmids	inheritance_modes	age_of_onset	prevalence	disease_mechanism	origin	xrefs
1	949523	C	T	ALT	183381	ISG15	Pathogenic	1	0	0	no assertion criteria provided	0	NM_005101.3:c.163C>T	NP_005092.1:p.Gln55Ter	OMIM	Immunodeficiency 38 with basal ganglia calcification;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION	25307056		Childhood	<1 / 1 000 000		germline	MedGen:CN221808;OMIM:616126;Orphanet:319563
1	949696	C	CG	ALT	161455	ISG15	Pathogenic	1	0	0	no assertion criteria provided	0	NM_005101.3:c.339dupG	NP_005092.1:p.Leu114Alafs	OMIM	Immunodeficiency 38 with basal ganglia calcification;IMMUNODEFICIENCY 38 WITH BASAL GANGLIA CALCIFICATION	1,22859821,25307056		Childhood	<1 / 1 000 000		germline	MedGen:CN221808;OMIM:616126;Orphanet:319563

At this sheet, the second row of the result, there is a "1" in the all_pmids column, are there a mistake of the program?

mut column

Hi,

Thanks for this - very useful. I intend to convert this to a VCF file to use downstream. I can't find any info on what the mut columns is. It's value is commonly ALT, but sometimes REF. What does is mean?

thanks!

Improper handling of pseudoautosomal region

File clinvar_alleles.single.b38.tsv.gz has 43 duplicate rows. Here are the allele IDs for those rows:
[231307, 150500, 24917, 260558, 150301, 38949, 45434, 45436, 167501, 24912, 190850, 24911, 24915, 257893, 231306, 260559, 25394, 178250, 24914, 178251, 178253, 260556, 189165, 189166, 260560, 45437, 178249, 137687, 231308, 137685, 150305, 99001, 137686, 137689, 150303, 361265, 99002, 150304, 98999, 38950, 23109, 195246, 137688]

On closer look, a subset of them are in pseudoautosomal region (ie. one row should have had chrX and the other chrY). Allele IDs 231307 and 24915 are part of that subset.

ps - At this point, I'm just nitpicking when reporting issues :)

generate clinvar_alleles_with_gnomad table(s)

This repo currently contains only a clinvar_alleles_with_exac_v1.single.b37.tsv.gz table, but
we could generate similar tables for the new gnomAD exomes and genomes datasets (http://gnomad.broadinstitute.org/downloads).
If anyone's interested in having such tables, or has preferences on table columns/format, please let me know.

Feature request: Including strand info and genomic start and end coordinates

Wouldn't it be useful to include strand and genomic start and end coordinates of variants (in clinvar_alleles.single file)? Having such info would makes things easier to compare to databases in bed-file format (which of course is my use case).

Clinical Significance Order Lost in Allele Grouping

The clinical sig is lost per submitter, in the group_by_allele.py.

For example: https://www.ncbi.nlm.nih.gov/clinvar/variation/92428/
Has 4 submitters, 1 calls Likely Benign 3 calls Benign.

clinvar_allele_trait_pairs.single.tsv.gz | grep 39073 | grep 53676401 | less
Has 3 lines, with the correct Clin Sig Order.

But clinvar_alleles_grouped.single.tsv.gz | grep 39073 | grep 53676401 | less
Reduces it to "benign;likely benign" indicating 2 entries, for 4 submitters. The order is lost.

BUT.... submitters_ordered is still correct:
EGL Genetic Diagnostics,Eurofins Clinical Diagnostics;GeneReviews;Illumina Clinical Services Laboratory,Illumina;Center for Pediatric Genomic Medicine,Children's Mercy Hospital and Clinics

Output doesn't include mitochondria?

I don't see mitochondria in the output. Maybe I'm missing something? Anyway, thanks for putting this together!

Add LICENSE

MIT, or what are we doing these days?

Issue with R script

Hi,

I am getting the following error, Could you please walk me through the way to resolve it. I am using python 2.7 on Ubuntu 14.10 machine. All required libraries are up to date.

180593 records processed
180594 records processed
[Nov 18 10:53:45]: Final counts of variants discarded:
[Nov 18 10:53:45]: REF == ALT: 226
[Nov 18 10:53:45]: Wrong REF: 4082
[Nov 18 10:53:45]: Invalid nucleotide: 9
[Nov 18 10:53:45]: Finished 1.3. Running time: 0:00:21.418153 sec.
[Nov 18 10:53:45]: Renamed tmp.2016-11-18_10.46.10.clinvar_table_normalized.tsv to clinvar_table_normalized.tsv
[Nov 18 10:53:45]: --> Exec 1.4: Rscript join_data.R variant_summary.txt.gz
[Nov 18 10:53:45]: Output (last mod N/A): clinvar_combined.tsv [doesn't exist yet]
[Nov 18 10:53:50]: [1] 180594 14
[Nov 18 10:53:59]: Warning message:
[Nov 18 10:53:59]: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
[Nov 18 10:53:59]: embedded nul(s) found in input
[Nov 18 10:53:59]: [1] 354049 30
[Nov 18 10:53:59]: Error in [.data.frame(x, r, vars, drop = drop) :
[Nov 18 10:53:59]: undefined columns selected
[Nov 18 10:53:59]: Calls: subset -> subset.data.frame -> [ -> [.data.frame
[Nov 18 10:54:16]: $subset(txt_download, assembly == "GRCh37", select = desired_columns)
[Nov 18 10:54:16]: <environment: 0x2c70db0>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $subset.data.frame(txt_download, assembly == "GRCh37", select = desired_colu
[Nov 18 10:54:16]: <environment: 0x2dd9a78>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $x[r, vars, drop = drop]
[Nov 18 10:54:16]: <environment: 0x2d9c030>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $``[.data.frame(x, r, vars, drop = drop)
[Nov 18 10:54:16]: <environment: 0x2d9c068>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: $`stop("undefined columns selected")`
[Nov 18 10:54:16]: <environment: 0x2d9a5c0>
[Nov 18 10:54:16]:
[Nov 18 10:54:16]: attr(,"error.message")
[Nov 18 10:54:16]: [1] "Error in `[.data.frame`(x, r, vars, drop = drop) : \n undefined columns selected\nCalls: subset -> subset.data.frame -> [ -> [.data.frame\n"
[Nov 18 10:54:16]: attr(,"class")
[Nov 18 10:54:16]: [1] "dump.frames"
[Nov 18 10:54:16]: Finished 1.4. Running time: 0:00:31.037980 sec.
[Nov 18 10:54:16]: WARNING: unable to rename tmp.2016-11-18_10.46.10.clinvar_combined.tsv to clinvar_combined.tsv. tmp.2016-11-18_10.46.10.clinvar_combined.tsv is not readable
[Nov 18 10:54:16]: --> Skipping 1.5: (cat clinvar_combined.tsv | head -1 > tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv )
[Nov 18 10:54:16]: Input (last mod N/A): clinvar_combined.tsv [Error - input file not found]
[Nov 18 10:54:16]: Traceback (most recent call last):
[Nov 18 10:54:16]: File "/home/ecgi6/.local/lib/python2.7/site-packages/pypez.py", line 769, in does_command_need_to_run
[Nov 18 10:54:16]: try: raise Exception("File not found: " + str(input_filename))
[Nov 18 10:54:16]: Exception: File not found: clinvar_combined.tsv
[Nov 18 10:54:16]:

Another translation for gold stars

gold_stars_table = list(
   '-' = 0,
   ...

combined[which(combined$review_status == "-"),]

allele_id	chrom	pos	ref	alt	measureset_type	measureset_id
255248	414032	17	48271723	A	T	Variant
255249	414033	7	94049849	A	C	Variant

Looks like sometimes review status can be "-"

Wrong gene mapping

ClinVar record 292157 is about gene MTHFR, but it is wrongly mapped to gene C1orf167 in file clinvar_alleles.single.b37.tsv.gz. Coordinate Chr1:11786191 maps to C1orf167 in + strand and MTHFR in - strand. Looks like somewhere in the code, strand variable is overlooked during gene mapping.

Add "last_evaluated" field to VCF

Currently it's only in the .tsv

RScript crash

I tried to run the parser with the newest ClinVar may release

I was stuck at the 1.32. RScript crashed after a few second of run.

Any idea how it happened?

Thanks for developing the tool.

[May 17 12:05:49]: --> Exec 1.32: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-17_11.05.29.clinvar_alleles_combine[May 17 12:05:49]: --> Exec 1.32: Rscript join_data.R variant_summary.txt.gz output_tmp/clinvar_alleles_grouped.single.b37.tsv.gz output_tmp/tmp.2017-05-17_11.05.29.clinvar_alleles_combined.single.b37.tsv.gz
[May 17 12:05:49]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 17 12:06:13]: [1] 282106 25
[May 17 12:06:37]: [1] 634027 30
[May 17 12:06:44]: Error in write.table(combined, output_table, sep = "\t", row.names = F, :
[May 17 12:06:44]: unimplemented type 'list' in 'EncodeElement'
[May 17 12:06:54]: $write.table(combined, output_table, sep = "\\t", row.names = F, col.names =
[May 17 12:06:54]: <environment: 0x7fc00382c120>
[May 17 12:06:54]: attr(,"error.message")
[May 17 12:06:54]: [1] "Error in write.table(combined, output_table, sep = "\t", row.names = F, : \n unimplemented type 'list' in 'EncodeElement'\n"
[May 17 12:06:54]: attr(,"class")
[May 17 12:06:54]: [1] "dump.frames"
[May 17 12:06:54]: Finished 1.32. Running time: 0:01:04.242256 sec.d.single.b37.tsv.gz
[May 17 12:05:49]: Output (last mod N/A): output_tmp/clinvar_alleles_combined.single.b37.tsv.gz [doesn't exist yet]
[May 17 12:06:13]: [1] 282106 25
[May 17 12:06:37]: [1] 634027 30
[May 17 12:06:44]: Error in write.table(combined, output_table, sep = "\t", row.names = F, :
[May 17 12:06:44]: unimplemented type 'list' in 'EncodeElement'
[May 17 12:06:54]: $write.table(combined, output_table, sep = "\\t", row.names = F, col.names =
[May 17 12:06:54]: <environment: 0x7fc00382c120>
[May 17 12:06:54]: attr(,"error.message")
[May 17 12:06:54]: [1] "Error in write.table(combined, output_table, sep = "\t", row.names = F, : \n unimplemented type 'list' in 'EncodeElement'\n"
[May 17 12:06:54]: attr(,"class")
[May 17 12:06:54]: [1] "dump.frames"
[May 17 12:06:54]: Finished 1.32. Running time: 0:01:04.242256 sec.

Need to update output

ClinVar released a new XML dump on July 9:

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/

And our output is currently from July 8.

Mismatch between clinical_significance_ordered and submitters_ordered

Hi,

Thanks for releasing this great resources. I noticed some discrepancies between the semicolon-separated lists in clinical_significance_ordered and submitters_ordered:

In [1]: df = pd.read_csv('clinvar_alleles_example_750_rows.single.b37.tsv', sep='\t')

In [2]: df.shape
Out[2]: (749, 39)

In [3]: for col in 'rcv scv clinical_significance_ordered submitters_ordered'.split():
    ...:     df['len_' + col] = df[col].apply(lambda x: len(x.split(';')))

In [4]: diffs = df[df.len_clinical_significance_ordered != df.len_submitters_ordered].shape

In [5]: diffs.shape
Out[5]: (120, 43)

Ordered clinical significance doesn't seem to match the RCV or SCV lists either. Is this intended?

Thanks

missing clinvar_alleles_with_exac_v1.*.tsv.gz

Looks like clinvar_alleles_with_exac_v1.*.tsv.gz files missing in b3*/single folders ?

Semi-colon in pathogenic, benign and conflicted columns. Bug?

Certain rows have values 0 and 1 separated by a semi-colon in columns pathogenic, benign and conflicted of file clinvar_alleles.tsv.gz. For example, pathogenic column in line#6704 is 1;0. If I am not wrong, these columns must have either 0 or 1 by definition. Is this correct, or am I missing something?

Edit: I see these rows always seem to have more than one measureset_id. Is this the reason why above-mentioned columns may have multiple values?

Use a more general python interpreter location

Instead of having each python script start with:

#!/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/bin/python

You should use:

#!/usr/bin/env python

The use of env ensures that the "current" python interpreter, wherever it is installed, is used.

Multiple isoforms not in index

Some of the reference transcripts used by clinvar are not always the ones chosen by Clinical labs. The current version of this pipeline generates a single "hgvs_c" and "hgvs_p" for the reference transcript only.
e.g. in https://www.ncbi.nlm.nih.gov/clinvar/variation/12408/ look under "hgvs" and you will find two NM and two AA changes and an "Other names" section all indicating a two transcripts annotations.

NM_001001430.2(TNNT2):c.236T>A (p.Ile79Asn)
Other names:
p.I79N:ATC>AAC
Protein change:I79N, I89N
HGVS:
NG_007556.1:g.17040T>A
NM_000364.3:c.266T>A
NM_001001430.2:c.236T>A

Please let me know if you plan on working on this, I might. [email protected]

using hg38

This is a very useful script collection. Is it possible to use another reference, e.g. hg38.fa ? I tried editing the parsing script replacing 37 with 38, but then the normalisation breaks, even though I feed it the corresponding reference, e.g. hg38. What am I overlooking here?

The error thrown is:
'Incorrect REF value: 1 1014143 C T (actual REF should be )' ...
(...)

Thank you very much.

Memory Footprint too High

Hello,

I came across your clinvar pipeline via a google search and decided to experiment with it. It turns out that for the VM I am using, there is a failure because too much memory is required in the parse_clinvar_xml.py stage.

I see that you are parsing the entire ClinVar XML tree at once via an ET.parse(handle) command.

Have you considered using ET.iterparse() to parse the XML on the fly? I think you could detect the end of ClinVarSet elements, extract info from the element, and then use an elem.clear() to free the memory. An example is discussed at the following link:

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree

Thanks for posting your code.

-- Gregory S. Turenchalk

Sometimes "symbol" disagrees with primary hgvs gene annotation

Hi there,
Really glad you guys shared your code and parsed data on github!

Mostly this issue is just to put it on your radar for something to improve for the next version, but notice that for the following entry we get conflicting information for the gene symbol:

On ClinVar it's EGFR:
https://www.ncbi.nlm.nih.gov/clinvar/variation/45271/
For the hgvs_c we get EGFR:
https://www.ncbi.nlm.nih.gov/nuccore/NM_005228

But in clinvar_alleles.single.b37.tsv we get EGFR-AS1:

> paste <(gzcat clinvar_alleles.single.b37.tsv.gz | head -n1 | tr '\t' '\n' | cat -n) <(gzcat clinvar_alleles.single.b37.tsv.gz | grep RCV000038427 | tr '\t' '\n') | column -t -s$'\t'
     1  chrom                  7
     2  pos                    55249063
     3  ref                    G
     4  alt                    A
     5  measureset_type        Variant
     6  measureset_id          45271
     7  rcv                    RCV000038427;RCV000321080
     8  allele_id              54438
     9  symbol                 EGFR-AS1
    10  hgvs_c                 NM_005228.4:c.2361G>A
    11  hgvs_p                 NP_005219.2:p.Gln787=
    12  molecular_consequence  NM_005228.4:c.2361G>A:synonymous variant
    13  clinical_significance  Benign;Likely benign
    14  pathogenic             0
    15  benign                 1
    16  conflicted             0
    17  review_status          criteria provided, multiple submitters, no conflicts
    18  gold_stars             2
    19  all_submitters         Laboratory for Molecular Medicine,Partners HealthCare Personalized Medicine;PreventionGenetics,PreventionGenetics;Illumina Clinical Services Laboratory,Illumina
    20  all_traits             not specified;Not Specified;NOT SPECIFIED;Lung cancer;Lung Cancer
    21  all_pmids              25741868;17409930,23562183,23667368,24627688,24846033,25311215
    22  inheritance_modes
    23  age_of_onset
    24  prevalence
    25  disease_mechanism
    26  origin                 germline;somatic
    27  xrefs                  MedGen:CN169374;Genetic Alliance:Lung+Cancer/4334;Genetics Home Reference:lung-cancer;MedGen:C0684249;OMIM:211980;SNOMED CT:187875007

This can be traced to the following code from parse_clinical_xml.py:

            #find the gene symbol
            current_row['symbol']=''
            genesymbol = measure[i].findall('.//Symbol')
            if genesymbol is not None:
                for symbol in genesymbol:
                    if(symbol.find('ElementValue').attrib.get('Type')=='Preferred'):
                        current_row['symbol']=symbol.find('ElementValue').text;
                        break

Notice how we break after the first success, but for this example we have multiple "Preferred" symbols in the xml in a confusing order.

<ClinVarSet ID="17452916">
...
    <ClinVarAccession Acc="RCV000038427" Version="3" Type="RCV" DateUpdated="2017-01-25"/>
...
    <MeasureSet Type="Variant" ID="45271">
      <Measure Type="single nucleotide variant" ID="54438">
        <Name>
          <ElementValue Type="Preferred">NM_005228.4(EGFR):c.2361G&gt;A (p.Gln787=)</ElementValue>
        </Name>
...
          <Symbol>
            <ElementValue Type="Preferred">EGFR-AS1</ElementValue>
          </Symbol>
...
          <Symbol>
            <ElementValue Type="Preferred">EGFR</ElementValue>
          </Symbol>
...

Perhaps one might want to simply use the gene symbol given in parenthesis in the text for the first .//Name/ElementValue in the .//Measure

I'll admit I haven't done an extensive analysis on the best choice for extracting the gene symbol from the xml, and the solution I just tossed your way ^ might be incorrect, but right now the symbol column in your output is not always reliable.

Also, check out (the first one I noticed via grep) https://www.ncbi.nlm.nih.gov/clinvar/variation/41400 for an example where there's no "Preferred" gene symbol, but there is a symbol in parenthesis given that seems to check out. The XML does not have a "Preferred" symbol in this case.

ClinSig changes for some halpotypes in join_data.R - become wrong!

This is the result of join_data.R for multi

chrom	pos	ref	alt	measureset_type	measureset_id	rcv	allele_id	symbol	hgvs_c	clinical_significance	clinical_significance_ordered	pathogenic	benign	conflicted
2	166277030	T	G	Haplotype	30359	RCV000023304	39315	SCN9A	NM_002977.3:c.2794A>C	Benign/Likely benign	Pathogenic	0	1	0

It's wrong: NM_002977.3(SCN9A):c.[2794A>C;2971G>T] – Haplotype

Looking at each step:

clinvar_table_raw.multi.tsv - Correct
clinvar_table_normalized.multi.tsv.gz - Correct
clinvar_allele_trait_pairs.multi.tsv.gz - Correct
clinvar_alleles_grouped.multi.tsv.gz - Correct
clinvar_alleles_combined.multi.tsv.gz - WRONG

There are 20 halpotype variants with this same issue:

measureset_type	measureset_id	rcv	allele_id
Haplotype	1631	RCV000001698	16670
Haplotype	5706	RCV000006060	20745
Haplotype	5813	RCV000006169	20852
Haplotype	7239	RCV000007661	22244
Haplotype	13065	RCV000013940	28104
Haplotype	13399	RCV000014336;RCV000014337	28436
Haplotype	16318	RCV000017711;RCV000201276	31357
Haplotype	16876;16877	RCV000018372;RCV000018373	31916
Haplotype	4297	RCV000004533;RCV000004534;RCV000004535;RCV000004536	38384
Haplotype	4297	RCV000004533;RCV000004534;RCV000004535;RCV000004536	38385
Haplotype	4816	RCV000005085	38434
Haplotype	9398	RCV000010000	38447
Haplotype	9407	RCV000010010	38448
Haplotype	16318;217371	RCV000017711;RCV000201276;RCV000201278	38476
Haplotype	30359	RCV000023304	39315
Haplotype	38571	RCV000021985	46849
Haplotype	402236	RCV000454199	98655
Haplotype	218894	RCV000203245	137950
Haplotype	225143	RCV000210779	227037
Haplotype	188053	RCV000167863	255673

add .gitignore File to Repo?

Would you be willing to add a .gitignore file to this repo, so when we fork or clone the repository, we don't have to pull down all the data files?

*.tsv
*.tsv.gz
*.tbi

It my be better to host the data files, that result from this code somewhere else, for those users who just want the pre-canned results.

Augmented tables in output

Currently, only the b37/multi folder has augmented tables (with ExAC and GnomAD data). Shouldn't these augmented tables be in each output folder?

Missing data issue

Greetings,

I was checking over resulted tsv file and found out that at least some variants included in clinvar VCF file ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/ ) are not being shown there.

Can you please explain why ? For example, the following variant appears in the ClinVar VCF file but doesn't appear in the TSV file included in the codebase for this repo AND also doesn't appear in regenerated TSV file using the latest ClinVar dataset:

7 92130876 rs61750420 C T

Thanks in advance,
Peter

I would like to have the ModeOfInheritence field in the flatfile.

This tool has been very helpful! I would like to have the ModeOfInheritence field pulled into the flatfile. Is there any addendum that you can provide to the script? IS there a way I could customize the columns I want pulled into the flatfile?

b38 availability?

Hi, I am wondering if the clinvar files are available for b38 version too
Any technichal issues holding it so?

Thanks

macarthur-lab / clinvar Goto Github PK

clinvar's People

Stargazers

Watchers

Forkers

clinvar's Issues

If you're interested, the 36 variants I've flagged are provided below.

5 variants listed twice in clinvar.tsv where ref/alt are switched between the rows (e.g. chr pos ref alt):

5 variants where ref(clinvar.tsv) doesn't match hg19? (e.g. myhg19.fasta vs clinvar.tsv)

A in UCSC browser, but apparently 'T' in my fasta file

16 variants where ref(myhg19.fasta)=N but ref(clinvar.tsv)=[ACGT]

Recommend Projects

Recommend Topics

Recommend Org