Git Product home page Git Product logo

stranger's Introduction

Stranger

Build Status - GitHub PyPI Version DOI GitHub Release Date Coverage Status Code style: black Woke

Annotates output files from ExpansionHunter and TRGT with the pathologic implications of the repeat sizes.

Installation

git clone github.com/clinical-genomics/stranger
cd stranger
pip install --editable .

Usage

Usage: stranger [OPTIONS] VCF

  Annotate str variants with str status

Options:
  -f, --repeats-file PATH         Path to a file with repeat definitions. See
                                  README for explanation  [default: $HOME/
                                  stranger/stranger/resources/vari
                                  ant_catalog_grch37.json]
  -i, --family_id TEXT
  -t, --trgt                      File was produced with TRGT
  --version
  --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Set the level of log output.  [default:
                                  INFO]
  --help                          Show this message and exit.

Repeat definitions

The repeats are called with Expansion Hunter as mentioned earlier. ExpansionHunter will annotate the number of times that a repeat has been seen in the bam files of each individual and what repeat id the variant has. Stranger will annotate the level of pathogenicity for the repeat number. The intervals that comes with the package are manually collected from the literature since there is no single source where this information can be collected.

You can find a demo repeat definitions json file that comes with Stranger here. It is based on the ExpansionHunter variant catalog, but extended with a few disease locus relevant keys: It is advisable to use an up to date file, perhaps based on a curated public repostitory such as STRchive or STRipy. The ones we use in our routine pipelines can be found at our Reference-files repository and include our literature curation.

Column/Key Content/Value
HGNC_ID HGNC identifier for the repeat or most associated gene.
HGNC_SYMBOL HGNC symbol for the repeat or most associated gene.
REPID ExpansionHunter repeat ID.
RU Basic repeat unit, as seen in ExpansionHunter. Unused.
DisplayRU Repeat unit, as clinicians are used to see it.
Normal_Max (#copies) Longest repeat expected for normal individual; higher are marked pre- or full-mutation
Pathologic_Min (#copies) Shortest repeat expected for pathology. This and higher is annotated as full-mutation.
Disease Associated disease.
InheritanceMode Mode of inheritance "AR", "AD", "XR" etc
Source Reference literature resource type, eg GeneReviews or PubMed
SourceId PMID or GeneReviews book ID for references
TRID Trgt repeat ID if not same as REPID
PathologicStruc Array of index for pathogenic motif

Other fields accepted by ExpansionHunter are also encouraged.

For convenience, here is a formatted table with some of the current contents.
HGNCId LocusId DisplayRU InheritanceMode normal_max pathologic_min Disease SourceDisplay SourceId
3776 AFF2 CCG XR 39 200 Fraxe GeneReviews Internet 2019-11-07 NBK535148
644 AR CAG XR 34 38 SBMA GeneReviews Internet 2019-11-07 NBK535148
18060 ARX_EIEE GCN XR 16 17 EIEE GeneReviews Internet 2019-11-07 NBK535148
18060 ARX_PRTS GCN XR 12 20 PRTS GeneReviews Internet 2019-11-07 NBK535148
3033 ATN1 CAG AD 35 48 DRPLA GeneReviews Internet 2019-11-07 NBK535148
10549 ATXN10 ATTCT AD 32 800 SCA10 GeneReviews Internet 2019-11-07 NBK535148
10548 ATXN1 CAG AD 35 45 SCA1 GeneReviews Internet SCA1 2017-06-22 NBK1184
10555 ATXN2 CAG AD 31 37 SCA2 GeneReviews Internet SCA2 2019-02-14 NBK1275
7106 ATXN3 CAG AD 44 60 MJD GeneReviews Internet 2019-11-07 NBK535148
10560 ATXN7 CAG AD 19 36 SCA7 GeneReviews Internet 2019-11-07 NBK535148
10561 ATXN8OS CTG AD 50 80 SCA8 GeneReviews Internet 2019-11-07 NBK535148
28337 C9ORF72 GGCCCC AD 25 40 FTDALS1 GeneReviews Internet 2019-11-07 NBK535148
1388 CACNA1A CAG AD 18 20 SCA6 GeneReviews Internet 2019-11-07 NBK535148
1541 CBL CCG AD 79 100 FRAX11B Jones et al Nature 1995 7603564
1541 BEAN1 TGGAA AD 10 40 SCA31 Sato et al AJHG 2009 7603564
13164 CNBP CCTG AD 30 75 DM2 GeneReviews Internet 2020-03-19 NBK1466
2482 CSTB CCCCGCCCCGCG AR 3 30 EPM1 GeneReviews Internet 2019-11-07 NBK535148
2482 DAB1 ATTTC AD 16 31 SCA37 GeneReviews Internet 2019-05-30 NBK541729
29284 DIP2B CGG AD 24 270 FRA12A GeneReviews Internet 2019-11-07 NBK535148
2933 DMPK CTG AD 34 50 DM1 GeneReviews Internet 2019-10-03 NBK1165
18683 EIF4A3 TCGGCAGCGGCGCAGCGAGG AR 9 10 RCPS GeneReviews Internet 2019-11-07 NBK535148
3775 FMR1 CGG XR 55 200 FragileX GeneReviews Internet 2019-11-07 NBK535148
1092 FOXL2 GCN AD 14 15 BPES GeneReviews Internet 2019-11-07 NBK535148
3951 FXN GAA AR 35 51 FRDA GeneReviews Internet 2019-11-07 NBK535148
4331 GLS GCA AR 20 90 GDPAG van Kuilenburg et al (2019) NEJM 380:1433-1441 30970188
5102 HOXA13_I GCN AD 14 22 HFGS GeneReviews Internet 2019-08-08 NBK1423
5102 HOXA13_II GCN AD 12 18 HFGS GeneReviews Internet 2019-08-08 NBK1423
5102 HOXA13_III GCN AD 18 24 HFGS GeneReviews Internet 2019-08-08 NBK1423
5136 HOXD13 GCN AD 15 22 SDTY5 GeneReviews Internet 2019-11-07 NBK535148
4851 HTT CAG AD 36 40 Huntington GeneReviews Internet 2020-06-11 NBK1305
14203 JPH3 CTG AD 28 40 HDL2 GeneReviews Internet 2019-06-27 NBK1529
31708 LRP12 CGN AD 45 90 OPDM1 GeneReviews Internet 2019-11-07 NBK535148
1226 GIPC1 GGC AD 32 73 OPDM2 Deng et al (2020) AJHG 106(6):793-804 32413282
17043 NIPA1 GCN AD 8 10000 ALS - susceptibility to Tazelaar et al (2019) Neurobiol Aging 74:234.e9-234.e15 30342764
15911 NOP56 GGCCTG AD 14 650 SCA36 GeneReviews Internet 2014-08-07 NBK231880
53924 NOTCH2NLC CGG AD 38 66 NIID GeneReviews Internet 2019-11-07 NBK535148
8565 PABPN1 GCN AD 10 12 OPMD GeneReviews Internet 2014-02-20 NBK1126
9143 PHOX2B GCN AD 20 25 CCHS GeneReviews Internet 2014-01-30 NBK1427
9305 PPP2R2B CAG AD 32 51 SCA12 GeneReviews Internet 2019-11-07 NBK535148
16854 RAPGEF2 TTTCA AD 1 10 FAME7 Ishiura et al (2018) Nature Genetics 50;581-90 29507423
9969 RFC1 AARRG AR 11 12 CANVAS Cortese et al 2019 Nat Gen PMID: 30926972 30926972
31750 SAMD12 TTTCA AD 1 10 FAME1 Ishiura et al (2018) Nature Genetics 50;581-90 29507423
10472 RUNX2 GCN AD 17 20 CCD GeneReviews Internet 2019-11-07 NBK535148
11199 SOX3 GCN XR 15 22 MRGH GeneReviews Internet 2019-11-07 NBK535148
11588 TBP CAN AD 40 49 SCA17 GeneReviews Internet 2019-09-12 NBK1438
11592 TBX1 GCN AD 15 25 TOF GeneReviews Internet 2019-11-07 NBK535148
11634 TCF4 CTG AD 39 100 FECD3 GeneReviews Internet 2019-11-07 NBK535148
11969 TNRC6A TTTCA AD 1 10 FAME6 Ishiura et al (2018) Nature Genetics 50;581-90 29507423
15516 XYLT1 GGC AR 20 70 DBQD2 LaCroix et al (2018) AJHG 104(1):35-44 30554721
12873 ZIC2 GCN AD 15 25 HPE5 GeneReviews Internet 2019-11-07 NBK535148
12874 ZIC3 GCN XR 10 12 VACTERLX GeneReviews Internet 2019-11-07 NBK535148
9179 POLG CTG - 15 10000 - Research only. Contact CMMS, KUH, regarding findings. CMMS

Stranger can also read a legacy .tsv format file, structured like a Scout gene panel, with STR specific columns. The column names and keys correspond, but if in any kind of doubt, please read the code or use the json version.

As a default the file that follows the distribution is used but the users can create their own file. Header line(s) should be preceded with a #.

It is also possible to use an ExpansionHunter variant catalog json file with corresponding keys added. E.g.

[
    {
        "VariantType": "Repeat",
        "LocusId": "ATXN2",
        "LocusStructure": "(GCT)*",
        "ReferenceRegion": "chr12:112036753-112036822",
        "Disease": "SCA2",
        "NormalMax": 31,
        "PathologicMin": 39
    },
    {
        "VariantType": "Repeat",
        "LocusId": "PABPN1",
        "LocusStructure": "(GCG)*",
        "ReferenceRegion": "chr14:23790681-23790699",
        "Disease": "OPMD",
        "NormalMax": 6,
        "PathologicMin": 9
    }
]

Such files are also provided with the distribution. PRs with updates are much appreciated.

Output

Output is by annotated VCF, with keys STR_STATUS, NormalMax and PathologicMin.

##INFO=<ID=STR_STATUS,Number=A,Type=String,Description="Repeat expansion status. Alternatives in [normal, pre_mutation, full_mutation]">
4       3076603 .       C       <STR17>,<STR18> .       PASS    END=3076660;REF=19;RL=57;RU=CAG;VARID=HTT;REPID=HTT;STR_STATUS=normal,normal

TRGT mode

The flag --trgt will instruct Stranger to accept TRGT style VCFs. In particular, motif copy numbers are parsed from the GT.MC field and motifs listed in the PathologicStruc entry of the reference catalog:

    {
        "VariantType": "Repeat",
        "LocusId": "RAPGEF2",
        "HGNCId": 16854,
        "InheritanceMode": "AD",
        "DisplayRU": "TTTCA",
        "SourceDisplay": "Ishiura et al (2018) Nature Genetics 50 581-90",
        "Source": "PubMed",
        "SourceId": "29507423",
        "LocusStructure": "(TTTTA)*(TTTCA)*(TTTTA)*",
        "ReferenceRegion": [
            "4:160263679-160263703",
            "4:160263704-160263708",
            "4:160263709-160263768"
        ],
        "VariantType": [
            "Repeat",
            "Repeat",
            "Repeat"
        ],
        "VariantId": [
            "RAPGEF2_TTTTA_5P",
            "RAPGEF2",
            "RAPGEF2_TTTTA_3P"
        ],
        "PathologicRegion": "4:160263704-160263708",
        "PathologicStruc": [1],
        "Disease": "FAME7",
        "NormalMax": 1,
        "PathologicMin": 10
    },

and

...
##FORMAT=<ID=MC,Number=.,Type=String,Description="Motif counts per allele">
...
4	160263680	.	TTTATTTTATTTTATTTTATTTTATATTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATT	.	0	.	TRID=FAME7_RAPGEF2;END=160263770;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n(TTTTA)n;STR_STATUS=full_mutation;STR_NORMAL_MAX=1;STR_PATHOLOGIC_MIN=10;RankScore=internal_id_3:30;HGNCId=16854;InheritanceMode=AD;DisplayRU=TTTCA;SourceDisplay=Ishiura et al (2018) Nature Genetics 50 581-90;Source=PubMed;SourceId=29507423;Disease=FAME7	GT:AL:ALLR:SD:MC:MS:AP:AM	0/0:91,91:85-98,91-91:21,20:18_0,18_0:0(0-89)_1(89-89)_0(89-89),0(0-89)_1(89-89)_0(89-89):0.956044,0.956044:.,.	0/0:91,91:85-98,91-91:21,20:18_0,18_0:0(0-89)_1(89-89)_0(89-89),0(0-89)_1(89-89)_0(89-89):0.956044,0.956044:.,.	0/0:91,91:85-98,91-91:21,20:18_0,18_0:0(0-89)_1(89-89)_0(89-89),0(0-89)_1(89-89)_0(89-89):0.956044,0.956044:.,.
...

stranger's People

Contributors

dnil avatar fellen31 avatar geocarvalho avatar lint-action avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

stranger's Issues

Add refs field to json file

We should add a list that points to references that supports the repeat and/or the levels of pathogenicity. References could be urls to for example OMIM or pubmed. What do you think of that @dnil ? This would make the whole thing more transparent and trackable. I'll be happy to help out with this of course.

Warnings thrown

hi,

an issue I just realized we've had for a while. When running stranger on expansionhunter (3.0 and 4.0 makes no difference) we get the below posted errors. We use stranger 0.8 and the latest hg38 catalogue from the repo. What I don't really understand, for instance, expansionhunter does two separate call for huntingtons LocusID HTT, with two different repeat units. But stranger only annotates one of these, the other one throws the error: WARNING No info for repeat id HTT_CCG. Is there something I am missing?

2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id DAB1_ATTTT1
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id DAB1_ATTTC
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id DAB1_ATTTT2
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id ATXN7_GCC
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id CNBP_CAGA
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id CNBP_CA
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id HTT_CCG
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id RAPGEF2_TTTTA_5P
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id RAPGEF2_TTTTA_3P
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id SAMD12_TTTTA_3P
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id SAMD12_TTTTA_5P
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id FXN_A
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id ATXN8OS_CTA
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id POLG_I
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id POLG_II
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id TNRC6A_TTTTA_5P
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id TNRC6A_TTTTA_3P
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id EIF4A3_CACA20
2021-06-09 10:30:46 rs-n1 stranger.utils[65859] WARNING No info for repeat id NOP56_CGCCTG

Review HGNCId - LocusId associations

Somehow mistakes have been made for the associations between gene names (LocusId) and HGNCid. We discovered this in Lund when a gene panel contained only the gene CBL but the STR module displayed results for both CBL and BEAN1. At a quick review I notice that BEAN1 and CBL both have assigned HGNCId 1541 (correct for CBL but not for BEAN1). Further down in the stranger list CSTB and DAB1 both have 2482 (correct for CSTB but not for DAB1). It's possible that there are more discrepancies, I haven't done a systematic check, so please review the entire catalog.

Rank model for stranger

As a downstream user, loading samples into Scout, I would like to see Stranger produce a simple ranking, where pathogenic expansions in known disease loci appear on top, pre-mutations next perhaps with a slight bonus to variants near the pathogenic limit, followed by normal variants and last additional loci of potential interest.

Matching inheritance pattern annotation

Would be nifty to tag inheritance models followed, so that they could be matched to disorder. Compare genmod models, but aware of expansion status (full_mutation / intermediate) rather than just existence of two variants.

Rename MASTER to MAIN

Is your feature request related to a problem? Please describe.
I'm always frustrated when woke test fails on the branch name.

Describe the solution you'd like
Rename master branch to main.

decompose TRGT VCF

TRGT VCFs have biallelic sites, and important per allele info in the FORMAT tags, e.g. AL and MC. We need to decompose

Tested bcftools norm and vt decompose -s but they can't be bothered with non-standard FORMAT fields, only decomposing FORMAT.DP and the like.

Variant catalog update hg38

Hi!

I've noticed that after updating the demo variant catalog for hg38, many reference region coordinates and/or locus structures have changed. Interestingly, some of these coordinates also differ from those in the newest version of the original ExpansionHunter variant catalog. Some of the affected loci IDs include ATN1, ATXN1, ATXN3, PHOX2B, TBP, FOXL2, HOXD13, RUNX2, ZIC2, ZIC3. Is this intentional?

Invalid literal for int() with base 10: '.'

I thought I had solved this in #59, but I think when I changed line 260 from continue to repeat_res.extend([0]) the error consisted...

stranger/stranger/utils.py

Lines 259 to 260 in 06ab59e

if allele == ".":
repeat_res.extend([0])

Could you have a look? HG002_Revio.vcf.gz

2024-05-31 14:40:51 stuart stranger.cli[41] INFO Running stranger version 0.9.0
  2024-05-31 14:40:51 stuart stranger.cli[41] INFO Parsing repeats file variant_catalog_grch38.json
  2024-05-31 14:40:51 stuart stranger.cli[41] INFO Vcf is zipped
  Traceback (most recent call last):
    File "/usr/local/bin/stranger", line 10, in <module>
      sys.exit(base_command())
               ^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
      return self.main(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1078, in main
      rv = self.invoke(ctx)
           ^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
      return ctx.invoke(self.callback, **ctx.params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/click/core.py", line 783, in invoke
      return __callback(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
      return f(get_current_context(), *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/stranger/cli.py", line 155, in cli
      repeat_data = get_repeat_info(variant_info, repeat_information)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/stranger/utils.py", line 209, in get_repeat_info
      repeat_res = get_trgt_repeat_res(variant_info, repeat_info)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/stranger/utils.py", line 269, in get_trgt_repeat_res
      pathologic_counts = int(allele)
                          ^^^^^^^^^^^
  ValueError: invalid literal for int() with base 10: '.'

Update docs

Eg the readme is lagging a release or three behind..

different attributes in GRCh37 vs. GRCh38 catalogs

A few more details that might need some clarification:

These 4 loci have different NormalMax values in 37 vs. 38 (and ATN1 also has different PathologicMin):

ATN1 35 34
DMPK 34 37
FMR1 55 65
TBP 40 44

These loci have differing LocusStructure (even though DisplayRU is the same):

TBP (CAN)* (GCA)*
NIPA1 (NGC)* (GCG)*

TBP locus GRCh38 coordinates match the EHv4 catalog coordinates (https://github.com/Illumina/ExpansionHunter/blob/master/variant_catalog/hg38/variant_catalog.json#L499)
but the GRCh37 coordinates don't match the EHv4 catalog (https://github.com/Illumina/ExpansionHunter/blob/master/variant_catalog/hg19/variant_catalog.json#L260)

The situation is the same for NIPA1.

Ensure applicable HGNC-ids / symbols

Make sure Stranger and reference files use HGNC-ids that accurately reflect or at least can be tied to corresponding diseases (especially for filtering in Scout).

Add rarity statistic from normal population

It is straighforward to produce normal population calls for each locus. These should be employed to show the rarity of the present expansions size, e g as a quantile value for the less selected population.

Annotate multi-sample VCF

Would it be possible to annotate a multi-sample VCF?

Or if not, give an error if a multi-sample VCF is supplied ๐Ÿ™‚

Variant catalog and DAB1

Hi,
in the latest version of the variant catalog there are three different variantIds for DAB1; DAB1_ATTTT1, DAB1_ATTTC and DAB1_ATTTT2. DAB1_ATTTC is the clinically significant one but this is not recognised by scout as the status field is left blank. For other variants with more than one variantId there seems to be the rule to give the clinically significant variant the same name as given for the LocusId. For instance, ATXN7 has two variantId; ATXN7 and ATXN7_GCC. When uploaded to scout they all seem to get a status annotation for the most significant variant in multiple variantId loci.

It might be a reason for the naming of DAB1 variantIds in the catalog but if not, it's probably an honest mistake?

Kind Regards,
Paul

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.