Git Product home page Git Product logo

neoepiscope's Introduction

neoepiscope Build Status lifecycle DOI

neoepiscope is peer-reviewed open-source software for predicting neoepitopes from DNA sequencing (DNA-seq) data. Where most neoepitope prediction software confines attention to neoepitopes arising from at most one somatic mutation, often just an SNV, neoepiscope uses assembled haplotype output of HapCUT2 to also enumerate neoepitopes arising from more than one somatic mutation. neoepiscope also takes into account frameshifting from indels and permits personalizing the reference transcriptome using germline variants.

Read our paper in Bioinformatics

Note

neoepiscope v0.2.x has a critical bug where homozygous variants are not phased with heterozygous variants. Please update to the most recent version.

License

neoepiscope is licensed under the MIT license. See LICENSE for more details.

Portions of neoepiscope---specifically, segments of code in transcript.py, bowtie_index.py, and download.py---are taken from Rail-RNA, which is copyright (c) 2015 Abhinav Nellore, Leonardo Collado-Torres, Andrew Jaffe, James Morton, Jacob Pritt, José Alquicira-Hernández, Christopher Wilks, Jeffrey T. Leek, and Ben Langmead and licensed under the MIT License.

Support

Gitter chat or email [email protected].

Installing neoepiscope

neoepiscope is compatible with Python 3.6 and higher. To install, run

pip install neoepiscope

Note: if this fails on macOS 10 (Catalina) or newer, the required pysam installation may be unable to find the C compiler. To solve this, you can try either 1) running xcode-select --install or 2) installing pysam via conda (e.g. conda install -c bioconda pysam) before trying pip install neoepiscope again.

To download compatible reference annotation files (hg19, GRCh38, and/or mouse mm9) and link installations of relevant optional softwares to neoepiscope (e.g. netMHCpan), you will need to use our download functionality. Run the command:

neoepiscope download

and respond to the prompts as relevant for your needs.

To make sure that the software is running properly, clone this repository, and from within it run:

python setup.py test

Using neoepiscope

Preparing reference files (for those using references other than human hg19 or GRCh38 or mouse mm9)

If you aren't using human hg19 or GRCh38 or mouse mm9 reference builds from our download functionality, you will need to download and prepare your own annotation files. Before calling any neoepitopes, run neoepiscope in index mode to prepare dictionaries of transcript data used in neoepitope prediction:

neoepiscope index -g <GTF> -d <DIRECTORY TO HOLD PICKLED DICTIONARIES>

Options:

-g, --gtf path to GTF file

-d, --dicts path to write pickled dictionaries

Ensure proper ordering of VCF

To call neoepitopes from somatic mutations, ensure that the column with data for the tumor sample in your VCF file precedes the column with data from a matched normal sample. If it does not, run neoepiscope in swap mode to produce a new VCF:

neoepiscope swap -i <INPUT VCF> -o <SWAPPED VCF>

Options:

-i, --input path to input VCF

-o, --output path to swapped VCF

Add germline variation (optional)

If you would like to include germline variation in your neoepitope prediction, merge your somatic and germline VCFs for a sample prior to phasing variants:

neoepiscope merge -g <GERMLINE VCF> -s <SOMATIC VCF> -o <MERGED VCF>

Options:

-g, --germline path to germline VCF

-s, --somatic path to somatic VCF

-o, --output path to write merged VCF

-t, --tumor-id tumor ID (matching sample in tumor BAM file's read group field)

If you plan to use GATK's ReadBackedPhasing for haplotype phasing (see below), make sure to specify a tumor ID using the -t flag. It should match the sample name in the header of your tumor BAM file (the SM value in the read group field).

Predict haplotype phasing

Next, run HapCUT2 with your merged or somatic VCF and your tumor BAM file (make sure to use --indels 1 when running extractHAIRS if you wish to predict neoepitopes resulting from insertions and deletions). Before calling neoepitopes, prep your HapCUT2 output to included unphased mutations as their own haplotypes and flag germline variants if relevant:

neoepiscope prep -v <VCF> -c <HAPCUT2 OUTPUT> -o <ADJUSTED HAPCUT OUTPUT>

Options:

-v, --vcf path to VCF file used to generate HapCUT2 output

-c, --hapcut2-output path to original HapCUT2 output

-o, --output path to output file

-p, --phased flag input VCF as phased with GATK ReadBackedPhasing

Alternatively, you may perform phasing using GATK's ReadBackedPhasing on your merged or somatic VCF. If you phased variants with GATK instead of HapCUT2, make sure to use the -p flag when running neoepiscope prep to format your output:

neoepiscope prep -v <VCF> -o <ADJUSTED HAPCUT OUTPUT> -p

You may also predict neoepitopes without phasing by preparing your merged or somatic VCF:

neoepiscope prep -v <VCF> -o <ADJUSTED HAPCUT OUTPUT>

Neoepitope prediction

Finally, call neoepitopes:

neoepiscope call -b <GENOME BUILD> -c <PREPPED HAPCUT2 OUTPUT> [options]

Options:

-x, --bowtie-index path to bowtie index of reference genome

-d, --dicts path to directory containing pickled dictionaries generated in index mode

-b, --build which genome build to use (human hg19 or GRCh38 or mouse mm9; overrides -x and -d options)

-c, --merged-hapcut2-output path to HapCUT2 output adjusted by neoepiscope prep

-v, --vcf path to VCF file used to generate HapCUT2 output

-o, --output path to output file

-f, --fasta output additional fasta file output

-k, --kmer-size kmer size for neoepitope prediction (default 8-11 amino acids)

-p, --affinity-predictor software to use for MHC binding predictions (default MHCflurry v1 with rank and affinity scores)

-a, --alleles alleles to use for MHC binding predictions

-n, --no-affinity do not run binding affinity predictions, overrides the -p and -a options

-g, --germline how to handle germline mutations (by default includes as background variation)

-s, --somatic how to handle somatic mutations (by default includes for neoepitope enumeration)

-e, --rna-edits path to directory containing REDIportal-formatted RNA edits file

-u, --upstream-atgs handling of translation from upstream start codons - ("novel" (default) only, "all", "none", "reference" only)

-i, --isolate isolate mutations - disables phasing of mutations which share a haplotype

--nmd enumerate neoepitopes from nonsense mediated decay transcripts

--pp enumerate neoepitopes from polymorphic pseudogene transcripts

--igv enumerate neoepitopes from IG V transcripts

--trv enumerate neoepitopes from TR V transcripts

--allow-nonstart enumerate neoepitopes from transcripts without annotated start codons

--allow-nonstop enumerate neoepitopes from transcripts without annotated stop codons

--rna-bam path to paired end RNA-seq alignment file

--transcript-counts path to file containing per-transcript read counts

--tpm-threshold minimum transcript TPM required to retain neoepitope

Using the --build option requires use of our download functionality to procure and index the required reference files for human hg19, human GRCh38, and/or mouse mm9. If using an alternate genome build, you will need to download your own bowtie index and GTF files for that build and use the neoepiscope index mode to prepare them for use with the --dicts and --bowtie-index options.

Haplotype information should be included using -c /path/to/haplotype/file. This in the form of HapCUT2 output, generated either from your somatic VCF or a merged germline/somatic VCF made with our neoepiscope merge functionality. The HapCUT2 output should be adjusted using our neoepiscope prep functionality to ensure that mutations that lack phasing data are still included in analysis.

If you wish to extract variant allele frequency information from your somatic VCF to be output with relevant epitopes, include the path to the somatic VCF you used to create your merged VCF using -v /path/to/VCF.

To specify the output file, use -o /path/to/output_file. If no output file is specified, the output will be written to standard out. By default, only data on neoepitopes is output in the file. By using the --fasta option, an additional file, /path/to/output_file.fasta, will be made. This is a FASTA file specifying the full-protein sequences from each mutation-affected transcript. The header in the FASTA will give the name of the transcript from which the protein originated, followed by "v[#]" for every version of the transcript. This option is only available when writing output to a file, not standard out.

The default kmer size for neoepitope enumeration is 8-11 amino acids, but a custom range can be specified using the --kmer-size argument with the minimum and maximum epitope size separated by commas (e.g. --kmer-size 8,20 to get epitopes ranging from 8 to 20 amino acids in length).

For affinity prediction, neoepiscope currently supports predictions from MHCflurry v1, MHCnuggets v2, netMHC v4, netMHCpan v3 or v4, netMHCIIpan v3, netMHCII v2, PickPocket v1, netMHCstabpan v1, and PSSMHCpan v1. When installing our software with pip, MHCflurry and MHCnuggets are automatically installed or updated. Optional integration of netMHC, netMHCpan, netMHCIIpan, netMHCII, PickPocket, netMHCstabpan, or PSSMHCpan must be done from your own installation of these softwares using our download functionality (see "Installing neoepiscope" above). Note that gawk may be required for the use of these additional tools. Please note that MHCflurry and MHCnuggets require the use of TensorFlow, which was limited compatibility with python v3.7. If you would like to use these tools, please use python v3.6 or lower to run neoepiscope.

The default affinity prediction software for neoepiscope is MHCflurry v1. To specify a custom suite of binding prediction softwares, use the -p argument for each software followed by its name, version, and desired scoring output(s) (e.g. -p mhcflurry 1 affinity,rank -p mhcnuggets 2 affinity). To forgo binding affinity predictions, use the --no-affinity command line option.

Germline and somatic mutations can be handled in a variety of ways. They can be excluded entirely (e.g. --germline exclude), included as background variation to personalize the reference transcriptome (e.g. --germline background), or included as variants from which to enumerate neoepitopes (e.g. --somatic include). The default value for --germline is background, and the default value for --somatic is include.

The choice of start codon for a transcript can also be handled with flexibility. By default, the value for the --upstream-atgs argument is none, which specifies preferential use of the reference start codon for a transcript, or alternatively the nearest ATG downstream of it in the case of a disrupted reference start codon. Alternatively, the use of --upstream-atgs novel allows for the use of a novel ATG upstream of the reference start codon in the case of a disrupted start codon. A less conservative --upstream-atgs all uses the most upstream ATG, regardless of its novelty. For a conservative option, --upstream-atgs reference requires use of only the reference start codon, preventing enumeration of neoepitopes from a transcript if the reference start codon is disrupted.

By default, neoepiscope only enumerates neoepitopes from protein coding transcripts with annotated start and stop codons. However, by specifying the --nmd, --pp, --igv, and/or --trv flags, you can additionally enumerate neoepitopes from nonsense mediated decay transcripts, polymorphic pseudogene transcripts, immunoglobulin variable transcripts, and/or T cell receptor variable transcripts, respectively. For further flexibility, you can add the --allow-nonstart and/or --allow-nonstop to enumerate neoepitopes from transcripts without annotated start and/or stop codons, respectively.

Two options exist for quantifying expression of neoepitopes: 1) providing transcript read counts to calculate transcript-level expression in TPM or 2) providing an RNA alignment to calculate direct read-level support of the source mutation. Both options may be used simultaneously. To calculate transcript-level expression, use the --transcript-counts option and provide the path to a tab-seperated file with transcript identifiers in the first column and read counts in the second column (e.g. the output from HTseq's htseq-count program). This will provide the TPM value(s) for the transcript(s) a neoepitope is associated with. To additionally filter out neoepitopes from poorly expressed transcripts, you can use the --tpm-thresholdoption to set a minimum TPM requirement. To calculate mutation-level expression, you can provide a paired-end RNA-seq BAM alignment file. This will provide the number of reads supporting the mutation, the number of reads covering the position of the mutation, and the percent of reads covering the mutation which support that mutation. NOTE: mutation-level expression requires samtools to be installed and in your PATH.

Neoepitope calling output

neoepiscope output is a TSV file, either written to standard out by default, or the file named with the --output option. The 1st column lists the neoepitope sequence. The 2nd column lists the chromosome on which the source mutation occurs, and the 3rd column lists the position of the mutation on that chromosome. The 4th column lists the reference nucleotide sequence at that position (* for insertions), and the 5th column lists the alternative nucleotide sequence at that position (* for deletions). The 6th column lists the type of variant - V for SNVs/MNVs, I for insertions, and D for deletions. The 7th column lists the VAF for that mutation (if available), and the 8th column lists the paired normal epitope for neoepitopes resulting from SNVs/MNVs. The 9th column lists any warnings associated with the neoepitope or its transcript(s) of origin (e.g. if the reference start codon was disrupted and an alternative start codon was used), the 10th column lists the Ensembl identifier(s) of the transcript(s) of origin for the neoepitope, and the 11th column lists the transcript type(s) of the transcript(s) of origin. The 12th column lists the Ensembl identifier(s) of any genes associated with the transcript(s) of origin, and the 13th column lists the gene name(s). The 14th column lists the TPM(s) expression levels for the transcript(s) associated with that epitope. The 15th column lists the number of RNA-seq reads supporting the source mutation The 16th column lists the number of RNA-seq reads covering the position of the source mutation. The 17th column lists the percentage of reads covering the position of the source mutation which support that mutation. The 18th column lists the IEDB identifier(s) associated with the epitope if it is a known sequence, with any relevant peptide modifications listed. If any MHC binding predictions were run for neoepitopes, the following columns list the binding affinities of the neoepitope for that HLA allele/binding prediction tool combination as labeled (e.g. mhcnuggets_HLA-A*02:01_affinity represents the binding affinity in nM of that neoepitope for the allele HLA-A*02:01 as predicted by MHCnuggets).

If the --fasta option was specified, a fasta file will also be written to the file specified with the --output option, with the additional extension .fasta. Sequence names will be transcript identifiers followed by _vX, where X is a version number. Sequences are the amino acid sequences derived from translation of that transcript.

neoepiscope's People

Contributors

1pmayur avatar boeinco avatar ericjiyun03 avatar juliannedavid avatar maryawood avatar mihirparalkar avatar nellore avatar reidt03 avatar thijsmaas avatar wm-schreyer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neoepiscope's Issues

fail to include germline variant information

Hi, I try to generate noepeptide using this pipeline recently but encounter some difficulties.

firstly, I successfully run this pipeline with both somatic and germline variant which called by varscan2.

The command is showed below:

# merge varscan snp and indel variant, and I am not sure whether it is the correct to merge indel variant
cat ../../varscan_result/varscan_file_somatic_fetch_exon/001.vcf.*.Somatic.hc  |  grep -v ^\#  > 001.vcf.snp_indel.Somatic.hc
cat ../../varscan_result/varscan_file_Germline/001T.*.vcf > 001.vcf.snp_indel.Germline.hc
neoepiscope swap -i 001.vcf.snp_indel.Somatic.hc -o 001.vcf.snp_indel.Somatic.hc.sw
neoepiscope merge -g 001.vcf.snp_indel.Germline.hc  -s 001.vcf.snp_indel.Somatic.hc.sw  -o 001.merge.vcf


#hapcut2 need to sort variant
cat 001.merge.vcf   | sort -k1,1V -k2,2n  > 001.merge.sorted.vcf
mv 001.merge.sorted.vcf 001.merge.vcf

# phasing variant, and I am not sure that the difference between illumina read and 10X genomic, but I can only run correctly with the latter.
/home/alex2/git_file/HapCUT2/build/extractHAIRS  --bam ../../GATK_Recalibrator/001T.recal.bam  --VCF 001.merge.vcf  --out  001.merge.vcf.unlinked --10X 1 --indels 1
python3.6 /home/alex2/git_file/HapCUT2/utilities/LinkFragments.py  --bam ../../GATK_Recalibrator/001T.recal.bam   --VCF 001.merge.vcf --fragments 001.merge.vcf.unlinked  --out 001.merge.vcf.linked
/home/alex2/git_file/HapCUT2/build/HAPCUT2 --nf 1 --fragments 001.merge.vcf.linked  --VCF 001.merge.vcf  --output 001.merge.vcf.hp
# can not include germline info
neoepiscope prep -v 001.merge.vcf -c 001.merge.vcf.hp -o 001.merge.vcf.adhp
neoepiscope call -b hg19 -c 001.merge.vcf.adhp  -o 001.merge.vcf.out  -p netMHCpan 4 affinity -a HLA-A*24:02,HLA-A*24:02,HLA-B*54:01,HLA-B*40:02,HLA-C*03:04,HLA-C*01:02

However, I failed to run with somatic and germline variant which called by GATK pipeline. It shows some error in "neoepiscope call " step.

Traceback (most recent call last):
  File "/home/alex2/anaconda3/envs/exome-seq/bin/neoepiscope", line 10, in <module>
    sys.exit(main())
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/__init__.py", line 765, in main
    protein_fasta=args.fasta,
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/transcript.py", line 3149, in get_peptides_from_transcripts
    return_protein=True,
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/transcript.py", line 2360, in neopeptides
    protein = seq_to_peptide(sequence[start_codon[0] :], reverse_strand=False)
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/transcript.py", line 265, in seq_to_peptide
    codon = _codon_table[seq[i : i + 3]]
KeyError: 'G*A'

It seems that some wear character "*" appear in sequence, but I have no idea to solve this.
I am willing to share my .bam file and vcf to you.

data link:
https://drive.google.com/drive/folders/1O6PdYwImV0fEXDHPOemixV6cbVPhxatL?usp=sharing

Prioritizing neoepitopes

Hi,

After getting the neoepitopes, I thought of a way to prioritize by using the availability of the allele and the binding score of the peptide with that allele as a factor. As you mentioned, you have ranked the alleles based on their normalized presence in the population, what tools did you use to calculate the allele frequence population wise, and also,can that data be made available?

incomplete warning message

After installing with pip3 and testing with the current repo, I tried swapping the columns (ours is tumor then normal) of a Varscan output file, and received the following:

$ neoepiscope swap -i varscan_filtered.vcf -o varscan_filtered.swapped.vcf
/home/ubuntu/.local/lib/python3.5/site-packages/neoepiscope/file_processing.py:94: Warning: Reading NORMAL as normal tissue and TUMOR as tumor tissue
  Warning,

... and that's it. Now, it actually swapped the sample columns ... but then why the warning?

Integrating HapCUT2

While trying to run extractHAIRS and HAPCUT2 for getting haplotype phasing outputs, I cloned the HAPCUT2 and got the following error while making it from source.

git submodule init
Submodule 'submodules/htslib' (https://github.com/samtools/htslib) registered for path 'submodules/htslib'
Submodule 'submodules/samtools' (https://github.com/samtools/samtools) registered for path 'submodules/samtools'
git submodule update
Cloning into 'submodules/htslib'...
remote: Enumerating objects: 51, done.
remote: Counting objects: 100% (51/51), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 10677 (delta 25), reused 32 (delta 21), pack-reused 10626
Receiving objects: 100% (10677/10677), 8.10 MiB | 0 bytes/s, done.
Resolving deltas: 100% (7452/7452), done.
Checking connectivity... done.
Submodule path 'submodules/htslib': checked out '26229a368e3045a827f04fc03f3879e25f110ba3'
Cloning into 'submodules/samtools'...
remote: Enumerating objects: 26, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 9954 (delta 10), reused 10 (delta 5), pack-reused 9928
Receiving objects: 100% (9954/9954), 11.44 MiB | 0 bytes/s, done.
Resolving deltas: 100% (6468/6468), done.
Checking connectivity... done.
Submodule path 'submodules/samtools': checked out '255f97daffd3a27d7b063e991a993f47ad6186cf'
echo "Building Samtools bam library..."
Building Samtools bam library...
make -C submodules/samtools lib
make[1]: Entering directory '/home/jupyter/notebooks/HapCUT2/submodules/samtools'
gcc -g -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_CURSES_LIB=1 -I. -I../htslib -c -o bam_aux.o bam_aux.c
In file included from bam.h:48:0,
                 from bam_aux.c:27:
../htslib/htslib/bgzf.h:34:18: fatal error: zlib.h: No such file or directory
compilation terminated.
Makefile:116: recipe for target 'bam_aux.o' failed
make[1]: *** [bam_aux.o] Error 1
make[1]: Leaving directory '/home/jupyter/notebooks/HapCUT2/submodules/samtools'
Makefile:32: recipe for target 'submodules/samtools/libbam.a' failed
make: *** [submodules/samtools/libbam.a] Error 2
root@92c193bba70f:~/notebooks/HapCUT2# make install-hairs
cp build/extractHAIRS /usr/local/bin
cp: cannot stat 'build/extractHAIRS': No such file or directory
Makefile:95: recipe for target 'install-hairs' failed
make: *** [install-hairs] Error 1

How can I integrate HAPCUT with neoepiscope without running into these errors?

running for a long time; output stopped in the middle of a warning

I've run all the preliminary steps without problem (HapCut2, prep) but the final 'call' step has been running for over a week now, using 1 cpu, and stable at a couple GB of RAM. It also output (not sure if stdout or err) a number of stop codon warnings, then it looks to have paused in the middle of outputting a final warning:

[...]
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000339679.7; this transcript may undergo degradation
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000421682.1; this transcript may undergo degradation
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000597972.1; this transcript may undergo degradation
  Warning,
/home/ubuntu/neoepiscope/neoepiscope-venv/lib/python3.5/site-packages/neoepiscope/transcript.py:2408: Warning: Stop codon not detected prior to end of transcript ENST00000401510.1; this transcript may undergo degradation
  Warning,

... and that's it. There aren't any other processes running, and it is running, but it seems like something may be off. Any ideas how to troubleshoot this? I don't think the stop codon issues are the issue since ... I've seen discussion of that warning somewhere - maybe another issue? not sure ... but I could remove offending transcripts or cut down to a single chromosome ... just wondering if it looks like something you've seen before.

The calls:

./HapCUT2-1.2/build/extractHAIRS --indels 1 --bam tumor.bam --VCF tumor.vcf --out fragments_file
./HapCUT2-1.2/build/HAPCUT2 --fragments fragments_file --VCF tumor.vcf --output haplotype_output_file
neoepiscope prep -v tumor.vcf -c haplotype_output_file -o adjusted_haplotype_output
neoepiscope call -b /home/ubuntu/neoepiscope.data/hg19 -c adjusted_haplotype_output --no-affinity

Support substitutions longer than 1

Hi!
I'm working with a heavily modified version of your tool for a client and found an issue that originated from your code.
The client wanted to support dinucleotide substitutions, which would be found on the VCF file as "AA" -> "TT".
Currently, neoepiscope reads this as a SNP which have a hardcode length of 1 in the annotated_seq() function.
The resulting annotated sequence will be wrong, the reference sequence after the mutated one will start 1 nucleotide to early, resulting in a frame-shifted neopeptide.
I was able to fix this by changing line 1783 and 1784 of transcript.py:
last_index += fill + 1 -> last_index += fill + len(snv[0])
last_pos += fill + 1 -> last_pos += fill + len(snv[0])

And this can be tested like this:

self.fwd_transcript.edit("CA", 450533)
peptides = self.fwd_transcript.neopeptides().keys()
self.assertEqual(len(peptides), 42)
self.assertEqual(sorted(peptides)[0], "AGRASLDK")
self.assertEqual(sorted(peptides)[-1], "VPAGRASLDKP")

Error when running Neoepiscope

Error occurs after multiple warnings with message "Stop codon not detected prior to end of transcript {transcript_name}; this transcript may undergo degradation".

Neoepiscope version: 0.4.1

Command used:
neoepiscope call -b GRCh38 -c Mel5_tumor_v_Mel5_normal.haplotypes.prepped -v Mel5_tumor_v_Mel5_normal.merged.vcf -o Mel5_tumor_v_Mel5_normal.neoepiscope.out -k 9 -p netMHCpan 4 affinity,rank -a HLA-A01:01,HLA-B08:01,HLA-C*07:01 --nmd --pp --trv --igv --allow-nonstart --allow-nonstop

Error message:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/bin/neoepiscope", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/neoepiscope/init.py", line 696, in main
neoepitopes, fasta = get_peptides_from_transcripts(
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/neoepiscope/transcript.py", line 3212, in get_peptides_from_transcripts
peptides, protein = transcript_a.neopeptides(
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/neoepiscope/transcript.py", line 2277, in neopeptides
read_frame1 = self.reading_frame(seq[3])
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/neoepiscope/transcript.py", line 876, in reading_frame
if pos > self._start_codon or pos < self._stop_codon:
TypeError: '<' not supported between instances of 'int' and 'NoneType'

No neoepitopes found.

root@a231da4504f2:~/notebooks/neoepiscope# neoepiscope call -x ~/neoepiscope.data/hg19 -d ~/neoepiscope.data/gencode_v29/ -c adjusted_hap_som_germ.out -a HLA-A -o tests/output_prediction_somatic_germline.out
No neoepitopes found

root@92c193bba70f:~/notebooks/neoepiscope# neoepiscope call -x ~/neoepiscope.data/hg19 -d ~/neoepiscope.data/gencode_v29/ -c tests/adjusted_ychrom_hap.out -a HLA-A -o tests/output_ychrome.out
No neoepitopes found

root@92c193bba70f:~/notebooks/neoepiscope# neoepiscope call -x tests/Ychrom.ref -d ~/neoepiscope.data/gencode_v29/ -c tests/adjusted_ychrom_hap.out -a HLA-A -o tests/output_ychrome.out
No neoepitopes found

root@92c193bba70f:~/notebooks/neoepiscope# neoepiscope call -x tests/Chr11.ref -d ~/neoepiscope.data/gencode_v29/ -c adjusted_chr11_hap.out -a HLA-A -o tests/output_ychrome.out
No neoepitopes found

Above are various trials of different haplotypes assembly outputs, that I've ran with the same output of no neoepitopes. It's possible I might be missing a step, so I'll run through the steps:

  1. First I took a test vcf (test.vcf/Ychrome_varscan.vcf), either separately or after merging it with germline.vcf. (Combining germline and somatic variants).
  2. Prepped the already present test hapcut outputs to include unphased variants. (test_hapcut.out/complete_hapcut.out)
  3. Ran neoeopitope calling with the reference genome bowtie indices and dicts.

Now, the assumption is this case is not manually running the Haplotype phasing and taking the already given sample outputs. Where am I be going wrong?

"-g" in prep? exists in README.md, but not usage message

Hi - I installed a couple of weeks ago using pip, so I think I'm up to date version-wise. But I'm hoping to add germline variation to the predicted neoepitopes, so I used a command from the GitHub README.md, to this effect:

#########cut#########
Predict haplotype phasing

Next, run HapCUT2 with your merged or somatic VCF (make sure to use --indels 1 when running extractHAIRS if you wish to predict neoepitopes resulting from insertions and deletions). Before calling neoepitopes, prep your HapCUT2 output to included unphased mutations as their own haplotypes and flag germline variants if relevant:

neoepiscope prep -v -c [-g ] -o

Options:

-v, --vcf path to VCF file used to generate HapCUT2 output

-g, --germline-vcf path to germline VCF used in neoepiscope merge

-c, --hapcut2-output path to original HapCUT2 output

-o, --output path to output file

-p, --phased flag input VCF as phased with GATK ReadBackedPhasing
##########cut#########

... so I'm specifying a germline-only VCF file, using "-g". However, I get this:

neoepiscope: error: unrecognized arguments: -g SS.normal.vcf

... and 'neoepiscope prep -h' doesn't show a "-g" option like the readme does. What should I do here?

Thanks in advance!

transcript_to_gene_info.pickle is not loaded

When you use your own index with -x and -d, the "transcript_to_gene_info.pickle" file is never loaded and the code crashed after trying to write the "info_dict" variable that is never created.

Easy fix, load the file after "transcript_to_CDS.pickle" is loaded at line 465 in init.py.
Add something like:

info_path = os.path.join(args.dicts, "transcript_to_gene_info.pickle")   
    if os.path.isfile(info_path):  
       with open(info_path, "rb") as info_stream:  
          info_dict = pickle.load(info_stream)  
             else:  
                raise RuntimeError(  
                   "".join(  
                      [  
                          "Cannot find ",  
                            info_path,  
                             "; have you indexed your GTF with neoepiscope index?",  
                       ]  
                    )  
                )  

Transcript.py expressed edit function has list index problems during edits out of interval ranges

Using test_transcript.py in tests (although this is repeatable with own GTF)
Mutation is edited to be not in interval sequence.
Call to annotated_seq or expressed edit in the ranges around the "mutation" create error.

`

test = TestTranscript()
test.setUp()
test.transcript
<neoepiscope.transcript.Transcript object at 0x1123dae10>
trans = test.transcript
trans
<neoepiscope.transcript.Transcript object at 0x1123dae10>
trans.intervals
[5246692, 5246955, 5247805, 5248028, 5248158, 5248300]
mutloc = 5246980
start = mutloc-33
end = mutloc+32
trans.annotated_seq(start, end)
[('CTCCTGGGCA', 'R', [()], 5246956)]
start
5246947
trans.edit("ATGCGGGG", mutloc, mutation_type="I")
trans.annotated_seq(start, end)
Traceback (most recent call last):
File "<pyshell#12>", line 1, in
trans.annotated_seq(start, end)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/neoepiscope/transcript.py", line 1116, in annotated_seq
include_germline=include_germline,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/neoepiscope/transcript.py", line 783, in expressed_edits
if start_index % 2 or pos == intervals[start_index][0]:
IndexError: list index out of range
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.