mckennalab / flashfry Goto Github PK

FlashFry: The rapid CRISPR target site characterization tool

License: Other

Scala 89.25% Java 1.28% Jupyter Notebook 2.39% R 0.82% Python 1.23% Common Workflow Language 3.33% Dockerfile 0.91% Shell 0.78%

bioinformatics crispr crispr-cas9 genome-editing

flashfry's Introduction

Of note: If you've been using FlashFry before version 1.9, the command-line system has changed slightly.

FlashFry is a fast and flexible command-line tool for characterizing large numbers of potential CRISPR target sequences. FlashFry can be used with any genome, and can run against non-traditional model organisms or transcriptomes. If you're looking to characterize a smaller region or would like a nice web interface we recommend the GT-scan or crispor.org websites.

The easiest way to get started it to try out the quick-start procedure to make sure everything works on your system. If everything looks good, there are few more in-depth tutorials to try out various capacities of FlashFry. Thanks to @drivenbyentropy for the Java implementation of the ViennaRNA energy calculations.

Quickstart

First, make sure you're running Java version 8 (type java -version on the command line to see the version). From the UNIX or Mac command line, download the latest release version of the FlashFry jar file:

wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar

Download and then un-gzip the sample data for human chromosome 22:

wget https://raw.githubusercontent.com/aaronmck/FlashFry/master/test_data/quickstart_data.tar.gz
tar xf quickstart_data.tar.gz

Then run the database creation step (this should take a few minutes, it takes ~75 seconds on my laptop):

mkdir tmp
java -Xmx4g -jar FlashFry-assembly-1.15.jar \
 index \
 --tmpLocation ./tmp \
 --database chr22_cas9ngg_database \
 --reference chr22.fa.gz \
 --enzyme spcas9ngg

Now we discover candidate targets and their potential off-target in the test data (takes a few seconds). Here we're using the EMX1 target with some sequence flanking the target site. This flanking sequnce is needed by on-target scoring metrics to fully evaluate the target's efficiency:

java -Xmx4g -jar FlashFry-assembly-1.15.jar \
 discover \
 --database chr22_cas9ngg_database \
 --fasta EMX1_GAGTCCGAGCAGAAGAAGAAGGG.fasta \
 --output EMX1.output

Finally we score the discovered sites (a few seconds):

java -Xmx4g -jar FlashFry-assembly-1.15.jar \
 score \
 --input EMX1.output \
 --output EMX1.output.scored \
 --scoringMetrics doench2014ontarget,doench2016cfd,dangerous,hsu2013,minot \
 --database chr22_cas9ngg_database

There should now be a set of scored sites in the EMX1.output.scored. Success! Now check out the documentation and tutorials for more specific details.

Cite

FlashFry is published in BMC Biology; if you find it useful please cite:

TY  - JOUR
AU  - McKenna, Aaron
AU  - Shendure, Jay
PY  - 2018
DA  - 2018/07/05
TI  - FlashFry: a fast and flexible tool for large-scale CRISPR target design
JO  - BMC Biology
SP  - 74
VL  - 16
IS  - 1
AB  - Genome-wide knockout studies, noncoding deletion scans, and other large-scale studies require a simple and lightweight framework that can quickly discover and score thousands of candidate CRISPR guides targeting an arbitrary DNA sequence. While several CRISPR web applications exist, there is a need for a high-throughput tool to rapidly discover and process hundreds of thousands of CRISPR targets.
SN  - 1741-7007
UR  - https://doi.org/10.1186/s12915-018-0545-0
DO  - 10.1186/s12915-018-0545-0
ID  - McKenna2018
ER  -

flashfry's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger moritzschaefer springtan mpaperlee james-gagnon mengchengyao hcph antonkulaga vb6hobbyst7 cfb2018

flashfry's Issues

guideRNA ranking

Hi @aaronmck ,

I have a question regarding the ranking of guideRNA sequences?
In the results I saw:
AggregateRankedScore_medianRank AggregateRankedScore_tranche AggregateRankedScore_topX .
I roughly understand from the wiki introduction that these are aggregated ranking metrics.
But how are they aggregated?
What is AggregateRankedScore_tranche?
Sometimes, AggregateRankedScore_medianRank and AggregateRankedScore_topX could be the `same, similar, or very different? What is the underlying cause?

Also, I found the guide-target free energy (--folding) computation does not change the results. Is this expected?

Thanks a lot. I like flashfry since it incorporates lots of scoring metrics and is very fast!

Export index data base ?

Hi, was wondering if it would be possible to export the binary database made by the index into a format that can be used by R or excel or similar?

doench2016cfd scoring with spcas9ngg19

Dear Aaron,

Would it be possible to enable doench2016cfd scoring with spcas9ngg19 by removing position 1 (PAM distal) from the CFD sgRNA-DNA mismatch matrix?

Many thanks,

Tyler

"discover" output file field <otCount> format issue

The format of the field is a character instead of an int.
Example: Analyzing the BRD4 gene, I get for the first target 90 off-targets. Though in the otCount, a 'Z' is written (which has the value 90 in ASCII). For this reason the scoring step fails (as it can't interpret the character).

The issue is here:
https://github.com/aaronmck/FlashFry/blob/db4d2f441c52cb561f87855c54ee15aee45da5db/src/main/scala/targetio/TabDelimitedHandler.scala#L110

https://docs.oracle.com/javase/7/docs/api/java/io/PrintWriter.html
says
void | write(int c): Writes a single character.

So a int-to-string conversion is necessary there.

How comes you didn't stumble upon this issue?

Whole genome off-target scores

Myc fasta issue in test data

I looks like the myc fasta doesnt have +/- 1kb on either end of the gene as padding as the workflow suggests, and the mapping.bed would require.

404 to link in paper

Paper

Link 404

http://mckennalab.org/FlashFry/

bedannotator not functional in v1.9.2

Hello,

I'm using FlashFry-assembly-1.9.2, and the bedannotator module seems to be nonfunctional. The only result of using it is the displacement of the last column of data. No annotations are created, and no position transformations occur. I first encountered this using my own dataset, but have since confirmed this to be true following the wiki instructions to the letter:

Here's my commands:

java -Xmx4g -jar FlashFry-assembly-1.9.2.jar discover --database hg19_cas9ngg_database --fasta myc_example.fasta --output myc_example.sites
java -Xmx4g -jar FlashFry-assembly-1.9.2.jar score --input myc_example.sites --output myc_example_sites.scored --scoringMetrics doench2014ontarget,doench2016cfd,dangerous,hsu2013,minot,bedannotator --database hg19_cas9ngg_database --inputAnnotationBed ~/turbo/data-public/databases/ENCODE/wgEncodeBroadHmmGm12878HMM.bed --transformPositions myc_example.bed

#Viewing myc_example_sites.scored shows that the results from the last column, otCount, have been shifted to the right by one column (except for the header). No translations or annotations are apparent.

#Comparing without using bedannotator, can get equivalent files by just reordering the columns
java -Xmx4g -jar FlashFry-assembly-1.9.2.jar score --input myc_example.sites --output myc_example_sites_noannotate.scored --scoringMetrics doench2014ontarget,doench2016cfd,dangerous,hsu2013,minot --database hg19_cas9ngg_database
cut -f1-15,17 myc_example_sites.scored | tail -n +2 > myc_example_sites_2compare.scored
tail -n +2 myc_example_sites_noannotate.scored > myc_example_sites_noannotate_2compare.scored
#After removing the header and shifting the columns, the outputs are now identical
diff myc_example_sites_noannotate_2compare.scored myc_example_sites_2compare.scored

Input string error

Hi Aaron,

I am running into a error while testing FlashFry to look for gRNA in my dataset. I have 12 sequences(100bp/each), use hg38 assembly. I was able to get the "discover" function to work (output file looked fine), however I received this error for the "score" function:

I was able to get through "score" function to get the scoring for individual sequences, but not the whole dataset. Do you know what went wrong?

Thank you,
Linh

Re-opening: Bedfile command line argument wgEncodeBroadHmmGm12878HMM.bed doesn't contain both a name and a file #15Jun 25, 2020

Hello,

Per request, I am re-opening this issue when using the bedannotator flag when doing the scoring. I get the error:

Exception in thread "main" java.lang.AssertionError: assertion failed: Bedfile command line argument /Users/nkuperwasser/sandbox/wgEncodeBroadHmmGm12878HMM.bed doesn't contain both a name and a file

This file came straight from the test data folder, and I have had the same issue with other bed files.

Here is the rest of the exception:

at scala.Predef$.assert(Predef.scala:219)
at scoring.BedAnnotation.$anonfun$setup$1(BedAnnotation.scala:131)
at scoring.BedAnnotation.$anonfun$setup$1$adapted(BedAnnotation.scala:130)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:193)
at scoring.BedAnnotation.setup(BedAnnotation.scala:130)
at modules.ScoreResults.$anonfun$run$2(ScoreResults.scala:115)
at modules.ScoreResults.$anonfun$run$2$adapted(ScoreResults.scala:101)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:193)
at modules.ScoreResults.run(ScoreResults.scala:101)
at picocli.CommandLine.execute(CommandLine.java:1048)
at picocli.CommandLine.access$900(CommandLine.java:142)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1255)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1223)
at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1131)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:1414)
at picocli.CommandLine.parseWithHandler(CommandLine.java:1353)
at main.scala.Main$.main(Main.scala:57)
at main.scala.Main.main(Main.scala)

Here is the full command:

java -Xmx16g -jar ~/FlashFry/FlashFry-assembly-1.15.jar score --input D34.guides --output D34.scored --scoringMetrics doench2016cfd,dangerous,hsu2013,minot,bedannotator --database ~/FlashFry/db/mm39/mm39ngg --inputAnnotationBed ~/genomes/mm39/mm39_bed.fry --transformPositions unc93b1_D34.bed

I have been using my own mm39_bed.fry and used the wgEncodeBroadHmmGm12878HMM.bed to test if there was an issue with my bed file...but doesn't see to be the case.

Thank you in advance!

Bedfile command line argument wgEncodeBroadHmmGm12878HMM.bed doesn't contain both a name and a file

when using BedAnnotator,I got an error:

15:25:40.024 [main] INFO  modules.ScoreResults - adding score: BedAnnotator
Exception in thread "main" java.lang.AssertionError: assertion failed: Bedfile command line argument wgEncodeBroadHmmGm12878HMM.bed doesn't contain both a name and a file
        at scala.Predef$.assert(Predef.scala:219)
        at scoring.BedAnnotation.$anonfun$setup$1(BedAnnotation.scala:142)
        at scoring.BedAnnotation.$anonfun$setup$1$adapted(BedAnnotation.scala:141)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:193)
        at scoring.BedAnnotation.setup(BedAnnotation.scala:141)
        at modules.ScoreResults.$anonfun$run$2(ScoreResults.scala:96)
        at modules.ScoreResults.$anonfun$run$2$adapted(ScoreResults.scala:91)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:193)
        at modules.ScoreResults.run(ScoreResults.scala:91)
        at picocli.CommandLine.execute(CommandLine.java:1048)
        at picocli.CommandLine.access$900(CommandLine.java:142)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:1255)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:1223)
        at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1131)
        at picocli.CommandLine.parseWithHandlers(CommandLine.java:1414)
        at picocli.CommandLine.parseWithHandler(CommandLine.java:1353)
        at main.scala.Main$.main(Main.scala:57)
        at main.scala.Main.main(Main.scala)

I use the example bed file,so why have this problem

Designs for SpG Cas enzyme

Hi - we have been working on designing gRNAs with alternate PAM sites. One Cas enzyme we have been testing is the SpG Cas enzyme. This is an engineered version of the Cas9 enzyme that recognizes an NG PAM site. I was wondering if you can provide an additional option to index and discover guides using this enzyme.

Filtering gRNAs for GC% and homopolymers

Hi Aaron:

Been using FlashFry for quite a few years now. In my current project, I am designing millions of gRNAs and before we synthesize them, we filter the gRNAs based on GC% and homopolymeric sequences in the guide target. Is there anyway you can add as an optional filter these parameters?

support for other PAM sequences

Is it possible to add functionality to build the off target database using custom PAM sequences?

Support Azimuth scoring method for guides

TODO:
https://github.com/MicrosoftResearch/Azimuth

Java version

what version of Java should be used with FlashFry? The version on my system is
java version "1.7.0_181"
OpenJDK Runtime Environment (IcedTea 2.6.14) (7u181-2.6.14-0ubuntu0.3)
OpenJDK 64-Bit Server VM (build 24.181-b01, mixed mode)

When I follow the quickstart instructions I get an error that seems to be caused by java version incompatibility:

$mkdir flashFry
$cd flashFry/
$ wget https://github.com/aaronmck/FlashFry/releases/download/1.9.0/FlashFry-assembly-1.9.0.jar
$ wget https://raw.githubusercontent.com/aaronmck/FlashFry/master/test_data/quickstart_data.tar.gz
$ tar xf quickstart_data.tar.gz
$gunzip *gz
$java -Xmx4g -jar FlashFry-assembly-1.9.0.jar index --tmpLocation ./tmp --database chr22_cas9ngg_database --reference chr22.fa.gz --enzyme spcas9ngg

Exception in thread "main" java.lang.UnsupportedClassVersionError: main/scala/Main : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:808)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:443)
at java.net.URLClassLoader.access$100(URLClassLoader.java:65)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.net.URLClassLoader$1.run(URLClassLoader.java:349)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:348)
at java.lang.ClassLoader.loadClass(ClassLoader.java:430)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:363)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

Add option/PAM for Staphylococcus aureus Cas9?

Hi there,

I'm wondering if it might be possible to add Staphylococcus aureus Cas9 (SaCas9) to FlashFry? It has a canonical NNGRRT PAM motif, with the NNGRRN motif showing quite significant nuclease targetting as well.

SaCas9 is a small Cas9 ortholog, so it is starting to be published on more and more for space limited applications (i.e. AAV viral vectors and gene therapy applications). Here is an example:

Y. Tan et al., “Rationally engineered Staphylococcus aureus Cas9 nucleases with high genome-wide specificity,” Proceedings of the National Academy of Sciences, vol. 116, no. 42, pp. 20969–20976, Oct. 2019, doi: 10.1073/pnas.1906843116.

and discovery paper:
F. A. Ran et al., “In vivo genome editing using Staphylococcus aureus Cas9,” Nature, vol. 520, no. 7546, Art. no. 7546, Apr. 2015, doi: 10.1038/nature14299.

I did try to look into the code to submit a PR. I assume this is the place, but I don't know Scala, and the use of bitwise operations comments like this gave me some pause :x

FlashFry/src/main/scala/standards/StandardScanParameters.scala

Line 98 in 127a455

 // be super careful with this value!! the cas9 mask only considers the lower 46 bites (23 of 24 bases are used) 

output all off-target score

how to output off-target score for every single off-target , but not only DoenchCFD_maxOT and DoenchCFD_specificityscore of all off-target?

problem with genome fasta files with chromosome description

Thanks for this excellent tool!
I found out that if the reference genome files includes chromosme description in the fasta header files and off-target positions output are asked in the discovery stage, then the program fails in the scoring stage. I manually removed the chromosome descriptions and re-indexed, but wondered if there is an easy fix for that in the code. Thx

Best Scoring Matrix for CPF1

First, I would like to thank you for this amazing effort. This is by far the most convenient and complete command-line tool for Guide RNA design.
I am not sure if the scoring matrices are independent of the choice of the enzyme. In other words, if I am used to the enzyme as cpf1 in indexing, can I use any of the scoring matrices (like doench2014ontarget or doench2016cfd)?
My understanding is that these scoring algorithms were developed for cas9 applications so I am worried to use them with cpf1.
What is your kind advice?

--includeOTs score option

The includeOTs cmd line option for the score command throws the following exeption:

$ java -Xmx4g -jar FlashFry-assembly-1.9.0.jar score --includeOTs --input test_cas9ngg_targets.tsv --output test_cas9ngg_targets_scored_icludeots.tsv --scoringMetrics doench2016cfd --database test_cas9ngg_db
11:26:58.709 [main] INFO  reference.binary.BinaryHeader$ - Loading header: enzyme type is SpCAS9
11:26:58.712 [main] DEBUG reference.binary.BinaryHeader$ - Number of characters used to generate this lookup file: 7
11:26:58.966 [main] INFO  modules.ScoreResults - Loading CRISPR objects (filtering out overflow guides)..
11:26:59.061 [main] INFO  modules.ScoreResults - adding score: Doench2014OnTarget
11:26:59.074 [main] INFO  modules.ScoreResults - Scoring all guides...
11:26:59.075 [main] INFO  modules.ScoreResults - Scoring with model Doench2014OnTarget
11:26:59.139 [main] INFO  modules.ScoreResults - Aggregating results...
11:26:59.161 [main] INFO  modules.ScoreResults - Writing annotated guides to the output file...
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (modules.ScoreResults@341b80b2): java.util.NoSuchElementException: key not found: 0
        at picocli.CommandLine.execute(CommandLine.java:1056)
        at picocli.CommandLine.access$900(CommandLine.java:142)
        at picocli.CommandLine$RunAll.handle(CommandLine.java:1304)
        at picocli.CommandLine$RunAll.handle(CommandLine.java:1264)
        at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1131)
        at picocli.CommandLine.parseWithHandlers(CommandLine.java:1414)
        at picocli.CommandLine.parseWithHandler(CommandLine.java:1353)
        at main.scala.Main$.main(Main.scala:57)
        at main.scala.Main.main(Main.scala)
Caused by: java.util.NoSuchElementException: key not found: 0
        at scala.collection.MapLike.default(MapLike.scala:232)
        at scala.collection.MapLike.default$(MapLike.scala:231)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
        at bitcoding.BitPosition.decode(BitPosition.scala:67)
        at crispr.CRISPRHit.$anonfun$toOutput$1(CRISPRHit.scala:43)
        at crispr.CRISPRHit.$anonfun$toOutput$1$adapted(CRISPRHit.scala:42)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
        at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:253)
        at scala.collection.TraversableLike.map(TraversableLike.scala:234)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
        at scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:253)
        at crispr.CRISPRHit.toOutput(CRISPRHit.scala:42)
        at targetio.TabDelimitedOutput.$anonfun$write$7(TabDelimitedHandler.scala:139)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike.map(TraversableLike.scala:234)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at targetio.TabDelimitedOutput.write(TabDelimitedHandler.scala:139)
        at modules.ScoreResults.$anonfun$run$8(ScoreResults.scala:131)
        at modules.ScoreResults.$anonfun$run$8$adapted(ScoreResults.scala:130)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:193)
        at modules.ScoreResults.run(ScoreResults.scala:130)
        at picocli.CommandLine.execute(CommandLine.java:1048)
        ... 8 more

I have tested other scoring methods and the exeption is always the same.
Details about my system:

OS: Debian GNU/Linux 9.8 (stretch)
Java version (brew install openjdk):

openjdk version "1.8.0_181-ojdkbuild"
OpenJDK Runtime Environment (build 1.8.0_181-ojdkbuild-13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

rnafold4j.src.main.java.rnafold4j.RNAFoldAPI does not exist

I am trying to update the library to Scala 2.13 and add sbt-native docker support and proper .gitignore. However, you have some weird rnafold4j.src.main.java.rnafold4j.RNAFoldAPI dependency in the code and I have not idea where do you take it from. I could not find rnafold4j file in the repo, neither I found anything in maven central

CFD score threshold

Hi @aaronmck ,

What DoenchCFD_specificityscore threshold do you recommend using to select for good guides?
Thanks a lot.

Best,
Huanle

Inconsistent Offtarget Reporting

Hi,

I'm running FlashFry to locate off-target sites of a given guide RNA. I find that the report up to a certain number of mismatches is inconsistent. I'll put the guide RNA and commands below so the issue can be reproduced. (Or if I'm making some mistake in how I'm running the software!)

Cpf1 Guide RNA with 5' PAM
cat cpfguide.fa

>cpfguide
TTTCCCACGGCATCAAGTGCCCCG

Database Indexing Command
java -Xmx4g -jar FlashFry-assembly-1.9.9.1.jar index --tmpLocation /tmp --database hg38_cpf1_database --reference ~/hg38.fa --enzyme cpf1

Command for searching up to 2 mismatches
java -Xmx4g -jar FlashFry-assembly-1.9.9.1.jar discover --database hg38_cpf1_database --fasta cpfguide.fa --output cpfguide2.out --positionOutput --maxMismatch=2 --maximumOffTargets=40000; cat cpfguide2.out
Output

contig  start   stop    target  context overflow        orientation     otCount offTargets
cpfguide        0       24      TTTCCCACGGCATCAAGTGCCCCG        NONE    OK      FWD     0

Command for searching up to 3 mismatches
java -Xmx4g -jar FlashFry-assembly-1.9.9.1.jar discover --database hg38_cpf1_database --fasta cpfguide.fa --output cpfguide3.out --positionOutput --maxMismatch=3 --maximumOffTargets=40000; cat cpfguide3.out
Output

contig  start   stop    target  context overflow        orientation     otCount offTargets
cpfguide        0       24      TTTCCCACGGCATCAAGTGCCCCG        NONE    OK      FWD     2       TTTTGCACGGCATCAAGTAACCCG_2_3<chr10:47553816^R|chr10:46786533^F>

Command for searching up to 4 mismatches
java -Xmx4g -jar FlashFry-assembly-1.9.9.1.jar discover --database hg38_cpf1_database --fasta cpfguide.fa --output cpfguide4.out --positionOutput --maxMismatch=4 --maximumOffTargets=40000; cat cpfguide4.out
Output

contig  start   stop    target  context overflow        orientation     otCount offTargets
cpfguide        0       24      TTTCCCACGGCATCAAGTGCCCCG        NONE    OK      FWD     7       TTTCCCACGGCATCAAGTGCCCCG_1_0<chr17:80790186^R>,TTTGCCACGGCATCAACTGCCCAG_1_2<chr2:136115625^F>,TTTGCCACGGCATCAAGGCCCCGC_1_4<chr2:115641298^F>,TTTGCCACGGCTTCATCTGCCCCC_1_4<chr17:17768968^R>,TTTCCCACTGCTTCAACTGCCCCT_1_4<chr10:102146516^R>,TTTTGCACGGCATCAAGTAACCCG_2_3<chr10:47553816^R|chr10:46786533^F>

The output when allowing 4 mismatches shows a locus with a perfect match, as well as one with two mismatches. Neither of these are reported when using the --maxMismatch=3 or --maxMismatch=2 parameter though, and I would have expected them to.

Thank you for the tool! I've been using it quite a bit for SpCas9 and never noticed anything amiss. This is my first time trying with a Cpf1 guide though.

Best,
Tim

score doench2016ontarget

Is there support for scoring with Rule Set 2 from Doench et al. Nature Biotechnology, 2016 ?

Too many open files

When running the example in your README.md I get the following exception:


Exception in thread "main" java.io.IOException: Too many open files
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createTempFile(File.java:2024)
        at crispr.BinWriter.$anonfun$new$1(BinWriter.scala:47)
        at crispr.BinWriter.$anonfun$new$1$adapted(BinWriter.scala:46)
        at scala.collection.Iterator.foreach(Iterator.scala:929)
        at scala.collection.Iterator.foreach$(Iterator.scala:929)
        at utils.BaseCombinationIterator.foreach(BaseCombinationGenerator.scala:58)
        at crispr.BinWriter.<init>(BinWriter.scala:46)
        at modules.BuildOffTargetDatabase.$anonfun$runWithOptions$1(BuildOffTargetDatabase.scala:67)
        at modules.BuildOffTargetDatabase.$anonfun$runWithOptions$1$adapted(BuildOffTargetDatabase.scala:61)
        at scala.Option.map(Option.scala:146)
        at modules.BuildOffTargetDatabase.runWithOptions(BuildOffTargetDatabase.scala:61)
        at main.scala.Main$.$anonfun$new$2(Main.scala:81)
        at main.scala.Main$.$anonfun$new$2$adapted(Main.scala:74)
        at scala.Option.map(Option.scala:146)
        at main.scala.Main$.delayedEndpoint$main$scala$Main$1(Main.scala:74)
        at main.scala.Main$delayedInit$body.apply(Main.scala:57)
        at scala.Function0.apply$mcV$sp(Function0.scala:34)
        at scala.Function0.apply$mcV$sp$(Function0.scala:34)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App.$anonfun$main$1$adapted(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:378)
        at scala.App.main(App.scala:76)
        at scala.App.main$(App.scala:74)
        at main.scala.Main$.main(Main.scala:57)
        at main.scala.Main.main(Main.scala)

Increasing my open-files-limit from 4096 to 80000 solved the issue, though not everyone may have the option to increase the limit

Discover module: Key not found: 0 error

Hello @aaronmck I have been using FlashFry a lot and it's a very useful tool. I am looking to count sgRNAs on sequencing reads (10+ million) that I use as reference sequences for indexing. When running it in discover mode to count for guides (pre-defined FASTA with 20 bp target + NGG PAMs), I keep getting the error below for some files.

If I do not use the --positionOutput argument, the error does not happen. However, I need to be able to use the position information for further analysis.

Any chance you can help?

Thanks,
Sridhar Ranganathan

12:16:10.094 [main] INFO r.t.OrderedBinTraversalFactory - With 465 guides, and allowing 0 mismatch(es), we're going to scan 443 target bins out of a total of 16384 12:16:10.094 [main] INFO modules.OffTargetDiscovery - scanning against the known targets from the genome with 465 guides 12:16:10.094 [main] INFO modules.OffTargetDiscovery - Starting seek traversal 12:16:10.259 [main] INFO reference.traverser.SeekTraverser$ - Comparing the 0th bin (AAAACTA) with 286 guides, of a total bin count 443. 0.082861372 seconds/10K bins, executed 7,618,846 comparisons 12:16:12.012 [main] INFO modules.OffTargetDiscovery - Performed a total of 92,748 guide to target comparisons 12:16:12.014 [main] INFO modules.OffTargetDiscovery - Writing final output for 465 guides Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (modules.OffTargetDiscovery@3b938003): java.util.NoSuchElementException: key not found: 0 at picocli.CommandLine.execute(CommandLine.java:1056) at picocli.CommandLine.access$900(CommandLine.java:142) at picocli.CommandLine$RunLast.handle(CommandLine.java:1255) at picocli.CommandLine$RunLast.handle(CommandLine.java:1223) at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1131) at picocli.CommandLine.parseWithHandlers(CommandLine.java:1414) at picocli.CommandLine.parseWithHandler(CommandLine.java:1353) at main.scala.Main$.main(Main.scala:57) at main.scala.Main.main(Main.scala) Caused by: java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike.default(MapLike.scala:232) at scala.collection.MapLike.default$(MapLike.scala:231) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at bitcoding.BitPosition.decode(BitPosition.scala:67) at crispr.CRISPRHit.$anonfun$toOutput$1(CRISPRHit.scala:59) at crispr.CRISPRHit.$anonfun$toOutput$1$adapted(CRISPRHit.scala:58) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29) at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:253) at scala.collection.TraversableLike.map(TraversableLike.scala:234) at scala.collection.TraversableLike.map$(TraversableLike.scala:227) at scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:253) at crispr.CRISPRHit.toOutput(CRISPRHit.scala:58) at targetio.TabDelimitedOutput.$anonfun$write$7(TabDelimitedHandler.scala:151) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike.map(TraversableLike.scala:234) at scala.collection.TraversableLike.map$(TraversableLike.scala:227) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at targetio.TabDelimitedOutput.write(TabDelimitedHandler.scala:151) at modules.OffTargetDiscovery.$anonfun$run$4(OffTargetDiscovery.scala:135) at modules.OffTargetDiscovery.$anonfun$run$4$adapted(OffTargetDiscovery.scala:134) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:193) at modules.OffTargetDiscovery.run(OffTargetDiscovery.scala:134) at picocli.CommandLine.execute(CommandLine.java:1048) ... 8 more

unexpected result

Hi @aaronmck ,

I ran into an issue that I do not expect and can not explain. I attached an image containing the relevant result.
As you can see there is a 0-mismatch off-target in the result. Interestingly, it happens to be within my target gene.

So I quickly checked the sequence using bedtools getfasta. (It seems coordinates in flashfry results are 1-based) I got these:

>chr3:52211336-52211359
ACACCGTACGGTCCTATGCTGCT
>chr3:52211337-52211360
CACCGTACGGTCCTATGCTGCTG
>chr3:52211338-52211361
ACCGTACGGTCCTATGCTGCTGG
>chr3:52211339-52211362
CCGTACGGTCCTATGCTGCTGGC

This indicates that the 0-mismatch off-target is supposed to be the target. Am I right? If so, then this result from flashfry output is incorrect.

Can you please help me to sort this out? Thanks a lot!

Is random flanking sequence required?

Hi There,

What is the point of adding random flanking sequences of sgRNA when attempting to discover target and off-target sites?

java -Xmx4g -jar FlashFry-assembly-1.12.jar
discover
--database chr22_cas9ngg_database
--fasta EMX1_GAGTCCGAGCAGAAGAAGAAGGG.fasta
--output EMX1.output

Thanks a lot in advance.

Run with multiple threads/cores?

Thank you very much for the creation of this program, works great! I am starting to assemble some fairly large regions to run the discover module, and I was wondering if/how it would be possible to run with multiple threads/cores?

Thanks!!

Sincerely,

Score: TSV columns off-by-one?

Hi there,

I'm using FlashFry score, and the output TSV seems to have an issue where the header columns past "dangerous_GC" don't seem to match up w/ the data values. This is via opening the TSV with Excel & Pandas. I checked briefly in the TSV with a hexeditor and saw there was a double tab character in the data values near "dangerous_GC", but not in the header, so there is a mismatch with the columns in the header, and columns and the data

Here was my FlashFry execution.

jar=FlashFry-assembly-1.13.jar
java -Xmx4g -jar ${jar} score\
              --database ${db}\
              --scoringMetrics hsu2013,doench2014ontarget,doench2016cfd,moreno2015,bedannotator,dangerous,minot,reciprocalofftargets,rank \
              --input ${discover_output} \
              --output tmp/scored_${discover_output}

head of the relevant TSV is attached, and its result when I open it in Excel (showing columns offset)
head_scored_discover_Scaf17_hotspots_1000.fasta.output.tsv.zip

head_scored_discover_Scaf17_hotspots_1000.fasta.output.xlsx

Discover command help message

Hi,
while for every other command you can get the help message by calling it without arguments, the discover one exits with an error because it tries to open the DB header .header (I guess the default value of the database parameter is the empty string). Should be a very quick fix.
Concerning all the help messages, I think it would be useful to include the list of scoring methods and enzymes, even without description, so that one can easily recall the exact parameters without checking the wiki.

Thanks for this awesome tool, it's one of a kind!
Simone

Support for Cpf1 and CasΦ

Amazing tool, I really like it since it is fast enough especially when the genome is very large. I am super interested in using this tool to figure out the off targets for Cpf1 and the newly reported CasΦ （PAM=‘TBN’）because they have way smaller size, which brings great potential in future gene therapy. Is it possible to add the off-target scan function for Cpf1 and CasΦ? Waiting for your reply and thank you in advance!

bedannotator

Hey, great software. I'm currently trying to annotate a genome with gRNA sequences and I want to make it a track UCSC browser. Can you give me a bit of instruction on how to use bedannotator to preform this task.

thanks,

Eli

issue parsing T2T-CHM13v2.0 during indexing

Dear Aaron,

I'm trying to index T2T-CHM13v2.0 (https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz) with FlashFry:

java -Xmx10g -jar FlashFry-assembly-1.15.jar index
--tmpLocation ./tmp
--enzyme spcas9ngg19
--reference /2TBevo_new/T2T-CHM13v2.0.fa.gz
--database T2TCHM13v2.0_spcas9ngg19

Which is throwing the following error:

14:22:19.135 [main] INFO modules.BuildOffTargetDatabase - Discovering target sites in the input genome file...
14:22:19.147 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr1 CP068277.2 Homo sapiens isolate CHM13 chromosome 1
14:22:21.145 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr2 CP068276.2 Homo sapiens isolate CHM13 chromosome 2
14:23:35.648 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr3 CP068275.2 Homo sapiens isolate CHM13 chromosome 3
14:24:39.961 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr4 CP068274.2 Homo sapiens isolate CHM13 chromosome 4
14:25:32.330 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr5 CP068273.2 Homo sapiens isolate CHM13 chromosome 5
14:26:20.209 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr6 CP068272.2 Homo sapiens isolate CHM13 chromosome 6
14:27:07.387 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr7 CP068271.2 Homo sapiens isolate CHM13 chromosome 7
14:27:52.435 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr8 CP068270.2 Homo sapiens isolate CHM13 chromosome 8
14:28:35.155 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr9 CP068269.2 Homo sapiens isolate CHM13 chromosome 9
14:29:13.785 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr10 CP068268.2 Homo sapiens isolate CHM13 chromosome 10
14:29:56.145 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr11 CP068267.2 Homo sapiens isolate CHM13 chromosome 11
14:30:32.931 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr12 CP068266.2 Homo sapiens isolate CHM13 chromosome 12
14:31:10.173 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr13 CP068265.2 Homo sapiens isolate CHM13 chromosome 13
14:31:45.976 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr14 CP068264.2 Homo sapiens isolate CHM13 chromosome 14
14:32:14.238 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr15 CP068263.2 Homo sapiens isolate CHM13 chromosome 15
14:32:41.599 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr16 CP068262.2 Homo sapiens isolate CHM13 chromosome 16
14:33:09.938 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr17 CP068261.2 Homo sapiens isolate CHM13 chromosome 17
14:33:37.751 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr18 CP068260.2 Homo sapiens isolate CHM13 chromosome 18
14:34:03.194 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr19 CP068259.2 Homo sapiens isolate CHM13 chromosome 19
14:34:23.557 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr20 CP068258.2 Homo sapiens isolate CHM13 chromosome 20
14:34:43.140 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr21 CP068257.2 Homo sapiens isolate CHM13 chromosome 21
14:35:01.995 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chr22 CP068256.2 Homo sapiens isolate CHM13 chromosome 22
14:35:14.299 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chrX CP068255.2 Homo sapiens isolate CHM13 chromosome X
14:35:30.716 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chrY CP086569.2 Homo sapiens isolate NA24385 chromosome Y
14:36:09.753 [main] INFO reference.ReferenceEncoder$ - Switching to chromosome >chrM CP068254.1 Homo sapiens isolate CHM13 mitochondrion, complete genome
14:36:23.651 [main] INFO reference.ReferenceEncoder$ - Done looking for targets...
14:36:23.651 [main] INFO modules.BuildOffTargetDatabase - Closing the temporary binary output files...
14:36:24.501 [main] INFO modules.BuildOffTargetDatabase - Creating the final binary database file...
14:38:40.409 [main] INFO reference.binary.DatabaseWriter$ - Writing bin AATTGCT our 999 bin
14:39:53.967 [main] INFO reference.binary.DatabaseWriter$ - Writing bin ACTTATT our 1999 bin
14:41:59.520 [main] INFO reference.binary.DatabaseWriter$ - Writing bin AGTGTCT our 2999 bin
14:43:40.693 [main] INFO reference.binary.DatabaseWriter$ - Writing bin ATTGCTT our 3999 bin
14:45:55.422 [main] INFO reference.binary.DatabaseWriter$ - Writing bin CATGACT our 4999 bin
14:48:31.185 [main] INFO reference.binary.DatabaseWriter$ - Writing bin CCTCGTT our 5999 bin
Failed on processing file: /home/tyler/Documents/FlashFry/./tmp/binCCTTGG9469158892438854627.txt
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (modules.BuildOffTargetDatabase@451001e5): java.lang.IllegalStateException: Unable to parse line: chr6_CP068272.2_Homo_sapiens_isolate_CHM13_chromosome_6 6<877372 68877394 CCTTGGCCTCCCAAAGTGCTGG R
at picocli.CommandLine.execute(CommandLine.java:1056)
at picocli.CommandLine.access$900(CommandLine.java:142)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1255)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1223)
at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1131)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:1414)
at picocli.CommandLine.parseWithHandler(CommandLine.java:1353)
at main.scala.Main$.main(Main.scala:57)
at main.scala.Main.main(Main.scala)
Caused by: java.lang.IllegalStateException: Unable to parse line: chr6_CP068272.2_Homo_sapiens_isolate_CHM13_chromosome_6 6<877372 68877394 CCTTGGCCTCCCAAAGTGCTGG R
at crispr.CRISPRSite$.fromLine(CRISPRSite.scala:76)
at reference.binary.BlockReader.$anonfun$loadBlock$1(BlockReader.scala:94)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1406)
at reference.binary.BlockReader.loadBlock(BlockReader.scala:93)
at reference.binary.BlockReader.fetchBin(BlockReader.scala:62)
at reference.binary.DatabaseWriter$.$anonfun$writeToBinnedFileSet$1(DatabaseWriter.scala:82)
at reference.binary.DatabaseWriter$.$anonfun$writeToBinnedFileSet$1$adapted(DatabaseWriter.scala:78)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1406)
at reference.binary.DatabaseWriter$.writeToBinnedFileSet(DatabaseWriter.scala:78)
at modules.BuildOffTargetDatabase.run(BuildOffTargetDatabase.scala:82)
at picocli.CommandLine.execute(CommandLine.java:1048)
... 8 more

Not quite sure what's causing the parsing error, the fai index looks like this:

chr1 248387328 57 80 81
chr2 242696752 251492284 80 81
chr3 201105948 497222803 80 81
chr4 193574945 700842633 80 81
chr5 182045439 896837322 80 81
chr6 172126628 1081158386 80 81
chr7 160567428 1255436654 80 81
chr8 146259331 1418011232 80 81
chr9 150617247 1566098862 80 81
chr10 134758134 1718598884 80 81
chr11 135127769 1855041554 80 81
chr12 133324548 1991858480 80 81
chr13 113566686 2126849644 80 81
chr14 101161492 2241835973 80 81
chr15 99753195 2344262043 80 81
chr16 96330374 2445262212 80 81
chr17 84276897 2542796775 80 81
chr18 80542538 2628127193 80 81
chr19 61707364 2709676572 80 81
chr20 66210255 2772155338 80 81
chr21 45090682 2839193281 80 81
chr22 51324926 2884847656 80 81
chrX 154259566 2936814201 80 81
chrY 62460029 3093002071 80 81
chrM 16569 3156242926 80 81

Many thanks for your help,

Tyler

off-targets ranking

Hi @aaronmck ,

Quick question about the off-targets:

In the resutls, the offTargets columns contains all the potential off-targets. I wonder are these off-targets sorted based on any metric(s)?

Can you clarify?

Best,
Hanle

Parallelism on HPC clusters

Hi there,

We're trying to advise a user of our HPC cluster at the University of Birmingham who is attempting to use FlashFry for a large dataset. We just wanted to know whether the code is parallelised at all so we could advise appropriately on what resources they should request from the cluster scheduler, and if so, whether you'd done any scaling studies vs the number of cores?

Best wishes,
Ryan

mckennalab / flashfry Goto Github PK

flashfry's Introduction

links:

Quickstart

Cite

flashfry's People

Contributors

Stargazers

Watchers

Forkers

flashfry's Issues

Paper

Link 404

Recommend Projects

Recommend Topics

Recommend Org