global19 / icity Goto Github PK
View Code? Open in Web Editor NEWThis project forked from ncbi/icity
License: Other
This project forked from ncbi/icity
License: Other
This file describes scripts designed to search for genes linked to set of the baits defined by user. Scripts given in order of their use general workflow of the pipeline. Pipeline takes as input: BLAST database, coordinates of coding sequences of proteins present in the database and set of baits (coordinates in contigs). Scripts designed to be used in Unix environment and require following programs to be installed and path to executables should be exported with "export PATH=": Python 3.xx: https://www.python.org/downloads/ (scripts provided for protocol were implemented with python 3.4) NCBI BLAST suite: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (version 2.7.1 was used for example dataset and scripts provided). Clustering tools: mmseqs2 (preferred) can be downloaded at https://github.com/soedinglab/MMseqs2 (mmseqs version used in protocol was downloaded at Jule 2018) Sequence alignment: MUSCLE http://www.drive5.com/muscle/ (MUSCLE v3.7 version was used in the protocol) All the scripts can be found at: ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/ icity.py Python batch file that run all computational steps of the pipeline. Requires all python files described below. Script reads config.py file for input parameters and executes following steps of the pipeline: Step 5, Identify protein-coding genes around the baits Step 6, Collect protein IDs from bait neighborhoods Step 7, Get protein sequences from the database Step 8, Run clustering with permissive parameters Step 9, Create protein profiles from representatives of a permissive cluster Step 10, Run BLAST search for generated protein profiles Step 11, Sort blast hits between clusters Step 12, Calculate relevance metrics for all protein clusters Example: python icity.py results will be stored in files specified in config.py (default files are: Relevance.tsv - contains data for ICITY metric for all clusters, VicinityPermissiveClustsLinear.tsv - contains information about cluster members) config.py Configuration file for icity.py. Have the following fields: ICITY_CONFIG_INPUT - Parameters of the pipeline #Parameter Name Value for the example dataset Description PTYFile Database/CDS.pty Path to the file generated in Step 1 SeedsFile Seeds.tsv Path to the file generated in Step 3 NeighborhoodVicinitySize 10000 Offset around seed (base pairs) PathToDatabase Database/ProteinsDB Path to the database generated in Step 1 PermissiveClusteringThreshold 0.3 Sequence similarity clustering threshold SortingOverlapThreshold 0.4 Overlap threshold to sort BLAST hits SortingCoverageThresold 0.25 Coverage threshold to sort BLAST hits ICITY_CONFIG_OUTPUT - Output files of the pipeline #Parameter Name Example Value Description ICITYFileName Relevance_09.tsv File with ICITY values for protein clusters VicinityClustersFileName VicinityPermissiveClustsLinear.tsv File with protein clusters information ICITY_CONFIG_TEMPORARYFILES - List of temporary files generated by the pipeline SelectNeighborhood.py -h, --help show this help message and exit -p P PTYDataFileName, complete pty for contigs. PTY file should contain following values: LocusID ORFStart..ORFStop Strand OrganismID ContigID Accession Number GeneratedGI separated by tab symbol. -s S SeedsFileName, seeds tsv file. File should contain following values: LociID ContigID Start Stop separated by tab symbol -o O ResultFileName, output pty file that contain following values: GI ORF Coordinates Strand Genome Contig separated by tab values -d D Offset around seed (base pairs) Script takes subset of pty file in vicinity of baits and saves to file specified. Example: To select proteins in +-10kb vicinity of the seeds in example dataset run and save it into Vicinity.tsv file run: python SelectNeighborhood.py -p Database/CDS.pty -s Seeds.tsv -o Vicinity.tsv -d 10000 sh RunClust.sh - Argument 1: FASTA file name - Argument 2: Sequence similarity clustering threshold - Argument 3: Result clusters FileName This script call will cluster protein sequences presented in file FASTA file using sequence similarity cut off saves results into file specified. Example: to cluster protein sequences presented in file Vicinity.faa using sequence similarity cut off equal to 0.3 and saves results into VicinityPermissiveClustsLinear.tsv run following command sh RunClust.sh Vicinity.faa 0.3 VicinityPermissiveClustsLinear.tsv RemoveFASTAIDRedundency.py -h, --help show this help message and exit -f F FASTA file name Script removes everything except first ID in FASTA ID line in the file Example: python RemoveFASTAIDRedundency.py -f Vicinity.faa > VicinityShortID.faa This call will reduce following FASTA ID >gi|1000270260|gb|AAD36845.1| AAD36845.1 N-acetyl-gamma-glutamyl-phosphate reductase [Thermotoga maritima MSB8] to >1000270260 ConvertOutput.py: -h, --help show this help message and exit -f F Cluster file name Script converts mmseq tab separated file output to file where each line present one cluster. Output file will contain cluster id separated by tab character with cluster members separated by space character. Linear format is needed for scripts used below. Example: python ConvertOutput.py -f VicinityPermissiveClusts.tsv > VicinityPermissiveClustsLinear.tsv Will convert VicinityPermissiveClusts.tsv to new format in VicinityPermissiveClustsLinear.tsv MakeProfiles.py -h, --help show this help message and exit -f F Clusters file name -c C Folder name where profiles will be saved -d D Path to protein database Script will create protein profile for proteins from genomic database using MUSCLE for each permissive cluster in and save it to CLUSTERS folder with “.ali” extension and CLUSTER_ prefix with line number after as cluster ID, if directory don’t exists it will create it. Example: python MakeProfiles.py -f VicinityPermissiveClustsLinear.tsv -c CLUSTERS/ -d Database/ProteinsDB Script will make profiles for each line presented in VicinityPermissiveClustsLinear.tsv and save it to CLUSTERS folder. RunPSIBLAST.py -h, --help show this help message and exit -c C Folder name where profiles stored -d D Path to protein database Script will run PSIBLAST for each cluster present in specified folder and save BLAST hits to the same folder with .hits extension with following format: ProteinID BLAST Score Alignment Start Alignment Stop Alignment Sequence CLUSTERID Contig Is in Vicinity Islands ORF Start ORF Stop Distance to the bait Example: python RunPSIBLAST.py -c CLUSTERS/ -d Database/ProteinsDB Will take sequence alignments in CLUSTERS folder, run PSIBLAST and save result with .hits extension to the same folder GetIcityForBLASTHits.py -h, --help show this help message and exit -f F Sorted PSIBLAST hits file name -o O Result tsv file -d D Genomic database -c C Permissive clusters file name Calculates number of different proteins at the baits, in all genomic database and calculates median distance to the baits using sorted PSIBLAST search results. Saves it to the file with following format: Cluster ID Effective size in vicinity of baits Effective size in entire database Median distance to bait (in ORFs) Icity Example: python GetIcityForBLASTHits.py -f ClusterHitsFileName -o ResultFileName -d PathToDatabase -c PermissiveClustersFileName CalculateICITY.sh - Argument 1: Clusters folder path - Argument 2: Path to protein database - Argument 3: Path to the file with clusters information - Argument 4: Result file name with clusters relevance information Script run GetIcityForBLASTHits.py for each cluster blast hits file specified in the folder. Saves results into separate file for each cluster in the same folder with CLUSTER_ prefix followed by the number of the cluster in cluster file and .tsv extension. Then merges all files into one and saves it into file provided as 4th argument. Example: sh RunEffectiveSizeEstimation.sh CLUSTERS/Sorted/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.