Git Product home page Git Product logo

icity's Introduction

This file describes scripts designed to search for genes linked to set of the 
baits defined by user. Scripts given in order of their use general workflow 
of the pipeline. Pipeline takes as input: BLAST database, coordinates of 
coding sequences of proteins present in the database and set of baits 
(coordinates in contigs).

Scripts designed to be used in Unix environment and require following programs
to be installed and path to executables should be exported with "export PATH=":
Python 3.xx: https://www.python.org/downloads/ (scripts provided for protocol were implemented with python 3.4)
NCBI BLAST suite: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (version 2.7.1 was used for example dataset and scripts provided).
Clustering tools: mmseqs2 (preferred) can be downloaded at https://github.com/soedinglab/MMseqs2 (mmseqs version used in protocol was downloaded at Jule 2018)  
Sequence alignment: MUSCLE http://www.drive5.com/muscle/ (MUSCLE v3.7 version was used in the protocol)

All the scripts can be found at:
ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/


icity.py
Python batch file that run all computational steps of the pipeline.  Requires all python files described below.
Script reads config.py file for input parameters and executes following steps of the pipeline:
	Step 5, Identify protein-coding genes around the baits
	Step 6, Collect protein IDs from bait neighborhoods
	Step 7, Get protein sequences from the database
	Step 8, Run clustering with permissive parameters
	Step 9, Create protein profiles from representatives of a permissive cluster
	Step 10, Run BLAST search for generated protein profiles
	Step 11, Sort blast hits between clusters
	Step 12, Calculate relevance metrics for all protein clusters

Example:
python icity.py
results will be stored in files specified in config.py (default files are: Relevance.tsv - contains data for ICITY 
metric for all clusters, VicinityPermissiveClustsLinear.tsv - contains information about cluster members)


config.py
Configuration file for icity.py. Have the following fields:
ICITY_CONFIG_INPUT - Parameters of the pipeline
#Parameter Name					Value for the example dataset	Description
PTYFile							Database/CDS.pty				Path to the file generated in Step 1
SeedsFile						Seeds.tsv						Path to the file generated in Step 3
NeighborhoodVicinitySize		10000							Offset around seed (base pairs)
PathToDatabase					Database/ProteinsDB				Path to the database generated in Step 1
PermissiveClusteringThreshold	0.3								Sequence similarity clustering threshold
SortingOverlapThreshold			0.4								Overlap threshold to sort BLAST hits
SortingCoverageThresold			0.25							Coverage threshold to sort BLAST hits

ICITY_CONFIG_OUTPUT - Output files of the pipeline
#Parameter Name				Example Value						Description
ICITYFileName				Relevance_09.tsv					File with ICITY values for protein clusters 
VicinityClustersFileName	VicinityPermissiveClustsLinear.tsv	File with protein clusters information 

ICITY_CONFIG_TEMPORARYFILES - List of temporary files generated by the pipeline


SelectNeighborhood.py
  -h, --help  show this help message and exit
  -p P        PTYDataFileName, complete pty for contigs. PTY file should 
			  contain following values: 
			  LocusID	ORFStart..ORFStop	Strand	OrganismID	ContigID	Accession Number	GeneratedGI			  
			  separated by tab symbol.
  -s S        SeedsFileName, seeds tsv file. File should contain following
              values: LociID	ContigID	Start	Stop
			  separated by tab symbol
  -o O        ResultFileName, output pty file that contain following values:
              GI	ORF Coordinates	Strand	Genome	Contig
			  separated by tab values
  -d D        Offset around seed (base pairs)
Script takes subset of pty file in vicinity of baits and saves to file 
specified.

Example: To select proteins in +-10kb vicinity of the seeds in example dataset 
run and save it into Vicinity.tsv file run:
python SelectNeighborhood.py -p Database/CDS.pty -s Seeds.tsv -o Vicinity.tsv -d 10000


sh RunClust.sh
-	Argument 1: FASTA file name
-	Argument 2: Sequence similarity clustering threshold
-	Argument 3: Result clusters FileName

This script call will cluster protein sequences presented in file FASTA file 
using sequence similarity cut off saves results into file specified.

Example: to cluster protein sequences presented in file Vicinity.faa using 
sequence similarity cut off equal to 0.3 and saves results into 
VicinityPermissiveClustsLinear.tsv run following command
sh RunClust.sh Vicinity.faa 0.3 VicinityPermissiveClustsLinear.tsv


RemoveFASTAIDRedundency.py
  -h, --help  show this help message and exit
  -f F        FASTA file name
Script removes everything except first ID in FASTA ID line in the file

Example:
python RemoveFASTAIDRedundency.py -f Vicinity.faa > VicinityShortID.faa
This call will reduce following FASTA ID
>gi|1000270260|gb|AAD36845.1| AAD36845.1 N-acetyl-gamma-glutamyl-phosphate reductase [Thermotoga maritima MSB8]
to
>1000270260


ConvertOutput.py:
  -h, --help  show this help message and exit
  -f F        Cluster file name
Script converts mmseq tab separated file output to file where each line present
one cluster. Output file will contain cluster id separated by tab character 
with cluster members separated by space character. Linear format is needed for 
scripts used below.

Example: 
python ConvertOutput.py -f VicinityPermissiveClusts.tsv > VicinityPermissiveClustsLinear.tsv
Will convert VicinityPermissiveClusts.tsv to new format in
VicinityPermissiveClustsLinear.tsv


MakeProfiles.py
  -h, --help  show this help message and exit
  -f F        Clusters file name
  -c C        Folder name where profiles will be saved
  -d D        Path to protein database
Script will create protein profile for proteins from genomic database using 
MUSCLE for each permissive cluster in and save it to CLUSTERS folder with 
“.ali” extension and CLUSTER_ prefix with line number after as cluster ID, if 
directory don’t exists it will create it.

Example:
python MakeProfiles.py -f VicinityPermissiveClustsLinear.tsv -c CLUSTERS/ -d Database/ProteinsDB
Script will make profiles for each line presented in 
VicinityPermissiveClustsLinear.tsv and save it to CLUSTERS folder.


RunPSIBLAST.py
  -h, --help  show this help message and exit
  -c C        Folder name where profiles stored
  -d D        Path to protein database
Script will run PSIBLAST for each cluster present in specified folder and save
BLAST hits to the same folder with .hits extension with following format:
ProteinID	BLAST Score	Alignment Start	Alignment Stop	Alignment Sequence	CLUSTERID	Contig	Is in Vicinity Islands	ORF Start	ORF Stop	Distance to the bait

Example:
python RunPSIBLAST.py -c CLUSTERS/ -d Database/ProteinsDB
Will take sequence alignments in CLUSTERS folder, run PSIBLAST and save result
with .hits extension to the same folder


GetIcityForBLASTHits.py
  -h, --help  show this help message and exit
  -f F        Sorted PSIBLAST hits file name
  -o O        Result tsv file
  -d D        Genomic database
  -c C        Permissive clusters file name
Calculates number of different proteins at the baits, in all genomic database
and calculates median distance to the baits using sorted PSIBLAST search
results. Saves it to the file with following format:
Cluster ID	Effective size in vicinity of baits	Effective size in entire database	Median distance to bait (in ORFs)	Icity

Example:
python GetIcityForBLASTHits.py -f ClusterHitsFileName -o ResultFileName -d PathToDatabase -c PermissiveClustersFileName


CalculateICITY.sh
-	Argument 1: Clusters folder path
-	Argument 2: Path to protein database
-	Argument 3: Path to the file with clusters information
-	Argument 4: Result file name with clusters relevance information

Script run GetIcityForBLASTHits.py for each cluster blast hits file specified
in the folder. Saves results into separate file for each cluster in the same 
folder with CLUSTER_ prefix followed by the number of the cluster in cluster 
file and .tsv extension. Then merges all files into one and saves it into file
provided as 4th argument.

Example:
sh RunEffectiveSizeEstimation.sh CLUSTERS/Sorted/




icity's People

Contributors

arxcaeli avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.