sagnikbanerjee15 / finder Goto Github PK

A fully automated gene annotator from RNA-Seq expression data

License: MIT License

Dockerfile 16.60% Python 49.95% Common Workflow Language 33.45%

gene-annotations rna-seq gtf finder protein-sequences changepoint-detection gene-models transcripts predict-genes bioinformatics

finder's Introduction

Welcome to `finder2`

MORE DETAILS COMING

finder is a gene annotator pipeline which automates the process of downloading short reads, aligning them and using the assembled transcripts to generate gene annotations. Additionally it uses protein sequences and reports gene predictions by BRAKER2. It is a fast, scalable, platform independent software that generates gene annotations in GTF format. finder accepts inputs through the command line interface. It finds several novel genes/transcripts and also reports the tissue/conditions they were found to be in. finder is released as a docker image. Users need to have python3 installed in their system to be able to run finder. The header script will create either a docker container or a singularity container depending on what is installed on the system with preference given to docker.

If you use finder for your research please cite

Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z Sen,Roger P Wise, and Carson M Andorf. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences BMC Bioinformatics

Installation

finder requires a number of softwares which needs to be installed. This might cause version conflicts with softwares that are already installed in your system. Hence, the developers have decided to enforce the use of finder within a conda environment.

Installing `finder` from `GitHub`

git pull https://github.com/sagnikbanerjee15/Finder.git

Downloading `finder` from release (Latest stable version)

wget https://github.com/sagnikbanerjee15/Finder/archive/refs/tags/finder_v1.1.0.tar.gz
tar -xvzf finder_v1.1.0.tar.gz
cd finder_v1.1.0
echo "export PATH=\$PATH:$(pwd)" >> ~/.bashrc
source ~/.bashrc

You can choose to run finder using the command outlined in [this section](#Running Finder). When run_finder command is executed, it will pull the latest docker image from docker hub. Depending on what is installed, the program will create either a docker or a singularity container and execute the main program inside it. If you wish to create the docker image locally execute the following command:

docker build -t sagnikbanerjee15/finder:1.1.0 .

Please remember to add proxies if you are on a VPN.

finder runs BRAKER2 which depends on GeneMark-ET. GeneMark-ET is hosted at the University of Georgia website. The license prohibits the redistribution of their software, which is why it could not be included in this package. Hence, users have to manually download the software and provide the path as input to the software. Please follow the instructions below to download the softwares and the key:

Open a browser of your choice
Go to this website
Select the option GeneMark-ES/ET/EP ver 4.62_lic (2^nd from top) and LINUX 64
Enter your name, institution, country and email-id and click on the button that says I agree to the terms of this license agreement
Right click on the link that says Please download program here and select Copy Link Address
Then type in wget and paste the path you just copied
This command will download the file gmes_linux_64.tar.gz in the current directory
Now, right click on the link that says 64_bit and select Copy Link Address
Then type in wget and paste the path you just copied
This command will download the file gm_key_64.tar.gz in the current directory. Please note that this key will expire after one year from the date of download.
Execute the following commands:

tar -xvzf gm_key_64.tar.gz
tar -xvzf gmes_linux_64.tar.gz

Executing FINDER with Sample data

Please follow the following the instructions to generate gene annotations using Arabidopsis thaliana. A csv file template has been provided with the release in example/Arabidopsis_thaliana_metadata.csv. Keep all the headers intact and replace the data with your samples of choice. Also note, that FINDER can work with both data downloaded from NCBI and also with data on local directories. Below is a detailed description of the each column of the metadata file. All the fields must be present in the metadata file. Mandatory fields must have some valid data whereas other fields like Description, Date and Read Length can be left vacant.

Column Name	Column Description	Mandatory
BioProject	Name of the bioproject that the data belongs to. If you are using locally saved data then please enter a dummy project name. Please note that FINDER will NOT be able to process empty fields of Bioproject.	YES
SRA Accession	Enter the SRA Accession number of the sample that you expect `finder` to use for generating the gene annotations. Note that FINDER will use this ID to download the read samples from NCBI-SRA. In case you wish to use data which is not currently uploaded to NCBI, then you should enter the name of the local file. Do not enter any file extension in this field. For example, if your filename is `sample1.fastq`, please enter `sample1` in this field. `finder` assumes all files have the extension fastq. If there are files in your system that end with `f.q` please rename those to `*.fastq`. For paired-ended samples do not include the pair information in this field. For example, if you have 2 files `sample2_1.fastq` and `sample2_2.fastq` please enter `sample2` in this field.	YES
Tissues	Mention the tissue type or condition from which the sample has been collected. `finder` will report the tissues that are associated with a particular transcript. This can be used to find gene models that are expressed in a specific tissue and/or condition	YES
Description	A brief description of the data. This field is not mandatory and is not used by `finder`. It is upto the user to enter whatever metadata is deemed important.	NO
Date	Enter the date of producing the RNA-Seq sample. This field is not mandatory and is not used by `finder`.	NO
Read Length (bp)	Enter the length of the reads. This field is not mandatory and is not used by `finder`.	NO
Ended	Enter either PE or SE for Paired ended reads or single neded reads. No other value should be entered.	YES
RNA-Seq	Enter 1 for all the rows. This field is included for future extensions.	YES
process	Enter 1 if you wish to process the sample. If a value of 0 is present, then `finder` will ignore the sample	YES
Location	Enter the location of the directory. For samples to be downloaded from NCBI, this field should be left empty. If the location of a directory is provided here then `finder` will assume that the sample is present in it. `finder` will generate an error if the sample is not found in this directory. It is not necessary to have all the samples in the same directory.	YES

To optimize disk space usage finder will process read samples from each bioproject at a time. Once the data is downloaded and reads are mapped, FINDER will remove all those data (if -no-cleanup is not specificied) to save disk space. But samples that were locally present will not be removed.

Running FINDER

Help menu for FINDER can be launched by the following command:

run_finder -h

usage: run_finder [-h] [--version] --metadatafile METADATAFILE --output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model {VERT,INV,PLANTS,FUNGI} --genemark_path GENEMARK_PATH --genemark_license GENEMARK_LICENSE [--cpu CPU] [--genome_dir_star GENOME_DIR_STAR]
              [--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint CHECKPOINT] [--perform_post_completion_data_cleanup] [--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3 EXONERATE_GFF3]
              [--star_shared_mem] [--framework {docker,singularity}]

Generates gene annotation from RNA-Seq data

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Required arguments:
  --metadatafile METADATAFILE, -mf METADATAFILE
                        Please enter the name of the metadata file. Enter 0 in the last column of those samples which you wish to skip processing. The columns should represent the following in order --> BioProject, SRA Accession, Tissues, Description, Date, Read Length, Ended (PE or SE), RNA-Seq, process, Location. If the sample is skipped it will not be downloaded. Leave the directory path blank if you are downloading the samples. In the end of the run the program will output a csv file with the directory path filled out. Please check the provided csv file for more information on how to configure the metadata file. 
  --output_directory OUTPUT_DIRECTORY, -out_dir OUTPUT_DIRECTORY
                        Enter the name of the directory where all other operations will be performed
  --genome GENOME, -g GENOME
                        Enter the SOFT-MASKED genome file of the organism
  --organism_model {VERT,INV,PLANTS,FUNGI}, -om {VERT,INV,PLANTS,FUNGI}
                        Enter the type of organism
  --genemark_path GENEMARK_PATH, -gm GENEMARK_PATH
                        Enter the path to genemark
  --genemark_license GENEMARK_LICENSE, -gml GENEMARK_LICENSE
                        Enter the licence file. Please make sure your license file is less than 365 days old

Optional arguments:
  --cpu CPU, -n CPU     Enter the number of CPUs to be used.
  --genome_dir_star GENOME_DIR_STAR, -gdir_star GENOME_DIR_STAR
                        Please enter the location of the genome index directory of STAR
  --genome_dir_olego GENOME_DIR_OLEGO, -gdir_olego GENOME_DIR_OLEGO
                        Please enter the location of the genome index directory of OLego
  --verbose VERBOSE, -verb VERBOSE
                        Enter a verbosity level
  --protein PROTEIN, -p PROTEIN
                        Enter the protein fasta
  --no_cleanup, -no_cleanup
                        Provide this option if you do not wish to remove any intermediate files. Please note that this will NOT remove any files and might take up a large amount of space
  --preserve_raw_input_data, -preserve
                        Set this argument if you want to preserve the raw fastq files. All other temporary files will be removed. These fastq files can be later used. 
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Enter a value if you wish to restart operations from a certain check point. Please note if you have new RNA-Seq samples, then FINDER will override this argument and computation will take place from read alignment. If there are missing data in any step then also FINDER will enforce restart of operations from a previous
                        . For example, if you wish to run assembly on samples for which alignments are not available then FINDER will readjust this value and set it to 1.
                            1. Align reads to reference genome (Will trigger removal of all alignments and start from beginning)
                            2. Assemble with PsiCLASS (Will remove all assemblies)
                            3. Find genes with FINDER (entails changepoint detection)
                            4. Predict genes using BRAKER2 (Will remove previous results of gene predictions with BRAKER2)
                            5. Annotate coding regions
                            6. Merge FINDER annotations with BRAKER2 predictions and protein sequences
                            
  --perform_post_completion_data_cleanup, -pc_clean
                        Set this field if you wish to clean up all the intermediate files after the completion of the execution. If this operation is requested prior to generation of all the important files then it will be ignored and finder will proceed to annotate the genome. 
  --run_tests, -rt      Modify behaviour of finder to accelerate tests. This will reduce the downloaded fastq files to a bare minimum and also check the other installations
  --addUTR, --addUTR    Turn on this option if you wish BRAKER to add UTR sequences
  --skip_cpd, --skip_cpd
                        Turn on this option to skip changepoint detection. Could be effective for grasses
  --exonerate_gff3 EXONERATE_GFF3, -egff3 EXONERATE_GFF3
                        Enter the exonerate output in gff3 format
  --star_shared_mem, --star_shared_mem
                        Turn on this option if you want STAR to load the genome index into shared memory. This saves memory if multiple finder runs are executing on the same host, but might not work in your cluster environment.
  --framework {docker,singularity}, -fm {docker,singularity}
                        Enter your choice of framework

finder can be launched using the following command:

run_finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -preserve 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error

This program will download and run the entire process of annotation. The duration of execution will depend on your internet speed and the number of cores you assigned to FINDER. Also, FINDER is designed in a way to handle a large number of RNA-Seq samples. So the speedup might not be noticeable with just a few samples.

Run the following command to remove all intermediate files. We recommend that while you run finder, you preserve all intermediate files and then run the following command to remove all the intermediate files.

run_finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -preserve -pc_clean 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error

Enforcing running of `finder` from preset checkpoints

finder allows users to enforce execution from a specific checkpoints. Requesting a particular checkpoint does not mean that finder will skip all previous steps. It means that finder will remove all files generated by process after the checkpoint to ensure that the modules recalculate those. Below is a description of all the checkpoints that finder can accept:

Align reads to reference genome - Requesting finder to start from this checkpoint will trigger removal of all previous alignments.
Assemble with PsiCLASS - Requesting finder to start from this checkpoint will trigger removal of assemblies that was previously generated. Aligned files will not be removed. If there are some RNA-Seq samples that are not aligned FINDER will align those first before attempting to assemble them
Find genes with finder - finder will regenerate all files post assembly by PsiCLASS
Predict genes using BRAKER2 - finder will rerun the BRAKER2 step
Annotate coding regions - finder will restart from annotating the coding sequences
Merge finder annotations with BRAKER2 predictions and protein sequences - finder will generate merged annotations from RNA-Seq samples, predictions and protein sequences

If you wish to start finder from downloading the SRA samples, please delete the output directory and start over.

Output Files

All relevant output files generated by finder can be found in the final_GTF_files directory under the output directory. Below is the list of files and what data they contain

braker.gtf - gene models generated by BRAKER2
braker_utr.gtf - gene models, with UTR models, generated by BRAKER2
combined_redundant_transcripts_removed.gtf - GTF file from PsiCLASS output
combined_split_transcripts_with_bad_SJ_redundancy_removed.gtf - GTF file after splitting transcripts. This file is generated only from RNA-Seq expression evidence
combined_with_CDS.gtf - finder output with CDS predicted by GeneMark-S/T
combined_with_CDS_high_conf.gtf- finder gene models with high confidence
combined_with_CDS_low_conf.gtf- finder gene models with low confidence
combined_with_CDS_BRAKER_appended_high_conf.gtf - High confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models
combined_with_CDS_high_and_low_confidence_merged.gtf - High and Low confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models
FINDER_BRAKER_PROT.gtf - High confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models and gene models from protein evidence
tissue/condition to transcript - A file with two columns. the first column lists the transcripts and the seconds column lists the tissues/conditions they were found in. In future versions, we will include the functionality of extracting transcripts specific to a tissue/condition.

Intermediate files and folders [To be updated]

finder generates several intermediate files and folders. This section contains a detailed outline of the contents of each folder and what each file represents.

Checking Progress

finder is configured to output information to a log file location in the output directory named progress.log. While reporting issues please make sure you attach the log file.

Restarting previous runs with more RNA-Seq samples

finder offeres users the opportunity to augment data into already completed annotation runs. Users need to update the metadata.csv file with the new RNA-Seq data and rerun finder. The program will determine an optimal starting point. finder will skip downloading of already processed RNA-Seq samples and will proceed with the new data. Users also have the option of removing some previously supplied RNA-Seq samples.

Utilities included with FINDER

finder offers users with 2 utilites which could be used independently.

downloadAndDumpFastqFromSRA.py - A python program that optimizes the download of data from SRA. Ids of RNA-Seq (or any sequencing for that matter) needs to be provided as a newline separated file. The program will download the RNA-Seq files, using the requested number of cores, convert those to fastq and remove the .sra files. downloadAndDumpFastqFromSRA.py will continuosly query the SRA database in the event of a failure.

python downloadAndDumpFastqFromSRA.py -h
usage: download_and_dump_fastq_from_SRA.py [-h] --sra SRA --output OUTPUT
                                           [--cpu CPU]

Parallel download of fastq data from NCBI. Program will create the output
directory if it is not present. If fastq file is present, then downloading is
skipped. Program optimizes downloading of sra files and converting to fastq by
utilizing multiple CPU cores.

optional arguments:
  -h, --help            show this help message and exit
  --sra SRA, -s SRA     Please enter the name of the file which has all the
                        SRA ids listed one per line. Please note the
                        bioproject IDS cannot be processed
  --output OUTPUT, -o OUTPUT
                        Please enter the name of the output directory.
                        Download will be skipped if file is present
  --cpu CPU, -n CPU     Enter the number of CPUs to be used.

verifyInputsToFINDER.py - This program will verify whether all the resuested samples are in fact from a transcriptomic source of the organism whose genome is being annotated.

python verifyInputsToFINDER.py -h
usage: verify_inputs_to_finder.py [-h] --metadatafile METADATAFILE --srametadb
                                  SRAMETADB --taxon_id TAXON_ID

Verifies whether all the data are transcriptomic and from the organism under
consideration

optional arguments:
  -h, --help            show this help message and exit
  --metadatafile METADATAFILE, -mf METADATAFILE
                        Please enter the name of the metadata file. Enter 0 in
                        the last column of those samples which you wish to
                        skip processing. The columns should represent the
                        following in order --> BioProject,Run,tissue_group,tis
                        sue,description,Date,read_length,ended (PE or
                        SE),directorypath,download,skip. If the sample is
                        skipped it will not be downloaded. Leave the directory
                        path blank if you are downloading the samples. In the
                        end of the run the program will output a csv file with
                        the directory path filled out. Please check the
                        provided csv file for more information on how to
                        configure the metadata file.
  --srametadb SRAMETADB, -m SRAMETADB
                        Enter the location of the SRAmetadb file.
  --taxon_id TAXON_ID, -t TAXON_ID
                        Enter the taxonomic id of the organism. Enter -1 if
                        you are working on a non-model organism or a sub-
                        species for which no taxonomic id exists.

Terms of use

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all

copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

SOFTWARE.

Support

Please report all issues here.

finder's People

Contributors

Stargazers

Watchers

Forkers

guo-cheng shishirgupta-wu eernst biogeeker wyim-pgl microseq pyoelii yuzhenpeng jiangchb deniskristak zagrosman pythseq yun-yunho

finder's Issues

Genome examples empty?

Hi I am trying to predict gene structure using Finder.
And it seems this tool is better than PASA, MAKER,,,
So I am planning to get used to this tool.

However, the example data you shared, I could get metadata, protein data, and rawdata but could not find genome data. Where can I find the genome sequence?

Thanks a lot for us to use this beautiful tool.
Sincerely, Paul.

Will this error affect final results?

Hi,
I got the following error when running finder, will it affect the final results?
fixOverlappingAndMergedTranscripts.py:347: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
coverage_info[transcript_id]["bed_cov"]=np.array(temp)

Best,
Kun

Error during STAR alignment

Hi there,

I am getting an error while using Finder with RNA-Seq data available in my local directory. Here's the error:

cat: /elegans/Finder/FINDER_elegans/alignments/SRR9265068_round3_SJ.out.tab: No such file or directory
cat: /elegans/Finder/FINDER_elegans/alignments/SRR1741331_round3_SJ.out.tab: No such file or directory

Here's the output (partial) generated in the alignments directory:

-rw-r--r--  1 bshrestha ebpproject 186K Oct 12 11:22 SRR9265068_final.sortedByCoord.out.bam.csi
-rw-r--r--  1 bshrestha ebpproject 356K Oct 12 11:21 SRR9265068_final.sortedByCoord.out.bam.bai
-rw-r--r--  1 bshrestha ebpproject 3.1M Oct 12 11:20 not_available_round1_and_round2_and_round3_SJ.out.tab
-rw-r--r--  1 bshrestha ebpproject    0 Oct 12 11:20 not_available_round3_SJ.out.tab
drwx------  3 bshrestha ebpproject  512 Oct 12 11:20 SRR9265068_final__STARtmp
-rw-r--r--  1 bshrestha ebpproject  239 Oct 12 11:20 SRR9265068_relaxed.output
-rw-r--r--  1 bshrestha ebpproject 2.0K Oct 12 11:20 SRR9265068_round3_Log.final.out
-rw-r--r--  1 bshrestha ebpproject 4.8G Oct 12 11:20 SRR9265068_final.sortedByCoord.out.bam
-rw-r--r--  1 bshrestha ebpproject 968M Oct 12 11:17 SRR9265068_round3_Unmapped.out.mate2
-rw-r--r--  1 bshrestha ebpproject 968M Oct 12 11:17 SRR9265068_round3_Unmapped.out.mate1
-rw-r--r--  1 bshrestha ebpproject 4.2M Oct 12 11:17 SRR9265068_final_SJ.out.tab
-rw-r--r--  1 bshrestha ebpproject    0 Oct 12 11:05 SRR9265068_relaxed.error
-rw-r--r--  1 bshrestha ebpproject 3.1M Oct 12 11:05 not_available_round1_and_round2_SJ.out.tab
-rw-r--r--  1 bshrestha ebpproject    0 Oct 12 11:05 not_available_round2_SJ.out.tab
drwx------  3 bshrestha ebpproject  512 Oct 12 11:05 SRR9265068_round2__STARtmp
-rw-r--r--  1 bshrestha ebpproject 2.2G Oct 12 11:05 SRR9265068_round2_Unmapped.out.mate2
-rw-r--r--  1 bshrestha ebpproject  305 Oct 12 11:04 SRR9265068_round2.output
-rw-r--r--  1 bshrestha ebpproject 2.2G Oct 12 11:04 SRR9265068_round2_Unmapped.out.mate1
-rw-r--r--  1 bshrestha ebpproject 2.0K Oct 12 11:04 SRR9265068_round2_Log.final.out
-rw-r--r--  1 bshrestha ebpproject 206M Oct 12 11:04 SRR9265068_round2_Aligned.sortedByCoord.out.bam
-rw-r--r--  1 bshrestha ebpproject 2.1M Oct 12 11:04 SRR9265068_round2_SJ.out.tab
drwx------  2 bshrestha ebpproject 1.0K Oct 12 10:57 SRR9265068_round2__STARgenome
-rw-r--r--  1 bshrestha ebpproject    0 Oct 12 10:57 SRR9265068_round2.error
-rw-r--r--  1 bshrestha ebpproject 3.1M Oct 12 10:57 not_available_round1_SJ.out.tab
drwx------  3 bshrestha ebpproject  512 Oct 12 10:57 SRR9265068_round1__STARtmp
-rw-r--r--  1 bshrestha ebpproject  239 Oct 12 10:57 SRR9265068_round1.output
-rw-r--r--  1 bshrestha ebpproject 2.0K Oct 12 10:57 SRR9265068_round1_Log.final.out
-rw-r--r--  1 bshrestha ebpproject 3.8G Oct 12 10:57 SRR9265068_round1_Aligned.sortedByCoord.out.bam
-rw-r--r--  1 bshrestha ebpproject 2.6G Oct 12 10:55 SRR9265068_round1_Unmapped.out.mate2
-rw-r--r--  1 bshrestha ebpproject 2.6G Oct 12 10:55 SRR9265068_round1_Unmapped.out.mate1
-rw-r--r--  1 bshrestha ebpproject 3.9M Oct 12 10:54 SRR9265068_round1_SJ.out.tab
-rw-r--r--  1 bshrestha ebpproject    0 Oct 12 10:42 SRR9265068_round1.error

As you can see there's no SRR9265068_round3_SJ.out.tab file create during the third round but it created SRR9265068_final_SJ.out.tab file. However, for some libraries it created "round3_SJ.out.tab" files after completing the third run as shown below:

-rw-r--r--  1 bshrestha ebpproject 2.8M Oct 12 12:09 whole_worm_stress_round1_and_round2_and_round3_SJ.out.tab
drwxr-xr-x 14 bshrestha ebpproject  47K Oct 12 12:09 .
-rw-r--r--  1 bshrestha ebpproject 3.5K Oct 12 12:09 whole_worm_stress_round3_SJ.out.tab
drwx------  3 bshrestha ebpproject  512 Oct 12 12:09 SRR14458419_round3__STARtmp
-rw-r--r--  1 bshrestha ebpproject  239 Oct 12 12:09 SRR14458419_round3.output
-rw-r--r--  1 bshrestha ebpproject 431M Oct 12 12:09 SRR14458419_round3_Unmapped.out.mate2
-rw-r--r--  1 bshrestha ebpproject 431M Oct 12 12:09 SRR14458419_round3_Unmapped.out.mate1
-rw-r--r--  1 bshrestha ebpproject 2.0K Oct 12 12:09 SRR14458419_round3_Log.final.out
-rw-r--r--  1 bshrestha ebpproject 1.9M Oct 12 12:09 SRR14458419_round3_Aligned.sortedByCoord.out.bam
-rw-r--r--  1 bshrestha ebpproject 3.6K Oct 12 12:09 SRR14458419_round3_SJ.out.tab
-rw-r--r--  1 bshrestha ebpproject    0 Oct 12 11:57 SRR14458419_round3.error

I also tried using Finder to download SRA files but it didn't do a good job in downloading files properly. Please see the attachment. So, I downloaded the RNA reads in my local computer and used it as an input.

Any suggestions on how to fix this?

Thank you

RNAseq file name

STAR CPU limit

I tried to run finder with 24 CPUs but it failed during alignment step.
The same command line worked with 12 CPUs.

There seems to be an issue with STAR or its compilation
alexdobin/STAR#512

IndexError

Hi there. I've been having issues running both the example and on my own data, see below:

~/finder/finder -no_cleanup -mf ~/finder/example/Arabidopsis_thaliana_metadata.csv -n 6 -gdir_star /Volumes/TOSH_EXTERNAL/star_index_without_transcriptome_Arabid -out_dir /Volumes/TOSH_EXTERNAL/Finder_Output1 -g ~/finder/example/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p ~/finder/example/uniprot_ARATH.fasta -gdir_olego /Volumes/TOSH_EXTERNAL/olego_index_Arabid -preserve Traceback (most recent call last): File "/Users/nicklister/finder/finder", line 668, in <module> main() File "/Users/nicklister/finder/finder", line 606, in main validateCommandLineArguments(options,logger_proxy,logging_mutex) File "/Users/nicklister/finder/finder", line 181, in validateCommandLineArguments finder_directory=open(options.temp_dir+"/finding_finder","r").read().split(":")[-1].split()[-1].strip() IndexError: list index out of range

Any ideas what might be wrong with what I am doing?

Thanks,
Nick

Installation: error, required file not found: hmm_to_gtf.pl

When running ./install.py, I receive this error after following the installation instructions. The solution seems to just be copying and pasting "hmm_to_gtf.pl" from the "other" directory within "gmes_linux_64" to the main "gmes_linux_64" directory. One thing I noticed was that the version of GeneMark-ES/ET/EP in the installation instructions is 4.62 while on the GeneMark website the version is 4.68. I'd suggest either updating your scripts for this new version of GeneMark or clarifying whether 4.62 must be used for Finder to work. Thank you!

recovery points

Due to memory issues I had to relaunch finder while it was aligneing proteins with exonerate.
Strangely it restarted a the braker step. Is their a reason for this or is it possible to add a recovery point after braker?

Mapping rate in round 2 0.0

I'm having an issue with the pipeline that I can't quite narrow down. Here's what's printed to the screen:
wirenia@wirenia:~/Desktop/2021-04-21_Hanleya_finder$ ./run_finder.sh cat: output/alignments/Hanleya_hanleyi_mantle_round1_SJ.out.tab: No such file or directory cat: output/alignments/Hanleya_hanleyi_mantle_round2_SJ.out.tab: No such file or directory mv: cannot stat 'output/alignments/Hanleya_hanleyi_mantle_final_Log.final.out': No such file or directory cat: output/alignments/Hanleya_hanleyi_mantle_round3_SJ.out.tab: No such file or directory cat: output/alignments/Hanleya_hanleyi_mantle_round4_SJ.out.tab: No such file or directory samtools index: "output/alignments/Hanleya_hanleyi_mantle_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed samtools index: "output/alignments/Hanleya_hanleyi_mantle_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). Can not open output/alignments/Hanleya_hanleyi_mantle_final.sortedByCoord.out.bam. mv: cannot stat 'output/assemblies_psiclass_modified/combined/psiclass_output_sample_0.gtf': No such file or directory mv: cannot stat 'output/assemblies_psiclass_modified/combined/psiclass_output_vote.gtf': No such file or directory Traceback (most recent call last): File "/home/wirenia/finder/finder", line 641, in <module> main() File "/home/wirenia/finder/finder", line 602, in main orchestrateGeneModelPrediction(options,logger_proxy,logging_mutex) File "/home/wirenia/finder/finder", line 415, in orchestrateGeneModelPrediction findTranscriptsInEachSampleNotReportedInCombinedAnnotations(options,logger_proxy,logging_mutex) File "/home/wirenia/finder/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations combined_transcript_info=readAllTranscriptsFromGTFFileInParallel([combined_gtf_filename,"combined","combined"])[0] File "/home/wirenia/finder/scripts/fileReadWriteOperations.py", line 232, in readAllTranscriptsFromGTFFileInParallel fhr=open(gtf_filename,"r") FileNotFoundError: [Errno 2] No such file or directory: 'output/assemblies_psiclass_modified/combined/combined.gtf'

progress.log says:
2021-04-23 00:46:46,664 - finder - INFO - Software paths have been set 2021-04-23 00:47:07,129 - finder - INFO - Generating STAR index 2021-04-23 01:09:17,755 - finder - INFO - STAR index generation complete 2021-04-23 01:09:17,786 - finder - INFO - Generating OLego index 2021-04-23 02:16:37,614 - finder - INFO - OLego index built 2021-04-23 02:16:37,648 - finder - INFO - validateCommandLineArguments execution successful 2021-04-23 02:16:37,676 - finder - INFO - Metadata information created 2021-04-23 02:16:37,676 - finder - INFO - readMetaDataFile execution successful 2021-04-23 02:16:37,689 - finder - INFO - expandGzippedFiles execution successful 2021-04-23 02:16:37,717 - finder - INFO - Starting FINDER from None checkpoint 2021-04-23 02:16:37,718 - finder - INFO - Program params - Namespace(addUTR=True, checkpoint=None, compressed_data_files=None, cpu='35', error_corrected_raw_data='output/raw_data_error_corrected', exonerate_gff3='protein_evidence.gff3', files_for_ncrna={'mature_ATGC': '/home/wirenia/finder/dep/mature_ATGC.fa'}, final_GTF_files='output/final_GTF_files', genome='final.purged.fa.PolcaCorrected.fa.masked', genome_dir_olego='output/indices/olego_index', genome_dir_star='output/indices/star_index_without_transcriptome', indices='output/indices', md=None, metadatafile='metadata.csv', mrna_md={'mantle': {'Hanleya_hanleyi_mantle': {'bioproject': 'DUMMY', 'condition': 'mantle', 'Date': '1/12/17', 'Ended': 'PE', 'desc': 'cDNA;Illumina HiSeq 2500', 'read_length': '101', 'error_corrected': 0, 'location_directory': '/home/wirenia/Desktop/2021-04-21_Hanleya_finder', 'downloaded_from_NCBI': 0}}}, no_cleanup=False, output_assemblies_psiclass_terminal_exon_length_modified='output/assemblies_psiclass_modified', output_braker='output/braker', output_directory='output', output_fasta_N_removed='output/raw_fasta_N_removed', output_rcorrector=None, output_sample_fastq=None, output_star='output/alignments', paired_end_adapterfile=None, perform_post_completion_data_cleanup=False, preserve_raw_input_data=False, protein='protein_evidence.fas', raw_data_downloaded_from_NCBI='output/raw_data_downloaded_from_NCBI', record_time={}, run_tests=False, single_end_adapterfile=None, skip_cpd=False, smrna_md={}, softwares={'psiclass': '/home/wirenia/finder/dep/psiclass_terminal_exon_length_modified/psiclass', 'junc': '/home/wirenia/finder/dep/psiclass_terminal_exon_length_modified//junc', 'subexon-info': '/home/wirenia/finder/dep/psiclass_terminal_exon_length_modified//subexon-info', 'addXS': '/home/wirenia/finder/dep/psiclass_terminal_exon_length_modified//addXS', 'fastq-sample': '/home/wirenia/finder/dep/fastq-tools-0.8/scripts/fastq-sample', 'download_and_dump_fastq_from_SRA': '/home/wirenia/finder/dep/../utils/downloadAndDumpFastqFromSRA.py', 'transferGenomicNucleotideCountsToTranscriptome': '/home/wirenia/finder/dep/../scripts/transferGenomicNucleotideCountsToTranscriptome.py', 'find_exonic_troughs': '/home/wirenia/finder/dep/../scripts/find_exonic_troughs.R', 'olego': '/home/wirenia/finder/dep/olego/olego', 'olegoindex': '/home/wirenia/finder/dep/olego/olegoindex', 'mergePEsam.pl': '/home/wirenia/finder/dep/olego/mergePEsam.pl', 'xa2multi': '/home/wirenia/finder/dep/olego/xa2multi.pl', 'gmst': '/home/wirenia/finder/dep/gmst.pl', 'prodigal': '/home/wirenia/finder/dep/Prodigal/prodigal', 'canon-gff3': '/home/wirenia/finder/dep/canon-gff3', 'convert_exonerate_gff_to_gtf': '/home/wirenia/finder/dep/../utils/convert_exonerate_gff_to_gtf.py', 'augustus_main_dir': '/home/wirenia/finder/dep/Augustus', 'braker': '/home/wirenia/finder/dep/BRAKER/scripts/braker.pl', 'GENEMARK_PATH': '/home/wirenia/finder/dep/gmes_linux_64', 'AUGUSTUS_CONFIG_PATH': 'output/braker/Augustus/config', 'AUGUSTUS_BIN_PATH': 'output/braker/Augustus/bin', 'AUGUSTUS_SCRIPTS_PATH': 'output/braker/Augustus/scripts', 'GUSHR_PATH': '/home/wirenia/finder/dep/GUSHR'}, space_saved=None, temp_dir='output/temp', total_space=None, verbose=3) 2021-04-23 02:16:37,719 - finder - INFO - Started processing data for mantle 2021-04-23 02:16:37,731 - finder - INFO - Downloading missing data from NCBI started 2021-04-23 02:16:37,731 - finder - INFO - Downloading missing data from NCBI finished for mantle 2021-04-23 02:16:38,384 - finder - INFO - STAR Round1 run for Hanleya_hanleyi_mantle completed 2021-04-23 02:16:38,385 - finder - INFO - Mapping of reads for round1 completed for mantle 2021-04-23 02:16:38,629 - finder - INFO - Selecting high confidence junctions after round1 mapping completed for mantle 2021-04-23 02:16:38,665 - finder - INFO - Raw read download from NCBI cleanup completed for mantle 2021-04-23 02:16:38,686 - finder - INFO - STAR Round2 run for Hanleya_hanleyi_mantle completed 2021-04-23 02:16:38,687 - finder - INFO - Mapping of reads for round2 completed for mantle 2021-04-23 02:16:38,888 - finder - INFO - Selecting high confidence junctions after round2 mapping completed for mantle 2021-04-23 02:16:38,889 - finder - INFO - Mapping rate in round2 mantle Hanleya_hanleyi_mantle 0.0 2021-04-23 02:16:38,890 - finder - INFO - Resorting to alignment with relaxed parameters for these runs due to poor mapping Hanleya_hanleyi_mantle 2021-04-23 02:16:38,975 - finder - INFO - STAR relaxed alignment run for Hanleya_hanleyi_mantle completed 2021-04-23 02:16:38,976 - finder - INFO - Mapping of reads for round3 completed for mantle 2021-04-23 02:16:39,005 - finder - INFO - Selecting high confidence junctions after round3 mapping completed for mantle 2021-04-23 02:16:39,006 - finder - INFO - Mapping of reads for round4 completed for mantle 2021-04-23 02:16:39,033 - finder - INFO - Selecting high confidence junctions after round4 mapping completed for mantle 2021-04-23 02:16:39,034 - finder - INFO - Mapping with OLego for micro-exon detection completed for mantle 2021-04-23 02:16:39,453 - finder - INFO - Merging of alignments from all rounds of mapping completed for mantle 2021-04-23 02:16:39,518 - finder - INFO - Removing intermediate alignment files completed for mantle 2021-04-23 02:16:39,528 - finder - INFO - Mapping of all runs completed for mantle 2021-04-23 02:16:40,360 - finder - INFO - Information collection about alignments completed 2021-04-23 02:16:43,663 - finder - INFO - Generation of assemblies with PsiCLASS completed

The onlly files in /output/temp/ are download_these_runs and finding_finder

The *relaxed.error file says:
`EXITING because of FATAL ERROR: could not open genome file output/indices/star_index_without_transcriptome//genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions

Apr 23 02:16:38 ...... FATAL ERROR, exiting`

My genome file name is correctly specified in the input script.

Any ideas?

Thanks!
Kevin

Not stranded annotation on the final GTF file

Dear Sagnik,
I am trying to use the annotation file within CellRanger (10X genomics) but it doesn't accept non-stranded exons.
In which step are they being considered, and how can I obtain this information to prepare a useful .gtf ?
Thanks for your assistance!
Best wishes,

Vitor.

Issue running Finder

Hello, bellow i post the output of the run with finder pipeline. It looks like some alignment files that should be produced by Finder are not produced, STAR aligner is installed of course and working, I also installed psiclass, although i do not know if it was necessary, genemark is in etc... But of course, it doesnt go as far, this are the initial steps.

So why this happens?

cp: '/home/orestis/../x_soft_all.fasta' and '/home/orestis/../iquitos_soft_all.fasta' are the same file
cat: /home/orestis/../alignments/RNA_all_round1_SJ.out.tab: No such file or directory
cat: /home/orestis/../alignments/cacao_RNA_all_round2_SJ.out.tab: No such file or directory
mv: cannot stat '/home/orestis/../alignments/RNA_all_final_Log.final.out': No such file or directory
cat: /home/orestis/../alignments/RNA_all_round3_SJ.out.tab: No such file or directory
samtools index: "/home/orestis/../alignments/RNA_all_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed
samtools index: "/home/orestis/../alignments/RNA_all_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
Can not open /home/orestis/../alignments/RNA_all_final.sortedByCoord.out.bam.
[main_samview] fail to read the header from "/home/orestis/../alignments/RNA_all_final.sortedByCoord.out.bam".
[main_samview] fail to read the header from "/home/orestis/../alignments/RNA_all_for_psiclass.sam".
mv: cannot stat '/home/orestis/../assemblies_psiclass_modified/combined/psiclass_output_sample_0.gtf': No such file or directory
mv: cannot stat '/home/orestis/../assemblies_psiclass_modified/combined/psiclass_output_vote.gtf': No such file or directory
Traceback (most recent call last):
File "/home/orestis/FINDER/finder_v1.1.0/finder", line 688, in
main()
File "/home/orestis/FINDER/finder_v1.1.0/finder", line 649, in main
orchestrateGeneModelPrediction( options, logger_proxy, logging_mutex )
File "/home/orestis/FINDER/finder_v1.1.0/finder", line 461, in orchestrateGeneModelPrediction
findTranscriptsInEachSampleNotReportedInCombinedAnnotations( options, logger_proxy, logging_mutex )
File "/home/orestis/FINDER/finder_v1.1.0/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations
combined_transcript_info = readAllTranscriptsFromGTFFileInParallel( [combined_gtf_filename, "combined", "combined"] )[0]
File "/home/orestis/FINDER/finder_v1.1.0/scripts/fileReadWriteOperations.py", line 290, in readAllTranscriptsFromGTFFileInParallel
fhr = open( gtf_filename, "r" )
FileNotFoundError: [Errno 2] No such file or directory: '/home/orestis/../assemblies_psiclass_modified/combined/combined.gtf'

error

Hi,

Thanks for the nice tool. I am trying to run finder on my own plant genome, but I am getting this error

EXITING: Did not find the genome in memory, did not remove any genomes from shared memory

May 25 22:17:31 ...... FATAL ERROR, exiting

It must be something simple, but I think I am missing something here...

Best,
André

Error: "finder: Argument list too long"

Dear Sagnik,
I am trying out the example data and wondering what argument could be causing this error and how do I fix it!

finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -gdir_star $PWD/star_index_without_transcriptome -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -gdir_olego olego_index -preserve 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error

finder: Argument list too long

Thanks in advance.
Greetings,

Vitor.

Braker2 and new release

Hi,
I would like to use finder to annotate a genome but I read in a previous post that a new version should come out without braker2, which can cause several problems. Do you think it's better to wait for the new release or use this one?
Thanks

Issues with transcript_to_condition

Hello,
Thank you for addressing the issues on finder. I am running finder on Arabidopsis and the execution completes without an error. But when I checked the final GTF files, I found that the file transcript_to_condition does not have quite what I expected. Could you please look into it? Here are the contents of the file.

exons	flower,leaf,cold
introns	flower,leaf,cold
cds	flower,leaf,cold
cds_frame	flower,leaf,cold
direction	flower,leaf,cold
TPM	flower,leaf,cold
cov	flower,leaf,cold
gene_id	flower,leaf,cold
FPKM	flower,leaf,cold
transcript_start	flower,leaf,cold
transcript_end	flower,leaf,cold
chromosome	flower,leaf,cold
annotator	flower,leaf,cold

Thank you.

Trouble with running the example, maybe from alignReads.py Round4 missing

When I running example data, I follow all pipeline in README.md. But I got a error:

Traceback (most recent call last):
  File "/home/software/finder/finder", line 675, in <module>
    main()
  File "/home/software/finder/finder", line 636, in main
    orchestrateGeneModelPrediction(options,logger_proxy,logging_mutex)
  File "/home/software/finder/finder", line 420, in orchestrateGeneModelPrediction
    alignReadsAndMergeOutput(options,logger_proxy,logging_mutex)
  File "/home/software/finder/scripts/findGenesFromExpression.py", line 419, in alignReadsAndMergeOutput
    fhr=open(options.output_star+"/"+Run+"_round4_Log.final.out","r")
FileNotFoundError: [Errno 2] No such file or directory: '/home/software/finder/example/FINDER_test_ARATH/alignments/SRR9844295_round4_Log.final.out'

Then I check scripts, I find in findGenesFromExpression.py line 392~412, you've blocked out the code, this part annotation is "Align reads with STAR round4" . But after that, in line 418, The "Align reads with OLego round5" still use round4 result.

Then I check the function "alignReadsWithSTARRound4" in alignReads.py, But I can't find any same name function in alignReads.py.

So this is a bug in Finder? Or can I use round3 result in OLego round5?

error when downloading docker image

Hi, I am trying to download a docker image of finder but encountered a error, here is the error information:

 => ERROR [44/53] RUN mkdir -p /softwares/NCBIBLAST &&  cd /softwares/NCBIBLAST &&   wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz &&  tar -xvzf /  1.0s
------
 > [44/53] RUN mkdir -p /softwares/NCBIBLAST &&         cd /softwares/NCBIBLAST &&      wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz &&        tar -xvzf /softwares/NCBIBLAST/ncbi-blast-2.12.0+-x64-linux.tar.gz:
#48 0.324 --2022-03-15 13:00:29--  https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz
#48 0.325 Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.12, 2607:f220:41e:250::13, ...
#48 0.445 Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
#48 0.864 HTTP request sent, awaiting response... 404 Not Found
#48 0.989 2022-03-15 13:00:29 ERROR 404: Not Found.
#48 0.989
------
executor failed running [/bin/sh -c mkdir -p /softwares/NCBIBLAST &&    cd /softwares/NCBIBLAST &&      wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-${NCBI_VERSION}+-x64-linux.tar.gz &&        tar -xvzf /softwares/NCBIBLAST/ncbi-blast-${NCBI_VERSION}+-x64-linux.tar.gz]: exit code: 8

It seems the program failed to download a certain version of Blast.
Could you fix this? Thank you for any help provided.

Best,
Yangzi

dummy_data file issue - Singularity

Hi! I'm trying to run the example data (testrun) in Finder on an HPC using Singularity (v 3.7.1). In the main testrun.error file the error that starts the error series was
cat: /scratch1/njmello/Finder-finder_v1.1.0/example/testrun/alignments/dummy_data1_round2_SJ.out.tab: No such file or directory

I then went to the progress.log file and saw that the mapping rate for the dummy variables were 0. I went to the alignments folder for the dummy_data1_round1.error and saw

EXITING because of fatal input ERROR: could not open readFilesIn=/home/ubuntu/Finder/example/raw_data//dummy_data1.fastq

Jan 11 13:05:49 ...... FATAL ERROR, exiting

I went to the raw data file and saw this:

[njmello@login001 raw_data]$ pwd
/scratch1/njmello/Finder-finder_v1.1.0/example/raw_data
[njmello@login001 raw_data]$ ls
dummy_data1.fastq  dummy_data2.fastq.gz

I think the double '//' that I saw in the dummy_data1_round1.error is potentially the key to the problem, as I know that the dummy1 fastq file is present. I'll upload the testrun.error file and the progress.log file too if that helps. If there's anything I should try, please let me know. Thank you!!
testrun.error.log
progress.log

Question about final gtf files

Hi, so I've sucessfully completed few finder runs and I'm currently assessing the results. I'm finding it difficult to decide which one of all the provided would the more "final". For example, I'm getting less than 500 exons in the braker_utr.gtf, while in the "combined_with_CDS_BRAKER_appended_high_conf.gtf" (is this one repetitive with "combined_with_CDS_high_conf.gtf"?), which I understand couldbe the final one, I'm getting ~15000. So I'm a bit confused. Do you have any comment?

Thanks for the good support

Help check legit of Finder for report in University

Hi. I have a report about gene prediction tool for bioinformatics in university.And, i want to write a report about your tool but when search google, i only can find 1 artical : https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04120-9 but not enought legit for it. I'm seeing your project still develop. Can you give me more link artical or something about your tool for me, thanks!
btw, i will call all of my friends give heart for your github 💯

Are all those .sam mandatory ?

Hi,
I was doing some tests using finder for a large plant genome.
At the end of alignments steps, a .sortedByCoord.out.sam and a _for_psiclass.sam are created and removed only at the end of the whole pipeline. I'm using 24 RNA-seq library in this test, and only the alignments dir is around 2.5 To. Most of it is due to those .sam. Would it be possible to suppress (or best not generate them..) them at the end of the alignment ? Are they really used in some other steps ?
Thanks,
Jonathan

Final output annotation

Hi @sagnikbanerjee15

Finder is a pretty useful program.
Can you please add --keep-gene-names in when running gffread?
Otherwise, annotations of genes are missing.

Thanks

Several bug I found

Hi @sagnikbanerjee15

I found three bug when using Finder.

In fixOverlappingAndMergedTranscripts.pym line 347, please add "dtype=object" in np.array(temp). Otherwise , there's an annoying warning message.
in alignReads.py, line 432, there should be an extra space before "2>". Otherwise, bam2 file would be created, which causes error in the following step.
Please update the PsiCLASS to the latest version. There might be problem with genes that have long intron.

Good luck.

Zuyao

Missing CDS annotation for some genes and transcripts

I was working with Finder and I am glad to report that the run completed without any errors for the dataset I was working with. I noticed that for some genes, the CDS was detected properly. There was an Open Reading Frame (ORF) in the transcript but Finder was unable to annotate it. Could you please look into it?
Thank you!

error while producing the *_SJ.out.tab file

Hi,

I encountered the following error during the alignment step while finder was outputting the *_SJ.out.tab file

cat Finder.output
Traceback (most recent call last):
File "/usr/local/bioinfo/src/Finder/finder-113203c/finder", line 640, in
main()
File "/usr/local/bioinfo/src/Finder/finder-113203c/finder", line 601, in main
orchestrateGeneModelPrediction(options,logger_proxy,logging_mutex)
File "/usr/local/bioinfo/src/Finder/finder-113203c/finder", line 385, in orchestrateGeneModelPrediction
alignReadsAndMergeOutput(options,logger_proxy,logging_mutex)
File "/usr/local/bioinfo/src/Finder/finder-113203c/scripts/findGenesFromExpression.py", line 313, in alignReadsAndMergeOutput
selectHighConfidenceSpliceJunctions(options,1,condition)
File "/usr/local/bioinfo/src/Finder/finder-113203c/scripts/findGenesFromExpression.py", line 105, in selectHighConfidenceSpliceJunctions
condition,junctions_to_be_discarded,junctions_to_be_retained=selectHighConfidenceSpliceJunctionsPerCondition([condition,options,round])
File "/usr/local/bioinfo/src/Finder/finder-113203c/scripts/findGenesFromExpression.py", line 46, in selectHighConfidenceSpliceJunctionsPerCondition
chromosome,j_start,j_end,strand,intron_motif,annotated,uniq_reads,mm_reads,max_overhang=line.strip().split()
ValueError: not enough values to unpack (expected 9, got 8)

What should I do?

'Another job is still loading the genome, sleeping for 1 min'

Hi @sagnikbanerjee15

I am trying to run Finder on a fairly fragmented genome of around 500Mbp. Running the example dataset was successful. When I try running my own dataset of a single RNAseq dataset the program seems to get stuck very early on while running STAR. Specifically, the *_round1_Log.out file has at the bottom repeated instances of Another job is still loading the genome, sleeping for 1 min. I can run STAR on the dataset independent of finder so I'm wondering if you have an idea about what might be going wrong here? Thank you for your time and advice.

Error in installation?

Hello. I'm getting this error when trying to run finder--the error occurs with example data as well. Any help would be greatly appreciated.

Traceback (most recent call last): File "/data/selaginella/finder/finder", line 648, in <module> main() File "/data/selaginella/finder/finder", line 586, in main validateCommandLineArguments(options,logger_proxy,logging_mutex) File "/data/selaginella/finder/finder", line 235, in validateCommandLineArguments cmd+=" --runThreadN "+options.cpu TypeError: can only concatenate str (not "int") to str

Output files not always containing CDS filed

Hello,

I was checking the output files because I want to generate a protein fasta in order to run BUSCO.
I realized that some files like "combined_with_CDS_high_and_low_confidence_merged.gtf" contain a CDS field for some genes, but for other genes not. Why is that?
Exons and CDS coordinates are not always identical.
I would appreciate it if you tell me how to obtain the aminoacid sequences corresponding only to CDS coordinates of all genes from the above file. So far, I used the Augustus python script "etAnnoFastaFromJoingenes.py", but it can only extract sequences from .gtf files with annotated CDS coordinates.
Any help would be appreciated.
Best wishes,
Vitor.

cds_predict error

Hello there,

I'm trying to use FINDER for the first time, on a Drosophila de novo assembly with locally stored RNA-seq data. Many parts of the pipeline seem to have worked using the installation and run guidance as documented - for instance BRAKER has finished - but I find an error in step 5 of the pipeline that I can't work out how to overcome:

INFO: Creating SIF file...
Traceback (most recent call last):
File "/softwares/FINDER/Finder/finder", line 688, in
main()
File "/softwares/FINDER/Finder/finder", line 665, in main
findCDS( options, logger_proxy, logging_mutex )
File "/softwares/FINDER/Finder/scripts/predictCDS.py", line 66, in findCDS
fhr = open( options.output_assemblies_psiclass_terminal_exon_length_modified + "/combined/cds_predict/annotation.gtf", "r" )
FileNotFoundError: [Errno 2] No such file or directory: '/lustre/scratch116/tol/teams/team301/users/cl16/Drosophila/idDroSubo1_FINDER/assemblies_psiclass_modified/combined/cds_predict/annotation.gtf'

Indeed, if I go to /idDroSub1_FINDER/assemblies_psiclass_modified/combined/cds_predict/ I see only:

-rw-r--r-- 1 cl16 team301 38219274 Jan 12 11:53 minus.fa
-rw-r--r-- 1 cl16 team301 0 Jan 12 11:53 ORFs_minus.gtf
-rw-r--r-- 1 cl16 team301 0 Jan 12 11:53 ORFs_plus.gtf

I also notice an error from running Codan in the combined/ directory:

Traceback (most recent call last):
File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in
main()
File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main
codan_BOTH(options.transcripts, options.output_folder, options.model, options.cpu)
File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in codan_BOTH
retrieveORF_BOTH(transcripts, outF+"minus.fa", outF)
File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in retrieveORF_BOTH
record_dictP = SeqIO.index(transcripts, "fasta")
File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index
return _IndexedSeqFileDict(
File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'u000001431.46339_3_covsplit.0'

Can you please advise on what might have gone wrong here, and what should be done to fix it? Happy to provide any intermediate files needed to diagnose. I'm running on an HPC cluster using singularity 3.9.0.

Regards,
Chris L

Protein FASTA from closely related species

Thank you for the wonderful package.

I have plenty of transcriptome data for my species, but I do not have a protein fasta. Is it recommended to use a protein fasta from a closely related species?

I read your paper's section on "De novo gene prediction from expression data and proteins from closely related species" but am unsure whether this meant using both short read data and protein level data, or both.

using local RNA-Seq files

Even when setting a dummy BIOPROJECT and giving a local file prefix Finder tries to download the RNA-Seq files from the NCBI

2021-03-04 11:49:45,178 - finder - INFO - Started processing data for test
2021-03-04 11:49:45,178 - finder - INFO - Downloading missing data from NCBI started
2021-03-04 11:49:45,179 - finder - INFO - Running command - /usr/local/bioinfo/src/Finder/finder-a3746f0/dep/../utils/downloadAndDumpFastqFromSRA.py -s /home/agena/klopp/work/Project_test.1468/Finder/FINDER_test2_test/temp/download_these_runs -o /home/agena/klopp/work/Project_test.1468/Finder/FINDER_test2_test/raw_data_downloaded_from_NCBI -n 6 > /home/agena/klopp/work/Project_test.1468/Finder/FINDER_test2_test/temp/download_these_runs.output 2> /home/agena/klopp/work/Project_test.1468/Finder/FINDER_test2_test/temp/download_these_runs.error
2021-03-04 11:49:45,479 - finder - INFO - Downloading missing data from NCBI finished for test
2021-03-04 11:49:45,489 - finder - INFO - STAR Round1 run for /work/klopp/Project_test.1468/Finder/local completed
2021-03-04 11:49:45,489 - finder - INFO - Mapping of reads for round1 completed for test
2021-03-04 11:49:45,499 - finder - INFO - Selecting high confidence junctions after round1 mapping completed for test

Could you provide an example with local files?

I'm not able to do the initial example.

Good afternoon.

My name is Fernando, I'm from Brazil and a student of computational biology at Fiocruz.
First of all, I would like to thank you for the initiative to create a software that could predict eukaryotic genes, which was praised by several researchers.

I have a problem, it might even be pathetic of me to do a simple execution using the suggested example.

I'm pasting here the command line I used:

finder -no_cleanup
-mf example/Arabidopsis_thaliana_metadata.csv
-n 4
-gdir_olego example/A_tha
-out_dir example/FINDER_test_ARATH
-g example/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa
-p example/uniprot_ARATH.fasta
-gdir_olego olego_index -preserve -pc_clean 1> example/FINDER_test_ARATH.output 2> example/FINDER_test_ARATH.error

Could you please tell me how to proceed, like this example, I'll get other complete genomes. I've been able to perform all the phases up to this point and I'm even embarrassed to ask the software creator for guidance. If possible, could you tell me what I should be aware of when I write the command line?

sorry for bothering and thank you very much for the information and for providing a very useful software

STAR memory issue

Hello! I'm very excited about this new tool but running into an issue early on in the pipeline. The file star_index_without_transcriptome.error is reporting this:

`EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome
SOLUTION: please specify --limitGenomeGenerateRAM not less than 43903877386 and make that much RAM available

Apr 22 14:59:38 ...... FATAL ERROR, exiting`

I have enough RAM but don't know how to indicate that to STAR.

Thanks!
Kevin

Overloading and High memory consumption while downloading data

Hi there, thanks for developing finder

I am currently running finder on a fungus genome. I put several dozens of SRA libraries in the metadata file. The process is currently at the first step, downloading from SRA and aligning the data using STAR.

Currently, this is the landscape on the node I am running finder:

The load average is very over the roof. Do you know why is this happening? Is this normal? Can you further explain to me what is downloadAndDumpFastqFromSRA.py doing besides prefetching and using fasterq-dump? It seems to me it is creating too many processes.

On the other hand, memory is another issue that I worry about. This should be a fairly small genome (~43 Mb), but it is using 380+ Gb of memory! it has even swapped already (which slows tremendously the process). I am not sure which processes are taking that much space. It is not clear to me that it is STAR, because if it were, memory should have been released after the alignment process. So I wonder why Memory is almost full while downloading data.

For reference, this is the command line I used:

finder -mf $folder/new_metadata.csv -out_dir $folder/output -g $folder/input/strain_masked.fasta -gdir_star $star_folder -gdir_olego $olego_folder --cpu $threads --addUTR

I would appreciate any thoughts on the matter. My TI admins are asking and I would like to be able to answer them.
Regards

Trouble running the example

Hi, and congratulations for the software. I want to give a try, and I managed to successfully install it (the conda environment and the step-by-step process is much appreciated, but there are few inaccuracies in the readme, such as FIND instead of find when running, or where is install.sh or the folder where the compressed files for the external software has to be downloaded).
When running the example as it's written in readme, I'm getting the following errors, like it's missing some files:

mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_0.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_1.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_2.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_3.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_4.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_5.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_6.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_7.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_8.gtf': No such file or directory
mv: cannot stat '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_vote.gtf': No such file or directory
Traceback (most recent call last):
File "/bin/finder/finder", line 626, in
main()
File "/bin/finder/finder", line 587, in main
orchestrateGeneModelPrediction(options,logger_proxy,logging_mutex)
File "/bin/finder/finder", line 411, in orchestrateGeneModelPrediction
findTranscriptsInEachSampleNotReportedInCombinedAnnotations(options,logger_proxy,logging_mutex)
File "/bin/finder/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations
combined_transcript_info=readAllTranscriptsFromGTFFileInParallel([combined_gtf_filename,"combined","combined"])[0]
File "/bin/finder/scripts/fileReadWriteOperations.py", line 202, in readAllTranscriptsFromGTFFileInParallel
fhr=open(gtf_filename,"r")
FileNotFoundError: [Errno 2] No such file or directory: '/bin/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/combined.gtf'

Can you please provide some support?
Thanks

Unable to load modules

Hi!

I am having issues with loading modules. Am I missing something in installation?

conda activate finder_conda_env
export PATH=$PATH:/global/scratch/users/skyungyong/Software/finder

whereis finder
finder: /global/scratch/users/skyungyong/Software/finder/finder
which finder
/global/scratch/users/skyungyong/Software/finder/finder
tail -n 1 ~/.bashrc
export PATH=$PATH:/global/scratch/users/skyungyong/Software/finder

finder -h
Traceback (most recent call last):
File "/global/scratch/users/skyungyong/Software/finder/finder", line 54, in
from scripts.alignReads import *
ModuleNotFoundError: No module named 'scripts.alignReads'

Psiclass fault

Hi,
Thanks for the nice pipeline. I would very much like to try and use it in my next project.
However, I am experiencing a Psiclass fault while trying to run Finder with example data (I'm just using 3 RNA-Seq datasets SRR5197915, SRR8422201, SRR8422202).

This is the line that cause the fault:
/softwares/Psiclass/PsiCLASS/classes -p 20 --primaryParalog --lb /home/server2/Programs/Finder-finder_v1.1.0/FINDER_test_ARATH/assemblies_psiclass_modified/fofn -s /home/server2/Programs/Finder-finder_v1.1.0/FINDER_test_ARATH/assemblies_psiclass_modified/combined/subexon/psiclass_output_subexon_combined.out -o /home/server2/Programs/Finder-finder_v1.1.0/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output > /home/server2/Programs/Finder-finder_v1.1.0/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_classes.log

This the last 20 rows of "psiclass_output_classes.log":
Thread 5: 3 20666991 20668480 finished.
30914: 2 3 20673832 20675907. Free threads: 14/20
20671127: atCnt=1 seCnt=2 0 3 2
Thread 11: 3 20671127 20672435 finished.
30915: 2 3 20681913 20684035. Free threads: 14/20
20673832: atCnt=1 seCnt=2 0 3 6
Thread 11: 3 20673832 20675907 finished.
30916: 7 3 20695497 20698221. Free threads: 14/20
20681913: atCnt=1 seCnt=2 0 3 5
Thread 11: 3 20681913 20684035 finished.
30917: 5 3 20698535 20700159. Free threads: 14/20
20695497: atCnt=1 seCnt=7 0 11 28
Thread 11: 3 20695497 20698221 finished.
Thread 12: 3 20636589 20639491 finished.
20698535: atCnt=2 seCnt=5 0 9 36
30918: 9 3 20700379 20703042. Free threads: 15/20
Thread 12: 3 20698535 20700159 finished.
20700379: atCnt=6 seCnt=9 0 20 57
30919: 13 3 20703262 20705327. Free threads: 15/20
Thread 12: 3 20700379 20703042 finished.

Here's the last 10 rows of the Progress file:
2022-03-03 07:33:21,976 - finder - INFO - STAR Round3 run for SRR8422202 completed
2022-03-03 07:33:21,977 - finder - INFO - Mapping of reads for round3 completed for cold
2022-03-03 07:33:21,989 - finder - INFO - Selecting high confidence junctions after round3 mapping completed for cold
2022-03-03 07:35:18,956 - finder - INFO - OLego run for SRR8422201 completed
2022-03-03 07:36:47,446 - finder - INFO - OLego run for SRR8422202 completed
2022-03-03 07:36:47,447 - finder - INFO - Mapping with OLego for micro-exon detection completed for cold
2022-03-03 07:38:54,777 - finder - INFO - Merging of alignments from all rounds of mapping completed for cold
2022-03-03 07:38:54,778 - finder - INFO - Mapping of all runs completed for cold
2022-03-03 08:06:50,383 - finder - INFO - Information collection about alignments completed
2022-03-03 08:19:27,161 - finder - INFO - Generation of assemblies with PsiCLASS completed

It seems that programs crash due to memory issue (in resources monitor I saw RAM going up during this step til the limit and then stop).
I'm using a 20cpus, 64Gb RAM computer.
I've seen you posted a similar problem in psiclass github page issues (splicebox/PsiCLASS#21)

The pipelines (1.1.0) was installed following instruction and it runs using docker (version 20.10.12, build e91ed57).
This the full command used:
/usr/bin/time --verbose ./run_finder -no_cleanup -mf $PWD/example/Arabidopsis_thaliana_metadata_local.csv -n 20 -out_dir $PWD/FINDER_test_ARATH -g $PWD/example/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/example/uniprot_ARATH.fasta -preserve --organism_model PLANTS --genemark_path ~/Programs/gmes_linux_64/ -gml ~/Programs/Finder-finder_v1.1.0/gm_key_64

This is the stderr:
1.1.0: Pulling from sagnikbanerjee15/finder
Digest: sha256:9816d258d2421d4625983c929f508b1f577cfe7ab3bc2042e841647a186c7931
Status: Image is up to date for sagnikbanerjee15/finder:1.1.0
docker.io/sagnikbanerjee15/finder:1.1.0
done
mv: cannot stat '/home/server2/Programs/Finder-finder_v1.1.0/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_vote.gtf': No such file or directory
Traceback (most recent call last):
File "/softwares/FINDER/Finder/finder", line 688, in
main()
File "/softwares/FINDER/Finder/finder", line 649, in main
orchestrateGeneModelPrediction( options, logger_proxy, logging_mutex )
File "/softwares/FINDER/Finder/finder", line 461, in orchestrateGeneModelPrediction
findTranscriptsInEachSampleNotReportedInCombinedAnnotations( options, logger_proxy, logging_mutex )
File "/softwares/FINDER/Finder/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations
combined_transcript_info = readAllTranscriptsFromGTFFileInParallel( [combined_gtf_filename, "combined", "combined"] )[0]
File "/softwares/FINDER/Finder/scripts/fileReadWriteOperations.py", line 290, in readAllTranscriptsFromGTFFileInParallel
fhr = open( gtf_filename, "r" )
FileNotFoundError: [Errno 2] No such file or directory: '/home/server2/Programs/Finder-finder_v1.1.0/FINDER_test_ARATH/assemblies_psiclass_modified/combined/combined.gtf'

Do you have any suggestion to solve the issue?
Thank you in advance

exonerate result file input as novel finder command parameter

Exonerate is long to run. Would it be possible to add an option to give the exonerate result file as an input.
We use a cluster with hundreds of nodes which could be used to run exonerate before running finder.

SRA download as fastq.gz

Hello
Would it be possible that the downloading process generate fastq.gz files instead of fastq ?
I know that there is an option that can delete later those files but in the mean time it can take a lot of space...
Also, as it is my first run of Finder for my project, I prefer to keep intermediate files until it will be completed.
But for my project I have 165 SRA libraries... So I have ended up with 990 Go of fastq files.

using RNA-Seq gzipped fastq files

Usually we receive or download gzipped fastq files.
It would be convenient not to have to gunzip them for Finder.

Finder failed to analy

Hi Sagnik

I'm trying to run finder on the provided example data. However, it's just failing and trowing the following error.

Any help is appreciated.

finder -no_cleanup -mf test_metadata.csv -n 30 -gdir_star $PWD/star_index_without_transcriptome -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -gdir_olego olego_index -preserve
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_olego_round5.sorted.bam
samtools merge: fail to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_olego_round5.sorted.bam": No such file or directory
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_olego_round5.sorted.bam
samtools merge: fail to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_olego_round5.sorted.bam": No such file or directory
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam
samtools index: failed to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam": No such file or directory
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam
samtools index: failed to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam": No such file or directory
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam
samtools index: failed to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam": No such file or directory
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam
samtools index: failed to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam": No such file or directory
open: No such file or directory
open: No such file or directory
open: No such file or directory
open: No such file or directory
sh: line 1: 148093 Segmentation fault (core dumped) /data/apps/finder/dep/psiclass_terminal_exon_length_modified//subexon-info /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_introns --noStats > /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_exons
sh: line 1: 148092 Segmentation fault (core dumped) /data/apps/finder/dep/psiclass_terminal_exon_length_modified//subexon-info /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_introns --noStats > /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_exons
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam
samtools view: failed to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data2_final.sortedByCoord.out.bam" for reading: No such file or directory
[E::hts_open_format] Failed to open file /data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam
samtools view: failed to open "/data/apps/finder/example/FINDER_test_ARATH/alignments/dummy_data1_final.sortedByCoord.out.bam" for reading: No such file or directory
mv: cannot stat ‘/data/apps/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_0.gtf’: No such file or directory
mv: cannot stat ‘/data/apps/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_sample_1.gtf’: No such file or directory
mv: cannot stat ‘/data/apps/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/psiclass_output_vote.gtf’: No such file or directory
Traceback (most recent call last):
File "/data/apps/finder/finder", line 675, in
main()
File "/data/apps/finder/finder", line 636, in main
orchestrateGeneModelPrediction(options,logger_proxy,logging_mutex)
File "/data/apps/finder/finder", line 449, in orchestrateGeneModelPrediction
findTranscriptsInEachSampleNotReportedInCombinedAnnotations(options,logger_proxy,logging_mutex)
File "/data/apps/finder/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations
combined_transcript_info=readAllTranscriptsFromGTFFileInParallel([combined_gtf_filename,"combined","combined"])[0]
File "/data/apps/finder/scripts/fileReadWriteOperations.py", line 279, in readAllTranscriptsFromGTFFileInParallel
fhr=open(gtf_filename,"r")
FileNotFoundError: [Errno 2] No such file or directory: '/data/apps/finder/example/FINDER_test_ARATH/assemblies_psiclass_modified/combined/combined.gtf'

Kind regards

Error: EXITING: Did not find the genome in memory, did not remove any genomes from shared memory

Hi, I just came across this software as I have been struggling with MAKER. Running the example data and my data, I have encountered this same isssue. It says it can't find the genome but it's right there.

EXITING: Did not find the genome in memory, did not remove any genomes from shared memory
This is what is on the log.out file;

cat Log.out
STAR version=2.7.7a
STAR compilation time,server,dir=Mon Dec 28 13:38:40 EST 2020 vega:/home/dobin/data/STAR/STARcode/STAR.master/source

Command Line:
STAR --runThreadN 30 --genomeLoad Remove --genomeDir /mnt/nfs/home/b9017460/finder/example/star_index_without_transcriptome

Initial USER parameters from Command Line:
All USER parameters from Command Line:
runThreadN 30 ~RE-DEFINED
genomeLoad Remove ~RE-DEFINED
genomeDir /mnt/nfs/home/b9017460/finder/example/star_index_without_transcriptome ~RE-DEFINED

Finished reading parameters from all sources
Final user re-defined parameters-----------------:
runThreadN 30
genomeDir /mnt/nfs/home/b9017460/finder/example/star_index_without_transcriptome
genomeLoad Remove

Final effective command line:
STAR --runThreadN 30 --genomeDir /mnt/nfs/home/b9017460/finder/example/star_index_without_transcriptome --genomeLoad Remove
Number of fastq files for each mate = 1
Finished loading and checking parameters
Reading genome generation parameters:

STAR --runMode genomeGenerate --runThreadN 30 --genomeDir star_index_without_transcriptome --genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa --genomeSAindexNbases 12
GstrandBit=32
versionGenome 2.7.4a ~RE-DEFINED
genomeType Full ~RE-DEFINED
genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa ~RE-DEFINED
genomeSAindexNbases 12 ~RE-DEFINED
genomeChrBinNbits 18 ~RE-DEFINED
genomeSAsparseD 1 ~RE-DEFINED
genomeTransformType None ~RE-DEFINED
genomeTransformVCF - ~RE-DEFINED
sjdbOverhang 0 ~RE-DEFINED
sjdbFileChrStartEnd - ~RE-DEFINED
sjdbGTFfile - ~RE-DEFINED
sjdbGTFchrPrefix - ~RE-DEFINED
sjdbGTFfeatureExon exon ~RE-DEFINED
sjdbGTFtagExonParentTranscripttranscript_id ~RE-DEFINED
sjdbGTFtagExonParentGene gene_id ~RE-DEFINED
sjdbInsertSave Basic ~RE-DEFINED
genomeFileSizes 120586240 985722733 ~RE-DEFINED
Genome version is compatible with current STAR
Number of real (reference) chromosomes= 7
1 1 30427671 0
2 2 19698289 30670848
3 3 23459830 50593792
4 4 18585056 74186752
5 5 26975502 92798976
6 Mt 366924 119799808
7 Pt 154478 120324096
Started loading the genome: Fri Jun 18 10:22:22 2021

Genome: size given as a parameter = 120586240
SA: size given as a parameter = 985722733
SAindex: size given as a parameter = 1
Read from SAindex: pGe.gSAindexNbases=12 nSAi=22369620
nGenome=120586240; nSAbyte=985722733
GstrandBit=32 SA number of indices=238963086

EXITING: Did not find the genome in memory, did not remove any genomes from shared memory

Jun 18 10:22:22 ...... FATAL ERROR, exiting

Please help. I have tried adding the location but still not working :(
I have been trying to annotate a genome for a very long time now but keep encountering errors.

Best,
Lenshina

Why the busco scores for the annotated example arabidopsis thaliana genome proteins are very low ?

Dear the authors,

Thanks a lot for your valuable and user-friendly software. I have tried the finder programe to annotate the example arabidopsis thaliana genome, but found the proteins extrated from FINDER_BRAKER_PROT.gtf file only achieved 54% BUSCO scores of embryophyta_odb10. It is very low compared with the genome itself BUSCO scores 99.3% with the same database and means near half of coding genes were not annotated. I do not know why.
The command I used to run finder is as fellows:
finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n 30 -gdir_star $PWD/star_index_without_transcriptome -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -gdir_olego olego_index -preserve 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error

Could you give me some advice to improve the performence of the annotation?

Thanks again for your reading.
Looking forward to your timely respons.
Best,
Bob

alignment files empty in example run

Hi. Thanks for providing the Finder tool, I'm keen to use it.

I installed it fine (I think) and ran the Arabidopsis example. The following messages appeared (saved in 'Log.out'), but the run completed.

EXITING: Did not find the genome in memory, did not remove any genomes from shared memory
Jun 21 20:04:47 ...... FATAL ERROR, exiting

However, the alignments appear to have not worked. After the run has completed, the 'Log.final.out' file contains:

                             Started job on |       Jun 21 20:04:24
                         Started mapping on |       Jun 21 20:04:24
                                Finished on |       Jun 21 20:04:26
   Mapping speed, Million of reads per hour |       0.00

                      Number of input reads |       0
                  Average input read length |       0
                                UNIQUE READS:
               Uniquely mapped reads number |       0
                    Uniquely mapped reads % |       0.00%
                      Average mapped length |       0.00
                   Number of splices: Total |       0
        Number of splices: Annotated (sjdb) |       0
                   Number of splices: GT/AG |       0
                   Number of splices: GC/AG |       0
                   Number of splices: AT/AC |       0
           Number of splices: Non-canonical |       0
                  Mismatch rate per base, % |       -nan%
                     Deletion rate per base |       0.00%
                    Deletion average length |       0.00
                    Insertion rate per base |       0.00%
                   Insertion average length |       0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       0
         % of reads mapped to multiple loci |       0.00%
    Number of reads mapped to too many loci |       0
         % of reads mapped to too many loci |       0.00%
                              UNMAPPED READS:

and the 'progress.log' conatins the following (just showing the final section):

2021-06-21 19:57:37,814 - finder - INFO - Started processing data for cold
2021-06-21 19:57:37,816 - finder - INFO - Downloading missing data from NCBI started
2021-06-21 19:57:37,816 - finder - INFO - Running command - /pathto/FINDER/finder/dep/../utils/downloadAndDumpFastqFromSRA.py -s /pathto/FINDER/finder/example/FINDER_test_ARATH/temp/download_these_runs -o /pathto/FINDER/finder/example/FINDER_test_ARATH/raw_data_downloaded_from_NCBI -n 20 > /pathto/FINDER/finder/example/FINDER_test_ARATH/temp/download_these_runs.output 2> /pathto/FINDER/finder/example/FINDER_test_ARATH/temp/download_these_runs.error
2021-06-21 20:04:09,161 - finder - INFO - Downloading missing data from NCBI finished for cold
2021-06-21 20:04:14,766 - finder - INFO - STAR Round1 run for SRR8422200 completed
2021-06-21 20:04:20,366 - finder - INFO - STAR Round1 run for SRR8422201 completed
2021-06-21 20:04:26,413 - finder - INFO - STAR Round1 run for SRR8422202 completed
2021-06-21 20:04:26,413 - finder - INFO - Mapping of reads for round1 completed for cold
2021-06-21 20:04:26,429 - finder - INFO - Selecting high confidence junctions after round1 mapping completed for cold
2021-06-21 20:04:29,802 - finder - INFO - STAR Round2 run for SRR8422200 completed
2021-06-21 20:04:33,525 - finder - INFO - STAR Round2 run for SRR8422201 completed
2021-06-21 20:04:36,956 - finder - INFO - STAR Round2 run for SRR8422202 completed
2021-06-21 20:04:36,957 - finder - INFO - Mapping of reads for round2 completed for cold
2021-06-21 20:04:36,978 - finder - INFO - Selecting high confidence junctions after round2 mapping completed for cold
2021-06-21 20:04:36,978 - finder - INFO - Mapping rate in round2 cold SRR8422200 0.0
2021-06-21 20:04:36,978 - finder - INFO - Mapping rate in round2 cold SRR8422201 0.0
2021-06-21 20:04:36,978 - finder - INFO - Mapping rate in round2 cold SRR8422202 0.0
2021-06-21 20:04:36,978 - finder - INFO - Resorting to alignment with relaxed parameters for these runs due to poor mapping SRR8422200,SRR8422201,SRR8422202
2021-06-21 20:04:40,347 - finder - INFO - STAR relaxed alignment run for SRR8422200 completed
2021-06-21 20:04:44,046 - finder - INFO - STAR relaxed alignment run for SRR8422201 completed
2021-06-21 20:04:47,614 - finder - INFO - STAR relaxed alignment run for SRR8422202 completed
2021-06-21 20:04:47,614 - finder - INFO - Mapping of reads for round3 completed for cold
2021-06-21 20:04:47,642 - finder - INFO - Selecting high confidence junctions after round3 mapping completed for cold
2021-06-21 20:04:47,642 - finder - INFO - Mapping of reads for round4 completed for cold
2021-06-21 20:04:47,669 - finder - INFO - Selecting high confidence junctions after round4 mapping completed for cold
2021-06-21 20:04:47,670 - finder - INFO - Mapping with OLego for micro-exon detection completed for cold
2021-06-21 20:04:47,769 - finder - INFO - Merging of alignments from all rounds of mapping completed for cold
2021-06-21 20:04:47,770 - finder - INFO - Mapping of all runs completed for cold
2021-06-21 20:04:48,332 - finder - INFO - Information collection about alignments completed
2021-06-21 20:04:49,126 - finder - INFO - Generation of assemblies with PsiCLASS completed

and the majority of the files in '/example/FINDER_test_ARATH/alignments' are empty.

Any help would be most appreciated.

thanks,
Gareth

Final output has many gene annotations in repetitive regions

@sagnikbanerjee15

I found that in the final output, many genes are located in repeat regions.
It seems the soft masking of the genome doesn't work well.

Got any idea?

Thanks

Use of Long Read (PacBio IsoSeq)

May I ask if FINDER is compatible with long read sequencing? If not, how do you recommend enhancing the annotation results with the FINDER output using such data?

Thanks a lot!

repeat masked genome fasta file

Do you recommend to use a repeat masked genome fasta file as finder input?
Did you check the differences found in the results with and without repeat masking?

Low BUSCO stats

Hi!

My previous annotation with MAKER and BRAKER had ~97% complete BUSCO genes for the genome I have. This was consistent with outputs from BUSCO directly run on the genome. But this annotation had many fragmented gene models, so I tried FINDER. The software finished running, but many gene models seem to be missing.

gffread -x FINDER_BRAKER_PROT.cds.fasta -y FINDER_BRAKER_PROT.aa.fasta -g ../../../../1.Final_Assembly/5.Final_data/SH1353.primary.scaffolds.noPlasmid.fa FINDER_BRAKER_PROT.gtf
less FINDER_BRAKER_PROT.aa.fasta | grep -c ">"
48708

I ran BUSCO for a quick check without filtering any multiple transcripts on this FINDER output. The statistics shows 89.3 completeness.

INFO C:89.3%[S:61.9%,D:27.4%],F:2.5%,M:8.2%,n:5950
INFO 5310 Complete BUSCOs (C)
INFO 3682 Complete and single-copy BUSCOs (S)
INFO 1628 Complete and duplicated BUSCOs (D)
INFO 146 Fragmented BUSCOs (F)
INFO 494 Missing BUSCOs (M)
INFO 5950 Total BUSCO groups searched

It looks like BRAKER captured many of these missing genes, but they were rejected from the final annotation set. BRAKER annotation shows 95.1% BUSCO completeness.

less braker.gtf | awk '$3 == "transcript" {print}' | grep -c "t1"
40802

    C:95.1%[S:84.9%,D:10.2%],F:1.3%,M:3.6%,n:5950
    5661 Complete BUSCOs (C)
    5054 Complete and single-copy BUSCOs (S)
    607 Complete and duplicated BUSCOs (D)
    75 Fragmented BUSCOs (F)
    214 Missing BUSCOs (M)
    5950 Total BUSCO groups searched

I don't remember reading about stringent filtering criteria for BRAKER annotation in the paper. Could you point out what this would happen and how I can improve it?

Thank you!

Issue during Braker

It seems finder is failing during braker and I can't quite determine why. Attatched is the progress log, the error file, and the job script I used. I emailed the total output file to you as well, I just cant seem to get it attached here.
finder -no_cleanup -preserve --addUTR -mf ${DIR}/1_evidence/metadata.csv \ -n 32 -out_dir ${DIR}/2_outputs \ -g ${DIR}/1_evidence/Ia453-masked-genome.fa \ -p ${DIR}/1_evidence/protein.fa -preserve 1> \ ${DIR}/2_outputs/FINDER_test_Ia453.output 2> \ ${DIR}/2_outputs/errors/FINDER_test_Ia453.error
progress (1).log
FINDER_test_Ia453.error.zip

Finder on local data

Hello there,

Thank you for a wonderful software. I am trying to use finder with RNAseq data which resides on my local machine. From the logs it seems that its actually not running the STAR alignment but still producing the message that its running. For example:

2021-08-16 14:45:11,921 - finder -   INFO - STAR Round1 run for RSB01_1_cutadapt.fastq.gz completed
2021-08-16 14:45:12,347 - finder -   INFO - STAR Round1 run for RSB01_2_cutadapt.fastq.gz completed
2021-08-16 14:45:12,881 - finder -   INFO - STAR Round1 run for RSB02_1_cutadapt.fastq.gz completed
2021-08-16 14:45:13,418 - finder -   INFO - STAR Round1 run for RSB02_2_cutadapt.fastq.gz completed
2021-08-16 14:45:13,965 - finder -   INFO - STAR Round1 run for RSB03_1_cutadapt.fastq.gz completed
2021-08-16 14:45:14,629 - finder -   INFO - STAR Round1 run for RSB03_2_cutadapt.fastq.gz completed

Because, it's not producing any files, it ultimately finishes with errors.

The metadata file looks something like this:

BioProject,SRA Accession,Tissues,Description,Date,Read Length (bp),Ended,RNA-Seq,process,Location
BAT,RSB01_1_cutadapt.fastq.gz,Brain,Uninfected,,,PE,1,1,/lustre/analysis/annotation/RNAseq
BAT,RSB01_2_cutadapt.fastq.gz,Brain,Uninfected,,,PE,1,1,/lustre/analysis/annotation/RNAseq
BAT,RSB02_1_cutadapt.fastq.gz,Brain,Uninfected,,,PE,1,1,/lustre/analysis/annotation/RNAseq
BAT,RSB02_2_cutadapt.fastq.gz,Brain,Uninfected,,,PE,1,1,/lustre/analysis/annotation/RNAseq
BAT,RSB03_1_cutadapt.fastq.gz,Brain,Uninfected,,,PE,1,1,/lustre/analysis/annotation/RNAseq
BAT,RSB03_2_cutadapt.fastq.gz,Brain,Uninfected,,,PE,1,1,/lustre/analysis/annotation/RNAseq

May be the files are not picked up from the location? Any thoughts on this will be very valuable.

Thank you.

sagnikbanerjee15 / finder Goto Github PK

finder's Introduction

Welcome to finder2

Installation

Installing finder from GitHub

Downloading finder from release (Latest stable version)

Executing FINDER with Sample data

Running FINDER

Enforcing running of finder from preset checkpoints

Output Files

Intermediate files and folders [To be updated]

Checking Progress

Restarting previous runs with more RNA-Seq samples

Utilities included with FINDER

Terms of use

Support

finder's People

Contributors

Stargazers

Watchers

Forkers

finder's Issues

finder: Argument list too long

Recommend Projects

Recommend Topics

Recommend Org

Welcome to `finder2`

Installing `finder` from `GitHub`

Downloading `finder` from release (Latest stable version)

Enforcing running of `finder` from preset checkpoints