Git Product home page Git Product logo

emo-bon / metagoflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ebi-metagenomics/pipeline-v5

6.0 6.0 7.0 499.87 MB

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project

Home Page: https://metagoflow.readthedocs.io

License: Apache License 2.0

Common Workflow Language 38.76% Shell 6.08% Dockerfile 5.04% Python 43.61% Perl 6.50%

metagoflow's People

Contributors

cedricdcc avatar cpavloud avatar gdemoro avatar hariszaf avatar jprmachado avatar katesakharova avatar mb1069 avatar mberacochea avatar mr-c avatar mscheremetjew avatar steninidak avatar vkale1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

metagoflow's Issues

remove raw data from ro-crate in case ENA fetch tool is used

When metaGOflow runs using ena fetch, then a raw_data folder is built under the output folder.
However, we do not need these raw data in the rocrate.
Thus, move the raw data folder out of the output folder or remove those in general before the rocrate

eggNOG output is not returned

The output of the eggNOG annotation is not returned.

It's not only about adding a file in the outputs, but you also need to parse the output file in order to build a summary like for kos, ips etc.

For example:

"GO:0006355","regulation of transcription, DNA-templated","biological_process","36"
"GO:0055085","transmembrane transport","biological_process","33"
"GO:0006412","translation","biological_process","28"

Multiple tmp/data folders

When running parallel from cwltool after running multiple instances of IPS the tmp remains with all leftovers of the data required for the analyses.

Adjusting the parameters to:

+protein_chunk_size_IPS: 20000
+protein_chunk_size_eggnog: 1000
+protein_chunk_size_hmm: 500

This forces the analysis to run 19 instances of IPS docker rather than 1 using the default values when testing with the input:

test_?_???_HWLTKDRXY_600000.fastq.gz

The analyses leftover jumped from ~0.3T to ~1.5T. As far as I can see all the content tmp/*/data folders are the same.

Altering to use the same folder probably needs to be done at cwltool source code, the other possible solution is to clean after the IPS instance destruction.

Database integrity: checking md5sums

Do the databases downloaded with the download_dbs.sh script have md5sums we could check against? With the dbs being so big, it was taking > 10 hours to download and I would be very concerned about downloads being interupted/corrupted...

diamond CPU overload

Diamond asks for 5 * <num_threads> causing overload.

Thorough description of its behavior could benefit the wf.

replace trimmomatic step(s) with fastp

Trimmomatic can be tricky and take some time.

Often threads are not that used.

We will replace trimmomatic with fastp in all cases used in our wf.

You can find a cwl script of fastp here

Hmmsearch has hard-coded 4 threads per chunk analysis

Is there are reason why the hmmsearch has hard-coded 4 threads per chunk analysis?
tools/hmmer/hmmsearch/hmmsearch.cwl line:

82 - prefix: --cpu
83 valueFrom: '4'

Why not allocate total_number_of_threads/number_chunks per chunk analysis?

Building the rocrate should not be reliant on the root filesystem

The workflow completed with success, but the rocrate failed to build because it was trying to use the /tmp filesystem and ran out of space, threw an error then proceeded to delete all the results, thereby loosing 9 days compute work.

INFO [workflow ] completed success                                                                                                                                                                                 
ERROR Unhandled error, try again with --debug for more information:                                                                                                                                                
  can only concatenate str (not "bool") to str                                                                                                                                                                     
ls: cannot access '*.merged.CDS.I5_*.tsv.gz': No such file or directory                                                                                                                                            
ls: cannot access '*.merged.CDS.I5_*.tsv.gz': No such file or directory                                                                                                                                            
export...                                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                 
  File "utils/edit-ro-crate.py", line 470, in <module>                                                                                                                                                             
    main(args.target_directory, args.extended_config_yaml, args.ena_run_accession_id, args.metagoflow_version)                                                                                                     
  File "utils/edit-ro-crate.py", line 441, in main                                                                                                                                                                 
    crate.write_zip("".join([target_directory,".zip"]))                                                                                                                                                            
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/site-packages/rocrate/rocrate.py", line 440, in write_zip                                                                                          
    self.write(tmp_dir)                                                                                                                                                                                            
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/site-packages/rocrate/rocrate.py", line 430, in write                                                                                              
    writable_entity.write(base_path)                                                                                                                                                                               
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/site-packages/rocrate/model/file.py", line 68, in write                                                                                            
    shutil.copy(self.source, out_file_path)                                                                                                                                                                        
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/shutil.py", line 418, in copy                                                                                                                      
    copyfile(src, dst, follow_symlinks=follow_symlinks)                                                                                                                                                            
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/shutil.py", line 275, in copyfile                                                                                                                  
    _fastcopy_sendfile(fsrc, fdst)                                                                                                                                                                                 
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile                                                                                                        
    raise err from None                                                                                                                                                                                            
  File "/usr/local/src/miniconda3/envs/metaGOflow/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile                                                                                                        
    sent = os.sendfile(outfd, infd, offset, blocksize)                                                                                                                                                             
OSError: [Errno 28] No space left on device: 'HWLTKDRXY.UDI210/results/DBH_AAAAOSDA_1_2_HWLTKDRXY.UDI235_clean.fastq.trimmed.fasta' -> '/tmp/rocrate_ilrd4kfh/results/DBH_AAAAOSDA_1_2_HWLTKDRXY.UDI235_clean.fastq
.trimmed.fasta'                                                                                                                                                                                                    
(metaGOflow) 

The rocrate building should not be trying to use the (very small, in my case 50G) root files system rather than partition on which the wf is launched (20TB).
Building the ro-crate should perhaps be separated from the workflow, or at the very least the wf needs to check that the rocrate build successful before deleting all the results.

Chunked results should be concatenated at end of execution

Some steps, such as InterProScan, chunk the work into bits. These should be joined after all the chunks are computed. At present the number of output files depends on the input:

If there is a lot of data it gets split up into more than one file:
DBB.merged_CDS.I5_001.tsv.gz
DBB.merged_CDS.I5_002.tsv.gz

When there is less output there is only one file:
DBB.merged_CDS.I5.tsv.gz

MetaGOflow v1.0 fails to run because of incompatible Prosite binary in IPS 5.57-90.0

MetaGOflow v1.0 aborts with an error on cluster using AMD EPYC 7643 cpus due to the following Prosite binary incompatibility in InterProScan:

28/07/2023 13:53:31:803 Running InterProScan v5 in STANDALONE mode... on Linux
log4j:WARN No appenders could be found for logger (org.apache.activemq.broker.BrokerService).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
28/07/2023 13:53:37:135 RunID: hpc066.a.incd.pt_20230728_135336815_i87a
28/07/2023 13:53:44:194 Loading file /var/lib/cwl/stg0e3d7f9f-ad7f-4ce2-9653-2b028d4e70b2/2000001_3000000.faa
28/07/2023 13:53:44:195 Running the following analyses:
[Pfam-35.0,ProSitePatterns-2022_01,ProSiteProfiles-2022_01,TIGRFAM-15.0]
Pre-calculated match lookup service DISABLED.  Please wait for match calculations to complete...
28/07/2023 15:14:58:144 Uploaded 368019 unique sequences for analysis
2023-07-28 15:17:13,145 [amqEmbeddedWorkerJmsContainer-3] [uk.ac.ebi.interpro.scan.management.model.implementations.RunBinaryStep:199] ERROR - Command line failed with exit code: 1
Command: python3 bin/prosite/runprosite.py data/prosite/2022_01/profile_models /tmp/hpc066.a.incd.pt_20230728_135141098_4rxh//jobPrositeProfiles/000000212001_000000213000.fasta /tmp/hpc066.a.incd.pt_20230728_135141098_4rxh//jobPrositeProfiles/000000212001_000000213000.raw.out bin/prosite/pfsearchV3 -f -o 7 -t 4
Error output from binary:
Error setting affinity!
Error running prosite binary bin/prosite/pfsearchV3

2023-07-28 15:17:13,150 [amqEmbeddedWorkerJmsContainer-3] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:216] ERROR - Execution thrown when attempting to executeInTransaction the StepExecution.  All database activity rolled back.
java.lang.IllegalStateException: Command line failed with exit code: 1
Command: python3 bin/prosite/runprosite.py data/prosite/2022_01/profile_models /tmp/hpc066.a.incd.pt_20230728_135141098_4rxh//jobPrositeProfiles/000000212001_000000213000.fasta /tmp/hpc066.a.incd.pt_20230728_135141098_4rxh//jobPrositeProfiles/000000212001_000000213000.raw.out bin/prosite/pfsearchV3 -f -o 7 -t 4
Error output from binary:
Error setting affinity!
Error running prosite binary bin/prosite/pfsearchV3

This seems to be due to a Prosite binary incompatibility as reported elsewhere with this version of IPS..

The version of IPS in v1.0 is 5.57-90.0 from Aug 2022. Unfortunately, even if this problem has been resolved in newer versions of IPS (latest: June 2023), changes to Docker mean that the current Dockerfile for IPS in MetaGOflow will not compile the newer versions of IPS.

The only way to resolve this for MetaGOflow v.1.0 would be to recompile the Prosite binaries with the newer compatible libraries and rebuild the IPS 5.57-90.0 container. This may not be worth the effort at this stage with more than half of the first EMO BON batch of data already analysed.

However, this should be resolved for MetaGOflow v1.1 (or whatever version we use to analyses the second EMO BON data tranch).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.