Comments (11)
Not sure if this will help now but just in case anyone else wants to convert silva to sintax fasta format, I wrote this.. it won't give the proper taxonomic levels to eukaryotes in silva132 but it will work with -makeudb_sintax.
Theo
#!/usr/bin/env python
#reformats SILVA formatted taxonomy in the header of fasta files to be compatible with usearch and sintax dbs.
#useage:
#reformat_silva.py infile.fasta outfile.fasta
import sys
f= open(sys.argv[1],'r')
h= open(sys.argv[2],'w')
tax=['d','p','c','o','f','g','s']
for i in f:
out=""
if i[0]==">":
k=i.split(";")
if len(k)>7:
k=k[-7:]
acc=k[0].split(" ")[0]
k[-1]=k[-1].rstrip("\n")
if " " in k[0]:
k[0]=k[0].split(" ")[1]
c=-1
for x in k:
c=c+1
out=out+tax[c]+":"+x+","
out=out.rstrip(",")
h.write(acc+";tax="+out+"\n")
else:
i=i.replace('U','T')
h.write(i)
from amptk.
update: I've tried this, but the size of the database is large and searches are incredibly slow, need to figure out why before doing this. There seems to be a lot of redundancy in the database, one solution would be to dereplicate and then LCA, but not sure that will then be any better.
from amptk.
Failed to create database UTAX and SINTAX but usearch created.
my command:
#!/bin/bash
SILVA="/media/bioadmin/Think/MiseqDATA/SILVA_Database"
download the SILVA_132_SSURef_NR99 database and extract the V4 region
mkdir $SILVA && cd $SILVA
RELEASE=132
URL="https://www.arb-silva.de/fileadmin/silva_databases/release_${RELEASE}/Exports"
INPUT="SILVA_${RELEASE}_SSURef_Nr99_tax_silva.fasta.gz"
Download and check
wget -c ${URL}/${INPUT}{,.md5} && md5sum -c ${INPUT}.md5
Define variables and output files
Primers V3-V4 region (~465 bp) 341F & 805R
OUTPUT="${INPUT/.fasta.gz/_341F_805R.fasta}"
LOG="${INPUT/.fasta.gz/_341F_805.log}"
PRIMER_F="CCTACGGGNGGCWGCAG"
PRIMER_R="GACTACHVGGGTATCTAATCC"
ANTI_PRIMER_R="GGATTAGATACCCBDGTAGTC"
MIN_LENGTH=32
MIN_F=$(( ${#PRIMER_F} * 2 / 3 ))
MIN_R=$(( ${#PRIMER_R} * 2 / 3 ))
Extract target region using forward & reverse primers
zcat "${INPUT}" | sed '/^>/ ! s/U/T/g' |
cutadapt --discard-untrimmed --minimum-length ${MIN_LENGTH} -g "${PRIMER_F}" -O "${MIN_F}" - 2> "${LOG}" |
cutadapt --discard-untrimmed --minimum-length ${MIN_LENGTH} -a "${ANTI_PRIMER_R}" -O "${MIN_R}" - 2>> "${LOG}" |
sed '/^>/ s/;/|/g ; /^>/ s/ //g ; /^>/ s// /1' > "${OUTPUT}"
Format extracted SILVA database
sed 's/://g' $SILVA/${OUTPUT} |
sed 's/-//g' | sed 's/(//g' | sed 's/)//g' |
sed 's/ Bacteria/;tax=d:Bacteria,/g' |
sed 's/|/p:"/1' | sed 's/|/",c:/1' | sed 's/|/,o:/1' |
sed 's/|/,f:/1' | sed 's/|/,g:/1' | sed 's/|/,s:/1' >
amptk database = This is my output:
amptk database -i
--skip_trimming --install --primer_required none --derep_fulllength
[12:50:36 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3
[12:50:36 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0
[12:50:36 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S
[12:50:36 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta
[12:50:49 PM]: 594,027 records loaded
[12:50:49 PM]: Using 8 cpus to process data
[12:51:29 PM]: 593,752 records passed (99.95%)
[12:51:29 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found
[12:51:29 PM]: Now dereplicating sequences (collapsing identical sequences)
[12:51:40 PM]: 359,733 records passed (60.56%)
[12:51:41 PM]: Creating USEARCH Database (VSEARCH)
[12:52:08 PM]: Database /usr/local/lib/python3.4/dist-packages/amptk/DB/16S.udb created successfully
amptk database -i
--skip_trimming --install --primer_required none --derep_fulllength
[12:52:46 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3
[12:52:46 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0
[12:52:46 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S_SINTAX
[12:52:46 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta
[12:52:59 PM]: 594,027 records loaded
[12:52:59 PM]: Using 8 cpus to process data
[12:53:36 PM]: 593,752 records passed (99.95%)
[12:53:36 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found
[12:53:36 PM]: Now dereplicating sequences (collapsing identical sequences)
[12:53:47 PM]: 359,733 records passed (60.56%)
[12:53:47 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
File "/usr/local/bin/amptk", line 735, in
main()
File "/usr/local/bin/amptk", line 726, in main
mod.main(arguments)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 623, in main
makeDB(OutName, args=args)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 418, in makeDB
amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment
amptk database -i
--skip_trimming --install --primer_required none --derep_fulllength
[12:55:22 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3
[12:55:22 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0
[12:55:22 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S_UTAX
[12:55:22 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta
[12:55:35 PM]: 594,027 records loaded
[12:55:35 PM]: Using 8 cpus to process data
[12:56:12 PM]: 593,752 records passed (99.95%)
[12:56:12 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found
[12:56:12 PM]: Now dereplicating sequences (collapsing identical sequences)
[12:56:22 PM]: 359,733 records passed (60.56%)
[12:56:22 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
File "/usr/local/bin/amptk", line 735, in
main()
File "/usr/local/bin/amptk", line 726, in main
mod.main(arguments)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 623, in main
makeDB(OutName, args=args)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 418, in makeDB
amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment
from amptk.
Likely running out of memory or the taxonomy labels not formatted correctly. Check the logfile to see what usearch died with.
from amptk.
usearch v9.2.64_i86linux32, 4.0Gb RAM (1040Gb total), 160 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
00:00 37Mb 0.1% Reading 16S_SINTAX.extracted.fa
00:01 84Mb 21.5% Reading 16S_SINTAX.extracted.fa
00:02 174Mb 65.5% Reading 16S_SINTAX.extracted.fa
00:02 247Mb 100.0% Reading 16S_SINTAX.extracted.fa
00:02 214Mb 0.1% Converting to upper case
00:03 214Mb 65.2% Converting to upper case
00:03 214Mb 100.0% Converting to upper case
00:03 215Mb 0.1% Word stats
00:04 215Mb 8.9% Word stats
00:05 215Mb 22.3% Word stats
00:06 215Mb 35.9% Word stats
00:07 215Mb 47.4% Word stats
00:08 215Mb 59.2% Word stats
00:09 215Mb 70.6% Word stats
00:10 215Mb 81.7% Word stats
00:11 215Mb 93.5% Word stats
00:11 215Mb 100.0% Word stats
00:11 215Mb 100.0% Alloc rows
00:11 798Mb 0.1% Build index
00:12 798Mb 2.0% Build index
00:13 798Mb 10.2% Build index
00:14 798Mb 17.5% Build index
00:15 798Mb 24.9% Build index
00:16 798Mb 32.7% Build index
00:17 798Mb 41.0% Build index
00:18 798Mb 49.3% Build index
00:19 798Mb 57.7% Build index
00:20 798Mb 65.7% Build index
00:21 798Mb 73.7% Build index
00:22 798Mb 81.9% Build index
00:23 798Mb 88.4% Build index
00:24 798Mb 96.4% Build index
00:24 798Mb 100.0% Build index
00:24 801Mb 0.0% Initialize taxonomy data
00:25 805Mb 13.9% Initialize taxonomy data
00:26 810Mb 33.0% Initialize taxonomy data
00:27 814Mb 54.0% Initialize taxonomy data
00:28 820Mb 75.6% Initialize taxonomy data
00:29 823Mb 94.2% Initialize taxonomy data
00:29 824Mb 100.0% Initialize taxonomy data
00:29 824Mb 0.0% Building name table
00:29 826Mb 100.0% Building name table
00:29 826Mb 58553 names, tax levels min 0, avg 6.5, max 7
WARNING: 1535 taxonomy nodes have >1 parent
usearch9 -makeudb_sintax 16S_SINTAX.extracted.fa -output 16S_SINTAX.udb -notrunclabels
---Fatal error---
../utaxdata.cpp(1176) assert failed: Size > 0 && Ids != 0
from amptk.
Hopefully that error makes sense - you have sequences (at least one) that doesn’t have any taxonomy information.
from amptk.
I am commenting here because it is related, I am trying to build a tailored V4 database out of SILVA NR 132. I guess I've managed to format the headers properly, the fasta has now around 200 Mb. However, after running:
amptk database -i '/media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132' -o V4_NR132 --format off --create_db utax --skip_trimming --install --primer_required none --derep_fulllength
I got:
[06:30:23 PM]: OS: debian buster/sid, 16 cores, ~ 66 GB RAM. Python: 3.7.3
[06:30:23 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.13.6
[06:30:23 PM]: Base name set to: /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132
[06:30:23 PM]: Working on file: /media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132
[06:30:33 PM]: 556,586 records loaded
[06:30:33 PM]: Using 16 cpus to process data
[06:30:46 PM]: 556,389 records passed (99.96%)
[06:30:46 PM]: Errors: 0 no taxonomy info, 0 no ID, 1 length out of range, 196 too many ambiguous bases, 0 no primers found
[06:30:46 PM]: Now dereplicating sequences (collapsing identical sequences)
[06:30:50 PM]: 255,268 records passed (45.86%)
[06:30:50 PM]: Creating UTAX Database, this may take awhile
[07:03:52 PM]: There was a problem creating the DB, check the UTAX log file /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.utax.log
The utax logfile says:
00:00 37Mb 0.1% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 112Mb 7
0.6% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 142Mb 100.0% Reading /ho
me/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa
00:01 108Mb 0.1% Converting to upper case ^M00:01 108Mb 10
0.0% Converting to upper case
00:01 109Mb 0.1% Word stats ^M00:02 109Mb 55.9% Word stats^M00:02 109Mb 100.0% Word stats
00:02 109Mb 100.0% Alloc rows
00:02 358Mb 0.1% Build index^M00:03 358Mb 19.3% Build index^M00:04 358Mb 66.9% Build index^M00:04 358Mb 100.0% Build index
00:04 360Mb 0.0% Initialize taxonomy data^M00:05 364Mb 17.6% Initialize taxonomy data^M00:06 373Mb 75.9% Initialize taxonomy data
^M00:06 377Mb 100.0% Initialize taxonomy data
00:06 377Mb 0.0% Building name table ^M00:06 378Mb 100.0% Building name table
00:06 378Mb 41947 names, tax levels min 2, avg 6.6, max 7
WARNING: 1247 taxonomy nodes have >1 parent
00:06 491Mb 0.1% Word stats^M00:07 491Mb 33.8% Word stats^M00:07 491Mb 100.0% Word stats
00:07 491Mb 100.0% Alloc rows
00:07 740Mb 0.1% Build index^M00:08 740Mb 7.0% Build index^M00:09 740Mb
......
27.6% Distance matrix/usort^M33:02 4.3Gb 27.6% Distance matrix/usort
---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required
I guess this is related to the 32bit version of usearch being limited to 4 Gb, unfortunately, 64bit version is expensive. My system has 64 Gb.
Is there any way to create this database using vsearch?
Is it possible to split the fasta, loop in the database creation then join files at the end?
Any suggestions would be good.
Best,
from amptk.
UTAX is specific to usearch. The memory limit doesn’t seem to always be consistent, technically it should be 4 GB but that doesn’t mean it can work on files equal to 4 GB. Anyway the current hybrid method will use the entire database with vsearch for global alignment. So you can randomly sub sample the input data to generate the UTAX and sintax databases whole keeping the whole thing for the usearch database.
from amptk.
I am trying to make the SILVA LSU fasta file into a new LSU database in amptk. I think I have the taxonomy fixed (partially using the script from theo-allnutt-bioinformatics above), but now when I try to create the database, amptk gives me errors. In the command line it says
[Jan 31 03:57 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
File "/Users/smithlab/miniconda/envs/amptk/bin/amptk", line 736, in <module>
main()
File "/Users/smithlab/miniconda/envs/amptk/bin/amptk", line 727, in main
mod.main(arguments)
File "/Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/extract_region.py", line 623, in main
makeDB(OutName, args=args)
File "/Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/extract_region.py", line 418, in makeDB
amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment
in the log file it says
usearch9(16829,0xa98521c0) malloc: *** mach_vm_map(size=2963820544) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
usearch9 -makeudb_sintax /Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/DB/LSU_SINTAX.extracted.fa -output /Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/DB/LSU_SINTAX.udb -notrunclabels
---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required
I tried Googling malloc_error_break and the results I found suggested that the script needs to be debugged to see where the error is occurring and then insert the breakpoint. I do not know how to do this, so do you have any suggestions?
from amptk.
Malloc errors are memory related, so the free version of usearch is 32 but and thus the file you are trying to make a database out of is too large and it errors. So need to reduce the size to build the sintax and utax databases. The global alignment or usearch uses vsearch which doesn’t have the memory limit. Look at the readthedocs manual on how I subsampled the data for other databases to build sintax and utax.
from amptk.
Going back to the Green Genes question, I have made an AMPtk formatted Green Genes database (too big to upload here, so I've put it here (gg_13_5_replace), which is also where the 28S database I made (RDP_SILVA_LSU_update) is stored). I added a few endosymbiont sequences (for the specific project I'm working on) and some fungal 18S sequences because the primers we've been using sometimes amplify fungal sequences.
I was able to successfully install it into AMPtk.
from amptk.
Related Issues (20)
- Issue installing AMPtk (Mac OS - M1 chip) HOT 2
- getting NoneType vs int error in clustering step
- Error when run quick start HOT 7
- usearch9 not found when generate UTAX database
- VSEARCH error on amptk -filter step
- Support Python 3.8 onwards HOT 3
- SyntaxError in "duplicate ID in mapping file: XXX, exiting"
- Default for -p, --index_bleed documented as 0.005 HOT 1
- Typo "Bjerkandara adusta" --> "Bjerkandera adusta" HOT 1
- Missing species names in amptk_mock1.fa HOT 3
- Missing final new line in amptk_mock1.fa and amptk_synmock.fa HOT 2
- Inconsistent primer trimming sequence in amptk_mock*.fa HOT 5
- Matching MockA, MockB1 and MockB2 to FASTQ filenames HOT 2
- platform.linux_distribution is removed since Python 3.8 HOT 1
- Species names in amptk_mock2.fa and amptk_mock3.fa vs Figure 4
- new users cannot install amptk properly, please help HOT 3
- unoise3 clustering HOT 5
- Problem with TypeError during AMPtk cluster HOT 11
- Saw you started some prelim ONT methods HOT 2
- Problematic unoise3 implementation with VSEARCH HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amptk.