nextgenusfs / amptk Goto Github PK

View Code? Open in Web Editor NEW

35.0 9.0 14.0 18.71 MB

AMPtk: Amplicon ToolKit for NGS data (formally UFITS)

Home Page: http://amptk.readthedocs.io/

License: BSD 2-Clause "Simplified" License

Python 94.25% R 5.39% Dockerfile 0.22% Shell 0.14%

amplicons microbiome mycobiome next-generation-sequencing fungi rdna coi lsu co1 mtco1

amptk's People

Contributors

Stargazers

Watchers

Forkers

yubioinfo vmikk hyphaltip irawand07 marcomeola npomb001 kilaza druvus bc-anaisabel gerverska chrisjackson-pellicle mpnguyen2022 ayixon munamerchan

amptk's Issues

edlib - invalid ELF header

Hello Jon!
I tried to update AMPtk to the most recent version (v0.10.2) and received the following error:

ImportError: /home/mik/amptk/lib/edlib.so: invalid ELF header

OS - Linux Mint (Ubuntu-based)
edlib v.1.2.0 is installed via pip (pip install -U edlib)

Could you please tell me how to resolve this problem?

Reference database updates & Wiki

Hello Jon!
UNITE team announced the new release of their database (v7.2) and I decided to update the AMPTk-compatible database with it.
To make the database preparation more transparent for the users I started a new wiki page and posted the code used for ITS data. The code is based on your log files from the AMPTk pre-formatted database (ITS.log, ITS_UTAX.log, ITS1_UTAX.log, ITS2_UTAX.log).

Note however, that possibly there is a mistake in the current UNITE release. I think that they accidentally interchanged file names between trimmed and untrimmed (developer) files: the length of sequences is smaller in the dev file (which should be used for DB preparation), while the 'trimmed' file is larger. I've reported this issue to UNITE, but haven't received a response yet.

With best regards,
Vladimir

amptk filter - structure of otutable.demux.fq for new OTU table

Hi Jon,

I would like to use amptk filter to generate a 'cleaned' OTU table on an existing OTU table that I made using a different pipeline. I can use the amptk filter command fine, but then need to make a new OTU table with the normalized and cleaned output using amptk remove and subsequently vsearch --usearch_global.

This requires that I have a ".demux.fq" file that would normally be generated in earlier steps of the amptk pipeline. The sequences that I got back from the sequencing center are not yet demultiplexed at the sample level due to the way that we prepared our constructs, and can not be plugged directly into the amptk pipeline. I am hoping that I can generate a ".demux.fq" file independently from intermediate files in my pipeline, but need to know what this file consists of. Can I just concatenate sample-level fastq files that have been quality filtered and trimmed to generate this?

Thanks,
Lauren

Does the 'amptk filter' remove non-representative OTUs ?

Hi Jon,

The --min_reads_otu parameter of 'amptk filter' excludes OTUs within samples with frequency below of delimited? I've run the 'amptk filter' with the default parameter, but in my OTU table still are present OTUs with frequency of 1.

16s Illumina

Hello Mr. Palmer,
I have used amptk to process ITS fungal sequences with great success, however now I am trying to use it on bacterial v3-v4 16S data. I could not find in the documentation how to properly run the illumina command, doc contains:
#simple folder of PE MiSeq data
"amptk illumina -i miseq_folder/ -o mydata -f ITS1-F -r ITS2"
I cannot figure how to properly tell amptk which primers i have used. Does this command check for an indexed primer sequence db ?

Thanks in advance.

UnboundLocalError: local variable 'hit' referenced before assignment when using '--fasta_db'

Hi,
I'm trying to use my personal db:
amptk taxonomy -f miseq.lulu.otus.fa -i miseq.lulu.otu_table.txt -m miseq.mapping_file.txt --fasta_db silva18S.fa -o miseq

I also tried (like: #55 ):
amptk taxonomy -f miseq.lulu.otus.fa -i miseq.lulu.otu_table.txt -m miseq.mapping_file.txt --fasta_db db_18S/silva18S.fa -o miseq -u usearch9

but I have the same error:
`
Traceback (most recent call last):
File "/Users/Alessandro/anaconda3/bin/amptk", line 735, in
main()
File "/Users/Alessandro/anaconda3/bin/amptk", line 726, in main
mod.main(arguments)
File "/Users/Alessandro/anaconda3/lib/python3.7/site-packages/amptk/assign_taxonomy.py", line 320, in main
bestClassify = amptklib.bestclassifier(utaxDict, sintaxDict, otuList)
File "/Users/Alessandro/anaconda3/lib/python3.7/site-packages/amptk/amptklib.py", line 1494, in bestclassifier
BestClassify[otu] = hit
UnboundLocalError: local variable 'hit' referenced before assignment
'

amptk taxonomy help menu for `-i` argument

Pretty sure the mixup is on line 31 of the amptk python script...

When you print the general help menu with amptk taxonomy, the input flags are described as either -i or --otu_table.

Usage:       amptk taxonomy <arguments>
version:     1.0.3

Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional).  
             By default the script uses a hybrid approach, e.g. gets taxonomy information from 
             SINTAX, UTAX, and global alignment hits from the larger UNITE-INSD database, and 
             then parses results to extract the most taxonomy information that it can at 'trustable' 
             levels. SINTAX/UTAX results are used if BLAST-like search pct identity is less than 97%.  
             If % identity is greater than 97%, the result with most taxonomy levels is retained.
    
Arguments:   -f, --fasta         Input FASTA file (i.e. OTUs from amptk cluster) (Required)
             -i, --otu_table     Input OTU table file (i.e. otu_table from amptk cluster)
             -o, --out           Base name for output file. Default: amptk-taxonomy.<method>.txt

... (more fun things) ...

However, it looks like the two potential arguments you could pass would be either -i or --input.

parser.add_argument('-i', '--input', dest="otu_table", help='Append Taxonomy to OTU table')

I kept getting an error by passing --otu_table, but the script ran find with i.

CSV vs TSV OTU tables in ufits taxonomy command

If a user inputs an OTU table where the filename ends in '.csv' but the OTU table is not comma separated, then the taxonomy will not be correctly appended to the final output. Vise versa is true, if file ends with '.txt' and is not tab delimited will get the same error.

Fix will be to determine delimiters automatically instead of using the filename.

feature request for 'amptk filter'

Hi Jon,
I ultimately end up doing this in R with the OTU table output from amptk, but it's probably something that could be something of use to others perhaps.
While you have the option with amptk filter to (by default) filter out any OTUs with just a single read (with --min_reads_otu), I'd like to have the same principle applied to OTUs which are present in just a single sample. Would it be possible to throw in a similar flag, something like --min_samples_otu?
I'm wondering how many other users also discard OTUs present in just a single sample?
Thanks for your consideration,
Devon

OTU table normalization

Should the counts in the OTU table be normalized to the number of reads in each sample? This would improve functionality of --index_bleed filter.

amptk summarize

amptk summarize appears to have a bug, the output files show that each sample is even across all OTUs at all taxonomic levels - this is not reflected in the input OTU table.
I was made aware of this by a different user, then tested on multiple OTU tables including theirs and my results are consistent. I tried using --percent and had the same issue.

Hybrid taxonomy abbreviations

Hello Jon!
This is not really an issue, but a question.

After taxonomy assignment with hybrid approach (USEARCH, SINTAX, UTAX) using amptk < v.0.9.0 resulting taxonomy is self-explanatory and looks like:

ID;k:Fungi,…         # where ID is an accession number of the best hit.
UTAX;k:Fungi,…
SINTAX;k:Fungi,…

In amptk >= v.0.9.0 resulting taxonomy looks like "XX|PercentIdentiy|ID;k:Fungi…", where XX can be GD, GS, SS, US. Unfortunately, I cannot find description of these abbreviations in the documentation. If I understand correctly:

GS means taxonomy based on global alignment hits (USEARCH)
SS means SINTAX-based taxonomy
US means UTAX-based taxonomy

But what does GD mean? Is it used when the methods gives controversial results and resulting taxonomy is LCA-based?

Appending Taxonomy Error

Greetings,

I am getting an error in the final taxonomy step. I have tried it a couple different ways with the same result. Let me know what you would need to diagnose.

^[[92m[Sep 24 09:11 PM]^[[0m: Loading FASTA Records
^[[92m[Sep 24 09:11 PM]^[[0m: 562 OTUs
^[[92m[Sep 24 09:11 PM]^[[0m: Global alignment OTUs with usearch_global (USEARC$
^[[92m[Sep 24 09:11 PM]^[[0m: Classifying OTUs with UTAX (USEARCH)
^[[92m[Sep 24 09:11 PM]^[[0m: Classifying OTUs with SINTAX (USEARCH)
^[[92m[Sep 24 09:11 PM]^[[0m: Appending taxonomy to OTU table and OTUs
Traceback (most recent call last):
File "/depot/bioinfo/apps/apps/amptk-1.1.2/bin/amptk-assign_taxonomy.py", lin$
tax = otuDict.get(line[0]) or "No Hit"
NameError: name 'otuDict' is not defined

update citation link in docs

do it.

Amptk summarize ValueError: not enough values to unpack (expected 2, got 1)

Hi,
I'm trying :
amptk summarize -i miseq.otu_table.taxonomy.txt --graphs -o barplot --font_size 6 --format pdf
but something doesn't work.

that's the pipeline:
amptk illumina -i raw/ --require_primer off -o miseq
amptk cluster -i miseq.demux.fq.gz -e 0.25 -m 2 --unoise -o miseq
amptk filter -i miseq.otu_table.txt -f miseq.cluster.otus.fa -o miseq --show_stats
amptk lulu -i miseq.final.txt -f miseq.filtered.otus.fa -o miseq
amptk taxonomy -f miseq.lulu.otus.fa -i miseq.lulu.otu_table.txt -m miseq.mapping_file.txt -d ITS2 -o miseq
amptk summarize -i miseq.otu_table.taxonomy.txt --graphs -o barplot --font_size 6 --format pdf

the same pipeline works with 16S samples (with the db 16S), but with my new 18S samples doesn't work the last raw. I checked the file miseq.otu_table.taxonomy.txt and it seems ok !

the error is the follows:
Traceback (most recent call last):
File "/Users/Alessandro/anaconda3/bin/amptk", line 735, in
main()
File "/Users/Alessandro/anaconda3/bin/amptk", line 726, in main
mod.main(arguments)
File "/Users/Alessandro/anaconda3/lib/python3.7/site-packages/amptk/summarize_taxonomy.py", line 153, in main
level, value = w.split(':')
ValueError: not enough values to unpack (expected 2, got 1)
Traceback (most recent call last):
File "/Users/Alessandro/anaconda3/bin/amptk", line 735, in
main()
File "/Users/Alessandro/anaconda3/bin/amptk", line 726, in main
mod.main(arguments)
File "/Users/Alessandro/anaconda3/lib/python3.7/site-packages/amptk/summarize_taxonomy.py", line 153, in main
level, value = w.split(':')
ValueError: not enough values to unpack (expected 2, got 1)

I checked also the db installed, Amptk info:
Running AMPtk v 1.4.0
Taxonomy Databases Installed: /Users/Alessandro/anaconda3/lib/python3.7/site-packages/amptk/DB
DB_name DB_type FASTA Fwd Primer Rev Primer Records Source Version Date
16S.udb vsearch rdp_v16.fa None None 13118 2019-06-21 None None
16S_SINTAX.udb sintax rdp_v16.fa 515FB 806RB 9679 2019-06-21 None None
16S_UTAX.udb sintax rdp_v16.fa 515FB 806RB 9679 2019-06-21 None None
COI.udb vsearch arth-chord.bold.reformated.fasta LCO1490 mlCOIintR 1617885 BOLD 20190219 2019-02-19
COI_SINTAX.udb sintax arth-chord.bold.BIN-consensus.fasta LCO1490 mlCOIintR 381032 BOLD 20190219 2019-02-19
COI_UTAX.udb utax arth-chord.bold.BIN-consensus.fasta LCO1490 mlCOIintR 20000 BOLD 20190219 2019-02-19
ITS.udb vsearch UNITE_public_all_02.02.2019.fasta ITS1-F ITS4 1130202 UNITE 8.0 2019-02-13
ITS1_UTAX.udb utax sh_general_release_dynamic_all_02.02.2019_dev.... ITS1-F ITS2 59710 UNITE 8.0 2019-02-12
ITS2_UTAX.udb utax sh_general_release_dynamic_all_02.02.2019_dev.... fITS7 ITS4 48956 UNITE 8.0 2019-02-12
ITS_SINTAX.udb sintax sh_general_release_dynamic_s_all_02.02.2019_de... ITS1-F ITS4 120881 UNITE 8.0 2019-02-13
ITS_UTAX.udb utax sh_general_release_dynamic_all_02.02.2019_dev.... ITS1-F ITS4 50000 UNITE 8.0 2019-02-12
LSU.udb vsearch RDP_v8.0_fungi.fa None None 91823 RDP 8 2019-02-12
LSU_SINTAX.udb sintax RDP_v8.0_fungi.fa None None 91823 RDP 8 2019-02-12
LSU_UTAX.udb utax RDP_v8.0_fungi.fa None None 45000 RDP 8 2019-02-13

Composition of each OTU?

Question was:
One of the optional outputs with other software is a file that lists the individual reads that make up the OTUs. So if I looked at a specific OTU, I would be able to see the range of sequence variation within that OTU. Might it be possible to have this as an optional file output?

something funny...

(Currently running amptk-1.1.0)...

Not clear why, but when I use amptk drop with a .gz file, I get an error suggesting it doesn't like files compressed with gzip. I can avoid the error by decompressing the file prior to executing the amtpk drop command, then things proceed without any issue.

Given that I've had weird issues with this .gzip thing lately, it might just be me, but I wondered if anyone else was getting this error?

Example command

amptk drop \
--input my.cluster.otus.fa \
--reads my.demux.fq.gz \
--list OTU526 OTU1181 OTU770 OTU184 OTU1155 \
--out trim_dropdOTUs

Example error

[03/08/18 14:22:57]: OS: linux2, 24 cores, ~ 131 GB RAM. Python: 2.7.14
[03/08/18 14:22:57]: Python Modules: numpy v1.13.3, pandas v0.21.0, matplotlib v2.0.2, psutil v5.4.1, natsort v5.1.1, biopython v1.70, edlib v1.2.1, biom-format NOT installed!
[03/08/18 14:22:57]: Loading 1912 OTUs
[03/08/18 14:22:57]: Dropping 5 OTUs
[03/08/18 14:22:57]: Mapping Reads to OTUs and Building OTU table
[03/08/18 14:22:57]: vsearch --fastq_filter my.demux.fq.gz --fastaout trim_dropdOTUs.reads.tmp --fastq_qmax 55
[03/08/18 14:22:57]: vsearch v2.6.2_linux_x86_64, 125.7GB RAM, 24 cores
https://github.com/torognes/vsearch



Fatal error: Files compressed with gzip are not supported

[03/08/18 14:22:57]: vsearch --usearch_global trim_dropdOTUs.reads.tmp --strand plus --id 0.97 --db trim_dropdOTUs.cleaned.otus.fa --uc trim_dropdOTUs.mapping.uc --otutabout trim_dropdOTUs.cleaned.otu_table.txt
[03/08/18 14:22:57]: vsearch v2.6.2_linux_x86_64, 125.7GB RAM, 24 cores
https://github.com/torognes/vsearch

Reading file trim_dropdOTUs.cleaned.otus.fa 100%
343260 nt in 1907 seqs, min 180, max 180, avg 180
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%


Unable to open file for reading (trim_dropdOTUs.reads.tmp)

[03/08/18 14:22:57]: 1,907 OTUs remaining

combining samples based on metadata??

I've been asked by a user to look into an option to combine samples based on metadata (i.e. treatment effects). Is this something others are interested in? Not sure if this is a general enough request to build into UFITS.

centroid= not found in taxonomy dictionary

I am having trouble following the documentation to create a BOLD database. I am using AMPtk-1.3.0 installed with pip from PyPI (I am having conda problems, so can't install with conda right now) and several VSEARCH versions (tried 2.2.0, 2.4.4, 2.8.0 and 2.10.4). The first step of the tutorial fails:

bold2utax.py -i Chordata_bold_data.txt -o chordata.bold.bins.fa

It prints:

410469 total records processed
220491 non COI records dropped
8736 records without a BIN dropped
174073 records written to BINs
Now looping through BINs and clustering with VSEARCH @ 99%
Updating taxonomy

But then it outputs thousands of lines with centroid=BOLD:XXXYYYYY not found in taxonomy dictionary, and finishes with:

centroid=BOLD:ADJ0552 not found in taxonomy dictionary
Wrote 27791 consensus seqs for each BIN to chordata.bold.bins.fa

And an empty chordata.bold.bins.fa file. Looking at the bold2utax.py code, I saw at line 157:

record.id = record.id.replace('consensus=', '')

Which do not match the error message, nor the comment at line 153:

#finally loop through centroids and get taxonomy from dictionary

The script runs successfully if I replace the line 157 above by:

record.id = record.id.replace('centroid=', '')

Is this an issue with vsearch versions?

tiny request on database/taxonomy install

It would be great to know when the database was last updated. You mention when the ITS database was last updated in the docs, but no other db have any specific info.

Would it be possible to run something like amptk taxonomy, and then at the bottom where we see a list of Databases Configured:, include another column with the last update?

Gotta keep you busy Jon...

amptk filter --subtract functionality

Hey Jon,

I've been able to run amptk filter with a number of different flags and I'm to the point now where I have a sense of what index bleed calculation I'd like to apply. In addition, I'd like to apply a specific --subtract value.

In essense, this:

amptk filter \
-i ../dropd.cluster.otu_table.txt \
-f ../dropd.cluster.otus.fa \
--delimiter csv \
--index_bleed 0.01 \
--threshold max \
-o allfilt \
--normalize n \
-s 30

Please don't yell at me for not using the normalization feature; I have a pretty good reason in this case (won't get into that though)...

That code works just fine, generates all the output files expected, except it doesn't remove the mock community (because I haven't passed the -b flag). Minor detail given that I can just drop that column after the fact. However, you've written functionality to do that... so using the same script above, but adding in the -b option (with the --mc option)...

amptk filter \
-i ../dropd.cluster.otu_table.txt \
-f ../dropd.cluster.otus.fa \
--delimiter csv \
--index_bleed 0.01 \
--threshold max \
-o allfilt \
--normalize n \
-s 30 \
-b mockIM4 \
--mc /path/to/mock.fa

Generates this error message:

Traceback (most recent call last):
  File "/mnt/lustre/macmaneslab/devon/pkgs/amptk-1.0.3/bin/amptk-filter.py", line 484, in <module>
    mocks = final[args.mock_barcode]
  File "/mnt/lustre/software/linuxbrew/colsa/Cellar/python/2.7.14/lib/python2.7/site-packages/pandas/core/frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "/mnt/lustre/software/linuxbrew/colsa/Cellar/python/2.7.14/lib/python2.7/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "/mnt/lustre/software/linuxbrew/colsa/Cellar/python/2.7.14/lib/python2.7/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "/mnt/lustre/software/linuxbrew/colsa/Cellar/python/2.7.14/lib/python2.7/site-packages/pandas/core/internals.py", line 3838, in get
    loc = self.items.get_loc(item)
  File "/mnt/lustre/software/linuxbrew/colsa/Cellar/python/2.7.14/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2524, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'mockIM4p82redo'

What's curious is that it still generates the expected output files, but the mock community (mockIM4p82redo) appears in the dataset.

Any thoughts?

Thanks!

totally not a big issue, but perhaps something that could help?

Hi Jon,

A recent batch of samples consisted of several .fq files which contained zero bytes of data along with lots of good reads for other samples. If I run all of these files (including those with zero bytes of information) through the 'amptk illumina' command the run crashes on the first file with zero bytes with the following message:

Traceback (most recent call last):
File "/home/fosterLab/devonr/bin/ufits-v0.7.2/bin/ufits-process_illumina_folder.py", line 293, in
MergeReads(for_reads, rev_reads, name, read_length)
File "/home/fosterLab/devonr/bin/ufits-v0.7.2/bin/ufits-process_illumina_folder.py", line 74, in MergeReads
pct_out = finalcount / float(origcount)
ZeroDivisionError: float division by zero

If I manually remove these files ahead of time the error is avoided. I was just curious if there would be a way for amptk to simply ignore files that don't contain any information. No big deal, but thought I'd let you know about this peculiar instance (for folks like me that are parsing hundreds of samples at once, there's a chance you simply miss the observation that a sample or two that didn't generate any sequence data).

Cheers,

Devon

UFITS Summarize

A couple of issues I ran into with ufits summarize:

srussell@ubuntu:~/Data/MillerWoods/UFITS$ ufits summarize -i out.otu_table.taxonomy.txt -o data-summary --graphs --percent --format png
Traceback (most recent call last):
  File "/home/srussell/ufits/bin/ufits-summarize_taxonomy.py", line 250, in <module>
    processTax(uniqK, Lk, 'kingdom')
  File "/home/srussell/ufits/bin/ufits-summarize_taxonomy.py", line 163, in processTax
    fig = plt.figure()
  File "/home/srussell/.local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 535, in figure
    **kwargs)
  File "/home/srussell/.local/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 84, in new_figure_manager
    return new_figure_manager_given_figure(num, figure)
  File "/home/srussell/.local/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 92, in new_figure_manager_given_figure
    window = Tk.Tk()
  File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1818, in __init__
    self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: no display name and no $DISPLAY environment variable

soussell@ubuntu:~/Data/MillerWoods/UFITS$ ufits summarize -i out.otu_table.taxonomy.txt -o data-summary --graphs
Traceback (most recent call last):
  File "/home/srussell/ufits/bin/ufits-summarize_taxonomy.py", line 250, in <module>
    processTax(uniqK, Lk, 'kingdom')
  File "/home/srussell/ufits/bin/ufits-summarize_taxonomy.py", line 163, in processTax
    fig = plt.figure()
  File "/home/srussell/.local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 535, in figure
    **kwargs)
  File "/home/srussell/.local/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 84, in new_figure_manager
    return new_figure_manager_given_figure(num, figure)
  File "/home/srussell/.local/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 92, in new_figure_manager_given_figure
    window = Tk.Tk()
  File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1818, in __init__
    self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: no display name and no $DISPLAY environment variable

Illumina Processing results in zero sequences

Attempting to run through the pipeline, I get twarted at the initial step. The output follows:

ufits illumina -i rawdata -o processed

The first two output lines below repeat for all samples:
…..
[09:33:53 PM]: working on sample VI9
[09:33:53 PM]: Measured read length (250 bp) does not equal 300 bp, proceeding with larger value
[09:34:02 PM]: Stripping primers and trim/pad to 250 bp
[09:34:02 PM]: splitting the job over 2 cpus, but this may still take awhile

[09:36:43 PM]: Concatenating Demuxed Files
[09:36:43 PM]: Counting FASTQ Records
[09:36:43 PM]: 0 reads processed
[09:36:43 PM]: Found 0 barcoded samples
Sample: Count
[09:36:43 PM]: Output file: processed.demux.fq (0.0 B)

processed.ufits-process.zip

Auto-detection of output base name error

Due to a typo... you will need to specify an -o, --out for the amptk clustering commands. Will be fixed with new release. 619207a

usearch9 multithread fastq_filter is crashing

Per this problem with usearch 9 http://www.drive5.com/usearch/manual/malloc_bug_v9.html

I edited ufits code to add the '-threads' cpu and requested 1 cpu and it would succeed - probably should just force this step to be single threaded all the time but not force the whole pipeline to be cpu=1

51c51,54
<     cmd = [usearch, '-fastq_filter', R1, '-fastq_trunclen', str(read_length), '-fastqout', pretrim_R1]
---
>     threads = 1
>     cmd = [usearch, '-fastq_filter', R1, '-fastq_trunclen', str(read_length), '-fastqout', pretrim_R1, '-threads', str(threads)]
53c56
<     cmd = [usearch, '-fastq_filter', R2, '-fastq_trunclen', str(read_length), '-fastqout', pretrim_R2]
---
>     cmd = [usearch, '-fastq_filter', R2, '-fastq_trunclen', str(read_length), '-fastqout', pretrim_R2, '-threads', str(threads)]

ufits taxonomy: make OTU table optional

Hello Jon!
Thanks a lot for UFITS, I find it very useful and user-friendly!
I would like to propose to make the input OTU table (-i or --otu_table flags) optional in ufits taxonomy. The hybrid approach for taxonomic assignment is very handy by itself. So it would be nice if the users will be able to perform taxonomic annotation for any FASTA file just with a single command, e.g.:

ufits taxonomy -f tst.fasta -d ITS2 --method hybrid

Now it almost works, but will not produce the “hybrid” of usearch (global alignment) and UTAX results without OTU table.

With best regards,
Vladimir

hiseq vs. miseq output mixup?

Mock Communities

Running the following command:

ufits filter -i out.otu_table.txt -f out.cluster.otus.fa -b MADMOCK --mc MADMOCK.fq

It does not seem to be finding any of the mock community members.

[01:56:22 PM]: OS: linux2, 2 cores, ~ 4 GB RAM. Python: 2.7.12
[01:56:23 PM]: ufits v.0.7.4, USEARCH v9.2.64, VSEARCH v2.3.4
[01:56:23 PM]: Loading OTU table: out.otu_table.txt
[01:56:23 PM]: OTU table contains 6749 OTUs
[01:56:23 PM]: Mapping OTUs to Mock Community (USEARCH)
[01:57:30 PM]: Mock members not found: M00223:13:000000000-AM23G:1:2101:12547:14605, M00223:13:000000000-AM23G:1:1113:5080:11972, M00223:13:000000000-AM23G:1:2110:11421:22117, M00223:13:000000000-AM23G:1:1102:15523:10975, M00223:13:000000000-AM23G:1:1105:17595:13435,

etc, etc. etc.

M00223:13:000000000-AM23G:1:2113:9217:4330
[01:57:30 PM]: Sorting OTU table naturally
[01:57:31 PM]: Removing OTUs according to --min_reads_otu: (OTUs with less than 2 reads from all samples)
[01:57:31 PM]: Normalizing OTU table to number of reads per sample
[01:57:31 PM]: Index bleed, samples into mock: 2.820267%.
[01:57:31 PM]: Will use value of 2.900000% for index-bleed OTU filtering.
[01:57:32 PM]: Filtering OTU table down to 6730 OTUs
[01:57:32 PM]: Filtering valid OTUs

OTU Table filtering finished

OTU Table Stats: out.stats.txt
Sorted OTU table: out.sorted.txt
Final filtered: out.final.txt
Final binary: out.final.binary.txt
Filtered OTUs: out.filtered.otus.fa

brew install ufits zlib issue

brew install ufits
returns this error message
==> Installing ufits from nextgenusfs/tap
Error: Operation already in progress for zlib
Another active Homebrew process is already using zlib.
Please wait for it to finish or terminate it to continue.

please advise

amptk-extract_region.py - 'newtaxLen' is not defined

Hello Jon! Sorry to bother you again.
I tried to manually update the database with amptk database (with v.0.10.2) and I have got the following error during the dereplication step:

Now dereplicating sequences (remove if sequence and header identical)
Traceback (most recent call last):
  File "/home/mik/amptk/bin/amptk-extract_region.py", line 471, in <module>
    dereplicate(derep_tmp, OutName)
  File "/home/mik/amptk/bin/amptk-extract_region.py", line 97, in dereplicate
    if newtaxLen > oldTaxLen:
NameError: global name 'newtaxLen' is not defined

add SILVA/greengenes DB to 16S amptk taxonomy

problem is reformatting the taxonomy information in format needed by UTAX/SINTAX. I've previously looked at SILVA taxonomy -- the taxonomy appeared to be a hot mess (I'm not a bacteriologist).... So the challenge will be convert the taxonomy strings to proper format

example taxonomy strings:

>BOLD:ACI6695;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Coleoptera,f:Elateridae,g:Nipponoelater,s:Nipponoelater babai
>S004604051;tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Hymenochaetaceae,g:Inonotus,s:Sanghuangporus zonatus
>S004127186;tax=k:Fungi,p:Ascomycota
>S004061552;tax=k:Fungi,p:Ascomycota,c:Eurotiomycetes,s:Pyrenula sanguinea

Results naming for DADA2 and UNOISE2

Hello Jon!
First of all, thanks for the great software!

I have tested DADA2 pipeline with amptk and tried to compare it with UNOISE2 pipeline.
After taxonomic annotation I found "No Hit" in the resulting OTU table from the DADA pipeline.
Then I realized that this is because of a different naming scheme of the results:

DADA2 (amptk dada2 ...)
test.otu_table.txt - table with iSeqs
test.cluster.otu_table.txt - table with OTUs (clustered iSeqs)

UNOISE2 (amptk unoise2 ...)
test.iSeq.otu_table.txt - table with iSeqs
test.otu_table.txt - table with OTUs (clustered iSeqs)

Maybe it's better to make filenames for the results of DADA pipeline consistent with UNOISE2 pipeline?
I mean that test.otu_table.txt (which is expected as -i argument in amptk taxonomy) should correspond to OTUs for both methods.

With best regards,
Vladimir

Combining multiple Ion chip runs

Hi Jon,

Enjoying amptk for our fungal, bacterial and insect work on the Ion S5 XL these days. I hope I didn't miss it in the Docs, but can you point me in the direction of how to combine multiple Ion chip runs where the same barcodes were used for different samples? Do I simply concatenate the ion.demux.fq.gz files after the 'amptk ion' command? Or can I provide multiple mapping files at this stage and toggle

--mult_samples Combine multiple chip runs, name prefix for chip

Thanks!
Jon

amptk v1.2.5 taxonomy error

I built a custom 16s database:
bioadmin@bioadmin-B75M-D3V-JP[DB] amptk database -i Extra_rdp_v16.fa -o 16S --format off --create_db utax --skip_trimming --keep_all [11:39午前]

[11:39:44 AM]: OS: linux2, 8 cores, ~ 8 GB RAM. Python: 2.7.15
[11:39:44 AM]: AMPtk v1.2.5, USEARCH v9.2.64, VSEARCH v2.9.0
[11:39:44 AM]: Working on file: Extra_rdp_v16.fa
[11:39:45 AM]: 13,225 records loaded
[11:39:45 AM]: Using 8 cpus to process data
[11:39:46 AM]: 13,164 records passed (99.54%)
[11:39:46 AM]: Errors: 0 no taxonomy info, 0 length out of range, 61 too many ambiguous bases, 0 no primers found
[11:39:46 AM]: Creating UTAX Database, this may take awhile
[11:49:33 AM]: Database /home/bioadmin/packages/amptk-1.2.5/DB/16S.udb created successfully

But assigning taxonomy doesn't work:
bioadmin@bioadmin-B75M-D3V-JP[Kafue_MG_amptk] amptk taxonomy -f ./$NAME.lulu.otus.fa -i ./$NAME.lulu.otu_table.txt -m ./$NAME.mapping_file.txt -d 16S -o $NAME

[12:07:37 PM]: OS: linux2, 8 cores, ~ 8 GB RAM. Python: 2.7.15
[12:07:37 PM]: AMPtk v1.2.5, USEARCH v9.2.64, VSEARCH v2.9.0
[12:07:37 PM]: Loading FASTA Records
[12:07:37 PM]: 265 OTUs
[12:07:37 PM]: Global alignment OTUs with usearch_global (USEARCH)
[12:07:44 PM]: Classifying OTUs with UTAX (USEARCH)
[12:07:44 PM]: Classifying OTUs with SINTAX (USEARCH)
Traceback (most recent call last):
File "/home/bioadmin/packages/amptk-1.2.5/bin/amptk-assign_taxonomy.py", line 286, in
utaxDict = amptklib.classifier2dict(utax_out, args.utax_cutoff)
File "/home/bioadmin/packages/amptk-1.2.5/lib/amptklib.py", line 1357, in classifier2dict
ClassyDict[ID] = (score, passtax)
UnboundLocalError: local variable 'score' referenced before assignment

a little quirk (bug?) due to formatting in `biom convert`

Sometimes I have use for .csv formats, others .tsv. Often when I'm working in AMPTK I use .csv formats because I find it easier to filter some things manually.
What I noticed is that if you set the output to .csv format in the amptk filter step, then run amptk taxonomy with that .csv file, you don't generate a .biom file.
It fails, apparently, because the biom convert argument doesn't recognize a comma-separated file! I converted it to a tab-delimited file and it worked just fine.

UnboundLocalError: local variable 'hit' referenced before assignment when using usearch10

OS: debian buster/sid, 8 cores, ~ 16 GB RAM. Python: 3.7.3
AMPtk v1.4.0, USEARCH v10.0.240, VSEARCH v2.13.4
Loading FASTA Records
8,203 OTUs
Global alignment OTUs with usearch_global (VSEARCH) against ITS.udb
Classifying OTUs with UTAX (USEARCH)
Classifying OTUs with SINTAX (USEARCH)
UTAX results empty
SINTAX results empty
Traceback (most recent call last):
File "/amptk/bin/amptk", line 735, in
main()
File "/amptk/bin/amptk", line 726, in main
mod.main(arguments)
File "/amptk/lib/python3.7/site-packages/amptk/assign_taxonomy.py", line 320, in main
bestClassify = amptklib.bestclassifier(utaxDict, sintaxDict, otuList)
File "/amptk/lib/python3.7/site-packages/amptk/amptklib.py", line 1494, in bestclassifier
BestClassify[otu] = hit
UnboundLocalError: local variable 'hit' referenced before assignment

running:
amptk taxonomy -f .cluster.filtered.otus.fa -i .cluster.final.txt -m mapping_file.txt -d ITS2 -u usearch10

Y try with usearch9 and work fine

it appears that there is not hit with usearch10
Greetings

Add way to filter non-fungal OTUs

Need some sort of way to remove non-target OTUs - some primer sets pick up plants, protists, protozoa, etc. One way would be to include some more eukaryotes into the UTAX training set, which I would then hope would classify them as non-fungal, which would be easy and fast to remove them.....otherwise Blast NCBI? (very slow and kind of dumb).

UnboundLocalError: local variable 'hit' referenced before assignment

running:
amptk taxonomy -f .cluster.filtered.otus.fa -i .cluster.final.txt -m mapping_file.txt -d ITS2 -u usearch10

error in cluster v0.3.7

Since i've upgraded to version 0.3.7 (from 0.2.8) i get this error using the 'cluster' script, do you think i need to uprgade any dependencies to fix this?

$ ufits cluster -i ITS.rarify.2.fq  --uchime_ref ITS2 --cleanup
-------------------------------------------------------
[03:03:35 PM]: Operating system: linux2, ufits v.0.3.7
[03:03:35 PM]: USEARCH version: usearch v8.1.1861_i86linux32
[03:03:35 PM]: vsearch detected, will use for filtering
[03:03:35 PM]: Loading FASTQ Records
[03:03:36 PM]: 1,390,913 reads (843.8 MB)
[03:03:36 PM]: Quality Filtering, expected errors < 1.0
Traceback (most recent call last):
  File "/home/tvv/Bureau/ufits/bin/ufits-OTU_cluster.py", line 95, in <module>
    subprocess.call(['vsearch', '--fastq_filter', args.FASTQ, '--fastq_maxee', str(args.maxee), '--fastqout', filter_out, '--fastaout', filter_fasta, '--fastq_qmax', '45'], stdout = FNULL, stderr = FNULL)
  File "/usr/lib/python2.7/subprocess.py", line 522, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 20] Not a directory

Building new Arthropod database: novice troubles

Hi Jon: We are still really enjoying AMPtk for all of our bacterial, fungal and insect needs. I am trying to build a new COI database because we submitted 5000 local samples to BOLD and are keen to capture them. I am following your instructions from here (https://amptk.readthedocs.io/en/latest/taxonomy.html#taxonomy-databases) and got through #reformat taxonomy, #combine datasets and #generate global alignment database using bold2utax.py and amptk database - no problems!

Thanks to my ignorance, I am failing on the second script, bold2amptk.py. I downloaded the script and put it into:

/usr/local/Cellar/amptk/1.2.4/libexec/bin/

Called it from the directory where I have arthropoda.bold.bins.fa (I'm not interested in Chordates at the moment):

$ /usr/local/Cellar/amptk/1.2.4/libexec/bin/bold2amptk.py -i arthropoda.bold.bins.fa -o arthropods

And got hit with this:

Loading 410,649 sequence records
Searching for forward primer: GGTCAACAAATCATAAAGATATTGG, and reverse primer: GGSACSGGSTGAACSGTSTAYCCYCC
Requiring reverse primer match with at least 4 mismatches
Traceback (most recent call last):
File "/usr/local/Cellar/amptk/1.2.4/libexec/bin/bold2amptk.py", line 101, in
ForCutPos = amptklib.findFwdPrimer(ForPrimer, Seq, args.primer_mismatch, amptklib.degenNucSimple)
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 1097, in findFwdPrimer
align = edlib.align(primer, sequence, mode="HW", k=mismatch, additionalEqualities=equalities)
File "edlib.pyx", line 40, in edlib.align (edlib.bycython.cpp:1214)
TypeError: Expected bytes, got newbytes

I'm running on OS X EL Capitan 10.11.6, with 32 GB 1867 MHz DDR3, 4 GHz Intel Core i7.
AMPtk version: 1.2.4-af8f8f1
Python 2.7.14

Thanks in advance,
Jon

Illumina: Set Merge Options

Hello
I'm testing amptk on some ITS data.
The "amptk illumina" log shows some values like:
#########################################
Merging reads 100%
107030 Pairs
8000 Merged (7.5%)
99030 Not merged (92.5%)

Pairs that failed merging due to various reasons:
1008 too few kmers found on same diagonal
389 multiple potential alignments
27692 too many differences
69941 alignment score too low, or score drop to high
############################################

Along all the reads, I got around 60-95 % Not merged.
1st question is:
Is this right? Because after concatenating, the number of valid output reads looks like if most of the pairs were merged (i.e. there is ~25 milion merged reads from 50 milion raw paired-end reads); is this the "--rescue_forward" step?

2nd question is:
Maybe is there any control option for the merging (overlap/diff) that I could make? The vsearch manual state that default parameters are 10 min bases overlap/10 max bases diff (If I understood it right).

Thanks!

AMPtk filter not finding usearch

Hi there,
I have just gotten AMPtk and its dependencies installed, but it seems not to be finding usearch9 when calling the filter function.

Trying to run the filter command returns the error:

usearch9 not found in your PATH, exiting.

I have installed USEARCH manually and created a soft link as specified.

I verified that "usearch9" exists in my "/usr/local/bin" directory, and that it links correctly to the actual "usearch9.2.64_i86osx32" file.

Any suggestions on where things might be misspecified and how I can sort this out?

Thanks so much in advance!
Max

Does ion preprocessing filters only reads that contains both forward and reverse primers?

Hello @nextgenusfs,

I would like to know if the preprocessing step of amptk filters only reads with both f and r primers, because the ITS region (in my case ITS2) have variable length, and the limit of the reads length of the sequencing was of 400 bp. Thus, if this step filters only reads with both primers I could be losing reads longer than 400 bp that contains only one of the primers.

ufits cluster

Just wanted to leave some notes on something I see for potential commentary.

Processed the same Illumina data, with the only difference being the --rescue_forward command.

Trial 1:
illumina -i rawdata -o processed --require_primer off --read_length 250
Trial 2
ufits illumina -i rawdata -o process --read_length 250 --rescue_forward --require_primer off

Relevant sections from ufits cluster:
Trial1:

[09:16:51 AM]: Loading FASTQ Records
[09:16:53 AM]: 1,745,038 reads (884.1 MB)
[09:16:53 AM]: Quality Filtering, expected errors < 1.0
[09:20:08 AM]: 1,723,084 reads passed
[09:20:08 AM]: De-replication (remove duplicate reads)
[09:20:30 AM]: 131,868 reads passed

Trial 2:
Fewer reads pass QC and none pass dereplication.

[02:14:02 PM]: Loading FASTQ Records
[02:14:09 PM]: 3,265,311 reads (1.6 GB)
[02:14:09 PM]: Quality Filtering, expected errors < 1.0
[02:19:39 PM]: 1,119,483 reads passed
[02:19:39 PM]: De-replication (remove duplicate reads)
[02:19:50 PM]: 0 reads passed
[02:19:50 PM]: Sorting reads by size: removing reads seen less than 2 times
[02:19:50 PM]: Clustering OTUs (UPARSE)
Traceback (most recent call last):
  File "/home/srussell/ufits/bin/ufits-OTU_cluster.py", line 147, in <module>
    ufitslib.log.info('{0:,}'.format(numOTUs) + ' OTUs, '+ '{0:,}'.format(numchimeras) + ' de-novo chimeras')
NameError: name 'numOTUs' is not defined

It may be some time before I can work to diagnose, so I just wanted to drop a line to see if there was an easy explanation.

Nanopore

Is there anything in the structure of amptk that would prohibit using Nanopore amplicon data as input?

Fragment lengths are ~1500 bp. Input fastq are already adapter and quality trimmed and demultiplexed. Curious about using amptk for some comparisons with the clustering and classification steps.

Sound crazy?

RDP run ends with crash

On Version 0.6.1

ufits taxonomy -f dada2.cluster.filtered.otus.fa -o RDP_output -i dada2.cluster.final.txt --method rdp --rdp_db fungalits_unite --rdp [PATH]/pkgs/RDPTools/2.0.2/classifier.jar

Produces this output

OTUs with taxonomy: RDP_output.otus.taxonomy.fa
OTU phylogeny: RDP_output.tree.phy
Traceback (most recent call last):
  File "[PATH]/pkgs/ufits/0.6.1/bin/ufits-assign_taxonomy.py", line 456, in <module>
    for i in [utax_out, usearch_out, sintax_out, qiimeTax, tmpTable]:
NameError: name 'utax_out' is not defined

ufits ion reverse primer not being trimmed

Reverse primer not properly being reverse-complemented during de-multiplexing. This bug seems to have been introduced when converting the code to utilize multi-processing.

Mock Community Question

ufits filter command

one of the lines contained:
[09:47:33 AM]: Mock members not found: mock_10, mock_7, mock_8, mock_9, mock_1, mock_11, mock_12, mock_2, mock_3, mock_4, mock_5, mock_6

Wondering if you could comment on that. Also continuing a few lines down, there was an error:

[09:47:33 AM]: Sorting OTU table naturally
[09:47:33 AM]: Removing OTUs according to --min_reads_otu: (OTUs with less than 2 reads from all samples)
[09:47:33 AM]: Normalizing OTU table to number of reads per sample
/home/srussell/ufits/bin/ufits-filter.py:227: RuntimeWarning: invalid value encountered in double_scalars
bleed1max = bleed1 / float(total)

nextgenusfs / amptk Goto Github PK

amptk's People

Contributors

Stargazers

Watchers

Forkers

amptk's Issues

OTU Table filtering finished

OTU Table Stats: out.stats.txt Sorted OTU table: out.sorted.txt Final filtered: out.final.txt Final binary: out.final.binary.txt Filtered OTUs: out.filtered.otus.fa

I built a custom 16s database: bioadmin@bioadmin-B75M-D3V-JP[DB] amptk database -i Extra_rdp_v16.fa -o 16S --format off --create_db utax --skip_trimming --keep_all [11:39午前]

But assigning taxonomy doesn't work: bioadmin@bioadmin-B75M-D3V-JP[Kafue_MG_amptk] amptk taxonomy -f ./$NAME.lulu.otus.fa -i ./$NAME.lulu.otu_table.txt -m ./$NAME.mapping_file.txt -d 16S -o $NAME

Recommend Projects

Recommend Topics

Recommend Org

OTU Table Stats: out.stats.txt
Sorted OTU table: out.sorted.txt
Final filtered: out.final.txt
Final binary: out.final.binary.txt
Filtered OTUs: out.filtered.otus.fa

I built a custom 16s database:
bioadmin@bioadmin-B75M-D3V-JP[DB] amptk database -i Extra_rdp_v16.fa -o 16S --format off --create_db utax --skip_trimming --keep_all [11:39午前]

But assigning taxonomy doesn't work:
bioadmin@bioadmin-B75M-D3V-JP[Kafue_MG_amptk] amptk taxonomy -f ./$NAME.lulu.otus.fa -i ./$NAME.lulu.otu_table.txt -m ./$NAME.mapping_file.txt -d 16S -o $NAME