Hello Jon! UNITE team announced the new release of their database (v7.2) and I dec

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Reference database updates & Wiki about amptk HOT 11 CLOSED

nextgenusfs commented on June 22, 2024

Reference database updates & Wiki

from amptk.

Comments (11)

nextgenusfs commented on June 22, 2024

Thanks @vmikk. I will update the wiki page slightly. And I will build the new v7.2 UNITE DB databases so that amptk install -i ITS will download the newest version.

from amptk.

nextgenusfs commented on June 22, 2024

I also noticed a few weeks ago that Robert dropped support for UTAX in USEARCH10, so I'm not sure how long I will continue to support UTAX. I find that it sometimes does a better job than SINTAX, but hard to force people to stick to USEARCH9, although if there aren't major upgrades in v10 than maybe will stay with v9 for a little while yet.

from amptk.

nextgenusfs commented on June 22, 2024

And let me know what they say about the developer vs regular for general fasta release, here are some quick stats on each of them (so yes looks like probably not use developer this time, I want to use the most unprocessed so that primers are more likely to be incorporated in the sequence):

fasta_stats.py sh_general_release_s_28/sh_general_release_dynamic_s_28.06.2017.fasta 
Reads:    58,639
AvgLen:   731 bp
Shortest: 216 bp
Longest:  7,491 bp
Total:    42,886,966 bp

fasta_stats.py sh_general_release_s_28/developer/sh_general_release_dynamic_s_28.06.2017_dev.fasta 
Reads:    58,639
AvgLen:   547 bp
Shortest: 140 bp
Longest:  2,764 bp
Total:    32,081,159 bp

from amptk.

vmikk commented on June 22, 2024

Also there is an issue with duplicated sequence IDs (because of them USEARCH will fail with error):

awk 'BEGIN {FS = "|"} /^>/ {print $2}' \
  sh_general_release_dynamic_s_28.06.2017_dev.fasta \
  | sort -g | uniq -c | sort -r \
  | awk '$1 > 1'

from amptk.

nextgenusfs commented on June 22, 2024

FYI: https://twitter.com/unite_sh/status/882846103846236160, still working on getting these pre-installed versions released. running into a memory error with UTAX on full length ITS (i don't have 64 bit usearch....). We'll see if "updates" to v7.2 are any better. I'm worried that SINTAX will have same memory problem....

from amptk.

nextgenusfs commented on June 22, 2024

Pre-built ITS databases have been updated to UNITE v7.2.

from amptk.

vmikk commented on June 22, 2024

Hello Jon!
Thanks for the update!

By the way, how did you handled duplicated sequences (e.g., KX909166)? In some cases they have different annotation, for example:

>unidentified|KX909166|SH640154.07FU|reps|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__unidentified;f__unidentified;g__unidentified;s__unidentified
>Nectria_dacryocarpa|KX909166|SH490415.07FU|reps_singleton|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Nectria;s__Nectria_dacryocarpa

PS. UNITE team didn't reply to my message.

from amptk.

nextgenusfs commented on June 22, 2024

I'm not sure what they are doing with duplicate sequences in their DB, but the seqs of these two are identical, so if you use the --derep_fulllength option of amptk database it will remove the duplicates. But perhaps I should update that so it keeps the longer taxonomy string if sequences are identical.

from amptk.

vmikk commented on June 22, 2024

Yes, it would be great to preserve the longest taxonomy string. As I understand, the first one is taken now, and in the case of KX909166 the shortest string was selected.
However, it is not really urgent because there are not too much duplicates in the database.

from amptk.

bsmoda commented on June 22, 2024

Hello @nextgenusfs
I'm trying to create a new ITS2 database from UNITE reference with the primers used in my work. But I've got this error message:

$ amptk database -i /mnt/data2/lbcb/projects/Bruno.Moda/databases/unite/v7.2/unite_insd/UNITE_public_01.12.2017.fasta -f GTGAATCATCGARTCTTTGAAC -r TATGCTTAAGTTCAGCGGGTA --primer_required none -o ITS2_unite_v7.2 --create_db usearch --install --source UNITE:7.2
usage: amptk-extract_region.py [options] -f
amptk-extract_region.py: error: unrecognized arguments: --primer_required none --install --source UNITE:7.2

I'm using amptk within conda env, but with the repo from git (up-to-date)
I've tried without the unrecognized arguments, but got this other error:

$ amptk database -i /mnt/data2/lbcb/projects/Bruno.Moda/databases/unite/v7.2/unite_insd/UNITE_public_01.12.2017.fasta -f GTGAATCATCGARTCTTTGAAC -r TATGCTTAAGTTCAGCGGGTA -o ITS2_unite_v7.2 --create_db usearch
Traceback (most recent call last):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/bin/amptk-extract_region.py", line 411, in
amptklib.setupLogging(log_name)
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 1637, in setupLogging
fhnd = logging.FileHandler(LOGNAME)
File "/mnt/data2/lbcb/conda/envs/amptk/lib/python2.7/logging/init.py", line 920, in init
StreamHandler.init(self, self._open())
File "/mnt/data2/lbcb/conda/envs/amptk/lib/python2.7/logging/init.py", line 950, in _open
stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: u'/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/DB/ITS2_unite_v7.2.log'

What should I do?
Thanks!

from amptk.

nextgenusfs commented on June 22, 2024

You can open a new issue if you have a problem instead of appending to this closed one. You can try to uninstall v1.2.4 from conda, i.e. conda remove amptk and then install the new version via pip, i.e. pip install amptk. I've been having many conda issues the last few weeks and thus haven't pushed an update to conda as it isn't working for me at the moment. Conda has been giving us some permissions errors as well, so hard to know if this is related or not.

from amptk.

Reference database updates & Wiki about amptk HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent