Git Product home page Git Product logo

fasql's Introduction

fasql (v0.5.0)

Read and Write FASTX Files w/ DuckDB

Warning This repo will be archived. Please see exon-duckdb, which has similar but expanded functionality.

Reading Overview

fasql is an open source DuckDB Extension from WHERE TRUE Technologies that adds FASTA and FASTQ file parsing as table functions.

For example, given a FASTA file called ./swissprot.fasta.gz in your local directory, you can query it like so.

SELECT *
FROM read_fasta('./swissprot.fasta.gz')
LIMIT 5
-- ┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────┐
-- │          id          │     description      │                                                           sequence                                                           │      file_name       │
-- │       varchar        │       varchar        │                                                           varchar                                                            │       varchar        │
-- ├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
-- │ sp|A0A023I7E1|ENG1…  │ Glucan endo-1,3-be…  │ MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIFPEIKHPFEPMYANTENGKIVPTNSWISNLFYPSADNLAPTTPDPYTLRLLDGYGGNPGLTIRQPSAKVLGSYPPTN…  │ ./swissprot.fasta.gz │
-- │ sp|A0A024B7W1|POLG…  │ Genome polyprotein…  │ MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKEKKRRGADTSVGIVGLLLTTAMA…  │ ./swissprot.fasta.gz │
-- │ sp|A0A024SC78|CUTI…  │ Cutinase OS=Hypocr…  │ MRSLAILTTLLAGHAFAYPKPAPQSVNRRDWPSINEFLSELAKVMPIGDTITAACDLISDGEDAAASLFGISETENDPCGDVTVLFARGTCDPGNVGVLVGPWFFDSLQTALGSRTLGVKGV…  │ ./swissprot.fasta.gz │
-- │ sp|A0A024SH76|GUX2…  │ Exoglucanase 2 OS=…  │ MIVGILTTLATLATLAASVPLEERQACSSVWGQCGGQNWSGPTCCASGSTCVYSNDYYSQCLPGAASSSSSTRAASTTSRVSPTTSRSSSATPPPGSTTTRVPPVGSGTATYSGNPFVGVTP…  │ ./swissprot.fasta.gz │
-- │ sp|A0A026W182|ORCO…  │ Odorant receptor c…  │ MMKMKQQGLVADLLPNIRVMKTFGHFVFNYYNDNSSKYLHKVYCCVNLFMLLLQFGLCAVNLIVESADVDDLTANTITLLFFTHSIVKICYFAIRSKYFYRTWAIWNNPNSHPLFAESNARY…  │ ./swissprot.fasta.gz │
-- └──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────┘

Schemas

The schemas for the FASTA and FASTQ table functions are as follows.

FASTA

The read_fasta function returns a table with the following schema.

column_name column_type nullable
id VARCHAR NO
description VARCHAR YES
sequence VARCHAR NO
file_name VARCHAR NO

FASTQ

The read_fastq function returns a table with the following schema.

column_name column_type nullable
id VARCHAR NO
description VARCHAR YES
sequence VARCHAR NO
quality_scores VARCHAR NO
file_name VARCHAR NO

Replacement Scans

A number of "replacement scans" also work, whereby you just need to have a file reasonably named, and the extension will pick up on it as the appropriate file. E.g. SELECT * FROM 'test.fasta' or SELECT * FROM 'test.fastq.gz'.

The following file endings will trigger the scan for FASTA:

  • .fasta.gz
  • .fa.gz
  • .fasta
  • .fa

And for FASTQ:

  • .fastq.gz
  • .fq.gz
  • .fastq
  • .fq

Globs

Globs are supported both within the table function and the replacement scan provided the glob matches the replacement scan in the first place.

For example, SELECT * FROM './path/to/*.fasta' will select all FASTA files in the ./path/to/ directory. This is the same as SELECT * FROM read_fasta('./path/to/*.fasta').

Writing Overview

On MacOS and Linux, you can write FASTA and FASTQ files using COPY TO.

FASTA

For example, given a table called my_fasta with the schema id VARCHAR, description VARCHAR, sequence VARCHAR, you can write it to a file called my_fasta.fasta like so.

For example, combining this with the DuckDB postgres scanner to export a query to a file:

LOAD 'postgres_scanner';
LOAD 'fasql';

-- Your connection string here, can use PGPASSWORD for the password
CALL POSTGRES_ATTACH('postgres://localhost:5432/my_db');

COPY (
  SELECT id_col AS id, desc_col AS description, seq_col AS sequence
  FROM my_table
  WHERE length(seq_col) < 50
) TO 'long_sequences.fasta' (FORMAT 'fasta');

A description column is optional, and if not present will be ignored.

FASTQ

Similar to above, the COPY TO syntax is the same, but the format is fastq and schema is id VARCHAR, description VARCHAR, sequence VARCHAR, quality_scores VARCHAR.

Installation and Usage

You can use this extension as you would other DuckDB extensions. Here's one example of how to do that in a raw DuckDB console and one in Python.

DuckDB Console

To install and use fasql, start a duckdb session:

# unsigned required for non DuckDB Labs distributed extensions (or a large monthly payment)
$ duckdb -unsigned

Now from the session:

D SET custom_extension_repository='dbe.wheretrue.com/fasql/latest';
D INSTALL fasql;
D LOAD fasql;

If you want some easy data to play with you can wget one of the test files.

$ wget https://raw.githubusercontent.com/wheretrue/fasql/main/test/sql/test.fasta

Then back into the console, load the extension and enjoy.

D LOAD fasql;
D SELECT COUNT(*) FROM read_fasta('./test.fasta');
-- ┌──────────────┐
-- │ count_star() │
-- │    int64     │
-- ├──────────────┤
-- │            2 │
-- └──────────────┘

Python

For example, this script installs the extension then counts the number of records at path.

import pathlib

import duckdb

con = duckdb.connect(config={'allow_unsigned_extensions': True})
con.execute("SET custom_extension_repository='dbe.wheretrue.com/fasql/latest';")
con.execute("INSTALL fasql;")
con.execute("LOAD fasql;")

# Assumes this is in your home directory.
path = pathlib.Path("swissprot.fasta.gz")

result = con.execute(f"SELECT * FROM read_fasta('{path}');").fetchall()
print(len(result))
assert len(result) == 569213

# Or create a dataframe.
df = con.execute(f"SELECT id, description, sequence FROM read_fasta('{path}');").df()
# print(df.head())
#                           id                                        description                                           sequence
# 0   sp|A0A023I7E1|ENG1_RHIMI  Glucan endo-1,3-beta-D-glucosidase 1 OS=Rhizom...  MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIF...
# 1   sp|A0A024B7W1|POLG_ZIKVF  Genome polyprotein OS=Zika virus (isolate ZIKV...  MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRM...
# 2  sp|A0A024SC78|CUTI1_HYPJR  Cutinase OS=Hypocrea jecorina (strain ATCC 567...  MRSLAILTTLLAGHAFAYPKPAPQSVNRRDWPSINEFLSELAKVMP...
# 3   sp|A0A024SH76|GUX2_HYPJR  Exoglucanase 2 OS=Hypocrea jecorina (strain AT...  MIVGILTTLATLATLAASVPLEERQACSSVWGQCGGQNWSGPTCCA...
# 4   sp|A0A026W182|ORCO_OOCBI  Odorant receptor coreceptor OS=Ooceraea biroi ...  MMKMKQQGLVADLLPNIRVMKTFGHFVFNYYNDNSSKYLHKVYCCV...

Supported Platforms

This extension is built for Linux and MacOS.

fasql's People

Contributors

tshauck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

fasql's Issues

Add glob support

  • Should be able to pass globs as well as single files.
  • Should support compression.
  • Should support replacement scans.

E.g. SELECT * FROM 'path/to/gzipped/fastqs/*.fastq.gz';

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.