mamarjan / gff3-pltools Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 5.0 1.21 MB

A fast parallel GFF3 parser

License: MIT License

Ruby 2.99% D 97.01%

gff3-pltools's People

Contributors

Stargazers

Watchers

Forkers

macarthy csw lomereiter pjotrp domdellasera

gff3-pltools's Issues

Too many files starting with gff3

Create directory bio/gff3 and move all files from dlib/bio/ to that directory, without gff3 in their name.

Allow FASTA to be extracted without parsing the rest of the GFF3 file

Split feature iterate-over-records, and make it record based rather than file based

Each scenario in iterate-over-records.feature is actually a feature! You can take a hint from the fact that each scenario actually warrants a fuller explanation, and could be split into further scenarios.

Also I would use one example record to test a feature (not a file!). You can embed the record in the feature, example:

https://github.com/pjotrp/bioruby-alignment/blob/master/features/edit/del_bridges.feature

A tool for sorting GFF3 files

A tool for reordering the GFF3 file so that features are close to their parents/children and separated by ###

Complete support for Latin-1 and Unicode files

Here is an excerpt from the next GFF3 spec which is in development:

"The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of Latin-1 or Unicode are recommended. "

Source: http://www.sequenceontology.org/resources/gff3_1.21.html

Therefore, this parser should be able to support the Latin-1, UTF-8, UTF-16 and UTF-32 encodings.

parseAttributes requires a range of tests

Probably best to use rspec, in addition to cucumber.

Ruby interface for filtering GFF3 files

A Ruby module and function which would use the gff3-ffetch utility to quickly filter the records in a GFF3 file, and either return those as a string or send them into a file.

Parsing pragmas

Support for pragma parsing and retrieving is missing.

parseAttributes variable naming

The function is unreadable - I see attributes everywhere ;)

New labels for github issues

Add labels for

in progress
in testing
question asked
today

So we have some more state on the issues :)

GFF3 to JSON tool

New name

The plan for a parallel GFF3 parser accessible from Ruby and other languages has been postponed, so the new goal for this project is to create a range of tools for working with GFF3 files, and a gem for using them from within Ruby.

It not anymore just a high performance library, so a new name would be appropriate.

This issue is for tracking proposals. GFF3 Tools seem to be taken, so the first one is:

gff3-par-tools

GFF3 parser should parse the lines into Records

Currently all the parser does is to split the file into lines. However, that is not very useful (not at all actually).

Instead, the parser should take one line at a time and parse its contents into a GFF record, and make that available to Ruby. The record in Ruby should also resemble the interface of GFF record objects from other GFF parsers, as closely as possible.

A utility for benchmarking the parser

For now it will simply parse the file provided on the command line with the NO_VALIDATION option to gff3_file.open(). You can use the command "time" to get the running time.

Docs

A tool with good docs is a much better tool. What we need is:

API docs for the Ruby modules,
API docs for the D library,
a quality RADME file,
man pages for the tools,
examples of usage.

& and , have special meaning in attributes field

I forgot about this. Need to check out how they affect the code. It's quite a basic requirement, so flagging it to be solved for the first release.

Correct behavior for %00 in fields

Currently the parser will ignore the existence of this character sequence. According to the GFF3 spec, it's not illegal to have a zero character, like the one used to end a C string. As the interface is defined in C constructs, the interface needs to be designed in a way that will allow this.

Validator utility for records

A validator utility written in D, which parses all records for a given GFF3 file and print out issues with records found by the parser. Can be also used to measure the speed of parsing.

replace_url_escaped_chars() too slow

Tests with the 1GB file from Wormbase show that this function is probably a problem, when parsing a file with escaped characters. The file in question has multiple newline chars escaped in attributes fields for pretty much every record.

Instead of ~30 seconds as expected, the benchmark application went for 4-5 minutes until interrupted by me with Ctrl-C.

The function should be replaced with a similar with much better performance.

Move validation out into its own module

Create a module gff3-validation.d

Split util.d

The introduction of a util.d file is always worrisome :). It usually contains stuff which should be in properly named modules. Make a point of keeping this file really, really small.

GDC support

Expand and maintain README file

The README should contain up-to-date instructions on state of the project, and explicit commands for running tools and/or tests.

Validation of record formatting should be lazy

At the moment all validation happens inline. We need to be able to turn that off. Ideally this is handled through a delegate, so as to prevent recurring if statements.

Support for FASTA data at the end of the file

Fasta support and retrieval should be added.

gff3-ffetch to pass FASTA data to output

A new option, like --fasta-pass-through, and then after the last record a line should be printed with ##FASTA, and then the FASTA data would follow.

Try re-implementing the FeatureCache using built-in associative arrays

The current implementation is already fast, but the built-in associative arrays might be faster, but that should be measured.

Fasta reader comparison

How does the Fasta reader compare to the one in the dscience project by bioinfornatics? Maybe share?

Records to represent pragmas and comments

Some types of applications, like validation tools, require access to data in pragmas, and maybe comments too. Currently the library skips these lines and there is no way for the library client to receive that data.

A good approach would be probably to add a few new record types, some is_x() functions for testing the type, and a flag to the parser to let it know if the user is interested in pragmas and comments.

Refactor RecordRange

RecordRange and LazySplitLines (soon named SplitIntoLines) show remarkable overlap in functionality. Is there a way to introduce a D template?

Windows binaries

count-features utility too slow

It takes too long while parsing the c_angaria.current.annotations.gff3 file. An algorithm with better performance could be used instead of the most basic.

Maybe first generate a list of all records, then employ some sorting algorithm for their hashes or something similar.

Record should move into gff3-record.d module

The record handler and testing should move into its own file. The file handler can be renamed to gff3-file.d.

Ignore comments

While parsing records, the parser should ignore comments, which are lines that start with a single # character.

The spec says that # character doesn't have to be escaped in any other context except when part of the first field. From that we can conclude that comments at the end of the line are not allowed.

Parallel parsing of GFF3 records

First parallel parsing of GFF3 data in strings should be supported, and then the same for data in files.

Decouple API from internal GFF3 record data structures

The current API is strongly tied to the GFF3 record structure. This is too low-level, and may perhaps introduce deployment (architecture) issues and/or inefficiencies (think lazy parsing).

My proposal is to start from the use cases. What is it you want to harvest from a GFF3 file? Genes, CDS, ORFs, mRNA - so that is high level(!). The efficiency is not in raw parsing, it is actually in parsing + combining results. I predict the API itself can be really String based - i.e. it is not necessary to pass around struct types when it is not required.

It may be interesting to have a low-level API, but I don't expect it will help Ruby performance a fat lot. So, might as well focus on the final API. My bio-gff3 command line parser is used just for a few types of results. Getting that functionality first will give a lot of experience in what is required.

Maybe initially just create a command line tool that copies the functionality of bio-gff3? The API will follow.