Git Product home page Git Product logo

gff3-pltools's People

Contributors

mamarjan avatar pjotrp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gff3-pltools's Issues

Ruby interface for filtering GFF3 files

A Ruby module and function which would use the gff3-ffetch utility to quickly filter the records in a GFF3 file, and either return those as a string or send them into a file.

New name

The plan for a parallel GFF3 parser accessible from Ruby and other languages has been postponed, so the new goal for this project is to create a range of tools for working with GFF3 files, and a gem for using them from within Ruby.

It not anymore just a high performance library, so a new name would be appropriate.

This issue is for tracking proposals. GFF3 Tools seem to be taken, so the first one is:

gff3-par-tools

GFF3 parser should parse the lines into Records

Currently all the parser does is to split the file into lines. However, that is not very useful (not at all actually).

Instead, the parser should take one line at a time and parse its contents into a GFF record, and make that available to Ruby. The record in Ruby should also resemble the interface of GFF record objects from other GFF parsers, as closely as possible.

A utility for benchmarking the parser

For now it will simply parse the file provided on the command line with the NO_VALIDATION option to gff3_file.open(). You can use the command "time" to get the running time.

Docs

A tool with good docs is a much better tool. What we need is:

  • API docs for the Ruby modules,
  • API docs for the D library,
  • a quality RADME file,
  • man pages for the tools,
  • examples of usage.

Correct behavior for %00 in fields

Currently the parser will ignore the existence of this character sequence. According to the GFF3 spec, it's not illegal to have a zero character, like the one used to end a C string. As the interface is defined in C constructs, the interface needs to be designed in a way that will allow this.

Validator utility for records

A validator utility written in D, which parses all records for a given GFF3 file and print out issues with records found by the parser. Can be also used to measure the speed of parsing.

replace_url_escaped_chars() too slow

Tests with the 1GB file from Wormbase show that this function is probably a problem, when parsing a file with escaped characters. The file in question has multiple newline chars escaped in attributes fields for pretty much every record.

Instead of ~30 seconds as expected, the benchmark application went for 4-5 minutes until interrupted by me with Ctrl-C.

The function should be replaced with a similar with much better performance.

Split util.d

The introduction of a util.d file is always worrisome :). It usually contains stuff which should be in properly named modules. Make a point of keeping this file really, really small.

Expand and maintain README file

The README should contain up-to-date instructions on state of the project, and explicit commands for running tools and/or tests.

Fasta reader comparison

How does the Fasta reader compare to the one in the dscience project by bioinfornatics? Maybe share?

Records to represent pragmas and comments

Some types of applications, like validation tools, require access to data in pragmas, and maybe comments too. Currently the library skips these lines and there is no way for the library client to receive that data.

A good approach would be probably to add a few new record types, some is_x() functions for testing the type, and a flag to the parser to let it know if the user is interested in pragmas and comments.

Refactor RecordRange

RecordRange and LazySplitLines (soon named SplitIntoLines) show remarkable overlap in functionality. Is there a way to introduce a D template?

count-features utility too slow

It takes too long while parsing the c_angaria.current.annotations.gff3 file. An algorithm with better performance could be used instead of the most basic.

Maybe first generate a list of all records, then employ some sorting algorithm for their hashes or something similar.

Ignore comments

While parsing records, the parser should ignore comments, which are lines that start with a single # character.

The spec says that # character doesn't have to be escaped in any other context except when part of the first field. From that we can conclude that comments at the end of the line are not allowed.

Decouple API from internal GFF3 record data structures

The current API is strongly tied to the GFF3 record structure. This is too low-level, and may perhaps introduce deployment (architecture) issues and/or inefficiencies (think lazy parsing).

My proposal is to start from the use cases. What is it you want to harvest from a GFF3 file? Genes, CDS, ORFs, mRNA - so that is high level(!). The efficiency is not in raw parsing, it is actually in parsing + combining results. I predict the API itself can be really String based - i.e. it is not necessary to pass around struct types when it is not required.

It may be interesting to have a low-level API, but I don't expect it will help Ruby performance a fat lot. So, might as well focus on the final API. My bio-gff3 command line parser is used just for a few types of results. Getting that functionality first will give a lot of experience in what is required.

Maybe initially just create a command line tool that copies the functionality of bio-gff3? The API will follow.

A new repository for the Ruby gem

Extract the Ruby library into a separate repository. Man pages are one reason why, but other scripting languages should be able to use the tools without Ruby too.

Add tests for invalid input and data validation

The parser should be able to detect badly formatted files, like when a line has more or less then 9 fields, characters which should be escaped but are not, etc, and output warnings or throw errors.

Basic GFF3 output in D

The most basic implementation would probably be to add write and writeln methods to Record class, and similar methods to the Feature class.

Packaging for Ubuntu 32 and 64 bit

This should be enough for the first release, with instructions and descriptions of requirements, which can be used for other platforms.

A gem for rubygems.org

This gem should build the binaries while being installed, like current gems with external C libraries. Dependencies should be dmd or gdc for D2 installed.

Benchmark tool seg-fault

When parsing the 233MB m_hapla testfile from Wormbase, the tool crashes with a segmentation fault after a number of parsed lines.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.