mamarjan / gff3-pltools Goto Github PK
View Code? Open in Web Editor NEWA fast parallel GFF3 parser
License: MIT License
A fast parallel GFF3 parser
License: MIT License
Create directory bio/gff3 and move all files from dlib/bio/ to that directory, without gff3 in their name.
Each scenario in iterate-over-records.feature is actually a feature! You can take a hint from the fact that each scenario actually warrants a fuller explanation, and could be split into further scenarios.
Also I would use one example record to test a feature (not a file!). You can embed the record in the feature, example:
https://github.com/pjotrp/bioruby-alignment/blob/master/features/edit/del_bridges.feature
A tool for reordering the GFF3 file so that features are close to their parents/children and separated by ###
Here is an excerpt from the next GFF3 spec which is in development:
"The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of Latin-1 or Unicode are recommended. "
Source: http://www.sequenceontology.org/resources/gff3_1.21.html
Therefore, this parser should be able to support the Latin-1, UTF-8, UTF-16 and UTF-32 encodings.
Probably best to use rspec, in addition to cucumber.
A Ruby module and function which would use the gff3-ffetch utility to quickly filter the records in a GFF3 file, and either return those as a string or send them into a file.
Support for pragma parsing and retrieving is missing.
The function is unreadable - I see attributes everywhere ;)
Add labels for
in progress
in testing
question asked
today
So we have some more state on the issues :)
The plan for a parallel GFF3 parser accessible from Ruby and other languages has been postponed, so the new goal for this project is to create a range of tools for working with GFF3 files, and a gem for using them from within Ruby.
It not anymore just a high performance library, so a new name would be appropriate.
This issue is for tracking proposals. GFF3 Tools seem to be taken, so the first one is:
gff3-par-tools
Currently all the parser does is to split the file into lines. However, that is not very useful (not at all actually).
Instead, the parser should take one line at a time and parse its contents into a GFF record, and make that available to Ruby. The record in Ruby should also resemble the interface of GFF record objects from other GFF parsers, as closely as possible.
For now it will simply parse the file provided on the command line with the NO_VALIDATION option to gff3_file.open(). You can use the command "time" to get the running time.
A tool with good docs is a much better tool. What we need is:
I forgot about this. Need to check out how they affect the code. It's quite a basic requirement, so flagging it to be solved for the first release.
Currently the parser will ignore the existence of this character sequence. According to the GFF3 spec, it's not illegal to have a zero character, like the one used to end a C string. As the interface is defined in C constructs, the interface needs to be designed in a way that will allow this.
A validator utility written in D, which parses all records for a given GFF3 file and print out issues with records found by the parser. Can be also used to measure the speed of parsing.
Tests with the 1GB file from Wormbase show that this function is probably a problem, when parsing a file with escaped characters. The file in question has multiple newline chars escaped in attributes fields for pretty much every record.
Instead of ~30 seconds as expected, the benchmark application went for 4-5 minutes until interrupted by me with Ctrl-C.
The function should be replaced with a similar with much better performance.
Create a module gff3-validation.d
The introduction of a util.d file is always worrisome :). It usually contains stuff which should be in properly named modules. Make a point of keeping this file really, really small.
The README should contain up-to-date instructions on state of the project, and explicit commands for running tools and/or tests.
At the moment all validation happens inline. We need to be able to turn that off. Ideally this is handled through a delegate, so as to prevent recurring if statements.
Fasta support and retrieval should be added.
A new option, like --fasta-pass-through, and then after the last record a line should be printed with ##FASTA, and then the FASTA data would follow.
The current implementation is already fast, but the built-in associative arrays might be faster, but that should be measured.
How does the Fasta reader compare to the one in the dscience project by bioinfornatics? Maybe share?
Some types of applications, like validation tools, require access to data in pragmas, and maybe comments too. Currently the library skips these lines and there is no way for the library client to receive that data.
A good approach would be probably to add a few new record types, some is_x() functions for testing the type, and a flag to the parser to let it know if the user is interested in pragmas and comments.
RecordRange and LazySplitLines (soon named SplitIntoLines) show remarkable overlap in functionality. Is there a way to introduce a D template?
It takes too long while parsing the c_angaria.current.annotations.gff3 file. An algorithm with better performance could be used instead of the most basic.
Maybe first generate a list of all records, then employ some sorting algorithm for their hashes or something similar.
The record handler and testing should move into its own file. The file handler can be renamed to gff3-file.d.
While parsing records, the parser should ignore comments, which are lines that start with a single # character.
The spec says that # character doesn't have to be escaped in any other context except when part of the first field. From that we can conclude that comments at the end of the line are not allowed.
First parallel parsing of GFF3 data in strings should be supported, and then the same for data in files.
The current API is strongly tied to the GFF3 record structure. This is too low-level, and may perhaps introduce deployment (architecture) issues and/or inefficiencies (think lazy parsing).
My proposal is to start from the use cases. What is it you want to harvest from a GFF3 file? Genes, CDS, ORFs, mRNA - so that is high level(!). The efficiency is not in raw parsing, it is actually in parsing + combining results. I predict the API itself can be really String based - i.e. it is not necessary to pass around struct types when it is not required.
It may be interesting to have a low-level API, but I don't expect it will help Ruby performance a fat lot. So, might as well focus on the final API. My bio-gff3 command line parser is used just for a few types of results. Getting that functionality first will give a lot of experience in what is required.
Maybe initially just create a command line tool that copies the functionality of bio-gff3? The API will follow.
Extract the Ruby library into a separate repository. Man pages are one reason why, but other scripting languages should be able to use the tools without Ruby too.
The parser should be able to detect badly formatted files, like when a line has more or less then 9 fields, characters which should be escaped but are not, etc, and output warnings or throw errors.
A module/function which would let the user use the validation utility and return messages describing the issues found.
Test data should be in ./test/data - so it can be shared between testing techniques
It splits one line, right?
The most basic implementation would probably be to add write and writeln methods to Record class, and similar methods to the Feature class.
It seems open files should not be tracked in a global style variable. Files should tracked throught the API.
This should be enough for the first release, with instructions and descriptions of requirements, which can be used for other platforms.
This gem should build the binaries while being installed, like current gems with external C libraries. Dependencies should be dmd or gdc for D2 installed.
The fasta handlers should be in fasta.d. Keep modules as pristine as possible!
When parsing the 233MB m_hapla testfile from Wormbase, the tool crashes with a segmentation fault after a number of parsed lines.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.