Git Product home page Git Product logo

biovalidator's Introduction

InterMine

Master: InterMine CI Dev: InterMine CI Version License Research software impact Conda Documentation Status f A powerful open source data warehouse system. InterMine allows users to integrate diverse data sources with a minimum of effort, providing powerful web-services and an elegant web-application with minimal configuration. InterMine powers some of the largest data-warehouses in the life sciences, including:

For the full list of InterMines, please see the registry

For details, please visit: InterMine Documentation

If you run an InterMine, or use one in your research, in order to improve the chance of continued funding for the project it would be appreciated if groups that use InterMine or parts of InterMine would let us know.

Getting Started With InterMine

For a guide on getting started with InterMine, please visit: tutorial

3min bootstrap

As long as you have the prerequisites installed (Java, PostgreSQL), you can get a working data-warehouse and associated web-application by running an automated bootstrap script:

  # For the testmodel
./testmine/setup.sh

For a genomic application, with test data from Malaria, see BioTestMine

Docker

You can build InterMine using Docker. See https://github.com/intermine/docker-intermine-gradle

Copyright and Licence

Copyright (C) 2002-2022 FlyMine

See LICENSE file for licensing information.

This product includes software developed by the Apache Software Foundation

InterMine Development Roadmap

For more information about the upcoming releases, please visit the InterMine Development Roadmap. For the roadmap, please see here.

Please cite

InterMine: extensive web services for modern biology.
Kalderimis A, Lyne R, Butano D, Contrino S, Lyne M, Heimbach J, Hu F, Smith R, Stěpán R, Sullivan J, Micklem G.
Nucleic Acids Res. 2014 Jul; 42 (Web Server issue): W468-72
doi pubmed

InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data.
Smith RN, Aleksic J, Butano D, Carr A, Contrino S, Hu F, Lyne M, Lyne R, Kalderimis A, Rutherford K, Stepan R, Sullivan J, Wakeling M, Watkins X, Micklem G.
Bioinformatics (2012) 28 (23): 3163-3165.
doi pubmed

See zotero for the full list of InterMine publications.

biovalidator's People

Contributors

deepakkumar96 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

deepakkumar96

biovalidator's Issues

Performance benchmarking

How much time does this library take to parse a file?

What if the file is very large? What if there are a lot of errors?

Are there tricks you can do to speed up the analysis the library does? Do other libraries have tricks we can learn from?

Filetype is not required -- better error message, or guess the filetype.

I get this error message when I don't provide a filetype:

java.lang.IllegalArgumentException: Invalid validator type
	at org.intermine.biovalidator.validator.ValidatorType.of(ValidatorType.java:43)
	at org.intermine.biovalidator.api.ValidatorHelper.validate(ValidatorHelper.java:70)

Instead can we do either (if filetype is NULL):

a. guess
b. give an error message "filetype required: FASTA or GFF3"

biovalidator version ?

We use semantic versioning at InterMine:

https://semver.org/

Thoughts?

I am guessing you plan to update to 1.0 (or 1.0.0!), e.g. no SNAPSHOT, when we are happy with GFF3 and want to publish to Maven.

Suggestion - a single error message

This is me being lazy and not wanting to iterate! For discussion!!

This is my code snippet:

            List<Message> messages = validationResult.getErrorMessages();
            StringBuilder errorMessage = new StringBuilder();
            for (Message msg : messages) {
                errorMessage.append(msg.getMessage());
            }
            return errorMessage.toString();

I would like to do this instead:

            return validationResult.getErrorMessage();

For the strict setting we'll only have one error message anyway.

Make error messages very verbose.

We want to give the user enough information to fix the problem.

This goes for all error messages but specifically file not found. Update this error message to include the path that was tried.

Only want error messages, don't want exceptions

Again, when I didn't provide the filetype, I got an illegal argument exception. Instead, I would like an informative error message "filetype required: FASTA or GFF3".

What are fatal exceptions for? They are to stop the process because there is simply no point in going on. Something very bad has happened. In our case, that doesn't really apply. Does that make sense? Something bad has happened (e.g. no filetype), but we don't need to kill the process, we only want to stop what we are doing and return the relevant information to the user.

Exceptions in Java also let you do custom actions for certain exceptions, e.g. close database connections, write to the log etc. We don't have those in our case, we always only ever want to return the error message to the user.

I think your error handling is correct however, it makes using the library more difficult. Look at my code, I have to catch a bunch of exceptions, then check for NULL then check the error messages. Very messy! It's not maintainable -- later on you may change the exceptions, people will forget what exceptions are thrown etc. And I am finding it really hard to write my unit tests.

This is a specific library to do a specific job. These fatal errors don't help do that job. Because this library is part of a big application, your library should catch the errors, and return an informative error message to the user so they can act on them.

Thoughts?

GFF3: Should we be case-sensitive

So I checked both so.obo and so-simple.obo and turns out both files do not have an exact match of 'DNaseI_hypersensitive_site' and 'recombination_region' but
1. for type 'DNaseI_hypersensitive_site' -> 'so.obo' and 'so-simple.obo' do have 'DNAseI_hypersensitive_site' here the case of third letter is difference, its capital 'A'.
2. for type 'recombination_region' -> 'so.obo' and 'so-simple.obo' have 'mitotic_recombination_region', 'non_allelic_homologous_recombination_region', etc. 

See intermine/intermine#1828 for details.

This problem has been fixed but there are older files that have the incorrect spelling.

@rachellyne since this is the NCBI only I think we should be case sensitive and this file should fail. What do you think?

File format - need a standard

In the app (and in biovalidator) we use enums to represent the file formats.

Here is the spec: openapi.json

This is the Java representation of that: DataFile.java

This is fragile. I guess I see this as a place for future bugs! Right now we are okay with just the handful of file types we have. But in the future?

Ideally we would have a standard that we would use. I don't think there is any "good" answer. But we can mitigate the danger by:

  • put a comment in configurator to be sure to check valid types in biovalidator
  • Have biovalidator be very permissive with the file type names, e.g. case insensitive, accept "GFF" and "GFF3"

Error message suppression strategy?

(Just because I am thinking about error messages, and I have NOT tested this.)

What happens when the GFF file has the wrong number of columns. Are we going to get an error message for each bad row? It could be that we have some validation rules that are only enforced once. e.g. check for number of columns each row. but on fail, don't check again.

Ignore if you've handled this!! :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.