Git Product home page Git Product logo

narwhal-processor's Introduction

narwhal-processor

narwhal-processor is a processing library that normalizes data of a known type. Current and proposed data types that can be normalized include date, country name, continent, state and province, coordinates and numeric range (altitude, depth).

Comments, contributions, reviews and help are welcomed.

Code Status

Build Status

Note

This library is still under active development. Some parts may change based on the reviews, comments and usage. Do not hesitate to enter an Issue if you have any problems or questions.

Goal

The goal of this library is to provide a set of processing functions through a common Java interface that supports JavaBeans. This will ease the integration of the library in various biodiversity projects by providing a uniform way to access processing functions.

Scope

The narwhal-processor is meant to be used as a low-level processing library with few secondary or contextual validations. For example, given a date such as 1999-01-16, the output (if successful) will be parsed into day (16), month (01), and year (1999). However, if this date represents the date of collection, it is out of scope to determine the biological validity of Jan 16, 1999. The narwhal-processor only produces results from data that are without uncertainty.

Documentation and Usage

See our wiki for all the information.

Dependencies

Optional

Tested with Maven 3

Build

mvn clean package

Tests

Unit tests

mvn clean test

Setup in Eclipse

After a git clone

mvn eclipse:eclipse

In Eclipse : File/Import/Existing Projects into Workspace

You may need to add the maven repository to Eclipse's Build Path via Preferences > Java > Build Path > Classpath Variables by clicking the New button and adding the name M2_REPO and the directory. On a Mac, this is usually /Users/<User>/.m2/repository.

Contributors

  • Daniel Amariles
  • Peter Desmet
  • Oscar Fonts (NTv2 transformations with GeoTools)

Narwhal Mythology

From Wikipedia: Some medieval Europeans believed narwhal tusks to be the horns from the legendary unicorn. As these horns were considered to have magic powers, such as the ability to cure poison and melancholia

narwhal-processor's People

Contributors

cgendreau avatar dshorthouse avatar jegoi avatar peterdesmet avatar tigreped avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

narwhal-processor's Issues

Processor should be case, accent and hyphen agnostic

Looking at this dictionary file it seems the processor is case-insensitive, but not accent (¨^´ etc) and hyphen (- –) agnostic.

I think it should be, otherwise a lot of time will be invested in creating dictionaries that can handle all cases, while the processor can do this easily.

This is sensible:

REGION DE BRUXELLES CAPITALE    BRU
REGION DE BRUXELLES CAPITAL BRU
REGION BRUXELLES CAPITALE   BRU
REGION BRUXELLES CAPITAL        BRU
BRUXELLES   BRU
BRUXELLE    BRU

This is not:

RÉGION DE BRUXELLES-CAPITALE   BRU
REGION DE BRUXELLES-CAPITALE    BRU
RÉGION DE BRUXELLES CAPITALE   BRU
REGION DE BRUXELLES CAPITALE    BRU
RÉGION BRUXELLES-CAPITALE  BRU
REGION BRUXELLES-CAPITALE   BRU
RÉGION BRUXELLES CAPITALE  BRU
REGION BRUXELLES CAPITALE   BRU
etc.

If the narwhal can already deal with this: great! But then we should update the documentation and only use uppercase, no accents, and no hyphens in our dictionaries.

Which maven repository?

In which maven repository are you publishing the narwhal artifact please?
Or perhaps people just check it out and install locally?

If you are in need of a repository, Canadensys are welcome to use http://repository.gbif.org/ for their artifacts.

Coordinate conversion Geodetic Datum

Hi i'm using your API for coordinate conversion, and i couldn´t find more info about the geodetic Datum of this tool, it would be nice if you can add a field with this info... i guess is WGS84 or EPSG:4326.

Thank you for your answer

Populate internal country database using sementic web

Would be interesting to expand the narwhal to be able to build an up-to-date and well-maintained knowledge base of country names, their alternative representations (possibly multilingual) and mappings to known misspellings using linked open data (semantic Web).

This could be done using a semantic Web URI.
Something like : http://dbpedia.org/page/Category:Member_states_of_the_United_Nations

A country could than be identified with a URI such as http://dbpedia.org/resource/Canada
The name of a country in different languages could populated using "owl:sameAs".
The known misspellings could be handle using SKOS.

For performance reasons, we'd like this thesaurus to be embedded in the library, but with the capacity to be periodically refreshed with data pulled from external resources (like it's currently the case through the gbif-parser).

Benefits:

  • Different labeling used for this concept (see rdfs:label
    http://dbpedia.org/page/Canada) in different languages.
  • Recognize a country name in a different language vs a typo to not report
    country name in different languages as error
  • Information about where it is, without any geopetial query, (ex.
    continent, hemispere)
  • Opens the door for validation using the date. (think Russia, USSR)
  • Use semantic web standards allowing biodiversity application to benefits from it in a near future.
  • Same concept can be expanded to states, provinces and municipalities

processBean() function from AbstractDataProcessor should be depracated

After testing some usage scenarios for this library, I came to the conclusion that processBean() functions should be delegated to the user of the library and should not included in the core library. It makes intermediate steps less efficient (e.g. clean string before processing).

Add support for dubious month/day dates

11-12-2012 could be a little-endian date (11 december) or a middle-endian date (12 november) and is thus dubious.

Currently, the parser returns nothing for these dates, but the year can be unambiguously determined: 2012.

Add bounds to ddmmss values

Currently, the narwhal-processor returns results for coordinates like
45° 332' 255" N, 129° 410' 311" W.

This is incorrect, mm and ss should be >=0 and ­< 60.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.