Git Product home page Git Product logo

pipelines's Introduction

Build Status Coverage

Table of Contents

About the project

REMEMBER, YOU HAVE TO USE JAVA VERSION 8

Pipelines for data processing and indexing of biodiversity data

Status: IN PRODUCTION

Vision: Consistent data processing pipelines (parsing, interpretation and quality flagging) for use in GBIF, the Living Atlases project and beyond. Built to scale from laptop to GBIF volumes. Deployable on JVM, Spark, Google Cloud, .

Architecture

The project provides vanilla JVM-based parsing and interpretation libraries, and pipelines for indexing into SOLR and ElasticSearch, built using Apache Beam.

Apache Beam provides a high level abstraction framework ideal for this purpose with the ability to deploy across target environments (e.g. Spark, local JVM) and with many built in connectors for e.g. HBase, SOLR, ElasticSearch etc.

Ingress

Ingress is from Darwin Core Archive (zip files of one or more delimited text files) or ABCD Archives (compressed XML) only[1]. During ingress data is converted from its native format and stored as Avro files containing Darwin Core compliant data.

This is depicted below:

Ingress

Avro is chosen as a storage and interchange format exclusively in this project because a) it is splittable with each split compressed independently, b) it holds the data schema with the data, c) is well supported in the Hadoop ecosystem (e.g. Hive, Spark) and many other tools (e.g. Google Cloud) d) is very robust in serialization and e) reduces boiler plate code thanks to schema to code generators. Darwin Core Archives and JSON for example do not exhibit all these traits.

[1] Other protocols (e.g. DiGIR, TAPIR) are supported by GBIF but are converted by crawling code upstream of this project.

Interpretation

During interpretation the verbatim data is parsed, normalised and tested against quality control procedures.

To enable extensibility data is interpreted into separate avro files where a separate file per category of information is used. Many interpretations such as date / time formatting are common to all deployments, but not all. For example, in the GBIF.org deployment the scientific identification is normalised to the GBIF backbone taxonomy and stored in /interpreted/taxonomy/interpreted*.avro which might not be applicable to everyone. Separating categories allows for extensibility for custom deployments in a reasonably modular fashion.

Interpretation is depicted below:

Ingress

Note that all pipelines are designed and tested to run with the DirectRunner and the SparkRunner at a minimum. This allows the decision to be taken at runtime to e.g. opt to interpret a small dataset in the local JVM without needing to use cluster resources for small tasks.

It is a design decision to ensure that all the underlying parsers are as reusable as possible for other projects with careful consideration to not bring in dependencies such as Beam or Hadoop.

Indexing

Initial implementations will be available for both SOLR and for ElasticSearch to allow for evaluation of both at GBIF. During indexing the categories of interpreted information of use are merged and loaded into the search indexes:

Ingress

Note that GBIF target 10,000 records/sec per node indexing speed (i.e. 100,000 records/sec on current production cluster). This will allow simplified disaster recovery and rapid deployment and of new features.

Structure

The project is structured as:

  • .buildSrc - Tools for building the project
  • docs - Documents related to the project
  • examples - Examples of using project API and base classes
    • transform - Transform example demonstrates how to create Apache Beam pipeline, create the new transformation and use it together with GBIF transforms and core classes
    • metrics - The example demonstrates how to create and send Apache Beam SparkRunner metrics to ELK and use the result for Kibana dashboards
  • pipelines - Main pipelines module
    • beam-common - Classes and API for using with Apache Beam
    • common - Only static string variables
    • export-gbif-hbase - The pipeline to export the verbatim data from the GBIF HBase tables and save as ExtendedRecord avro files
    • ingest-gbif - Main GBIF pipelines for ingestion of biodiversity data
    • ingest-gbif-standalone - Independent GBIF pipelines for ingestion of biodiversity data
    • ingest-hdfs-table - Pipeline classes for conversion from interpreted formats into one common for HDFS view creation
    • ingest-transforms - Transformations for ingestion of biodiversity data
  • sdks - Main module contains common classes, such as data models, data format interpretations, parsers, web services clients etc.
    • core - Main API classes, such as data interpretations, converters, DwCA reader etc.
    • models - Data models represented in Avro binary format, generated from Avro schemas
    • parsers - Data parsers and converters, mainly for internal usage inside of interpretations
    • keygen - The library to generate GBIF identifiers, to support backward compatibility the codebase (with minimum changes) was copied from the occurrence/occurrence-persistence project
  • tools - Module for different independent tools

How to build the project

The project uses Apache Maven tool for building. Project contains maven wrapper and script for Linux and MacOS systems, you just can run build.sh script:

./build.sh

or

source build.sh

Please read Apache Maven how-to.

Codestyle and tools recommendations

  • Use Intellij IDEA Community (or better)
  • Use Google Java Format (Please, do not reformat old codebase, only new)
  • The project uses Project Lombok, please install Lombok plugin for Intellij IDEA.
  • Because the project uses Error-prone you may have issues during the build process from IDEA. To avoid these issues please install the Error-prone compiler integration plugin and build the project using the error-prone java compiler to catch common Java mistakes at compile-time. To use the compiler, go to FileSettingsCompilerJava Compiler and select Javac with error-prone in the Use compiler box.
  • Add a custom parameter to avoid a debugging problem. To use the compiler, go to FileSettingsCompilerJava CompilerAdditional command line parameters and add -Xep:ParameterName:OFF
  • Tests: please follow the conventions of the Maven surefire plugin for unit tests and the ones of the Maven failsafe plugin for integration tests. To run the integration tests just run the verify phase, e.g.: mvn clean verify

pipelines's People

Contributors

djtfmartin avatar fmendezh avatar gbif-jenkins avatar marcos-lg avatar mattblissett avatar mike-podolskiy90 avatar muttcg avatar qifeng-bai avatar timrobertson100 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.