Git Product home page Git Product logo

multispecies-whale-detection's Introduction

Multispecies Whale Detection

This repository contains tools Google implemented for the development of a neural network to detect whale vocalizations and classify them across several species and geographies. We release the code in the hope that it will allow other groups to more easily develop species detection models on their own data.

This is not an officially supported Google product. Support and/or new releases may be limited.

Example Generator (examplegen)

A Beam pipeline for creating trainer input from a collection of labeled audio files.

Motivation

Acoustic monitoring datasets are normally stored as a collection of audio files on a filesystem. The duration of the files varies across and within deployments but is often much longer than the ideal length of individual training examples. Sample rate, audio encoding, and number of channels also vary, while the trainer will require a single input format for audio and annotations. Annotations, which are the source of labeled data for training, are usually stored as CSV, but the naming and meaning of columns also varies.

Features

The examplegen Beam pipeline reads audio files and CSV label and metadata files from a filesystem and writes TFRecord files to be consumed by the training job. In the process, it:

  • handles different audio formats
  • splits multi-channel files and resamples to a common sample rate
  • chunks large files into clips short enough to not slow down training
  • joins labels and metadata to the audio, and represents them as features in the output TensorFlow Examples
  • adjusts label start times to be relative to the start of the audio clip in each example
  • serializes the joined records in tensorflow.Example format

Usage

Local:

  1. Run python3 run_examplegen_local.py

Google Cloud Dataflow:

  1. Edit run_examplegen.py, setting the project and bucket paths for your own Cloud project or switching to a different Beam runner.
  2. In your Cloud project, create a service account and IAM permissions for "Dataflow Admin" and "Storage Object Admin" on the bucket paths you configured in run_examplegen.py.
  3. Generate a service account key, download the JSON file, and rename it to service-account.json.
  4. Run GOOGLE_APPLICATION_CREDENTIALS=service_account.json python3 run_examplegen.py.
  5. Monitor the job and, when it completes, check the output in the output_directory you configured.

Audio TFRecord inspector (dataset.print)

A debugging tool for inspecting the output of the examplegen pipeline.

Usage:

python3 -m multispecies_whale_detection.scripts.print_dataset --tfrecord_filepattern={output_directory}/tfrecords-*

This prints to the console a human-readable representation of the first few records of an audio TFRecord dataset specified by the given file pattern. It is mostly intended to be used by developers of this project, but it can be handy to verify that a run of examplegen produced the expected form of output.

Dataflow Python versions and virtualenv

(documentation sketch) Python versions used on local machines tend to get ahead of the versions supported in Dataflow. To avoid unforseen issues, it is best to launch the Dataflow job from a local machine with a Dataflow-supported version. virtualenv is a good way to do this alongside any other Python version that may be installed system-wide.

  1. Determine the latest Python version supported by Dataflow.
  2. Download and install that Python version.
  3. Create and activate the virtual environment.
    python3 -m venv ~/tmp/whale_env
    . ~/tmp/whale_env/bin/activate
    pip install --upgrade pip
    
  4. Install this package in the virtual environment.
    cd multispecies_whale_detection  # (directory containing setup.py)
    pip install .
    

multispecies-whale-detection's People

Contributors

dstevens75 avatar ilmikko avatar matt-har-vey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

multispecies-whale-detection's Issues

Flesh out examplegen usage documentation

User testing ran into a pitfall:

It is not obvious from the docs that examplgen reads all audio files and CSV files within a whole directory tree (recursive) and converts them all to examples.

Running without realizing this can cause problems downstream, like overlap between two runs that were intended to be separate for train and validation or multiplication of labeled examples if, for example, multiple versions of label CSV are within the same directory tree.

The documentation should elaborate on a suggested directory structure

examplegen_train_run/
input/
output/

examplegen_validation_run/
input/
output/

and make clear that unlabeled sections of audio are treated as implicit negatives.

(depends on issue #14 since the documentation should also explain the treatment of absolute and relative paths in labels CSV as detailed in #14)

examplegen support for relative paths

Moving the input_directory for examplegen is often convenient, but since the current examplegen requires that all the paths in labels CSV files be absolute, this creates the inconvenience of needing to rewrite them when the input directory is moved.

The absolute paths were intended to allow referencing files from multiple cloud storage buckets (still a valid use case but a less common one).

For this issue:

  • Change the default interpretation of the filename field in labels CSV to be relative to input_directory
  • Continue to support absolute paths (almost os.path.isabs, but don't forget gs:// and maybe other schemes)
  • Ensure the filename "for reference" field in the tfrecords is an exact copy of the filename field from the label CSV
  • Test on DirectRunner (Linux and Windows) as well as on Dataflow, with a mix of absolute and relative paths on each platform

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.