crunch-lib
This repository contains useful reusable high-level components for common use-cases in processing data with Apache Crunch
If you want to try it, it's in the central Maven repo so you can use this snippet (or equivalent for gradle/sbt/...)
<dependency>
<groupId>com.spotify.crunch</groupId>
<artifactId>crunch-lib</artifactId>
<version>0.0.5</version>
</dependency>
AvroCollections
extract
pulls out individual fields from aPCollection
of Avro records by their field names without the need for trivialMapFn
skeyByAvroField
keys aPCollection
of Avro records by a specific field using it's name without the need for trivialMapFn
s
SPTables
swapKeyValue
swaps the key and the value parts of aPTable
negateCounts
negates the value part of a long-valued table to facilitate easy sort-descending
TopLists
topNYbyX
Creates a top-list of elements in the providedPTable
, categorised by the key of the input table and using the count of the value part of the input table.globalTopList
Create a list of unique items in the input collection with their count, sorted descending by their frequency.
Averages
meanValue
Calculates the mean value for each key in the provided numerically-valuedPTable
.
Percentiles
distributed
/inMemory
Calculates a set of percentiles for each key in the provided numerically-valuedPTable
.
DoFns
detach
wrap a DoFn operating as a reducer such that each value given by the Iterable is already detached (preventing object reuse problems)