Git Product home page Git Product logo

random-sampling's Introduction

Random Sampling

A collection of algorithms in Java 8 for the problem of random sampling with a reservoir.

Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory. [1] In this context, the sample of k items will be referred to as sample and the list S as stream.

This package distinguishes these algorithms into two main categories: the ones that assign a weight in each item of the source stream and the ones that don't. These will be referred to as weighted and unweighted random sampling algorithms respectively. In unweighted algorithms, each item in the stream has probability k/n in appearing in the sample. In weighted algorithms this probability depends on the extra parameter weight. Each algorithm may interpret this parameter in a different way, for example in [2] two possible interpretations are mentioned.

Using

You can add a dependency from your project as follows:

Using Maven

<dependency>
    <groupId>gr.james</groupId>
    <artifactId>random-sampling</artifactId>
    <version>0.28</version>
</dependency>

Using Gradle

implementation 'gr.james:random-sampling:0.28' // Runtime
api            'gr.james:random-sampling:0.28' // Public API

Examples

Select 10 numbers at random in the range [1,100]. Each number has a 10% probability of appearing in the sample.

RandomSampling<Integer> rs = new WatermanSampling<>(10, new Random());
rs.feed(IntStream.rangeClosed(1, 100).boxed().iterator());
Collection<Integer> sample = rs.sample();
System.out.println(sample);

Select 5 random tokens from an input stream.

RandomSampling<String> rs = new VitterXSampling<>(5, new Random());
rs.feed(new Scanner(System.in));
System.out.println(rs.sample());

Same example using Algorithm Z.

RandomSampling<String> rs = new VitterZSampling<>(5, new Random());
rs.feed(new Scanner(System.in));
System.out.println(rs.sample());

Select 2 terms from a vocabulary, based on their weight.

WeightedRandomSampling<String> rs = new EfraimidisSampling<>(2, new Random());
rs.feed("collection", 1);
rs.feed("algorithms", 2);
rs.feed("java", 2);
rs.feed("random", 3);
rs.feed("sampling", 4);
rs.feed("reservoir", 5);
System.out.println(rs.sample());

Unweighted random sampling using the Java 8 stream API.

RandomSamplingCollector<Integer> collector = WatermanSampling.collector(5, new Random());
Collection<Integer> sample = IntStream.range(0, 20).boxed().collect(collector);
System.out.println(sample);

Weighted random sampling using the Java 8 stream API.

WeightedRandomSamplingCollector<String> collector = ChaoSampling.weightedCollector(2, new Random());
Map<String, Double> map = new HashMap<>();
map.put("collection", 1.0);
map.put("algorithms", 2.0);
map.put("java", 2.0);
map.put("random", 3.0);
map.put("sampling", 4.0);
map.put("reservoir", 5.0);
Collection<String> sample = map.entrySet().stream().collect(collector);
System.out.println(sample);

Algorithms

Class Algorithm Space Weighted
WatermanSampling Algorithm R by Waterman O(k)
VitterXSampling Algorithm X by Vitter O(k)
VitterZSampling Algorithm Z by Vitter O(k)
LiLSampling Algorithm L by Li O(k)
EfraimidisSampling Algorithm A-Res by Efraimidis O(k)
ChaoSampling Algorithm by Chao O(k)
SequentialPoissonSampling Algorithm by Ohlsson O(k)
ParetoSampling Algorithm by Rosén O(k)

1 Algorithm R by Waterman

Signature: WatermanSampling implements RandomSampling

References

  • The Art of Computer Programming, Vol II, Random Sampling and Shuffling.

2 Algorithm X by Vitter

Signature: VitterXSampling implements RandomSampling

References

3 Algorithm Z by Vitter

Signature: VitterZSampling implements RandomSampling

References

4 Algorithm L by Li

Signature: LiLSampling implements RandomSampling

References

5 Algorithm A-Res by Efraimidis

Signature: EfraimidisSampling implements WeightedRandomSampling

References

6 Algorithm by Chao

Signature: ChaoSampling implements WeightedRandomSampling

References

7 Algorithm by Ohlsson

Signature: SequentialPoissonSampling implements WeightedRandomSampling

References

7 Algorithm by Rosén

Signature: ParetoSampling implements WeightedRandomSampling

References

References

[1] Wikipedia contributors. "Reservoir sampling." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 17 Oct. 2017. Web. 21 Nov. 2017.

[2] Efraimidis, Pavlos S. "Weighted random sampling over data streams." Algorithms, Probability, Networks, and Games. Springer International Publishing, 2015. 183-195.

random-sampling's People

Contributors

gstamatelat avatar mrbuddycasino avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

random-sampling's Issues

Create a StrictRandomSampling marker interface

The interface should be applied to all reservoir sampling algorithms that satisfy strict proportionality between the weights and the inclusion probabilities. Currently, only ChaoSampling has this property. The addition will help determine strictness during runtime and perhaps simplify tests at a later time.

Consider static sample method on each algorithm

For example

public static <E> Collection<E> execute(Collection<E> population, int sampleSize, Random random) {
    final LiLSampling<E> sampling = new LiLSampling<>(sampleSize, random);
    sampling.feed(population);
    return sampling.sample;
}

This way, the caller would be able to get the mutable sample collection and not reuse the instance.

Long.MAX_VALUE is possible to overflow

WatermanSampling, VitterXSampling and VitterZSampling can overflow long. Inside AbstractRandomSampling streamSize can become Long.MAX_VALUE and it is being increased inside the implementations.

Create AbstractOrderSampling class

The EfraimidisSampling, ParetoSampling and SequentialPoissonSampling can all be considered order sampling designs. That is sampling designs that setup a floating point key for each element and select the elements with the maximum keys as the sample.

Implement the AbstractOrderSampling class and convert these 3 algorithms as subclasses.

Consider returning boolean from feed method

boolean feed(T item)

Returns true if the implementation changed as a result of this call.

boolean feed(Iterable<T> items)

Returns true if the implementation changed as a result of this call.


Need to find an alternative chaining method feed(0).feed(1).feed(2).

EfraimidisSampling fails when weight is too small

On EfraimidisSampling and more specifically the line

assert newItem.weight > 0.0 && newItem.weight < 1.0;

sometimes fails because if the weight is too small then 1/weight is too big and the newItem.weight is 0 due to double rounding.

Consider the feed method returning T or Optional

Idea 1

T feed(T item) returns the element T of the reservoir that was replaced by item or null if the sample was not modified as a result of this operation.

Idea 2

A new feed method has to return values from 3 different cases:

  1. item was not selected and there was no change in the sample (return null)
  2. item was inserted in the sample but nothing was removed (return null?)
  3. item replaced an existing item T of the sample (return T)

An obvious solution is to return Optional<T>:

Optional<T> feed(T item) returns null in case (1), an Optional<T> that doesn't hold a value in case (2) and an Optional<T> that holds the reference to the item that was removed in case (3).

RandomSampling.streamSize() should return long

Allow more than Integer.MAX_VALUE items to be feeded to a sampler.

  • Change RandomSampling.streamSize() to return long instead of int.
  • Change the limit that StreamOverflowException is thrown to Long.MAX_VALUE.

Weighted.compareTo is not consistent with equals

The implementation of compareTo in Weighted is not consistent with equals:

@Override
public int compareTo(Weighted<T> o) {
final int c = Double.compare(weight, o.weight);
if (c == 0) {
assert (Integer.compare(System.identityHashCode(this), System.identityHashCode(o)) == 0) == (this.equals(o));
return Integer.compare(System.identityHashCode(this), System.identityHashCode(o));
} else {
assert !this.equals(o);
return c;
}
}

In particular, there is no guarantee that System.identityHashCode is unique for every object. Thus, compareTo may return 0 and at the same time equals can return false.

The compareTo method needs to change such that it always returns 0 for same-reference objects and always a non-zero value for different objects, even if they have the same weight.

Default weight for each weighted algorithm

Each weighted algorithm should have its own default weight because the contract of WeightedRandomSampling doesn't allow for a weight value that is guaranteed to be legal.

Command line client

Reads from stdin and outputs the sample to stdout when stdin is EOF.

Possible interface:

reservoir-sample --tokenization="token_separation_reg_exp" --algorithm="efraimidis" --sample=10 --random-source=file

See the core utility shuf.

Thread Safety

It would be nice to have a thread-safe version of at least the best-performing algorithm (Lil). Other than that, very useful, thanks for making this available!

ChaoSampling test

Create test for ChaoSampling verifying that the probability of inclusion is proportional to the weight.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.