Algorithms to Sample From Streams

04:30 PM - 04:55 PM on July 16, 2016, Room CR6

Jonathan Arfa

Audience level:
novice

Description

Sampling is often necessary when you want to understand some aspects of your data but you don’t have the memory or time to analyze the full data set. This is a simple task when your data is on disk - but is much more complicated when your data is an incoming stream of unknown (but very long) length. In this talk I’ll present several methods to do this, some of which get you samples that are unbiased over the entire stream and some of which produce samples that are biased towards the present, and the tradeoffs these algorithms entail.

Abstract

Those of us who use Python to build systems that ingest constant streams of incoming data (which includes anybody whose code touches the internet) often need algorithms that keep a fixed-size sample from the stream for on-the-fly analytics. At Magnetic we are engaged in real-time bidding for online advertising, which also requires extremely low-latency Python code.

Reservoir sampling is a commonly used algorithm for keeping a sample that's unbiased over all events in the stream. But what if you want your algorithm to keep a representative sample from the past 10 minutes instead of the entire stream?

In this talk I'll start with a gentle introduction to reservoir sampling for those who are not familiar with it, and then discuss several variants that keep samples which are biased towards more recent events. One of these is the VIRB (Variable Incoming Rate Biased) sampler, which we have developed in-house at Magnetic.

The only pre-requisite is that an audience member can read and follow basic Python code. No background is required in Statistics or Data Science, although this talk is highly relevant to Data Scientists.