Distributed Python with Dask

04:30 PM - 05:25 PM on July 17, 2016, Room CR4

Matthew Rocklin

Audience level:
advanced

Description

Dask is a Python library for parallel and distributed computing. It was built with the needs of the numeric Python ecosystem in mind, emphasizing common interfaces (NumPy, Pandas, Toolz) and full flexibly in supporting custom analytic workloads and complex algorithms. This talk describes the motivations behind creating Dask, gives interactive examples of its use, talks about how it is used in practice, and compares its capabilities against other parallel computing frameworks.

Abstract

The numeric Python ecosystem contains fast and intuitive libraries that find wide use in a broad variety of applications. However, these libraries were mostly designed for data that fits in memory and for computations that run on a single core. This original design makes them sub-optimal for use on multi-core machines or on distributed clusters. While straightforward subsets of this software ecosystem can be parallelized using standard data engineering frameworks (MapReduce, Spark, Storm) the more advanced analytic algorithms remain awkward to parallelize with existing tools.

Dask is a Python library designed to complement the existing Python data science ecosystem with parallel computing. In order to support sophisticated algorithms, Dask employs a very generic task scheduling system, similar to what is found in Luigi or Airflow, but optimized for fast numeric computations operating at the sub-millisecond level. This, combined with friendly interfaces and asynchronous programming creates a fresh and flexible parallel programming experience capable of handling both the more numeric side of Python, and also custom workflows.

In this talk we describe the state of parallelism within the numeric Python stack, the motivations behind Dask, and show interactive examples using the library in a variety of real world contexts.