Hassle Free ETL with PySpark

03:15 PM - 03:40 PM on July 16, 2016, Room CR4

Rob Howley

Audience level:
novice
Watch:
https://www.youtube.com/watch?v=1L6wp7AxfPE

Description

While the models of data science get all the press, the real work is in the maze of data preprocessing and pipelines. The goal of this talk is to get a glimpse into how you can use Python and the distributed power of Spark to simplify your (data) life, ditch the ETL boilerplate and get to the insights. We’ll intro PySpark and considerations in ETL jobs with respect to code structure and performance.

Abstract

Section Heading: What do we want?

In short … easy to develop ETL jobs that scale out of the box! Data growth is inevitable both in terms of size and variety, but that doesn’t mean we need to spend more time and effort on the subsequent ETL. We’ll introduce the tools that can get us the easy ETL we want. Python and Spark are friends; the world should know.

Section Heading: Learn by doing

Jump into the deep end and put together some simple, but interconnected jobs. These jobs will leverage a barebones job runner framework that I have open sourced. With a little planning, code reuse within the presented framework should be a cinch and your ETL reduced to T.

Section Heading: ETL Principles

Data errors, job failures and general meltdowns happen. How do we minimize the damage without losing days of our lives? First we’ll get back to the basics. In the PySpark world, there are few more guidelines that can help not only speed the development of your data pipelines, but also improve the quality of your output. Partitioning schemes, data types, memory usage and file types matter!