Spark Dataframes for the Pandas Pro

10:30 AM - 10:55 AM on July 17, 2016, Room CR7

Alfred Lee

Audience level:


Pandas has been a mainstay of data analysis in Python for years now. With Spark gaining wider and wider adoption, more analysts are learning this new distributed processing engine. I will explain some of the important, essential differences, the understanding of which will ease the transition into this new, promising tool.


SparkSQL's DataFrame construct is inspired by R and Pandas DataFrames. It enables the same sorts of data manipulation workflows. But the syntax and underlying data structures are actually rather different. In this talk, I will translate the major Pandas concepts into their Spark counterparts and explain the shifts in assumptions that are necessary. By understanding these assumptions, the differences will make sense, and the analyst can begin writing better Spark code.