Fighting the Flu with Machine Learning

05:00 PM - 05:25 PM on July 16, 2016, Room CR5

Rohan Koodli

Audience level:


Everyone despises falling sick. In this talk, I will present how I used Python and machine learning to predict future sequences of the flu. My Random Forests model learns patterns in how the flu mutates from previous flu seasons, and creates decision trees based on those inferences. I will also cover how using a library (scikit-learn) was much easier and accurate than creating a Random Forests Classifier.


I created a machine learning algorithm which can predict flu strain sequences in future flu seasons. In my talk, I will show how I used Python to implement my flu prediction algorithm, as well as how I used libraries to make my algorithm more efficient and simpler. To create my algorithm, I first had to obtain the genetic sequences of the flu. I used the Biopython library to read the genetic sequences, each of which is over 1700 base pairs long. Then, I began making my Random Forests algorithm. I trained my algorithm on the grandparent-parent-child relationships between influenza strains. However, I found that my algorithm did not predict with very good accuracy as my algorithm often overfit my gene sequence data, so I started using the scikit-learn Random Forests algorithm instead. This was much easier to write and predicted much more accurately than my previous algorithm. However, I did lose some accuracy, as I could only include parent-child relationships of flu strains. Currently, my model with the scikit-learn Random Forests Algorithm has an accuracy of 93% when predicting future strains of the flu (H1N1 and H3N2 flu subtypes, specifically).