Scalable Machine Learning for Big Data Biology


Course Overview

Machine learning (ML) has become an integral part of computational thinking in the era of big data biology. This course will focus on understanding the statistical structure of large-scale biological datasets using ML algorithms. We will cover the basics of ML and study their scalable versions in distributed and non-distributed frameworks. We will pursue scalable ML algorithms for: matrix factorization, optimization, dimensionality reduction, clustering, classification, graph analytics and deep learning, among others.

The course will be project driven (5 mini projects) with source material from genomic sciences, structural biology, drug discovery, systems modeling and biological imaging. There will be one final course project, along with a presentation. For some of the assignments, students will be expected to design, implement and test their ML solutions in Apache Spark.


No biological background is expected. The assignments will cover the necessary biology. Experience in programming and some software engineering is preferred. Knowledge of probability, statistics, linear algebra and algorithms will be useful. Prior introduction to machine learning is a bonus. The class is open to senior–year undergraduates and graduate students.

Class Logistics

  • Class Timings: 1.30 PM to 3.00 PM Wed/Fri.
  • Location: BST3 3073
  • Office Hours: email us at chakracs AT and dkoes AT
  • Website:


  • Assignments: 5 in total, 50–60% grade
    • 3 late days in total (for the whole semester)
  • Final Course Project: 30–40% grade
    • Significant Implementation Effort
    • In-class Project Presentation (Dec 4 and 6)
    • Final Report
  • Class Participation: 10% grade


  • 29-Aug-2018 For access to BST3, get in touch with Kelly Gentille [kmg120 AT pitt DOT edu]
  • 29-Aug-2018 Check the Lectures tab for class schedule