Scalable Machine Learning for Big Data Biology


Course Overview

Machine learning (ML) has become an integral part of computational thinking in the era of big data biology. This course will focus on understanding the statistical structure of large-scale biological datasets using ML algorithms. We will cover the basics of ML and study their scalable versions for implementation on a distributed computing framework. We will pursue distributed ML algorithms for: matrix factorization, convex optimization, dimensionality reduction, clustering, classification, graph analytics and deep learning, among others.

The course will be project driven (3 to 4 mini projects) with source material from genomic sciences, structural biology, drug discovery, systems modeling and biological imaging. There will be one final course project, along with a presentation. Students will be expected to design, implement and test their ML solutions in Apache Spark.


No biological background is expected. The assignments will cover the necessary biology. Experience in programming and some software engineering is preferred. Knowledge of probability, statistics, linear algebra and algorithms is a bonus. The class is open to senior–year undergraduates and graduate students.

Class Logistics

  • Class Timings: 1.30 PM to 3.00 PM Wed/Fri.
  • Location: BST3 3073
  • Office Hours: email us at and
  • Website:


  • Assignments: 3 in total, 40% grade
    • 3 late days in total (for the whole semester)
  • Final Course Project: 50% grade
    • Significant Implementation Effort
    • In-class Project Presentation (Dec 2 and 9)
    • Final Report
  • Class Participation: 10% grade


  • 1-Sep-2017 Please check Lectures tab for Spark Installation instructions
  • 30-Aug-2017 For access to BST3, get in touch with Kelly Gentille [kmg120 AT pitt DOT edu]
  • 30-Aug-2017 Check the Lectures tab for class schedule and material