Introduction to Data Science with Python – SDSU HealthLINK Center for Transdisciplinary Health Disparities Research

Course Syllabus

Introduction to Data Science with Python

Dates:

Prerequisites:

Basic algebra background; preference will be given to attendees who had attended the Fall 2025 SDSU HealthLINK Center Data Science Training on Introduction to Machine Learning by Dr. Kee Moon

Instructor: Christopher Paolini

Preferred Title: “Chris”, “Christopher”, or Dr. Paolini

Dr. Paolini is an associate professor in the Electrical and Computer Engineering Department at SDSU. Dr. Paolini is a co-investigator of the SDSU HealthLINK Center’s Research Infrastructure Core. Dr. Paolini is the recipient of grants from the Department of Energy and NASA, and five NSF Office of CyberInfrastructure awards, most recently the current NSF CC* Storage Grant 1659169 Implementation of a Distributed, Shareable, and Parallel Storage Resource at SDSU to Facilitate High-Performance Computing for Climate Science. Dr. Paolini’s doctoral and post-doctoral research has been in the areas of combustion engineering, computational thermodynamics, and chemical kinetics. His current research interests include Internet of Things device development, machine learning, embedded systems, cloud computing, big data analytics, deep learning, software engineering, numerical chemical thermodynamics, numerical chemical kinetics, numerical geochemistry, high performance computing, scientific computing and numerical modeling, high speed networking, cyberinfrastructure development, and cybersecurity.

By the end of this course participants will be able to:

develop the ability to use Python’s Pandas library to efficiently clean, manipulate, analyze, and summarize structured datasets for data-driven decision making;
develop the ability to build, interpret, and evaluate linear regression models to understand relationships between variables and make data-driven predictions;
develop the ability to build, interpret, and evaluate logistic regression models to classify outcomes and estimate probabilities for binary and multiclass problems;
develop the ability to apply and compare linear, cubic spline, and Lagrange interpolation methods to accurately generate estimation functions that pass through measurement data points, estimate values between data points, and assess interpolation method trade-offs in smoothness, stability, and computational complexity;
develop the ability to apply and interpret Support Vector Machines to construct optimal margin classifiers to solve classification problems using kernel methods;
develop the ability to apply Principal Component Analysis to reduce dataset dimensionality while preserving key variance;
design Gaussian Processes (GP) with different kernels for modeling time series data and making predictions;
design the ability to apply K-Means clustering to partition data into meaningful groups and evaluate cluster quality using cluster similarity and separation;
develop the ability to build, interpret, and compare decision tree and random forest models to perform classification tasks;
implement recurrent neural networks (RNNs) for processing sequential or time series data;
develop the ability to design, train, and evaluate convolutional neural networks for image classification and object detection tasks.

This two-day (6-hour) workshop covers a subset of topics and techniques typically covered in a course on data science, to include linear regression, logistic regression, interpolation (linear, cubic spline, Lagrange), support vector machines, dimensionality reduction using principal component analysis, Gaussian processes (GP), K-means clustering, decision trees and random forests, time-series modeling and forecasting using recurrent neural networks (RNNs), image classification, and object detection using convolutional neural networks (CNNs).

Session 1 (3 hours, day 1)

March 23, 2026, 1:00PM to 4:00PM PST

Lessons covered:

Pandas for data manipulation, analysis, and cleaning
Linear Regression
Logistic Regression
Interpolation (Linear, Cubic Spline, Lagrange)
Support Vector Machines
Dimensionality Reduction using Principal Component Analysis

Session 2 (3 hours, day 2)

March 27, 2026, 1:00PM to 4:00PM PST

Lessons covered:

Gaussian Process Regression
K-Means Clustering
Decision Trees and Random Forests
Modeling Time-Series using Recurrent Neural Networks
Image Classification and Object Detection using Convolutional Neural Networks

Pandas for data manipulation, analysis, and cleaning

Lesson 1 will cover the open-source Python library Pandas used for data manipulation and analysis. Pandas provides the high-level data structures DataFrames and Series that allow one to organize, filter, transform, and aggregate structured data. Pandas includes features for handling missing values, cleaning inconsistent data, merging datasets, and reshaping tables. We will use Pandas with the NumPy library to build data science workflows.

Linear Regression

Lesson 2 will cover linear regression using scikit-learn to introduce public health students to the most fundamental training model one can use to analyze correlations between variables. A regression model is used to predict a numeric output response value, called a regressand, from a set of input values, called predictors. Linear regression is a supervised learning algorithm where a set of predictors and regressands are provided and an algorithm “learns” a (in this case linear) relationship between them. The learned relationship can then be used to predict a new output regressand given new input predictors.

Logistic Regression

Lesson 3 will cover the use of Scikit-learn (sklearn) to perform logistic regression for classification tasks. Logistic regression is a supervised machine learning algorithm used for classification with binary outcomes (e.g., true or false). A logistic regression model predicts probabilities of class membership, not continuous values, by applying a linear model to the input features and then passing the result through the logistic (sigmoid) function, which maps any real number into a probability between 0 and 1. We will use sklearn.linear_model.LogisticRegression, to fit a model to labeled data with a .fit() method and generate predictions with .predict() or probability estimates with .predict_proba(). The Scikit-learn library includes built-in support for regularization (L1 or L2), which helps prevent overfitting and improves generalization. Combined with preprocessing tools and model evaluation utilities like cross-validation and metrics, sklearn provides methods to construct and evaluate logistic regression models.

Interpolation (Linear, Cubic Spline, Lagrange)

Lesson 4 will cover methods of interpolation used to estimate unknown values that reside between known data points. Interpolation methods use existing data to construct a function that approximates the values within the range. Interpolation is commonly used in data analysis to fill in missing data or create smooth transitions between points. Linear interpolation estimates values between two data points using a straight line. Lagrange interpolation fits a single polynomial that passes exactly through all given data points, providing an exact fit but potentially causing large oscillations (known as Runge’s phenomenon) between data points. Cubic spline interpolation fits piecewise cubic polynomials between data points while ensuring smoothness with continuous first and second derivatives.

Support Vector Machines

Lesson 5 will cover Support Vector Machines (SVMs), which are supervised learning algorithms used for classification and regression that work by finding the optimal hyperplane separating data points of different classes. An SVM model maximizes the margin between classes, using a subset of training points called support vectors to define a boundary between classes.

Dimensionality Reduction using Principal Component Analysis

Lesson 6 will cover the use of Principal Component Analysis (PCA) as a dimensionality reduction method that transforms a (potentially large) dataset into a smaller set of uncorrelated variables or features known as principal components. These components are linear combinations of the original features and are ordered by the amount of variance they capture, with the first few retaining most of the information in the data. We can reduce the dimension of a large dataset by retaining only the components that preserve the greatest variability.

Gaussian Process Regression

Lesson 7 will cover Gaussian Process Regression (GPR), which is a probabilistic, non-parametric approach to regression that uses Gaussian processes to represent a distribution of potential functions that could fit a given dataset. Rather than simply providing a point estimate, GPR generates both a mean prediction and a quantified level of uncertainty for every data point. The model’s flexibility is driven by a kernel function, which defines the relationships between points in the input space.

K-Means Clustering

Lesson 8 will cover K-Means Clustering, an unsupervised learning classification method that organizes data into K distinct groups according to data similarity. The algorithm initially defines K initial centroids and then assigns every data point to its closest centroid. The algorithm repeatedly adjusts these centroids to reduce the variance within each cluster. This cycle repeats until the group assignments become constant, leading to the formation of clearly defined clusters.

Decision Trees and Random Forests

Lesson 9 will cover decision trees and random forests. A decision tree is a supervised learning method that predicts outcomes by repeatedly partitioning data into a hierarchy based on feature values. Random forests address a weakness of decision trees in overfitting training data, leading to a decision tree that does not generalize well to test data. A random forest mitigates this weakness by constructing an ensemble of multiple decision trees, each trained on a random sample of the data and features, which leads to better overall generalization.

Modeling Time-Series using Recurrent Neural Networks

Lesson 10 will cover the use of recurrent neural networks (RNNs) to make inferences from time series data. RNNs are a type of neural network designed to process sequential data by maintaining a hidden state that carries forward information from previous time steps.

Image Classification and Object Detection using Convolutional Neural Networks

Lesson 11 will cover the task of image classification and object detection using convolutional neural networks (CNNs), a type of neural network that learns spatial features through convolutional layers. CNNs can identify and localize multiple objects within a single image by predicting bounding boxes that enclose each object and assigning a class label to each box that identifies what type of object appears within the box.

Hardware Requirements

Hardware	Who provides	Notes
Laptop, 1 per student	Attendee
Nvidia DGX A100	Computational Science Research Center (CSRC)	Student accounts will be provided to participants.

Software Requirements

Software	Who provides	Notes
Visual Studio Code	Open source	Installation instructions will be provided by Dr. Paolini. Sung will help prepare the installation manual two weeks prior to the workshop.
Visual Studio Code Remote – SSH extension	Open source
Python	Open source
keras, scikit-learn, numpy, matplotlib, tensorflow	Open source

Datasets

To be provided by Dr Paolini
Will be available for download on the day of class