Introduction to Data Science with Python
Dates:
Prerequisites:
Basic algebra background; preference will be given to attendees who had attended the Fall 2025 SDSU HealthLINK Center Data Science Training on Introduction to Machine Learning by Dr. Kee Moon
Instructor: Christopher Paolini
Preferred Title: “Chris”, “Christopher”, or Dr. Paolini
Dr. Christopher Paolini is an Associate Professor in the Department of Electrical and Computer Engineering at SDSU. Dr. Paolini is an associate professor in the Electrical and Computer Engineering Department at SDSU. Dr. Paolini is a co-investigator of the SDSU HealthLINK Center’s Research Infrastructure Core. Dr. Paolini is the recipient of grants from the Department of Energy and NASA, and five NSF Office of CyberInfrastructure awards, most recently the current NSF CC* Storage Grant 1659169 Implementation of a Distributed, Shareable, and Parallel Storage Resource at SDSU to Facilitate High-Performance Computing for Climate Science. Dr. Paolini’s doctoral and post-doctoral research has been in the areas of combustion engineering, computational thermodynamics, and chemical kinetics. His current research interests include Internet of Things device development, machine learning, embedded systems, cloud computing, big data analytics, deep learning, software engineering, numerical chemical thermodynamics, numerical chemical kinetics, numerical geochemistry, high performance computing, scientific computing and numerical modeling, high speed networking, cyberinfrastructure development, and cybersecurity.
By the end of this course participants will be able to:
This two-day (6-hour) workshop covers a subset of topics and techniques typically covered in a course on data science, to include linear regression, logistic regression, interpolation (linear, cubic spline, Lagrange), support vector machines, dimensionality reduction using principal component analysis, Gaussian processes (GP), K-means clustering, decision trees and random forests, time-series modeling and forecasting using recurrent neural networks (RNNs), image classification, and object detection using convolutional neural networks (CNNs).
Lessons covered:
Lessons covered:
Pandas for data manipulation, analysis, and cleaning
Lesson 1 will cover the open-source Python library Pandas used for data manipulation and analysis. Pandas provides the high-level data structures DataFrames and Series that allow one to organize, filter, transform, and aggregate structured data. Pandas includes features for handling missing values, cleaning inconsistent data, merging datasets, and reshaping tables. We will use Pandas with the NumPy library to build data science workflows.
Linear Regression
Lesson 2 will cover linear regression using scikit-learn to introduce public health students to the most fundamental training model one can use to analyze correlations between variables. A regression model is used to predict a numeric output response value, called a regressand, from a set of input values, called predictors. Linear regression is a supervised learning algorithm where a set of predictors and regressands are provided and an algorithm “learns” a (in this case linear) relationship between them. The learned relationship can then be used to predict a new output regressand given new input predictors.
Logistic Regression
Lesson 3 will cover the use of Scikit-learn (sklearn) to perform logistic regression for classification tasks. Logistic regression is a supervised machine learning algorithm used for classification with binary outcomes (e.g., true or false). A logistic regression model predicts probabilities of class membership, not continuous values, by applying a linear model to the input features and then passing the result through the logistic (sigmoid) function, which maps any real number into a probability between 0 and 1. We will use sklearn.linear_model.LogisticRegression, to fit a model to labeled data with a .fit() method and generate predictions with .predict() or probability estimates with .predict_proba(). The Scikit-learn library includes built-in support for regularization (L1 or L2), which helps prevent overfitting and improves generalization. Combined with preprocessing tools and model evaluation utilities like cross-validation and metrics, sklearn provides methods to construct and evaluate logistic regression models.
Interpolation (Linear, Cubic Spline, Lagrange)
Lesson 4 will cover methods of interpolation used to estimate unknown values that reside between known data points. Interpolation methods use existing data to construct a function that approximates the values within the range. Interpolation is commonly used in data analysis to fill in missing data or create smooth transitions between points. Linear interpolation estimates values between two data points using a straight line. Lagrange interpolation fits a single polynomial that passes exactly through all given data points, providing an exact fit but potentially causing large oscillations (known as Runge’s phenomenon) between data points. Cubic spline interpolation fits piecewise cubic polynomials between data points while ensuring smoothness with continuous first and second derivatives.
Support Vector Machines
Lesson 5 will cover Support Vector Machines (SVMs), which are supervised learning algorithms used for classification and regression that work by finding the optimal hyperplane separating data points of different classes. An SVM model maximizes the margin between classes, using a subset of training points called support vectors to define a boundary between classes.
Dimensionality Reduction using Principal Component Analysis
Lesson 6 will cover the use of Principal Component Analysis (PCA) as a dimensionality reduction method that transforms a (potentially large) dataset into a smaller set of uncorrelated variables or features known as principal components. These components are linear combinations of the original features and are ordered by the amount of variance they capture, with the first few retaining most of the information in the data. We can reduce the dimension of a large dataset by retaining only the components that preserve the greatest variability.
Gaussian Process Regression
Lesson 7 will cover Gaussian Process Regression (GPR), which is a probabilistic, non-parametric approach to regression that uses Gaussian processes to represent a distribution of potential functions that could fit a given dataset. Rather than simply providing a point estimate, GPR generates both a mean prediction and a quantified level of uncertainty for every data point. The model’s flexibility is driven by a kernel function, which defines the relationships between points in the input space.
K-Means Clustering
Lesson 8 will cover K-Means Clustering, an unsupervised learning classification method that organizes data into K distinct groups according to data similarity. The algorithm initially defines K initial centroids and then assigns every data point to its closest centroid. The algorithm repeatedly adjusts these centroids to reduce the variance within each cluster. This cycle repeats until the group assignments become constant, leading to the formation of clearly defined clusters.
Decision Trees and Random Forests
Lesson 9 will cover decision trees and random forests. A decision tree is a supervised learning method that predicts outcomes by repeatedly partitioning data into a hierarchy based on feature values. Random forests address a weakness of decision trees in overfitting training data, leading to a decision tree that does not generalize well to test data. A random forest mitigates this weakness by constructing an ensemble of multiple decision trees, each trained on a random sample of the data and features, which leads to better overall generalization.
Modeling Time-Series using Recurrent Neural Networks
Lesson 10 will cover the use of recurrent neural networks (RNNs) to make inferences from time series data. RNNs are a type of neural network designed to process sequential data by maintaining a hidden state that carries forward information from previous time steps.
Image Classification and Object Detection using Convolutional Neural Networks
Lesson 11 will cover the task of image classification and object detection using convolutional neural networks (CNNs), a type of neural network that learns spatial features through convolutional layers. CNNs can identify and localize multiple objects within a single image by predicting bounding boxes that enclose each object and assigning a class label to each box that identifies what type of object appears within the box.
| Hardware | Who provides | Notes |
| Laptop, 1 per student | Attendee | |
| Nvidia DGX A100 | Computational Science Research Center (CSRC) | Student accounts will be provided to participants. |
| Software | Who provides | Notes |
| Visual Studio Code | Open source | Installation instructions will be provided by Dr. Paolini. Sung will help prepare the installation manual two weeks prior to the workshop. |
| Visual Studio Code Remote – SSH extension | Open source | |
| Python | Open source | |
| keras, scikit-learn, numpy, matplotlib, tensorflow | Open source |
To be provided by Dr Paolini
Will be available for download on the day of class