STA 437/2005 Winter 2025:

Methods for Multivariate Data

This course introduces modern methods for multivariate data building also some theoretical foundations. The lecture is divided into five blocks: I. Foundations of Multivariate Analysis, II. Multivariate statistical Inference, III. Dimensionality Reduction Techniques, IV. Covariance matrix modelling and estimation, V. Methods for Tensors.

More details can be found in syllabus and piazza.



Announcements:


Instructors:

Prof Piotr Zwiernik
Email piotr.zwiernik@utoronto.ca
Office hours Tuesday, 15:30-17:00 (UY 9033)

Teaching Assistants:

Shupeng Chen (shupeng.chen@mail.utoronto.ca), Dayi Li (dayi.li@mail.utoronto.ca), Miaoshiqi Liu (miaoshiqi.liu@mail.utoronto.ca), Luis Sierra Muntané (luis.sierra@mail.utoronto.ca), Rongqian Zhang (rongqian.zhang@mail.utoronto.ca)

Lecture Time & Location:

Section Room Lecture time
STA 437 LEC0101 & STA 2105 LEC0101 BR 200 W 9-11 (lecture), F 9-10 (tutorials)
STA 437 LEC5101 & STA 2105 LEC5101 SF 1105 W 13-15 (lecture), F 13-14 (tutorials)

Suggested Reading

Lecture notes (the file will be expanded and updated as the course progresses so don’t print the whole document)

The lecture notes cover all the material presented in class. Some of the textbooks I used:

Grading scheme (in chronological order)

(20%) midterm 1, (20%) midterm 2, (20%) final project, (40%) final exam

The midterms are short (1 hour) and they focus on simple conceptual/theory questions. ***

Lectures and timeline (tentative)

Week Lectures Notes Tutorials Lecture date Timeline
1 Introduction, some linear algebra, matrix decompositions
Random vectors, covariance matrices.
slides1
notes1
RZ: tut1 8 Jan syllabus
2 Sample statistics. Multivariate normal distribution: definition, basic properties. notes2 ML: tut2 15 Jan  
3 MVN: Conditional distribution, conditional independence. notes3 LSM: tut3
code
22 Jan  
4 Estimation for MVN models
Gaussian Processes: basic definitions and examples
notes4 DL: tut4 29 Jan  
5 Non-Gaussian distributions: elliptical distributions, copulas slides5
notes5
midterm1 5 Feb  
6 Non-Gaussian distributions: Copulas (cont’d), Gaussian mixtures slides6 ML: tut6
code
12 Feb  
  Reading week
(no class/tutorial)
- - - Final project out
7 Principal Component Analysis: definition, basic examples, Scree plot slides7
notes7
SCh: tut7 26 Feb  
8 Principal Component Analysis: Affine Subspace Approximation
Computations, Covariance matrix estimation
slides8
notes8
RZ: tut8
code
code pdf
5 Mar  
9 Multidimensional Scaling
Laplacian eigenmap and UMAP
slides9
notes9
midterm2 12 Mar  
10 Canonical Correlation Analysis (CCA)
Factor Analysis (FA)
slidesFA
notes10
DL: tut10 19 Mar  
11 Conditional independence
Graphical models
slidesCIndep SCh: tut11 26 Mar  
12 Gaussian Graphical models
Ising model
rec1
rec2
LSM Apr 2  

Final project

Submissions: Groups of size 1-2. You have two datasets to choose from. Submit a PDF file with a carefully described data analysis and the code used. Deadline: April 1st.

Expectations and grading: This is an open-ended project that is aimed at forcing you to use some of the multivariate methods for a real dataset. Although there is no right question here, we look for quality analysis that uses the range of methods discussed in class. To help you focus, we gave a list of possible questions that could be addressed. But there is no need to answer them - get creative and follow your curiosity. If the provided dataset is to big, feel free to take a smaller portion. The only real goal here is to learn the methods.

Note: Be to the point. Avoid AI-generated long and meaningless descriptions. You should be ready to answer questions about your work (methods used and conclusions, not implementation details). We prepared the data in R but feel free to prepare your analysis using Python or Julia.

Option 1: Autism Brain Imaging Data Exchange (ABIDE) dataset (access the data)

This dataset contains brain activity recordings from 47 individuals who participated in a study at Yale University. The data come from functional MRI (fMRI) scans, which measure brain activity over time. Each subject has a matrix (196 × 110) representing their brain activity.

Guiding Questions for Analysis

  1. Understanding Brain Connectivity:
    • How can we detect patterns of brain connectivity from these fMRI recordings?
    • Can we identify distinct functional networks in the brain using this data?
  2. Comparing Groups (Autism vs. Control):
    • Do brain connectivity patterns differ between individuals with Autism and the Control group?
    • Are there specific brain regions that show different activity between groups?
  3. Exploring Demographic Factors:
    • Does age influence brain connectivity patterns?
    • Are there any differences in connectivity between males and females?
  4. Finding Unique Relationships:
    • Are there connectivity patterns that are specific to individuals with Autism but not present in the Control group?
    • Can we use this data to predict diagnosis based on brain activity alone?

Option 2: Credit Default Swap (CDS) dataset (access the data)

Imagine a company takes out a big loan. The lender worries: What if the company can’t pay it back? To manage this risk, financial markets offer Credit Default Swaps (CDS)—a type of insurance for loans.

This dataset includes CDS spreads for over 600 companies across 10 different time periods (tenors). Since spreads vary over time and across companies, analyzing them can reveal how financial markets assess risk under different conditions.

Guiding Questions for Analysis

  1. Risk Patterns & Market Conditions
    • How do CDS spreads behave in calm vs. crisis periods (e.g., during COVID-19)?
    • Are some industries or companies consistently seen as high-risk?
  2. Multivariate Relationships & Dependencies
    • How do CDS spreads for different maturities (short-term vs. long-term) move together for the same company?
    • Can we use Principal Component Analysis (PCA) or Canonical Correlation Analysis (CCA) to find key risk factors driving CDS spreads?
  3. Clustering & Risk Groups
    • Can we identify clusters of companies that have similar risk trends? (e.g., using hierarchical clustering or k-means)
    • Do companies within the same industry tend to have similar CDS spread behavior?
  4. Advanced Dependency Analysis
    • How do time-varying dependencies between companies evolve? (e.g., using copulas to measure joint risk behavior)
    • Can we model the spillover effect—how risk changes in one company affect others?

Appendix: Report Structure and Guidelines for the Final Project

This appendix provides guidelines on how to structure your final project report. While the project remains open-ended, following this structure will help ensure a clear and well-organized submission.

1. Introduction

2. Data and Preprocessing

### 3. Methodology

Describe the multivariate methods used and justify their relevance to your research question.

### 4. Results


Practice

This is the list of exercises that should be relevant for preparing for the final. I cleaned-up the exercises so the numbers below refer to the newest version of the notes. The list is incomplete and it covers only Chapters 1-6 for now. The rest is coming soon.

Chapter 1: 1-5,7-10,12,16,19,20,27,28

Chapter 2: 2-5,7,9,10,16,18-20

Chapter 3: 2,4,15,17,19,21,23,25-29,38

Chapter 4: 1-11

Chapter 5: 2, 12-14

Chapter 6: 1, 3-9