Biostat 212a Homework 1

Due Jan 25, 2026 @ 11:59PM

Author

YOUR NAME and UID

Published

January 11, 2026

1 Filling gaps in lecture notes (10% pts)

Consider the regression model \[ Y = f(X) + \epsilon, \] where \(\operatorname{E}(\epsilon) = 0\).

1.1 Optimal regression function

Show that the choice \[ f_{\text{opt}}(X) = \operatorname{E}(Y | X) \] minimizes the mean squared prediction error \[ \operatorname{E}\{[Y - f(X)]^2\}, \] where the expectations averages over variations in both \(X\) and \(Y\). (Hint: condition on \(X\).)

1.2 Bias-variance trade-off

Given an estimate \(\hat f\) of \(f\), show that the test error at a \(x_0\) can be decomposed as \[ \operatorname{E}\{[y_0 - \hat f(x_0)]^2\} = \underbrace{\operatorname{Var}(\hat f(x_0)) + [\operatorname{Bias}(\hat f(x_0))]^2}_{\text{MSE of } \hat f(x_0) \text{ for estimating } f(x_0)} + \underbrace{\operatorname{Var}(\epsilon)}_{\text{irreducible}}, \] where the expectation averages over the variability in \(y_0\) and \(\hat f\).

2 ISL Exercise 2.4.3 (10% pts)

library(tidyverse)
fit <- lm(sales ~ TV, data = )

3 ISL Exercise 2.4.4 (10% pts)

4 ISL Exercise 2.4.10 (30% pts)

Your can read in the boston data set directly from url https://raw.githubusercontent.com/ucla-biostat-212a/2026winter/master/slides/data/Boston.csv. A documentation of the boston data set is here.

library(tidyverse)

Boston <- read_csv("https://raw.githubusercontent.com/ucla-biostat-212a/2026winter/master/slides/data/Boston.csv", col_select = -1) %>% 
  print(width = Inf)

# A tibble: 506 × 13
      crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio lstat
     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  4.98
 2 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2   242    17.8  9.14
 3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  4.03
 4 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3   222    18.7  2.94
 5 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3   222    18.7  5.33
 6 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3   222    18.7  5.21
 7 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5   311    15.2 12.4 
 8 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5   311    15.2 19.2 
 9 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5   311    15.2 29.9 
10 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5   311    15.2 17.1 
    medv
   <dbl>
 1  24  
 2  21.6
 3  34.7
 4  33.4
 5  36.2
 6  28.7
 7  22.9
 8  27.1
 9  16.5
10  18.9
# ℹ 496 more rows

5 ISL Exercise 3.7.3 (20% pts)

6 ISL Exercise 3.7.15 (20% pts)

7 Bonus question (Extra credits)

For multiple linear regression, show that \(R^2\) is equal to the correlation between the response vector \(\mathbf{y} = (y_1, \ldots, y_n)^T\) and the fitted values \(\hat{\mathbf{y}} = (\hat y_1, \ldots, \hat y_n)^T\). That is \[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = [\operatorname{Cor}(\mathbf{y}, \hat{\mathbf{y}})]^2. \]