This book Chapter and the code in this page considers functions as
predictors in models with scalar outcomes. We focus on the linear
scalar-on-function regression (SoFR), starting the general motivation
and building intuition using exploratory analyses and careful
interpretation of coefficients. Methods using unpenalized basis
expansions are implemented in traditional software, while estimation and
inference for SoFR using penalized splines is conducted using the
refund
and mgcv
packages.
Much of this page will use data from the NHANES to illustrate techniques for modeling scalar outcomes using functional predictors. In particular, we will use MIMS profiles as functional predictors of body mass index (BMI) as a continuous outcome. We will begin with a simple approach and show how this is related to more general linear scalar-on-function regression models. For each participant, we will obtain the average MIMS value in each of 12 two-hour bins and then use these bin averages as predictors in a standard linear regression model. This has the advantage of allowing some degree of flexibility in the association between MIMS values and BMI over the course of the day, while not requiring non-standard modeling techniques or interpretations. Indeed, in our experience this kind of approach can be a good starting point in collaborative settings or serve as a useful check on results from more complex approaches.
The code chunk below imports and organizes the processed NHANES data
that will be used in this page. After importing the data, we create a
new variable which indicates whether the participant died within two
years of their inclusion in the study; retain only those variables in
the processed NHANES data that will be relevant for this example; rename
MIMS
and MIMS_sd
to include the suffix
_mat
to indicate these variables are stored as matrices;
restrict the dataset to participants 25 years of age or older; and
ensure that the resulting dataframe has the class
tibble
.
nhanes_df =
readRDS(
here::here("data", "nhanes_fda_with_r.rds")) %>%
mutate(
death_2yr = ifelse(event == 1 & time <= 24, 1, 0)) %>%
select(
SEQN, BMI, age, gender, death_2yr,
MIMS_mat = MIMS, MIMS_sd_mat = MIMS_sd) %>%
filter(age >= 25) %>%
drop_na(BMI) %>%
tibble()
In the next code chunk, we convert MIMS_mat
and
MIMS_mat_sd
to tidyfun
objects using
tfd()
using the arg
argument in
tfd()
to be explicit about the grid over which functions
are observed.
nhanes_df =
nhanes_df %>%
mutate(
MIMS_tf = matrix(MIMS_mat, ncol = 1440),
MIMS_tf = tfd(MIMS_tf, arg = seq(1/60, 24, length = 1440)),
MIMS_sd_tf = matrix(MIMS_sd_mat, ncol = 1440),
MIMS_sd_tf = tfd(MIMS_sd_tf, arg = seq(1/60, 24, length = 1440)))
The next code chunk contains two components. The first component
creates a new data frame containing average MIMS values in two-hour bins
by computing the rolling mean of each MIMS_tf
observation
with a bin width of 120 minutes, and then evaluating that rolling mean
at hours \(1, 3, ..., 23\). The result
is saved as MIMS_binned
, and for the next step only
BMI
and MIMS_binned
are retained.
The second component of this code chunk fits the regression of
BMI
on these bin averages. The tf_spread()
function produces a wide-format dataframe with columns corresponding to
each bin average in the MIMS_binned
variable, and the call
to lm()
regresses BMI
on all of these
averages.
nhanes_bin_df =
nhanes_df %>%
mutate(
MIMS_binned =
tf_smooth(MIMS_tf, method = "rollmean", k = 120, align = "center"),
MIMS_binned = tfd(MIMS_binned, arg = seq(1, 23, by = 2))) %>%
select(BMI, MIMS_binned)
## setting fill = 'extend' for start/end values.
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `MIMS_binned = tf_smooth(MIMS_tf, method = "rollmean", k = 120,
## align = "center")`.
## Caused by warning:
## ! non-equidistant arg-values in 'MIMS_tf' ignored by rollmean.
fit_binned =
lm(BMI ~ .,
data = nhanes_bin_df %>% tf_spread(MIMS_binned))
We now show the binned predictors and the resulting coefficients
using plotting tools in tidyfun
. The
first plot generated in the code chunk below shows the
MIMS_binned
variable for the first 500 rows (other data
points are omitted to prevent overplotting). The second plot shows the
coefficients for each bin averaged MIMS values. We create this plot by
tidying the model fit stored in fit_binned
and omitting the
intercept term. An hour
variable is then created by
manipulating the coefficient names, and upper and lower 95% confidence
bounds for each hour are obtained by adding and subtracting 1.96 times
the standard error from the estimates. We plot the estimates as lines
and points, and add error bars for the confidence intervals. The two
panels are combined using the patchwork
package.
The binned MIMS profiles show expected diurnal patterns of activity, where there is generally lower activity in the night and higher activity in the day. The results of the regression using bin-averaged MIMS values as predictors for BMI (right panel) are consistent with these observed trends. Coefficients for bin averages during the day are generally below zero and some (2-hour intervals between 8-10AM and 6-8PM) are statistically significant.
ggp_binned =
nhanes_bin_df %>%
slice(1:500) %>%
ggplot(aes(y = MIMS_binned, color = BMI)) +
geom_spaghetti() +
geom_meatballs() +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Binned MIMS")
ggp_coefs =
fit_binned %>%
broom::tidy() %>%
filter(term != "(Intercept)") %>%
mutate(
hour = str_replace(term, "MIMS_binned_", ""),
hour = as.numeric(hour),
ub = estimate + 1.96 * std.error,
lb = estimate - 1.96 * std.error) %>%
ggplot(aes(x = hour, y = estimate)) +
geom_point() + geom_path() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = .5) +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")
ggp_binned + ggp_coefs
We will motivate a shift to scalar-on-function regression by noting that the model using bin averages can be expressed in terms of a functional coefficient expressed as a step function and functional predictors by integrating over their product. This is motivated by the following expression, with technical details provided in the book Chapter text: \[\begin{equation} \sum_{b=1}^{12} \beta_{b} \overline{X}_{ib} \approx \int_{0}^{24} \beta^{step}(s) X_{i}(s) \, ds. \label{eq:ch4_step_int} \end{equation}\]
The code chunk below creates a dataframe that contains the step coefficient function defined above.
stepfun_coef_df =
fit_binned %>%
broom::tidy() %>%
filter(term != "(Intercept)") %>%
select(estimate, std.error) %>%
slice(rep(1:12, each = 120)) %>%
mutate(
method = "Step",
estimate = .5 * estimate,
arg = seq(1/60, 24, length = 1440)) %>%
tf_nest(.id = method)
A plot showing the complete (not binned) MIMS_tf
trajectories alongside the step coefficient function is shown below.
Comparing this plot with one above, in which bin averages are shown
alongside regression coefficients, is intended to emphasize that these
approaches are identical up a to a re-scaling of the regression
parameters. Connecting the bin average approach to a truly functional
coefficient is an intuitive starting point for the more flexible linear
SoFR models considered next.
ggp_stepfun_mims =
nhanes_df %>%
slice(1:500) %>%
ggplot(aes(y = MIMS_tf, color = BMI)) +
geom_spaghetti() +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")
ggp_stepfun_coef =
stepfun_coef_df %>%
ggplot(aes(y = estimate)) +
geom_spaghetti(linewidth = 1.1, alpha = .75) +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")
ggp_stepfun_mims + ggp_stepfun_coef
We now introduce the linear scalar-on-function regression model in which there is only one functional predictor and no scalar covariates. This is analogous to ``simple” linear regression, and will be useful for introducing key concepts in interpretation, estimation, and inference for models with functional predictors. Later sections will consider extensions of this approach.
For participants \(i = 1, \dots, n\), let \(Y_i\) be a scalar response of interest and \(X_{i}: S\rightarrow \mathbb{R}\) be a functional predictor observed over the domain \(S\). The simple linear scalar-on-function regression model is \[\begin{equation} Y_i=\beta_0 + \int_{S} \beta_1 (s)X_{i}(s)\, ds + \epsilon_i \end{equation}\] where \(\beta_0\) is a population-level scalar intercept, \(\beta_1(s): S\rightarrow \mathbb{R}\) is the functional coefficient of interest, and \(\epsilon_i\) is a residual with mean zero and variance \(\sigma^2_{\epsilon}\). This model specification generalizes the specific case where a coefficient function constrained to be a step function approximated a regression model based on bin averages. The coefficient function now can be more flexible, but the interpretation is analogous: \(\beta_1(s)\) defines the weight given to predictor functions at each \(s \in S\), and the product of \(X_{i}(s)\) and \(\beta_1(s)\) is integrated to obtain the overall contribution of the functional term in the model. At each time point, \(\beta(s)\) is the effect on the outcome for a one unit increase in \(X_{i}(s)\) when all other values \(X_{i}(s')\) \(s'\in S, s'\neq s\) are kept unchanged. This interpretation is admittedly somewhat awkward, but unavoidable when regressing a scalar outcome on a functional predictor. Note that the coefficients adjust for effects of all other \(s'\in S\).
The innovation in scalar-on-function regression, compared to nonfunctional models, is a coefficient function that integrates with covariate functions to produce scalar terms in the linear predictor. The corresponding challenge is developing an estimation strategy that minimizes \[\begin{equation}\min_{\beta_0, \beta_{1}(s)} \sum_{i = 1}^{n} \left\{ Y_i - \beta_0 - \int_{S} \beta_1 (s)X_{i}(s)\, ds\right\}^2 \label{eq:ch4_sofr_ss} \end{equation}\] in a way that is flexible and computationally feasible.
Our first approach expands the coefficient function \(\beta_1(s)\) using a relatively low-dimensional basis expansion; doing so leads to a more familiar setting in which scalar basis coefficients are the target of estimation and inference. Specifically, let \(\beta_1(s) = \sum_{k = 1}^{K}\beta_{1k}B_{k}(s)\) where \(B_1(s), \dots, B_{K}(s)\) is a collection of basis functions. Substituting this expansion into the integral term gives \[\begin{equation} \begin{array}{lll} E[Y_i] &=& \beta_0 + \int_S \beta_1 (s)X_{i}(s)\, ds \\ &=& \beta_0 + \sum_{k=1}^{K} \left[ \int_{S} B_{k}(s)X_{i}(s)\, ds\right] \beta_{1k} \\ &=& \beta_0 + \mathbf{C}^t_{i} \boldsymbol{\beta}_{1} \end{array} \end{equation}\] where \(C_{ik} = \int_{S} B_{k}(s)X_{i}(s)\, ds\), \(\mathbf{C}_i = [C_{i1}, \dots, C_{iK}]^{t}\), and \(\boldsymbol{\beta}_1=(\beta_{11},\ldots,\beta_{1K})^t\) is the vector of basis coefficients. The result of the basis expansion for the coefficient function, therefore, is a recognizable multiple linear regression with carefully-defined scalar covariates and corresponding coefficients. Specifically, let \(\mathbf{y}=(y_1,\ldots,y_n)^t\), the matrix \(\mathbf{X}\) be constructed by row-stacking vectors \([1, C_{i1}, \dots, C_{iK}]\), and \(\boldsymbol{\beta} = [\beta_{0}, \beta_{11}, \dots, \beta_{1K}]^{t}\) be the vector of regression coefficients including the population intercept and spline coefficients. This suggests a standard OLS approach to estimating \(\boldsymbol{\beta}\). Note that functional predictors are actually observed over a finite grid, and the definite integrals that define the \(C_{ik}\) are in practice estimated using numeric quadrature.
Many options for the basis have been considered in the expansive literature for SoFR. To illustrate the ideas in this Section, we start with a quadratic basis and obtain the corresponding estimate of the coefficient function \(\beta_{1}(s)\). We define the basis \[\begin{equation} \left\{\begin{array}{lll} B_{1}(s) = 1\;,\\ B_{2}(s) = s\;,\\ B_{3}(s) = s^2\;, \\ \end{array}\right. \label{eq:ch4_quad_basis} \end{equation}\] and, given this, obtain scalar predictors \(C_{ik}\) that can be used in standard linear model software. The basis expansion includes an intercept term, which should not be confused with the model’s intercept, \(\beta_0\). The intercept in the basis expansion allows the coefficient function \(\beta_1(s)\) to shift as needed, while the population intercept is the expected value of the response when the predictor function is zero over the entire domain.
Continuing to focus on BMI as an outcome and MIMS as a functional
predictor, the code chunk below defines the quadratic basis and obtains
the numeric integrals in the \(C_{ik}\). The basis matrix is defined in
terms of arg
and given appropriate column names. We
construct the data frame num_int_df
which contains the
necessary numeric integrals. We retain the row names of the matrix
product in the resulting dataframe, and convert this to a numeric
variable for consistency with nhanes_df
.
epoch_arg = seq(1/60, 24, length = 1440)
B = cbind(1, epoch_arg, epoch_arg^2)
colnames(B) = c("int", "lin", "quad")
num_int_df =
as_tibble(
(nhanes_df$MIMS_mat %*% B) * (1/ 60),
rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
The next code chunk implements the regression and processes the
results. We first define a new data frame, nhanes_quad_df
,
that contains variables relevant for the scalar-on-function regression
of BMI on MIMS trajectories using SoFR and expand the coefficient
function \(\beta_1(s)\) in terms of the
quadratic basis defined in the previous code chunk. This is created by
joining two previously defined dataframes, and keeping only
BMI
and the columns corresponding to the numeric integrals
\(C_{ik}\). Using
nhanes_quad_df
, we fit a linear regression of BMI on the
\(C_{ik}\); the formula specification
includes a population intercept to reiterate that the model’s intercept
\(\beta_0\) is distinct from the basis
expansion’s intercept, which appears in \(C_{i1}\). Finally, we combine the
coefficient estimates in fit_quad
with the basis matrix to
obtain the estimate of the coefficient function. We compute the matrix
product of B
and the coefficients of fit_quad
(omitting the population intercept), and convert the result to a
tidyfun
object. The coefficient function is stored in a
data frame called quad_coef_df
, along with a variable
method
with the value quad
.
nhanes_quad_df =
left_join(nhanes_df, num_int_df, by = "SEQN") %>%
select(BMI, int, lin, quad)
fit_quad =
nhanes_quad_df %>%
lm(BMI ~ 1 + int + lin + quad, data = .)
quad_coef_df =
tibble(
method = "Quadratic",
estimate = tfd(t(B %*% coef(fit_quad)[-1]), arg = epoch_arg))
This general strategy for estimating coefficient functions can be
readily adapted to other basis choices. The next code defines a cubic
B-spline basis with eight degrees of freedom; this is more flexible than
the quadratic basis, while also ensuring a degree of smoothness that is
absent from the stepwise estimate of the coefficient function. Once the
basis is defined, the remaining steps in the code chunk below mirror
those used to estimate the coefficient function using a quadratic basis,
with a small number of minor changes. The basis is generated using the
bs()
function in the splines
package, and
there are now eight basis functions instead of three. There is a
corresponding increase in the number of columns in
num_int_df
and for convenience we write the formula in the
lm()
call as BMI ~ 1 + .
instead of listing
columns individually. The final step in this code chunk constructs the
estimated coefficient function by multiplying the matrix of basis
functions evaluated over \(\mathbf{s}\)
by the vector of B-spline coefficients; the result is stored in a data
frame called bspline_coef_df
, now with a variable
method
taking the value B-Spline
. The
similarity between this model fitting and the one using a quadratic
basis is intended to emphasize that the basis expansion approach to
fitting a linear SoFR can be easy to implement for a broad range of
basis choices.
B_bspline = splines::bs(epoch_arg, df = 8, intercept = TRUE)
colnames(B_bspline) = str_c("BS_", 1:8)
num_int_df =
as_tibble(
(nhanes_df$MIMS_mat %*% B_bspline) * (1/ 60),
rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
nhanes_bspline_df =
left_join(nhanes_df, num_int_df, by = "SEQN") %>%
select(BMI, BS_1:BS_8)
fit_bspline =
lm(BMI ~ 1 + ., data = nhanes_bspline_df)
bspline_coef_df =
tibble(
method = "B-Spline",
estimate =
tfd(t(B_bspline %*% coef(fit_bspline)[-1]), arg = epoch_arg))
We show how to display all coefficient function estimates in the next
code chunk. The first step uses bind_rows()
to combine data
frames containing the stepwise, quadratic, and B-spline estimated
coefficient functions. The result is a data with three rows, one for
each estimate, and two columns containing the method
and
estimate
variables. We plot the estimates using
ggplot()
and geom_spaghetti()
by setting the
aesthetics for y
and color
to
estimate
and method
, respectively. In the
resulting plot, the coefficient functions have some broad similarities
across basis specifications. That said, the quadratic basis has a much
higher estimate in the nighttime than other methods because of the
constraints on the shape of the coefficient function. The stepwise
coefficient has the bin average interpretation but the lack of
smoothness across bins is scientifically implausible. Of the
coefficients presented so far, then, the B-spline basis with eight
degrees of freedom is our preference as a way to include both
flexibility and smoothness in the estimate of \(\beta_1(\cdot)\).
bind_rows(stepfun_coef_df, quad_coef_df, bspline_coef_df) %>%
ggplot(aes(y = estimate, color = method)) +
geom_spaghetti(alpha = 1, linewidth = 1.2) +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")
Our approach to scalar-on-function regression using smoothness penalties relies on key insights that connect functional regression to scatterplot smoothing and mixed models. By expressing the coefficient function using a spline expansion it is possible to cast scalar-on-function regression in terms of a linear regression model. While the design matrix for that model has a specific construction, once it is available, usual model fitting approaches can be used directly. Analogously, the use of penalized splines requires the careful construction of design and penalty matrices, but once these are in place, the techniques for scatterplot smoothing. Critically, this includes casting scalar-on-function regression as a mixed model; this connection makes it possible to use the rich collection of techniques for mixed model estimation and inference for functional regression.
Broadly speaking, in functional regression models we prefer to use rich spline basis expansions because they are flexible and numerically stable. We include penalties that encourage smoothness in the target of estimation, which generally take the form of penalties on the overall magnitude of the squared second derivative of the estimand. This combination results in a quadratic penalty with a tuning parameter that controls the degree of smoothness; higher penalization leads to “flatter” coefficient functions, while lower penalization allows more flexible but “wigglier” coefficients.
As before, let \(\beta_1(s) = \sum_{k = 1}^{K}\beta_{1k}B_{k}(s)\) where \(B_1(s), \dots, B_{K}(s)\) is a collection of basis functions. We add a smoothness-inducing penalty of the form \(\int_S \{\beta_{1}''(s)\}^2 ds\) to the minimization criterion. Intuitively, estimates of \(\beta_{1}(s)\) that include many sharp turns will have squared second derivatives \(\{\beta_{1}''(s)\}^2\) with many large values, while smooth estimates of \(\beta_{1}(s)\) will have second derivatives that are close to 0 over the domain \(S\). To implement the squared second derivative penalty, let \(\mathbf{P}\) be the \(K \times K\) matrix with the \((i,j)^{\text{th}}\) entry equal to \(\int_S B_i''(s)B_j''(s)ds\).
Let \(\mathbf{X}\) be a \(n \times (K+1)\) matrix in which the \(i^{\text{th}}\) row is \([1, C_{i1}, \dots,C_{iK}]\), and let \(\boldsymbol{\beta}\) be the \((K+1)\) dimensional column vector that concatenates the population intercept \(\beta_0\) and the spline coefficients \(\boldsymbol{\beta}_1\). Adding the second derivative penalty yields a penalized sum of squares \[\begin{equation}\min_{\boldsymbol{\beta}}||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda \boldsymbol{\beta}^t\mathbf{D}\boldsymbol{\beta}\;. \end{equation}\] Here \(\lambda\geq 0\) is a scalar tuning parameter and \(\mathbf{D}\) is the matrix given by \[\mathbf{D}=\begin{bmatrix} \mathbf{0}_{1 \times 1} & \mathbf{0}_{K \times 1} \\ \;\mathbf{0}_{1 \times K} & \mathbf{P}\\ \end{bmatrix}\;,\] where \(\mathbf{0}_{a\times b}\) is a matrix of zero entries with \(a\) rows and \(b\) columns. For fixed values of \(\lambda\), a closed form solution for \(\widehat{\boldsymbol{\beta}}\) is \(\widehat{\boldsymbol{\beta}}=(\mathbf{X}^t\mathbf{X}+\lambda \mathbf{D})^{-1}\mathbf{X}^t\mathbf{y}\). Varying \(\lambda\) from 0 to \(\infty\) will induce no penalization and full penalization, respectively, and choosing an appropriate tuning parameter is an important practical challenge. As elsewhere in this Chapter, though, we emphasize that the familiar form should not mask the novelty and innovation of this model, which implements penalized spline smoothing to estimate the coefficient function in a scalar-on-function regression.
We illustrate these ideas in the next code chunk, which continues to
use BMI as an outcome and MIMS as a functional predictor. The code draws
on elements that have been seen previously. We first define a B-spline
basis with 30 degrees of freedom evaluated over the finite grid
arg
. Using functionality in the splines2
package, we obtain the second derivative of each spline basis function
evaluated over the same finite grid. The elements of the penalty matrix
\(\mathbf{P}\) are obtained through
numeric approximations to the integrals. The design matrix \(\mathbf{X}\) is obtained by adding a column
taking the value \(1\) everywhere to
the terms \(C_{ik}\). Next, we
construct the matrix \(\mathbf{D}\).
The response vector \(\mathbf{y}\) is
extracted from nhanes_df
and we choose high and low values
for the tuning parameter \(\lambda\).
Given all of these elements, we estimate the coefficient vector \(\boldsymbol{\beta}\) to obtain
coef_high
and coef_low
.
B_bspline = splines::bs(epoch_arg, df = 30, intercept = TRUE)
sec_deriv = splines2::bSpline(epoch_arg, df = 30, intercept = TRUE, derivs = 2)
P = t(sec_deriv) %*% sec_deriv * (1 / 60)
X = cbind(1, (nhanes_df$MIMS_mat %*% B_bspline) * (1 / 60))
D = rbind(0, cbind(0, P))
y = nhanes_df$BMI
lambda_high = 10e6
lambda_low = 100
coef_high = solve(t(X) %*% X + lambda_high * D) %*% t(X) %*% y
coef_low = solve(t(X) %*% X + lambda_low * D) %*% t(X) %*% y
The estimated coefficient functions that correspond to the estimates
in coef_high
and coef_low
can be produced
through simple modifications to the previous code. A figure below shows
the resulting coefficient functions. Note that we specify the color for
these curves for consistency with later plots.
Recasting the penalized sum of squares as a mixed model allows the
data-driven estimation of tuning parameters; more broadly, this opens
the door to using a wide range of methods for mixed model estimation and
inference in functional regression settings. Below, we construct the
design matrix \(\mathbf{C}\) using
numeric integration. The penalty matrix \(\mathbf{P}\), which contains the numeric
integral of the squared second derivative of the basis functions, is
reused from a prior code chunk. We pass these as arguments into
mgcv::gam()
by specifying C
in the formula
that defines the regression structure, and then use the
paraPen
argument to supply our penalty matrix
P
for the design matrix C
. Lastly, we specify
the estimation method to be REML.
C = (nhanes_df$MIMS_mat %*% B_bspline) * (1 / 60)
fit_REML_penalty =
gam(y ~ 1 + C, paraPen = list(C = list(P)),
method = "REML")
Code that multiplies the basis functions by the resulting spline coefficients to obtain the estimated coefficient function is shown below. We plot the result and include previous penalized estimates, based on high and low values of the tuning parameter \(\lambda\), which over- and under-smoothed the coefficient function. The data-driven approach to tuning parameter selection yields an estimate that is smooth but time-varying.
Many of the strengths of the mgcv
package can be
leveraged through the refund
package, which adds
functionality, quality of life features, and user interfaces relevant to
FDA. For SoFR, this means that instead of building models using
knowledge of the linear algebra underlying penalized spline estimation,
we instead only require correctly specified data structures and syntax
for refund::pfr()
. In the code chunk below, we regress BMI
on MIMS using the matrix variable MIMS_mat
using a the
linear specification in lf()
. We additionally indicate the
grid over which predictors are observed, and specify the use of REML to
choose the tuning parameter. The next component of this code chunk
extracts the resulting coefficient and structures it for plotting.
pfr_fit =
pfr(
BMI ~ lf(MIMS_mat, argvals = seq(1/60, 24, length = 1440)),
method = "REML", data = nhanes_df)
pfr_coef_df =
coef(pfr_fit) %>%
mutate(method = "refund::pfr()") %>%
tf_nest(.id = method, .arg = MIMS_mat.argvals) %>%
rename(estimate = value)
The next figure displays results obtained using
mgcv::gam()
and refund::pfr()
; there are some
minor differences in the default model implementations and these results
do not align perfectly, although they are very similar and can be made
exactly the same. For reference, we also show the coefficient function
based on an unpenalized B-spline basis with eight degrees of
freedom.
bind_rows(bspline_coef_df, REML_penalty_df, pfr_coef_df) %>%
ggplot(aes(y = estimate, color = method)) +
geom_spaghetti(alpha = 1, linewidth = 1.2) +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")
## Warning: concatenating functions on different grids.
## Warning: inputs have different resolutions, result has resolution =0.001
To this point, we have developed unpenalized and penalized estimation strategies using basis expansions constructed independently of observed data – step functions, polynomial bases, B-splines. The use of a data-driven basis obtained through FPCA is a popular alternative, and indeed was among the first approaches to SoFR. This approach, sometimes referred to as “Functional Principal Components Regression” (FPCR), has useful numerical features that can make it an appealing option in some practical settings.
The code chunk below implements the scalar-on-function regression of
BMI on MIMS using a data-driven basis. In the first lines of code, we
use the function refunder::rfr_fpca()
to conduct FPCA. This
function takes a tf
vector as an input; for data observed
over a regular grid, this serves as a wrapper for
fpca.face()
. We specify {npc = 4} to return \(K = 4\) principal components. The remainder
of this code chunk is essentially copied from above, with naming
conventions similar to previous code. We regress BMI
on
covariates \(C_{ik}\) obtained through
numeric integration, and save this as .
nhanes_fpca =
rfr_fpca("MIMS_tf", data = nhanes_df, npc = 4)
## Warning: domain for tfb_fpc can't be larger than observed arg-range -- extrapolating FPCs is a bad idea.
## domain reset to [0.017,24]
B_fpca = nhanes_fpca$efunctions * sqrt(60)
colnames(B_fpca) = str_c("efunc_", 1:4)
num_int_df =
as_tibble(
(nhanes_df$MIMS_mat %*% B_fpca) * (1/60),
rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
nhanes_fpcr_df =
left_join(nhanes_df, num_int_df, by = "SEQN") %>%
select(BMI, efunc_1:efunc_4)
fit_fpcr_int =
lm(BMI ~ 1 + ., data = nhanes_fpcr_df)
We used numerical integration to obtain the \(C_{ik}\) in the previous code, but an important advantage of FPCR is that principal component scores are the projection of centered functional observations onto eigenfunctions. That is, we can use FPC scores as predictors rather than obtaining values through numeric integration.
The fact that FPCR can be carried out using a regression on FPC scores directly is a key strength. There are many practical settings where the numeric integration used to construct the design matrices throughout this chapter – for pre-specified and data-driven basis expansions – is not possible. For functional data that are sparsely observed or that are measured with substantial noise, numeric integration can be difficult or impossible. In both those settings, FPCA methods can produce estimates of eigenfunctions and the associated scores and thereby enable scalar-on-function regression in a wide range of real-world settings. At the other extreme, for very high-dimensional functional observations, it may be necessary to conduct dimension reduction as a pre-processing step to reduce memory and computational burdens. The FPCR gives an interpretable scalar-on-function regression in this setting as well. That said, because FPCR is a regression on FPC scores, only effects that are captured by the directions of variation contained in the FPCs can be accounted for using this approach. Moreover, the smoothing of the estimated coefficient function depends on the intrinsic choice of the number of eigenfunctions, \(K\). This tends to be less problematic when one is interested in prediction performance, but may have large effects on the estimation of the \(\beta_1(s)\) coefficient.
In the code below, we extract scores
from the FPCA
object nhanes_fpca
obtained in a previous code chunk.
Mirroring code elsewhere, we create a dataframe containing the scores;
merge this with nhanes_df
and retain BMI
and
the predictors of interest; and fit a linear model, storing the results
as fit_fpcr_score
to reflect that we have performed FPCR
using score estimates.
C = nhanes_fpca$scores * (sqrt(60) / 60)
colnames(C) = str_c("score_", 1:4)
rownames(C) = nhanes_df$SEQN
nhanes_score_df =
as_tibble(
C, rownames = "SEQN") %>%
mutate(SEQN = as.numeric(SEQN))
nhanes_fpcr_df =
left_join(nhanes_df, nhanes_score_df, by = "SEQN") %>%
select(BMI, score_1:score_5)
fit_fpcr_score =
lm(BMI ~ 1 + ., data = nhanes_fpcr_df)
A table showing coefficient estimates from fit_fpcr_int
and fit_fpcr_score
is shown next. As expected, the
intercepts from the two models differ – one is based on centered \(X_{i}(s)\) covariate functions and the
other is not – but basis coefficients are nearly identical.
Throughout this Section, we have deferred the important
methodological consideration of how to choose the truncation level \(K\). Because our estimation approach does
not include any explicit smoothness constraints, the flexibility of the
underlying basis controls the smoothness of the resulting coefficient
function. For that reason, \(K\) is
effectively a tuning parameter. The figure below illustrates this point
by showing the coefficient function with \(K=4\) as well as the coefficient function
with \(K=12\). The previous code could
be modified for this setting, but we implement the helper function nhanes_fpcr
to do the model
fitting. We also show the coefficient function expressed as a step
function for reference. The coefficient function with \(K=4\) is, unsurprisingly, the smoothest,
while the coefficient function with \(K =
12\) more closely tracks the step coefficient function.
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `fit = map(npc, nhanes_fpcr, df = nhanes_df)`.
## Caused by warning:
## ! domain for tfb_fpc can't be larger than observed arg-range -- extrapolating FPCs is a bad idea.
## domain reset to [0.017,24]
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
We now turn our attention to inference. First, we will develop pointwise confidence intervals that are not adjusted for correlation and multiplicity (not CMA), as defined in elsewhere. We refer to these as “unadjusted” confidence intervals, even though they are adjusted for other scalar and/or functional covariates. Leter we will discuss correlation (to account for the correlation of \(\widehat{\beta}_1(s_j)\)) and multiplicity (to account for the multiple locations \(s_j\), \(j=1,\ldots,p\)) adjusted confidence intervals. We refer to this double adjustment as correlation and multiplicity adjusted (CMA) inference.
Our approach to unadjusted inference assumes Normality and constructs \((1-\alpha)\) confidence intervals of the form \[\widehat{\beta}_1(s) \pm z_{1-\alpha / 2} \sqrt{\mbox{Var}\{\widehat{\beta}_1(s)\}}\] where \(z_{\alpha}\) is the \(\alpha\) quantile of a standard Normal distribution. Recall that the coefficient \(\widehat{\beta}_1(s)\) was expanded in a basis \(\widehat{\beta}_1(s)=\mathbf{B}(s)\boldsymbol{\beta}_1\); the variance \(\mbox{Var}\{\widehat{\beta}_1(s)\}\) at any point \(s\in S\) can be obtained by combining the basis with the variance of the estimated coefficients. This is very useful because the basis coefficient vector \(\widehat{\boldsymbol{\beta}}_1\) is finite and quite low dimensional.
When using a fixed basis and no penalization, the resulting inference
is familiar from usual linear models. After constructing the design
matrix \(\mathbf{X}\) and estimating
all model coefficients, \(\boldsymbol{\beta}_1\), using ordinary
least squares and the error variance, \(\sigma_{\epsilon}^2\), based on model
residuals, \(\mbox{Var}(\widehat{\boldsymbol{\beta}}_1)\)
can be extracted from \(\mbox{Var}(\widehat{\boldsymbol{\beta}}_1) =
\widehat{\sigma}_{\epsilon}^2 \left(\mathbf{X}^{t}\mathbf{X}
\right)\). Indeed, this can be quickly illustrated using
previously fit models; in the code chunk below, we use the
vcov()
function to obtain \(\mbox{Var}(\widehat{\boldsymbol{\beta}})\)
from fit_fpcr_int
, the linear model object for FPCR with
\(K = 4\). We remove the row and column
corresponding to the population intercept, and then pre- and
post-multiply by the FPCA basis matrix \(\mathbf{B}(\mathbf{s})\) stored in
B_fpca
. The resulting covariance matrix is \(1440 \times 1440\), and has the variances
\(\mbox{Var}\{\widehat{\beta}_1(s)\}\)
for each value in the observation grid \(s\in\mathbf{s}\) on the main diagonal. The
final part of the code chunk creates the estimate \(\widehat{\boldsymbol{\beta}}_{1}(s)\) as a
tf
object; uses similar code to obtain the pointwise
standard error (as the square root of entries on the diagonal of the
covariance matrix); and constructs upper and lower bounds for the 95%
confidence interval.
var_basis_coef = vcov(fit_fpcr_int)[-1,-1]
var_coef_func = B_fpca %*% var_basis_coef %*% t(B_fpca)
fpcr_inf_df =
tibble(
method = c("FPCR: 5"),
estimate = tfd(t(B_fpca %*% coef(fit_fpcr_int)[-1]), arg = epoch_arg),
se = tfd(sqrt(diag(var_coef_func)), arg = epoch_arg)
) %>%
mutate(
ub = estimate + 1.96 * se,
lb = estimate - 1.96 * se)
The process for obtaining confidence intervals for penalized spline
estimation is conceptually similar. Again, inference is built around the
covariance of the basis coefficients and the primary difference is in
the calculation of that variance. The necessary step is to perform
inference on fixed and random effects in a mixed model. Later we will
explore this in detail; for now, we will use the helpful wrapper
pfr()
for inference. Indeed, the object
pfr_coef_df
, obtained in a previous code chunk using
coef(pfr_fit)
, already includes a column se
containing the pointwise standard error. In the code chunk below, we use
this column to construct upper and lower bounds of a 95% confidence
interval.
pfr_inf_df =
pfr_coef_df %>%
mutate(
ub = estimate + 1.96 * se,
lb = estimate - 1.96 * se)
Our next code chunk will plot the estimates and confidence intervals
created in the previous code. We combine the dataframes containing
estimates and confidence bounds for the penalized spline and FPCR
methods using \(K = 4, 8, 12\). The
result is plotting using familiar tools from ggplot
and
tidyfun
, and we use geom_errorband()
to plot
the confidence bands.
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `fit = map(npc, nhanes_fpcr, df = nhanes_df)`.
## Caused by warning:
## ! domain for tfb_fpc can't be larger than observed arg-range -- extrapolating FPCs is a bad idea.
## domain reset to [0.017,24]
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
## Warning: concatenating functions on different grids.
## Warning: inputs have different resolutions, result has resolution =0.01
## Warning: concatenating functions on different grids.
## Warning: inputs have different resolutions, result has resolution =0.01
## Warning: concatenating functions on different grids.
## Warning: inputs have different resolutions, result has resolution =0.01
## Warning: concatenating functions on different grids.
## Warning: inputs have different resolutions, result has resolution =0.01
In this Section, we will consider several obvious and necessary
extensions of the “simple” linear model. We will focus on implementation
of models using pfr()
. Note that this page only deals with
cross-sectional settings; multilevel and longitudinal functional data
are the subject of other pages. The emphasis here is on the fact that
these extensions can be implemented using small changes to the familiar
code of functional regression and attempts to make SoFR models as easy
to fit as linear and semiparametric models.
Most real-data analyses that involve a functional predictor will also include one or more scalar covariates. All methods in the previous section can easily be adapted to this setting. To be explicit, we are now interested in fitting models of the form \[\begin{equation} Y_i=\beta_0 + \int_{S} \beta_1 (s)X_{i}(s)\, ds + \mathbf{Z}_i^t\mathbf{\gamma} + \epsilon_i\;, \end{equation}\] where \(\mathbf{Z}_i\) is a \(Q\times 1\) dimensional vector of scalar covariates for subject \(i\) and \(\mathbf{\gamma}\) is the vector of associated coefficients.
The code chunk below regresses BMI
on
MIMS_mat
as a functional predictor and age
and
gender
as scalar covariates using pfr()
;
recall that pfr()
expects functional predictors to be
structured as matrices. Except for the addition of age
and
gender
in the formula, all other elements of the model
fitting code have been seen before.
pfr_adj_fit =
pfr(
BMI ~ age + gender + lf(MIMS_mat, argvals = seq(1/60, 24, length = 1440)),
data = nhanes_df)
Moreover, all subsequent steps build directly on previous code
chunks. Functional coefficients and standard errors can be extracted
using coef
and combined to obtain confidence intervals, and
then plotted using tools from tidyfun
or other graphics
packages. Estimates and inference for non-functional coefficients can be
obtained using summary()
on the fitted model object.
It is frequently the case that more than one functional predictor is available; often the functions are observed over the same domain, but this is not necessary. The goal is to fit a model of the form \[\begin{equation} Y_i=\beta_0 + \sum_{r=1}^{R} \int_{S_r} \beta_r(s) X_{ir}(s)\, ds + \mathbf{Z}_i^t\mathbf{\gamma} + \epsilon_i \end{equation}\] where \(X_{ir}(s)\), \(1 \leq r \leq R\) are functional predictors observed over domains \(S_r\) with associated coefficient functions \(\beta_r(s)\). The interpretation of the coefficients here is similar to that elsewhere; it is possible to interpret the effect of \(X_{ir}(s)\) on the expected value of \(Y_i\) through the coefficient \(\beta_r(s)\), keeping all other functional and non-functional predictors fixed. As when adding scalar predictors to the “simple” scalar-on-function regression model, techniques described previously can be readily adapted to this setting.
We will again use pfr()
to implement an example of this
model. The code chunk below regresses BMI
on scalar
covariates age
and gender
. The functional
predictor MIMS_mat
is familiar from many analyses in this
Section and is used here. We add the functional predictor
MIMS_sd_mat
, which is the standard deviation of MIMS values
taken across several days of observation for each participant, also
stored as a matrix. MIMS_sd_mat
is included in the formula
specification exactly as MIMS_mat
has been in previous code
chunks.
pfr_mult_fit =
pfr(
BMI ~ age + gender +
lf(MIMS_mat, argvals = seq(1/60, 24, length = 1440)) +
lf(MIMS_sd_mat, argvals = seq(1/60, 24, length = 1440)),
data = nhanes_df)
The subsequent extraction of estimates and inference is almost
identical – with the key addition of specifying select = 1
or select = 2
in the call to coef()
to extract
values for MIMS_mat
and MIMS_sd_mat
,
respectively.
pfr_mult_inf_df =
bind_rows(
coef(pfr_mult_fit, select = 1) %>% mutate(coef = "MIMS") %>% rename(argvals = MIMS_mat.argvals),
coef(pfr_mult_fit, select = 2) %>% mutate(coef = "MIMS SD") %>% rename(argvals = MIMS_sd_mat.argvals)) %>%
rename(estimate = value) %>%
mutate(
ub = estimate + 1.96 * se,
lb = estimate - 1.96 * se) %>%
tf_nest(.id = coef, .arg = argvals)
pfr_mult_inf_df %>%
ggplot(aes(y = estimate)) +
geom_spaghetti(alpha = 1, linewidth = 1.2) +
geom_errorband(aes(ymax = ub, ymin = lb), linetype = 0) +
facet_grid(.~coef) +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")
Outcomes that follow binomial, Poisson, and other exponential family
distributions are common in practice, and require the use of a
generalized linear model framework for estimation and inference. This
is, of course, a natural extension to the estimation methods we favor.
For example, given a binary outcome \(Y_i\), scalar predictors \(\mathbf{Z}_i\), and a functional predictor
\(X_i(s)\), one can pose the logistic
scalar-on-function regression \(E[Y_i|\{X_{i}(s):s\in S\},\mathbf{Z}_i] =
\mu_i\), where \[\begin{equation}
g(\mu_i)= \beta_0 + \int_{S} \beta_1 (s)X_{i}(s)ds +
\mathbf{Z}_i^t\mathbf{\gamma}\;, \label{eq:SoFR_mgcv}
\end{equation}\] where \(g(\cdot)\) is the logit link function.
Model parameters can be interpreted as log odd ratios or exponentiated
to obtain odds ratios. Expanding the coefficient function in terms of a
basis expansion again provides a mechanism for estimation and inference.
Rather than minimizing a sum of squares or the penalized equivalent, one
now minimizes a (penalized) log likelihood. Tuning parameters for
penalized approaches can be selected in a variety of ways, but we
continue to take advantage of the connection between roughness penalties
and mixed model representations. Non-penalized approaches can be fit
directly using glm()
, while penalized models are available
through mgcv::gam()
or pfr()
.
In the code chunk below, we fit a logistic scalar-on-function
regression in the NHANES dataset. Our binary outcome is two-year
mortality, with the value 0 indicating that the participant survived two
years after enrollment. We adjust for age, gender, and BMI, and focus on
MIMS as a functional predictor. The model specification using
pfr()
sets the argument family
to
binomial()
, but other aspects of this code are drawn
directly from prior examples.
pfr_mort_fit =
pfr(
death_2yr ~ age + gender + BMI +
lf(MIMS_mat, argvals = seq(1/60, 24, length = 1440)),
family = binomial(), method = "REML", data = nhanes_df)
Extracting estimated coefficient functions and conducting inference
can be accomplished using the coef()
function to obtain
estimates and pointwise standard errors. Post-processing to construct
confidence intervals is direct, and these can be inverse-logit
transformed to obtain estimates and inference as odds ratios.
pfr_mort_inf_df =
coef(pfr_mort_fit) %>%
rename(estimate = value) %>%
mutate(
method = "pfr_adj",
ub = estimate + 1.96 * se,
lb = estimate - 1.96 * se) %>%
tf_nest(.id = method, .arg = MIMS_mat.argvals)
pfr_mort_inf_df %>%
ggplot(aes(y = estimate)) +
geom_spaghetti(alpha = 1, linewidth = 1.2) +
geom_errorband(aes(ymax = ub, ymin = lb)) +
scale_x_continuous(
breaks = seq(0, 24, length = 5),
labels = str_c(seq(0, 24, length = 5), ":00")) +
labs(x = "Time of day (hours)", y = "Coefficient")