adelie
The adelie
package provides extremely efficient
procedures for fitting the entire group lasso and group elastic net
regularization path for GLMs, multinomial, the Cox model and multi-task
Gaussian models. adelie
is similar to the R package
glmnet
in scope of models, and in computational speed. The
R package adelie
is built from the same C++ code as used in
the corresponding adelie Python
package.
In this notebook, we give a brief overview of the group elastic net
problem that adelie
solves.
The single-response group elastic net problem is given by $$
\begin{align*}
\mathrm{minimize}_{\beta, \beta_0} \quad&
\ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left(
\alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2
\right)
\\\text{subject to}\quad&
\eta = X \beta + \beta_0 \mathbf{1} + \eta^0
\end{align*}
$$ where β0
is the intercept, β is the
coefficient vector, X is the
feature matrix, η0
is a fixed offset vector, λ ≥ 0 is the regularization
parameter, G is the number of
groups, ωg ≥ 0 is the
penalty factor per group, α ∈ [0, 1] is the elastic net
parameter, and βg are the
coefficients for the g th
group. ℓ(⋅) is the loss function
defined by the GLM. As an example, the Gaussian GLM
(glm.gaussian()
) defines the loss function as $$
\begin{align*}
\ell(\eta)
&=
\sum\limits_{i=1}^n w_i \left(
-y_i \eta_i + \frac{\eta_i^2}{2}
\right)
\end{align*}
$$ where w ≥ 0 is the
observation weight vector, y
is the response vector, and η
is the linear prediction vector as in the optimization problem above.
This is equivalent to the weighted sum-of-squared errors $$
\begin{align*}
\mbox{RSS}
&=
\frac12\sum\limits_{i=1}^n w_i \left(
y_i -\eta_i \right)^2,
\end{align*}
$$ since the term in ∑wiyi2
is a constant for the optimization.
Specifically for the Gaussian GLM, we employ a specialized optimizer based on coordinate descent to solve the group elastic net problem.
The Gaussian GLM is written in “exponential family” form, with ηi the natural
parameter. Other GLMs such as binomial (glm.binomial()
) and
Poisson (glm.poisson()
) have similar expressions for the
negative log-likelihood. For other general GLMs as well as the Cox model
(glm.cox()
), we use a proximal Newton method, which leads
to an Iterative Reweighted Least Squares (IRLS) algorithm, That is, we
iteratively perform a quadratic approximation to ℓ(⋅), which yields a sequence of Gaussian GLM
group elastic net problems that we solve using our special solver based
on coordinate descent.
The Gaussian GLM also admits a different algorithm, which we call the the covariance method, using summary statistics rather than individual-level data. The covariance method solves the following problem: $$ \begin{align*} \mathrm{minimize}_{\beta} \quad& \frac{1}{2} \beta^\top A \beta - v^\top \beta + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2 \right) \end{align*} $$ This method would be equivalent to the usual single-response Gaussian group elastic net problem if A ≡ Xc⊤WXc and v ≡ Xc⊤Wyc where Xc is column-centered version of X and yc is the centered version of y − η0 where the means are computed with weights W (if intercept is to be fit).
This method only works for the Gaussian case since the proximal Newton method changes the weights W at every IRLS iteration, so that without access to X, it is not possible to compute the new “A” and “v”.
The multi-response group elastic net problem is given by $$
\begin{align*}
\mathrm{minimize}_{B,\; \beta_0} \quad&
\ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left(
\alpha \|B_g\|_F + \frac{1-\alpha}{2} \|B_g\|_F^2
\right)
\\\text{subject to}\quad&
\eta = X B + \mathbf{1} \beta_0^\top + \eta^0
\end{align*}
$$ The model arises in “multitask” learning —
glm.multigaussian()
— as well as multinomial (multiclass)
logistic regression — glm.multinomial()
. This way, we have
possibly different linear predictions for each of K responses, and hence η is n × K. The coefficient
“matrices” for each group Bg are penalized
using a Frobenius norm. Note that if an intercept is included in the
model, an intercept is added for each response.
We use an alternative but equivalent representation in the
adelie
software, by “flattening” the coefficient matrices.
Specifically we solve $$
\begin{align*}
\mathrm{minimize}_{\beta,\; \beta_0} \quad&
\ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left(
\alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2
\right)
\\\text{subject to}\quad&
\mathrm{vec}(\eta^\top) = (X \otimes I_K) \beta + (\mathbf{1}
\otimes I_K) \beta_0 + \mathrm{vec}(\eta^{0\top})
\end{align*}
$$ where vec(⋅) is the operator
that flattens a column-major matrix into a vector, and A ⊗ B is the Kronecker
product operator. Hence β ≡ vec(B⊤).
As indicated above, the multi-response group elastic net problem is
technically of the same form as the single-response group elastic net
problem. In fact, adelie
reuses the single-response solver
for multi-response problems by modifying the inputs appropriately
(e.g. using matrix.kronecker_eye()
) to represent X ⊗ IK).
For the MultiGaussian family, we wrap the specialized single-response
Gaussian solver and otherwise for general multi-response GLMs, we wrap
the single-response GLM solver.
In this section, we cover the basic usage of adelie
.
Before using adelie
, we assume that the user has a
feature matrix X
and a response vector y
. For
simplicity, we assume for now that X
is a dense matrix,
although we will later see that X
can be substituted with a
custom matrix as well. For demonstration, we will randomly generate our
data.
n = 100 # number of observations
p = 1000 # number of features
set.seed(5) # seed
X = matrix(rnorm(n*p),n,p)
y = X[,1:10]%*%rnorm(10) + rnorm(n)*sqrt(10) # makes SNR = 1
The most basic call to grpnet
simply supplies a
X
matrix and a glm
object. For example,
glm.gaussian()
returns a Gaussian glm
object,
which corresponds to the usual least squares loss function. By default,
grpnet
will fit lasso by setting each feature as its own
group. adelie
is written to treat groups of size
1
in a more optimized manner, so it is a competitive lasso
solver that has computation times similar to those of the
glmnet
package. Like glmnet
,
grpnet
will also generate a sequence of 100 regularizations
λ evenly spaced on the log
scale, by default if the user does not provide the path. Since
grpnet
is a path-solver, it will warm-start at the next
λ using the current solution
on the path. For this reason, we recommend users to supply a
sufficiently fine grid of λ or use the default
path!
fit = grpnet(X=X, glm=glm.gaussian(y=y))
print(fit)
##
## Call: grpnet(X = X, glm = glm.gaussian(y = y))
##
## Df %Dev Lambda
## 1 0 0.00 3.3330
## 2 1 4.41 3.1810
## 3 1 8.43 3.0370
## 4 1 12.09 2.8990
## 5 1 15.42 2.7670
## 6 1 18.46 2.6410
## 7 1 21.23 2.5210
## 8 1 23.75 2.4070
## 9 1 26.05 2.2970
## 10 1 28.15 2.1930
## 11 1 30.06 2.0930
## 12 1 31.79 1.9980
## 13 1 33.38 1.9070
## 14 1 34.82 1.8210
## 15 1 36.14 1.7380
## 16 1 37.34 1.6590
## 17 2 38.73 1.5830
## 18 2 40.35 1.5110
## 19 2 41.83 1.4430
## 20 2 43.17 1.3770
## 21 2 44.40 1.3150
## 22 2 45.52 1.2550
## 23 3 46.55 1.1980
## 24 3 47.79 1.1430
## 25 3 48.91 1.0910
## 26 4 50.07 1.0420
## 27 5 51.52 0.9944
## 28 5 52.94 0.9492
## 29 6 54.43 0.9061
## 30 7 55.94 0.8649
## 31 9 57.44 0.8256
## 32 9 58.97 0.7881
## 33 10 60.41 0.7523
## 34 11 61.91 0.7181
## 35 16 63.65 0.6854
## 36 18 65.60 0.6543
## 37 19 67.43 0.6245
## 38 22 69.12 0.5962
## 39 23 70.82 0.5691
## 40 28 72.47 0.5432
## 41 30 74.05 0.5185
## 42 30 75.54 0.4949
## 43 32 76.98 0.4724
## 44 35 78.35 0.4510
## 45 38 79.66 0.4305
## 46 42 80.95 0.4109
## 47 43 82.18 0.3922
## 48 46 83.37 0.3744
## 49 50 84.55 0.3574
## 50 55 85.70 0.3411
## 51 58 86.79 0.3256
## 52 58 87.80 0.3108
## 53 60 88.72 0.2967
## 54 62 89.60 0.2832
## 55 65 90.43 0.2703
The print()
methods gives a nice summary of the fit.
Note that the solver finished early in this example. By default, the
solver terminates if the latter reaches 90% as a simple heuristic to avoid
overfitting. This threshold can be controlled via the argument
adev_tol
.
The output of grpnet
is a an object of class “grpnet”,
which contains some simple descriptors of the model, as well as a
state object that represents the state of the optimizer. For
most use-cases, the users do not need to inspect the internals of a
state object, but rather can extract the useful information via the
methods print()
, plot()
,
predict()
and coef()
.
Here we plot the coefficients profiles as a function of −log (λ). By default
grpnet
standardizes the features internally, and these
coefficients are on the scale of the standardized features. One can
avoid standardization via standardize = FALSE
in the call
to grpnet()
.
We can make predictions at new values for the features — here we use a subset of the original:
pred = predict(fit, newx = X[1:5,],lambda = c(1.5, 1))
pred
## [,1] [,2]
## [1,] 1.67467724 2.35995807
## [2,] 1.23844475 1.82265012
## [3,] 2.00210773 1.70850627
## [4,] -3.47439324 -4.30161563
## [5,] -0.06120398 -0.06572013
Finally, we would like to choose a good value for λ, and for that we use
cross-validation. We include a progress_bar
argument, which
can be helful for big problems. The plot
method for a
cv.grpnet
object displays the average cross-validated
deviance of the model, with approximate standard error bars for the
average.
Two vertical dashed lines are included: one at the value of λ corresponding to the minimum value, and another at a larger (more conservative) value of λ, such that the mean error is within one standard error of the minimum.
One can print the fitted cv.glmnet
object:
fitcv
##
## Call: cv.grpnet(X = X, glm = glm.gaussian(y), progress_bar = TRUE)
##
## Measure: Mean Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.7881 32 11.66 2.049 9
## 1se 1.3146 21 13.47 2.396 2
One can also predict directly from it:
In the first case the prediction is at the default value of λ, which is
lambda = "lambda.1se"
. Alternatively, any numeric values of
λ can be supplied.
To fit group lasso, the user simply needs to supply a
groups
argument that defines the starting column index of
X for each group. For example,
if there are 4
features with two (contiguous) groups of
sizes 3
and 1
, respectively, then
groups = c(1, 4)
. For demonstration, we take the same data
as before and group every 10
features. We then fit group
lasso using the same function.
fitg = grpnet(
X=X,
glm=glm.gaussian(y=y),
groups=seq(from = 1, to = p, by=10),
)
print(fitg)
##
## Call: grpnet(X = X, glm = glm.gaussian(y = y), groups = seq(from = 1, to = p, by = 10))
##
## Df %Dev Lambda
## 1 0 0.00 1.3680
## 2 10 5.72 1.3060
## 3 10 10.94 1.2470
## 4 10 15.71 1.1900
## 5 10 20.06 1.1360
## 6 10 24.03 1.0840
## 7 10 27.66 1.0350
## 8 10 30.97 0.9881
## 9 10 33.99 0.9432
## 10 10 36.75 0.9004
## 11 10 39.27 0.8594
## 12 10 41.57 0.8204
## 13 10 43.67 0.7831
## 14 10 45.58 0.7475
## 15 10 47.33 0.7135
## 16 10 48.93 0.6811
## 17 10 50.39 0.6501
## 18 10 51.72 0.6206
## 19 10 52.93 0.5924
## 20 10 54.04 0.5654
## 21 10 55.05 0.5397
## 22 10 55.98 0.5152
## 23 10 56.82 0.4918
## 24 20 58.01 0.4694
## 25 30 59.67 0.4481
## 26 30 61.38 0.4277
## 27 30 62.94 0.4083
## 28 30 64.38 0.3897
## 29 50 65.82 0.3720
## 30 50 67.37 0.3551
## 31 110 69.27 0.3390
## 32 130 71.36 0.3236
## 33 130 73.36 0.3089
## 34 140 75.24 0.2948
## 35 140 77.00 0.2814
## 36 140 78.62 0.2686
## 37 150 80.13 0.2564
## 38 170 81.56 0.2448
## 39 170 82.90 0.2336
## 40 180 84.16 0.2230
## 41 190 85.33 0.2129
## 42 200 86.40 0.2032
## 43 210 87.41 0.1940
## 44 220 88.37 0.1852
## 45 220 89.26 0.1767
## 46 230 90.10 0.1687
plot(fitg)
Notice in the printout the active set (df) increases in steps of 10, as expected. Also the coefficient plot colors all coefficients for a group with the sample color.
As before, we can use cross-validation to select the tuning parameter.
In the previous section, we covered how to solve penalized least
squares regression. Nearly all of the content remains the same for
fitting penalized GLM regression. The only difference is in the choice
of the glm
object. For brevity, we only discuss logistic
regression as our non-trivial GLM example, however the following
discussion applies for any GLM.
Let’s modify our example and generate a binomial response.
To solve the group elastic net problem using the logistic loss, we simply provide the binomial GLM object. For simplicity, we fit the lasso problem below.
Cross-validation works as before:
We can make predictions from the fitted objects as before. For GLMs, the predictions are for the link (η).
We can specify coefficient groups just as before.
grpnet
is also able to fit multi-response GLM elastic
nets. Currently, we only support MultiGaussian (least squares) and
Multinomial GLMs.
The following code shows an example of fitting a multinomial
regression problem. We will use the glm.multinomial()
family, which expects a matrix response Y with K columns (the number of classes).
Each row of Y is a proportion
vector which sums to 1. In the example below, Y is an indicator matrix, with a
single 1 in each row, the rest being zeros.
n = 300 # number of observations
p = 100 # number of features
K = 4 # number of classes
set.seed(7)
X = matrix(rnorm(n*p),n,p)
eta = X[, 1:5] %*% matrix(rnorm(K*5),5,K)/sqrt(5)
probs = exp(eta)
probs = probs/rowSums(probs)
Y = t(apply(probs,1,function(prob)rmultinom(1, 1, prob)))
Y[1:5,]
## [,1] [,2] [,3] [,4]
## [1,] 0 1 0 0
## [2,] 0 0 0 1
## [3,] 0 1 0 0
## [4,] 0 0 1 0
## [5,] 1 0 0 0
The fitting proceeds as before. We will directly fit a group lasso, with groups of 5 chosen similarly to before.
print(fitm)
##
## Call: grpnet(X = X, glm = glm.multinomial(Y), groups = grps)
##
## Df %Dev Lambda
## 1 0 0.00 0.0233300
## 2 5 2.02 0.0222700
## 3 5 3.94 0.0212600
## 4 5 5.70 0.0202900
## 5 5 7.33 0.0193700
## 6 5 8.83 0.0184900
## 7 5 10.22 0.0176500
## 8 5 11.50 0.0168400
## 9 5 12.70 0.0160800
## 10 5 13.81 0.0153500
## 11 5 14.84 0.0146500
## 12 5 15.80 0.0139800
## 13 5 16.69 0.0133500
## 14 5 17.53 0.0127400
## 15 5 18.30 0.0121600
## 16 5 19.03 0.0116100
## 17 5 19.71 0.0110800
## 18 5 20.35 0.0105800
## 19 5 20.94 0.0101000
## 20 5 21.49 0.0096390
## 21 5 22.01 0.0092010
## 22 5 22.50 0.0087830
## 23 5 22.95 0.0083840
## 24 5 23.37 0.0080020
## 25 10 24.00 0.0076390
## 26 15 24.66 0.0072920
## 27 15 25.50 0.0069600
## 28 15 26.23 0.0066440
## 29 20 27.00 0.0063420
## 30 30 27.94 0.0060540
## 31 35 29.01 0.0057780
## 32 35 30.12 0.0055160
## 33 50 31.29 0.0052650
## 34 55 32.61 0.0050260
## 35 65 33.97 0.0047970
## 36 65 35.40 0.0045790
## 37 75 36.81 0.0043710
## 38 75 38.26 0.0041730
## 39 85 39.73 0.0039830
## 40 90 41.20 0.0038020
## 41 95 42.67 0.0036290
## 42 95 44.09 0.0034640
## 43 95 45.47 0.0033070
## 44 100 46.81 0.0031560
## 45 100 48.12 0.0030130
## 46 100 49.40 0.0028760
## 47 100 50.64 0.0027450
## 48 100 51.84 0.0026200
## 49 100 53.01 0.0025010
## 50 100 54.15 0.0023880
## 51 100 55.26 0.0022790
## 52 100 56.35 0.0021760
## 53 100 57.41 0.0020770
## 54 100 58.44 0.0019820
## 55 100 59.45 0.0018920
## 56 100 60.44 0.0018060
## 57 100 61.40 0.0017240
## 58 100 62.34 0.0016460
## 59 100 63.27 0.0015710
## 60 100 64.17 0.0015000
## 61 100 65.06 0.0014310
## 62 100 65.93 0.0013660
## 63 100 66.78 0.0013040
## 64 100 67.62 0.0012450
## 65 100 68.44 0.0011880
## 66 100 69.25 0.0011340
## 67 100 70.04 0.0010830
## 68 100 70.82 0.0010340
## 69 100 71.59 0.0009866
## 70 100 72.35 0.0009417
## 71 100 73.09 0.0008989
## 72 100 73.83 0.0008581
## 73 100 74.56 0.0008191
## 74 100 75.28 0.0007819
## 75 100 75.99 0.0007463
## 76 100 76.69 0.0007124
## 77 100 77.39 0.0006800
## 78 100 78.07 0.0006491
## 79 100 78.76 0.0006196
## 80 100 79.43 0.0005914
## 81 100 80.10 0.0005646
## 82 100 80.76 0.0005389
## 83 100 81.41 0.0005144
## 84 100 82.05 0.0004910
## 85 100 82.69 0.0004687
## 86 100 83.31 0.0004474
## 87 100 83.72 0.0004271
## 88 100 84.44 0.0004077
## 89 100 85.08 0.0003891
## 90 100 85.47 0.0003714
## 91 100 86.16 0.0003546
## 92 100 86.76 0.0003384
## 93 100 87.13 0.0003231
## 94 100 87.78 0.0003084
## 95 100 88.14 0.0002944
## 96 100 88.76 0.0002810
## 97 100 89.11 0.0002682
## 98 100 89.70 0.0002560
## 99 100 90.02 0.0002444
The coefficient for each feature is a vector(one per class), so rather than show multiple coefficients, the plot shows the progress of the 2norm as a function of λ. Once again, because of the grouping (into groups of 5 here), the groups are colored the same. In this case, the first group has all the action, so appears much earlier than the rest.
Another important difference from the single-response case is that
the user must be aware of the shape of the returned coefficients, as
returned by coef(fitm)
or coef(fitmcv)
.
We see that coef()
returns a list. For multi-response
GLMs, the intercepts
component is included by default for
each response, and hence is a L × K matrix, where L is the number of λs in the path. The
betas
component will be a L × pK) sparse
matrix where every successive K columns correspond to the
coefficients associated with a feature and across all responses.
Fortunately the predict()
method understands this
structure, and behaves as intended.
predict(fitmcv,newx = X[1:3,])
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] -0.8315155 0.7267649 0.6423057 -1.0228404
## [2,] 0.4868225 -0.5441420 -0.6714108 0.2398791
## [3,] -0.1516871 0.3078583 -1.0663172 0.4250282
Here by default the prediction is made at
lambda = "lambda.1se"
.
Our final example will be for censored survival data, and we will fit Cox proportional hazards model.
We create an example with a 500 × 100 sparse X matrix. In this example we have
right censoring. So the “subject” exits at times in y
, and
the binary status
is 1 of the exit is a “death”, else 0 if
censored.
set.seed(9)
n <- 500
p <- 100
X <- matrix(rnorm(n*p), n, p)
X[sample.int(n * p, size = 0.5 * n * p)] <- 0
X_sparse <- Matrix::Matrix(X, sparse = TRUE)
nzc <- p / 4
beta <- rnorm(nzc)
fx <- X[, seq(nzc)] %*% beta / 3
hx <- exp(fx)
y <- rexp(n,hx)
status <- rbinom(n = n, prob = 0.3, size = 1)
We now fit an elastic-net model using the glm.cox()
family, with every set of 5 coefficients in a group.