Introduction to Group Lasso and Elastic Net

The adelie package provides extremely efficient procedures for fitting the entire group lasso and group elastic net regularization path for GLMs, multinomial, the Cox model and multi-task Gaussian models. adelie is similar to the R package glmnet in scope of models, and in computational speed. In fact, if there are no groups, adelie fits the same class of models as in glmnet, with similar computational speed.

The R package adelie is built from the same C++ code as used in the corresponding adelie Python package. See Yang and Hastie (2024) for details. The theory and algorithms in this implementation are inspired by Friedman, Hastie, and Tibshirani (2010), Simon et al. (2011), Tibshirani et al. (2012) and Simon, Friedman, and Hastie (2013).

In this notebook, we give a brief overview of the group elastic net problem that adelie solves, and describe some of the features of this implementation. Then we develop a detailed example for fitting pairwise interactions, mimicking the glinternet R package (Lim and Hastie (2015))

Single-Response Group Elastic Net

The single-response group elastic net problem is given by $$ \begin{align*} \mathrm{minimize}_{\beta, \beta_0} \quad& \ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2 \right) \\\text{subject to}\quad& \eta = \eta^0 + \beta_0 \mathbf{1} + X \beta \end{align*} $$ where

ℓ(⋅) is the loss function defined by the GLM,
η⁰ is a fixed offset vector,
β₀ is the intercept,
β is the coefficient vector,
X is the feature matrix,
λ ≥ 0 is the regularization parameter,
G is the number of groups,
β_g are the subvector of coefficients for the gth group,
ω_g ≥ 0 is the penalty factor per group, and
α ∈ [0, 1] is the elastic net parameter.

As an example, the Gaussian GLM (glm.gaussian()) defines the loss function as $$ \begin{align*} \ell(\eta) &= \sum\limits_{i=1}^n w_i \left( -y_i \eta_i + \frac{\eta_i^2}{2} \right) \end{align*} $$ where w ≥ 0 is the observation weight vector, y is the response vector, and η is the linear prediction vector as in the optimization problem above. This is equivalent to the weighted sum-of-squared errors $$ \begin{align*} \mbox{RSS} &= \frac12\sum\limits_{i=1}^n w_i \left( y_i -\eta_i \right)^2, \end{align*} $$ since the term in ∑w_iy_i² is a constant for the optimization.

By default we set the weights w_i = 1/n, and if supplied they are renormalized to sum to 1.

Specifically for the Gaussian GLM, we employ a specialized optimizer based on coordinate descent to solve the group elastic net problem.

The Gaussian GLM is written in “exponential family” form, with η_i the natural parameter. Other GLMs such as binomial (glm.binomial()) and Poisson (glm.poisson()) have similar expressions for the negative log-likelihood. For other general GLMs as well as the Cox model (glm.cox()), we use a proximal Newton method, which leads to an iteratively reweighted least squares (IRLS) algorithm, That is, we iteratively perform a quadratic approximation to ℓ(⋅), which yields a sequence of Gaussian GLM group elastic net problems that we solve using our special solver based on coordinate descent.

Multi-Response Group Elastic Net

The multi-response group elastic net problem is given by $$ \begin{align*} \mathrm{minimize}_{B,\; \beta_0} \quad& \ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|B_g\|_F + \frac{1-\alpha}{2} \|B_g\|_F^2 \right) \\\text{subject to}\quad& \eta = \eta^0 + \mathbf{1}\beta_0^\top + X B \end{align*} $$ The model arises in “multitask” learning — glm.multigaussian() — as well as multinomial (multiclass) logistic regression — glm.multinomial(). This way, we have possibly different linear predictions for each of K responses, and hence η is n × K. The coefficient “matrices” for each group B_g are penalized using a Frobenius norm. Note that if an intercept is included in the model, an intercept is added for each response.

We use an alternative but equivalent representation in the adelie software, by “flattening” the coefficient matrices. Specifically we solve $$ \begin{align*} \mathrm{minimize}_{\beta,\; \beta_0} \quad& \ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2 \right) \\\text{subject to}\quad& \mathrm{vec}(\eta^\top) = \mathrm{vec}(\eta^{0\top}) + (\mathbf{1} \otimes I_K) \beta_0 + (X \otimes I_K) \beta \end{align*} $$ where vec(⋅) is the operator that flattens a column-major matrix into a vector, and A ⊗ B is the Kronecker product operator. Hence β ≡ vec(B^⊤).

As indicated above, the multi-response group elastic net problem is technically of the same form as the single-response group elastic net problem. In fact, adelie reuses the single-response solver for multi-response problems by modifying the inputs appropriately (e.g. using matrix.kronecker_eye()) to represent X ⊗ I_K). For the MultiGaussian family, we wrap the specialized single-response Gaussian solver and otherwise for general multi-response GLMs, we wrap the single-response GLM solver.

Quickstart

In this section, we cover the basic usage of adelie using very simple examples.

library(adelie)
## Loaded adelie 1.0.7

Gaussian Group Elastic Net

Before using adelie, we assume that the user has a feature matrix X and a response vector y. For simplicity, we assume for now that X is a dense matrix, although we will later see that X can be substituted with a custom matrix as well. For demonstration, we will randomly generate our data.

n = 100     # number of observations
p = 1000    # number of features
set.seed(5) # seed
X = matrix(rnorm(n*p),n,p)
y = X[,1:10] %*% rnorm(10) + rnorm(n) * sqrt(10) # makes SNR = 1

Lasso

The most basic call to grpnet simply supplies a X matrix and a glm object. For example, glm.gaussian(y=y) returns a Gaussian glm object with response variable y, which corresponds to the usual least squares loss function. By default, grpnet will fit lasso by setting each feature as its own group. adelie is written to treat groups of size 1 in a more optimized manner, so it is a competitive lasso solver that has computation times similar to those of the glmnet package. Like glmnet, grpnet will also generate a sequence of 100 values for the regularization parameter λ evenly spaced on the log scale, by default if the user does not provide the path. The largest value of λ will be the smallest value such that all the terms with positive omega_g above in the solution are zero. Since grpnet is a path-solver, it will warm-start at the next λ using the current solution on the path. For this reason, we recommend users to supply a sufficiently fine grid of λ or use the default path!

fit = grpnet(X=X, glm=glm.gaussian(y=y))
print(fit)
## 
## Call:  grpnet(X = X, glm = glm.gaussian(y = y)) 
## 
##    Groups Df  %Dev Lambda
## 1       0  0  0.00 4.0620
## 2       1  1  4.29 3.8770
## 3       1  1  8.21 3.7010
## 4       1  1 11.77 3.5330
## 5       1  1 15.02 3.3720
## 6       1  1 17.98 3.2190
## 7       1  1 20.67 3.0730
## 8       1  1 23.13 2.9330
## 9       1  1 25.37 2.8000
## 10      1  1 27.41 2.6730
## 11      1  1 29.27 2.5510
## 12      1  1 30.96 2.4350
## 13      1  1 32.50 2.3240
## 14      1  1 33.91 2.2190
## 15      1  1 35.19 2.1180
## 16      1  1 36.36 2.0220
## 17      1  1 37.42 1.9300
## 18      1  1 38.39 1.8420
## 19      1  1 39.27 1.7580
## 20      1  1 40.08 1.6780
## 21      1  1 40.81 1.6020
## 22      1  1 41.48 1.5290
## 23      2  2 42.18 1.4600
## 24      2  2 43.43 1.3930
## 25      2  2 44.58 1.3300
## 26      2  2 45.62 1.2700
## 27      2  2 46.57 1.2120
## 28      2  2 47.44 1.1570
## 29      3  3 48.28 1.1040
## 30      3  3 49.30 1.0540
## 31      4  4 50.57 1.0060
## 32      4  4 51.72 0.9605
## 33      5  5 53.03 0.9168
## 34      6  6 54.34 0.8751
## 35      7  7 55.84 0.8354
## 36     10 10 57.45 0.7974
## 37     13 13 59.37 0.7611
## 38     14 14 61.31 0.7266
## 39     16 16 63.18 0.6935
## 40     19 19 65.08 0.6620
## 41     21 21 66.93 0.6319
## 42     23 23 68.75 0.6032
## 43     25 25 70.51 0.5758
## 44     25 25 72.14 0.5496
## 45     25 25 73.62 0.5246
## 46     30 30 75.08 0.5008
## 47     31 31 76.55 0.4780
## 48     34 34 77.93 0.4563
## 49     38 38 79.31 0.4356
## 50     42 42 80.62 0.4158
## 51     46 46 81.94 0.3969
## 52     47 47 83.20 0.3788
## 53     52 52 84.39 0.3616
## 54     54 54 85.54 0.3452
## 55     56 56 86.62 0.3295
## 56     58 58 87.65 0.3145
## 57     59 59 88.62 0.3002
## 58     61 61 89.51 0.2866
## 59     62 62 90.35 0.2735

The print() methods gives a nice summary of the fit. Note that the solver finished early in this example. By default, the solver terminates if the latter reaches 90% as a simple heuristic to avoid overfitting. This threshold can be controlled via the argument adev_tol.

The output of grpnet is a an object of class “grpnet”, which contains some simple descriptors of the model, as well as a state object that represents the state of the optimizer. For most use-cases, the users do not need to inspect the internals of a state object, but rather can extract the useful information via the methods print(), plot(), predict() and coef().

plot(fit)

Here we plot the coefficients profiles as a function of −log (λ). By default grpnet standardizes the features internally, and these coefficients are on the scale of the standardized features. One can avoid standardization via standardize = FALSE in the call to grpnet().

We can make predictions at new values for the features — here we use a subset of the original:

pred = predict(fit, newx = X[1:5,],lambda = c(1.5, 1))
pred
##           [,1]        [,2]
## [1,]  1.521422  1.99250171
## [2,]  1.055905  1.63603670
## [3,]  1.976767  1.48947977
## [4,] -3.231701 -3.71364519
## [5,] -0.114237  0.02501799

If you used standardize = TRUE, the default, you do not have to standardize newx; adelie will do it for you automatically.

Finally, we would like to choose a good value for λ, and for that we use cross-validation. We include a progress_bar argument, which can be helpful for big problems. The plot method for a cv.grpnet object displays the average cross-validated deviance of the model, with approximate standard error bars for the average.

fitcv = cv.grpnet(X,glm.gaussian(y),progress_bar = TRUE)
plot(fitcv)

Two vertical dashed lines are included: one at the value of λ corresponding to the minimum value, and another at a larger (more conservative) value of λ, such that the mean error is within one standard error of the minimum.

One can print the fitted cv.glmnet object:

fitcv
## 
## Call:  cv.grpnet(X = X, glm = glm.gaussian(y), progress_bar = TRUE) 
## 
## Measure: Mean Deviance 
## 
##     Lambda Index Measure    SE Nonzero
## min 0.8354    35   12.91 2.321       7
## 1se 1.6021    21   15.10 2.582       1

One can also predict directly from it:

pred = predict(fitcv, newx = X)
pred = predict(fitcv, newx = X, lambda = "lambda.min")

In the first case the prediction is at the default value of λ, which is lambda = "lambda.1se". Alternatively, any numeric values of λ can be supplied.

Group Lasso

So far this has been a simple Gaussian lasso fit, and the results would be identical to those produced by glmnet.

To fit group lasso, the user simply needs to supply a groups argument that defines the starting column index of X for each group. For example, if there are 4 features with two (contiguous) groups of sizes 3 and 1, respectively, then groups = c(1, 4). For demonstration, we take the same data as before and group every 10 features. We then fit group lasso using the same function.

fitg = grpnet(
    X=X,
    glm=glm.gaussian(y=y),
    groups=seq(from = 1, to = p, by=10),
    )
print(fitg)
## 
## Call:  grpnet(X = X, glm = glm.gaussian(y = y), groups = seq(from = 1,      to = p, by = 10)) 
## 
##    Groups  Df  %Dev Lambda
## 1       0   0  0.00 1.5370
## 2       1  10  5.66 1.4670
## 3       1  10 10.84 1.4010
## 4       1  10 15.56 1.3370
## 5       1  10 19.86 1.2760
## 6       1  10 23.80 1.2180
## 7       1  10 27.39 1.1630
## 8       1  10 30.67 1.1100
## 9       1  10 33.66 1.0600
## 10      1  10 36.40 1.0110
## 11      1  10 38.90 0.9654
## 12      1  10 41.18 0.9215
## 13      1  10 43.26 0.8796
## 14      1  10 45.16 0.8396
## 15      1  10 46.90 0.8015
## 16      1  10 48.49 0.7650
## 17      1  10 49.94 0.7303
## 18      1  10 51.26 0.6971
## 19      1  10 52.47 0.6654
## 20      1  10 53.57 0.6352
## 21      1  10 54.58 0.6063
## 22      1  10 55.51 0.5787
## 23      1  10 56.35 0.5524
## 24      1  10 57.12 0.5273
## 25      2  20 57.99 0.5033
## 26      2  20 59.09 0.4805
## 27      3  30 60.63 0.4586
## 28      3  30 62.21 0.4378
## 29      3  30 63.66 0.4179
## 30      3  30 65.00 0.3989
## 31      4  40 66.33 0.3808
## 32      4  40 67.61 0.3635
## 33      7  70 68.96 0.3469
## 34      9  90 70.69 0.3312
## 35     11 110 72.50 0.3161
## 36     13 130 74.31 0.3017
## 37     13 130 76.07 0.2880
## 38     13 130 77.71 0.2749
## 39     15 150 79.26 0.2624
## 40     16 160 80.72 0.2505
## 41     17 170 82.06 0.2391
## 42     18 180 83.34 0.2283
## 43     18 180 84.53 0.2179
## 44     19 190 85.68 0.2080
## 45     20 200 86.75 0.1985
## 46     21 210 87.77 0.1895
## 47     24 240 88.73 0.1809
## 48     24 240 89.61 0.1727
## 49     24 240 90.42 0.1648
plot(fitg)

Notice in the printout the active set (df) increases in steps of 10, as expected. Also the coefficient plot colors all coefficients for a group with the sample color.

As before, we can use cross-validation to select the tuning parameter.

fitgcv = cv.grpnet(
    X,
    glm.gaussian(y),
    groups=seq(from = 1, to = p, by=10),
    progress_bar = TRUE)
plot(fitgcv)

GLM Group Elastic Net

In the previous section, we covered how to solve penalized least squares regression. Nearly all of the content remains the same for fitting penalized GLM regression. The only difference is in the choice of the glm object. For brevity, we only discuss logistic regression as our non-trivial GLM example, however the following discussion applies for any GLM.

Let’s modify our example and generate a binomial response.

eta = X[,1:5] %*% rnorm(5) / sqrt(5)
mu = 1 / (1 + exp(-eta))
y = rbinom(n, size = 1, prob = mu)

To solve the group elastic net problem using the logistic loss, we simply provide the binomial GLM object. For simplicity, we fit the lasso problem below.

fitb = grpnet(X, glm.binomial(y))
plot(fitb)

Cross-validation works as before:

fitbcv = cv.grpnet(X,glm.binomial(y))
plot(fitbcv)

We can make predictions from the fitted objects as before.

predict(fitb, newx = X[1:5,],lambda = c(0.13, 0.07))
##             [,1]       [,2]
## [1,]  0.39651049  1.2057345
## [2,] -0.44462600 -1.3611672
## [3,]  0.06720588  0.9408529
## [4,]  0.06593822  0.3846284
## [5,] -0.11815661 -0.5595519

For GLMs, the predictions are by default on the link (η) scale. However, one can also make predictions on the inverse link or response scale (μ):

predict(fitb, newx = X[1:5,],lambda =  c(0.13, 0.07),type="response")
##           [,1]      [,2]
## [1,] 0.5978490 0.7695433
## [2,] 0.3906392 0.2040507
## [3,] 0.5167951 0.7192719
## [4,] 0.5164786 0.5949889
## [5,] 0.4704952 0.3636511

In this case we get the estimated probability of a 1-class. We can specify coefficient groups just as before.

Multi-Response GLM Elastic Net

grpnet is also able to fit multi-response GLM elastic nets. Currently, we only support MultiGaussian (least squares) and Multinomial GLMs.

The following code shows an example of fitting a multinomial regression problem. We use the glm.multinomial() family, which expects a matrix response Y with K columns (the number of classes). Each row of Y is a proportion vector which sums to 1. In the example below, Y is an indicator matrix, with a single 1 in each row, the rest being zeros. If instead the data were grouped, and each row originally represented the number out of n_i in category k, we would first divide each of the counts by n_i turning them into proportions. Then we would supply weights w_i = n_i, normalized to sum to 1.0 overall.

n = 300     # number of observations
p = 100    # number of features
K = 4       # number of classes
set.seed(7)
X = matrix(rnorm(n*p),n,p)
eta = X[, 1:5] %*% matrix(rnorm(K*5),5,K)/sqrt(5)
probs = exp(eta)
probs = probs/rowSums(probs)
Y = t(apply(probs,1,function(prob)rmultinom(1, 1, prob)))
Y[1:5,]
##      [,1] [,2] [,3] [,4]
## [1,]    0    1    0    0
## [2,]    0    0    0    1
## [3,]    0    1    0    0
## [4,]    0    0    1    0
## [5,]    1    0    0    0

The fitting proceeds as before. We will directly fit a group lasso, with groups of 5 chosen similarly to before.

grps = seq(from=1, to=p, by = 5)
fitm = grpnet(X, glm.multinomial(Y), groups=grps)

print(fitm)
## 
## Call:  grpnet(X = X, glm = glm.multinomial(Y), groups = grps) 
## 
##    Groups  Df  %Dev    Lambda
## 1       1   5  0.00 0.0232100
## 2       1   5  2.08 0.0221600
## 3       1   5  3.99 0.0211500
## 4       1   5  5.74 0.0201900
## 5       1   5  7.36 0.0192700
## 6       1   5  8.86 0.0184000
## 7       1   5 10.24 0.0175600
## 8       1   5 11.52 0.0167600
## 9       1   5 12.71 0.0160000
## 10      1   5 13.82 0.0152700
## 11      1   5 14.85 0.0145800
## 12      1   5 15.81 0.0139200
## 13      1   5 16.70 0.0132800
## 14      1   5 17.53 0.0126800
## 15      1   5 18.31 0.0121000
## 16      1   5 19.04 0.0115500
## 17      1   5 19.72 0.0110300
## 18      1   5 20.35 0.0105300
## 19      1   5 20.94 0.0100500
## 20      1   5 21.50 0.0095920
## 21      1   5 22.01 0.0091560
## 22      1   5 22.49 0.0087390
## 23      1   5 22.95 0.0083420
## 24      2  10 23.51 0.0079630
## 25      3  15 24.17 0.0076010
## 26      3  15 24.99 0.0072560
## 27      3  15 25.75 0.0069260
## 28      3  15 26.46 0.0066110
## 29      5  25 27.27 0.0063110
## 30      6  30 28.31 0.0060240
## 31      7  35 29.41 0.0057500
## 32      8  40 30.52 0.0054890
## 33     11  55 31.86 0.0052390
## 34     12  60 33.22 0.0050010
## 35     13  65 34.68 0.0047740
## 36     13  65 36.07 0.0045570
## 37     15  75 37.50 0.0043500
## 38     16  80 38.96 0.0041520
## 39     16  80 40.40 0.0039630
## 40     17  85 41.82 0.0037830
## 41     18  90 43.22 0.0036110
## 42     19  95 44.63 0.0034470
## 43     19  95 45.98 0.0032900
## 44     20 100 47.30 0.0031410
## 45     20 100 48.60 0.0029980
## 46     20 100 49.87 0.0028620
## 47     20 100 51.09 0.0027320
## 48     20 100 52.29 0.0026080
## 49     20 100 53.45 0.0024890
## 50     20 100 54.58 0.0023760
## 51     20 100 55.69 0.0022680
## 52     20 100 56.76 0.0021650
## 53     20 100 57.82 0.0020660
## 54     20 100 58.85 0.0019730
## 55     20 100 59.86 0.0018830
## 56     20 100 60.84 0.0017970
## 57     20 100 61.80 0.0017160
## 58     20 100 62.75 0.0016380
## 59     20 100 63.68 0.0015630
## 60     20 100 64.59 0.0014920
## 61     20 100 65.48 0.0014240
## 62     20 100 66.35 0.0013600
## 63     20 100 67.21 0.0012980
## 64     20 100 68.05 0.0012390
## 65     20 100 68.87 0.0011820
## 66     20 100 69.69 0.0011290
## 67     20 100 70.49 0.0010770
## 68     20 100 71.28 0.0010280
## 69     20 100 72.05 0.0009817
## 70     20 100 72.83 0.0009371
## 71     20 100 73.59 0.0008945
## 72     20 100 74.34 0.0008539
## 73     20 100 75.08 0.0008150
## 74     20 100 75.81 0.0007780
## 75     20 100 76.53 0.0007426
## 76     20 100 77.25 0.0007089
## 77     20 100 77.97 0.0006767
## 78     20 100 78.67 0.0006459
## 79     20 100 79.37 0.0006166
## 80     20 100 80.06 0.0005885
## 81     20 100 80.74 0.0005618
## 82     20 100 81.41 0.0005362
## 83     20 100 82.08 0.0005119
## 84     20 100 82.74 0.0004886
## 85     20 100 83.38 0.0004664
## 86     20 100 84.02 0.0004452
## 87     20 100 84.64 0.0004250
## 88     20 100 85.24 0.0004057
## 89     20 100 85.84 0.0003872
## 90     20 100 86.42 0.0003696
## 91     20 100 86.98 0.0003528
## 92     20 100 87.52 0.0003368
## 93     20 100 88.05 0.0003215
## 94     20 100 88.57 0.0003069
## 95     20 100 89.05 0.0002929
## 96     20 100 89.54 0.0002796
## 97     20 100 89.99 0.0002669
## 98     20 100 90.43 0.0002548

plot(fitm)

The coefficient for each feature is a vector(one per class), so rather than show multiple coefficients, the plot shows the progress of the 2norm of this vector as a function of λ. Once again, because of the grouping (into groups of 5 here), the groups are colored the same. In this case, the first group has all the action, so appears much earlier than the rest.

fitmcv = cv.grpnet(X,glm.multinomial(Y),groups=grps)
plot(fitmcv)

Another important difference from the single-response case is that the user must be aware of the shape of the returned coefficients, as returned by coef(fitm) or coef(fitmcv).

names(coef(fitm))
## [1] "intercepts" "betas"      "df"         "lambda"

We see that coef() returns a list. For multi-response GLMs, the intercepts component is included by default for each response, and hence is a L × K matrix, where L is the number of λs in the path. The betas component will be a L × pK) sparse matrix where every successive K columns correspond to the coefficients associated with a feature and across all responses.

Fortunately the predict() method understands this structure, and behaves as intended.

predict(fitmcv,newx = X[1:3,])
## , , 1
## 
##             [,1]       [,2]       [,3]       [,4]
## [1,] -0.76042057  0.6699049  0.6102631 -0.9841546
## [2,]  0.53664637 -0.5736665 -0.6766612  0.2482660
## [3,] -0.08825713  0.2528624 -1.0590830  0.4305533

Here by default the prediction is made at lambda = "lambda.1se".

Typically for the multinomial we would be more interested in the predicted probabilities:

predict(fitmcv,newx = X[1:3,], lambda="lambda.min", type = "response")
##            [,1]       [,2]       [,3]       [,4]
## [1,] 0.07307049 0.48892499 0.38055768 0.05744684
## [2,] 0.40896892 0.09295511 0.11134535 0.38673062
## [3,] 0.15732292 0.37288397 0.05509535 0.41469776

Each of the rows sum to 1.

Cox Models

Our final example will be for censored survival data, and we will fit Cox proportional hazards model.

We create an example with a 500 × 100 sparse X matrix. In this example we have right censoring. So the “subject” exits at times in y, and the binary status is 1 of the exit is a “death”, else 0 if censored.

set.seed(9)
n <- 500
p <- 100
X  <- matrix(rnorm(n*p), n, p)
X[sample.int(n * p, size = 0.5 * n * p)] <- 0
X_sparse <- Matrix::Matrix(X, sparse = TRUE)

nzc <- p / 4
beta <- rnorm(nzc)
fx <- X[, seq(nzc)] %*% beta / 3
hx <- exp(fx)
y <- rexp(n,hx)
status <- rbinom(n = n, prob = 0.3, size = 1)

We now fit an elastic-net model using the glm.cox() family, with every set of 5 coefficients in a group.

groups = seq(from = 1, to = p, by = 5)
fitcv <- cv.grpnet(X_sparse,
                   glm.cox(stop = y, status = status),
                   alpha = 0.5,
                   groups = groups)
par(mfrow = c(1,2))
plot(fitcv)
plot(fitcv$grpnet.fit)

The family glm.cox() accommodates start-stop data and strata as well. It handles ties by default using the “Efron” method. See ?glm.cox for further details.

Constraints

Adelie supports box constraints (lower and upper bounds on coefficients. Here we demosntrate with a few examples. Currently only works with scalar response models.

Non-negative Lasso

For our first example we will fit a non-negative lasso, using randomly generated data. Note that even though it seems the constraint object is the same for each singleton group, they need to be created separately; i.e. we cannot make one, and replicate it. This is somewhat counter-intuitive, but these are the rules.

n <- 100
p <- 30
set.seed(1)
X <- matrix(rnorm(n * p), n, p)
y <- X[,c(1:5)] %*% rnorm(5)/3 + rnorm(n)

fit <- grpnet(X, glm.gaussian(y))
constrs = lapply(1:p, function(i) constraint.box(lower = 0, upper = Inf))
fit.constr = grpnet(X, glm.gaussian(y), constraints = constrs)
par(mfrow=c(1,2))
plot(fit)
plot(fit.constr)

### Constraints on a single group

Next we create a group of the first four variables. We first fit the model unconstrained, and observe the final coefficients for the first four variables. Then we put very specific constraints on each of the four coefficients in that first group, and leave the rest of the variables alone. We add blue broken lines to show these constraints. Note that we can replicate NULL constraints.

fit <- grpnet(X, glm.gaussian(y), groups=c(1,5:30))
beta <- coef(fit)$beta[100, 1:4]
print(beta)
## [1]  0.2529354  0.1174040  0.5912792 -0.2852030
lower = lower=c(-Inf,-Inf,-Inf,-.2)
upper=c(0.2,0.05,0.4,Inf)
constrs = rep(list(NULL),27) # there are 27 groups
constrs[[1]] = constraint.box(lower=lower,upper=upper)
fit.constr = grpnet(X, glm.gaussian(y), groups=c(1,5:30), constraints = constrs)
par(mfrow=c(1,2))
plot(fit)
plot(fit.constr)
abline(h=lower,col="blue",lty=2)
abline(h=upper,col="blue",lty=2)

Matrix

In group elastic net problems, the matrix object plays a crucial role in the performance of the solver. It becomes apparent in our optimization algorithm (and our benchmark analysis) that most of the runtime lies in interacting with the matrix object, e.g. computing inner-products. Hence, a highly performant matrix object implementing a select set of methods that the solver requires will yield tremendous speed gains overall. In addition, we have found numerous examples where a matrix class admits some special structure that can be exploited for further speed and memory gains. One simple example is a large sparse matrix, which cannot fit in memory as a dense matrix. Another example is genomics datasets which are not only sparse, but only take on 3 possible integer values (see [Examples], to come), and generally have over 160 billion entries with 30% non-zero entries.

For these reasons, we found it fruitful to abstract out the matrix class, and adelie provides a few examples. We discuss below some ways a user may interact with these classes.

Some Useful Matrix Methods

We will go through a few matrix methods that will be useful in group lasso modeling.

Dense Matrix

The simplest example of such a matrix is a dense matrix. Let us first construct a dense matrix and wrap it using matrix.dense.

n = 4
p = 2
set.seed(0)
X_dense = matrix(rnorm(n*p),n,p)
print(X_dense)
##            [,1]       [,2]
## [1,]  1.2629543  0.4146414
## [2,] -0.3262334 -1.5399500
## [3,]  1.3297993 -0.9285670
## [4,]  1.2724293 -0.2947204
Xd = matrix.dense(X_dense)
print(Xd)
## C++ object <0x5556ffbfd610> of class 'RMatrixNaiveDense64F' <0x5556f8b05670>

If this is all we are going to pass to grpnet(), then we could have saved ourselves the trouble, because it will do so anyway in the call to grpnet. But we will see that it can still become useful when we combine matrices.

Sparse Matrix

Similarly, we can wrap a sparse matrix using matrix.sparse. We will make a simple sparse matrix using the function Matrix::sparseMatrix(). We will use the market matrix format, where we supply i, j, x triples.

d = data.frame(
    i = c(2, 1, 4, 1, 2, 4),
    j= c(1, 2, 2, 3, 3, 3),
    x = c(0.184, 0.330, 0.738, 0.576, -0.305, 0.390)
)
print(d)
##   i j      x
## 1 2 1  0.184
## 2 1 2  0.330
## 3 4 2  0.738
## 4 1 3  0.576
## 5 2 3 -0.305
## 6 4 3  0.390
require(Matrix)
## Loading required package: Matrix
X_sparse = with(d, sparseMatrix(i, j, x=x, dims=c(4,3)))
print(X_sparse)
## 4 x 3 sparse Matrix of class "dgCMatrix"
##                        
## [1,] .     0.330  0.576
## [2,] 0.184 .     -0.305
## [3,] .     .      .    
## [4,] .     0.738  0.390
Xs = matrix.sparse(X_sparse)
print(Xs)
## C++ object <0x5556feb0bb60> of class 'RMatrixNaiveSparse64F' <0x5556f8a82250>

Likewise, if this was all we were to pass to grpnet, we again could have saved the trouble, because it would convert an S4 sparse matrix automatically.

Concatenated Matrices

Suppose we have both a dense matrix and a sparse matrix. We can concatenate matrix classes together in adelie. This enables the software to use the most efficient methods for each when performing matrix multiplies.

Xds = matrix.concatenate(list(Xd, Xs))
print(Xds)
## C++ object <0x5556ff620e70> of class 'RMatrixNaiveCConcatenate64' <0x5556f92b00e0>
Xds$rows; Xds$cols
## [1] 4
## [1] 5
print(Xds)
## C++ object <0x5556ff620e70> of class 'RMatrixNaiveCConcatenate64' <0x5556f92b00e0>

By default we concatenated by columns (as in cbind()). We could alternatively concatenate by rows, using the axis=1 argument, but that would fail here since the dimensions are not compatible.

We currently do not have code that allows one to view these matrix objects; that may come in future versions of the package. Each matrix object has attributes, which typically contain the original matrix, or matrices if it was concatenated.

Mixed variables and one-hot encoding

Sometimes we have a mixture of categorical and quantitative variables. In this case we typically expand the categorical variables using one_hot encoding (a.k.a dummy variables). Consider the simple example where we have X_dense as above, along with a factor with 3 levels.

X_mixed = cbind(X_dense, c(1,0,2,1))
print(X_mixed)
##            [,1]       [,2] [,3]
## [1,]  1.2629543  0.4146414    1
## [2,] -0.3262334 -1.5399500    0
## [3,]  1.3297993 -0.9285670    2
## [4,]  1.2724293 -0.2947204    1
levels = c(1,1,3)

Along with this matrix we have an integer vector of levels, which says how many levels in each column, with quantitative variables depicted as 1. Notice for factors we represent the values for the factor starting at 0 in adelie.

Xoh = matrix.one_hot(X_mixed, levels = levels)
Xoh$cols
## [1] 5

We would like to see what is inside X0h. Since we dont have a print method yet, we will use a bit of trickery to do that, by multiplying it by the identity. Since only row-sparse matrices are allowed (they are meant to be adelie coefficient matrices), we will make one.

eye= as(as(diag(5), "generalMatrix"), "RsparseMatrix")
Xoh$sp_tmul(eye)
##            [,1]       [,2] [,3] [,4] [,5]
## [1,]  1.2629543  0.4146414    0    1    0
## [2,] -0.3262334 -1.5399500    1    0    0
## [3,]  1.3297993 -0.9285670    0    0    1
## [4,]  1.2724293 -0.2947204    0    1    0

Standardization

We typically wish to standardize the features before fitting an adelie model, with a few exceptions. If some columns represent the one-hot vectors for a factor variable, we generally do not standardize them. Lets use that case as an example. Each matrix class has its own standardize method.

Xohs = matrix.standardize(Xoh)
Xohs$sp_tmul(eye)
##            [,1]       [,2] [,3] [,4] [,5]
## [1,]  0.5406019  1.3781812    0    1    0
## [2,] -1.7308931 -1.3107855    1    0    0
## [3,]  0.6361463 -0.4696949    0    0    1
## [4,]  0.5541449  0.4022993    0    1    0

The standardization information is stored in the attributes of the matrix object.

print(attributes(Xohs))
## $.xData
## <environment: 0x5556fc017008>
## 
## $class
## [1] "Rcpp_RMatrixNaiveStandardize64"
## attr(,"package")
## [1] "adelie"
## 
## $`_mat`
## C++ object <0x5556fe7296c0> of class 'RMatrixNaiveOneHotDense64F' <0x5556f9293330>
## 
## $`_centers`
## [1]  0.8847374 -0.5871490  0.0000000  0.0000000  0.0000000
## 
## $`_scales`
## [1] 0.6996219 0.7268932 1.0000000 1.0000000 1.0000000

Extended Examples

We now go through some applications where we use adelie and its group lasso facilities as a building block. Currently just one example, but more to follow.

Learning Interactions via Hierarchical Group-Lasso Regularization

In regression settings, we may want to include pairwise interaction terms amongst a subset of features to capture some non-linearity. Moreover, we would like to perform feature selection on the interaction terms as well. However, to achieve an interpretable model, we would like to also impose a hierarchy such that interaction terms are only included in the model if the main effects are included. Michael Lim and Trevor Hastie provide a formulation of this problem using group lasso where the group structure imposes the hierarchy and the group lasso penalty allows for feature selection. For further details see their R package glinternet and the reference: Learning interactions via hierarchical group-lasso regularization

We implement their approach in adelie, and provide two advantages:

We can provide interaction models for all the adelie response familes. The glinternet package handles two cases only - Gaussian and binomial.
A computation speedup by a factor of about 25.

We first demonstrate the functions that are provided in the adelie package for fitting these models. Then we develop our approach from scratch using grpnet as the work engine, which helps to explain how the model works.

Simulation Setup

We will work under a simulation setting. We draw n independent samples Z_i ∈ ℝ^d where the continuous features are sampled from a standard normal and the discrete features are sampled uniformly. Remember that factors start at 0, and the maximum value is levels-1.

require(adelie)
n=1000
d_cont = 10     # number of continuous features
d_disc = 10     # number of discrete features
set.seed(3)     # random seed
Z_cont = matrix(rnorm(n*d_cont),n,d_cont)
levels = sample(2:10,d_disc,replace=TRUE)
Z_disc = matrix(0,n,d_disc)
for(i in seq(d_disc))Z_disc[,i] = sample(0:(levels[i]-1),n,replace=TRUE)

It is customary to first center and scale the continuous features so that they have mean 0 and standard deviation 1.

sd0 = function(x)sd(x)*sqrt(1-1/length(x)) # SD formula with divition by n rather n-1
Z_cont_means = apply(Z_cont,2,mean)
Z_cont_stds = apply(Z_cont,2,sd0)
Z_cont = scale(Z_cont, Z_cont_means,Z_cont_stds)

We now combine these matrices to form one big matrix, and a vector of levels that describes them all.

Z = cbind(Z_cont,Z_disc)
levels = c(rep(1,d_cont),levels)
print(levels)
##  [1] 1 1 1 1 1 1 1 1 1 1 5 6 4 6 9 2 7 8 9 2

We make a response which has a main effect in each of a continuous feature and discrete one, and their interaction.

xmat = model.matrix(~Z_cont[,1]*factor(Z_disc[,2]))
nc=ncol(xmat)
print(nc)
## [1] 12
set.seed(4)
beta = rnorm(nc)
y = xmat%*%beta+rnorm(n)*2.5

Tools in adelie for fitting a glinternet model

We now demonstrate the tools we have constructed to fit interaction models. In the next section, we describe a manual construction, in order to get some insight of what is under the hood.

We provide a function glintnet, deliberately omitting the “er” in the middle to avoid confusion. In the following we consider a model with main effects, and possible interactions between the first variable, and all the others.

set.seed(2)
fit = glintnet(Z,glm.gaussian(y),levels=levels,intr_keys = 1)
cvfit = cv.glintnet(Z,glm.gaussian(y),levels=levels,intr_keys = 1)
par(mfrow=c(1,2))
plot(fit)
plot(cvfit)

The generating model has 12 non-zero coefficients (including intercept), and the more conservative lambda.1se appears to have found these.

We have some tools for poking in and seeing what was fit. The true model has main effects and interactions between variables 1 (continuous) and 12 (6-level factor).

predict(cvfit,type="nonzero")
## [[1]]
## [[1]]$main
## [1] 12
## 
## [[1]]$int
## [1]  1 12

There is a main-effect for variable 12 (6 coefficients), and an interaction term between variable 12 and variable 1. This counts actually 12 coefficients - 6 for the interaction, and 6 again for the main effect. The algorithm uses the overlap group lasso to ensure that when interactions are present, so are the main effects. Details can be found in the reference.

Although the model has found the correct interactions, digging in to see the coefficents is less helpful. The following printout is truncated.

coef(cvfit)
## $intercepts
##           [,1]
## [1,] 0.5537348
## 
## $betas
## 1 x 211 sparse Matrix of class "dgRMatrix"
##                                                                               
## [1,] . . . . . . . . . . . . . . . -0.04969358 0.04479742 0.0176263 0.09323329
##                                                                               
## [1,] 0.02816331 -0.1341267 . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                               
## [1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                               
## [1,] . . . . . . . . . . . . . . . . . . . . . -0.2648985 -0.2373439 0.2387798
##                                                                              
## [1,] -0.4284432 0.09398259 0.6513534 0.4970022 0.4389761 0.15011 -0.006143393
##                                                                               
## [1,] -0.7149761 -0.2617563 . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                               
## [1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                   
## [1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 
## $df
## [1] 18
## 
## $lambda
## [1] 0.1776954

The coefficient matrix is quite wide, and seeing what goes with what is a little cumbersome. We will provide helpful tools in future to make these more manageable.

Perhaps we made it too easy, by specifying that the only interactions could be with variable 1 (intr_keys=1). Now we repeat the process giving no hints to glintnet.

set.seed(2)
fit2 = glintnet(Z,glm.gaussian(y),levels=levels)
cvfit2 = cv.glintnet(Z,glm.gaussian(y),levels=levels)
par(mfrow=c(1,2))
plot(fit2)
plot(cvfit2)

And the nonzero variables:

predict(cvfit2,type="nonzero")
## [[1]]
## [[1]]$main
## [1] 12
## 
## [[1]]$int
## [1]  1 12

It still found them! We do not expect this to happen all the time, but is encouraging that it found the true signals here.

The structure of glintnet and cv.glintnet mirrors that of grpnet and cv.grpnet - they have a few more arguments to specify the interactions, if needed. Importantly, they have all the methods such as print(), coef(), and predict().

predict(cvfit, newx=Z[1:3,])
##           [,1]
## [1,] 0.7178638
## [2,] 0.4701666
## [3,] 1.2550334
predict(cvfit, newx=Z[1:3,],lambda="lambda.min")
##           [,1]
## [1,] 0.9076634
## [2,] 0.5006261
## [3,] 1.8064647
print(cvfit)
## 
## Call:  cv.glintnet(X = Z, glm = glm.gaussian(y), intr_keys = 1, levels = levels) 
## 
## Measure: Mean Deviance 
## 
##      Lambda Index Measure     SE Nonzero
## min 0.06096    40   5.983 0.2856      78
## 1se 0.17770    17   6.264 0.3359      18

The print method for a glintnet fit adds some detail:

print(fit)
## 
## Call:  glintnet(X = Z, glm = glm.gaussian(y), intr_keys = 1, levels = levels) 
## 
##     N_main N_int  Df  %Dev  Lambda
## 1        0     0   0  0.00 0.37400
## 2        1     0   6  1.01 0.35700
## 3        2     1  18  2.04 0.34080
## 4        2     1  18  3.67 0.32530
## 5        2     1  18  5.16 0.31050
## 6        2     1  18  6.51 0.29640
## 7        2     1  18  7.75 0.28290
## 8        2     1  18  8.88 0.27010
## 9        2     1  18  9.91 0.25780
## 10       2     1  18 10.85 0.24610
## 11       2     1  18 11.71 0.23490
## 12       2     1  18 12.49 0.22420
## 13       2     1  18 13.20 0.21400
## 14       2     1  18 13.85 0.20430
## 15       2     1  18 14.45 0.19500
## 16       2     1  18 14.99 0.18620
## 17       2     1  18 15.48 0.17770
## 18       2     1  18 15.93 0.16960
## 19       2     1  18 16.34 0.16190
## 20       2     1  18 16.72 0.15460
## 21       2     1  18 17.06 0.14750
## 22       2     1  18 17.37 0.14080
## 23       2     1  18 17.66 0.13440
## 24       2     1  18 17.92 0.12830
## 25       2     1  18 18.15 0.12250
## 26       2     1  18 18.37 0.11690
## 27       2     1  18 18.56 0.11160
## 28       3     2  21 18.76 0.10650
## 29       3     2  21 18.96 0.10170
## 30       3     2  21 19.14 0.09706
## 31       3     2  21 19.30 0.09265
## 32       5     3  45 19.65 0.08844
## 33       5     3  45 19.99 0.08442
## 34       7     5  58 20.37 0.08058
## 35       7     6  70 20.74 0.07692
## 36       8     6  78 21.11 0.07342
## 37       8     6  78 21.49 0.07009
## 38       8     6  78 21.84 0.06690
## 39       8     6  78 22.16 0.06386
## 40       8     6  78 22.45 0.06096
## 41      10     8 110 22.82 0.05819
## 42      10     8 110 23.21 0.05554
## 43      11     9 114 23.57 0.05302
## 44      13    11 121 23.92 0.05061
## 45      13    11 121 24.25 0.04831
## 46      13    11 121 24.56 0.04611
## 47      14    11 122 24.84 0.04402
## 48      15    11 123 25.11 0.04202
## 49      15    12 126 25.36 0.04011
## 50      15    12 126 25.59 0.03828
## 51      15    12 126 25.80 0.03654
## 52      16    12 130 26.00 0.03488
## 53      16    13 146 26.18 0.03330
## 54      16    14 149 26.37 0.03178
## 55      16    15 157 26.55 0.03034
## 56      16    15 157 26.71 0.02896
## 57      18    17 164 26.87 0.02764
## 58      18    17 164 27.01 0.02639
## 59      18    17 164 27.15 0.02519
## 60      18    17 164 27.27 0.02404
## 61      18    17 164 27.38 0.02295
## 62      18    17 164 27.48 0.02191
## 63      18    17 164 27.58 0.02091
## 64      18    17 164 27.66 0.01996
## 65      18    17 164 27.74 0.01905
## 66      18    17 164 27.82 0.01819
## 67      18    17 164 27.88 0.01736
## 68      18    17 164 27.94 0.01657
## 69      18    17 164 28.00 0.01582
## 70      18    17 164 28.05 0.01510
## 71      18    17 164 28.10 0.01441
## 72      18    17 164 28.14 0.01376
## 73      18    17 164 28.18 0.01313
## 74      18    17 164 28.21 0.01254
## 75      18    17 164 28.24 0.01197
## 76      18    17 164 28.27 0.01142
## 77      18    17 164 28.30 0.01090
## 78      18    17 164 28.33 0.01041
## 79      18    17 164 28.35 0.00994
## 80      18    17 164 28.37 0.00948
## 81      18    17 164 28.39 0.00905
## 82      19    18 167 28.41 0.00864
## 83      19    18 167 28.42 0.00825
## 84      19    18 167 28.44 0.00787
## 85      19    18 167 28.45 0.00752
## 86      19    18 167 28.46 0.00717
## 87      19    18 163 28.47 0.00685
## 88      19    18 163 28.48 0.00654
## 89      19    18 163 28.49 0.00624
## 90      19    18 163 28.50 0.00596
## 91      19    18 163 28.51 0.00568
## 92      19    18 163 28.52 0.00543
## 93      19    18 163 28.52 0.00518
## 94      19    18 163 28.53 0.00494
## 95      19    18 163 28.53 0.00472
## 96      19    18 163 28.54 0.00451
## 97      19    18 163 28.54 0.00430
## 98      19    18 163 28.55 0.00410
## 99      19    18 163 28.55 0.00392
## 100     20    18 164 28.55 0.00374

Details: manual construction of interaction model

Here we go through details that are encapsulated in the glintnet function. Reader not interested in the innards should skip this section.

The model starts of with a one-hot encoded version of Z, which it then augments with additional blocks of interaction terms.

We are now in position to construct the full feature matrix to fit a group lasso model. As a demonstration, suppose we (correctly) believe that there is a true interaction term containing the first continuous feature, but we do not know the other feature. We therefore wish to construct an interaction between the first continuous feature against all other features. It is easy to specify this pairing, as shown below via intr_keys and intr_values. The following code constructs the interaction matrix X_intr.

X_int = matrix.interaction(Z,intr_keys = 1, intr_values=list(NULL),levels=levels)

This function creates for each element of intr_keys, Hadamard products with each of the variables listed in int_values:

For two quantitative features, this just makes one new feature, but includes in addition both of the individual features — so a 3-column matrix.
For a quantitative and a discrete variable, it does the same, but with the quantitative feature multiplied into each of the columns of the one-hot matrix for the factor. The one-hot matrix is also included, so if the factor feature has 5 levels, the result here is a 10-column matrix.
For two discrete features we get the Hadamard product of their two one-hot matrices.

In linear model termonology, these matrices are exactly what is needed to fit a main effect + interaction model for the pair of variables. Read the reference for further understanding of this particular construction.

To put all groups of features on the same relative “scale”, we must further center and scale all interaction terms between two continuous features. Then, it can be shown that interaction block between two discrete features induce a (Frobenius) norm of 1, a discrete and continuous feature induce a norm of $\sqrt{2}$, and two continuous features induce a norm of $\sqrt{3}$. These values will be used as penalty factors later when we call the group lasso solver. We first compute the necessary centers and scales.

pairs = attributes(X_int)[["_pairs"]] # base 0
pair_levels = apply(pairs,1,function(i)levels[i+1]) # it is transposed
is_cont_cont = apply(pair_levels,2,prod) == 1
cont_cont_pairs = pairs[is_cont_cont,]
cont_cont = Z[,cont_cont_pairs[,1]+1]*Z[,cont_cont_pairs[,2]+1] # base 0
centers = rep(0,X_int$cols)
scales = rep(1,X_int$cols)
cont_cont_indices = X_int$groups[is_cont_cont]+2 # base 0
centers[cont_cont_indices+1] = apply(cont_cont,2,mean) # base 0 index, so add 1
scales[cont_cont_indices+1] = apply(cont_cont,2,sd)  # base 0 index, so add 1

So far we have just focussed on the interaction terms. Glinternet requires additional main-effect terms in case the model chooses to ignore the interactions.

Now, we construct the full feature matrix including the one-hot encoded main effects as well as the standardized version of the interaction terms using the centers and scales defined above. Note that we have already standardized the quantitative features in the original matrix Z.

X_one_hot = matrix.one_hot(Z, levels)
X_int_std = matrix.standardize(
    X_int,
    centers=centers,
    scales=scales)
X = matrix.concatenate(
    list(
        X_one_hot,
        X_int_std)
    )
X$cols
## [1] 211

Before calling grpnet we must prepare the grouping and penalty factor information. Once again, we must remember the 0 base issue.

groups = c(
    as.vector(X_one_hot$groups),
    X_one_hot$cols + as.vector(X_int$groups)) +1 # the +1 is the base 0 issue
is_cont_disc = apply(pair_levels-1,2,function(x)xor(x[1],x[2]))
penalty_int = rep(1, length(X_int$groups))
penalty_int[is_cont_cont] = sqrt(3)
penalty_int[is_cont_disc] = sqrt(2)
penalty = c(
    rep(1, length(X_one_hot$groups)),
    penalty_int)

Finally we can call the group-lasso solver with our prepared inputs.

set.seed(2)
fit = grpnet(X,glm.gaussian(y),groups=groups,penalty=penalty, standardize=FALSE)
cv.fit = cv.grpnet(X,glm.gaussian(y),groups=groups,penalty=penalty, standardize=FALSE)
par(mfrow=c(1,2))
plot(fit)
plot(cv.fit)

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, Articles 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.

Lim, Michael, and Trevor Hastie. 2015. “Learning Interactions via Hierarchical Group-Lasso Regularization.” Journal of Computational and Graphical Statistics 24 (3): 627–54. https://doi.org/10.1080/10618600.2014.938812.

Simon, Noah, Jerome Friedman, and Trevor Hastie. 2013. “A Blockwise Descent Algorithm for Group-Penalized Multiresponse and Multinomial Regression.”

Simon, Noah, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software, Articles 39 (5): 1–13. https://doi.org/10.18637/jss.v039.i05.

Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan Tibshirani. 2012. “Strong Rules for Discarding Predictors in Lasso-Type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2): 245–66. https://doi.org/10.1111/j.1467-9868.2011.01004.x.

Yang, James, and Trevor Hastie. 2024. “A Fast and Scalable Pathwise-Solver for Group Lasso and Elastic Net Penalized Regression via Block-Coordinate Descent.” https://doi.org/10.48550/arXiv.2405.08631.

An Introduction to adelie