Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: AdaptiveLasso and AdaptiveLassoCV #169

Closed
wants to merge 29 commits into from
Closed

Conversation

mathurinm
Copy link
Owner

@mathurinm mathurinm commented Nov 11, 2020

closes #165

@codecov-io
Copy link

codecov-io commented Nov 11, 2020

Codecov Report

Merging #169 (93963af) into master (a8a110f) will decrease coverage by 1.61%.
The diff coverage is 74.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #169      +/-   ##
==========================================
- Coverage   86.08%   84.47%   -1.62%     
==========================================
  Files          13       14       +1     
  Lines         913      979      +66     
  Branches      120      122       +2     
==========================================
+ Hits          786      827      +41     
- Misses         97      122      +25     
  Partials       30       30              
Flag Coverage Δ
unittests 100.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
celer/dropin_sklearn.py 91.92% <65.21%> (-4.48%) ⬇️
celer/tests/test_adaptive.py 69.69% <69.69%> (ø)
celer/homotopy.py 84.46% <80.85%> (-2.72%) ⬇️
celer/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7644a26...93963af. Read the comment docs.

@mathurinm
Copy link
Owner Author

@josephsalmon @agramfort @QB3
Here's the example : https://183-122246365-gh.circle-artifacts.com/0/dev/auto_examples/plot_adaptivelasso.html#sphx-glr-auto-examples-plot-adaptivelasso-py

Agree with the name for new parameter: n_reweightings ?
Docstrings may need to be polished, feedback welcome. As you can see there's a lot of duplicated code/docstring with the Lasso class...

@QB3
Copy link

QB3 commented Nov 11, 2020

@josephsalmon @agramfort @QB3
Here's the example : https://183-122246365-gh.circle-artifacts.com/0/dev/auto_examples/plot_adaptivelasso.html#sphx-glr-auto-examples-plot-adaptivelasso-py

Agree with the name for new parameter: n_reweightings ?
Docstrings may need to be polished, feedback welcome. As you can see there's a lot of duplicated code/docstring with the Lasso class...

Maybe this is a little bit off topic but I put it here, just in case:

  • I am impressed that the adaptive Lasso does better in prediction for the CV
  • I am also impressed by the piecewise differentiability of the path for the adaptive.
    From what I remember, when I played with the adaptive a while ago, the mse_path was highly discontinuous, and we thus we could not not use autodiff to set the regularization parameter. Your plot seems to suggest the opposite, this may worth investigate.

@mathurinm
Copy link
Owner Author

I don't think it does better in prediction, does it ?

If you put n_samples, n_features = 100, 160 the MSE curves go havoc for Adaptive:
image

@josephsalmon
Copy link
Contributor

This is interesting, and weird!
Do you confirm this is bug less on the AdaptiveLassoCV side?
Could it be due to lack of convergence?
The blue dotted line on the right is wild...is the black line the average of the 5 folds (seems median would be less damaged here)?

@@ -100,7 +100,10 @@ def celer(
cdef int[:] all_features = np.arange(n_features, dtype=np.int32)

for t in range(max_iter):
if t != 0:
# if t != 0:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a severe regression a the beginning of paths for large X (e.g. finance) where X.T @ theta is recomputed while there are 3 features in the WS.

Maintaining Xtheta from one alpha to the other may be a fix.

@mathurinm
Copy link
Owner Author

I made another experiment on Finance:
image

The part where it gets funny: AdaptiveLasso uses only one coefficient:

In [26]: (lasso.coef_ != 0).sum()
Out[26]: 267

In [27]: (adaptive_lasso.coef_ != 0).sum()
Out[27]: 1

In [28]: np.where(adaptive_lasso.coef_)
Out[28]: (array([0]),)

In [29]: X0 = X[:, 0].toarray().squeeze()

In [30]: np.dot(X0, y) / np.linalg.norm(y) / np.linalg.norm(X0)
Out[30]: 0.9941543404858482

In [31]: adaptive_lasso.intercept_, lasso.intercept_
Out[31]: (-0.6253583759665271, -1.125413246436354)

In [32]: X0_cent = X0 - X0.mean()

In [33]: y_cent = y - y.mean()

In [34]: np.dot(X0_cent, y_cent) / np.linalg.norm(y_cent) / np.linalg.norm(X0_cent)
Out[34]: 0.8085836249206475

In [35]: lasso.normalize
Out[35]: False

In [36]: sparse.linalg.norm(X, axis=0)
Out[36]: 
array([436.72549038, 214.92614813, 776.73555157, ...,   1.20056613,
         1.47236374,   1.89979315])

On a different note, I think the way the adaptive path is computed is suboptimal: instead on doing

for alpha in alphas:
    for iter in range(n_reweightings):

we should do

for iter in range(n_reweightings):
   for alpha in alphas:

in order to benefit from warm start. Otherwise, for a fresh new alpha (without weights), initiliazing with the last reweighting solution for the previous alpha is not very useful. Makes sense @josephsalmon

@josephsalmon
Copy link
Contributor

sparsity =1 for AdaptiveLassoCV? really unexpected (normalization issue?).

I am also supprised that for alpha -> 0 the two methods reach different errors level (should be the OLS performance, but I agree in high dim, there might be different least-squares solutions).

And I agree with the loop inversion: very good idea!

@mathurinm
Copy link
Owner Author

I am not sure why the two paths should tend to same solutions as alphas goes to 0 : ot should be the basis pursuis performance for the Lasso, but for the adaptive lasso I don't know what the limit is expected to be.

@josephsalmon
Copy link
Contributor

You are right, I just thought they would be closer...
My guess would be the Adaptive Lasso with alpha->0 should converge to the solution of the interpolation Xcoef=y with constraints \norm{coef}_{0.5} minimized.

@@ -276,13 +279,269 @@ def _more_tags(self):
return {'multioutput': False}


class AdaptiveLasso(Lasso):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now thinking this is a bad name, as adaptive Lasso takes weights equal to 1 / w^ols_j, and it performs a single Lasso.
What we implement is rather iterative l1 reweighting (Candes Wakin Boyd paper).

IterativeReweightedLasso seems more correct for me, but it does not pop up on google, and I don't know if it good for visibility. When people talk about it in scikit-learn, they say adaptive lasso : scikit-learn/scikit-learn#4912
@agramfort @josephsalmon do you have an opinion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main naming issue to me is that an estimator should be well-defined mathematically; then the implementation is something under the hood for the user.

Here, the algorithm you are proposing is a DC-programming (à la Gasso et al. 2009) approach for solving sparse regression with \ell_0.5 regularization (also referred to as reweighted l1 by Cands et al.).
Hence, I would be in favor of separating the "theoretical" estimator from the algorithms used (for instance a coordinate descent alternative could be considered as another solver for sparse regression with \ell_0.5 regularization).

I agree that AdaptiveLasso is originally 2-step:

  1. OLS
  2. Lasso reweighted with weights controlled by step 1.

But I think this is vague enough in the original article (any consistent estimator can be used in the first step, not only OLS), so we can reuse this naming.

Potentially, the exponent \gamma (corresponding to the \ell_q norm used in the Adaptive Lasso paper) could be an optional parameter something like:

lq_norm = 0.5

(with the possibility to add more variants later on).

So in the end, I won't bother too much about the naming and stick to AdaptiveLasso as a good shortcut.

@mathurinm
Copy link
Owner Author

It's way faster this way but some gap scaling is wrong somewhere, we get negative ones in doc examples and on news20:

import time
from celer import AdaptiveLassoCV
from libsvmdata import fetch_libsvm
X, y = fetch_libsvm("news20")
t0 = time.time(); clf = AdaptiveLassoCV(verbose=1, cv=2, n_jobs=2, eps=1e-2).fit(X, y); dur = time.time() - t0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example adaptive Lasso
4 participants