Git Product home page Git Product logo

weightwatcher's Introduction

Downloads PyPI GitHub Published in Nature Video Tutorial Discord LinkedIn Blog CalculatedContent

WeightWatcher Logo

WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.

It can be used to:

  • analyze pre/trained pyTorch, Keras, DNN models (Conv2D and Dense layers)
  • monitor models, and the model layers, to see if they are over-trained or over-parameterized
  • predict test accuracies across different models, with or without training data
  • detect potential problems when compressing or fine-tuning pretrained models
  • layer warning labels: over-trained; under-trained

Quick Links

And in the notebooks provided in the examples directory

Installation: Version 0.7.5.2

pip install weightwatcher

if this fails try

Current TestPyPI Version 0.7.5.2

 python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher

Usage

import weightwatcher as ww
import torchvision.models as models

model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze()
summary = watcher.get_summary(details)

It is as easy to run and generates a pandas dataframe with details (and plots) for each layer

Sample Details Dataframe

and summary dictionary of generalization metrics

    {'log_norm': 2.11,      'alpha': 3.06,
      'alpha_weighted': 2.78,
      'log_alpha_norm': 3.21,
      'log_spectral_norm': 0.89,
      'stable_rank': 20.90,
      'mp_softrank': 0.52}

Advanced Usage

The watcher object has several functions and analysis features described below

Notice the min_evals setting: the power law fits need at least 50 eigenvalues to make sense but the describe and other methods do not

watcher.analyze(model=None, layers=[], min_evals=50, max_evals=None,
	 plot=True, randomize=True, mp_fit=True, pool=True, savefig=True):
...
watcher.describe(self, model=None, layers=[], min_evals=0, max_evals=None,
         plot=True, randomize=True, mp_fit=True, pool=True):
...
watcher.get_details()
watcher.get_summary(details) or get_summary()
watcher.get_ESD()
...
watcher.distances(model_1, model_2)

PEFT / LORA models (experimental)

To analyze an PEFT / LORA fine-tuned model, specify the peft option.

  • peft = True: Forms the BA low rank matric and analyzes the delta layers, with 'lora_BA" tag in name

    details = watcher.analyze(peft='peft_only')

  • peft = 'with_base': Analyes the base_model, the delta, and the combined layer weight matrices.

    details = watcher.analyze(peft=True)

The base_model and fine-tuned model must have the same layer names. And weightwatcher will ignore layers that do not share the same name. Also,at this point, biases are not considered. Finally, both models should be stored in the same format (i.e safetensors)

Note: If you want to select by layer_ids, you must first run describe(peft=False), and then select both the lora_A and lora_B layers

Usage: Base Model

Usage: Base Model

Ploting and Fitting the Empirical Spectral Density (ESD)

WW creates plots for each layer weight matrix to observe how well the power law fits work

details = watcher.analyze(plot=True)

For each layer, WeightWatcher plots the ESD--a histogram of the eigenvalues of the layer correlation matrix X=WTW. It then fits the tail of ESD to a (Truncated) Power Law, and plots these fits on different axes. The summary metrics (above) characterize the Shape and Scale of each ESD. Here's an example:

Generally speaking, the ESDs in the best layers, in the best DNNs can be fit to a Power Law (PL), with PL exponents alpha closer to 2.0. Visually, the ESD looks like a straight line on a log-log plot (above left).

Generalization Metrics

The goal of the WeightWatcher project is find generalization metrics that most accurately reflect observed test accuracies, across many different models and architectures, for pre-trained models and models undergoing training.

Our HTSR theory says that well trained, well correlated layers should be signficantly different from the MP (Marchenko-Pastur) random bulk, and specifically to be heavy tailed. There are different layer metrics in WeightWatcher for this, including:

  • rand_distance : the distance in distribution from the randomized layer
  • alpha : the slope of the tail of the ESD, on a log-log scale
  • alpha-hat or alpha_weighted : a scale-adjusted form of alpha (similar to the alpha-shatten-Norm)
  • stable_rank : a norm-adjusted measure of the scale of the ESD
  • num_spikes : the number of spikes outside the MP bulk region
  • max_rand_eval : scale of the random noise etc

All of these attempt to measure how on-random and/or non-heavy-tailed the layer ESDs are.

Scale Metrics

  • log Frobenius norm :

  • log_spectral_norm :

  • stable_rank :

  • mp_softrank :

Shape Metrics

  • alpha : Power Law (PL) exponent
  • (Truncated) PL quality of fit D : (the Kolmogorov Smirnov Distance metric)

(advanced usage)

  • TPL : (alpha and Lambda) Truncated Power Law Fit
  • E_TPL : (alpha and Lambda) Extended Truncated Power Law Fit

Scale-adjusted Shape Metrics

  • alpha_weighted :
  • log_alpha_norm : (Shatten norm):

Direct Correlation Metrics

The random distance metric is a new, non-parameteric approach that appears to work well in early testing. See this recent blog post

  • rand_distance : Distance of layer ESD from the ideal RMT MP ESD

There re also related metrics, including the new

  • 'ww_maxdist'
  • 'ww_softrank'

Misc Details

  • N, M : Matrix or Tensor Slice Dimensions
  • num_spikes : number of spikes outside the bulk region of the ESD, when fit to an MP distribution
  • num_rand_spikes : number of Correlation Traps
  • max_rand_eval : scale of the random noise in the layer

Summary Statistics:

The layer metrics are averaged in the summary statistics:

Get the average metrics, as a summary (dict), from the given (or current) details dataframe

details = watcher.analyze(model=model)
summary = watcher.get_summary(model)

or just

summary = watcher.get_summary()

The summary statistics can be used to gauge the test error of a series of pre/trained models, without needing access to training or test data.

  • average alpha can be used to compare one or more DNN models with different hyperparemeter settings θ, when depth is not a driving factor (i.e transformer models)
  • average log_spectral_norm is useful to compare models of different depths L at a coarse grain level
  • average alpha_weighted and log_alpha_norm are suitable for DNNs of differing hyperparemeters θ and depths L simultaneously. (i.e CV models like VGG and ResNet)

Predicting the Generalization Error

WeightWatcher (WW) can be used to compare the test error for a series of models, trained on the similar dataset, but with different hyperparameters θ, or even different but related architectures.

Our Theory of HT-SR predicts that models with smaller PL exponents alpha, on average, correspond to models that generalize better.

Here is an example of the alpha_weighted capacity metric for all the current pretrained VGG models.

Notice: we did not peek at the ImageNet test data to build this plot.

This can be reproduced with the Examples Notebooks for VGG and also for ResNet

Detecting signs of Over-Fitting and Under-Fitting

WeightWatcher can help you detect the signatures of over-fitting and under-fitting in specific layers of a pre/trained Deep Neural Networks.

WeightWatcher will analyze your model, layer-by-layer, and show you where these kind of problems may be lurking.

Correlation Traps

The randomize option lets you compare the ESD of the layer weight matrix (W) to the ESD of its randomized form. This is good way to visualize the correlations in the true ESD, and detect signatures of over- and under-fitting
details = watcher.analyze(randomize=True, plot=True)

Fig (a) is well trained; Fig (b) may be over-fit.

That orange spike on the far right is the tell-tale clue; it's caled a Correlation Trap.

A Correlation Trap is characterized by Fig (b); here the actual (green) and random (red) ESDs look almost identical, except for a small shelf of correlation (just right of 0). And random (red) ESD, the largest eigenvalue (orange) is far to the right of and seperated from the bulk of the ESD.

Correlation Traps

When layers look like Figure (b) above, then they have not been trained properly because they look almost random, with only a little bit of information present. And the information the layer learned may even be spurious.

Moreover, the metric num_rand_spikes (in the details dataframe) contains the number of spikes (or traps) that appear in the layer.

The SVDSharpness transform can be used to remove Correlation Traps during training (after each epoch) or after training using

sharpemed_model = watcher.SVDSharpness(model=...)

Sharpening a model is similar to clipping the layer weight matrices, but uses Random Matrix Theory to do this in a more principle way than simple clipping.

Early Stopping

Note: This is experimental but we have seen some success here

The WeightWatcher alpha metric may be used to detect when to apply early stopping. When the average alpha (summary statistic) drops below 2.0, this indicates that the model may be over-trained and early stopping is necesary.

Below is an example of this, showing training loss and test lost curves for a small Transformer model, trained from scratch, along with the average alpha summary statistic.

Early Stopping

We can see that as the training and test losses decrease, so does alpha. But when the test loss saturates and then starts to increase, alpha drops below 2.0.

Note: this only work for very well trained models, where the optimal alpha=2.0 is obtained


Additional Features

There are many advanced features, described below

Filtering


filter by layer types

ww.LAYER_TYPE.CONV2D | ww.LAYER_TYPE.CONV2D | ww.LAYER_TYPE.DENSE

as

details=watcher.analyze(layers=[ww.LAYER_TYPE.CONV2D])

filter by layer ID or name

details=watcher.analyze(layers=[20])

Calculations


minimum, maximum number of eigenvalues of the layer weight matrix

Sets the minimum and maximum size of the weight matrices analyzed. Setting max is useful for a quick debugging.

details = watcher.analyze(min_evals=50, max_evals=500)

specify the Power Law fitting proceedure

To replicate results using TPL or E_TPL fits, use:

details = watcher.analyze(fit='PL'|'TPL'|'E_TPL')

The details dataframe will now contain two quality metrics, and for each layer:

  • alpha : basically (but not exactly) the same PL exponent as before, useful for alpha > 2.0
  • Lambda : a new metric, now useful when the (TPL) alpha < 2.0

(The TPL fits correct a problem we have had when the PL fits over-estimate alpha for TPL layers)

As with the alpha metric, smaller Lambda implies better generalization.

Visualization


Save all model figures

Saves the layer ESD plots for each layer

watcher.analyze(savefig=True,savefig='/plot_save_directory')

generating 4 files per layer

ww.layer#.esd1.png
ww.layer#.esd2.png
ww.layer#.esd3.png
ww.layer#.esd4.png

Note: additional plots will be saved when randomize option is used

fit ESDs to a Marchenko-Pastur (MP) distrbution

The mp_fit option tells WW to fit each layer ESD as a Random Matrix as a Marchenko-Pastur (MP) distribution, as described in our papers on HT-SR.

details = watcher.analyze(mp_fit=True, plot=True)

and reports the

num_spikes, mp_sigma, and mp_sofrank

Also works for randomized ESD and reports

rand_num_spikes, rand_mp_sigma, and rand_mp_sofrank

fetch the ESD for a specific layer, for visualization or additional analysis

watcher.analyze()
esd = watcher.get_ESD()

Model Analysis


describe a model

Describe a model and report the details dataframe, without analyzing it

details = watcher.describe(model=model)

comparing two models

The new distances method reports the distances between two models, such as the norm between the initial weight matrices and the final, trained weight matrices

details = watcher.distances(initial_model, trained_model)

Compatability


compatability with version 0.2.x

The new 0.4.x version of WeightWatcher treats each layer as a single, unified set of eigenvalues. In contrast, the 0.2.x versions split the Conv2D layers into n slices, one for each receptive field. The pool=False option provides results which are back-compatable with the 0.2.x version of WeightWatcher, (which used to be called ww2x=True) with details provide for each slice for each layer. Otherwise, the eigenvalues from each slice of th3 Conv2D layer are pooled into one ESD.

details = watcher.analyze(pool=False)

Requirements

  • Python 3.7+

Frameworks supported

  • Tensorflow 2.x / Keras
  • PyTorch 1.x
  • HuggingFace

Note: the current version requires both tensorflow and torch; if there is demand, this will be updates to make installation easier.

Layers supported

  • Dense / Linear / Fully Connected (and Conv1D)
  • Conv2D

Tips for First Time Users

On using WeighWtatcher for the first time. I recommend selecting at least one trained model, and running `weightwatcher` with all analyze options enabled, including the plots. From this, look for:
  • if the layers ESDs are well formed and heavy tailed
  • if any layers are nearly random, indicating they are not well trained
  • if all the power law a fits appear reasonable, and xmin is small enough that the fit captures a reasonable section of the ESD tail

Moreover, the Power Laws and alpha fit only work well when the ESDs are both heavy tailed and can be easily fit to a single power law. Occasionally the power law and/or alpha fits don't work. This happens when

  • the ESD is random (not heavy tailed), alpha > 8.0
  • the ESD is multimodal (rare, but does occur)
  • the ESD is heavy tailed, but not well described by a single power law. In these cases, sometimes alpha only fits the the very last part of the tail, and is too large. This is easily seen on the Lin-Lin plots

In any of these cases, I usually throw away results where alpha > 8.0 because they are spurious. If you suspect your layers are undertrained, you have to look both at alpha and a plot of the ESD itself (to see if it is heavy tailed or just random-like).


How to Release

Publishing to the PyPI repository:
# 1. Check in the latest code with the correct revision number (__version__ in __init__.py)
vi weightwatcher/__init__.py # Increse release number, remove -dev to revision number
git commit
# 2. Check out latest version from the repo in a fresh directory
cd ~/temp/
git clone https://github.com/CalculatedContent/WeightWatcher
cd WeightWatcher/
# 3. Use the latest version of the tools
python -m pip install --upgrade setuptools wheel twine
# 4. Create the package
python setup.py sdist bdist_wheel
# 5. Test the package
twine check dist/*
# 7. Upload the package to TestPyPI first
twine upload --repository testpypi dist/*
# 8. Test the TestPyPI install
python3 -m pip install --index-url https://test.pypi.org/simple/ weightwatcher
...
# 9. Upload to actual PyPI
twine upload dist/*
# 10. Tag/Release in github by creating a new release (https://github.com/CalculatedContent/WeightWatcher/releases/new)

License

Apache License 2.0


Academic Presentations and Media Appearances

This tool is based on state-of-the-art research done in collaboration with UC Berkeley:

WeightWatcher has been featured in top journals like JMLR and Nature: #### Latest papers and talks
and has been presented at Stanford, UC Berkeley, KDD, etc:

KDD2019 Workshop

WeightWatcher has also been featured at local meetups and many popular podcasts

Popular Popdcasts and Blogs

2021 Short Presentations

Recent talk(s) by Mike Mahoney, UC Berekely


Experimental / Most Recent version (not ready yet)

You may install the latest / Trunk from testpypi

python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher

The testpypi version usually has the most recent updates, including experimental methods and bug fixes. But pypi has changed the way it handles testpypi requiring non-testpypi dependencies. e.g., torch and tensorflow fail on testpypi

If you have them installed already in your env, you're fine. Otherwise, you need to install them first


Contributors

Charles H Martin, PhD Calculation Consulting

Serena Peng Christopher Hinrichs


Consulting Practice

Calculation Consulting homepage

Calculated Content Blog

weightwatcher's People

Contributors

burhan-q avatar cdhinrichs avatar charlesmartin14 avatar pachevalier avatar reserena avatar richardscottoz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weightwatcher's Issues

Support for efficient CNNs

Hi,

Very interesting/cool work. I am wondering if you have tried it on a more efficient CNN models, for instance. Efficientnet, mobilenet, fbnet?

Bug in the MP fits of the randomized ESDs

Please be aware that there remains some bug in the MP fits of the randomized ESDs

In some cases, the MP fit looks quite off, but , in fact, the fit is just wrong

Here is an example from VGG11, layer 8; This is the current code (0.4.7)
Screen Shot 2021-05-30 at 10 34 11 PM

This is the correct fit
Screen Shot 2021-05-30 at 10 34 03 PM

All layers skipped

I am trying to test on a simple model inspired from here:
https://keras.io/examples/cifar10_cnn/

The result shows after applying ww, all layers are skipped. Here is the model summary:

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_24 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_36 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_25 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_37 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_18 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 15, 15, 64)        18496     
_________________________________________________________________
activation_38 (Activation)   (None, 15, 15, 64)        0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 13, 13, 64)        36928     
_________________________________________________________________
activation_39 (Activation)   (None, 13, 13, 64)        0         
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 6, 6, 64)          0         
_________________________________________________________________
dropout_19 (Dropout)         (None, 6, 6, 64)          0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 2304)              0         
_________________________________________________________________
dense_12 (Dense)             (None, 512)               1180160   
_________________________________________________________________
activation_40 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_20 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 10)                5130      
_________________________________________________________________
activation_41 (Activation)   (None, 10)                0         
=================================================================
Total params: 1,250,858
Trainable params: 1,250,858
Non-trainable params: 0

WW log:

'''
{0: {'id': 0,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f812430fd68>},
1: {'id': 1,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f812430b3c8>},
2: {'id': 2,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f812430b630>},
3: {'id': 3,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f81242ff710>},
4: {'id': 4,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.pooling.MaxPooling2D at 0x7f81242ff898>},
5: {'id': 5,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dropout at 0x7f8123eb2898>},
6: {'id': 6,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f8123eb2cc0>},
7: {'id': 7,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123e823c8>},
8: {'id': 8,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f8123e82240>},
9: {'id': 9,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123e8b7f0>},
10: {'id': 10,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.pooling.MaxPooling2D at 0x7f8123e8b898>},
11: {'id': 11,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dropout at 0x7f8123ea0dd8>},
12: {'id': 12,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Flatten at 0x7f8123ea0c50>},
13: {'id': 13,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dense at 0x7f8123ea0c18>},
14: {'id': 14,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123def9b0>},
15: {'id': 15,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dropout at 0x7f8123def9e8>},
16: {'id': 16,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dense at 0x7f8123defcc0>},
17: {'id': 17,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123da4198>}}
'''

Let the user define the logging level

Hi, thanks for creating this nice Python package :) It would be nice to be able to define the logging level desired. When running inside a larger application with other logging mechanisms, the logs coming from weightwatcher are pretty verbose and make it hard to parse things. I see it's already marked as TODO

Perhaps you could have a set_logging_level function to set a global verbose level in the library?

v0.1.3 installation issues

Some requirements that were not installed:
pypandoc
msgpack
upgrading to setuptools>=41.0.0

This was a installation into a conda environment with the basic anaconda package installed (conda version 4.5.12, python 3.6.5) along with an old version of pytorch (0.4.1)

name not set in watcher.describe()

using the new Keras iterator

Screen Shot 2021-07-20 at 10 44 06 AM

The layer does have a default name
Screen Shot 2021-07-20 at 10 07 43 AM

These probably need to be added as 'original_name' or something like this

rand=True fails when M = 1

See line 1957:

1955
1956 num_spikes = len(to_plot[to_plot > bulk_max])
-> 1957 ratio_numofSpikes = num_spikes / (M - 1)
1958

Math Warnings On Analyze

I see a couple of warning messages when I run analyze():

/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: divide by zero encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
Less than 2 unique data values left after xmin and xmax options! Cannot fit. Returning nans.

These don't cause a crash or prevent getting results, but I wonder should I expect to see these messages / is it potentially a problem?

add channels flag to the watcher initiailzer

Some models, like the ONNX models, have channels last even thought the ONNX default is first
Currently, the channels flag is set in the analyze method
It may make more sense to move/add this to the initializer

Screen Shot 2021-06-21 at 10 03 05 AM

plot_loghist is not defined

weightwatcher 0.4.1. When I run analyze(plot=True), I get this error:

 File "/home/xander/dev/tsai/tsai/callback/MVP.py", line 172, in weight_watcher_analyze
    details = watcher.analyze(plot=True, savefig=True)
  File "/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 989, in analyze
    self.apply_fit_powerlaw(ww_layer, params)
  File "/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 885, in apply_fit_powerlaw
    sample=sample, sample_size=sample_size, savefig=savefig)
  File "/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 1335, in fit_powerlaw
    plot_loghist(evals[evals>(xmin/100)], bins=100, xmin=xmin)
NameError: name 'plot_loghist' is not defined

This is the line, but I don't see it defined anywhere. Where is plot_loghist supposed to be coming from?

Matlab alpha implementation

Hi, I would like to implement alpha within the Deep Learning Matlab framework for early stopping without requiring a validation sample.

Could you please point me toward or write down the pseudo code to calculating alpha in that context so that I can implement it in Matlab ?

Thank you!
Federico

MXNet Support

Hello! I love this work, thanks for open-sourcing it.

My department is a heavy user of the Sockeye framework for seq2seq learning, which internally relies on MXNet as opposed to PyTorch or TF.

I'd like to go back and run this tool on our models and am willing to put in some elbow grease to get MXNet support. I'm wondering if you have any idea of the steps involved in this, roughly, and the amount of work they would be. That would help me get started and decide if this project is worth the effort.

Thanks!

Unified-SVDsmoothing approach for "smalish" models

Hi Charles, as we discussed on the SVDsmoothing channel on Slack, here's my Matlab implementation of an approach to your SVDsmoothing algorithm, that roams through all kinds of layers including LSTMs, reshapes each weight matrix it finds into a vector (remembering the original shape), concatenates all vectors into a large vector, then reshapes said vector to a large square matrix (adding zero paddings to the vector first if required to get it into a square shape), then applies SVD20 (or SVD10, or something else) smoothing to the large matrix, and then it goes backward: reshapes the large, smoothed matrix into a long vector, discards the padding, recover the vectors corresponding to each layer from the large vector, and lastly reshapes each vector to its original matrix shape. It has comments that should allow you to port it to Python for WW.

I tested this approach to estimate test error in two Matlab toy models as well as in my own model/data, and in all cases, it seems to work pretty well, following test error in all its ups and downs, for instance going up together with test error when the network training process starts overfitting the training set.

Here's a capture of training for 60 epochs with my model and my data - the training set is composed of about 4000 samples (that are actually augmentations of only 260 samples!) and the test set is only composed of 30 samples. It's remarkable how well the test estimation compares to the test error with such a small test set(I was using SVD50 in this case)
image

Here is the architecture of my model:
image
image

In the code I included the option to normalize each layer's weight vector (subtract mean and divide by standard deviation), saving the mean and the standard deviation of each layer so that the vector can be de-normalized at the end after SVD. The rationale behind this was that the large matrix would end containing more seamless weights this way (i.e., weights in the same scale). But this has not given me good results in practice. Maybe you can exclude it from the port or maybe you can include a better way to normalize the vectors.

This method should allow people to train smallish models without a validation set and to use the estimated error to know when to stop training.

Charles:
I will need to explain theoretically in my paper/dissertation why this approach works, in particular, why truncated SVD20 of a large matrix containing all the weights of a network indiscriminately can be a good predictor of test error - in the sense that one would presume the placement of the weights in a matrix is important for SVD, and yet here layers' weights end placed all over the large matrix (because the large matrix is but the reshaping of a concatenated vector of weights), and yet it still works. In particular, it works with the LSTM weights! Any pointers?

function [estimatedError] = estimateTestLoss(network, trainingData, trainingTarget, percentageKept, normalizeVectors)
%ESTIMATETESTLOSS Uses a modified version of WW's SVDsmoothing to estimate test error during / after NN training

%% ROAM THROUGH LAYERS, RESHAPE ALL WEIGHT MATRICES FOUND ONTO VECTORS

layers = network.Layers;

vectorNumber = 1;

for layerNumber = 1:size(layers,1)
    if isprop(layers(layerNumber), 'Weights')
        % If layer is conv2D
        if ndims(layers(layerNumber).Weights) > 2
            for input = 1:size(layers(layerNumber).Weights,3)
                for output = 1:size(layers(layerNumber).Weights,4)
                    weightVectors{vectorNumber} = reshape(layers(layerNumber).Weights(:,:,input,output),1, numel(layers(layerNumber).Weights(:,:,input,output)));
                    if normalizeVectors
                        mu{vectorNumber} = mean(weightVectors{vectorNumber});
                        sd{vectorNumber} = std(weightVectors{vectorNumber});
                        weightVectors{vectorNumber} = (weightVectors{vectorNumber} - mu{vectorNumber}) ./ sd{vectorNumber};
                    end
                    vectorNumber = vectorNumber + 1;
                end
            end
            % If layer is dense layers
        elseif ndims(layers(layerNumber).Weights) == 2
            weightVectors{vectorNumber} = reshape(layers(layerNumber).Weights, 1, numel(layers(layerNumber).Weights));
            if normalizeVectors
                mu{vectorNumber} = mean(weightVectors{vectorNumber});
                sd{vectorNumber} = std(weightVectors{vectorNumber});
                weightVectors{vectorNumber} = (weightVectors{vectorNumber} - mu{vectorNumber}) ./ sd{vectorNumber};
            end
            vectorNumber = vectorNumber + 1;
        end
    end
    % for lstm/bilstm layers
    if isprop(layers(layerNumber), 'RecurrentWeights')
        weightVectors{vectorNumber} = reshape([layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights], 1, numel([layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights]));
        if normalizeVectors
            mu{vectorNumber} = mean(weightVectors{vectorNumber});
            sd{vectorNumber} = std(weightVectors{vectorNumber});
            weightVectors{vectorNumber} = (weightVectors{vectorNumber} - mu{vectorNumber}) ./ sd{vectorNumber};
        end
        vectorNumber = vectorNumber + 1;
    end
end

%% RESHAPE VECTORS INTO A SINGLE MATRIX, SMOOTH WEIGHTS

weightVector = horzcat(weightVectors{:});
squareSize = ceil(sqrt(numel(weightVector)));
padding = squareSize^2 - numel(weightVector);
weightMatrix = reshape([weightVector zeros(1, padding)],squareSize,squareSize);

if normalizeVectors
    weightMatrix(isinf(weightMatrix) | isnan(weightMatrix)) = 0;
end

if ~isa(weightMatrix,'double')
    weightMatrix=double(weightMatrix);
end

eigenvalues = svd(weightMatrix).^2;

nComponents = round(percentageKept * length(eigenvalues) / 100);
if nComponents < 1
    nComponents = 1;
end

% do truncated SVD for smoothing weights in matrix
[~,~,V] = svds(weightMatrix,nComponents);
X = weightMatrix*V;
smoothedMatrix = (X*V');

% reshape smoothed matrix to vectors once again
smoothedVector = reshape(smoothedMatrix, 1, numel(weightMatrix));
smoothedVector = smoothedVector(1,1:end-padding); % remove padding

%% ROAM THROUGH LAYERS, RESHAPE ALL VECTORS BACK INTO WEIGHT MATRICES

vectorNumber = 1;
vectorIndex = 1;

for layerNumber = 1:size(layers,1)
    if isprop(layers(layerNumber), 'Weights')
        % If layer is conv2D
        if ndims(layers(layerNumber).Weights) > 2
            for input = 1:size(layers(layerNumber).Weights,3)
                for output = 1:size(layers(layerNumber).Weights,4)
                    layers(layerNumber).Weights(:,:,input,output) = reshape(smoothedVector(1,vectorIndex:vectorIndex + length(weightVectors{vectorNumber})-1), size(layers(layerNumber).Weights(:,:,input,output)));
                    if normalizeVectors
                        layers(layerNumber).Weights(:,:,input,output) = (layers(layerNumber).Weights(:,:,input,output) .* sd{vectorNumber}) + mu{vectorNumber};
                    end
                    vectorIndex = vectorIndex + length(weightVectors{vectorNumber});
                    vectorNumber = vectorNumber + 1;
                end
            end
            % If layer is dense layers
        elseif ndims(layers(layerNumber).Weights) == 2
            layers(layerNumber).Weights = reshape(smoothedVector(1,vectorIndex:vectorIndex + length(weightVectors{vectorNumber})-1), size(layers(layerNumber).Weights));
            if normalizeVectors
                layers(layerNumber).Weights = (layers(layerNumber).Weights .* sd{vectorNumber}) + mu{vectorNumber};
            end
            vectorIndex = vectorIndex + length(weightVectors{vectorNumber});
            vectorNumber = vectorNumber + 1;
        end
    end
    % for lstm/bilstm layers
    if isprop(layers(layerNumber), 'RecurrentWeights')
        concatenatedMatrices = reshape(smoothedVector(1,vectorIndex:vectorIndex + length(weightVectors{vectorNumber})-1), size([layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights]));
        if normalizeVectors
            concatenatedMatrices = (concatenatedMatrices .* sd{vectorNumber}) + mu{vectorNumber};
        end
        layers(layerNumber).InputWeights = concatenatedMatrices(:,1:size(layers(layerNumber).InputWeights,2));
        layers(layerNumber).RecurrentWeights = concatenatedMatrices(:,size(layers(layerNumber).InputWeights,2)+1:end);
        vectorIndex = vectorIndex + length(weightVectors{vectorNumber});        
        vectorNumber = vectorNumber + 1;
    end
end

%% USE SMOOTHED MODEL TO CALCULATE ERROR

if isa(network,'DAGNetwork')
    dagNet = createLgraphUsingConnections(layers, network.Connections);
    smoothedNet = assembleNetwork(dagNet);
elseif isa(network,'dlnetwork')
    dagNet = createLgraphUsingConnections(layers, network.Connections);
    smoothedNet = dlnetwork(dagNet);
else
    smoothedNet = assembleNetwork(layers);
end

prediction = predict(smoothedNet,trainingData);

if iscell(prediction)
    for cellRow=1:size(prediction,1)
        RMSE(cellRow) = sqrt(mean((prediction{cellRow}'-trainingTarget{cellRow}').^2));
    end
    estimatedError = nanmean(RMSE);
elseif size(prediction,2)>1
    estimatedError = crossentropy(prediction,trainingTarget');
else
    estimatedError = sqrt(mean((prediction-trainingTarget).^2));
end

end

nan's when fitting some layers

Things are we seeing trying to apply WeightWatcher to BERT

'nan' in fit cumulative distribution values.
Likely underflow or overflow error: the optimal fit for this distribution gives values that are so extreme that we lack the numerical precision to calculate them.

From a product perspective, It might be useful to catch these exceptions and log them in the details dataframe so the use can see which layers specifically had this problem...Ill add this to the git issues list

I suspect this is happening when the alphas are unusually large
Alpha is fit using a brute force method, that searches for the optimal x_min
Sometimes, when the alphas are very large (>6), the optimal x_min is so large that it screws up and shows this
But I don't think this should happen for alpha < 5 or 6, (edited)
see: https://arxiv.org/abs/2106.00734
Figure 9(g), page 20
In some cases, the tail does not 'fill out' and the fit is very difficult...this is a sign of a bad layer
There are a few corner cases like this; we only discuss a couple of them in the papers--not enough room to cover it all
Screen Shot 2021-07-24 at 10 28 37 AM

Error while plotting

After upgrading to 0.4.0 I see this error on the analyst method, This worked with 0.2.6 version

My code
mdl1 = binaryClassification(input_size, BATCH_SIZE)
#mdl.to(device)
mdl1.load_state_dict(torch.load('my_model.mdl'))

watcher = ww.WeightWatcher(model=mdl1)
results1 = watcher.analyze(plot()
plt.show()

Error
DEBUG:weightwatcher:fitting power law on 1 eigenvalues
Less than 2 unique data values left after xmin and xmax options! Cannot fit. Returning nans.
Traceback (most recent call last):
File "/Users/sanjaypillay/MS_DS/git/maps_capstone/ww_v1_model.py", line 192, in nn_kmer
results1 = watcher.analyze(plot=True)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 976, in analyze
self.apply_fit_powerlaw(ww_layer, params)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 875, in apply_fit_powerlaw
alpha, xmin, xmax, D, sigma, num_pl_spikes = self.fit_powerlaw(evals, xmin=xmin, xmax=xmax, plot=plot, title="", sample=sample, sample_size=sample_size)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 1305, in fit_powerlaw
fig2 = fit.plot_pdf(color='b', linewidth=2)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/powerlaw.py", line 525, in plot_pdf
return plot_pdf(data, ax=ax, linear_bins=linear_bins, **kwargs)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/powerlaw.py", line 2073, in plot_pdf
edges, hist = pdf(data, linear_bins=linear_bins, **kwargs)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/powerlaw.py", line 1950, in pdf
xmax = max(data)

ValueError: max() arg is an empty sequence

get_details() slice counts are not meaningful, confusing

Unfortunately, because we report results for all layers, sliced or not, as having slices, we can get meaningless results..such as for these DENSE layers (in VGG16)

layer_id layer_type N M slice slice_count level comment norm lognorm

20	DENSE	25088	4096	0	NaN	LEVEL.SLICE	Slice level	23.429	1.36975
20	DENSE	25088	4096	NaN	1	LEVEL.LAYER	Layer level	23.429	1.36975
21	DENSE	4096	4096	0	NaN	LEVEL.SLICE	Slice level	18.0218	1.2558
21	DENSE	4096	4096	NaN	1	LEVEL.LAYER	Layer level	18.0218	1.2558
22	DENSE	4096	1000	0	NaN	LEVEL.SLICE	Slice level	16.7575	1.22421
22	DENSE	4096	1000	NaN	1	LEVEL.LAYER	Layer level	16.7575	1.22421

A better solution is needed. Options include:

  1. removing slices all-together , allowing conv2D to be 1 giant matrix
    problem: Attention models also have slices

  2. setting all NaN Slice layers to 1 (or 0)

  3. hide slice columns ?

  4. aggregate slices into a more complex, hierarchal dataframe ?

Tensorflow 2.0 keras support issue

Hello, when I try to use weightwatcher with my tf.keras layers, I get the "skipping layer" issue. Do you know how I could fix that? Thanks!

A generic measure for generalization

This is a great library.

Is it possible for you to provide a summary metric that can show that one network has a higher capacity than other networks on a given data?

Is it possible for you to provide a summary metric that can show that one network has a higher capacity than other networks based only on the weight distribution?

Can you try out generic information theoretic measures to provide a measure of the capacity of the neural network based on observing the weight distribution alone?

Unfortunately, I am in Python 2 land at the moment and plan to migrate to Python 3 in the near future. I cannot run the code in the default implementation without tweaks.

The get_summary method provide statistics like min, avg, max. This is different from a specific metric that can tell the capacity of a network. For example, being able to tell that GPT 3 is better than GPT 2 based on the weight distribution alone and providing this information in a single value. Rather than providing a chart that can provide subjective interpretation based on the technical abilities of the analyst.

Kenneth Odoh

Add support for dense layers with 1 output

It has been suggested to treat vectors / dense layers with 1 output by

  • pad the layer to have a square shape (1x1574) -> (1x1600)
  • reshaping the layer to a square (1x1600)->(40x40) in some arbitrary way
  • compute the correlations with alpha, and or apply SVD Smoothing
  • re-shaping back to (40x40)->(1x1600)
  • unpad

Note: It is necessary to keep track of how the indices reshaped to map back for the SVDSmoothing

Below is sample MatLab code for this to reshape the 1x1574 vector to a 40x40 matrix

function [correlation, RMSE] = predictedTestLoss(layers, trainingData, trainingTarget)
%PREDICTEDTESTLOSS Given training set and training target, predicts test
%loss based on SVDSmoothing technique by Charles (WW)
for layerNumber = 1:size(layers,1)
    if isprop(layers(layerNumber), 'Weights')
        % for now, skip 1x1 convs and/or fully connected layers
        if ndims(layers(layerNumber).Weights) > 2  && size(layers(layerNumber).Weights,1) > 1
            for input = 1:size(layers(layerNumber).Weights,3)
                for output = 1:size(layers(layerNumber).Weights,4)
                    X = layers(layerNumber).Weights(:,:,input,output);
                    nComponents = round(20 * length(svd(X)) / 100);
                    [~,~,v] = svds(X',nComponents, 'smallestnz');
%                     [~,~,v] = rsvd(X',nComponents);
%                     tolerance = rescale(100-20,sqrt(eps(class(X))),1,"InputMin",0,"InputMax",100);
%                     [~,~,v] = svdsketch(X',tolerance);
                    X = X'*v;
                    VT=v';
                    layers(layerNumber).Weights(:,:,input,output) = (X*VT)';
                end
            end
            % for dense layers
        elseif ndims(layers(layerNumber).Weights) == 2
            squareSize = ceil(sqrt(numel(layers(layerNumber).Weights)));
            padding = squareSize^2 - numel(layers(layerNumber).Weights);
            X = reshape([layers(layerNumber).Weights zeros(1, padding)],squareSize,squareSize);
            nComponents = round(20 * length(svd(X)) / 100);
            [~,~,v] = svds(X',nComponents, 'smallestnz');
%             [~,~,v] = rsvd(X',nComponents);
%             tolerance = rescale(100-20,sqrt(eps(class(X))),1,"InputMin",0,"InputMax",100);
%             [~,~,v] = svdsketch(X',tolerance);
            X = X'*v;
            VT=v';
            smoothedVector = reshape((X*VT)',1,squareSize^2);
            layers(layerNumber).Weights=smoothedVector(1:end-padding);
        end
    end
    % for lstm/bilstm layers
    if isprop(layers(layerNumber), 'RecurrentWeights')
        X = [layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights];
        nComponents = round(20 * length(svd(X)) / 100);
        [~,~,v] = rsvd(X',nComponents);
        X = X'*v;
        VT=v';
        smoothedWeights=(X*VT)';
        layers(layerNumber).InputWeights=smoothedWeights(:,1:size(layers(layerNumber).InputWeights,2));
        layers(layerNumber).RecurrentWeights=smoothedWeights(:,size(layers(layerNumber).InputWeights,2)+1:end);
    end
end
smoothedNet = assembleNetwork(layers);
prediction = predict(smoothedNet,trainingData);
correlation = corr(prediction, trainingTarget);
RMSE = sqrt(mean((prediction-trainingTarget).^2));
end

input error

compute_alphas = False, plot = True

should give an error

Fix title on LogESD

This is not log log plot...it is meant to show deviations from log normality
It is NOT the log ESD of the density...it is the density of the log of the eigenvalues, but not plotted on a log log scale


Screen Shot 2020-11-29 at 11 23 26 PM

This is the actual log log plot of the PDF
Screen Shot 2020-11-29 at 11 23 30 PM

Add np.SVD option

We would like to support calculating all eigenvalues
This is slower and should be an user input option

IMHO, the right way to do this is to make our own SVD function and wrap this
We could also support SVD on GPU if available

MP Fits may fail because the ESD is not scaled properly

We have some scaling issues getting the MP fits to work, and, in particular, for the randomized ESDs

I am trying to find a good general-purpose solution for this.

This is a good example: If the ESD is rescale, and the spikes removed, then the MP fit is excellent
(layer 28, Randomized VGG11 from the torchmodels package)

Screen Shot 2021-04-21 at 9 26 06 AM

Here is the rescaled version
Screen Shot 2021-04-21 at 9 28 21 AM

Using average alpha as part of loss function?

I'm wondering in the context of regression, what would happen if one created a Loss function that includes alpha, for example by multiplying current MSE by alpha, so that as alpha decreases so does the loss?

BERT / PyTorch Embedding Layers being skipped

from transformers import BertModel, BertConfig

CHECKPOINT = "bert-base-uncased"

# Initializing a BERT bert-base-uncased style configuration
configuration = BertConfig()

# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)

# Accessing the model configuration
configuration = model.config

INFO:weightwatcher:params {'glorot_fix': False, 'normalize': False, 'conv2d_norm': True, 'randomize': False, 'savefig': False, 'rescale': True, 'deltaEs': False, 'intra': False, 'channels': None, 'conv2d_fft': False, 'min_evals': 0, 'max_evals': None, 'plot': False, 'mp_fit': False, 'ww2x': False, 'layers': []}
WARNING:weightwatcher:pytorch layer: Embedding(30522, 768, padding_idx=0)  type LAYER_TYPE.EMBEDDING not found 
WARNING:weightwatcher:pytorch layer: Embedding(512, 768)  type LAYER_TYPE.EMBEDDING not found 
WARNING:weightwatcher:pytorch layer: Embedding(2, 768)  type LAYER_TYPE.EMBEDDING not found 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.