calculatedcontent / weightwatcher Goto Github PK

The WeightWatcher tool for predicting the accuracy of Deep Neural Networks

License: Apache License 2.0

Python 78.22% Jupyter Notebook 21.78%

weightwatcher's Introduction

WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.

It can be used to:

analyze pre/trained pyTorch, Keras, DNN models (Conv2D and Dense layers)
monitor models, and the model layers, to see if they are over-trained or over-parameterized
predict test accuracies across different models, with or without training data
detect potential problems when compressing or fine-tuning pretrained models
layer warning labels: over-trained; under-trained

Quick Links

Please see our latest talk from the Sillicon Valley ACM meetup
Join the Discord Server
For a deeper dive into the theory, see our latest talk at ENS
and some of the most recent Podcasts:
- Practical AI
- The Prompt Desk
More details and demos can be found on the Calculated Content Blog

And in the notebooks provided in the examples directory

Installation: Version 0.7.5.2

pip install weightwatcher

if this fails try

Current TestPyPI Version 0.7.5.2

 python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher

Usage

import weightwatcher as ww
import torchvision.models as models

model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze()
summary = watcher.get_summary(details)

It is as easy to run and generates a pandas dataframe with details (and plots) for each layer

and summary dictionary of generalization metrics

    {'log_norm': 2.11,      'alpha': 3.06,
      'alpha_weighted': 2.78,
      'log_alpha_norm': 3.21,
      'log_spectral_norm': 0.89,
      'stable_rank': 20.90,
      'mp_softrank': 0.52}

Advanced Usage

The watcher object has several functions and analysis features described below

Notice the min_evals setting: the power law fits need at least 50 eigenvalues to make sense but the describe and other methods do not

watcher.analyze(model=None, layers=[], min_evals=50, max_evals=None,
	 plot=True, randomize=True, mp_fit=True, pool=True, savefig=True):
...
watcher.describe(self, model=None, layers=[], min_evals=0, max_evals=None,
         plot=True, randomize=True, mp_fit=True, pool=True):
...
watcher.get_details()
watcher.get_summary(details) or get_summary()
watcher.get_ESD()
...
watcher.distances(model_1, model_2)

PEFT / LORA models (experimental)

To analyze an PEFT / LORA fine-tuned model, specify the peft option.

peft = True: Forms the BA low rank matric and analyzes the delta layers, with 'lora_BA" tag in name

details = watcher.analyze(peft='peft_only')
peft = 'with_base': Analyes the base_model, the delta, and the combined layer weight matrices.

details = watcher.analyze(peft=True)

The base_model and fine-tuned model must have the same layer names. And weightwatcher will ignore layers that do not share the same name. Also,at this point, biases are not considered. Finally, both models should be stored in the same format (i.e safetensors)

Note: If you want to select by layer_ids, you must first run describe(peft=False), and then select both the lora_A and lora_B layers

Usage: Base Model

Ploting and Fitting the Empirical Spectral Density (ESD)

WW creates plots for each layer weight matrix to observe how well the power law fits work

details = watcher.analyze(plot=True)

For each layer, WeightWatcher plots the ESD--a histogram of the eigenvalues of the layer correlation matrix X=W^TW. It then fits the tail of ESD to a (Truncated) Power Law, and plots these fits on different axes. The summary metrics (above) characterize the Shape and Scale of each ESD. Here's an example:

Generally speaking, the ESDs in the best layers, in the best DNNs can be fit to a Power Law (PL), with PL exponents alpha closer to 2.0. Visually, the ESD looks like a straight line on a log-log plot (above left).

Generalization Metrics

The goal of the WeightWatcher project is find generalization metrics that most accurately reflect observed test accuracies, across many different models and architectures, for pre-trained models and models undergoing training.

Our HTSR theory says that well trained, well correlated layers should be signficantly different from the MP (Marchenko-Pastur) random bulk, and specifically to be heavy tailed. There are different layer metrics in WeightWatcher for this, including:

rand_distance : the distance in distribution from the randomized layer
alpha : the slope of the tail of the ESD, on a log-log scale
alpha-hat or alpha_weighted : a scale-adjusted form of alpha (similar to the alpha-shatten-Norm)
stable_rank : a norm-adjusted measure of the scale of the ESD
num_spikes : the number of spikes outside the MP bulk region
max_rand_eval : scale of the random noise etc

All of these attempt to measure how on-random and/or non-heavy-tailed the layer ESDs are.

Scale Metrics

log Frobenius norm : $\log_{10}\Vert\mathbf{W}\Vert^{2}_{F}$
log_spectral_norm : $\log_{10}\lambda_{max}=\log_{10}\Vert\mathbf{W}\Vert^{2}_{\infty}$
stable_rank : $R_{stable}=\Vert\mathbf{W}\Vert^{2}_{F}/\Vert\mathbf{W}\Vert^{2}_{\infty}$
mp_softrank : $R_{MP}=\lambda_{MP}/\lambda_{max}$

Shape Metrics

alpha : $\alpha$ Power Law (PL) exponent
(Truncated) PL quality of fit D : $\D$ (the Kolmogorov Smirnov Distance metric)

(advanced usage)

TPL : (alpha and Lambda) Truncated Power Law Fit
E_TPL : (alpha and Lambda) Extended Truncated Power Law Fit

Scale-adjusted Shape Metrics

alpha_weighted : $\hat{\alpha}=\alpha\log_{10}\lambda_{max}$
log_alpha_norm : (Shatten norm): $\log_{10}\Vert\mathbf{X}\Vert^{\alpha}_{\alpha}$

Direct Correlation Metrics

The random distance metric is a new, non-parameteric approach that appears to work well in early testing. See this recent blog post

rand_distance : $div(\mathbf{W},rand(\mathbf{W}))$ Distance of layer ESD from the ideal RMT MP ESD

There re also related metrics, including the new

'ww_maxdist'
'ww_softrank'

Misc Details

N, M : Matrix or Tensor Slice Dimensions
num_spikes : number of spikes outside the bulk region of the ESD, when fit to an MP distribution
num_rand_spikes : number of Correlation Traps
max_rand_eval : scale of the random noise in the layer

Summary Statistics:

The layer metrics are averaged in the summary statistics:

Get the average metrics, as a summary (dict), from the given (or current) details dataframe

details = watcher.analyze(model=model)
summary = watcher.get_summary(model)

or just

summary = watcher.get_summary()

The summary statistics can be used to gauge the test error of a series of pre/trained models, without needing access to training or test data.

average alpha can be used to compare one or more DNN models with different hyperparemeter settings θ, when depth is not a driving factor (i.e transformer models)
average log_spectral_norm is useful to compare models of different depths L at a coarse grain level
average alpha_weighted and log_alpha_norm are suitable for DNNs of differing hyperparemeters θ and depths L simultaneously. (i.e CV models like VGG and ResNet)

Predicting the Generalization Error

WeightWatcher (WW) can be used to compare the test error for a series of models, trained on the similar dataset, but with different hyperparameters θ, or even different but related architectures.

Our Theory of HT-SR predicts that models with smaller PL exponents alpha, on average, correspond to models that generalize better.

Here is an example of the alpha_weighted capacity metric for all the current pretrained VGG models.

Notice: we did not peek at the ImageNet test data to build this plot.

This can be reproduced with the Examples Notebooks for VGG and also for ResNet

Detecting signs of Over-Fitting and Under-Fitting

WeightWatcher can help you detect the signatures of over-fitting and under-fitting in specific layers of a pre/trained Deep Neural Networks.

WeightWatcher will analyze your model, layer-by-layer, and show you where these kind of problems may be lurking.

Correlation Traps

The randomize option lets you compare the ESD of the layer weight matrix (W) to the ESD of its randomized form. This is good way to visualize the correlations in the true ESD, and detect signatures of over- and under-fitting

details = watcher.analyze(randomize=True, plot=True)

Fig (a) is well trained; Fig (b) may be over-fit.

That orange spike on the far right is the tell-tale clue; it's caled a Correlation Trap.

A Correlation Trap is characterized by Fig (b); here the actual (green) and random (red) ESDs look almost identical, except for a small shelf of correlation (just right of 0). And random (red) ESD, the largest eigenvalue (orange) is far to the right of and seperated from the bulk of the ESD.

When layers look like Figure (b) above, then they have not been trained properly because they look almost random, with only a little bit of information present. And the information the layer learned may even be spurious.

Moreover, the metric num_rand_spikes (in the details dataframe) contains the number of spikes (or traps) that appear in the layer.

The SVDSharpness transform can be used to remove Correlation Traps during training (after each epoch) or after training using

sharpemed_model = watcher.SVDSharpness(model=...)

Sharpening a model is similar to clipping the layer weight matrices, but uses Random Matrix Theory to do this in a more principle way than simple clipping.

Early Stopping

Note: This is experimental but we have seen some success here

The WeightWatcher alpha metric may be used to detect when to apply early stopping. When the average alpha (summary statistic) drops below 2.0, this indicates that the model may be over-trained and early stopping is necesary.

Below is an example of this, showing training loss and test lost curves for a small Transformer model, trained from scratch, along with the average alpha summary statistic.

We can see that as the training and test losses decrease, so does alpha. But when the test loss saturates and then starts to increase, alpha drops below 2.0.

Note: this only work for very well trained models, where the optimal alpha=2.0 is obtained

Additional Features

There are many advanced features, described below

Filtering

filter by layer types

ww.LAYER_TYPE.CONV2D | ww.LAYER_TYPE.CONV2D | ww.LAYER_TYPE.DENSE

details=watcher.analyze(layers=[ww.LAYER_TYPE.CONV2D])

filter by layer ID or name

details=watcher.analyze(layers=[20])

Calculations

minimum, maximum number of eigenvalues of the layer weight matrix

Sets the minimum and maximum size of the weight matrices analyzed. Setting max is useful for a quick debugging.

details = watcher.analyze(min_evals=50, max_evals=500)

specify the Power Law fitting proceedure

To replicate results using TPL or E_TPL fits, use:

details = watcher.analyze(fit='PL'|'TPL'|'E_TPL')

The details dataframe will now contain two quality metrics, and for each layer:

alpha : basically (but not exactly) the same PL exponent as before, useful for alpha > 2.0
Lambda : a new metric, now useful when the (TPL) alpha < 2.0

(The TPL fits correct a problem we have had when the PL fits over-estimate alpha for TPL layers)

As with the alpha metric, smaller Lambda implies better generalization.

Visualization

Save all model figures

Saves the layer ESD plots for each layer

watcher.analyze(savefig=True,savefig='/plot_save_directory')

generating 4 files per layer

ww.layer#.esd1.png
ww.layer#.esd2.png
ww.layer#.esd3.png
ww.layer#.esd4.png

Note: additional plots will be saved when randomize option is used

fit ESDs to a Marchenko-Pastur (MP) distrbution

The mp_fit option tells WW to fit each layer ESD as a Random Matrix as a Marchenko-Pastur (MP) distribution, as described in our papers on HT-SR.

details = watcher.analyze(mp_fit=True, plot=True)

and reports the

num_spikes, mp_sigma, and mp_sofrank

Also works for randomized ESD and reports

rand_num_spikes, rand_mp_sigma, and rand_mp_sofrank

fetch the ESD for a specific layer, for visualization or additional analysis

watcher.analyze()
esd = watcher.get_ESD()

Model Analysis

describe a model

Describe a model and report the details dataframe, without analyzing it

details = watcher.describe(model=model)

comparing two models

The new distances method reports the distances between two models, such as the norm between the initial weight matrices and the final, trained weight matrices

details = watcher.distances(initial_model, trained_model)

Compatability

compatability with version 0.2.x

The new 0.4.x version of WeightWatcher treats each layer as a single, unified set of eigenvalues. In contrast, the 0.2.x versions split the Conv2D layers into n slices, one for each receptive field. The pool=False option provides results which are back-compatable with the 0.2.x version of WeightWatcher, (which used to be called ww2x=True) with details provide for each slice for each layer. Otherwise, the eigenvalues from each slice of th3 Conv2D layer are pooled into one ESD.

details = watcher.analyze(pool=False)

Requirements

Python 3.7+

Frameworks supported

Tensorflow 2.x / Keras
PyTorch 1.x
HuggingFace

Note: the current version requires both tensorflow and torch; if there is demand, this will be updates to make installation easier.

Layers supported

Dense / Linear / Fully Connected (and Conv1D)
Conv2D

Tips for First Time Users

On using WeighWtatcher for the first time. I recommend selecting at least one trained model, and running `weightwatcher` with all analyze options enabled, including the plots. From this, look for:

if the layers ESDs are well formed and heavy tailed
if any layers are nearly random, indicating they are not well trained
if all the power law a fits appear reasonable, and xmin is small enough that the fit captures a reasonable section of the ESD tail

Moreover, the Power Laws and alpha fit only work well when the ESDs are both heavy tailed and can be easily fit to a single power law. Occasionally the power law and/or alpha fits don't work. This happens when

the ESD is random (not heavy tailed), alpha > 8.0
the ESD is multimodal (rare, but does occur)
the ESD is heavy tailed, but not well described by a single power law. In these cases, sometimes alpha only fits the the very last part of the tail, and is too large. This is easily seen on the Lin-Lin plots

In any of these cases, I usually throw away results where alpha > 8.0 because they are spurious. If you suspect your layers are undertrained, you have to look both at alpha and a plot of the ESD itself (to see if it is heavy tailed or just random-like).

How to Release

Publishing to the PyPI repository:

# 1. Check in the latest code with the correct revision number (__version__ in __init__.py)
vi weightwatcher/__init__.py # Increse release number, remove -dev to revision number
git commit
# 2. Check out latest version from the repo in a fresh directory
cd ~/temp/
git clone https://github.com/CalculatedContent/WeightWatcher
cd WeightWatcher/
# 3. Use the latest version of the tools
python -m pip install --upgrade setuptools wheel twine
# 4. Create the package
python setup.py sdist bdist_wheel
# 5. Test the package
twine check dist/*
# 7. Upload the package to TestPyPI first
twine upload --repository testpypi dist/*
# 8. Test the TestPyPI install
python3 -m pip install --index-url https://test.pypi.org/simple/ weightwatcher
...
# 9. Upload to actual PyPI
twine upload dist/*
# 10. Tag/Release in github by creating a new release (https://github.com/CalculatedContent/WeightWatcher/releases/new)

License

Apache License 2.0

Academic Presentations and Media Appearances

This tool is based on state-of-the-art research done in collaboration with UC Berkeley:

WeightWatcher has been featured in top journals like JMLR and Nature:

#### Latest papers and talks

and has been presented at Stanford, UC Berkeley, KDD, etc:

KDD2019 Workshop

WeightWatcher has also been featured at local meetups and many popular podcasts

Popular Popdcasts and Blogs

2021 Short Presentations

Recent talk(s) by Mike Mahoney, UC Berekely

IARAI, the Institute for Advanced Research in Artificial Intelligence

Experimental / Most Recent version (not ready yet)

You may install the latest / Trunk from testpypi

python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher

The testpypi version usually has the most recent updates, including experimental methods and bug fixes. But pypi has changed the way it handles testpypi requiring non-testpypi dependencies. e.g., torch and tensorflow fail on testpypi

If you have them installed already in your env, you're fine. Otherwise, you need to install them first

Contributors

Charles H Martin, PhD Calculation Consulting

Serena Peng Christopher Hinrichs

Consulting Practice

Calculation Consulting homepage

Calculated Content Blog

weightwatcher's People

Contributors

Stargazers

Watchers

Forkers

codeaudit anirband rubenszimbres fitrialif onisimchukv galenwilkerson salemameen ashwathaithal novamx wqnow anasus1 timothyn617 lzb863 scape1989 nunofernandes-plight stjordanis jbdatascience adityak2920 coderpriya yohskua nunoedgarsciengfirst ilyaselitser datakalp rezacsedu jhssyb milesgray charismaticchiu atrah22 hussam789 valentierra ehariri duongvotran hoangkhoile anminhhung tindang97 dpsnewailab lightshifted ml-edu the-intelligence-of-information srivastavakshitij xanderdunn steliord walterwsmf vkirilenko microprediction udemirezen auserj quantitative-technologies zeta1999 laplacekorea federicoparra nsfzyzz sdjoko clickio techthiyanes sk707 shounolab rohitpandey13 researchdaniel ckqu1 strangefruit deepak-rai-1027 mustfkeskin valeman argo12 javiernistal talmoscovitz richardscottoz ki-ljl scalet98 mez digantamisra98 arinchang sc-pioneer liorblech-autobrains mind-forks mjbommar eycye burhan-q skumarh89 hiimanshusherawat priyabrata017 pachevalier honeypotz-net valteresj2 dy05sep2019 nijinjose intellimetri matthiashuber-digital yusuf-jkhan1 vishalsingh17 tomaarsen cdhinrichs fdoperezi blackopium12345 alfatti nickb- mhtarora39 calebgeniesse apollohuang1

weightwatcher's Issues

Trunk version for Keras not detecting channels correctly

Something wrong when processing Keras Conv2D layers
I suspect I broke the trunk when I added the channel='first' | 'last' option

Support for efficient CNNs

Hi,

Very interesting/cool work. I am wondering if you have tried it on a more efficient CNN models, for instance. Efficientnet, mobilenet, fbnet?

Now v0.1.3 getting 1/0 errors in power law calculations

Theoretical_CDF * (1 - Theoretical_CDF))
/Users/charleshmartin/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide

Tests of Missing PyTorch Layers

There are several PyTorch layers we don’t support yet.

https://pytorch.org/docs/stable/nn.html#

We would need to find implementations of pretrained models for them, such as

Conv1D
Conv3D
ConvTranspose1d
ConvTranspose3d

WeightWatcher object has no attribute "results"

Not a showstopper, but there's a line in:
https://github.com/CalculatedContent/WeightWatcher/blob/master/weightwatcher/weightwatcher.py#L599

where it tries to call the results attribute that no longer exists

Graphs plotted in the same figure

When i run details = watcher.analyze(plot = True) all figures are plotted in the same figure. How can i handle it?

Tool needs pytorch when you only need Keras

When I tried to install weightwatcher, it requires pytorch even though the code does not need it.

watcher.analyze(compute_alphas=True, layers=[20])

INFO Analyzing model 'vgg16' with 23 layers

But does not tell the user it is restricting to 1 layer...it just happens

running ww.LAYER_TYPE.DENSE, details has all layers

When running watcher.analyze(layers=ww.LAYER_TYPE.DENSE)

The get_details() shows all layers

Should we restrict details to just the layers being examined ?

Maybe 'mark' as analyzed vs skipped ?

See issue #10

the layers=[-0] option not working

watcher.analyze(layers=[-0], plot=True)
Does not work correctly a(obviously) if the layer to skip is layer 0,

does not skip layers with M = 1 for spectral norm, soft rank

M=1 is bad

example: menet348_12x1_g3

back compatibility with ww2x broken

min_evals does not work properly when ww2x is specified
layers=[] does not seem to work at all...maybe also related to ww2x ?

Bug in the MP fits of the randomized ESDs

Please be aware that there remains some bug in the MP fits of the randomized ESDs

In some cases, the MP fit looks quite off, but , in fact, the fit is just wrong

Here is an example from VGG11, layer 8; This is the current code (0.4.7)

This is the correct fit

All layers skipped

I am trying to test on a simple model inspired from here:
https://keras.io/examples/cifar10_cnn/

The result shows after applying ww, all layers are skipped. Here is the model summary:

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_24 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_36 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_25 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_37 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_18 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 15, 15, 64)        18496     
_________________________________________________________________
activation_38 (Activation)   (None, 15, 15, 64)        0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 13, 13, 64)        36928     
_________________________________________________________________
activation_39 (Activation)   (None, 13, 13, 64)        0         
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 6, 6, 64)          0         
_________________________________________________________________
dropout_19 (Dropout)         (None, 6, 6, 64)          0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 2304)              0         
_________________________________________________________________
dense_12 (Dense)             (None, 512)               1180160   
_________________________________________________________________
activation_40 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_20 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 10)                5130      
_________________________________________________________________
activation_41 (Activation)   (None, 10)                0         
=================================================================
Total params: 1,250,858
Trainable params: 1,250,858
Non-trainable params: 0

WW log:

'''
{0: {'id': 0,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f812430fd68>},
1: {'id': 1,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f812430b3c8>},
2: {'id': 2,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f812430b630>},
3: {'id': 3,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f81242ff710>},
4: {'id': 4,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.pooling.MaxPooling2D at 0x7f81242ff898>},
5: {'id': 5,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dropout at 0x7f8123eb2898>},
6: {'id': 6,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f8123eb2cc0>},
7: {'id': 7,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123e823c8>},
8: {'id': 8,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f8123e82240>},
9: {'id': 9,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123e8b7f0>},
10: {'id': 10,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.pooling.MaxPooling2D at 0x7f8123e8b898>},
11: {'id': 11,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dropout at 0x7f8123ea0dd8>},
12: {'id': 12,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Flatten at 0x7f8123ea0c50>},
13: {'id': 13,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dense at 0x7f8123ea0c18>},
14: {'id': 14,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123def9b0>},
15: {'id': 15,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dropout at 0x7f8123def9e8>},
16: {'id': 16,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Dense at 0x7f8123defcc0>},
17: {'id': 17,
'message': 'Skipping (Layer not supported)',
'type': <tensorflow.python.keras.layers.core.Activation at 0x7f8123da4198>}}
'''

Let the user define the logging level

Hi, thanks for creating this nice Python package :) It would be nice to be able to define the logging level desired. When running inside a larger application with other logging mechanisms, the logs coming from weightwatcher are pretty verbose and make it hard to parse things. I see it's already marked as TODO

Perhaps you could have a set_logging_level function to set a global verbose level in the library?

v0.1.3 installation issues

Some requirements that were not installed:
pypandoc
msgpack
upgrading to setuptools>=41.0.0

This was a installation into a conda environment with the basic anaconda package installed (conda version 4.5.12, python 3.6.5) along with an old version of pytorch (0.4.1)

name not set in watcher.describe()

using the new Keras iterator

The layer does have a default name

These probably need to be added as 'original_name' or something like this

rand=True fails when M = 1

See line 1957:

1955
1956 num_spikes = len(to_plot[to_plot > bulk_max])
-> 1957 ratio_numofSpikes = num_spikes / (M - 1)
1958

Math Warnings On Analyze

I see a couple of warning messages when I run analyze():

/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: divide by zero encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
Less than 2 unique data values left after xmin and xmax options! Cannot fit. Returning nans.

These don't cause a crash or prevent getting results, but I wonder should I expect to see these messages / is it potentially a problem?

add channels flag to the watcher initiailzer

Some models, like the ONNX models, have channels last even thought the ONNX default is first
Currently, the channels flag is set in the analyze method
It may make more sense to move/add this to the initializer

plot_loghist is not defined

weightwatcher 0.4.1. When I run analyze(plot=True), I get this error:

 File "/home/xander/dev/tsai/tsai/callback/MVP.py", line 172, in weight_watcher_analyze
    details = watcher.analyze(plot=True, savefig=True)
  File "/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 989, in analyze
    self.apply_fit_powerlaw(ww_layer, params)
  File "/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 885, in apply_fit_powerlaw
    sample=sample, sample_size=sample_size, savefig=savefig)
  File "/home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 1335, in fit_powerlaw
    plot_loghist(evals[evals>(xmin/100)], bins=100, xmin=xmin)
NameError: name 'plot_loghist' is not defined

This is the line, but I don't see it defined anywhere. Where is plot_loghist supposed to be coming from?

Matlab alpha implementation

Hi, I would like to implement alpha within the Deep Learning Matlab framework for early stopping without requiring a validation sample.

Could you please point me toward or write down the pseudo code to calculating alpha in that context so that I can implement it in Matlab ?

Thank you!
Federico

remove logging of glorot_fix and normalization

MXNet Support

Hello! I love this work, thanks for open-sourcing it.

My department is a heavy user of the Sockeye framework for seq2seq learning, which internally relies on MXNet as opposed to PyTorch or TF.

I'd like to go back and run this tool on our models and am willing to put in some elbow grease to get MXNet support. I'm wondering if you have any idea of the steps involved in this, roughly, and the amount of work they would be. That would help me get started and decide if this project is worth the effort.

Thanks!

Unified-SVDsmoothing approach for "smalish" models

Hi Charles, as we discussed on the SVDsmoothing channel on Slack, here's my Matlab implementation of an approach to your SVDsmoothing algorithm, that roams through all kinds of layers including LSTMs, reshapes each weight matrix it finds into a vector (remembering the original shape), concatenates all vectors into a large vector, then reshapes said vector to a large square matrix (adding zero paddings to the vector first if required to get it into a square shape), then applies SVD20 (or SVD10, or something else) smoothing to the large matrix, and then it goes backward: reshapes the large, smoothed matrix into a long vector, discards the padding, recover the vectors corresponding to each layer from the large vector, and lastly reshapes each vector to its original matrix shape. It has comments that should allow you to port it to Python for WW.

I tested this approach to estimate test error in two Matlab toy models as well as in my own model/data, and in all cases, it seems to work pretty well, following test error in all its ups and downs, for instance going up together with test error when the network training process starts overfitting the training set.

Here's a capture of training for 60 epochs with my model and my data - the training set is composed of about 4000 samples (that are actually augmentations of only 260 samples!) and the test set is only composed of 30 samples. It's remarkable how well the test estimation compares to the test error with such a small test set(I was using SVD50 in this case)

Here is the architecture of my model:

In the code I included the option to normalize each layer's weight vector (subtract mean and divide by standard deviation), saving the mean and the standard deviation of each layer so that the vector can be de-normalized at the end after SVD. The rationale behind this was that the large matrix would end containing more seamless weights this way (i.e., weights in the same scale). But this has not given me good results in practice. Maybe you can exclude it from the port or maybe you can include a better way to normalize the vectors.

This method should allow people to train smallish models without a validation set and to use the estimated error to know when to stop training.

Charles:
I will need to explain theoretically in my paper/dissertation why this approach works, in particular, why truncated SVD20 of a large matrix containing all the weights of a network indiscriminately can be a good predictor of test error - in the sense that one would presume the placement of the weights in a matrix is important for SVD, and yet here layers' weights end placed all over the large matrix (because the large matrix is but the reshaping of a concatenated vector of weights), and yet it still works. In particular, it works with the LSTM weights! Any pointers?

function [estimatedError] = estimateTestLoss(network, trainingData, trainingTarget, percentageKept, normalizeVectors)
%ESTIMATETESTLOSS Uses a modified version of WW's SVDsmoothing to estimate test error during / after NN training

%% ROAM THROUGH LAYERS, RESHAPE ALL WEIGHT MATRICES FOUND ONTO VECTORS

layers = network.Layers;

vectorNumber = 1;

for layerNumber = 1:size(layers,1)
    if isprop(layers(layerNumber), 'Weights')
        % If layer is conv2D
        if ndims(layers(layerNumber).Weights) > 2
            for input = 1:size(layers(layerNumber).Weights,3)
                for output = 1:size(layers(layerNumber).Weights,4)
                    weightVectors{vectorNumber} = reshape(layers(layerNumber).Weights(:,:,input,output),1, numel(layers(layerNumber).Weights(:,:,input,output)));
                    if normalizeVectors
                        mu{vectorNumber} = mean(weightVectors{vectorNumber});
                        sd{vectorNumber} = std(weightVectors{vectorNumber});
                        weightVectors{vectorNumber} = (weightVectors{vectorNumber} - mu{vectorNumber}) ./ sd{vectorNumber};
                    end
                    vectorNumber = vectorNumber + 1;
                end
            end
            % If layer is dense layers
        elseif ndims(layers(layerNumber).Weights) == 2
            weightVectors{vectorNumber} = reshape(layers(layerNumber).Weights, 1, numel(layers(layerNumber).Weights));
            if normalizeVectors
                mu{vectorNumber} = mean(weightVectors{vectorNumber});
                sd{vectorNumber} = std(weightVectors{vectorNumber});
                weightVectors{vectorNumber} = (weightVectors{vectorNumber} - mu{vectorNumber}) ./ sd{vectorNumber};
            end
            vectorNumber = vectorNumber + 1;
        end
    end
    % for lstm/bilstm layers
    if isprop(layers(layerNumber), 'RecurrentWeights')
        weightVectors{vectorNumber} = reshape([layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights], 1, numel([layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights]));
        if normalizeVectors
            mu{vectorNumber} = mean(weightVectors{vectorNumber});
            sd{vectorNumber} = std(weightVectors{vectorNumber});
            weightVectors{vectorNumber} = (weightVectors{vectorNumber} - mu{vectorNumber}) ./ sd{vectorNumber};
        end
        vectorNumber = vectorNumber + 1;
    end
end

%% RESHAPE VECTORS INTO A SINGLE MATRIX, SMOOTH WEIGHTS

weightVector = horzcat(weightVectors{:});
squareSize = ceil(sqrt(numel(weightVector)));
padding = squareSize^2 - numel(weightVector);
weightMatrix = reshape([weightVector zeros(1, padding)],squareSize,squareSize);

if normalizeVectors
    weightMatrix(isinf(weightMatrix) | isnan(weightMatrix)) = 0;
end

if ~isa(weightMatrix,'double')
    weightMatrix=double(weightMatrix);
end

eigenvalues = svd(weightMatrix).^2;

nComponents = round(percentageKept * length(eigenvalues) / 100);
if nComponents < 1
    nComponents = 1;
end

% do truncated SVD for smoothing weights in matrix
[~,~,V] = svds(weightMatrix,nComponents);
X = weightMatrix*V;
smoothedMatrix = (X*V');

% reshape smoothed matrix to vectors once again
smoothedVector = reshape(smoothedMatrix, 1, numel(weightMatrix));
smoothedVector = smoothedVector(1,1:end-padding); % remove padding

%% ROAM THROUGH LAYERS, RESHAPE ALL VECTORS BACK INTO WEIGHT MATRICES

vectorNumber = 1;
vectorIndex = 1;

for layerNumber = 1:size(layers,1)
    if isprop(layers(layerNumber), 'Weights')
        % If layer is conv2D
        if ndims(layers(layerNumber).Weights) > 2
            for input = 1:size(layers(layerNumber).Weights,3)
                for output = 1:size(layers(layerNumber).Weights,4)
                    layers(layerNumber).Weights(:,:,input,output) = reshape(smoothedVector(1,vectorIndex:vectorIndex + length(weightVectors{vectorNumber})-1), size(layers(layerNumber).Weights(:,:,input,output)));
                    if normalizeVectors
                        layers(layerNumber).Weights(:,:,input,output) = (layers(layerNumber).Weights(:,:,input,output) .* sd{vectorNumber}) + mu{vectorNumber};
                    end
                    vectorIndex = vectorIndex + length(weightVectors{vectorNumber});
                    vectorNumber = vectorNumber + 1;
                end
            end
            % If layer is dense layers
        elseif ndims(layers(layerNumber).Weights) == 2
            layers(layerNumber).Weights = reshape(smoothedVector(1,vectorIndex:vectorIndex + length(weightVectors{vectorNumber})-1), size(layers(layerNumber).Weights));
            if normalizeVectors
                layers(layerNumber).Weights = (layers(layerNumber).Weights .* sd{vectorNumber}) + mu{vectorNumber};
            end
            vectorIndex = vectorIndex + length(weightVectors{vectorNumber});
            vectorNumber = vectorNumber + 1;
        end
    end
    % for lstm/bilstm layers
    if isprop(layers(layerNumber), 'RecurrentWeights')
        concatenatedMatrices = reshape(smoothedVector(1,vectorIndex:vectorIndex + length(weightVectors{vectorNumber})-1), size([layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights]));
        if normalizeVectors
            concatenatedMatrices = (concatenatedMatrices .* sd{vectorNumber}) + mu{vectorNumber};
        end
        layers(layerNumber).InputWeights = concatenatedMatrices(:,1:size(layers(layerNumber).InputWeights,2));
        layers(layerNumber).RecurrentWeights = concatenatedMatrices(:,size(layers(layerNumber).InputWeights,2)+1:end);
        vectorIndex = vectorIndex + length(weightVectors{vectorNumber});        
        vectorNumber = vectorNumber + 1;
    end
end

%% USE SMOOTHED MODEL TO CALCULATE ERROR

if isa(network,'DAGNetwork')
    dagNet = createLgraphUsingConnections(layers, network.Connections);
    smoothedNet = assembleNetwork(dagNet);
elseif isa(network,'dlnetwork')
    dagNet = createLgraphUsingConnections(layers, network.Connections);
    smoothedNet = dlnetwork(dagNet);
else
    smoothedNet = assembleNetwork(layers);
end

prediction = predict(smoothedNet,trainingData);

if iscell(prediction)
    for cellRow=1:size(prediction,1)
        RMSE(cellRow) = sqrt(mean((prediction{cellRow}'-trainingTarget{cellRow}').^2));
    end
    estimatedError = nanmean(RMSE);
elseif size(prediction,2)>1
    estimatedError = crossentropy(prediction,trainingTarget');
else
    estimatedError = sqrt(mean((prediction-trainingTarget).^2));
end

end

nan's when fitting some layers

Things are we seeing trying to apply WeightWatcher to BERT

'nan' in fit cumulative distribution values.
Likely underflow or overflow error: the optimal fit for this distribution gives values that are so extreme that we lack the numerical precision to calculate them.

From a product perspective, It might be useful to catch these exceptions and log them in the details dataframe so the use can see which layers specifically had this problem...Ill add this to the git issues list

I suspect this is happening when the alphas are unusually large
Alpha is fit using a brute force method, that searches for the optimal x_min
Sometimes, when the alphas are very large (>6), the optimal x_min is so large that it screws up and shows this
But I don't think this should happen for alpha < 5 or 6, (edited)
see: https://arxiv.org/abs/2106.00734
Figure 9(g), page 20
In some cases, the tail does not 'fill out' and the fit is very difficult...this is a sign of a bad layer
There are a few corner cases like this; we only discuss a couple of them in the papers--not enough room to cover it all

get_details() needs 'message': 'Skipping (Layer not supported)'}

results docstring has

'message': 'Skipping (Layer not supported)'}

WE need for the get_details() dataframe

Error while plotting

After upgrading to 0.4.0 I see this error on the analyst method, This worked with 0.2.6 version

My code
mdl1 = binaryClassification(input_size, BATCH_SIZE)
#mdl.to(device)
mdl1.load_state_dict(torch.load('my_model.mdl'))

watcher = ww.WeightWatcher(model=mdl1)
results1 = watcher.analyze(plot()
plt.show()

Error
DEBUG:weightwatcher:fitting power law on 1 eigenvalues
Less than 2 unique data values left after xmin and xmax options! Cannot fit. Returning nans.
Traceback (most recent call last):
File "/Users/sanjaypillay/MS_DS/git/maps_capstone/ww_v1_model.py", line 192, in nn_kmer
results1 = watcher.analyze(plot=True)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 976, in analyze
self.apply_fit_powerlaw(ww_layer, params)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 875, in apply_fit_powerlaw
alpha, xmin, xmax, D, sigma, num_pl_spikes = self.fit_powerlaw(evals, xmin=xmin, xmax=xmax, plot=plot, title="", sample=sample, sample_size=sample_size)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/weightwatcher/weightwatcher.py", line 1305, in fit_powerlaw
fig2 = fit.plot_pdf(color='b', linewidth=2)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/powerlaw.py", line 525, in plot_pdf
return plot_pdf(data, ax=ax, linear_bins=linear_bins, **kwargs)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/powerlaw.py", line 2073, in plot_pdf
edges, hist = pdf(data, linear_bins=linear_bins, **kwargs)

File "/Users/sanjaypillay/anaconda3/envs/phy_r_37/lib/python3.7/site-packages/powerlaw.py", line 1950, in pdf
xmax = max(data)

ValueError: max() arg is an empty sequence

get_details() slice counts are not meaningful, confusing

Unfortunately, because we report results for all layers, sliced or not, as having slices, we can get meaningless results..such as for these DENSE layers (in VGG16)

layer_id layer_type N M slice slice_count level comment norm lognorm

20	DENSE	25088	4096	0	NaN	LEVEL.SLICE	Slice level	23.429	1.36975
20	DENSE	25088	4096	NaN	1	LEVEL.LAYER	Layer level	23.429	1.36975
21	DENSE	4096	4096	0	NaN	LEVEL.SLICE	Slice level	18.0218	1.2558
21	DENSE	4096	4096	NaN	1	LEVEL.LAYER	Layer level	18.0218	1.2558
22	DENSE	4096	1000	0	NaN	LEVEL.SLICE	Slice level	16.7575	1.22421
22	DENSE	4096	1000	NaN	1	LEVEL.LAYER	Layer level	16.7575	1.22421

A better solution is needed. Options include:

removing slices all-together , allowing conv2D to be 1 giant matrix
problem: Attention models also have slices
setting all NaN Slice layers to 1 (or 0)
hide slice columns ?
aggregate slices into a more complex, hierarchal dataframe ?

add option to save all open plots

either 1 by 1

or as a multipage report

https://stackoverflow.com/questions/26368876/saving-all-open-matplotlib-figures-in-one-file-at-once

multipage('multipage_w_raster.pdf', [fig2, fig3], dpi=250)

Tensorflow 2.0 keras support issue

Hello, when I try to use weightwatcher with my tf.keras layers, I get the "skipping layer" issue. Do you know how I could fix that? Thanks!

XMAX UNKNOWN sets xmax

XMAS UNKNOWN should not set xmax

A generic measure for generalization

This is a great library.

Is it possible for you to provide a summary metric that can show that one network has a higher capacity than other networks on a given data?

Is it possible for you to provide a summary metric that can show that one network has a higher capacity than other networks based only on the weight distribution?

Can you try out generic information theoretic measures to provide a measure of the capacity of the neural network based on observing the weight distribution alone?

Unfortunately, I am in Python 2 land at the moment and plan to migrate to Python 3 in the near future. I cannot run the code in the default implementation without tweaks.

The get_summary method provide statistics like min, avg, max. This is different from a specific metric that can tell the capacity of a network. For example, being able to tell that GPT 3 is better than GPT 2 based on the weight distribution alone and providing this information in a single value. Rather than providing a chart that can provide subjective interpretation based on the technical abilities of the analyst.

Kenneth Odoh

Add support for dense layers with 1 output

It has been suggested to treat vectors / dense layers with 1 output by

pad the layer to have a square shape (1x1574) -> (1x1600)
reshaping the layer to a square (1x1600)->(40x40) in some arbitrary way
compute the correlations with alpha, and or apply SVD Smoothing
re-shaping back to (40x40)->(1x1600)
unpad

Note: It is necessary to keep track of how the indices reshaped to map back for the SVDSmoothing

Below is sample MatLab code for this to reshape the 1x1574 vector to a 40x40 matrix

function [correlation, RMSE] = predictedTestLoss(layers, trainingData, trainingTarget)
%PREDICTEDTESTLOSS Given training set and training target, predicts test
%loss based on SVDSmoothing technique by Charles (WW)
for layerNumber = 1:size(layers,1)
    if isprop(layers(layerNumber), 'Weights')
        % for now, skip 1x1 convs and/or fully connected layers
        if ndims(layers(layerNumber).Weights) > 2  && size(layers(layerNumber).Weights,1) > 1
            for input = 1:size(layers(layerNumber).Weights,3)
                for output = 1:size(layers(layerNumber).Weights,4)
                    X = layers(layerNumber).Weights(:,:,input,output);
                    nComponents = round(20 * length(svd(X)) / 100);
                    [~,~,v] = svds(X',nComponents, 'smallestnz');
%                     [~,~,v] = rsvd(X',nComponents);
%                     tolerance = rescale(100-20,sqrt(eps(class(X))),1,"InputMin",0,"InputMax",100);
%                     [~,~,v] = svdsketch(X',tolerance);
                    X = X'*v;
                    VT=v';
                    layers(layerNumber).Weights(:,:,input,output) = (X*VT)';
                end
            end
            % for dense layers
        elseif ndims(layers(layerNumber).Weights) == 2
            squareSize = ceil(sqrt(numel(layers(layerNumber).Weights)));
            padding = squareSize^2 - numel(layers(layerNumber).Weights);
            X = reshape([layers(layerNumber).Weights zeros(1, padding)],squareSize,squareSize);
            nComponents = round(20 * length(svd(X)) / 100);
            [~,~,v] = svds(X',nComponents, 'smallestnz');
%             [~,~,v] = rsvd(X',nComponents);
%             tolerance = rescale(100-20,sqrt(eps(class(X))),1,"InputMin",0,"InputMax",100);
%             [~,~,v] = svdsketch(X',tolerance);
            X = X'*v;
            VT=v';
            smoothedVector = reshape((X*VT)',1,squareSize^2);
            layers(layerNumber).Weights=smoothedVector(1:end-padding);
        end
    end
    % for lstm/bilstm layers
    if isprop(layers(layerNumber), 'RecurrentWeights')
        X = [layers(layerNumber).InputWeights layers(layerNumber).RecurrentWeights];
        nComponents = round(20 * length(svd(X)) / 100);
        [~,~,v] = rsvd(X',nComponents);
        X = X'*v;
        VT=v';
        smoothedWeights=(X*VT)';
        layers(layerNumber).InputWeights=smoothedWeights(:,1:size(layers(layerNumber).InputWeights,2));
        layers(layerNumber).RecurrentWeights=smoothedWeights(:,size(layers(layerNumber).InputWeights,2)+1:end);
    end
end
smoothedNet = assembleNetwork(layers);
prediction = predict(smoothedNet,trainingData);
correlation = corr(prediction, trainingTarget);
RMSE = sqrt(mean((prediction-trainingTarget).^2));
end

add logspectral norm

Is there a way to implement in the future LSTM, RNN, GRUs support?

I love your tool, I think it's very interesting. Are you planning on adding a support for more types of layers in the future?
Thanks.

Tests to add for PyTorch Vision Models

PyTorch Vision Models

Check treatment of GoogleLeNet:

https://github.com/pytorch/vision/blob/master/torchvision/models/googlenet.py

BasicConv2D treated like Conv2D

also in Inception:
https://github.com/pytorch/vision/blob/master/torchvision/models/inception.py

Inception Block treated (are nested layers like this treated as slices or individual layers)?

input error

compute_alphas = False, plot = True

should give an error

ValueError with function marchenko_pastur_fun() in RMT_util

The output of the function does not match the values in the analyze()

got ValueErrors: not enough values to unpack (expected 4, got 2)

Fix title on LogESD

This is not log log plot...it is meant to show deviations from log normality
It is NOT the log ESD of the density...it is the density of the log of the eigenvalues, but not plotted on a log log scale

This is the actual log log plot of the PDF

Add np.SVD option

We would like to support calculating all eigenvalues
This is slower and should be an user input option

IMHO, the right way to do this is to make our own SVD function and wrap this
We could also support SVD on GPU if available

save EVALs

save the evals

Rankloss is always 0

Rankloss may be not working propelry

Tensorflow Support

Can we get a release for tensorflow?

Add support for TF Multihead attention and TimeDistributed layers

Add support for TF Multihead attention and TimeDistributed layers
This seems to be broken in the latest version..one we moved to strong typing for the layer types, we now have to support complex typed layers like this explicitly

MP Fits may fail because the ESD is not scaled properly

We have some scaling issues getting the MP fits to work, and, in particular, for the randomized ESDs

I am trying to find a good general-purpose solution for this.

This is a good example: If the ESD is rescale, and the spikes removed, then the MP fit is excellent
(layer 28, Randomized VGG11 from the torchmodels package)

Here is the rescaled version

from transformers import BertModel, BertConfig

CHECKPOINT = "bert-base-uncased"

# Initializing a BERT bert-base-uncased style configuration
configuration = BertConfig()

# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)

# Accessing the model configuration
configuration = model.config

INFO:weightwatcher:params {'glorot_fix': False, 'normalize': False, 'conv2d_norm': True, 'randomize': False, 'savefig': False, 'rescale': True, 'deltaEs': False, 'intra': False, 'channels': None, 'conv2d_fft': False, 'min_evals': 0, 'max_evals': None, 'plot': False, 'mp_fit': False, 'ww2x': False, 'layers': []}
WARNING:weightwatcher:pytorch layer: Embedding(30522, 768, padding_idx=0)  type LAYER_TYPE.EMBEDDING not found 
WARNING:weightwatcher:pytorch layer: Embedding(512, 768)  type LAYER_TYPE.EMBEDDING not found 
WARNING:weightwatcher:pytorch layer: Embedding(2, 768)  type LAYER_TYPE.EMBEDDING not found

Tests of OSMR PyTorch Models

OSMR PyTorch Vision Models

TO test these requires installing this repo as a dependent repo…need to decide if we really want to do that

Conv3D: ChannelNet
https://github.com/osmr/imgclsmob/blob/68335927ba27f2356093b985bada0bc3989836b1/pytorch/pytorchcv/models/channelnet.py

calculatedcontent / weightwatcher Goto Github PK

weightwatcher's Introduction

Quick Links

Installation: Version 0.7.5.2

Current TestPyPI Version 0.7.5.2

Usage

Advanced Usage

PEFT / LORA models (experimental)

Usage: Base Model

Ploting and Fitting the Empirical Spectral Density (ESD)

Generalization Metrics

Scale Metrics

Shape Metrics

Scale-adjusted Shape Metrics

Direct Correlation Metrics

Misc Details

Summary Statistics:

Predicting the Generalization Error

Detecting signs of Over-Fitting and Under-Fitting

Correlation Traps

Early Stopping

Additional Features

Filtering

filter by layer types

filter by layer ID or name

Calculations

minimum, maximum number of eigenvalues of the layer weight matrix

specify the Power Law fitting proceedure

Visualization

Save all model figures

fit ESDs to a Marchenko-Pastur (MP) distrbution

fetch the ESD for a specific layer, for visualization or additional analysis

Model Analysis

describe a model

comparing two models

Compatability

compatability with version 0.2.x

Requirements

Frameworks supported

Layers supported

Tips for First Time Users

How to Release

License

Academic Presentations and Media Appearances

KDD2019 Workshop

Popular Popdcasts and Blogs

2021 Short Presentations

Recent talk(s) by Mike Mahoney, UC Berekely

Experimental / Most Recent version (not ready yet)

Contributors

Consulting Practice

weightwatcher's People

Contributors

Stargazers

Watchers

Forkers

weightwatcher's Issues

Recommend Projects

Recommend Topics

Recommend Org