Git Product home page Git Product logo

Comments (17)

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Yes - my PMD is sparse CCA by penalized matrix decomposition from Witten 09 (and the more than 2 view extension described in the follow up paper).

That being said your comment is interesting in that I obviously haven't made that clear. I guess because I've called it PMD it seems to refer to the Penalized PCA rather than CCA?

I have seen the method referred to Interchangeably as Sparse CCA (because it was the first of the sparse CCA methods), Penalized Matrix Decomposition, and Sparse Partial Least Squares (the latter actually being my preferred name for mathematical reasons!).

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Perhaps the actionable change I can make is to make the docs explicit that this refers to PMD for CCA?

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

Hi James, yeah issues with nomenclature seem to be strong with sparse CCA in general...And I must admit that I am also not quite sure if I get your answer right, so, just to get on the same page here:

PMA::CCA (page 4 of PMA docs) is not implemented in your package but only PMA::PMD (page 18 of PMA docs)?

I can definitely tell that the python version of PMA::CCA from @teekuningas is tested against the official R-package. Over the last week, the repository owner and I made sure that PMA::CCA gives the same results as sparsecca._cca_pmd.cca. Testing is done by running PMA::CCA with the example data from Witten et al. against sparsecca._cca_pmd.cca and then comparing the canonical weights. The testing script can be found here.

Is there a Class in your repository that would give the same outputs (possibly with a different name but with the same underlying algorithm?)

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Other way round :)

So my PMD class is [Sparse CCA by] Penalized Matrix Decomposition - and this thread is convincing me I should go with SCCA_PMD as the class name.

My PMD should be able to perform both CCA and MultiCCA from PMA in R looking through those Docs.

I would expect results to match in all cases for the first component and then for further components I have the option of what I would call CCA (or projection) deflation and classical PLS deflation. PMA performs classical PLS deflation.

I could probably write a quick script based off of that one to check that.

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

Ah okay so currently:

cca_zoo.models.PMD == PMA:CCA == sparsecca._cca_pmd.cca?

But are you really sure about that? Because I couldn't reproduce the results from PMA:CCA using cca_zoo.models.PMD. Besides, cca_zoo.models.PMD only allows regularization l1 regularisation parameter between 1 and sqrt(number of features) for each view, however, PMA:CCA regularization parameters "must be between 0 and 1 (larger L1
bound corresponds to less penalization)"

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Yep - I use the scale from the original paper 1 to sqrt number of features in PMA it looks like they convert it after the argument is passed.

That being said there seem to be some numerical differences in the 2nd and 3rd mode on your example that I'll try to take a look at (objective value pretty close but converging to a different place).

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

So the following passes a similar test:

import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects

import numpy as np
from cca_zoo.models import PMD

def test_compare_pmd_to_r():
    """ Compares the output of the original R implementation to the implementation
        of this python implementation using example data.
        Thanks for JohannesWiesner for R and python code.
        """

    utils = rpackages.importr('utils')
    utils.chooseCRANmirror(ind=1)

    if not rpackages.isinstalled('PMA'):
        utils.install_packages('PMA', verbose=True)

    robjects.r('''
            ## Run example from PMA package ################################################

            library(PMA)

            # first, do CCA with type="standard"
            # A simple simulated example
            set.seed(3189)
            u <- matrix(c(rep(1,25),rep(0,75)),ncol=1)
            v1 <- matrix(c(rep(1,50),rep(0,450)),ncol=1)
            v2 <- matrix(c(rep(0,50),rep(1,50),rep(0,900)),ncol=1)
            x <- u%*%t(v1) + matrix(rnorm(100*500),ncol=500)
            z <- u%*%t(v2) + matrix(rnorm(100*1000),ncol=1000)

            # Can run CCA with default settings, and can get e.g. 3 components
            out <- CCA(x,z,typex="standard",typez="standard",K=1,penaltyx=0.3,penaltyz=0.3)
            out_u = out$u
            out_v = out$v

        ''')

    # Get the r data as numpy arrays
    out_u = np.array(robjects.globalenv['out_u'])
    out_v = np.array(robjects.globalenv['out_v'])
    x = np.array(robjects.globalenv['x'])
    z = np.array(robjects.globalenv['z'])

    # Compute cca with the same data as in
    pmd = PMD(c=[0.3 * np.sqrt(x.shape[1]), 0.3 * np.sqrt(z.shape[1])], latent_dims=1,
              max_iter=15, deflation='pls', tol=1e-15)
    pmd.fit((x, z))

    assert(np.allclose(np.abs(pmd.weights[0]),np.abs(out_u)))
    assert(np.allclose(np.abs(pmd.weights[1]),np.abs(out_v)))

Note that I have changed the number of components to 1.

Differences in further components come down to the following differences in the intialization of the inner loop:

  • I deflate the data whereas PMA stacks an additional data point such that the objective is zero for the previous set of weights. This allows for projection/CCA or classical/PLS deflation. This can result in small numerical differences.
  • PMA/the code you linked to initialize with the outer SVD whereas I initialize each inner loop with an unregularized PLS/SVD on the deflated data.

I've checked the latter point by passing the exact same initializations and then they converge to the same place so overall I'm satisfied that I'm implementing the algorithm in a valid way.

With your permission I'll add your test for the first component as it's a very useful ground truth.

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

Hi James, thanks for setting up the test. I asked @teekuningas for permission, since he finalized the test and will let you know once he responded :) Unfortunately I couldn't run it on my (Windows) machine (even with your latest version from today)?:

TypeError: __init__() got an unexpected keyword argument 'deflation'

Differences in further components come down to the following differences in the intialization of the inner loop:

I deflate the data whereas PMA stacks an additional data point such that the objective is zero for the previous set of weights. This allows for projection/CCA or classical/PLS deflation. This can result in small numerical differences.
PMA/the code you linked to initialize with the outer SVD whereas I initialize each inner loop with an unregularized PLS/SVD on the deflated data.

Yeah, that's the problem, I am really lacking the math here, so I cannot really contribute anything meaningful to this discussion...

Running PMD() without the deflation keyword works (using my current project data). And you're right, the weights are pretty similar on the first variate. However, setting K=5 leads to quite dissimilar results of u weights when plotting the absolute deviations:

import matplotlib.pyplot as plt

# plot distribution of deviations (ignore zeros for plotting purposes)
u,v = pmd.weights
deviations = np.abs(u)-np.abs(out_u)
deviations = deviations.flatten()
deviations = deviations[deviations != 0]
plt.hist(deviations)

image

Since I would also need to reproduce results from a former package (but still would like to use Python instead of R) I wonder if one could tweak the class in such a way that you would get more similar results also on the other variates?

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

Yep - I use the scale from the original paper 1 to sqrt number of features in PMA it looks like they convert it after the argument is passed.

Ah, good to know! Just as a side note (but really only a UX thing, so not really important): Wouldn't it be then a good idea to adopt the CCA::PMA style and do [0.3 * np.sqrt(x.shape[1]), 0.3 * np.sqrt(z.shape[1])] behind the scenes? Then users would pass a value between 0 and 1 which would also be more close to what one is used to when using sklearn estimators?

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

I asked @teekuningas for permission, since he finalized the test and will let you know once he responded :)

@teekuningas agreed, so feel free to implement our test in your package :)

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Thanks @JohannesWiesner

The deflation kwarg should be ok now (if not then do a pip install cca-zoo --upgrade or pull).

I'll take a look at the initialization problem some time today most likely. I have an idea how I'd achieve it. That being said your result above is super interesting for my own work (and lots of other work using sparse CCA) in that it shows just how much of a problem the lack of global convergence of the algorithm is i.e. small differences in initialization result in substantial differences in weights (even though the value of the objective is similar!).

Your UI point is interesting. Do you feel that it is more intuitive to have parameters entered between 0 and 1 and an error if c*#features < 1 or c*#features>sqrt(#features) or parameters entered between 1 and sqrt(#features)?

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

The deflation kwarg should be ok now (if not then do a pip install cca-zoo --upgrade or pull).

Yup, works now! However, with my current project data and your settings I get the following UserWarnings:

C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 0 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 1 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 2 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 3 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 4 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")

That being said your result above is super interesting for my own work (and lots of other work using sparse CCA) in that it shows just how much of a problem the lack of global convergence of the algorithm is i.e. small differences in initialization result in substantial differences in weights (even though the value of the objective is similar!).

Ha, interesting, glad I could generate interesting research questions!

Your UI point is interesting. Do you feel that it is more intuitive to have parameters entered between 0 and 1 and an error if c*#features < 1 or c*#features>sqrt(#features) or parameters entered between 1 and sqrt(#features)?

Yup, exactly! It would not require calculating the number of features before-hand but just uses error handling in case users don't provide a number between 0 and 1.

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on May 30, 2024

These are the deviations from PMA::CCA that sparsecca._cca_pmd.cca produces (using my current project data):

image

So they are really super small as you can see

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

I've done some tinkering on the dev branch. The hidden dev release pip install cca-zoo==1.10.2.dev1 contains my adjusted version. Now the initialization at the start of each loop is the same. There is still a bit of a difference but the weights in all 3 modes are now correlated with the R package above 0.9 so a script of the form:

    # Compute cca with the same data as in
    pmd = PMD(
        c=[0.3, 0.3],
        latent_dims=3,
        max_iter=15,
        deflation="pls",
        tol=1e-6,
    )
    pmd.fit((x, z))

    assert np.testing.assert_array_less(0.9,
                                        np.diag(np.abs(np.corrcoef(pmd.weights[0], out_u, rowvar=False)), 3)) is None
    assert np.testing.assert_array_less(0.9,
                                        np.diag(np.abs(np.corrcoef(pmd.weights[0], out_u, rowvar=False)), 3)) is None

will work. This checks that each dimensions has correlation with the original weights greater than 0.9.

I'll probably keep the changes I've made in a future version as they've actually sped things up a bit but I've left them in dev for the moment until I get a chance to review.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

This also contains the UI change - I'm torn on that change though!

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Quick note to say I have now put all of these changes in the main version i.e. from 1.10.3.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on May 30, 2024

Closing for now

from cca_zoo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.