computorg / published-202301-boulin-clayton Goto Github PK

View Code? Open in Web Editor NEW

0.0 5.0 0.0 12.36 MB

"A Python Package for Sampling from Copulae: clayton", by Alexis Boulin

Home Page: https://computorg.github.io/published-202301-boulin-clayton/

License: Creative Commons Attribution 4.0 International

TeX 100.00%

copulae python-package random-number-generation reproducible-research

published-202301-boulin-clayton's Introduction

A Python Package for Sampling from Copulae: clayton

Alexis Boulin, Université Côte d'Azur, CNRS, LJAD, France and Inria, Lemon

ISSN 2824-7795

published-202301-boulin-clayton's People

Contributors

Watchers

published-202301-boulin-clayton's Issues

Report of Reviewer 2

Associate Editor: Julien Chiquet
Reviewer 2 : Mathurin Massias (chose to lift his anonymity)

Reviewer 1: Reviewing history

Paper submitted March 30, 2022
Reviewer invited August 29, 2022
Review 1 received October 8, 2022
Paper revised December 15, 2022
Reviewer invited _December 16, 2022 _
Review 2 received _ December 16, 2022_
Paper conditionally accepted January 02, 2023
Paper published January 11, 2023

First Round

Recommendation

Revise and resubmit

Summary

The paper proposes coppy, a package to work with copulas in Python. Copula are a powerful statistical tool to model dependency between random variables. They have been extensively used in theoretical and practical works. Current implementations are done in R, and having the same functionalities in Python is a great asset for the community.

The code is well written, comes with a variety of known copulae (Archimedean, elliptical, extreme value copulae) to sample from. The paper demonstrates usecases of sampling from copulae in a clear fashion.

Code remarks

My first set of remarks concern the package itself. Implementing the following changes would really improve the practical impact of the released code, that is of wide interest to the statistical community. They are a few low hanging fruits that require minor effort to implement:

Transform the code into a proper python package (add a setup.py file containing useful information and that allows making the code installable with pip install . run at the root of the repo. Currently, how is the code used, with a sys.path.append?). Using setuptools.setup() in setup.py will also make dependencies explicit (through parameter install_requires of the setup function) and installed automatically
Release the package on PyPI

author' answer: This is now done! The package clayton is now released on PyPI and can be easily installed with any devices using pip install clayton. Here is the link of the project.

Add unit tests, and report test coverage. Run tests automatically at each PR using Github Actions.

author' answer: Thank you for suggesting the inclusion of unit tests. I have now created unit tests for the clayton package and they are located in the clayton folder. More details are available at this
link. I have designed classical tests such as instantiating the object, trying some invalid values to raise error messages, and sampling from the copula. These tests have been performed for all the copulas included in the package. Additionally, I have set up Github Actions to run the tests automatically for each pull request. This ensures that the package maintains a high level of quality and reliability.

flake8 . | wc -l currently returns 945 PEP8 violations. Most of them can be fixed automatically using autoformat on save in vscode for example.

author' answer: �Thank you for pointing out the high number of PEP8 violations in the code. I apologize for not being aware of these basic rules. Most of the violations have been corrected using the autoformat feature in vscode and pylint. However, some violations still remain because I need to include $\LaTeX$ in the documentation for proper rendering with Sphinx. I will continue to work on reducing the number of violations and improving the overall quality of the code. Thank you for bringing this to my attention.

Add pep8 compliance testing with Github Actions.

author' answer: Thank you for suggesting the inclusion of PEP8 compliance testing with Github Actions. I have now set up the necessary workflows to run these tests automatically. More details are available at this link. This will help to ensure that the code follows the PEP8 style guide and maintains a high level of quality. Thank you for bringing this to my attention

Add and publish documentation. The code, in particular classes, is well documented, it would be nice to have this displayed online, eg with sphinx.

author' answer: Thank you for suggesting the use of Sphinx to publish the documentation online. I have now followed your advice and published the documentation on my Github page. I appreciate your suggestion and I am glad that it was not time-consuming to set up. Thank you for helping me improve the documentation for my package.

Consider making the package name all lowercase (as in pytorch, numpy, scikit-learn, matplotlib, …)
Consider using CamelCase for class names (MonteCarlo instead of Monte_Carlo)
YMMV, but coppy is extremely close to copy, a fundamental method in Python. Googling Python coppy is obfuscated by Python copy

author' answer: Thank you for pointing out the potential confusion between the coppy package and the built-in Python method called ”copy.” I agree that the similarity in the names could be problematic and I have now renamed the package to ”clayton” to avoid this issue. The new name also hints at the famous Clayton copula, which is included in the package. Thank you for bringing this to my attention and helping me improve the package.

Remarks about the submission itself

There are a few typos in the paper, that could use a careful proofreading. I have highlighted a few here, but not all:

[...]

Code execution remarks

sns.displot(data = df_wmado, x = "scaled", color = seagreen(0.5)) > I am missing the seagreen import

author' answer: We have to import it from seaborn. It’s quite burdensome that I decided to change it from
classical colors in the whole code #C5C5C5 or #6F6F6F.

The cell df_wmado = monte.finite_sample(inv_cdf = [norm.ppf, expon.ppf], corr = True) takes a lot of time to run on my machine (at least >5 minutes).

author' answer: The sample size and the number of iteration were both reduce. It takes at least 20 seconds to
execute.

Could you diminish the size of the sample, or warn in the cell that the code requires a long time to be executed?

Second Round

Recommendation

Accept with minor revisions

Comments to author

The revision has addressed all my concerns. Regarding the introduced changes, I have the following minor remarks that are all easy to take into account:

In Fig 4, can you consider adding color to points? The current black and white rendering does not convey the information in the easiest way possible.
The sentence " Consider the following equation:" is followed by the requirements of the package.
The noisy aspect of the curves in Fig 4, Archimedean panel, suggests that more vectors should be sampled to cancel out this variability (It can be done easily since the current running time is around 0.2 s per value of $d$; you can also take values of d further apart. )
The clayton package, whose Python code can be found in this GitHub repository > consider writing the URL explicitly
The package quality has been greatly improved. Nevertheless, some files should be removed from the repo and added to the gitignore, ie the __pycache__ and .ipynb_checkpoints folders, the egg info folder.
In the paper, replace:
np.sqrt( 1 / (2*np.pi * sigma**2 ) ) * np.exp(- ( x - x0) ** 2 / (2 * sigma **2) )
by
np.sqrt(1. / (2*np.pi * sigma**2)) * np.exp(-(x - x0) ** 2 / (2 * sigma**2))
In the package, the author uses n_sample, the extremely popular scikit-learn uses n_samples so following their convention should be considered.

All remarks taken into account in final submission

Proofreading

@Aleboul Once you are happy with the final proofreading of your paper, please let me know by closing the issue and officially publish your work.

all the best!

Preparation of final publication

Phase 1: acceptance

When the Associate Editor (AE) is satisfied with the author's answers to the reviewers comments, he/she exchanges with authors so that (via the discussion tool on Scholastica)

authors are informed of the final acceptance
affiliation, author-url, affiliation-url and other metadata are correctly filled in on the author git repository
Optional! - at the authors' discretion - the manuscript is formatted using the latest Computo extension
CI/github-action validates the reproducibility of the manuscript

The authors inform the AE upon completion of the tasks required on their side.

Phase 2: production start-up

Phase 3: final publication

Once the above tasks are completed, then do

the AE turns the private repository into a public repository (could be done as soon as a DOI is available)
the AE invite the corresponding author for being an outside collaborator to the repository
the AE makes an Issue asking for final proofreading by the author (possibly with some minor additional questions), made as a pull-request on Computorg's repository

Once the manuscript is proofreaded (Issue closed)

the AE informs the Editor who tags the firstly published version
the Editor moves the BibTeX entry from file "in production" to "published" son Computo website, by completing if needed
the Editor archives the repository on software heritage
the Editor makes the usual communication of the new publication

Report of Reviewer 1

Associate Editor: Julien Chiquet
Reviewer 1 : Nicolas Bousquet (chose to lift his anonymity)

Reviewer 1: Reviewing history

Paper submitted March 30, 2022
Reviewer invited April 25, 2022
Review 1 received June 29, 2022
Paper revised December 15, 2022
Reviewer invited December 16, 2022
Review 2 received January 01, 2023
Paper conditionally accepted January 02, 2023
Paper published January 11, 2023

First Round (received June 29, 2022)

Recommendation

Revise and resubmit

General comment

This paper offers a contribution that belongs to the class of software papers presenting implementations of stats/ML algorithms encapsulated within a new module. It aims at offering new Python implementations of copula families, that extend the range of families provided in existing Python packages, and filling a hole with respect to R, which knows today many powerful packages on this theme (including especially vine copulas).

More in detail, the COPPY module proposed by the author offers to access to the extreme value family of copula, both in inference and simulation, which was not inserted within the historical Python tools “Copulas” and “Copulae”. A keypoint is that COPPY provides sampling techniques, of great interest in problems related to predictive analysis, bootstrapping, machine learning, etc.

The explanations about how the module is built are interesting and useful. The paper is easy to read, well illustrated, and the entanglement with the code allows for easy reproduction. It would be appreciable if the use of the selected copula families for several applications were referenced, along with some isodensity plots illustrating the main properties (e.g., isodensity plots for asymmetric copulas), in addition to the illustrations provided in the final sections of the paper. I think the paper globally fits with the requirements of Computo, and could be accepted after a revision taking account of these general and specific comments.

A main concern, in my opinion, is that a Python platform used by many practicians (e.g. engineers) is OpenTURNS, which already incorporates a wide variety of multivariate parametric models, especially under the form of copulas. See :

links

author' answer: Thank you for sharing information about the OpenTURNS package in Python. I wasn’t aware of it before. In the main paper, I have added the following sentence: Other packages provide sampling methods for copulae, but they are typically restricted to the bivariate case and the conditional simulation method (see, for example, [Baudin et al., 2017]). Additionally, if the multivariate case is considered only Archimedean and elliptical copulae are under interest and those packages (see [Nicolas, 2022]) do not include the extreme value class in arbitrary dimensions $d ≥ 2$.

It sounds clear to me that the paper, in addition to taking account the specific comments listed beneath, should have a look on this platform and position its content with respect to it. But I am confident that the author will provide some details about the differences and complementarity between COPPY and OpenTURNS.

Specific comments

Introduction

“it is characterizing only for a few models, the multivariate normal distribution”. This sentence is strangely formulated. Student copula is defined by a correlation matrix too, and probably Levy-based copulas offer similar properties. I suggest to say that the use of usual (linear) correlation coefficients is most often misleading, as it is known that rank correlations are real dependence indicators, but in practice (please search for appropriate references) the use of the multivariate normal is thought to be “easy” because of its canonic covariance / correlation parametrization. Many papers deal with this problem of reducing dependence structures to covariance matrices.

author' answer: I understand the referee’s concerns about the sentence ”it is characterizing only for a few models, the multivariate normal distribution.” Here is a new sentence that conveys the same idea: It is well known that only linear dependence can be captured by the covariance and it is only characteristic for a few models, e.g., the multivariate normal distribution or binary random variables. To elaborate on this point, consider two random variables $X$ and $Y$ defined in the same probability space. Then, it is not necessarily true that if $Cov(X, Y ) = 0$, the random variables $X$ and $Y$ are independent. It also holds for Student copulae that are parametrized by the correlation matrix. It is true that rank correlations are better indicators of dependence than linear correlations, but they are not able to detect more complicated nonlinear, non-monotone dependencies (see, e.g., [Drton et al., 2020]). These concerns are already addressed by the sentence quoted earlier about the limitations of the covariance matrix.

“of prime interest for…” I suggest to place some appropriate references here, relative to the fields listed by the author

author' answer: Thank you for pointing out the need for references in this sentence. I have now added the following sentence: The theory of copulae has been of prime interest for many applied fields of science, such as quantitative finance ([Patton, 2012]) or environmental sciences ([Mishra and Singh, 2011]).

Figure 1 Symbols used in this figure are explained further in the text. I suggest however to provide a short explanation in the legend to clarify the reading.

Section 2.1

It is peculiarly… than d". A reference would be suitable here.

author' answer: Thank you for pointing out the need of an additional reference. The corresponding sentence is thus modified leading to: Note that d-monotonic Archimedean inverse generators do not necessarily generate Archimedean copulae in dimensions higher than dd (see [McNeil and Neslehova, 2009]).

Section 2.2

asymmetric dependence: could you provide a graphic illustration to help the reader to understand? A more general question is “can COPPY help to visualize with isodensity, an usual technique in R”?

author' answer: The package COPPY (now known as clayton) does not provide tools for visualizing isodensity. As for asymmetric dependence, here is the definition for the bivariate case added in the main text: Asymmetric dependence refers to the property where, for a bivariate copula $C$, there exists $(u_0, u_1) \in [0,1]^2$ such that

$$ C(u_0, u_1) \neq C(u_1, u_0) $$

Section 3.1

The Pickands dependence function is not defined before. It would be appropriate for the reader to provide short details about it and its importanace in multivariate analysis.

author' answer: Thank you for pointing out the need for more information about the Pickands dependence function. I have now added the following text in Section 3.1: The Pickands dependence function characterizes the extremal dependence structure of an extreme value random vector and verifies $\max{w_0,\dots , w_{d-1}} ≤ A(w_0,\dots , w_{d−1}) ≤ 1$ where the lower bound corresponds to comonotonicity and the upper bound corresponds to independence. Estimating this function is an active area of research, with many compelling studies having been conducted on the topic (see, for example, [Bücher et al., 2011], [Gudendorf and Segers, 2012]).

Section 3.2

Sample from the Gaussian … should it be corrected by “samples”?

Section 4

Even if I recognize this could be a very long task, I would have appreciated to have a kind of comparison table with the use of other existing packages to compare CPU time, the effect of dimension, etc., maybe in another dedicated section, before the discussion. I let this decision to the AE, but this would give a noticeable gain in information for the reader that want to incorporate copula handling in machine-learning type routines.

author' answer: Thank you for suggesting a comparison of different packages for copula. I have added a new subsection in the discussion section (Section 5) where we compare our package clayton with two other packages in R: the copula and mev packages. We provide a comparison table that includes metrics such as CPU time and the effect of dimension. This should provide valuable information for readers who want to incorporate copulas into machine-learning routines.

Section 5

I completely agree with the line of improvement about the vines. Automatic copula selection using statistical criteria should be a very useful new contribution for the community.

Author's additional References

[Baudin et al., 2017] Baudin, M., Dutfoy, A., Iooss, B., and Popelin, A.-L. (2017). Openturns: An industrial software for uncertainty quantification in simulation. In Handbook of uncertainty quantification, pages 2001–2038. Springer.
[Bücher et al., 2011] B¨ucher, A., Dette, H., and Volgushev, S. (2011). New estimators of the pickands dependence function and a test for extreme-value dependence. The Annals of Statistics, 39(4):1963–2006.
[Drton et al., 2020] Drton, M., Han, F., and Shi, H. (2020). High-dimensional consistent independence testing with maxima of rank correlations. The Annals of Statistics, 48(6):3206–3227.
[Gudendorf and Segers, 2012] Gudendorf, G. and Segers, J. (2012). Nonparametric estimation of multivariate extreme-value copulas. Journal of Statistical Planning and Inference, 142(12):3073–3085.
[McNeil and Neslehova, 2009] McNeil, A. J. and Neslehova, J. (2009). Multivariate Archimedean copulas, d-monotone functions and l1-norm symmetric distributions. The Annals of Statistics, 37(5B):3059 – 3097.
[Mishra and Singh, 2011] Mishra, A. K. and Singh, V. P. (2011). Drought modeling – a review. Journal of Hydrology, 403(1):157–175. 5
[Nicolas, 2022] Nicolas, M. L. (2022). pycop: a python package for dependence modeling with copulas. Zenodo Software Package, 70:7030034.
[Patton, 2012] Patton, A. J. (2012). A review of copula models for economic time series. Journal of Multivariate Analysis, 110:4–18.

Second Round (received January 1, 2023)

The revision is good, and the author satisfactorily answer to my remarks. I appreciate the care given to the package finalization (a wish expressed by the second reviewer). A small defect is the impossibility to plot some visual diagnostics, but this could be probably managed in another package. Therefore I think the manuscript is valuable for publication.

computorg / published-202301-boulin-clayton Goto Github PK

published-202301-boulin-clayton's Introduction

A Python Package for Sampling from Copulae: clayton

published-202301-boulin-clayton's People

Contributors

Watchers

published-202301-boulin-clayton's Issues

Reviewer 1: Reviewing history

First Round

Recommendation

Summary

Code remarks

Remarks about the submission itself

Code execution remarks

Second Round

Recommendation

Comments to author

Phase 1: acceptance

Phase 2: production start-up

Phase 3: final publication

Reviewer 1: Reviewing history

First Round (received June 29, 2022)

Recommendation

General comment

Specific comments

Introduction

Section 2.1

Section 2.2

Section 3.1

Section 3.2

Section 4

Section 5

Author's additional References

Second Round (received January 1, 2023)

Recommendation

Recommend Projects

Recommend Topics

Recommend Org