Hi Thanks for writing such a great module! After running <code class

HI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Possible bugs in `CollinearityThreshold` about arfs HOT 9 CLOSED

harper357 commented on May 24, 2024

Possible bugs in `CollinearityThreshold`

from arfs.

Comments (9)

ThomasBury commented on May 24, 2024

HI @harper357, thank you for the kind words and the careful review (a proper CI/CD test suite is still missing). I'll go through your comments and PR as soon as my schedule allows it.

from arfs.

ThomasBury commented on May 24, 2024

For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent.

Agreed, using pandas is indeed more Pythonic than list comprehension in this case.

In the 2.2.3 version tutorial, it seems everything's fine. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful (please update to version 2.2.3). Thanks

from arfs.

harper357 commented on May 24, 2024

Looking back at my example, that doesn't show what I was trying to show. I'll try posting a better example and find an example data set for the threshold part later today.

…

On Fri, Feb 9, 2024, 1:44 AM Thomas Bury ***@***.***> wrote: For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent. Agreed, using pandas is indeed more Pythonic than list comprehension in this case. I'll review the user messages to see if they might be the source of the inconsistencies. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful. — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZ4ZBZHZ3ZAU7WOI6QUYFLYSXVW3AVCNFSM6AAAAABC3DCTHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGYYTENBVGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from arfs.

harper357 commented on May 24, 2024

I see you merged my changes so the bug should be fixed. But just to be complete, I uploaded a notebook to my fork that illustrates everything. Notebook

It should be pretty easy to understand, I used the cancer data set to show that there was a problem and then some minimal examples of what was happening.

from arfs.

ThomasBury commented on May 24, 2024

Thanks for the detailed example! I appreciate it.

I missed that Pandas preserves the order of the first series during addition. Your PR solved it due to the subsequent sorting.
Based on your example, I've updated the tutorial.

Regarding overall performance, using n_jobs > 1 launches multiple processes (essentially, multiple Python workers). This initialization incurs a time overhead. However, subsequent runs (e.g., the second time you run the function) may be faster than n_jobs=1. Nevertheless, the current parallelization approach is not very sophisticated, and I recommend keeping n_jobs=1 for now. The upcoming Pandas 3 will natively improve parallelization.

Let me know if the fix works and you can close this issue, thank you for contributing.

from arfs.

harper357 commented on May 24, 2024

Sorry, I have been pretty busy with work so it took me a while to get back to this.

I haven't had the time to formally test it, but it seems like the scipy.stats.spearmanr calculation is is much faster than arfs' implementation (it calculated rho for 14k features in < 1 min on a laptop ) . I am not sure if the other measures are found in other common packages though.

I think it is fine to close this.

from arfs.

ThomasBury commented on May 24, 2024

scipy.stats.spearmanr

Here is a rephrased version of the text:

Computation Time Comparison:

Running the code with a single job (n_jobs=1) results in similar overall computation time for both implementations:

ARFS: 231 ns ± 188 ns per loop
SciPy Spearmanr: 137 ns ± 89 ns per loop

While SciPy is slightly faster (around 1.7 times), keep in mind that:

ARFS Implementation:
* Uses NumPy and Pandas for potentially better scalability with larger datasets.
* Supports weighting, which is crucial for specific applications like Poisson regression.
* While parallelizing with n_jobs > 1 can be tempting for speed gains, keep in mind it spawns multiple Python processes. This incurs overhead due to Python's Global Interpreter Lock (GIL), a known limitation when utilizing multiple cores for certain tasks. Use it only for bigger dataset (or even not at all, I wrote it when pandas 2.0 wasn't released yet, let's see what Pandas 3.0 brings on the table).

I agree if one doesn't need weighting (weight vector is None), then defaulting to the SciPy/NumPy implementation for speed would make perfect sense. Perhaps in the next release

I hope this is clearer. Thanks for contributing.

from arfs.

harper357 commented on May 24, 2024

My larger dataset was showing a bigger speed up, but I totally get your points about weights.

Adding Numba support could also offer quite a speed up, but it might involve more code changes than what it is worth if most people aren't using large datasets.

from arfs.

ThomasBury commented on May 24, 2024

While this implementation isn't lightning-fast, some functions are inherently difficult to vectorize. While Numba could offer potential speedups, it has compatibility issues with recent Python and NumPy versions, often requiring significant code rewrites with uncertain gains.

Therefore, for now, I'm sticking with this simpler approach and awaiting the expected performance improvements in pandas 3.0.

Another option could be migrating to Polars, but it's unclear if it outperforms the upcoming pandas version. Stay tuned for further updates!

from arfs.

Possible bugs in `CollinearityThreshold` about arfs HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent