Git Product home page Git Product logo

Comments (9)

ThomasBury avatar ThomasBury commented on May 24, 2024

HI @harper357, thank you for the kind words and the careful review (a proper CI/CD test suite is still missing). I'll go through your comments and PR as soon as my schedule allows it.

from arfs.

ThomasBury avatar ThomasBury commented on May 24, 2024

For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent.

Agreed, using pandas is indeed more Pythonic than list comprehension in this case.

In the 2.2.3 version tutorial, it seems everything's fine. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful (please update to version 2.2.3). Thanks

from arfs.

harper357 avatar harper357 commented on May 24, 2024

from arfs.

harper357 avatar harper357 commented on May 24, 2024

I see you merged my changes so the bug should be fixed. But just to be complete, I uploaded a notebook to my fork that illustrates everything. Notebook

It should be pretty easy to understand, I used the cancer data set to show that there was a problem and then some minimal examples of what was happening.

from arfs.

ThomasBury avatar ThomasBury commented on May 24, 2024

Thanks for the detailed example! I appreciate it.

I missed that Pandas preserves the order of the first series during addition. Your PR solved it due to the subsequent sorting.
Based on your example, I've updated the tutorial.

Regarding overall performance, using n_jobs > 1 launches multiple processes (essentially, multiple Python workers). This initialization incurs a time overhead. However, subsequent runs (e.g., the second time you run the function) may be faster than n_jobs=1. Nevertheless, the current parallelization approach is not very sophisticated, and I recommend keeping n_jobs=1 for now. The upcoming Pandas 3 will natively improve parallelization.

Let me know if the fix works and you can close this issue, thank you for contributing.

from arfs.

harper357 avatar harper357 commented on May 24, 2024

Sorry, I have been pretty busy with work so it took me a while to get back to this.

I haven't had the time to formally test it, but it seems like the scipy.stats.spearmanr calculation is is much faster than arfs' implementation (it calculated rho for 14k features in < 1 min on a laptop ) . I am not sure if the other measures are found in other common packages though.

I think it is fine to close this.

from arfs.

ThomasBury avatar ThomasBury commented on May 24, 2024

scipy.stats.spearmanr

Here is a rephrased version of the text:

Computation Time Comparison:

Running the code with a single job (n_jobs=1) results in similar overall computation time for both implementations:

  • ARFS: 231 ns ± 188 ns per loop
  • SciPy Spearmanr: 137 ns ± 89 ns per loop

While SciPy is slightly faster (around 1.7 times), keep in mind that:

ARFS Implementation:
* Uses NumPy and Pandas for potentially better scalability with larger datasets.
* Supports weighting, which is crucial for specific applications like Poisson regression.
* While parallelizing with n_jobs > 1 can be tempting for speed gains, keep in mind it spawns multiple Python processes. This incurs overhead due to Python's Global Interpreter Lock (GIL), a known limitation when utilizing multiple cores for certain tasks. Use it only for bigger dataset (or even not at all, I wrote it when pandas 2.0 wasn't released yet, let's see what Pandas 3.0 brings on the table).

I agree if one doesn't need weighting (weight vector is None), then defaulting to the SciPy/NumPy implementation for speed would make perfect sense. Perhaps in the next release

I hope this is clearer. Thanks for contributing.

from arfs.

harper357 avatar harper357 commented on May 24, 2024

My larger dataset was showing a bigger speed up, but I totally get your points about weights.

Adding Numba support could also offer quite a speed up, but it might involve more code changes than what it is worth if most people aren't using large datasets.

from arfs.

ThomasBury avatar ThomasBury commented on May 24, 2024

While this implementation isn't lightning-fast, some functions are inherently difficult to vectorize. While Numba could offer potential speedups, it has compatibility issues with recent Python and NumPy versions, often requiring significant code rewrites with uncertain gains.

Therefore, for now, I'm sticking with this simpler approach and awaiting the expected performance improvements in pandas 3.0.

Another option could be migrating to Polars, but it's unclear if it outperforms the upcoming pandas version. Stay tuned for further updates!

from arfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.