joaopfonseca / ml-research Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 2.0 418.43 MB

A Python library with utilities for Machine Learning research and algorithm implementations

License: MIT License

Makefile 1.27% Python 98.73%

active-learning data-science machine-learning python scikit-learn

ml-research's People

Contributors

Stargazers

Watchers

Forkers

arcturusmotors harel-coffee

ml-research's Issues

Update ``make`` commands

Some commands are outdated and no longer work.

Replace ``df.append`` functions to comply with pandas 2.0

The pandas 2.0 removed df.append, which was replaced with pd.concat. Some datasets functions no longer work due to that. In addition, some links with dataset descriptions from openml are also no longer working.

Add support for Python 3.11

Add Python 3.12 support

Host all raw data from datasets submodule elsewhere

With Python 3.11, downloading some datasets returns an SSL error (when unsafe legacy renegotiation disabled). It happens when the server doesn't support "RFC 5746 secure renegotiation" and the client is using OpenSSL 3, which enforces that standard by default (source).

Hosting the raw data elsewhere should fix this issue.

Add ``mlresearch.show_versions()`` functions for bug reports

See scikit-learn and imbalanced-learn for reference.

Note: Update CONTRIBUTING.md accordingly

Remove computer vision models, augmentations and datasets

They will be removed in the next release since:

I'm not going to used these methods anytime soon and I don't have the time to test them properly
They are out of scope of the library. It is meant to be used for machine learning techniques, focused on tabular data. In the feature it may be worth considering the development of another library for computer vision, for example.
Setting Pytorch as a dependency for a reduced part of the library isn't particularly efficient.

Fix deprecation warnings

Fix bugs and deprecation warnings arising from dependency updates

Add ``check_random_states`` to ``init.py`` in utils submodule

This function can be useful at time, yet is not made easy to access. It is defined in mlresearch.utils._check_pipelines.py.

Set ``utils.parallel_loop`` to track job completions instead of job starts

Add issues template and fix broken link on PR template

See pytorch-lightning, scikit-learn and imbalanced-learn templates for examples.

Create function ``dataframe_to_image``

Should be reverse to the image_to_dataframe function

Add Changelog generator

Add Semi-supervised learning implementation

Move LaTeX-related functions to their own submodule

Modify parameter names in ``make_bold`` function to ensure parameter name consistency across utils function

Move all CI/CD to GitHub Actions

The number of non-metric + metric features in ``mlresearch.datasets.Datasets.summarize_datasets()`` is not matching the total features

Add missing docstring to GeometricSMOTE's ``_encode_categorical``

ml-research/mlresearch/data_augmentation/_gsmote.py

Line 503 in b87f40e

def _encode_categorical(self, X, y):

Fix applymap `FutureWarning`

This was found on the adult dataset from ContinuousCategoricalDatasets. Other datasets might have the same warning as well.

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Add tests for functions in ``mlresearch.utils._visualization``

There are no tests set up for the functions defined in this script. They are pretty trivial, but it's lowering the coverage and developing some simple tests for it should be simple enough.

Add function to describe and imbalance datasets

Make pytorch models sklearn compatible

Pytorch models should implement the fit, predict, fit_predict, transform, fit_transform, resample or fit_resample methods

``adult`` dataset contains trailing white spaces

Found this on the gender variable, but others might exist.

Fix some of the errors/warnings in the documentation

Add ML-Research to conda

Format code with new version of ``black`` and fix errors coming from new ``sklearn`` version

Describe the issue.

sklearn==1.4.0 removed _PredictScorer from their API. Currently mlresearch is unusable due to that problem whilst importing the package. In addition, code formatting changed with recent versions of black.

Error when importing ``set_matplotlib_style``

Describe the bug

Cannot import mlresearch.utils.set_matplotlib_style.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt, os
from mlresearch.utils import set_matplotlib_style
import warnings
warnings.filterwarnings("ignore")

set_matplotlib_style()
%config InlineBackend.figure_format = 'retina'

Expected Results

No error is thrown (i.e., no output)

Actual Results

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [2], in <cell line: 8>()
      6 import numpy as np
      7 import matplotlib.pyplot as plt, os
----> 8 from mlresearch.utils import set_matplotlib_style
      9 import warnings
     10 warnings.filterwarnings("ignore")

File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/_init_.py:45, in <module>
     41     sys.stderr.write("Partial import of imblearn during the build process.\n")
     42     # We are not importing the rest of scikit-learn during the build
     43     # process, as it may not be compiled yet
     44 else:
---> 45     from . import active_learning
     46     from . import data_augmentation
     47     from . import datasets

File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/active_learning/_init_.py:4, in <module>
      1 """
      2 Module which contains Active Learning implementations.
      3 """
----> 4 from ._active_learning import StandardAL, AugmentationAL
      5 from ._acquisition_functions import ACQUISITION_FUNCTIONS
      7 _all_ = ["StandardAL", "AugmentationAL", "ACQUISITION_FUNCTIONS"]

File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/active_learning/_active_learning.py:5, in <module>
      3 import numpy as np
      4 from sklearn.base import ClassifierMixin, BaseEstimator, clone
----> 5 from sklearn.model_selection import GridSearchCV
      6 from imblearn.pipeline import Pipeline
      7 from imblearn.over_sampling.base import BaseOverSampler

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_init_.py:23, in <module>
     20 from ._split import train_test_split
     21 from ._split import check_cv
---> 23 from ._validation import cross_val_score
     24 from ._validation import cross_val_predict
     25 from ._validation import cross_validate

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:32, in <module>
     30 from ..utils.fixes import delayed
     31 from ..utils.metaestimators import _safe_split
---> 32 from ..metrics import check_scoring
     33 from ..metrics._scorer import _check_multimetric_scoring, _MultimetricScorer
     34 from ..exceptions import FitFailedWarning

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_init_.py:41, in <module>
     37 from ._classification import multilabel_confusion_matrix
     39 from ._dist_metrics import DistanceMetric
---> 41 from . import cluster
     42 from .cluster import adjusted_mutual_info_score
     43 from .cluster import adjusted_rand_score

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_init_.py:22, in <module>
     20 from ._supervised import fowlkes_mallows_score
     21 from ._supervised import entropy
---> 22 from ._unsupervised import silhouette_samples
     23 from ._unsupervised import silhouette_score
     24 from ._unsupervised import calinski_harabasz_score

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py:16, in <module>
     14 from ...utils import check_X_y
     15 from ...utils import _safe_indexing
---> 16 from ..pairwise import pairwise_distances_chunked
     17 from ..pairwise import pairwise_distances
     18 from ...preprocessing import LabelEncoder

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/pairwise.py:33, in <module>
     30 from ..utils.fixes import delayed
     31 from ..utils.fixes import sp_version, parse_version
---> 33 from ._pairwise_distances_reduction import PairwiseDistancesArgKmin
     34 from ._pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
     35 from ..exceptions import DataConversionWarning

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/pairwise_distances_reduction/init_.py:89, in <module>
      1 # Pairwise Distances Reductions
      2 # =============================
      3 #
   (...)
     85 #    using Generalized Matrix Multiplication over `float64` data (see the
     86 #    docstring of :class:`GEMMTermComputer64` for details).
---> 89 from ._dispatcher import (
     90     ArgKmin,
     91     ArgKminClassMode,
     92     BaseDistancesReductionDispatcher,
     93     RadiusNeighbors,
     94     sqeuclidean_row_norms,
     95 )
     97 _all_ = [
     98     "BaseDistancesReductionDispatcher",
     99     "ArgKmin",
   (...)
    102     "sqeuclidean_row_norms",
    103 ]

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:8, in <module>
      5 from scipy.sparse import issparse
      7 from ... import get_config
----> 8 from .._dist_metrics import BOOL_METRICS, METRIC_MAPPING64
      9 from ._argkmin import (
     10     ArgKmin32,
     11     ArgKmin64,
     12 )
     13 from ._argkmin_classmode import (
     14     ArgKminClassMode32,
     15     ArgKminClassMode64,
     16 )

ImportError: cannot import name 'METRIC_MAPPING64' from 'sklearn.metrics._dist_metrics' (/Users/josefonseca/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_dist_metrics.cpython-39-darwin.so)

Versions

System:
    python: 3.9.12 (main, Apr  5 2022, 01:53:17)  [Clang 12.0.0 ]
executable: /Users/josefonseca/opt/anaconda3/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 21.2.4
   setuptools: 61.2.0
        numpy: 1.21.5
        scipy: 1.7.3
       Cython: 0.29.28
       pandas: 1.4.2
   matplotlib: 3.8.0
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /Users/josefonseca/opt/anaconda3/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 10
threading_layer: intel

       filepath: /Users/josefonseca/opt/anaconda3/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 10

Add `mlresearch.show_versions()` to documentation page

Describe the issue linked to the documentation

mlresearch.show_versions() is not documented in the readthedocs page.

Suggest a potential alternative/fix

No response

`mlresearch.preprocessing.PipelineEncoder` is not preserving feature order

This can be problematic for some applications.

``census`` dataset URL is broken

Other UCI URLs might be broken too; must check.

Review and add examples to documentation

The readthedocs page is getting a bit outdated:

Move secondary dependencies to optional dependencies

Some dependencies are only used in the utils submodule. They should be moved to optional dependencies.

Implement metrics for evaluating synthetic data

Describe the workflow you want to enable

Evaluate synthetic data quality

Describe your proposed solution

Implement the metrics proposed in "How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models" by Alaa et al.

These metrics should probably be added to the mlresearch.metrics submodule. To do this, a new submodule mlresearch.neural_network might have to be created in order to store the One Class network used to get data distributions' supports.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Make package `rich` an optional dependency

Rename ``data_augmentation`` submodule to ``synthetic_data``

At the moment there is only an oversampling technique in this submodule, and a wrapper to facilitate data augmentation, but more synthetic data generation techniques will be added in the future.

Get to 80% code coverage

Add tests to get to 80% code coverage

Calling ``Authenticity`` metric raises an error on the ``repr`` method

Describe the bug

With the new sklearn version (1.4.0), the attributes self._sign and self._response_method are no longer defined, which causes an AttributeError.

Steps/Code to Reproduce

from mlresearch.metrics import Authenticity

Authenticity()

Expected Results

Out[1]: make_scorer(Authenticity, metric=euclidean, n_jobs=None)

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
   704 stream = StringIO()
   705 printer = pretty.RepresentationPrinter(stream, self.verbose,
   706     self.max_width, self.newline,
   707     max_seq_length=self.max_seq_length,
   708     singleton_pprinters=self.singleton_printers,
   709     type_pprinters=self.type_printers,
   710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
   712 printer.flush()
   713 return stream.getvalue()

File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/lib/pretty.py:411, in RepresentationPrinter.pretty(self, obj)
   408                         return meth(obj, self, cycle)
   409                 if cls is not object \
   410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
   413     return _default_pprint(obj, self, cycle)
   414 finally:

File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/lib/pretty.py:779, in _repr_pprint(obj, p, cycle)
   777 """A pprint that just redirects to the normal repr function."""
   778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
   780 lines = output.splitlines()
   781 with p.group():

File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:206, in _BaseScorer.__repr__(self)
   205 def __repr__(self):
--> 206     sign_string = "" if self._sign > 0 else ", greater_is_better=False"
   207     response_method_string = f", response_method={self._response_method!r}"
   208     kwargs_string = "".join([f", {k}={v}" for k, v in self._kwargs.items()])

AttributeError: 'Authenticity' object has no attribute '_sign'

Versions

<details><summary>System, Dependency Information</summary>

**System Information**

* python          : `3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]`
* executable      : `/home/joaofonseca/miniconda3/envs/mlresearch/bin/python`
* machine         : `Linux-6.7.11-200.fc39.x86_64-x86_64-with-glibc2.38`

**Python Dependencies**

* ml-research     : `0.5.0`
* pip             : `23.3.1`
* setuptools      : `68.2.2`
* numpy           : `1.26.4`
* pandas          : `2.2.1`
* scikit-learn    : `1.4.1.post1`
* imbalanced-learn: `0.12.2`
* matplotlib      : `3.8.4`
* tqdm            : `4.66.2`
* Cython          : `None`
* scipy           : `1.13.0`
* keras           : `None`
* tensorflow      : `None`
* joblib          : `1.3.2`

</details>

Active Learning Implementations

Add Active Learning Implementations:

Implement ModelSearchCV with grid search and genetic search heuristics

The goal with this is to make ml-research fully independent from the research-learn library. Using genetic search heuristics for experiments might not be the best idea, but it could be a nice preliminary approach to audit methods and get some provisional results much more quickly than a grid search approach.

Note: 3 methods could be implemented; Grid, Random and Genetic search.

Describe the bug

Running mlresearch.latex.make_mean_sem_table with a mean and sem dataframe results into an AttributeError.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from mlresearch.latex import make_mean_sem_table

means = pd.DataFrame(np.random.random((3,3)))
sem = pd.DataFrame(np.random.random((3,3)))
make_mean_sem_table(means, sem)

Expected Results

                 0                1                2
0  0.59 $\pm$ 0.10  0.87 $\pm$ 0.53  0.07 $\pm$ 0.11
1  0.85 $\pm$ 0.60  0.51 $\pm$ 0.42  0.22 $\pm$ 0.52
2  0.05 $\pm$ 0.71  0.25 $\pm$ 0.94  0.54 $\pm$ 0.15

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[67], line 7
      5 means = pd.DataFrame(np.random.random((3,3)))
      6 sem = pd.DataFrame(np.random.random((3,3)))
----> 7 make_mean_sem_table(means, sem)

File ~/miniconda3/envs/recourse-game/lib/python3.10/site-packages/mlresearch/latex/_utils.py:293, in make_mean_sem_table(mean_vals, sem_vals, make_bold, maximum, threshold, decimals, axis)
    287     if type(sem_vals) is np.ndarray:
    288         sem_vals = pd.DataFrame(
    289             sem_vals, index=mean_vals.index, columns=mean_vals.columns
    290         )
    292     scores = (
--> 293         mean_vals.map(("{:,.%sf}" % decimals).format)
    294         + r" $\pm$ "
    295         + sem_vals.map(("{:,.%sf}" % decimals).format)
    296     )
    297 else:
    298     scores = mean_vals.map(("{:,.%sf}" % decimals).format)

File ~/miniconda3/envs/recourse-game/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'map'

Versions

System:
          python: 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
      executable: /home/joaofonseca/miniconda3/envs/recourse-game/bin/python
         machine: Linux-6.5.11-300.fc39.x86_64-x86_64-with-glibc2.38

Python dependencies:
     ml-research: 0.4.2
             pip: 23.3.1
      setuptools: 69.0.2
           numpy: 1.25.1
          pandas: 1.5.3
    scikit-learn: 1.2.1
imbalanced-learn: 0.10.1
      matplotlib: 3.7.0
            tqdm: 4.64.1
          Cython: None
           scipy: 1.10.1
           keras: None
      tensorflow: None
          joblib: 1.2.0

SOM implementation

Include a pipeline-compatible version for one-hot encoding

See implementation example here:

https://github.com/joaopfonseca/publications/blob/master/gsmotenc/scripts/results.py#L33

joaopfonseca / ml-research Goto Github PK

ml-research's People

Contributors

Stargazers

Watchers

Forkers

ml-research's Issues

Describe the issue.

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Describe the issue linked to the documentation

Suggest a potential alternative/fix

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Recommend Projects

Recommend Topics

Recommend Org