Git Product home page Git Product logo

ml-research's People

Contributors

dependabot[bot] avatar joaopfonseca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ml-research's Issues

Host all raw data from datasets submodule elsewhere

With Python 3.11, downloading some datasets returns an SSL error (when unsafe legacy renegotiation disabled). It happens when the server doesn't support "RFC 5746 secure renegotiation" and the client is using OpenSSL 3, which enforces that standard by default (source).

Hosting the raw data elsewhere should fix this issue.

Remove computer vision models, augmentations and datasets

They will be removed in the next release since:

  1. I'm not going to used these methods anytime soon and I don't have the time to test them properly
  2. They are out of scope of the library. It is meant to be used for machine learning techniques, focused on tabular data. In the feature it may be worth considering the development of another library for computer vision, for example.
  3. Setting Pytorch as a dependency for a reduced part of the library isn't particularly efficient.

Fix applymap `FutureWarning`

This was found on the adult dataset from ContinuousCategoricalDatasets. Other datasets might have the same warning as well.

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Error when importing ``set_matplotlib_style``

Describe the bug

Cannot import mlresearch.utils.set_matplotlib_style.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt, os
from mlresearch.utils import set_matplotlib_style
import warnings
warnings.filterwarnings("ignore")

set_matplotlib_style()
%config InlineBackend.figure_format = 'retina' 

Expected Results

No error is thrown (i.e., no output)

Actual Results

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [2], in <cell line: 8>()
      6 import numpy as np
      7 import matplotlib.pyplot as plt, os
----> 8 from mlresearch.utils import set_matplotlib_style
      9 import warnings
     10 warnings.filterwarnings("ignore")

File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/_init_.py:45, in <module>
     41     sys.stderr.write("Partial import of imblearn during the build process.\n")
     42     # We are not importing the rest of scikit-learn during the build
     43     # process, as it may not be compiled yet
     44 else:
---> 45     from . import active_learning
     46     from . import data_augmentation
     47     from . import datasets

File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/active_learning/_init_.py:4, in <module>
      1 """
      2 Module which contains Active Learning implementations.
      3 """
----> 4 from ._active_learning import StandardAL, AugmentationAL
      5 from ._acquisition_functions import ACQUISITION_FUNCTIONS
      7 _all_ = ["StandardAL", "AugmentationAL", "ACQUISITION_FUNCTIONS"]

File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/active_learning/_active_learning.py:5, in <module>
      3 import numpy as np
      4 from sklearn.base import ClassifierMixin, BaseEstimator, clone
----> 5 from sklearn.model_selection import GridSearchCV
      6 from imblearn.pipeline import Pipeline
      7 from imblearn.over_sampling.base import BaseOverSampler

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_init_.py:23, in <module>
     20 from ._split import train_test_split
     21 from ._split import check_cv
---> 23 from ._validation import cross_val_score
     24 from ._validation import cross_val_predict
     25 from ._validation import cross_validate

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:32, in <module>
     30 from ..utils.fixes import delayed
     31 from ..utils.metaestimators import _safe_split
---> 32 from ..metrics import check_scoring
     33 from ..metrics._scorer import _check_multimetric_scoring, _MultimetricScorer
     34 from ..exceptions import FitFailedWarning

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_init_.py:41, in <module>
     37 from ._classification import multilabel_confusion_matrix
     39 from ._dist_metrics import DistanceMetric
---> 41 from . import cluster
     42 from .cluster import adjusted_mutual_info_score
     43 from .cluster import adjusted_rand_score

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_init_.py:22, in <module>
     20 from ._supervised import fowlkes_mallows_score
     21 from ._supervised import entropy
---> 22 from ._unsupervised import silhouette_samples
     23 from ._unsupervised import silhouette_score
     24 from ._unsupervised import calinski_harabasz_score

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py:16, in <module>
     14 from ...utils import check_X_y
     15 from ...utils import _safe_indexing
---> 16 from ..pairwise import pairwise_distances_chunked
     17 from ..pairwise import pairwise_distances
     18 from ...preprocessing import LabelEncoder

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/pairwise.py:33, in <module>
     30 from ..utils.fixes import delayed
     31 from ..utils.fixes import sp_version, parse_version
---> 33 from ._pairwise_distances_reduction import PairwiseDistancesArgKmin
     34 from ._pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
     35 from ..exceptions import DataConversionWarning

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/pairwise_distances_reduction/init_.py:89, in <module>
      1 # Pairwise Distances Reductions
      2 # =============================
      3 #
   (...)
     85 #    using Generalized Matrix Multiplication over `float64` data (see the
     86 #    docstring of :class:`GEMMTermComputer64` for details).
---> 89 from ._dispatcher import (
     90     ArgKmin,
     91     ArgKminClassMode,
     92     BaseDistancesReductionDispatcher,
     93     RadiusNeighbors,
     94     sqeuclidean_row_norms,
     95 )
     97 _all_ = [
     98     "BaseDistancesReductionDispatcher",
     99     "ArgKmin",
   (...)
    102     "sqeuclidean_row_norms",
    103 ]

File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:8, in <module>
      5 from scipy.sparse import issparse
      7 from ... import get_config
----> 8 from .._dist_metrics import BOOL_METRICS, METRIC_MAPPING64
      9 from ._argkmin import (
     10     ArgKmin32,
     11     ArgKmin64,
     12 )
     13 from ._argkmin_classmode import (
     14     ArgKminClassMode32,
     15     ArgKminClassMode64,
     16 )

ImportError: cannot import name 'METRIC_MAPPING64' from 'sklearn.metrics._dist_metrics' (/Users/josefonseca/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_dist_metrics.cpython-39-darwin.so)

Versions

System:
    python: 3.9.12 (main, Apr  5 2022, 01:53:17)  [Clang 12.0.0 ]
executable: /Users/josefonseca/opt/anaconda3/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 21.2.4
   setuptools: 61.2.0
        numpy: 1.21.5
        scipy: 1.7.3
       Cython: 0.29.28
       pandas: 1.4.2
   matplotlib: 3.8.0
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /Users/josefonseca/opt/anaconda3/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 10
threading_layer: intel

       filepath: /Users/josefonseca/opt/anaconda3/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 10

Review and add examples to documentation

The readthedocs page is getting a bit outdated:

  • Add support for Python 3.10
  • Add support for Python 3.11
  • Check for missing, deleted or renamed functions and objects
  • Review content as a whole
  • Add examples to documentation
  • Add dependency groups to documentation
  • README contains dependencies that will no longer be used
  • Set docstring variables
  • Set docstring tests
  • Abbreviate mlresearch in api docs page
  • Add funding information

Implement metrics for evaluating synthetic data

Describe the workflow you want to enable

Evaluate synthetic data quality

Describe your proposed solution

Implement the metrics proposed in "How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models" by Alaa et al.

These metrics should probably be added to the mlresearch.metrics submodule. To do this, a new submodule mlresearch.neural_network might have to be created in order to store the One Class network used to get data distributions' supports.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Calling ``Authenticity`` metric raises an error on the ``__repr__`` method

Describe the bug

With the new sklearn version (1.4.0), the attributes self._sign and self._response_method are no longer defined, which causes an AttributeError.

Steps/Code to Reproduce

from mlresearch.metrics import Authenticity

Authenticity()

Expected Results

Out[1]: make_scorer(Authenticity, metric=euclidean, n_jobs=None)

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
   704 stream = StringIO()
   705 printer = pretty.RepresentationPrinter(stream, self.verbose,
   706     self.max_width, self.newline,
   707     max_seq_length=self.max_seq_length,
   708     singleton_pprinters=self.singleton_printers,
   709     type_pprinters=self.type_printers,
   710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
   712 printer.flush()
   713 return stream.getvalue()

File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/lib/pretty.py:411, in RepresentationPrinter.pretty(self, obj)
   408                         return meth(obj, self, cycle)
   409                 if cls is not object \
   410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
   413     return _default_pprint(obj, self, cycle)
   414 finally:

File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/lib/pretty.py:779, in _repr_pprint(obj, p, cycle)
   777 """A pprint that just redirects to the normal repr function."""
   778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
   780 lines = output.splitlines()
   781 with p.group():

File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:206, in _BaseScorer.__repr__(self)
   205 def __repr__(self):
--> 206     sign_string = "" if self._sign > 0 else ", greater_is_better=False"
   207     response_method_string = f", response_method={self._response_method!r}"
   208     kwargs_string = "".join([f", {k}={v}" for k, v in self._kwargs.items()])

AttributeError: 'Authenticity' object has no attribute '_sign'

Versions

<details><summary>System, Dependency Information</summary>

**System Information**

* python          : `3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]`
* executable      : `/home/joaofonseca/miniconda3/envs/mlresearch/bin/python`
* machine         : `Linux-6.7.11-200.fc39.x86_64-x86_64-with-glibc2.38`

**Python Dependencies**

* ml-research     : `0.5.0`
* pip             : `23.3.1`
* setuptools      : `68.2.2`
* numpy           : `1.26.4`
* pandas          : `2.2.1`
* scikit-learn    : `1.4.1.post1`
* imbalanced-learn: `0.12.2`
* matplotlib      : `3.8.4`
* tqdm            : `4.66.2`
* Cython          : `None`
* scipy           : `1.13.0`
* keras           : `None`
* tensorflow      : `None`
* joblib          : `1.3.2`

</details>

Implement ModelSearchCV with grid search and genetic search heuristics

The goal with this is to make ml-research fully independent from the research-learn library. Using genetic search heuristics for experiments might not be the best idea, but it could be a nice preliminary approach to audit methods and get some provisional results much more quickly than a grid search approach.

Note: 3 methods could be implemented; Grid, Random and Genetic search.

Error when running ``make_mean_sem_table`` (AttributeError: 'DataFrame' object has no attribute 'map')

Describe the bug

Running mlresearch.latex.make_mean_sem_table with a mean and sem dataframe results into an AttributeError.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from mlresearch.latex import make_mean_sem_table

means = pd.DataFrame(np.random.random((3,3)))
sem = pd.DataFrame(np.random.random((3,3)))
make_mean_sem_table(means, sem)

Expected Results

                 0                1                2
0  0.59 $\pm$ 0.10  0.87 $\pm$ 0.53  0.07 $\pm$ 0.11
1  0.85 $\pm$ 0.60  0.51 $\pm$ 0.42  0.22 $\pm$ 0.52
2  0.05 $\pm$ 0.71  0.25 $\pm$ 0.94  0.54 $\pm$ 0.15

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[67], line 7
      5 means = pd.DataFrame(np.random.random((3,3)))
      6 sem = pd.DataFrame(np.random.random((3,3)))
----> 7 make_mean_sem_table(means, sem)

File ~/miniconda3/envs/recourse-game/lib/python3.10/site-packages/mlresearch/latex/_utils.py:293, in make_mean_sem_table(mean_vals, sem_vals, make_bold, maximum, threshold, decimals, axis)
    287     if type(sem_vals) is np.ndarray:
    288         sem_vals = pd.DataFrame(
    289             sem_vals, index=mean_vals.index, columns=mean_vals.columns
    290         )
    292     scores = (
--> 293         mean_vals.map(("{:,.%sf}" % decimals).format)
    294         + r" $\pm$ "
    295         + sem_vals.map(("{:,.%sf}" % decimals).format)
    296     )
    297 else:
    298     scores = mean_vals.map(("{:,.%sf}" % decimals).format)

File ~/miniconda3/envs/recourse-game/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'map'

Versions

System:
          python: 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
      executable: /home/joaofonseca/miniconda3/envs/recourse-game/bin/python
         machine: Linux-6.5.11-300.fc39.x86_64-x86_64-with-glibc2.38

Python dependencies:
     ml-research: 0.4.2
             pip: 23.3.1
      setuptools: 69.0.2
           numpy: 1.25.1
          pandas: 1.5.3
    scikit-learn: 1.2.1
imbalanced-learn: 0.10.1
      matplotlib: 3.7.0
            tqdm: 4.64.1
          Cython: None
           scipy: 1.10.1
           keras: None
      tensorflow: None
          joblib: 1.2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.