joaopfonseca / ml-research Goto Github PK
View Code? Open in Web Editor NEWA Python library with utilities for Machine Learning research and algorithm implementations
License: MIT License
A Python library with utilities for Machine Learning research and algorithm implementations
License: MIT License
Some commands are outdated and no longer work.
The pandas 2.0 removed df.append
, which was replaced with pd.concat
. Some datasets functions no longer work due to that. In addition, some links with dataset descriptions from openml are also no longer working.
With Python 3.11, downloading some datasets returns an SSL error (when unsafe legacy renegotiation disabled). It happens when the server doesn't support "RFC 5746 secure renegotiation" and the client is using OpenSSL 3, which enforces that standard by default (source).
Hosting the raw data elsewhere should fix this issue.
See scikit-learn
and imbalanced-learn
for reference.
Note: Update CONTRIBUTING.md accordingly
They will be removed in the next release since:
This function can be useful at time, yet is not made easy to access. It is defined in mlresearch.utils._check_pipelines.py
.
See pytorch-lightning, scikit-learn and imbalanced-learn templates for examples.
Should be reverse to the image_to_dataframe
function
This was found on the adult dataset from ContinuousCategoricalDatasets
. Other datasets might have the same warning as well.
FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
There are no tests set up for the functions defined in this script. They are pretty trivial, but it's lowering the coverage and developing some simple tests for it should be simple enough.
Pytorch models should implement the fit, predict, fit_predict, transform, fit_transform, resample or fit_resample methods
Found this on the gender variable, but others might exist.
sklearn==1.4.0
removed _PredictScorer
from their API. Currently mlresearch
is unusable due to that problem whilst importing the package. In addition, code formatting changed with recent versions of black
.
Cannot import mlresearch.utils.set_matplotlib_style
.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt, os
from mlresearch.utils import set_matplotlib_style
import warnings
warnings.filterwarnings("ignore")
set_matplotlib_style()
%config InlineBackend.figure_format = 'retina'
No error is thrown (i.e., no output)
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Input In [2], in <cell line: 8>()
6 import numpy as np
7 import matplotlib.pyplot as plt, os
----> 8 from mlresearch.utils import set_matplotlib_style
9 import warnings
10 warnings.filterwarnings("ignore")
File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/_init_.py:45, in <module>
41 sys.stderr.write("Partial import of imblearn during the build process.\n")
42 # We are not importing the rest of scikit-learn during the build
43 # process, as it may not be compiled yet
44 else:
---> 45 from . import active_learning
46 from . import data_augmentation
47 from . import datasets
File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/active_learning/_init_.py:4, in <module>
1 """
2 Module which contains Active Learning implementations.
3 """
----> 4 from ._active_learning import StandardAL, AugmentationAL
5 from ._acquisition_functions import ACQUISITION_FUNCTIONS
7 _all_ = ["StandardAL", "AugmentationAL", "ACQUISITION_FUNCTIONS"]
File ~/opt/anaconda3/lib/python3.9/site-packages/mlresearch/active_learning/_active_learning.py:5, in <module>
3 import numpy as np
4 from sklearn.base import ClassifierMixin, BaseEstimator, clone
----> 5 from sklearn.model_selection import GridSearchCV
6 from imblearn.pipeline import Pipeline
7 from imblearn.over_sampling.base import BaseOverSampler
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_init_.py:23, in <module>
20 from ._split import train_test_split
21 from ._split import check_cv
---> 23 from ._validation import cross_val_score
24 from ._validation import cross_val_predict
25 from ._validation import cross_validate
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:32, in <module>
30 from ..utils.fixes import delayed
31 from ..utils.metaestimators import _safe_split
---> 32 from ..metrics import check_scoring
33 from ..metrics._scorer import _check_multimetric_scoring, _MultimetricScorer
34 from ..exceptions import FitFailedWarning
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_init_.py:41, in <module>
37 from ._classification import multilabel_confusion_matrix
39 from ._dist_metrics import DistanceMetric
---> 41 from . import cluster
42 from .cluster import adjusted_mutual_info_score
43 from .cluster import adjusted_rand_score
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_init_.py:22, in <module>
20 from ._supervised import fowlkes_mallows_score
21 from ._supervised import entropy
---> 22 from ._unsupervised import silhouette_samples
23 from ._unsupervised import silhouette_score
24 from ._unsupervised import calinski_harabasz_score
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py:16, in <module>
14 from ...utils import check_X_y
15 from ...utils import _safe_indexing
---> 16 from ..pairwise import pairwise_distances_chunked
17 from ..pairwise import pairwise_distances
18 from ...preprocessing import LabelEncoder
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/pairwise.py:33, in <module>
30 from ..utils.fixes import delayed
31 from ..utils.fixes import sp_version, parse_version
---> 33 from ._pairwise_distances_reduction import PairwiseDistancesArgKmin
34 from ._pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
35 from ..exceptions import DataConversionWarning
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/pairwise_distances_reduction/init_.py:89, in <module>
1 # Pairwise Distances Reductions
2 # =============================
3 #
(...)
85 # using Generalized Matrix Multiplication over `float64` data (see the
86 # docstring of :class:`GEMMTermComputer64` for details).
---> 89 from ._dispatcher import (
90 ArgKmin,
91 ArgKminClassMode,
92 BaseDistancesReductionDispatcher,
93 RadiusNeighbors,
94 sqeuclidean_row_norms,
95 )
97 _all_ = [
98 "BaseDistancesReductionDispatcher",
99 "ArgKmin",
(...)
102 "sqeuclidean_row_norms",
103 ]
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:8, in <module>
5 from scipy.sparse import issparse
7 from ... import get_config
----> 8 from .._dist_metrics import BOOL_METRICS, METRIC_MAPPING64
9 from ._argkmin import (
10 ArgKmin32,
11 ArgKmin64,
12 )
13 from ._argkmin_classmode import (
14 ArgKminClassMode32,
15 ArgKminClassMode64,
16 )
ImportError: cannot import name 'METRIC_MAPPING64' from 'sklearn.metrics._dist_metrics' (/Users/josefonseca/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_dist_metrics.cpython-39-darwin.so)
System:
python: 3.9.12 (main, Apr 5 2022, 01:53:17) [Clang 12.0.0 ]
executable: /Users/josefonseca/opt/anaconda3/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.1.1
pip: 21.2.4
setuptools: 61.2.0
numpy: 1.21.5
scipy: 1.7.3
Cython: 0.29.28
pandas: 1.4.2
matplotlib: 3.8.0
joblib: 1.2.0
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
filepath: /Users/josefonseca/opt/anaconda3/lib/libmkl_rt.1.dylib
prefix: libmkl_rt
user_api: blas
internal_api: mkl
version: 2021.4-Product
num_threads: 10
threading_layer: intel
filepath: /Users/josefonseca/opt/anaconda3/lib/libomp.dylib
prefix: libomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 10
mlresearch.show_versions()
is not documented in the readthedocs page.
No response
This can be problematic for some applications.
Other UCI URLs might be broken too; must check.
The readthedocs page is getting a bit outdated:
Some dependencies are only used in the utils
submodule. They should be moved to optional dependencies.
Evaluate synthetic data quality
Implement the metrics proposed in "How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models" by Alaa et al.
These metrics should probably be added to the mlresearch.metrics
submodule. To do this, a new submodule mlresearch.neural_network
might have to be created in order to store the One Class network used to get data distributions' supports.
No response
No response
At the moment there is only an oversampling technique in this submodule, and a wrapper to facilitate data augmentation, but more synthetic data generation techniques will be added in the future.
Add tests to get to 80% code coverage
With the new sklearn
version (1.4.0), the attributes self._sign
and self._response_method
are no longer defined, which causes an AttributeError.
from mlresearch.metrics import Authenticity
Authenticity()
Out[1]: make_scorer(Authenticity, metric=euclidean, n_jobs=None)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
704 stream = StringIO()
705 printer = pretty.RepresentationPrinter(stream, self.verbose,
706 self.max_width, self.newline,
707 max_seq_length=self.max_seq_length,
708 singleton_pprinters=self.singleton_printers,
709 type_pprinters=self.type_printers,
710 deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
712 printer.flush()
713 return stream.getvalue()
File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/lib/pretty.py:411, in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if cls is not object \
410 and callable(cls.__dict__.get('__repr__')):
--> 411 return _repr_pprint(obj, self, cycle)
413 return _default_pprint(obj, self, cycle)
414 finally:
File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/IPython/lib/pretty.py:779, in _repr_pprint(obj, p, cycle)
777 """A pprint that just redirects to the normal repr function."""
778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
780 lines = output.splitlines()
781 with p.group():
File ~/miniconda3/envs/mlresearch/lib/python3.12/site-packages/sklearn/metrics/_scorer.py:206, in _BaseScorer.__repr__(self)
205 def __repr__(self):
--> 206 sign_string = "" if self._sign > 0 else ", greater_is_better=False"
207 response_method_string = f", response_method={self._response_method!r}"
208 kwargs_string = "".join([f", {k}={v}" for k, v in self._kwargs.items()])
AttributeError: 'Authenticity' object has no attribute '_sign'
<details><summary>System, Dependency Information</summary>
**System Information**
* python : `3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]`
* executable : `/home/joaofonseca/miniconda3/envs/mlresearch/bin/python`
* machine : `Linux-6.7.11-200.fc39.x86_64-x86_64-with-glibc2.38`
**Python Dependencies**
* ml-research : `0.5.0`
* pip : `23.3.1`
* setuptools : `68.2.2`
* numpy : `1.26.4`
* pandas : `2.2.1`
* scikit-learn : `1.4.1.post1`
* imbalanced-learn: `0.12.2`
* matplotlib : `3.8.4`
* tqdm : `4.66.2`
* Cython : `None`
* scipy : `1.13.0`
* keras : `None`
* tensorflow : `None`
* joblib : `1.3.2`
</details>
Add Active Learning Implementations:
The goal with this is to make ml-research
fully independent from the research-learn
library. Using genetic search heuristics for experiments might not be the best idea, but it could be a nice preliminary approach to audit methods and get some provisional results much more quickly than a grid search approach.
Note: 3 methods could be implemented; Grid, Random and Genetic search.
sklearn.metrics.SCORERS
is deprecated. To maintain API compatibility with sklearn
, mlresearch.metrics
must be adapted accordingly
Running mlresearch.latex.make_mean_sem_table
with a mean and sem dataframe results into an AttributeError
.
import numpy as np
import pandas as pd
from mlresearch.latex import make_mean_sem_table
means = pd.DataFrame(np.random.random((3,3)))
sem = pd.DataFrame(np.random.random((3,3)))
make_mean_sem_table(means, sem)
0 1 2
0 0.59 $\pm$ 0.10 0.87 $\pm$ 0.53 0.07 $\pm$ 0.11
1 0.85 $\pm$ 0.60 0.51 $\pm$ 0.42 0.22 $\pm$ 0.52
2 0.05 $\pm$ 0.71 0.25 $\pm$ 0.94 0.54 $\pm$ 0.15
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[67], line 7
5 means = pd.DataFrame(np.random.random((3,3)))
6 sem = pd.DataFrame(np.random.random((3,3)))
----> 7 make_mean_sem_table(means, sem)
File ~/miniconda3/envs/recourse-game/lib/python3.10/site-packages/mlresearch/latex/_utils.py:293, in make_mean_sem_table(mean_vals, sem_vals, make_bold, maximum, threshold, decimals, axis)
287 if type(sem_vals) is np.ndarray:
288 sem_vals = pd.DataFrame(
289 sem_vals, index=mean_vals.index, columns=mean_vals.columns
290 )
292 scores = (
--> 293 mean_vals.map(("{:,.%sf}" % decimals).format)
294 + r" $\pm$ "
295 + sem_vals.map(("{:,.%sf}" % decimals).format)
296 )
297 else:
298 scores = mean_vals.map(("{:,.%sf}" % decimals).format)
File ~/miniconda3/envs/recourse-game/lib/python3.10/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
5895 if (
5896 name not in self._internal_names_set
5897 and name not in self._metadata
5898 and name not in self._accessors
5899 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5900 ):
5901 return self[name]
-> 5902 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'map'
System:
python: 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
executable: /home/joaofonseca/miniconda3/envs/recourse-game/bin/python
machine: Linux-6.5.11-300.fc39.x86_64-x86_64-with-glibc2.38
Python dependencies:
ml-research: 0.4.2
pip: 23.3.1
setuptools: 69.0.2
numpy: 1.25.1
pandas: 1.5.3
scikit-learn: 1.2.1
imbalanced-learn: 0.10.1
matplotlib: 3.7.0
tqdm: 4.64.1
Cython: None
scipy: 1.10.1
keras: None
tensorflow: None
joblib: 1.2.0
See implementation example here:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.