daffidwilde / edo Goto Github PK
View Code? Open in Web Editor NEWA library for generating artificial datasets through genetic evolution.
Home Page: https://doi.org/10.1007/s10489-019-01592-4
License: MIT License
A library for generating artificial datasets through genetic evolution.
Home Page: https://doi.org/10.1007/s10489-019-01592-4
License: MIT License
Individuals are currently stored as a collections.namedtuple
but it could be beneficial to move to having an Individual
class so that the fitness of an individual can be stored as an attribute. This sort of behaviour can be done with a namedtuple
but requires a new instance of an individual be made once the fitness is evaluated. Not preferable.
The alternative would be to use functools.lru_cache
or similar to store the fitness in memory elsewhere. This could also be potentially useful when assessing an individual which has turned up before (though unlikely to happen).
As per the discussion on #24, functionality to allow the mutation probability to reduce over the course of a run would be good.
My first thought is to implement this in a similar fashion to how stopping conditions were (by passing a function to run_algorithm
). To begin with, this could be done using just the current and max iteration. That way, users can define their own methods for decreasing the mutation probability.
Just noticed: there are merge conflict markers in https://github.com/daffidwilde/edo/blob/master/docs/tutorial/tutorial_ii.rst
As raised in the further work section of my thesis, the use of copula functions would offer an elegant solution to handling column relationships. An excerpt from that section (daffidwilde/thesis#101):
Copulas are functions that join multivariate distribution functions to their one-dimensional margins [235]. For EDO, this would mean P would contain a single element: a copula function fitted to the existing dataset. In this case, the technical aspects of an individual’s representation would need adjusting to accommodate this change. Likewise, the crossover and mutation processes would require some changes to account for the lack of distinct distribution families.
A Python implementation of copulas for data synthesis exists [12] and incorporating this as a dependency of the edo library would reduce the work required to implement this feature. Studying the impact of copulas in EDO would provide a valuable opportunity to demonstrate the capabilities of EDO as a fully fledged data synthesis method.
Measuring the fitness of an individual is too stochastic as it stands and the fitness of an individual is not strictly representative of that individual. This is due to the sampling process being random and us having no way of controlling that randomness. Some work arounds to explore are:
num_samples
, etc.Given a long enough time, the noise in the GA becomes prevelant and the population fitnesses seem to spread. Jumps in fitness like this are controlled by the mutation process. In a similar vein to the sampling issues we had before with families of datasets, individual datasets are delicate objects it would seem and the mutation process has the ability to be destructive throughout the algorithm.
Take the sample mean example. Each individual is a single-column dataset, X
, of length between 5 and 50 instances and is generated by sampling a set of numbers from a member of the family of normal distributions described below. Take some random sample of five elements from X
and call it Y
. The fitness function is given by mean(Y) ** 2
and the objective is to minimise this function.
Allow there to be 100 individuals in the population, and run the GA using the default parameters for 50 iterations:
import genetic_data as gd
import matplotlib.pyplot as plt
def sample_mean_squared(df):
return df.sample(5, replace=True).mean().mean() ** 2
pop, fit, all_pops, all_fits = gd.run_algorithm(
fitness=sample_mean_squared,
size=100,
row_limits=[5, 50],
col_limits=[1, 1],
max_iter=50,
maximise=False,
seed=0
)
fig, ax = plt.subplots(1, figsize=(40, 20), dpi=300)
ax.boxplot(all_fits, positions=range(len(all_fits)), sym='.')
ax.set_title('Fitness scores in each epoch', size=24, pad=25)
ax.set_yscale('log')
ax.set_xlabel('Epoch', size=24)
ax.set_ylabel(r'$\log(f(X))$', size=24)
for label in ax.get_xticklabels() + ax.get_yticklabels():
label.set_fontsize(20)
plt.show()
This wouldn't actually stop the behaviour, a well-defined condition would just terminate the algorithm before it happens.
As the algorithm progresses, it shows signs of convergence so decreasing the likelihood of a mutation happening would stop the GA from altering too many "good" solutions.
The way in which mutation of a value occurs in this GA is to sample from the normal distribution centred at the current value with standard deviation given by sigma
(defaulted to 1). In hindsight, this has meant that for small values, relatively large changes will likely happen during mutation, having a substantial effect on the overall fitness of that individual. Moving away from this fixed sigma approach towards something relative to the individual at hand could be beneficial.
Not that it affects the example here too much (since the GA is restricted to generating datasets with exactly one column) but new columns are added to a dataset in mutation by effectively initialising a column of the correct length and sticking it on the end. Columns are generated by sampling from a list of pdf
objects which sample their parameters from preset ranges upon becoming an instance of that class. For example, a normally distributed column pdf will take a mean in the interval [-10, 10] and standard deviation in the range [0, 10]. This is a deliberately wide range for the initial population but later on this could cause problems when a dataset is near-uniform, for instance. Allowing these ranges to diminish over time would soften the impact of mutation. Another option is to simply take a new column to be a copy of a column currently in the dataset. That could mean that the change would have too small an effect on the fitness of the individual, however.
In run_algorithm
fitnesses are being calculated properly but in all cases bar the last iteration, only best_prop
of the fitnesses are being appended to the history of all fitnesses. Likewise with the populations.
As I'm writing this, I think I know why. The best proportion of individuals are selected and taken from a population using population.pop(best)
as required. I hadn't accounted for the fact that this would change things globally resulting in the already appended population/scores being edited as well. I assume something about pointers?
Anyway, I will think of a work around on this one... but here is an example of the kind of behaviour I'm experiencing.
Using the fitness-recording-issue branch and its added print
statements:
>>> import numpy as np
>>> import genetic_data as gd
>>> def x_squared(df):
... return df.iloc[0, 0] ** 2
>>> pop, fit, all_pops, all_fits = gd.run_algorithm(
... fitness=x_squared,
... size=100,
... row_limits=[1, 1],
... col_limits=[1, 1],
... max_iter=5,
... best_prop=0.5,
... maximise=False,
... seed=0
... )
the best score in this iteration is 0.02471670450862218
the number of scores is 100
the best score in this iteration is 0.02471670450862218
the number of scores is 100
the best score in this iteration is 0.02471670450862218
the number of scores is 100
the best score in this iteration is 0.02471670450862218
the number of scores is 100
the best score in this iteration is 0.02471670450862218
the number of scores is 100
the best score in this iteration is 0.02471670450862218
the number of scores is 100
the best scores:
[28.57371438419073, 10.13474506798552, 2.0926246987460893, 0.273387549544826, 0.14578571748701574, 0.02471670450862218]
Which certainly does not make any sense. Then to double check everything:
>>> for scores in all_fits:
... print('the best', np.min(scores))
... print(len(scores), '\n')
the best 28.57371438419073
50
the best 10.13474506798552
50
the best 2.0926246987460893
50
the best 0.273387549544826
50
the best 0.14578571748701574
50
the best 0.02471670450862218
100
Following the comments on this commit, a warning or error should be implemented to catch inputs that don't make any sense. For instance, setting column limits that specify to only have one column class in any individual as well as never allowing that class to be selected thanks to the probability distribution weights
.
This isn't a priority but it would add to the robustness of the library.
One of the reasons edo
is built the way it is is so many of the parts are user-definable. However, this isn't necessarily the case a lot of the time when certain function parameters only take a subset of the others -- e.g. those that affect the mutation process can't take the fitness or history into account.
Maybe having these parts be user-defined methods would allow for people to have more control and freedom over what their EAs do.
Let's implement all of the following:
https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html#distributions
Could be worth:
Column pdfs take a seed to give a specific sample. When calculating the fitness of an individual, that is on a particular dataset. It would be more interesting and inline with the purpose of this project to calculate fitness on the family of datasets that individual represents.
Instead, iterate over several seed and take fitness of individual from those separate fitness scores. How to do that well is another issue.
Given a set of fitnesses for several instances of an individual, how to take the fitness of that individual? Min? Max? Mean? Median? Allow get_fitness
to take a function on how to do this like np.mean
, etc.
Workaround in #144, but this error is raised with the lateat Dask version (2012.12.0
):
(edo-dev) edo(fix-failing-action): python -m pytest -v tests/test_optimiser.py::test_get_pop_history
======================================================== test session starts ========================================================
platform darwin -- Python 3.9.1, pytest-6.2.1, py-1.10.0, pluggy-0.12.0 -- /Users/henrywilde/opt/anaconda3/envs/edo-dev/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/henrywilde/src/edo/.hypothesis/examples')
rootdir: /Users/henrywilde/src/edo
plugins: hypothesis-5.20.3, cov-2.10.1
collected 1 item
tests/test_optimiser.py::test_get_pop_history FAILED [100%]
============================================================= FAILURES ==============================================================
_______________________________________________________ test_get_pop_history ________________________________________________________
> ???
tests/test_optimiser.py:509:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_optimiser.py:560: in test_get_pop_history
assert np.allclose(
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/base.py:279: in compute
(result,) = compute(self, traverse=False, **kwargs)
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/base.py:561: in compute
dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/base.py:332: in collections_to_dsk
_opt = opt(dsk, keys, **kwargs)
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/array/optimization.py:48: in optimize
dsk = dsk.cull(set(keys))
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:627: in cull
all_ext_keys = self.get_all_external_keys()
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:518: in get_all_external_keys
self._all_external_keys.update(layer.get_output_keys())
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/blockwise.py:285: in get_output_keys
*[range(self.dims[i]) for i in self.output_indices]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.0 = <tuple_iterator object at 0x7ff03f7bb610>
> *[range(self.dims[i]) for i in self.output_indices]
)
}
E KeyError: 'j'
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/blockwise.py:285: KeyError
------------------------------------------------------------ Hypothesis -------------------------------------------------------------
Falsifying example: test_get_pop_history(
size=2,
row_limits=[1, 1],
col_limits=[1, 1],
distributions=[edo.distributions.discrete.Bernoulli,
edo.distributions.continuous.Uniform],
weights=(0.01, 0.99),
max_iter=1,
best_prop=0.5,
lucky_prop=0.0,
crossover_prob=0.0,
mutation_prob=0.0,
shrinkage=0.0,
maximise=False,
)
====================================================== short test summary info ======================================================
FAILED tests/test_optimiser.py::test_get_pop_history - KeyError: 'j'
========================================================= 1 failed in 1.23s =========================================================
Potential uses include:
Would this be done by specifying no. cores/processes or not? Have a look at Axelrod's implementation perhaps.
Slightly new implementation requires images to be updated by running docs/src/*.py
files
For the sake of readability, and to make adding to the current test suite easier, a refactor may be needed. It would be nice to make use of hypothesis.composite
to create our own strategies in the tests, or perhaps pytest.fixture
for session-long test data, etc.
Not a huge issue but it would be nice to preserve datatypes during crossover. That way, analysis of the data can be done more easily after running.
The issue arises when an NaN
is introduced to a column with integer datatype. For instance:
>>> import numpy as np
>>> from edo.individual import create_individual
>>> from edo.pdfs import Poisson
>>> np.random.seed(0)
>>> df, meta = create_individual(row_limits=[10, 10], col_limits=[1, 1], pdfs=[Poisson])
>>> print(df.dtypes)
0 int64
dtype: object
>>> print(df)
0
0 2
1 2
2 4
3 2
4 0
5 0
6 5
7 1
8 1
9 2
>>> df.iloc[3:5, 0] = np.nan
>>> print(df.dtypes)
0 float64
dtype: object
>>> print(df)
0
0 2.0
1 2.0
2 4.0
3 NaN
4 NaN
5 0.0
6 5.0
7 1.0
8 1.0
9 2.0
A variant on randomly sampling distribution parameters at the point of creating the initial population. With Latin hypercube sampling (LHS), the parameter space is utilised fully, giving a wider initial range of individuals.
This initialisation method is default in SciPy's differential evolution implementation, as well as in other optimisation methods.
I am yet to properly understand how doctests work with pytest. Will likely just call it separately.
Doing an example where datasets have exactly 1 continuous and 1 discrete column is proving a little difficult. Need to go over this with Vince.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.