daffidwilde / edo Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 0.0 8.44 MB

A library for generating artificial datasets through genetic evolution.

Home Page: https://doi.org/10.1007/s10489-019-01592-4

License: MIT License

Python 100.00%

data-generation evolutionary-algorithms optimisation

edo's People

Contributors

Stargazers

Watchers

edo's Issues

Look up papers on worst/best cases for example algorithms

Should individuals have their own class?

Individuals are currently stored as a collections.namedtuple but it could be beneficial to move to having an Individual class so that the fitness of an individual can be stored as an attribute. This sort of behaviour can be done with a namedtuple but requires a new instance of an individual be made once the fitness is evaluated. Not preferable.

The alternative would be to use functools.lru_cache or similar to store the fitness in memory elsewhere. This could also be potentially useful when assessing an individual which has turned up before (though unlikely to happen).

Update documentation

Following refactors for v0.2.1 and recent PRs (#120, #122, #123), the documentation needs to be updated.

I think this would also be good opportunity to address the concerns raised in #82.

Dwindling mutation probability

As per the discussion on #24, functionality to allow the mutation probability to reduce over the course of a run would be good.

My first thought is to implement this in a similar fashion to how stopping conditions were (by passing a function to run_algorithm). To begin with, this could be done using just the current and max iteration. That way, users can define their own methods for decreasing the mutation probability.

Merge conflict markers still present in docs

Just noticed: there are merge conflict markers in https://github.com/daffidwilde/edo/blob/master/docs/tutorial/tutorial_ii.rst

Implementing copulas

As raised in the further work section of my thesis, the use of copula functions would offer an elegant solution to handling column relationships. An excerpt from that section (daffidwilde/thesis#101):

Copulas are functions that join multivariate distribution functions to their one-dimensional margins [235]. For EDO, this would mean P would contain a single element: a copula function fitted to the existing dataset. In this case, the technical aspects of an individual’s representation would need adjusting to accommodate this change. Likewise, the crossover and mutation processes would require some changes to account for the lack of distinct distribution families.

A Python implementation of copulas for data synthesis exists [12] and incorporating this as a dependency of the edo library would reduce the work required to implement this feature. Studying the impact of copulas in EDO would provide a valuable opportunity to demonstrate the capabilities of EDO as a fully fledged data synthesis method.

Fitness is not specific enough

Measuring the fitness of an individual is too stochastic as it stands and the fitness of an individual is not strictly representative of that individual. This is due to the sampling process being random and us having no way of controlling that randomness. Some work arounds to explore are:

Move to other distribution where mean and variance are not so tangled such as the normal distribution
Play with how fitness should be determined so there is a mapping -- various amalgamation methods, ramping up/down num_samples, etc.
Put seed back into individuals so an individual is a specific dataset

Is mutation having too much of an effect?

Summary

Given a long enough time, the noise in the GA becomes prevelant and the population fitnesses seem to spread. Jumps in fitness like this are controlled by the mutation process. In a similar vein to the sampling issues we had before with families of datasets, individual datasets are delicate objects it would seem and the mutation process has the ability to be destructive throughout the algorithm.

Example

Take the sample mean example. Each individual is a single-column dataset, X, of length between 5 and 50 instances and is generated by sampling a set of numbers from a member of the family of normal distributions described below. Take some random sample of five elements from X and call it Y. The fitness function is given by mean(Y) ** 2 and the objective is to minimise this function.

Allow there to be 100 individuals in the population, and run the GA using the default parameters for 50 iterations:

import genetic_data as gd
import matplotlib.pyplot as plt

def sample_mean_squared(df):
    return df.sample(5, replace=True).mean().mean() ** 2

pop, fit, all_pops, all_fits = gd.run_algorithm(
    fitness=sample_mean_squared,
    size=100,
    row_limits=[5, 50],
    col_limits=[1, 1],
    max_iter=50,
    maximise=False,
    seed=0
)

fig, ax = plt.subplots(1, figsize=(40, 20), dpi=300)

ax.boxplot(all_fits, positions=range(len(all_fits)), sym='.')

ax.set_title('Fitness scores in each epoch', size=24, pad=25)
ax.set_yscale('log')
ax.set_xlabel('Epoch', size=24)
ax.set_ylabel(r'$\log(f(X))$', size=24)

for label in ax.get_xticklabels() + ax.get_yticklabels():
    label.set_fontsize(20)

plt.show()

Suggested fixes

Using another stopping condition based on the quality of a population's fitness:

This wouldn't actually stop the behaviour, a well-defined condition would just terminate the algorithm before it happens.

Dwindling mutation probability:

As the algorithm progresses, it shows signs of convergence so decreasing the likelihood of a mutation happening would stop the GA from altering too many "good" solutions.

Decreasing the mutation space:

The way in which mutation of a value occurs in this GA is to sample from the normal distribution centred at the current value with standard deviation given by sigma (defaulted to 1). In hindsight, this has meant that for small values, relatively large changes will likely happen during mutation, having a substantial effect on the overall fitness of that individual. Moving away from this fixed sigma approach towards something relative to the individual at hand could be beneficial.

Not that it affects the example here too much (since the GA is restricted to generating datasets with exactly one column) but new columns are added to a dataset in mutation by effectively initialising a column of the correct length and sticking it on the end. Columns are generated by sampling from a list of pdf objects which sample their parameters from preset ranges upon becoming an instance of that class. For example, a normally distributed column pdf will take a mean in the interval [-10, 10] and standard deviation in the range [0, 10]. This is a deliberately wide range for the initial population but later on this could cause problems when a dataset is near-uniform, for instance. Allowing these ranges to diminish over time would soften the impact of mutation. Another option is to simply take a new column to be a copy of a column currently in the dataset. That could mean that the change would have too small an effect on the fitness of the individual, however.

Fitness is not being recorded properly

In run_algorithm fitnesses are being calculated properly but in all cases bar the last iteration, only best_prop of the fitnesses are being appended to the history of all fitnesses. Likewise with the populations.

As I'm writing this, I think I know why. The best proportion of individuals are selected and taken from a population using population.pop(best) as required. I hadn't accounted for the fact that this would change things globally resulting in the already appended population/scores being edited as well. I assume something about pointers?

Anyway, I will think of a work around on this one... but here is an example of the kind of behaviour I'm experiencing.

Using the fitness-recording-issue branch and its added print statements:

>>> import numpy as np
>>> import genetic_data as gd

>>> def x_squared(df):
...     return df.iloc[0, 0] ** 2

>>> pop, fit, all_pops, all_fits = gd.run_algorithm(
...     fitness=x_squared,
...     size=100,
...     row_limits=[1, 1],
...     col_limits=[1, 1],
...     max_iter=5,
...     best_prop=0.5,
...     maximise=False,
...     seed=0
... )
the best score in this iteration is 0.02471670450862218
the number of scores is 100 

the best score in this iteration is 0.02471670450862218
the number of scores is 100 

the best score in this iteration is 0.02471670450862218
the number of scores is 100 

the best score in this iteration is 0.02471670450862218
the number of scores is 100 

the best score in this iteration is 0.02471670450862218
the number of scores is 100 

the best score in this iteration is 0.02471670450862218
the number of scores is 100 

the best scores: 
 [28.57371438419073, 10.13474506798552, 2.0926246987460893, 0.273387549544826, 0.14578571748701574, 0.02471670450862218]

Which certainly does not make any sense. Then to double check everything:

>>> for scores in all_fits:
...     print('the best', np.min(scores))
...     print(len(scores), '\n')
the best 28.57371438419073
50 

the best 10.13474506798552
50 

the best 2.0926246987460893
50 

the best 0.273387549544826
50 

the best 0.14578571748701574
50 

the best 0.02471670450862218
100

Include a warning for "silly inputs"

Following the comments on this commit, a warning or error should be implemented to catch inputs that don't make any sense. For instance, setting column limits that specify to only have one column class in any individual as well as never allowing that class to be selected thanks to the probability distribution weights.

This isn't a priority but it would add to the robustness of the library.

Start documentation using Sphinx

Moving towards a `GeneticAlgorithm` class?

One of the reasons edo is built the way it is is so many of the parts are user-definable. However, this isn't necessarily the case a lot of the time when certain function parameters only take a subset of the others -- e.g. those that affect the mutation process can't take the fitness or history into account.

Maybe having these parts be user-defined methods would allow for people to have more control and freedom over what their EAs do.

More distributions to be implemented

Let's implement all of the following:

https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html#distributions

Update CITATION/README/documentation when paper is published

Could be worth:

Updating https://github.com/daffidwilde/edo/blob/master/CITATION.rst to include the option of citing both the paper and the software;
Updating the README with a publications section or a brief "background on the algorithm is available at..."
Updating the documentation with something similar to above.

Move seed to repetition

Column pdfs take a seed to give a specific sample. When calculating the fitness of an individual, that is on a particular dataset. It would be more interesting and inline with the purpose of this project to calculate fitness on the family of datasets that individual represents.

Instead, iterate over several seed and take fitness of individual from those separate fitness scores. How to do that well is another issue.

Fitness of "individual" takes a function

Given a set of fitnesses for several instances of an individual, how to take the fitness of that individual? Min? Max? Mean? Median? Allow get_fitness to take a function on how to do this like np.mean, etc.

Finish first version of documentation

Tutorials

Write third tutorial using multiple column pdfs and tuple column limits

How-to

Write example for setting seed
Change tolerance to 0.1 in stopping condition example
Implement dwindling mutation probability and add to existing mutation how-to
Change "class attributes" to "attributes" in accessing information how-to
Heavy edit on customise column distributions
Rewrite implement new column distribution

Discussion

Write the "What is a genetic algorithm?" section
Move operator parameters to reference?
Edit individual representation section
Slight rewording of last paragraph in selection section
Maybe write up process in steps
Rewrite second paragraph in mutation
Reduce write up of process
Expand on why setting seeds is a good idea (maybe)
Write section on smoothing and accommodating for noisy fitness functions

Reference

Auto-gen distribution section
Maybe separate source code for continuous, numerical discrete and otherwise categorical pdfs
Get a DOI number, add to citation section

Get to the bottom of the new dask version error

Workaround in #144, but this error is raised with the lateat Dask version (2012.12.0):

(edo-dev) edo(fix-failing-action): python -m pytest -v tests/test_optimiser.py::test_get_pop_history
======================================================== test session starts ========================================================
platform darwin -- Python 3.9.1, pytest-6.2.1, py-1.10.0, pluggy-0.12.0 -- /Users/henrywilde/opt/anaconda3/envs/edo-dev/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/henrywilde/src/edo/.hypothesis/examples')
rootdir: /Users/henrywilde/src/edo
plugins: hypothesis-5.20.3, cov-2.10.1
collected 1 item

tests/test_optimiser.py::test_get_pop_history FAILED                                                                          [100%]

============================================================= FAILURES ==============================================================
_______________________________________________________ test_get_pop_history ________________________________________________________

>   ???

tests/test_optimiser.py:509:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_optimiser.py:560: in test_get_pop_history
    assert np.allclose(
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/base.py:279: in compute
    (result,) = compute(self, traverse=False, **kwargs)
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/base.py:561: in compute
    dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/base.py:332: in collections_to_dsk
    _opt = opt(dsk, keys, **kwargs)
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/array/optimization.py:48: in optimize
    dsk = dsk.cull(set(keys))
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:627: in cull
    all_ext_keys = self.get_all_external_keys()
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:518: in get_all_external_keys
    self._all_external_keys.update(layer.get_output_keys())
../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/blockwise.py:285: in get_output_keys
    *[range(self.dims[i]) for i in self.output_indices]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <tuple_iterator object at 0x7ff03f7bb610>

>           *[range(self.dims[i]) for i in self.output_indices]
        )
    }
E   KeyError: 'j'

../../opt/anaconda3/envs/edo-dev/lib/python3.9/site-packages/dask/blockwise.py:285: KeyError
------------------------------------------------------------ Hypothesis -------------------------------------------------------------
Falsifying example: test_get_pop_history(
    size=2,
    row_limits=[1, 1],
    col_limits=[1, 1],
    distributions=[edo.distributions.discrete.Bernoulli,
     edo.distributions.continuous.Uniform],
    weights=(0.01, 0.99),
    max_iter=1,
    best_prop=0.5,
    lucky_prop=0.0,
    crossover_prob=0.0,
    mutation_prob=0.0,
    shrinkage=0.0,
    maximise=False,
)
====================================================== short test summary info ======================================================
FAILED tests/test_optimiser.py::test_get_pop_history - KeyError: 'j'
========================================================= 1 failed in 1.23s =========================================================

Make use of parallelisation with `dask`

Potential uses include:

calculating fitness faster (maybe even using caching as well)
crossover and mutation processes
shrinking process

Would this be done by specifying no. cores/processes or not? Have a look at Axelrod's implementation perhaps.

Update images in documentation

Slightly new implementation requires images to be updated by running docs/src/*.py files

Move to class-based attributes for alternative pdfs

Improve test suite

For the sake of readability, and to make adding to the current test suite easier, a refactor may be needed. It would be nice to make use of hypothesis.composite to create our own strategies in the tests, or perhaps pytest.fixture for session-long test data, etc.

Conserving datatypes of columns

The issue

Not a huge issue but it would be nice to preserve datatypes during crossover. That way, analysis of the data can be done more easily after running.

Example

The issue arises when an NaN is introduced to a column with integer datatype. For instance:

>>> import numpy as np
>>> from edo.individual import create_individual
>>> from edo.pdfs import Poisson

>>> np.random.seed(0)

>>> df, meta = create_individual(row_limits=[10, 10], col_limits=[1, 1], pdfs=[Poisson])

>>> print(df.dtypes)
0    int64
dtype: object

>>> print(df)
   0
0  2
1  2
2  4
3  2
4  0
5  0
6  5
7  1
8  1
9  2

>>> df.iloc[3:5, 0] = np.nan

>>> print(df.dtypes)
0    float64
dtype: object

>>> print(df)
     0
0  2.0
1  2.0
2  4.0
3  NaN
4  NaN
5  0.0
6  5.0
7  1.0
8  1.0
9  2.0

Potential solutions

All values from categorical distributions are returned as strings.
Work out the "add-on" part of the dataframe separately (maybe even columns separately as well) and then append it.

daffidwilde / edo Goto Github PK

edo's People

Contributors

Stargazers

Watchers

edo's Issues

Summary

Example

Suggested fixes

Using another stopping condition based on the quality of a population's fitness:

Dwindling mutation probability:

Decreasing the mutation space:

Tutorials

How-to

Discussion

Reference

The issue

Example

Potential solutions

Recommend Projects

Recommend Topics

Recommend Org