morinim / vita Goto Github PK

Vita - Genetic Programming Framework

License: Mozilla Public License 2.0

Emacs Lisp 0.14% C++ 93.54% Python 4.71% C 0.47% CMake 0.37% MQL5 0.77% Shell 0.01%

genetic-programming genetic-algorithms differential-evolution classification symbolic-regression machine-learning artificial-intelligence cpp software-agents evolutionary-algorithms

vita's People

Contributors

Stargazers

Watchers

Forkers

afcarl clflush daemonib jdixoncs rnaimehaom dualword

vita's Issues

Use std::filesystem::path and operator/

C++17 has the standard std::filesystem::path class to represent paths on a filesystem.

We should use this class to store paths.

A list of interested functions / structures:

environment.statistics.dir
merge_path in the utility file
dataframe::read_xrff(const std::string &), dataframe::read_csv(const std::string &), dataframe::read(const std::string &) (in the current form they seem to describe functions parsing an XML document contained in a string)
src_problem::read, src_problem::setup_symbols, src_problem::src_problem
paths in examples/forex/trade_simulator.h
merge_path utility function is now obsolete

Replace `dataframe::load_xrff(const std::string &)` with `dataframe::load_xrff(const std::filesystem::path &)`

The current signature is misleading: it seems to describe a function parsing an XML document contained in a string.

Get rid of the clumsy make_XXX functions

With C++17, we get class template argument deduction. It is based on template argument deduction for function templates and allows us to get rid of the need for clumsy make_XXX functions.

Replace vita::any with std::any

The typical implementation should be performance equivalent.

Check small objects optimization:

Implementations are encouraged to avoid dynamic allocations for small objects, but such an optimization may only be applied to types that for which std::is_nothrow_move_constructible returns true.

(from http://en.cppreference.com/w/cpp/utility/any)

Kahan summation algorithm for class distribution

Kahan summation algorithm significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).

This could be a good integration for distribution class.

Usage of library file (libvita.a)

Hello,

I want to use this library in my own project, but I didn't want to place all source files in my project. Therefore, I compile a static Library (libvita.a) as README introducing. But when I want to include it in my project, it raised errors that I didn't have the header file (The namesapces are not declared.).

So, does it exsit one header file which includes all interfaces or I need to include all .h files in my project together with the library file.
Thanks.

Valhalla

An individual that has been replaced may, depending on his fitness, be sent to Valhalla from whence it may return to the population in some later generation.

A similar idea, performing an N-runs evolution, is to send to Valhalla the best individuals of the first N -m runs, reusing them in the last m runs (the Ragnarök? :-)

Use C++17 nested namespaces

C++17 simplify nested namespace definition:

namespace A::B::C {}

is equivalent to

namespace A { namespace B { namespace C {
} } }

Replace boost::program_options with docopt

docopt is clearer and simpler.

Moreover it's a small library that can be embedded removing another Boost-dependency.

Windows Visual Studio + CMake support/guide

The wiki references a Visual Studio guide as TBA. However, it doesn't seem like Vita compiles at all when generating a solution from CMake (VS19 & C++14). If it does compile, could a guide be added, and if not, document it as non-functioning.

Remove UNUSED macro [C++17]

C++17 introduces the [[maybe_unused]] attribute for the same purpose of our UNUSED macro (defined in compatibility_patch.h).

Clean up code checking class invariants

Classes should specify their invariants: what is true before and after executing any public method.

The function used to check the consistency of an object (debug()) should be renamed is_valid() (the most frequently used name);
Eliminate clutter and redundant checks.

A common pattern to implement invariants in classes is for the constructor of the class to throw an exception if the invariant is not satisfied. Since methods preserve the invariants, they can assume the validity of the invariant and need not explicitly check for it.
```
int A::method(B &b, int v)
{
  Expects(b.debug());  // REDUNDANT CHECK
  Expects(v > 0);      // REQUIRED

  // ...
  // embarrassing code
  // ...

  Ensures(b.debug());  // REDUNDANT CHECK
  Ensures(v > 0);      // REQUIRED

  return v;
}
```
The debug()/is_valid() method should exist only in the debug build, thus it does not add to code bloat.

References:

Add [[nodiscard]] attribute where appropriate

One, crucial use case are error codes. For other guidelines about where to apply see: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0600r0.pdf

Clone scaling

Before evaluating an individual, we could check if identical individuals (clones) are already present in the population.

When the number of clones (n) is greater than zero, the actual fitness assigned to the individual is multiplied by S (the parameter is called the clone scaling factor).

While a continous range of values is possible, in many programs S is set either to 1 (no clone scaling) or to 0 (clone extermination).

(from "Evolving Assembly Programs: How Games Help Microprocessor Validation". Corno, Sanchez, Squillero)

Because of the hash table based fitness scoring, in Vita we cannot assign different fitness values to syntactically equivalent individuals.

Anyway the hash table can be augmented with information used to calculate an approximation of n and the evaluator_proxy can be modified to use these information.

Use `std::byte` instead of `char`

std::byte is a distinct type implementing the concept of byte. A byte is not an integer or a character and therefore not open to programmer errors.

Relicensing the project with EUPL

The EUPL is the European Union Public Licence, published by the European Commission. It
has been studied and drafted as from 2005 and launched in January 2007. The current
version is the multilingual EUPL v1.1 (January 2009).

The EUPL is also a "share alike”" (or copyleft) licence resulting from the aim to avoid exclusive appropriation of the covered software. The EUPL is share alike on both source and object code.

Corporate Contributor License Agreement

Add something like the Apache Software Foundation Corporate Contributor License Agreement.

Making Predictions

I ran Vita on my dataset and the formula it output was:
[000] FADD 5.0 [033]
[033] FSUB [051] [042]
[042] FADD X78 X122
[051] FSUB [052] X78
[052] FMUL X122 X37

What is the best way for me to get predictions from that? Do you have predictions function, or a function to convert that output to standard formula format? I do understand when it says
[000] FADD 5.0 [033]
it is saying to add 5 to formula #33, and formula #33 is shown on the next line of the results.
But I would need to write my own script to convert your formula format to a formula I can use to make actual predictions, wouldn't I?

Use std::sample in the holdout_validation::init procedure

std::sample is computationally simpler and more direct than the current approach (std::shuffle + std::move).

Use helper variable templates where possible

E.g. is_floating_point_v<T> instead of is_floating_point<T>::value

Simple method for serialization

Probably the best way is via boost::serialization (but we have to give more careful consideration to Google Protocol Buffer).

Some resources to check:

CMake support for link time optimization (aka IPO)

CMake v3.9 finally supports LTO.

Here's an example code to show how it works:

cmake_minimum_required(VERSION 3.9.4)

include(CheckIPOSupported)
check_ipo_supported(RESULT supported OUTPUT error)

add_executable(example Example.cpp)

if (supported)
  message(STATUS "IPO / LTO enabled")
  set_property(TARGET example PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)
else ()
  message(STATUS "IPO / LTO not supported: <${error}>")
endif()

Remove angle brackets for class with default template types

See https://stackoverflow.com/q/16014736/3235496

Stratified sampling

Once you split up the data into train, validation and test set, chances are close to 100% that your already skewed data becomes even more unbalanced for at least one of the three resulting sets.

Think about it: let’s say your data set contains 1000 records and of those 20 are labelled as “fraud”. As soon as you split up the data set into train, validation and test set, it is possible that the train set contains the majority of the “fraud”-records or maybe even all of them. Although not as likely, the same could happen for the validation or test set, which is even worse because then the machine learning algorithm has no chance at all to learn the hidden patterns of “fraud”-records.

This is why you shouldn’t use random sampling when constructing the three different data sets, but stratified sampling instead. It assures that the train, validation and test sets are well balanced. Therewith the already existing problem of skewed classes is not intensified, which is what you want when creating high quality models.

Undefined behavior invoking `sum_of_errors_impl` with step argument greater than `1`

The behavior of std::advance is undefined if the specified sequence of increments would require that a non-incrementable iterator (such as the past-the-end iterator) is incremented.

This may happen when n > 1.

Also when n is greater than 1 the arithmetic mean is wrong (considering all the elements of the dataset instead of the correct subset).

Parallelization

What is the best way to run Vita multicore?

Experiment with xorshift64 / xorshift1024 custom PRNG

xorshift64* / xorshift1024* (http://vigna.di.unimi.it/ftp/papers/xorshift.pdf) are very fast PRNGs with interesting properties and could replace Mersenne Twister.

Matthews correlation coefficient

Accuracy is not useful when the two classes are of very different sizes (for example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in practice the classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class).

The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.

We can use the averaged table of confusion for multiclass problems (see http://en.wikipedia.org/wiki/Confusion_matrix).

Replacement for protected division

I was wondering if anyone has tried Ji Ni's replacement for x/y with x/sqrt(1+y*y)?

This seems like a neat idea and I am interested to learn if it works well for others too.

Bill Langdon

[From Yahoo Groups - Genetic Programming]

Use the standard [[fallthrough]] attribute

gcc 7 has added a default fallthrough warning (-Wimplicit-fallthrough enabled by -Wextra).

To suppress this warning C++17 provides a standard way: the [[fallthrough]]; attribute.

In C++11 / C++14, with gcc, it's also possible to add a //-fallthrough comment to silence the warning. This is the current approach but the C++17 attribute specifier sequence is a better option.

Better auto tuning of age_gap / number of layers

Which is the optimal age_gap / layers / individuals / aging scheme combination of ALPS algorithm?

We should probably make it a function of dataset size / generation...

CMake generate pkg-config .pc

CMake can generate pkg-config .pc files for packages. The .pc file serves as a Rosetta stone for many build systems.

A good basic reference for pkg-config .pc syntax is helpful.

Relevant code:

# this fragment generates build/my_package.pc
# 
# cmake -B build -DCMAKE_INSTALL_PREFIX=~/mylib

cmake_minimum_required(VERSION 3.0)

project(mylib 
LANGUAGES C
HOMEPAGE_URL https://github.invalid/username/mylib
DESCRIPTION "example library"
VERSION 1.1.2)

add_library(mine mine.c)
add_library(support support.c)

set(target1 mine)
set(target2 support)

set(pc_libs_private)
set(pc_req_private)
set(pc_req_public) 

configure_file(my_package.pc.in my_package.pc @ONLY)

and

# this template is filled-in by CMake `configure_file(... @ONLY)`
# the `@....@` are filled in by CMake configure_file(), 
# from variables set in your CMakeLists.txt or by CMake itself
#
# Good tutoral for understanding .pc files: 
# https://people.freedesktop.org/~dbn/pkg-config-guide.html

prefix="@CMAKE_INSTALL_PREFIX@"
exec_prefix="${prefix}"
libdir="${prefix}/lib"
includedir="${prefix}/include"

Name: @PROJECT_NAME@
Description: @CMAKE_PROJECT_DESCRIPTION@
URL: @CMAKE_PROJECT_HOMEPAGE_URL@
Version: @PROJECT_VERSION@
Requires: @pc_req_public@
Requires.private: @pc_req_private@
Cflags: -I"${includedir}"
Libs: -L"${libdir}" -l@target1@ -l@target2@
Libs.private: -L"${libdir}" -l@target1@ -l@target2@ @pc_libs_private@

Possible problems: How to generate .pc (pkg-config) file supporting –prefix of the cmake –install?