Git Product home page Git Product logo

milescranmer / symbolicregression.jl Goto Github PK

View Code? Open in Web Editor NEW
537.0 14.0 64.0 5.21 MB

Distributed High-Performance Symbolic Regression in Julia

Home Page: https://astroautomata.com/SymbolicRegression.jl/dev/

License: Apache License 2.0

Julia 99.37% Shell 0.25% Python 0.38%
symbolic-regression julia symbolic-computation machine-learning automl sciml interpretable-ml data-science distributed-systems evolutionary-algorithms

symbolicregression.jl's Introduction

SymbolicRegression.jl searches for symbolic expressions which optimize a particular objective.

sr_animation.mp4
Latest release Documentation Forums Paper
version Dev Discussions Paper
Build status Coverage
CI Coverage Status
Aqua QA

Check out PySR for a Python frontend. Cite this software

Mark Kittisopikul
Mark Kittisopikul

๐Ÿ’ป ๐Ÿ’ก ๐Ÿš‡ ๐Ÿ“ฆ ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ง โš ๏ธ
T Coxon
T Coxon

๐Ÿ› ๐Ÿ’ป ๐Ÿ”Œ ๐Ÿ’ก ๐Ÿš‡ ๐Ÿšง ๐Ÿ‘€ ๐Ÿ”ง โš ๏ธ ๐Ÿ““
Dhananjay Ashok
Dhananjay Ashok

๐Ÿ’ป ๐ŸŒ ๐Ÿ’ก ๐Ÿšง โš ๏ธ
Johan Blรฅbรคck
Johan Blรฅbรคck

๐Ÿ› ๐Ÿ’ป ๐Ÿ’ก ๐Ÿšง ๐Ÿ“ฃ ๐Ÿ‘€ โš ๏ธ ๐Ÿ““
JuliusMartensen
JuliusMartensen

๐Ÿ› ๐Ÿ’ป ๐Ÿ“– ๐Ÿ”Œ ๐Ÿ’ก ๐Ÿš‡ ๐Ÿšง ๐Ÿ“ฆ ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ง ๐Ÿ““
ngam
ngam

๐Ÿ’ป ๐Ÿš‡ ๐Ÿ“ฆ ๐Ÿ‘€ ๐Ÿ”ง โš ๏ธ
Kaze Wong
Kaze Wong

๐Ÿ› ๐Ÿ’ป ๐Ÿ’ก ๐Ÿš‡ ๐Ÿšง ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ฌ ๐Ÿ““
Christopher Rackauckas
Christopher Rackauckas

๐Ÿ› ๐Ÿ’ป ๐Ÿ”Œ ๐Ÿ’ก ๐Ÿš‡ ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ฌ ๐Ÿ”ง โš ๏ธ ๐Ÿ““
Patrick Kidger
Patrick Kidger

๐Ÿ› ๐Ÿ’ป ๐Ÿ“– ๐Ÿ”Œ ๐Ÿ’ก ๐Ÿšง ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ฌ ๐Ÿ”ง โš ๏ธ ๐Ÿ““
Okon Samuel
Okon Samuel

๐Ÿ› ๐Ÿ’ป ๐Ÿ“– ๐Ÿšง ๐Ÿ’ก ๐Ÿš‡ ๐Ÿ‘€ โš ๏ธ ๐Ÿ““
William Booth-Clibborn
William Booth-Clibborn

๐Ÿ’ป ๐ŸŒ ๐Ÿ“– ๐Ÿ““ ๐Ÿšง ๐Ÿ‘€ ๐Ÿ”ง โš ๏ธ
Pablo Lemos
Pablo Lemos

๐Ÿ› ๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ฌ ๐Ÿ““
Jerry Ling
Jerry Ling

๐Ÿ› ๐Ÿ’ป ๐Ÿ“– ๐ŸŒ ๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ““
Charles Fox
Charles Fox

๐Ÿ› ๐Ÿ’ป ๐Ÿ’ก ๐Ÿšง ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ฌ ๐Ÿ““
Johann Brehmer
Johann Brehmer

๐Ÿ’ป ๐Ÿ“– ๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ”ฌ โš ๏ธ ๐Ÿ““
Marius Millea
Marius Millea

๐Ÿ’ป ๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ‘€ ๐Ÿ““
Coba
Coba

๐Ÿ› ๐Ÿ’ป ๐Ÿ’ก ๐Ÿ‘€ ๐Ÿ““
Pietro Monticone
Pietro Monticone

๐Ÿ› ๐Ÿ“– ๐Ÿ’ก
Mateusz Kubica
Mateusz Kubica

๐Ÿ“– ๐Ÿ’ก
Jay Wadekar
Jay Wadekar

๐Ÿ› ๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ”ฌ
Anthony Blaom, PhD
Anthony Blaom, PhD

๐Ÿš‡ ๐Ÿ’ก ๐Ÿ‘€
Jgmedina95
Jgmedina95

๐Ÿ› ๐Ÿ’ก ๐Ÿ‘€
Michael Abbott
Michael Abbott

๐Ÿ’ป ๐Ÿ’ก ๐Ÿ‘€ ๐Ÿ”ง
Oscar Smith
Oscar Smith

๐Ÿ’ป ๐Ÿ’ก
Eric Hanson
Eric Hanson

๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ““
Henrique Becker
Henrique Becker

๐Ÿ’ป ๐Ÿ’ก ๐Ÿ‘€
qwertyjl
qwertyjl

๐Ÿ› ๐Ÿ“– ๐Ÿ’ก ๐Ÿ““
Rik Huijzer
Rik Huijzer

๐Ÿ’ก ๐Ÿš‡
Hongyu Wang
Hongyu Wang

๐Ÿ’ก ๐Ÿ“ฃ ๐Ÿ”ฌ
Saurav Maheshkar
Saurav Maheshkar

๐Ÿ”ง

Quickstart

Install in Julia with:

using Pkg
Pkg.add("SymbolicRegression")

MLJ Interface

The easiest way to use SymbolicRegression.jl is with MLJ. Let's see an example:

import SymbolicRegression: SRRegressor
import MLJ: machine, fit!, predict, report

# Dataset with two named features:
X = (a = rand(500), b = rand(500))

# and one target:
y = @. 2 * cos(X.a * 23.5) - X.b ^ 2

# with some noise:
y = y .+ randn(500) .* 1e-3

model = SRRegressor(
    niterations=50,
    binary_operators=[+, -, *],
    unary_operators=[cos],
)

Now, let's create and train this model on our data:

mach = machine(model, X, y)

fit!(mach)

You will notice that expressions are printed using the column names of our table. If, instead of a table-like object, a simple array is passed (e.g., X=randn(100, 2)), x1, ..., xn will be used for variable names.

Let's look at the expressions discovered:

report(mach)

Finally, we can make predictions with the expressions on new data:

predict(mach, X)

This will make predictions using the expression selected by model.selection_method, which by default is a mix of accuracy and complexity.

You can override this selection and select an equation from the Pareto front manually with:

predict(mach, (data=X, idx=2))

where here we choose to evaluate the second equation.

For fitting multiple outputs, one can use MultitargetSRRegressor (and pass an array of indices to idx in predict for selecting specific equations). For a full list of options available to each regressor, see the API page.

Low-Level Interface

The heart of SymbolicRegression.jl is the equation_search function. This takes a 2D array and attempts to model a 1D array using analytic functional forms. Note: unlike the MLJ interface, this assumes column-major input of shape [features, rows].

import SymbolicRegression: Options, equation_search

X = randn(2, 100)
y = 2 * cos.(X[2, :]) + X[1, :] .^ 2 .- 2

options = Options(
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=20
)

hall_of_fame = equation_search(
    X, y, niterations=40, options=options,
    parallelism=:multithreading
)

You can view the resultant equations in the dominating Pareto front (best expression seen at each complexity) with:

import SymbolicRegression: calculate_pareto_frontier

dominating = calculate_pareto_frontier(hall_of_fame)

This is a vector of PopMember type - which contains the expression along with the score. We can get the expressions with:

trees = [member.tree for member in dominating]

Each of these equations is a Node{T} type for some constant type T (like Float32).

You can evaluate a given tree with:

import SymbolicRegression: eval_tree_array

tree = trees[end]
output, did_succeed = eval_tree_array(tree, X, options)

The output array will contain the result of the tree at each of the 100 rows. This did_succeed flag detects whether an evaluation was successful, or whether encountered any NaNs or Infs during calculation (such as, e.g., sqrt(-1)).

Constructing expressions

Expressions are represented as the Node type which is developed in the DynamicExpressions.jl package.

You can manipulate and construct expressions directly. For example:

import SymbolicRegression: Options, Node, eval_tree_array

options = Options(;
    binary_operators=[+, -, *, ^, /], unary_operators=[cos, exp, sin]
)
x1, x2, x3 = [Node(; feature=i) for i=1:3]
tree = cos(x1 - 3.2 * x2) - x1^3.2

This tree has Float64 constants, so the type of the entire tree will be promoted to Node{Float64}.

We can convert all constants (recursively) to Float32:

float32_tree = convert(Node{Float32}, tree)

We can then evaluate this tree on a dataset:

X = rand(Float32, 3, 100)
output, did_succeed = eval_tree_array(tree, X, options)

Exporting to SymbolicUtils.jl

We can view the equations in the dominating Pareto frontier with:

dominating = calculate_pareto_frontier(hall_of_fame)

We can convert the best equation to SymbolicUtils.jl with the following function:

import SymbolicRegression: node_to_symbolic

eqn = node_to_symbolic(dominating[end].tree, options)
println(simplify(eqn*5 + 3))

We can also print out the full pareto frontier like so:

import SymbolicRegression: compute_complexity, string_tree

println("Complexity\tMSE\tEquation")

for member in dominating
    complexity = compute_complexity(member, options)
    loss = member.loss
    string = string_tree(member.tree, options)

    println("$(complexity)\t$(loss)\t$(string)")
end

Code structure

SymbolicRegression.jl is organized roughly as follows. Rounded rectangles indicate objects, and rectangles indicate functions.

(if you can't see this diagram being rendered, try pasting it into mermaid-js.github.io/mermaid-live-editor)

flowchart TB
    op([Options])
    d([Dataset])
    op --> ES
    d --> ES
    subgraph ES[equation_search]
        direction TB
        IP[sr_spawner]
        IP --> p1
        IP --> p2
        subgraph p1[Thread 1]
            direction LR
            pop1([Population])
            pop1 --> src[s_r_cycle]
            src --> opt[optimize_and_simplify_population]
            opt --> pop1
        end
        subgraph p2[Thread 2]
            direction LR
            pop2([Population])
            pop2 --> src2[s_r_cycle]
            src2 --> opt2[optimize_and_simplify_population]
            opt2 --> pop2
        end
        pop1 --> hof
        pop2 --> hof
        hof([HallOfFame])
        hof --> migration
        pop1 <-.-> migration
        pop2 <-.-> migration
        migration[migrate!]
    end
    ES --> output([HallOfFame])

The HallOfFame objects store the expressions with the lowest loss seen at each complexity.

The dependency structure of the code itself is as follows:

stateDiagram-v2
    AdaptiveParsimony --> Mutate
    AdaptiveParsimony --> Population
    AdaptiveParsimony --> RegularizedEvolution
    AdaptiveParsimony --> SingleIteration
    AdaptiveParsimony --> SymbolicRegression
    CheckConstraints --> Mutate
    CheckConstraints --> SymbolicRegression
    Complexity --> CheckConstraints
    Complexity --> HallOfFame
    Complexity --> LossFunctions
    Complexity --> Mutate
    Complexity --> Population
    Complexity --> SearchUtils
    Complexity --> SingleIteration
    Complexity --> SymbolicRegression
    ConstantOptimization --> Mutate
    ConstantOptimization --> SingleIteration
    Core --> AdaptiveParsimony
    Core --> CheckConstraints
    Core --> Complexity
    Core --> ConstantOptimization
    Core --> HallOfFame
    Core --> InterfaceDynamicExpressions
    Core --> LossFunctions
    Core --> Migration
    Core --> Mutate
    Core --> MutationFunctions
    Core --> PopMember
    Core --> Population
    Core --> Recorder
    Core --> RegularizedEvolution
    Core --> SearchUtils
    Core --> SingleIteration
    Core --> SymbolicRegression
    Dataset --> Core
    HallOfFame --> SearchUtils
    HallOfFame --> SingleIteration
    HallOfFame --> SymbolicRegression
    InterfaceDynamicExpressions --> LossFunctions
    InterfaceDynamicExpressions --> SymbolicRegression
    LossFunctions --> ConstantOptimization
    LossFunctions --> HallOfFame
    LossFunctions --> Mutate
    LossFunctions --> PopMember
    LossFunctions --> Population
    LossFunctions --> SymbolicRegression
    Migration --> SymbolicRegression
    Mutate --> RegularizedEvolution
    MutationFunctions --> Mutate
    MutationFunctions --> Population
    MutationFunctions --> SymbolicRegression
    Operators --> Core
    Operators --> Options
    Options --> Core
    OptionsStruct --> Core
    OptionsStruct --> Options
    PopMember --> ConstantOptimization
    PopMember --> HallOfFame
    PopMember --> Migration
    PopMember --> Mutate
    PopMember --> Population
    PopMember --> RegularizedEvolution
    PopMember --> SingleIteration
    PopMember --> SymbolicRegression
    Population --> Migration
    Population --> RegularizedEvolution
    Population --> SearchUtils
    Population --> SingleIteration
    Population --> SymbolicRegression
    ProgramConstants --> Core
    ProgramConstants --> Dataset
    ProgressBars --> SearchUtils
    ProgressBars --> SymbolicRegression
    Recorder --> Mutate
    Recorder --> RegularizedEvolution
    Recorder --> SingleIteration
    Recorder --> SymbolicRegression
    RegularizedEvolution --> SingleIteration
    SearchUtils --> SymbolicRegression
    SingleIteration --> SymbolicRegression
    Utils --> CheckConstraints
    Utils --> ConstantOptimization
    Utils --> Options
    Utils --> PopMember
    Utils --> SingleIteration
    Utils --> SymbolicRegression

Bash command to generate dependency structure from src directory (requires vim-stream):

echo 'stateDiagram-v2'
IFS=$'\n'
for f in *.jl; do
    for line in $(cat $f | grep -e 'import \.\.' -e 'import \.'); do
        echo $(echo $line | vims -s 'dwf:d$' -t '%s/^\.*//g' '%s/Module//g') $(basename "$f" .jl);
    done;
done | vims -l 'f a--> ' | sort

Search options

See https://astroautomata.com/SymbolicRegression.jl/stable/api/#Options

symbolicregression.jl's People

Contributors

alcap23 avatar charfox1 avatar chrisrackauckas avatar cobac avatar dependabot[bot] avatar foxtran avatar github-actions[bot] avatar johanbluecreek avatar johannbrehmer avatar kazewong avatar liuyxpp avatar milescranmer avatar pitmonticone avatar pre-commit-ci[bot] avatar rikhuijzer avatar sheevy avatar timholy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

symbolicregression.jl's Issues

SymbolicRegression compatibility issues

SymbolicRegression lists a compatibility requirement for SymbolicUtils of 0.6. This makes it incompatible with other packages that require newer versions of SymbolicUtils. E.g. Symbolics, which seems like a useful package to use in conjunction with SymbolicRegression, but causes problems with installation as in the attempt below:

(@v1.6) pkg> add https://github.com/MilesCranmer/SymbolicRegression.jl
     Cloning git-repo `https://github.com/MilesCranmer/SymbolicRegression.jl`
    Updating git-repo `https://github.com/MilesCranmer/SymbolicRegression.jl`
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package SymbolicUtils [d1185830]:
 SymbolicUtils [d1185830] log:
 โ”œโ”€possible versions are: 0.1.0-0.12.0 or uninstalled
 โ”œโ”€restricted to versions * by ElectroBiofilm [ee21f0d5], leaving only versions 0.1.0-0.12.0
 โ”‚ โ””โ”€ElectroBiofilm [ee21f0d5] log:
 โ”‚   โ”œโ”€possible versions are: 0.1.0 or uninstalled
 โ”‚   โ””โ”€ElectroBiofilm [ee21f0d5] is fixed to version 0.1.0
 โ”œโ”€restricted to versions 0.6 by SymbolicRegression [8254be44], leaving only versions 0.6.0-0.6.3
 โ”‚ โ””โ”€SymbolicRegression [8254be44] log:
 โ”‚   โ”œโ”€possible versions are: 0.6.9 or uninstalled
 โ”‚   โ””โ”€SymbolicRegression [8254be44] is fixed to version 0.6.9
 โ””โ”€restricted by compatibility requirements with Symbolics [0c5d862f] to versions: 0.8.4-0.12.0 โ€” no versions left
   โ””โ”€Symbolics [0c5d862f] log:
     โ”œโ”€possible versions are: 0.1.0-1.1.0 or uninstalled
     โ””โ”€restricted to versions * by ElectroBiofilm [ee21f0d5], leaving only versions 0.1.0-1.1.0
       โ””โ”€ElectroBiofilm [ee21f0d5] log: see above

Is there any way to relax the dependency on the old version of SymbolicUtils? Is it a matter of some superficial changes to accomadate changes in the SymbolicUtils API, or does it really truly depend on 0.6.x?

Unsatisfiable requirements detected for package SymbolicUtils

Sorry if this issue isn't related to this repository. When I update the Modeling Toolkit and SymbolicRegression packages to their latest versions by pointing to their repositories, it causes the following error related to SymbolicUtils. Is there any way to solve it? I think it's related to the latest update of the SymbolicUtils.

ERROR: Unsatisfiable requirements detected for package SymbolicUtils [d1185830]:
SymbolicUtils [d1185830] log:
โ”œโ”€possible versions are: 0.1.0-0.18.0 or uninstalled
โ”œโ”€restricted to versions 0.13 by SymbolicRegression [8254be44], leaving only versions 0.13.0-0.13.5
โ”‚ โ””โ”€SymbolicRegression [8254be44] log:
โ”‚ โ”œโ”€possible versions are: 0.6.14 or uninstalled
โ”‚ โ””โ”€SymbolicRegression [8254be44] is fixed to version 0.6.14
โ””โ”€restricted to versions 0.18 by ModelingToolkit [961ee093] โ€” no versions left
โ””โ”€ModelingToolkit [961ee093] log:
โ”œโ”€possible versions are: 7.0.0 or uninstalled
โ””โ”€ModelingToolkit [961ee093] is fixed to version 7.0.0

Issues with simplification backend

cc @AlCap23

I've noticed many times during training that equations are overly complex, so I think simplification isn't working for some reason. I narrowed it down to this:

catch e
return init_node
end

I think this is too general and is hiding the current issues in the simplifier. When I remove it, I see that many times, simplification is skipped for equations that should normally be easy to simplify.

@AlCap23 do you think you could help me fix the convert functions to capture the current API of SymbolicUtils? I think it will greatly improve the results of EquationSearch. Thanks!

Gradient/derivative implementation in loss functions

Hi! I'm trying to implement a loss function that takes into account the derivative or gradient of the expressions as a kind of regularization term.
Right now is not possible to implement this as the custom loss only takes the label and the predicted evaluated value.
After understanding how the evaluation of the equation works and seeing the evaldiffTreeArray function that explicitly calculates the derivative I'm wondering if it can be done as follows:

I'm thinking of including a new Loss as follows,
which includes the data

function Loss(cX::AbstractMatrix{T},x::AbstractArray{T}, y::AbstractArray{T}, options::Options{A,B,dA,dB,C},index::int)::T where {T<:Real,A,B,dA,dB,C<:Function}
    sum(options.loss.(cX, x, y))/length(y)
end 

and the new loss expression could be expressed like for example: loss(cX,x,y) = RMS + evaldiffTreeArray(y, cX, options, i).

I'm still struggling with how to include the options in the function as it's both needed to run the function but can't be included as is in the custom loss function.

There could be a different approach, and I'm open to ideas too.

PD. this is different than the other issue about gradient in the loss function as I'm not intending to compare vs a known gradient, which would require changing the data structure.

[CLEANUP] Default settings

The default settings as used in PySR should also be used in SymbolicRegression, so that pure-Julia users can access the same defaults as the Python users.

Also see the discussion here: MilesCranmer/PySR#115 for discussion about optimal settings.

[Feature] Increase Abstractions

I would like for SymbolicRegression.jl to have a mode that is much more abstract in terms of what operators are allowed. e.g., vector/matrix, or even symbolic operations. This would enable SymbolicRegression.jl to work with a much broader set of problems.

Perhaps there is a way to do this that would not even incur a performance hit, and Julia would simply specialize the evaluation code to the particular type of operator.

Potentially of interest to you, @shashi @ChrisRackauckas.

Can't define options as listed in Tutorial, causes Method Error.

I have no idea what's causing this to be honest. I load up Julia 1.6.3, have added SymbolicRegression, and tried defining the options variable as listed on the github page and in the docs, and get the following error.

ERROR: MethodError: no method matching var"#Options#5"(::Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, ::typeof(exp), ::Nothing, ::L2DistLoss, ::Int64, ::Int64, ::Float32, ::Float32, ::Int64, ::Nothing, ::Bool, ::Bool, ::Bool, ::Float32, ::Bool, ::Nothing, ::Int64, ::Float32, ::Bool, ::Bool, ::Int64, ::Vector{Float64}, ::Float32, ::Bool, ::Int64, ::Int64, ::Float32, ::Int64, ::Float32, ::Nothing, ::Nothing, ::Nothing, ::Bool, ::Nothing, ::Nothing, ::String, ::Int64, ::Float32, ::Int64, ::Nothing, ::Nothing, ::String, ::Float64, ::Type{Options})
Closest candidates are:
  var"#Options#5"(::Tuple{Vararg{Any, nbin}}, ::Tuple{Vararg{Any, nuna}}, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Type{Options}) where {nuna, nbin} at /home/johnbs/.julia/packages/SymbolicRegression/dhfuI/src/Options.jl:261
Stacktrace:
 [1] top-level scope
   @ REPL[2]:1

The code I'm using is as follows, and this happens even when the only package I have loaded is this one and its dependencies. Any help would be greatly appreciated.

options = SymbolicRegression.Options(
    binary_operators=(+, *, /, -),
    unary_operators=(exp),
    npopulations=20
);

EquationSearch not defined

Hi, at Julia 1.5.3, I installed the packge as specified on the readme. It says that EquationSearch is not defined when I run the example:

julia> hallOfFame = EquationSearch(X, y, niterations=5, options=options, numprocs=4)
ERROR: UndefVarError: EquationSearch not defined
Stacktrace:
 [1] top-level scope at REPL[6]:1

Should I install a specific version or so? (master/dev)?

Large memory consumption of EquationSearch

I ran with the following options

using Distributed
nprocs() == 1 && (procs = addprocs(4))
@everywhere using SymbolicRegression
options = SymbolicRegression.Options(
    binary_operators=(+, *, /, -),
    unary_operators=(cos, sin, sqrt),
    npopulations=40
)
niterations = 15

on a problem where X had size 3x18, and before it completed the julia session had allocated some 12+GB of memory and crashed both itself and my browser. I am not familiar with the internals of this package, but are there perhaps excessive amounts of data stored in an internal data structure that are not really needed that can be discarded as the optimization progresses?

[Feature] Differential Operators

It would be great if SymbolicRegression.jl supported differential operators.

To implement this with minimal changes, you could have a single operator for each variable. For example, unary_operators=(dx[1], dx[2]) would create differential operators with respect to x[1] and x[2] respectively. These operators are โ„โ†’โ„, so would fit perfectly within the normal evaluation scheme.

Since SymbolicRegression.jl now has fast autodiff in evalDiffTreeArray and evalGradTreeArray, you would basically need to call these from within evalTreeArray in case the unary operator of the current node is a differential operator.

Tagging in case of interest @asross @kazewong @ChrisRackauckas @patrick-kidger

I'm unsure of the cleanest way to have users specify this. Would having a new abstract struct exported by SymbolicRegression.jl be a good idea - sort of like an enum that indicates which variable is used? If using custom variable_names, you would want to have the constructor map from the variable name to the direction, which would also let PySR users specify differential operators in the same language.

Example:

using SymbolicRegression
options = Options(
    binary_operators=(+, -, *, /)
    unary_operators=(Dx(1), Dx(2))
)

SymbolicRegression would export the struct Dx which would be used throughout the library to specify a differential operator with respect to some variable. This would have another constructor that would take variable_names and map a string name of a variable to the correct index.

Thoughts?

Parsimony interference in pareto frontier

In the current model, parsimony enters the loss, and this loss enters the pareto frontier calculation. So technically even if a model has a better predictive accuracy than a simpler model, it might be removed from the pareto frontier due to parsimony.

So, we should remove parsimony altogether from hall of fame scores.

Segfault during execution

Whilst calling from PySR a segmentation fault is thrown.

image

The stack trace points to this line:

@inbounds for i=1:options.npopulations

which is conspicuously marked @inbounds. So I'm suspecting, but cannot prove, that this is a bug in SymbolicRegression.jl. (Although in principle it could be a bug in Julia itself.)

I don't have a lot more to offer than that, I'm afraid. Obviously this occurs nondeterministically, and it's only happened once so far, so I don't know if I can reproduce it.

  • Julia 1.6.0-rc1
  • Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-65-generic x86_64)
  • PySR installed from GitHub master branch today.
  • pysr.pysr called with alpha=0.2, procs=16, populations=100, niterations=100, loss="L1DistLoss()", binary_operators=["plus", "mult", "sub"], unary_operators=["cos", "exp", "sin"]

EquationSearch freezes with zero CPU utilization

I'm trying a very simple example on v0.3.3 (julia v1.6 beta1) and it just freezes without using any CPU. If I ctrl-C, I get the error below

using SymbolicRegression
options = SymbolicRegression.Options(
    binary_operators=(+, *, /, -),
    unary_operators=(cos, sin, sqrt),
    npopulations=20
)
niterations = 5

X = randn(3, 10)
y = randn(10)

hallOfFame = EquationSearch(X, y, niterations=niterations, options=options)
ERROR: InterruptException:

...and 19 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:364
 [2] macro expansion
   @ ./task.jl:383 [inlined]
 [3] EquationSearch(X::Matrix{Float64}, y::Vector{Float64}; niterations::Int64, weights::Nothing, varMap::Nothing, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(cos), typeof(sin), typeof(sqrtm)}})
   @ SymbolicRegression ~/.julia/packages/SymbolicRegression/sSsHc/src/SymbolicRegression.jl:117
 [4] top-level scope
   @ REPL[19]:1
 [5] eval
   @ ./boot.jl:360 [inlined]

Update SymbolicUtils

Please consider updating the compatibility requirement for the SymbolicUtils package, to v0.19, given the current one is really out of date.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Gradient-based loss functions

I am interested in finding a way to modify the loss function in PySR. As in, getting the expression, giving to a different function which gives the loss, rather than having a prediction as such.

I was unable to understand EvaluateEquation.jl, which I believe is required to implement the same. If anyone can spare time, please do let me know whether modifying the loss/score is possible. Thanks in advance.

Base.print

Just like how we redefine base operations after declaring options, we should do the same for Base.print*, rather than having to call printTree(tree, options). Likewise for PopMember - it should unpack and print the tree.

Simplification bug on 0.6

Not sure what this is, but I'm seeing this error come up in the recent versions with SymbolicUtils 0.6:

nested task error: UndefRefError: access to undefined reference                                                                          
        Stacktrace:                                                                                                                              
          [1] getindex                                                                                                                           
            @ ./array.jl:921 [inlined]                                                                                                           
          [2] rehash!(h::Dict{Tuple{AbstractAlgebra.Ring, Vector{Symbol}, Symbol, Int64}, AbstractAlgebra.Ring}, newsz::Int64)                   
            @ Base ./dict.jl:201                                                                                                                 
          [3] rehash!(h::Dict{Tuple{AbstractAlgebra.Ring, Vector{Symbol}, Symbol, Int64}, AbstractAlgebra.Ring}, newsz::Int64) (repeats 862 times
)                                                                                                                                                
            @ Base ./dict.jl:216                                                                                                                 
          [4] _setindex!                                                                                                                         
            @ ./dict.jl:368 [inlined]                                                                                                            
          [5] setindex!(h::Dict{Tuple{AbstractAlgebra.Ring, Vector{Symbol}, Symbol, Int64}, AbstractAlgebra.Ring}, v0::AbstractAlgebra.Generic.MP
olyRing{BigInt}, key::Tuple{AbstractAlgebra.Integers{BigInt}, Vector{Symbol}, Symbol, Int64})                                                    
            @ Base ./dict.jl:390                                                                                                                 
          [6] setindex!                                                                                                                          
            @ ./abstractdict.jl:541 [inlined]                                                                                                    
          [7] MPolyRing                                                                                                                          
            @ ~/.julia/packages/AbstractAlgebra/aF5Iw/src/generic/GenericTypes.jl:424 [inlined]                                                  
          [8] PolynomialRing(R::AbstractAlgebra.Integers{BigInt}, s::Vector{String}; cached::Bool, ordering::Symbol)                             
            @ AbstractAlgebra.Generic ~/.julia/packages/AbstractAlgebra/aF5Iw/src/generic/MPoly.jl:5064                                          
          [9] PolynomialRing                                                                
            @ ~/.julia/packages/AbstractAlgebra/aF5Iw/src/generic/MPoly.jl:5061 [inlined]   
         [10] to_mpoly(t::SymbolicUtils.Term{Real}, dicts::Tuple{OrderedCollections.OrderedDict{SymbolicUtils.Sym, Any}, OrderedCollections.Order
edDict{Any, SymbolicUtils.Sym}})                                                            
            @ SymbolicUtils ~/.julia/packages/SymbolicUtils/89Tzx/src/abstractalgebra.jl:85 
         [11] #51                                                                           
            @ ~/.julia/packages/SymbolicUtils/89Tzx/src/abstractalgebra.jl:39 [inlined]

For now I might try to disable simplification completely in PySR, for stability, until I figure this one out.

[BUG] Domain errors

Cross post with MilesCranmer/PySR#116

It seems like there are some very rare edge cases where a domain error will occur. For the vast majority of instances, an Inf or Nan will be caught at the source, and the evaluation function immediately quit. However, it looks like sometimes this doesn't happen. Here is the traceback:

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:334 [inlined]
     [2] fetch
       @ ./task.jl:349 [inlined]
     [3] (::SymbolicRegression.var"#60#98"{Vector{Vector{Task}}, Int64, Int64})()
       @ SymbolicRegression ./task.jl:423
    
        nested task error: DomainError with Inf:
        cos(x) is only defined for finite x.
        Stacktrace:
          [1] cos_domain_error(x::Float32)
            @ Base.Math ./special/trig.jl:97
          [2] cos(x::Float32)
            @ Base.Math ./special/trig.jl:108
          [3] deg1_l1_ll0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{1}, #unused#::Val{2}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:175
          [4] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:31
          [5] deg2_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{4}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:66
          [6] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:49
          [7] deg2_r0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{2}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:271
          [8] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:47
          [9] deg2_l0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{2}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:245
         [10] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:45
         [11] deg2_r0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{1}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:271
         [12] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:47
         [13] deg2_r0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{4}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:271
         [14] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:47
         [15] deg2_r0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{1}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:271
         [16] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:47
         [17] deg2_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{4}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:68
         [18] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:49
         [19] deg2_r0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{1}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:271
         [20] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:47
         [21] deg1_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{1}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:84
         [22] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:33
         [23] deg2_r0_eval(tree::Node, cX::Matrix{Float32}, #unused#::Val{4}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:271
         [24] evalTreeArray(tree::Node, cX::Matrix{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss})
            @ SymbolicRegression.../EvaluateEquation.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/EvaluateEquation.jl:47
         [25] EvalLoss(tree::Node, dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss}; allow_diff::Bool)
            @ SymbolicRegression.../LossFunctions.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/LossFunctions.jl:28
         [26] scoreFunc(dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, baseline::Float32, tree::Node, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss}; allow_diff::Bool)
            @ SymbolicRegression.../LossFunctions.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/LossFunctions.jl:47
         [27] scoreFunc
            @ ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/LossFunctions.jl:47 [inlined]
         [28] nextGeneration(dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, baseline::Float32, member::PopMember{Float32}, temperature::Float32, curmaxsize::Int64, frequencyComplexity::Vector{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss}; tmp_recorder::Dict{String, Any})
            @ SymbolicRegression.../Mutate.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/Mutate.jl:139
         [29] regEvolCycle(dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, baseline::Float32, pop::Population{Float32}, temperature::Float32, curmaxsize::Int64, frequencyComplexity::Vector{Float32}, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss}, record::Dict{String, Any})
            @ SymbolicRegression.../RegularizedEvolution.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/RegularizedEvolution.jl:62
         [30] SRCycle(dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, baseline::Float32, pop::Population{Float32}, ncycles::Int64, curmaxsize::Int64, frequencyComplexity::Vector{Float32}; verbosity::Int64, options::Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss}, record::Dict{String, Any})
            @ SymbolicRegression.../SingleIteration.jl ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/SingleIteration.jl:33
         [31] macro expansion
            @ ~/miniconda3/envs/pysr/share/julia/packages/SymbolicRegression/Z2pqJ/src/SymbolicRegression.jl:577 [inlined]
         [32] (::SymbolicRegression.var"#59#97"{Options{Tuple{typeof(*), typeof(/), typeof(+), typeof(-)}, Tuple{typeof(sin), typeof(cos), typeof(exp), typeof(log_abs)}, L2DistLoss}, Vector{Vector{Float32}}, Population{Float32}, Int64, Float32, SymbolicRegression.../Dataset.jl.Dataset{Float32}, Int64})()
            @ SymbolicRegression ./threadingconstructs.jl:178>

Interactive regression / printing epochs

Hello, and thanks for developing this library - I found very good uses for it in my research (granular and fluid dynamics). I noticed that in version 0.8 you introduced an interactive display of equation evolution as the regression progresses that refreshes the terminal screen. While nice on local, interactive workflows, the previous scheme printing each epoch was much more useful on batch jobs on e.g. SLURM clusters where stdout is collected into a file as the job runs; at the moment, the characters printed by the interactive display make the batch outputs incomprehensible, and I had to revert back to v0.7.14.

Would it be possible to have an option to use non-interactive display of equations as the regression progresses, similar to the previous library version?

Cannot run the example from the documentation

Hi, I tried to run the example from the documentation but it always fails for me (on both on Windows and Linux and on Julia 1.5.3 and 1.6-beta1).
On freshly installed Julia I run the following (output of most of the commands omitted):

(@v1.6) pkg> generate Test2
(@v1.6) pkg> activate Test2
julia> using Pkg
julia> Pkg.add(url="https://github.com/MilesCranmer/SymbolicRegression.jl.git")
julia> using Distributed
julia> addprocs(4)
julia> @everywhere using SymbolicRegression
ERROR: On worker 2:
ArgumentError: Package SymbolicRegression [8254be44-1295-4e6a-a16d-46603ac705cb] is required but does not seem to be installed:
 - Run `Pkg.instantiate()` to install all recorded dependencies.

Stacktrace:
 [1] _require
   @ .\loading.jl:986
 [2] require
   @ .\loading.jl:910
 [3] #1
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\Distributed.jl:79
 [4] #103
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:274
 [5] run_work_thunk
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:63
 [6] run_work_thunk
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:72
 [7] #96
   @ .\task.jl:406

...and 3 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:364
 [2] macro expansion
   @ .\task.jl:383 [inlined]
 [3] _require_callback(mod::Base.PkgId)
   @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\Distributed.jl:76
 [4] #invokelatest#2
   @ .\essentials.jl:707 [inlined]
 [5] invokelatest
   @ .\essentials.jl:706 [inlined]
 [6] require(uuidkey::Base.PkgId)
   @ Base .\loading.jl:916
 [7] require(into::Module, mod::Symbol)
   @ Base .\loading.jl:897
 [8] top-level scope
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\macros.jl:204

[Performance] Single evaluation results

Here is the speed of SymbolicRegression.jl in evaluating a single expression with 48 nodes, over development history since v0.5.0:

v0.5.0   11.709 ฮผs (58 allocations: 18.64 KiB)
v0.5.1   11.625 ฮผs (58 allocations: 18.64 KiB)
v0.5.2   11.750 ฮผs (58 allocations: 19.03 KiB)
v0.5.3   11.792 ฮผs (58 allocations: 19.03 KiB)
v0.5.4   11.792 ฮผs (58 allocations: 19.03 KiB)
v0.5.5   11.834 ฮผs (58 allocations: 19.03 KiB)
v0.5.6   11.750 ฮผs (58 allocations: 19.03 KiB)
v0.5.7   11.625 ฮผs (58 allocations: 19.03 KiB)
v0.5.8   11.708 ฮผs (58 allocations: 19.03 KiB)
v0.5.9   11.958 ฮผs (58 allocations: 19.03 KiB)
v0.5.10   11.666 ฮผs (58 allocations: 18.64 KiB)
v0.5.11   11.625 ฮผs (58 allocations: 18.64 KiB)
v0.5.12   11.625 ฮผs (58 allocations: 18.64 KiB)
v0.5.13   11.791 ฮผs (58 allocations: 19.42 KiB)
v0.5.14   11.833 ฮผs (58 allocations: 19.42 KiB)
v0.5.15   11.708 ฮผs (58 allocations: 19.42 KiB)
v0.5.16   11.750 ฮผs (58 allocations: 19.42 KiB)
v0.6.0   11.500 ฮผs (58 allocations: 19.42 KiB)
v0.6.1   11.750 ฮผs (58 allocations: 19.81 KiB)
v0.6.2   11.666 ฮผs (58 allocations: 19.81 KiB)
v0.6.3   11.666 ฮผs (58 allocations: 19.81 KiB)
v0.6.4   14.583 ฮผs (58 allocations: 19.81 KiB)
v0.6.5   14.583 ฮผs (58 allocations: 19.81 KiB)
v0.6.6   14.375 ฮผs (58 allocations: 19.81 KiB)
v0.6.7   14.625 ฮผs (58 allocations: 19.81 KiB)
v0.6.8   14.542 ฮผs (58 allocations: 19.81 KiB)
v0.6.9   14.625 ฮผs (58 allocations: 19.81 KiB)
v0.6.10   14.416 ฮผs (58 allocations: 19.81 KiB)
v0.7.0   14.500 ฮผs (58 allocations: 20.20 KiB)
v0.7.1   14.625 ฮผs (58 allocations: 20.20 KiB)
v0.7.2   14.583 ฮผs (58 allocations: 20.20 KiB)
v0.7.3   14.458 ฮผs (58 allocations: 20.20 KiB)
v0.7.4   14.541 ฮผs (58 allocations: 20.20 KiB)
v0.7.5   14.625 ฮผs (58 allocations: 20.20 KiB)
v0.7.6   14.417 ฮผs (58 allocations: 20.20 KiB)
v0.7.7   14.458 ฮผs (58 allocations: 20.20 KiB)
v0.7.8   14.458 ฮผs (58 allocations: 20.20 KiB)
v0.7.9   14.541 ฮผs (58 allocations: 20.59 KiB)
v0.7.10   14.416 ฮผs (58 allocations: 21.38 KiB)
v0.7.11   14.500 ฮผs (58 allocations: 21.38 KiB)
v0.7.12   14.458 ฮผs (58 allocations: 21.38 KiB)

As can be seen, a major performance regression happened from 0.6.3 to 0.6.4. The change can be seen here: v0.6.3...v0.6.4.

This was a necessary change to deal with NaNs and Infs, but I'm not sure it should impact performance that badly...

It looks like checking for NaNs/Infs within the SIMD loop is a major issue for the compiler. Will try checking if moving the NaN/Inf checks out of the loop gets a performance improvement or not.

(run this with:

git tag > tags.txt

# (Remove up to v0.5.0)

# Collect data:
for x in $(cat tags.txt); do git checkout $x 2>&1 > /dev/null && echo -n "${x} " && julia --project=. -O3 single_eval.jl; done >> benchmark_results.txt

# Sort and parse (requires vim-stream)
cat benchmark_results.txt | grep -v HEAD | vims -l 'xf.r f.r ' | sort -k1n -k2n -k3n | vims -l 'Iv\<esc>f r.f r.' |vims -l 'fndf '

Help on handling numeric errors like Inf

Hi, @MilesCranmer. This is not an issue but I am writing to ask for your help on some design considerations.

In symbolic regression code, we usually use some protected version of primitive functions to avoid numerical errors like division by zero. Of course, division by zero is allowed in Julia and the NaN value can be propagated. I looked into your Operators.jl and found that most functions are not protected. I am not sure whether you handle potential errors somewhere else.

Consider the following case using functions in Operators.jl.

sin(div(1.0, 0.0))

which should throw an error as follows

julia> sin(1.0 / 0.0)
ERROR: DomainError with Inf:
sin(x) is only defined for finite x.

Also, the exp function may produce huge values or even Inf if the input is a large number, which can affect numerical stability.

julia> exp(1000)
Inf

Currently, I am also working on symbolic regression but using a different genetic programming approach. It would be helpful if you can discuss how you handle such errors.

Tournament selection p

For some reason having any tournament selection p < 1 hurts the performance significantly. I don't think this should be the case... So I am labelling this a bug.

  • Check the code of the tournament selection algorithm, and ensure the weights are properly normalized.

Outdated version 0.2.0 on Julia 1.6.1

I am using Julia 1.6.1. When I added SymbolicRegression, the version was very much outdated: 0.2.0. The first example in the docs didn't run because EquationSearch didn't exist. Can you fix the dependency so that the latest version supports Julia 1.6.1?

[Cleanup] Better implementation of batching

In the current code, using batching=true doesn't seem to get nearly as good a speedup as simply inputting a smaller dataset. This indicates either the random slicing is taking too long, or the full evaluation is performed too frequently.

Questions on large datasets and best settings and scaling

I switched my code from Python to Julia. (To generate the datasets). I'm starting with a "small" dataset of 1,257,984 items. So the inputX is Matrix{Float32}(10, 1257984) and outputY is Vector{Float32}(1257984). (The actual data is integers, but I cast them to float). outputY is either 0 or 1, so it's a binary problem. Essentially finding an equation that divides two sets of vectors?

http://sirisian.com/randomfiles/symbolicregression/inputX.bin
http://sirisian.com/randomfiles/symbolicregression/outputY.bin

using SymbolicRegression
using SymbolicUtils
using Serialization

inputX = deserialize("inputX.bin")
outputY = deserialize("outputY.bin")
options = SymbolicRegression.Options(
	binary_operators=(+, *, /, -, ^),
	unary_operators=(sqrt,),
	constraints=((^)=>(-1, 3),),
	# target == 0
	# We want the predicted value to be < 0
	# target == 1
	# We want the predicted value to be >= 0
	loss=(t,p) -> t == 0 ? (p < 0 ? 0 : p^2 + 0.0001) : (p >= 0 ? 0 : p^2 + 0.0001),
	maxsize=48,
	maxdepth=8,
	npopulations=20,
	batching=true,
	batchSize=1000,
	progress=true
)
hallOfFame = EquationSearch(inputX, outputY, niterations=1000, options=options, numprocs=0, multithreading=true)
dominating = calculateParetoFrontier(inputX, outputY, hallOfFame, options)
eqn = node_to_symbolic(dominating[end].tree, options)
println(simplify(eqn))
for member in dominating
	size = countNodes(member.tree)
	score = member.score
	string = stringTree(member.tree, options)
		
	println("$(size)\t$(score)\t$(string)")
end

I have a lot of questions. I don't know what's feasible, so I think it's easier to ask them in one place.

  1. What settings should I be using for best results given my problem. Say I have an i9-12900k, so 24 threads to work with. I'm having problems reasoning about the size of batching or if I need to use other settings to speed things up.

  2. The loss function I have defined does not appear to work. My thinking was to drive the system to return values that are negative when the output is 0 and positive when the output is 1. For some reason when I let the above run it returns things like:
    1 0.000e+00 Inf -0.7851169
    So I'm fundamentally not understanding something. Surely such a constant equation like -0.5 would return a large loss? Say the target was 1 for an element being evaluated. Then 1 == 0 ? (-0.5 < 0 ? 0 : (-0.5)^2 + 0.0001) : (-0.5 >= 0 ? 0 : (-0.5)^2 + 0.0001) = 0.2501
    I noticed there are loss functions for binary classification problems. I see cross entropy and hinge loss mentioned a lot. I don't see these in the listed loss functions. I noticed in other places the target is expected to be -1 or 1 not 0 or 1. Any tips for using such loss functions with this library?

  3. Is it possible to force the algorithm to include a minimum complexity of say including the 10 variables + 9 operators, so 19. Like the simplest possible equation I would ever expect to work is x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10. I can write a cubic function that gets close to the result.

  4. Is it possible to define custom expression subtrees that use variables with lower complexity? Ideally I'd expect it to use varMap? For example say I have varMap=["x0", "y0", "x1", "y1", "x2", "y2", "x3", "y3", "cx", "cy"]. While I'm not sure (x0 - x1) is definitely going to be in my equation the likelihood is much higher than other subtrees. So I might have a list like ["x0 - x1", "x1 - x0", "y0 - y1", "y1 - y1", ...] or however best it would be stored and instead of complexity 3 it's more like complexity 2. In my example this happens because they're control points in a curve where (x0, y0), (x1, y1), (x2, y2), (x3, y3) are sequential and vector subtraction is expected to show up.

  5. My complete dataset is ~20,000,000 trillion elements. Same structure with 10 integers and a 0 or 1 binary result, but at different resolutions. Obviously I can't compute perfect equations or even loop over this within a lifetime. It is a fixed number though that I can randomly sample (aka Monte Carlo) to verify the quality of the result. My question is if it's possible to support a kind of randomized window dataset. So rather than supplying a fixed dataset one would supply a function (or generator) that returns a random record. Let's call this getRandomRecord. Calling this method might be slightly expensive, so you wouldn't want to just call it and throw away the result after using it once.

    I assume this would be an extension of the batching system. The system could generate data (probably multithreaded) calling getRandomRecord() populating a few GBs then batching would create a random array view on this data for each of its batches. The algorithm wouldn't be able to ever loop over all of the real data, but it could create another random view and calculate a fuzzy loss. Periodically the whole random dataset would be regenerated.

    Even with my 1257984 record example above I don't think it's statistically significant to iterate over the whole dataset at the beginning. It would definitely be useful to quickly sample them. Maybe if the total loss is low then it would sample more data. The closer the total loss (what's the term for this, sum of all losses? error?) is to 0 the more of the whole dataset it would attempt to sample. So starting out it might be fine with using 1K then slowly ramp up to 100K for the view size?

  6. In addition to the above, could this be scaled over hundreds of computers in the cloud? I couldn't get the numprocs to work without it crashing, so I'm not sure how stable it is. The concept would be that each computer generates its own randomized dataset and they only talks to ensure they don't process the same equations. My experience with distributed computing is relegated to C++ MPI stuff years ago, so I'm not familiar with Julia's features yet. I don't know your algorithm, but is it in your scope to support such distributed cloud setup in an easy way?

(Also as a side note I wouldn't be against sponsoring some of these changes for beer money, like 500 dollars. I'm interested in the potential of these kind of projects even if they don't solve my specific problems).

nested task error

Hi
I'm trying to see the result of this package on the example1 of the examples of the AI-Feynman package at

using the following script
using SymbolicRegression, CSV, DataFrames, SymbolicUtils
df = CSV.read("example1.txt", DataFrame, header=false);
df_mat = Matrix{Float64}(df[1:500, [1,2,3,4,5]]); #just first 500 rows of the dataset

X = df_mat[:, 1:4]'
y = df_mat[:, 5];

options = SymbolicRegression.Options(
    binary_operators=(+, *, /, -, ^),
    #unary_operators=(cos, exp),
    npopulations=20,
    batching=true,
    batchSize=128,
)

hallOfFame = EquationSearch(X, y, niterations=5, options=options, numprocs=6)

This script starts normally but when the process arrives at the

Progress: 34 / 100 total iterations (34.000%)

I get the following error

 `TaskFailedException

nested task error: On worker 13:
BoundsError: attempt to access 4ร—128 Matrix{Float64} at index [-1, 1:128]
Stacktrace:
  [1] throw_boundserror
    @ ./abstractarray.jl:651
  [2] checkbounds
    @ ./abstractarray.jl:616 [inlined]
  [3] _getindex
    @ ./multidimensional.jl:831 [inlined]
  [4] getindex
    @ ./abstractarray.jl:1170 [inlined]
  [5] deg0_eval
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:90
  [6] evalTreeArray
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:22
  [7] deg1_eval
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:72
  [8] evalTreeArray
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:27
  [9] deg2_eval
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:54
 [10] evalTreeArray
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:37
 [11] deg2_eval
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:54
 [12] evalTreeArray
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:37
 [13] deg2_l0_eval
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:199
 [14] evalTreeArray
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/EvaluateEquation.jl:33
 [15] scoreFuncBatch
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/LossFunctions.jl:58
 [16] #nextGeneration#1
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/Mutate.jl:137
 [17] regEvolCycle
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/RegularizedEvolution.jl:57
 [18] #SRCycle#1
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/SingleIteration.jl:32
 [19] macro expansion
    @ ~/.julia/packages/SymbolicRegression/1URtS/src/SymbolicRegression.jl:476 [inlined]
 [20] #53
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/macros.jl:87
 [21] #103
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:274
 [22] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:63
 [23] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:72
 [24] #96
    @ ./task.jl:411
Stacktrace:
 [1] #remotecall_fetch#143
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:394 [inlined]
 [2] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:386
 [3] #remotecall_fetch#146
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421 [inlined]
 [4] remotecall_fetch
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421 [inlined]
 [5] call_on_owner
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:494 [inlined]
 [6] fetch(r::Distributed.Future)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:533
 [7] (::SymbolicRegression.var"#55#87"{Vector{Vector{Distributed.Future}}, Int64, Int64})()
   @ SymbolicRegression ./task.jl:411

Stacktrace:
 [1] wait
   @ ./task.jl:322 [inlined]
 [2] fetch
   @ ./task.jl:337 [inlined]
     [3] _EquationSearch(::SymbolicRegression.../ProgramConstants.jl.SRDistributed, datasets::Vector{SymbolicRegression.../Dataset.jl.Dataset{Float64}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(exp), typeof(cos)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
   @ SymbolicRegression ~/.julia/packages/SymbolicRegression/1URtS/src/SymbolicRegression.jl:387
 [4] EquationSearch(datasets::Vector{SymbolicRegression.../Dataset.jl.Dataset{Float64}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(exp), typeof(cos)}, L2DistLoss}, numprocs::Int64, procs::Nothing, multithreading::Bool, runtests::Bool)
   @ SymbolicRegression ~/.julia/packages/SymbolicRegression/1URtS/src/SymbolicRegression.jl:181
 [5] EquationSearch(X::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, y::Matrix{Float64}; niterations::Int64, weights::Nothing, varMap::Nothing, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(exp), typeof(cos)}, L2DistLoss}, numprocs::Int64, procs::Nothing, multithreading::Bool, runtests::Bool)
   @ SymbolicRegression ~/.julia/packages/SymbolicRegression/1URtS/src/SymbolicRegression.jl:145
 [6] #EquationSearch#22
   @ ~/.julia/packages/SymbolicRegression/1URtS/src/SymbolicRegression.jl:157 [inlined]
 [7] top-level scope
   @ In[9]:9
 [8] eval
   @ ./boot.jl:360 [inlined]
 [9] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
`

I checked it several times but always appears the same error. I'm on a Ubuntu laptop machine with i7 cpu and 16GB RAM. If I increase the number of rows in the dataset the issue appears for other workers.

Data recorder

One of the most useful features for improving the performance of SymbolicRegression.jl, and otherwise debugging it, would be a data recorder. This would log everything that happens in a structured format (not plaintext!), so that one can quantitatively study the performance of various algorithms, and tune them.

AssertionError: isfinite(phi_c) && isfinite(dphi_c)

Sometimes I see this issue during the step "Testing entire pipeline on workers", even without BFGS turned on. It appears this happens during Newton's method. I have only observed this with very nonlinear operators like power laws and exponential functions.

To fix this issue for BFGS, I changed from HagerZhang (default) to BackTracking for the line search. I suppose we should try this for the other optimization strategies as well?

Another thing I should note is that this error goes away with increasing the precision, so it seems it just encounters an error during optimization.

Consistent snake_case vs CamelCase

This is a problem with both SymbolicRegression.jl and PySR - parameters are mixed in terms of using snake_case vs camelCase (vs PascalCase, for some symbols). In 0.9.0, this will be fixed: types/constructors/functions will be PascalCase, and parameters will be snake_case. Existing symbols will be deprecated, with a warning message indicating which function/parameter to use in place. Eventually these will be removed entirely at some point in the future.

Saving equations throughout runtime

Hey, thank you so much for all your work in this package.

I'm interesting in evaluating the performance of this algorithm overtime (i.e. saving the hallOfFame object after each iteration). Is there any way that I've missed of achieving this without modifying the source code?

Thanks!

Long-running parallel jobs have small percentage of processes hang

I have a suspicion that sometimes processes hang for long-running jobs, as the load decreases over time. Very infrequently, when running with procs=0, I will see errors such as cos receiving Inf as input, even though I have special checks for such behavior. It is possible that such a situation occurs for a long-running job, and the process crashes without stating anything.

Putting code within try in Julia slows it down quite a bit, so it is not practical to do error catching in that way.

Once #27 is implemented, perhaps one could monitor processes over time to see if they crash, and see what caused this to occur.

MethodError: Cannot `convert` an object of type SymbolicUtils.Term{Number, Nothing} to an object of type SymbolicUtils.Pow{Number, SymbolicUtils.Term{Number, Nothing}, Float32, Nothing}

Hi,

I hope you are fine.

After terminating the regression, following the example of the package, when I try to print the best equation of hallOfFame by

using SymbolicRegression, CSV, DataFrames, SymbolicUtils

df = CSV.read("example1.txt", DataFrame, header=false);
println(first(df, 5))
df_mat = Matrix{Float64}(df[rand(1:size(df)[1], 800), [1,2,3,4,5]]);

X = df_mat[:, 1:4]'
y = df_mat[:, 5];

options = SymbolicRegression.Options(
    binary_operators=(+, *, /, -, ^),
    unary_operators=(sqrt,),
    npopulations=100,
)

hallOfFame = EquationSearch(X, y, niterations=10, options=options, numprocs=nothing, multithreading=true);

dominating = calculateParetoFrontier(X,y, hallOfFame, options);
eqn = node_to_symbolic(dominating[end].tree, options)
println(simplify(eqn))

I get the following error

MethodError: MethodError: Cannot `convert` an object of type SymbolicUtils.Term{Number, Nothing} to an object of type SymbolicUtils.Pow{Number, SymbolicUtils.Term{Number, Nothing}, Float32, Nothing}
Closest candidates are:
  convert(::DataType, ::SymbolicUtils.Symbolic, !Matched::Options; varMap) at /home/vahid/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:78
  convert(::DataType, !Matched::Symbol, !Matched::Options; varMap) at /home/vahid/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:73
  convert(::Type{var"#s16"} where var"#s16"<:Union{Number, T}, !Matched::MultivariatePolynomials.AbstractPolynomialLike{T}) where T at /home/vahid/.julia/packages/MultivariatePolynomials/vqcb5/src/conversion.jl:65
  ...

Stacktrace:
  [1] sqrt_abs(x::SymbolicUtils.Pow{Number, SymbolicUtils.Term{Number, Nothing}, Float32, Nothing})
    @ SymbolicRegression.../Operators.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/Operators.jl:74
  [2] parse_tree_to_eqs(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, index_functions::Bool, evaluate_functions::Bool)
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:30
  [3] (::SymbolicRegression.../InterfaceSymbolicUtils.jl.var"#2#4"{Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, Bool, Bool})(x::Node)
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:30
  [4] map(f::SymbolicRegression.../InterfaceSymbolicUtils.jl.var"#2#4"{Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, Bool, Bool}, t::Tuple{Node, Node})
    @ Base ./tuple.jl:214
  [5] parse_tree_to_eqs(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, index_functions::Bool, evaluate_functions::Bool)
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:30
  [6] (::SymbolicRegression.../InterfaceSymbolicUtils.jl.var"#2#4"{Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, Bool, Bool})(x::Node)
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:30
  [7] map(f::SymbolicRegression.../InterfaceSymbolicUtils.jl.var"#2#4"{Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, Bool, Bool}, t::Tuple{Node, Node})
    @ Base ./tuple.jl:214
  [8] parse_tree_to_eqs(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}, index_functions::Bool, evaluate_functions::Bool)
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:30
  [9] node_to_symbolic(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss}; varMap::Nothing, evaluate_functions::Bool, index_functions::Bool)
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:119
 [10] node_to_symbolic(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-), typeof(pow)}, Tuple{typeof(sqrt_abs)}, L2DistLoss})
    @ SymbolicRegression.../InterfaceSymbolicUtils.jl ~/.julia/packages/SymbolicRegression/VQnnx/src/InterfaceSymbolicUtils.jl:119
 [11] top-level scope
    @ ~/GoogleDrive/Coding/testing/teste_julia.ipynb:1
 [12] eval
    @ ./boot.jl:360 [inlined]
 [13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094
 [14] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [15] invokelatest
    @ ./essentials.jl:706 [inlined]
 [16] (::VSCodeServer.var"#98#99"{VSCodeServer.NotebookRunCellArguments, String})()
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/serve_notebook.jl:18
 [17] withpath(f::VSCodeServer.var"#98#99"{VSCodeServer.NotebookRunCellArguments, String}, path::String)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/repl.jl:185
 [18] notebook_runcell_request(conn::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, params::VSCodeServer.NotebookRunCellArguments)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/serve_notebook.jl:14
 [19] dispatch_msg(x::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, dispatcher::VSCodeServer.JSONRPC.MsgDispatcher, msg::Dict{String, Any})
    @ VSCodeServer.JSONRPC ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/JSONRPC/src/typed.jl:67
 [20] serve_notebook(pipename::String; crashreporting_pipename::String)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/packages/VSCodeServer/src/serve_notebook.jl:94
 [21] top-level scope
    @ ~/.vscode/extensions/julialang.language-julia-1.4.3/scripts/notebook/notebook.jl:10
 [22] include(mod::Module, _path::String)
    @ Base ./Base.jl:386
 [23] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:285
 [24] _start()
    @ Base ./client.jl:485

Options.npopulations = nothing, does not detect number of cores

Hello,

Reading the API documentation I get that the program should run as many populations as cores if Options.npopulations is set to nothing. However, that is not the case. Am I understanding properly what this option is supposed to do?

Running Julia with -t 5:

using SymbolicRegression

x = rand(2, 30)
f(x) = x[1] + x[2]
y = f.(eachcol(x))

opt1 = Options()
opt2 = Options(npopulations = 5)

EquationSearch(x, y, niterations = 2, numprocs = 5, runtests = false, options = opt1)
# With opt1: 2 iterations in total

EquationSearch(x, y, niterations = 2, numprocs = 5, runtests = false, options = opt2)
# With opt2: 10 iterations in total as expected

Printing of sqrt as sqrtm

I notched that sqrt is printed as sqrtm, is this intentional or a typo?

12          1.963e+01  6.172e-02  (((46.020115 - x1) - x1) - ((sqrtm(x3) / 0.389079) / 0.5039604))

Q : recording # of function calls

I'd like to benchmark this library against other ones, and one of the natural metrics would be the total number of evaluations (summing ovet iterations, cores, etc.) . Is there a flag (combination) that makes the library report this?

DomainError when computing pareto curve

The new changes to the pareto curve calculation seem to be introducing some issues in computing the score. This shouldn't technically occur because it should be the dominating pareto curve, but I should still use an abs to prevent this.

JULIA: DomainError with -1802.6609:                                                                                                
log will only return a complex result if called with a complex argument. Try log(Complex(x)).                                      
Stacktrace:                                                                                                                        
  [1] throw_complex_domainerror(f::Symbol, x::Float32)                                                                             
    @ Base.Math ./math.jl:33                                                                                                       
  [2] _log(x::Float32, base::Val{:โ„ฏ}, func::Symbol)                                                                                
    @ Base.Math ./special/log.jl:328                                                                                               
  [3] log(x::Float32)                                                                                                              
    @ Base.Math ./special/log.jl:254                                                                                               
  [4] string_dominating_pareto_curve(hallOfFame::HallOfFame, baselineMSE::Float32, dataset::SymbolicRegression.var"../Dataset.jl".D
ataset{Float32}, options::Options{Tuple{typeof(+), typeof(-), typeof(*), typeof(/)}, Tuple{typeof(exp)}, L2DistLoss}, avgy::Float32
)                                                                                                                                  
    @ SymbolicRegression.var"../HallOfFame.jl" ~/.julia/packages/SymbolicRegression/HMgOA/src/HallOfFame.jl:100                    
  [5] _EquationSearch(::SymbolicRegression.var"../ProgramConstants.jl".SRThreaded, datasets::Vector{SymbolicRegression.var"../Datas
et.jl".Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(-), typeof(*), typeof(/)}, Tuple{typeof(exp)
}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool, saved_state::Nothing)                                             
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/HMgOA/src/SymbolicRegression.jl:657                                  
  [6] EquationSearch(datasets::Vector{SymbolicRegression.var"../Dataset.jl".Dataset{Float32}}; niterations::Int64, options::Options
{Tuple{typeof(+), typeof(-), typeof(*), typeof(/)}, Tuple{typeof(exp)}, L2DistLoss}, numprocs::Int64, procs::Nothing, multithreading::Bool, runtests::Bool, saved_state::Nothing)

Variable subset selection for linear regression

I noticed that it took quite some time for the symbolic regression to beat a simple variable subset selection using linear regression. To clarify what I mean with this, I consider a large regressor matrix A in the problem y = A*b where b are the parameters and the task is to select a subset of the columns of A to include in the linear model. This can typically be solved to near optimality with LASSO regression. After running for long enough, the symbolic regression did indeed find a better equation for my toy problem at a similar number of estimated parameters that did my subset selection, but it makes me wonder if subset selection can be used as a heuristic approach to seed some of the population members?

For reference, I include a naive, brute force algorithm that selects n variables from a regressor matrix A in an exact way, in case my above explanation didn't make sense

using IterTools, TotalLeastSquares
function ls_selection(A,y, n)
    m = size(A, 2)
    errors = Float64[]
    indices = Vector{Int}[]
    for ni = 1:n
        for inds = IterTools.subsets(1:m, ni)
            At = A[:,inds]
            ฮธ = At \ y
            E = y - At*ฮธ
            e = sum(abs, E)
            push!(errors, e)
            push!(indices, inds)
        end
    end
    mi = indices[argmin(errors)]
    mi, errors, indices
end

Using recorder to only track specific information?

Hi there,

I'm using recorder (which doesn't actually show up in the documentation anywhere that I've seen by the way, I found it in a github issue). But instead of looking at all the information, I just need a small portion: specifically loss values. (I'm looking at quantitatively representing convergence rate in terms of loss vs iterations). Additionally, I noticed there seem to be both "population" entries in the JSON file, as well as what I can only refer to as "individual" values? Basically the "individuals" have keys that are a large string of numbers, and they contain values such as events, score, tree, parent, etc. But in the populations, there's much simpler information per entry.

Basically, I just need to know 2 things:

  1. Is there a way to filter what is stored in the output with recorder so I'm not left with 100s of MB files that I have to parse?
  2. Which set of information here, ("individual" vs entrys in the population dictionaries) should I use to check convergence rate? Or are these two identical, but with different levels of detail?

Thanks for your help!

Best,

John

(Non-)negative custom loss functions

Please forgive me if I'm missing something basic, but is there a reason that custom loss functions aren't allowed to go negative? I'm using PySR 0.7.9.

RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: DomainError with -5.748968e30:
Your loss function must be non-negative.

Is there something specific in the context of PySR that prevents the use of negative loss functions? My question stems from wanting to use a Poisson loss (e.g., deviance "loss(x, y) = 2 * (x * log(x / y) + y - x)" or likelihood loss(x, y) = y - x * log(y)). These and the like seem to be used in other ML contexts too (e.g., https://www.tensorflow.org/api_docs/python/tf/keras/losses/Poisson and https://github.com/JuliaML/LossFunctions.jl/).

(I presume it must have something to do with the choice of the underlying optimizer that's being used. If that choice is inflexible, then perhaps it would be useful to explain the requirement in the docs regarding custom loss functions, which I see from other Issue responses are a work-in-progress.)

Thanks!

Lost hall of fame members

This bug seems very problematic. Sometimes when running a search, I will notice that the hall of fame actually gets worse from one step to the next.

Initially I thought this could be an affect of batched loss calculations - perhaps when a member was first added to the hall of fame, their loss was artificially lower since it was calculated on a smaller portion of data.

However, looking further into this, it seems like this even happens for unbatched runs.

When I turn off hofMigration in the Options, it seems like this gets even worse. Which seems to imply that hall of fame migration is the only way the hall of fame is being copied between one iteration and the next...

  • Ensure that per-populatuion hall of fame is correctly updated.
    • simply overwrite hall of fame if better than current size. Pareto frontier calculated separately, so don't worry about domination.
  • Ensure that global hall of fame is correctly updated.
  • Ensure that best_seen is correctly updated during search.
  • Ensure that PopMember's are correctly being copied, rather than just passed, to the search codes. It could be that they simply get overwritten...
  • Is calculateParetoFrontier messing them up?

[Performance] Constant optimization using autograd

The BFGS optimizer appears to be using finite difference rather than autograd. For the optimization, I am using a slower equation evaluation that is differentiable.

Therefore, either I switch to using the non-differentiable equation evaluation, or I figure out if the autograd version is faster and switch to that.

To force it to use autograd, I should use: autodiff=:forward as an option to the BFGS solver.

SymbolicUtils Upper Bound

This upper bound #63 is causing some issues using SymbolicRegression.jl, which is then blocking updates in DataDrivenDiffEq.jl. @shashi can you help figure out what's needed here? I think SymbolicRegression.jl was pulling into some SymbolicUtils.jl internals.

Switching from Float to UInt8 ?

Is it possible to switch from Float to UInt8 numbers ? I`m new in Julia.

I would like to have SymbolicRegression.jl work in discrete numbers and then be able to use binary operators.

Verbosity

Even with verbosity=0 there is still a lot of messages printed, e.g.:

function activate_env_on_workers(procs, project_path::String)
    println("Activating environment on workers.")
    ...
end

I think all the offending lines are in the Utils.jl file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.