balsa-project / balsa Goto Github PK

Balsa is a learned SQL query optimizer. It tailor optimizes your SQL queries to find the best execution plans for your hardware and engine.

License: Apache License 2.0

Python 97.36% Shell 2.64%

deep-reinforcement-learning learned-database query-optimization learned-query-optimization databases postgresql

balsa's Introduction

Balsa

Balsa is a learned query optimizer. It learns to optimize SQL queries by trial-and-error using deep reinforcement learning and sim-to-real learning.

Notably, Balsa is the first end-to-end learned optimizer that does not rely on learning from an existing expert optimizer's plans, while being able to surpass the performance of expert plans, sometimes by a sizable margin.

For technical details, see the SIGMOD 2022 paper, Balsa: Learning a Query Optimizer Without Expert Demonstrations [bibtex].

Setup

To quickly get started, run the following on one machine which will run both the agent neural network and query execution.

Clone and install Balsa.

Details

git clone https://github.com/balsa-project/balsa.git ~/balsa
cd ~/balsa
# Recommended: run inside a Conda environment.
# All commands that follow are run under this conda env.
conda create -n balsa python=3.7 -y
conda activate balsa

pip install -r requirements.txt
pip install -e .
pip install -e pg_executor

Install Postgres v12.5.

Details

This can be done in several ways. For example, installing from source:

cd ~/
wget https://ftp.postgresql.org/pub/source/v12.5/postgresql-12.5.tar.gz
tar xzvf postgresql-12.5.tar.gz
cd postgresql-12.5
./configure --prefix=/data/postgresql-12.5 --without-readline
sudo make -j
sudo make install

echo 'export PATH=/data/postgresql-12.5/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Install the pg_hint_plan extension v1.3.7.

Details

cd ~/
git clone https://github.com/ossc-db/pg_hint_plan.git -b REL12_1_3_7
cd pg_hint_plan
# Modify Makefile: change line
#   PG_CONFIG = pg_config
# to
#   PG_CONFIG = /data/postgresql-12.5/bin/pg_config
vim Makefile
make
sudo make install

Load data into Postgres & start it with the correct configuration.

Details

For example, load the Join Order Benchmark (JOB) tables:

cd ~/
mkdir -p datasets/job && pushd datasets/job
wget -c http://homepages.cwi.nl/~boncz/job/imdb.tgz && tar -xvzf imdb.tgz && popd
# Prepend headers to CSV files
python3 ~/balsa/scripts/prepend_imdb_headers.py

# Create and start the DB
pg_ctl -D ~/imdb initdb

# Copy custom PostgreSQL configuration.
cp ~/balsa/conf/balsa-postgresql.conf ~/imdb/postgresql.conf

# Start the server
pg_ctl -D ~/imdb start -l logfile

# Load data + run analyze (can take several minutes)
cd ~/balsa
bash load-postgres/load_job_postgres.sh ~/datasets/job

Perform basic checks:

psql imdbload
# Check that both primary and foreign key indexes are built:
imdbload=# \d title
[...]

# Check that data count is correct:
imdbload=# select count(*) from title;
  count
---------
2528312
(1 row)

NOTE: Using one machine is only for quickly trying out Balsa. To cleanly reproduce results, use Cluster mode which automates the above setup on a cloud and separates the training machine from the query execution machines.

Quickstart

First, run the baseline PostgreSQL plans (the expert) on the Join Order Benchmark:

python run.py --run Baseline --local

This will prompt you to log into Weights & Biases to track experiments and visualize metrics easily, which we highly recommend. (You can disable it by prepending the env var WANDB_MODE=disabled or offline).

The first run may take a while due to warming up. After finishing, you can see messages like:

latency_expert/workload (seconds): 156.62 (113 queries)
...
wandb: latency_expert/workload 156.61727

Next, to launch a Balsa experiment:

python run.py --run Balsa_JOBRandSplit --local

The first time this is run, simulation data is collected for all training queures, which will finish in ~5 minutes (cached for future runs):

...
I0323 04:15:48.212239 140382943663744 sim.py:738] Collection done, stats:
I0323 04:15:48.212319 140382943663744 sim.py:742]   num_queries=94 num_collected_queries=77 num_points=516379 latency_s=309.3
I0323 04:16:39.296590 140382943663744 sim.py:666] Saved simulation data (len 516379) to: data/sim-data-88bd801a.pkl

Balsa's simulation-to-reality approach requires first training an agent in simulation, then in real execution. To speed up the simulation phase, we have provided pretrained checkpoints for the simulation agent:

...
I0323 04:18:26.856739 140382943663744 sim.py:985] Loaded pretrained checkpoint: checkpoints/sim-MinCardCost-rand52split-680secs.ckpt

Then, the agent will start the first iteration of real-execution learning: planning all training queries, sending them off for execution, and waiting for these plans to finish. Periodically, test queries are planned and executed for logging.

Handy commands:

To kill the experiment(s): pkill -f run.py
To monitor a machine: dstat -mcdn

Experiment configs

All experiments and their hyperparameters are declared in experiments.py. To run an experiment with the local Postgres execution engine:

# <name> is a config registered in experiments.py.
python run.py --run <name> --local

Main Balsa agent:

Benchmark	Config
JOB (Random Split)	`Balsa_JOBRandSplit`
JOB Slow (Slow Split)	`Balsa_JOBSlowSplit`

Ablation: impact of the simulator (Figure 10):

Variant	Config
Balsa Sim	(main agent) `Balsa_JOBRandSplit`
Expert Sim	`JOBRandSplit_PostgresSim`
No Sim	`JOBRandSplit_NoSim`

NOTE: Running JOBRandSplit_PostgresSim for the first time will be slow (1.1 hours) due to simulation data being collected from EXPLAIN. This data is cached in data/ for future runs.

Ablation: impact of the timeout mechanism (Figure 11):

Variant	Config
Balsa (safe execution)	(main agent) `Balsa_JOBRandSplit`
no timeout	`JOBRandSplit_NoTimeout`

Ablation: impact of exploration schemes (Figure 12):

Variant	Config
Balsa (safe exploration)	(main agent) `Balsa_JOBRandSplit`
epsilon-greedy	`JOBRandSplit_EpsGreedy`
no exploration	`JOBRandSplit_NoExplore`

Ablation: impact of training schemes (Figure 13):

Variant	Config
Balsa (on-policy)	(main agent) `Balsa_JOBRandSplit`
retrain	`JOBRandSplit_RetrainScheme`

Comparision with learning from expert demonstrations (Neo-impl) (Figure 15):

Variant	Config
Balsa	(main agent) `Balsa_JOBRandSplit`
Neo-impl	`NeoImpl_JOBRandSplit`

Diversified experiences (Figure 16):

Variant	Config
Balsa	(Main agents) JOB `Balsa_JOBRandSplit`; JOB Slow `Balsa_JOBSlowSplit`
Balsa-8x (uses 8 main agents' data)	JOB `Balsa_JOBRandSplitReplay`; JOB Slow `Balsa_JOBSlowSplitReplay`

Generalizing to highly distinct join templates, Ext-JOB (Figure 17):

Variant	Config
Balsa, data collection agent	`Balsa_TrainJOB_TestExtJOB`
Balsa-1x (uses 1 base agent's data)	`Balsa1x_TrainJOB_TestExtJOB`
Balsa-8x (uses 8 base agents' data)	`Balsa8x_TrainJOB_TestExtJOB`

NOTE: When running a Ext-JOB config for the first time, you may see the error Missing nodes in init_experience. This means data/initial_policy_data.pkl contains the expert latencies of all 113 JOB queries for printing (assuming you ran the previous configs first) but lacks the new Ext-JOB queries. To fix, rename the previous .pkl file and rerun the new Ext-JOB config, which will automatically run the expert plans of the new query set to regenerate this file (as well as gathering the simulation data).

To specify a new experiment, subclass an existing config (give the subclass a descriptive name), change the values of some hyperparameters, and register the new subclass.

Metrics and artifacts

Each run's metrics and artifacts are logged to its log dir (path of the form ./wandb/run-20220323_051830-1dq64dx5/), managed by W&B. We recommend creating an account and running wandb login, so that these are automatically logged to their UI for visualization.

Key metrics to look at

Expert performance
- latency_expert/workload: total time of expert (PosgreSQL optimizer) plans on training queries, in seconds
- latency_expert_test/workload: for test queries
Agent performance
- latency/workload: total time of Balsa's plans on training queries, in seconds
- latency_test/workload: for test queries
Agent progress
- curr_value_iter: which iteration the agent is in
- num_query_execs: total number of unique query plans executed, for training queries
- curr_iter_skipped_queries: for the current iteration, number of training plans executed before and thus cached
- Wallclock duration is an x-axis choice in W&B: Relative Time (Wall)
- curr_timeout: current iteration's timeout in seconds for any training plan; for Balsa's safe execution
Learning efficiency: plot latency/workload vs. wallclock duration
Data efficiency: plot latency/workload vs. num_query_execs

Other (and less important) metrics are also available, such as latency(_expert)(_test)/q<query number> for per-query latencies.

Example visualization grouping by config cls:

Artifacts

Tracking metrics is sufficient for running and monitoring experiments. For advanced use cases, we provide a few artifacts (saved locally and automatically uploaded to W&B).

Agent checkpoints / the experience buffer so far are periodically (every 5 iters) saved to:

Saved iter=124 checkpoint to: <Balsa dir>/wandb/run-20220323_051830-1dq64dx5/files/checkpoint.pt
Saved Experience to: <Balsa dir>/data/replay-Balsa_JOBSlowSplit-9409execs-11844nodes-37s-124iters-1dq64dx5.pkl

The experience buffers are useful for training on diversified experiences (see paper for details).

Best training plans so far: under <logdir>/files/best_plans/

*.sql: a hinted-version of each training query, where the hint is the best plan found so far; this can be piped into psql for re-execution
- all.sql: concatenates all hinted training queries
latencies.txt: best latency so far of training queries; the last row, "all", sums up all training queires

This means Balsa can be used to find the best possible plans for a set of high-value queries.

Hyperparameters of each run: <logdir>/files/params.txt. Hparams are also (1) printed out at the beginning of each run; and (2) captured in W&B's Config table. Typically, referring to each hparam config by its class name, such as Balsa_JOBSlowSplit, is sufficient, but if any change is made in run.py#Main(), the above would capture it.

Cluster mode

We recommend launching a multi-node Postgres cluster to support multiple agent runs in parallel. This is optional, but highly recommended: Agent runs can have some variance and thus it's valuable to measure the median and variance of agent performance across several runs of the same config.

Instructions below use Ray to launch a cluster on AWS; refer to the Ray documentation for credentials setup. See the same docs for launching on other clouds or an on-premise cluster.

NOTE: The AWS instance types used below may not have exactly the same performance as the Azure cluster reported in our paper. The overall magnitude/trends, however, should be the same. See Cluster hardware used in paper.

Launching a cluster

Launch the cluster from your local machine and Ray will handle installing all necessary dependencies and configuring both the GPU driver node (cluster head) and the Postgres execution nodes (cluster workers).

On your laptop, edit the cluster configuration file such that the number of nodes and file mounts are correctly configured; see all NOTE comments. By default it uses a g4dn.12xlarge head node (48 vCPUs, 192GB RAM, 4 T4 GPUs) and 20 r5a.2xlarge worker nodes (8 vCPUs, 64GB RAM, 256GB SSD per node).

cd ~/balsa
vim balsa/cluster/cluster.yml

Start the cluster.

ray up balsa/cluster/cluster.yml

ray attach balsa/cluster/cluster.yml

On the head node, check if the workers are ready. It can take 10-20 minutes while the workers are getting set up.

# Run this on the head node
conda activate balsa
cd ~/balsa/balsa/cluster/
python check_cluster.py

NOTE: To monitor the detailed status of worker launching and setup, run on your laptop: ray exec cluster.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'.

Before warming up the Postgres worker nodes, you'd need to log onto AWS to add one rule in the security group.

1. Click any worker node's "Instance ID" on AWS EC2 web portal
2. Click "Security"
3. Click the security group "sg-xxxx (ray-autoscaler-<cluster name>)"
4. Click "Edit inbound rules"
5. Add rule -> "All TCP" and "Anywhere-IPv4"
6. Click "Save rules"

Warm up the Postgres worker nodes:

# Run this on the head node
cd ~/balsa/balsa/cluster
python warmup_pg_servers.py

Running experiments on the cluster

ray attach balsa/cluster/cluster.yml
wandb login

Before launching parallel runs, launch a single run first:

python run.py --run Baseline
python run.py --run <name>

This ensures that the appropriate simulation data is generated and cached and that data/initial_policy_data.pkl is correct for the query split. After you've observed it to run successfully to the first iteration of the real-execution stage (Planning took ...), safely kill it with pkill -f run.py. Then, launch multiple runs:

# Run this on the head node
# Usage: bash scripts/launch.sh <name> <N>
bash scripts/launch.sh Balsa_JOBRandSplit 8

# To monitor:
#   (1) W&B
#   (2) tail -f Balsa_JOBRandSplit-1.log

Editing code

You can edit Balsa code on your laptop under the working directory path specified under file_mounts in the balsa/cluster/cluster.yml file. Push updated code to the cluster with the following (this will interrupt ongoing runs):

ray up balsa/cluster/cluster.yml --restart-only

Now you can log in to the head node and launch new experiments with the updated changes.

Cluster hardware used in paper

For experiments reported in our paper, we ran 8 parallel agent runs on the head node sending queries to 20 Postgres nodes (i.e., 2.5 concurrent queires per agent run). We used Ubuntu 18.04 virtual machines on Azure:

head node (training/inference): a Standard_NV24_Promo VM
- 24 vCPUs, 224GB memory, 4 M60 GPUs; each agent run uses 0.5 GPU
query execution nodes (the target hardware environment to optimize for): 20 Standard_E8as_v4 VMs
- 8 vCPUs, 64GB memory, 128GB SSD (Premium SSD LRS) per VM

This is different from what is included in the cluster template above (AWS). To exactly reproduce, change the cluster template to launch the above or manually use the Azure portal to launch a VMSS.

If your cluster config differs in size or hardware, the numbers obtained may be different and the provided Postgres config may need to be adjusted. The overall trends and magnitudes should be similar.

Q&A

How to train a Balsa agent with diversified experiences

Once you have 8 main agent runs finish, their experience buffers will be saved to, say, ./data/replay-Balsa_JOBRandSplit-*-499iters-*.pkl. Then, launch Balsa_JOBRandSplitReplay to load all experience buffers that match replay-Balsa_JOBRandSplit-* (specified by the glob pattern p.prev_replay_buffers_glob in experiments.py) to train a Balsa-8x agent.

python run.py --run Balsa_JOBRandSplitReplay
# Or:
bash scripts/launch.sh Balsa_JOBRandSplitReplay 8

If you have more buffers that match the glob pattern, be sure to change p.prev_replay_buffers_glob to specify the desired buffers.

We also recommend using one experience buffer as a hold-out validation set by specifying p.prev_replay_buffers_glob_val for more stable test performance. Note that in this case the buffers used for training and validation need to be put in seperate directories and the glob patterns also need to be changed accordingly.

Other diversified experience configs are: Balsa_JOBSlowSplitReplay, Balsa1x_TrainJOB_TestExtJOB, and Balsa8x_TrainJOB_TestExtJOB.

My experiment failed with a "Hint not respected" error

This occasionally happens when Postgres fails to respect the query plan hint generated by Balsa. This is a fundamental limitation with the pg_hint_plan package. We suggest relaunching new runs to replace the failed runs.

Citation

@inproceedings{balsa,
  author =       {Yang, Zongheng and Chiang, Wei-Lin and Luan, Sifei and Mittal,
                  Gautam and Luo, Michael and Stoica, Ion},
  title =        {Balsa: Learning a Query Optimizer Without Expert
                  Demonstrations},
  booktitle =    {Proceedings of the 2022 International Conference on Management
                  of Data},
  year =         2022,
  pages =        {931--944},
  doi =          {10.1145/3514221.3517885},
  url =          {https://doi.org/10.1145/3514221.3517885},
  address =      {New York, NY, USA},
  isbn =         9781450392495,
  keywords =     {machine learning for systems, learned query optimization},
  location =     {Philadelphia, PA, USA},
  numpages =     14,
  publisher =    {Association for Computing Machinery},
  series =       {SIGMOD/PODS '22},
}

@misc{balsa_github,
  howpublished = {\url{https://github.com/balsa-project/balsa}},
  title =        "Balsa source code",
  year =         2022,
}

balsa's People

Contributors

Stargazers

Watchers

balsa's Issues

What is a good way for a dynamic database environment (without retraining)

Hello, I don't understand that in a production environment, databases are faced with changes in data distribution characteristics. Is there any other good way to solve this problem besides retraining the RL model

IndexError occurred when running python run.py --run Balsa_TPCH --local

I am interested in your code and try to run it with TPC-H . I write a subclass of Balsa_JOBRandSplit and change p as follows.

p.db = 'tpchload'
p.sim_checkpoint = None
p.query_dir = 'queries/myTpchTest'
p.query_glob = ['*.sql']
p.test_query_glob = TPCH_TEST_QUERIES

The PostgreSQL version and conda environment are the same as recommended in README.md. When I run it as python run.py --run Balsa_TPCH --local, an error occurred with the following traceback.

Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/xxx/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2150, in Main
    agent = BalsaAgent(p)
  File "run.py", line 754, in __init__
    self.exp, self.exp_val = self._MakeExperienceBuffer()
  File "run.py", line 809, in _MakeExperienceBuffer
    wi = self.GetOrTrainSim().training_workload_info
  File "run.py", line 1160, in GetOrTrainSim
    self.sim = TrainSim(p, self.loggers)
  File "run.py", line 379, in TrainSim
    sim.CollectSimulationData()
  File "/home/xxx/balsa/sim.py", line 728, in CollectSimulationData
    self.search.Run(query_node, query_node.info['sql_str'])
  File "/home/xxx/balsa/balsa/search.py", line 245, in Run
    dp_tables)
  File "/home/xxx/balsa/balsa/search.py", line 317, in _dp_bushy_search_space
    return list(dp_tables[num_rels].values())[0][1], dp_tables
IndexError: list index out of range

I use only three queries in the query_dir, like:

select   supp_nation,   cust_nation,   l_year,   sum(volume) as revenue  from   (    select     n1.n_name as supp_nation,     n2.n_name as cust_nation,     extract(year from l_shipdate) as l_year,     l_extendedprice * (1 - l_discount) as volume    from     supplier,     lineitem,     orders,     customer,     nation n1,     nation n2    where     s_suppkey = l_suppkey     and o_orderkey = l_orderkey     and c_custkey = o_custkey     and s_nationkey = n1.n_nationkey     and c_nationkey = n2.n_nationkey     and (      (n1.n_name = 'VIETNAM' and n2.n_name = 'UNITED KINGDOM')      or (n1.n_name = 'UNITED KINGDOM' and n2.n_name = 'VIETNAM')     )     and l_shipdate between date '1995-01-01' and date '1996-12-31'   ) as shipping  group by   supp_nation,   cust_nation,   l_year  order by   supp_nation,   cust_nation,   l_year;

The I add print(join_graph) after Line 257 of balsa/balsa/search.py, which is

r = r_tup[1]

and it shows "Graph with 0 nodes and 0 edges". I think I cannot get a correct join graph in Line 224 of balsa/balsa/search.py, which is

join_graph, all_join_conds = query_node.GetOrParseSql()

I then check the definition of GetOrParseSql(self) in balsa/balsa/util/plans_lib.py and print graph and join_conds. It shows Graph with 0 nodes and 0 edges for the graph and [] for the join_conds. I then check the definition of simple_sql_parser in balsa/balsa/util/simple_sql_parser.py and print the result of join_conds after

join_conds = join_cond_pat.findall(sql)

The sql is one of the queries in my query_dir but the join_conds is still []. I check the regular expression and guess it cannot deal with the expression c_custkey = o_custkey in my queries since there are dots in the used regular expression.
As introduced in the paper, TPC-H is used as a benchmark. Could you please give me some hints for the above parser problem or add some codes on TPC-H. Many thanks in advance.
Another confusion is that when I run the above command for the first time and set

p.query_glob = ['test1.sql', 'test2.sql', 'test3.sql']
p.test_query_glob = ['test1.sql']

it shows

3 train queries: ['test1', 'test2', 'test3']
0 test queries: []
wandb: (1) Create a W&B account

even if in the BalsaAgent params test_query_glob is ['test1.sql']. I am just curious about why we need to get the Baseline PG performance by running all test and training queries before training. Hope your reply sincerely!

IndexError occurred when executing python run.py --run Balsa_JOBRandSplit --local

I tried to run this project and found some problems. Following the instructions in README.md, I installed requirements and used this line of command to run.

python run.py --run Balsa_JOBRandSplit --local

The only difference is I'm using PostgreSQL 13.5, and I modified default connection settings in pg_executor/pg_executor/pg_executor.py to fit it into my environment.

LOCAL_DSN = "dbname=imdb user=postgres"

An error occurred when I tried to run it the first time, with following traceback:

Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/xxx/.conda/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/xxx/.conda/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2151, in Main
    agent.Run()
  File "run.py", line 2100, in Run
    has_timeouts = self.RunOneIter()
  File "run.py", line 1831, in RunOneIter
    model, dataset = self.Train()
  File "run.py", line 1221, in Train
    log=not train_from_scratch)
  File "run.py", line 883, in _MakeDatasetAndLoader
    skip_training_on_timeouts=p.skip_training_on_timeouts)
  File "/xxx/balsa/balsa/experience.py", line 560, in featurize
    skip_training_on_timeouts=skip_training_on_timeouts)
  File "/xxx/balsa/balsa/experience.py", line 393, in _featurize_dedup
    self.featurizer, all_subtrees)
  File "/xxx/balsa/balsa/experience.py", line 38, in TreeConvFeaturize
    plan_featurizer)
  File "/xxx/balsa/balsa/models/treeconv.py", line 268, in make_and_featurize_trees
    indexes = torch.from_numpy(_batch([_make_indexes(x) for x in trees])).long()
  File "/xxx/balsa/balsa/models/treeconv.py", line 268, in <listcomp>
    indexes = torch.from_numpy(_batch([_make_indexes(x) for x in trees])).long()
  File "/xxx/balsa/balsa/models/treeconv.py", line 218, in _make_indexes
    preorder_ids, _ = _make_preorder_ids_tree(root)
  File "/xxx/balsa/balsa/models/treeconv.py", line 197, in _make_preorder_ids_tree
    root_index=root_index + 1)
  File "/xxx/balsa/balsa/models/treeconv.py", line 197, in _make_preorder_ids_tree
    root_index=root_index + 1)
  File "/xxx/balsa/balsa/models/treeconv.py", line 197, in _make_preorder_ids_tree
    root_index=root_index + 1)
  [Previous line repeated 1 more time]
  File "/xxx/balsa/balsa/models/treeconv.py", line 199, in _make_preorder_ids_tree
    root_index=lhs_max_id + 1)
  File "/xxx/balsa/balsa/models/treeconv.py", line 197, in _make_preorder_ids_tree
    root_index=root_index + 1)
  File "/xxx/balsa/balsa/models/treeconv.py", line 199, in _make_preorder_ids_tree
    root_index=lhs_max_id + 1)
  File "/xxx/balsa/balsa/models/treeconv.py", line 198, in _make_preorder_ids_tree
    rhs, rhs_max_id = _make_preorder_ids_tree(curr.children[1],
IndexError: list index out of range

I used print() to show curr, and found the experience loaded a Bitmap Heap Scan node.

print(curr, curr.children, root_index)
// Bitmap Heap Scan [movie_keyword AS mk] cost=1131.96
//  Bitmap Index Scan [movie_keyword AS mk] cost=6.74
// [Bitmap Index Scan [movie_keyword AS mk] cost=6.74
//] 10

I skipped the issue by directly consider the Bitmap Heap Scan node as a leaf node, but I found another error when I restarted run.py.

Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/xxx/.conda/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/xxx/.conda/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2151, in Main
    agent.Run()
  File "run.py", line 2100, in Run
    has_timeouts = self.RunOneIter()
  File "run.py", line 1831, in RunOneIter
    model, dataset = self.Train()
  File "run.py", line 1221, in Train
    log=not train_from_scratch)
  File "run.py", line 883, in _MakeDatasetAndLoader
    skip_training_on_timeouts=p.skip_training_on_timeouts)
  File "/xxx/balsa/balsa/experience.py", line 560, in featurize
    skip_training_on_timeouts=skip_training_on_timeouts)
  File "/xxx/balsa/balsa/experience.py", line 393, in _featurize_dedup
    self.featurizer, all_subtrees)
  File "/xxx/balsa/balsa/experience.py", line 38, in TreeConvFeaturize
    plan_featurizer)
  File "/xxx/balsa/balsa/models/treeconv.py", line 269, in make_and_featurize_trees
    _batch([_featurize_tree(x, node_featurizer) for x in trees
  File "/xxx/balsa/balsa/models/treeconv.py", line 269, in <listcomp>
    _batch([_featurize_tree(x, node_featurizer) for x in trees
  File "/xxx/balsa/balsa/models/treeconv.py", line 255, in _featurize_tree
    _bottom_up(curr_node)
  File "/xxx/balsa/balsa/models/treeconv.py", line 249, in _bottom_up
    left_vec = _bottom_up(curr.children[0])
  File "/xxx/balsa/balsa/models/treeconv.py", line 249, in _bottom_up
    left_vec = _bottom_up(curr.children[0])
  File "/xxx/balsa/balsa/models/treeconv.py", line 249, in _bottom_up
    left_vec = _bottom_up(curr.children[0])
  [Previous line repeated 1 more time]
  File "/xxx/balsa/balsa/models/treeconv.py", line 250, in _bottom_up
    right_vec = _bottom_up(curr.children[1])
  File "/xxx/balsa/balsa/models/treeconv.py", line 249, in _bottom_up
    left_vec = _bottom_up(curr.children[0])
  File "/xxx/balsa/balsa/models/treeconv.py", line 250, in _bottom_up
    right_vec = _bottom_up(curr.children[1])
  File "/xxx/balsa/balsa/models/treeconv.py", line 249, in _bottom_up
    left_vec = _bottom_up(curr.children[0])
  File "/xxx/balsa/balsa/models/treeconv.py", line 246, in _bottom_up
    vec = node_featurizer.FeaturizeLeaf(curr)
  File "/xxx/balsa/balsa/util/plans_lib.py", line 726, in FeaturizeLeaf
    scan_operator_idx = np.where(self.scan_ops == node.node_type)[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

I found the error is caused by scan methods not included in the parameter search_space_scan_ops. In method BalsaAgent._MakeWorkload() in run.py, JOB queries are loaded and PostgreSQL plans are obtained through explain (costs, format json). Loaded train_nodes are then used to initialize experience set (run.py, line 826).

Besides, I'm also confused about the procedure. As is introduced in the paper, Balsa bootstraps from a simulator and never uses an expert optimizer. So is it OK to just replace train_nodes with an empty list? And if I'm going to use a different split of train/test dataset, how can I train the model with the simulator to get a checkpoint?

The version of cuda

I chose the 3-rd solution, i.e., (3) Don't visualize my results, to run this code, but met a mistake (maybe the problem about the version of CUDA), the detailed information is as follows:

Traceback (most recent call last):
  File "run.py", line 2157, in <module>
    app.run(Main)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2153, in Main
    agent.Run()
  File "run.py", line 2102, in Run
    has_timeouts = self.RunOneIter()
  File "run.py", line 1843, in RunOneIter
    is_test=False)
  File "run.py", line 1415, in PlanAndExecute
    avoid_eq_filters=is_test and p.avoid_eq_filters,
File "/home/xjc/balsa/balsa/optimizer.py", line 255, in plan
    **kwargs)
  File "/home/xjc/balsa/balsa/optimizer.py", line 508, in _beam_search_bk
    [join for join, _, _ in possible_plans])
  File "/home/xjc/balsa/balsa/optimizer.py", line 217, in infer
    cost1 = self.value_network(query_feat, plan_feat, pos_feat)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "run.py", line 544, in forward
    return self.model(query_feat, plan_feat, indexes)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xjc/balsa/balsa/models/treeconv.py", line 96, in forward
    query_embs = self.query_mlp(query_feats.unsqueeze(1))
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/xjc/anaconda3/envs/balsa/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

What should I do to solve it? Thanks a lot!

statement timeout and 'Forgot to call Stop()?' occurred when executing python run.py --run Balsa_JOBRandSplit --local

I tried to run this project and found some problems. Following the instructions in README.md, I installed requirements and used this line of command to run.
python run.py --run Balsa_JOBRandSplit --local
An error occurred when I tried to run it the first time, with following traceback:

Waiting on Ray tasks...value_iter=41
ray.get(tasks) received exception: ray::ExecuteSql() (pid=33391, ip=192.168.199.173)
  File "run.py", line 172, in ExecuteSql
    remote=not use_local_execution)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 89, in ExplainAnalyzeSql
    remote=remote)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 216, in _run_explain
    timeout_ms, cursor, remote)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 248, in _run_explain
    return pg_executor.Execute(s, verbose, geqo_off, timeout_ms, cursor)
  File "/data/pjz_data/optimizer/balsa/pg_executor/pg_executor/pg_executor.py", line 163, in Execute
    _SetGeneticOptimizer('default', cursor)
  File "/data/pjz_data/optimizer/balsa/pg_executor/pg_executor/pg_executor.py", line 102, in _SetGeneticOptimizer
    cursor.execute('set geqo = {};'.format(flag))
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout
Canceling Ray tasks.
Retrying PlanAndExecute() (max_retries=3).
Traceback (most recent call last):
  File "run.py", line 1507, in PlanAndExecute
    refs = ray.get(tasks)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(QueryCanceled): ray::ExecuteSql() (pid=33391, ip=192.168.199.173)
  File "run.py", line 172, in ExecuteSql
    remote=not use_local_execution)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 89, in ExplainAnalyzeSql
    remote=remote)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 216, in _run_explain
    timeout_ms, cursor, remote)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 248, in _run_explain
    return pg_executor.Execute(s, verbose, geqo_off, timeout_ms, cursor)
  File "/data/pjz_data/optimizer/balsa/pg_executor/pg_executor/pg_executor.py", line 163, in Execute
    _SetGeneticOptimizer('default', cursor)
  File "/data/pjz_data/optimizer/balsa/pg_executor/pg_executor/pg_executor.py", line 102, in _SetGeneticOptimizer
    cursor.execute('set geqo = {};'.format(flag))
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
   File "run.py", line 2151, in Main
    agent.Run()
  File "run.py", line 2100, in Run
    has_timeouts = self.RunOneIter()
  File "run.py", line 1841, in RunOneIter
    is_test=False)
  File "run.py", line 1520, in PlanAndExecute
    max_retries=max_retries - 1)
  File "run.py", line 1401, in PlanAndExecute
    self.timer.Start('plan_test_set' if is_test else 'plan')
  File "/data/pjz_data/optimizer/balsa/train_utils.py", line 231, in Start
    assert self.curr_stage is None, 'Forgot to call Stop()?'
AssertionError: Forgot to call Stop()?
Traceback (most recent call last):
  File "run.py", line 1507, in PlanAndExecute
    refs = ray.get(tasks)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(QueryCanceled): ray::ExecuteSql() (pid=33391, ip=192.168.199.173)
  File "run.py", line 172, in ExecuteSql
    remote=not use_local_execution)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 89, in ExplainAnalyzeSql
    remote=remote)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 216, in _run_explain
    timeout_ms, cursor, remote)
  File "/data/pjz_data/optimizer/balsa/balsa/util/postgres.py", line 248, in _run_explain
    return pg_executor.Execute(s, verbose, geqo_off, timeout_ms, cursor)
  File "/data/pjz_data/optimizer/balsa/pg_executor/pg_executor/pg_executor.py", line 163, in Execute
    _SetGeneticOptimizer('default', cursor)
  File "/data/pjz_data/optimizer/balsa/pg_executor/pg_executor/pg_executor.py", line 102, in _SetGeneticOptimizer
    cursor.execute('set geqo = {};'.format(flag))
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "run.py", line 2155, in <module>
    app.run(Main)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/zju/anaconda3/envs/balsa/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run.py", line 2151, in Main
    agent.Run()
  File "run.py", line 2100, in Run
    has_timeouts = self.RunOneIter()
  File "run.py", line 1841, in RunOneIter
    is_test=False)
  File "run.py", line 1520, in PlanAndExecute
    max_retries=max_retries - 1)
  File "run.py", line 1401, in PlanAndExecute
    self.timer.Start('plan_test_set' if is_test else 'plan')
  File "/data/pjz_data/optimizer/balsa/train_utils.py", line 231, in Start
    assert self.curr_stage is None, 'Forgot to call Stop()?'
AssertionError: Forgot to call Stop()?

balsa-project / balsa Goto Github PK

balsa's Introduction

Balsa

Setup

Quickstart

Experiment configs

Metrics and artifacts

Key metrics to look at

Artifacts

Cluster mode

Launching a cluster

Running experiments on the cluster

Editing code

Cluster hardware used in paper

Q&A

Citation

balsa's People

Contributors

Stargazers

Watchers

Forkers

balsa's Issues

Recommend Projects

Recommend Topics

Recommend Org