google / ml-compiler-opt Goto Github PK

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.

License: Apache License 2.0

Shell 2.57% Python 97.22% Dockerfile 0.21%

ml-compiler-opt's Introduction

Infrastructure for MLGO - a Machine Learning Guided Compiler Optimizations Framework.

MLGO is a framework for integrating ML techniques systematically in LLVM. It replaces human-crafted optimization heuristics in LLVM with machine learned models. The MLGO framework currently supports two optimizations:

inlining-for-size(LLVM RFC);
register-allocation-for-performance(LLVM RFC)

The compiler components are both available in the main LLVM repository. This repository contains the training infrastructure and related tools for MLGO.

We currently use two different ML algorithms: Policy Gradient and Evolution Strategies to train policies. Currently, this repository only support Policy Gradient training. The release of Evolution Strategies training is on our roadmap.

Check out this demo for an end-to-end demonstration of how to train your own inlining-for-size policy from the scratch with Policy Gradient, or check out this demo for a demonstration of how to train your own regalloc-for-performance policy.

For more details about MLGO, please refer to our paper MLGO: a Machine Learning Guided Compiler Optimizations Framework.

For more details about how to contribute to the project, please refer to contributions.

Pretrained models

We occasionally release pretrained models that may be used as-is with LLVM. Models are released as github releases, and are named as [task]-[major-version].[minor-version].The versions are semantic: the major version corresponds to breaking changes on the LLVM/compiler side, and the minor version corresponds to model updates that are independent of the compiler.

When building LLVM, there is a flag -DLLVM_INLINER_MODEL_PATH which you may set to the path to your inlining model. If the path is set to download, then cmake will download the most recent (compatible) model from github to use. Other values for the flag could be:

# Model is in /tmp/model, i.e. there is a file /tmp/model/saved_model.pb along
# with the rest of the tensorflow saved_model files produced from training.
-DLLVM_INLINER_MODEL_PATH=/tmp/model

# Download the most recent compatible model
-DLLVM_INLINER_MODEL_PATH=download

Prerequisites

Currently, the assumptions for the system are:

Recent Ubuntu distro, e.g. 20.04
python 3.8.x/3.9.x/3.10.x
for local training, which is currently the only supported mode, we recommend a high-performance workstation (e.g. 96 hardware threads).

Training assumes a clang build with ML 'development-mode'. Please refer to:

LLVM documentation
the build bot script

The model training - specific prerequisites are:

Pipenv:

pip3 install pipenv

The actual dependencies:

pipenv sync --system

Note that the above command will only work from the root of the repository since it needs to have Pipfile.lock in the working directory at the time of execution.

If you plan on doing development work, make sure you grab the development and CI categories of packages as well:

pipenv sync --system --categories "dev-packages ci"

Optionally, to run tests (run_tests.sh), you also need:

sudo apt-get install virtualenv

Note that the same tensorflow package is also needed for building the 'release' mode for LLVM.

Docs

An end-to-end demo using Fuchsia as a codebase from which we extract a corpus and train a model.

How to add a feature guide. Extensibility model.

ml-compiler-opt's People

Stargazers

Watchers

Forkers

extbit neotim zeta1999 liyuqian global-localhost global19-atlassian-net global19 abdelrahmanhosny jaynotleno ugoboby colibrow ali-sed jacob-hegna isabella232 dhinagaran-s ewenwan davidchan0519 dongaxis gengrill liu-sd yangwang92 amirjamez prasitagit python-repository-hub 5l1v3r1 boomanaiden154 petrhosek lucianopalmeida chao-peng eopxd yundiqian 00mjk northbadge flpha0830 bhushan23 rushtoneverland chengniansun mrcodechef avmi grandeep afaul mtrofin ct-clmsn kshiteejm simpledeeplearning 25pwn xc303919323 coolprj vitalyankh zeroomega faizan-m k1rnt chenglinw theobscuredev marmarhoun lakshmankishore yash2998chhabria kazutakahirata nigelwz leikang123 kanehui prikmm knowledgehacker jamestiotio dorakbg duranthiahia zstreet87 open-models-ai ethicalsecurity-agency vlkale matzeb iq-scm gokulakrishnanbalaji salaast ricardoprins liangminghuang dawin2015 ghas-results roykoand khoing0810 kkpan11 catasaurus arunsathiya trainingafternoon devqqmao yazzz1k tisma dakkshesh07

ml-compiler-opt's Issues

Rename GAOptimizer to GradientAscentOptimizer

Follow the style guide by avoiding acronyms

Address tf_agents' warning about np.bool

To repro: remove the corresponding line from pytest.ini, make sure tests pass.

Mimimum NVIDA chip for training

What kind of NVIDIA chip would I need for training?

TIA

Remove 3.8->3.9 unittest diffs

grep for this issue number for affected lines.

Address "distutils Version classes are deprecated" warning in tfagents

Might just be a simple version bump to get rid of this one, but it showed up when moving from requirements.txt to pipenv. Will do more investigation.

Plan to port Evolutionary Strategy algorithm

This is a port of the work by: Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, Adrian Weller: "Structured Evolution with Compact Architectures for Scalable Policy Optimization", https://arxiv.org/abs/1804.02395

This plan will follow 3 main steps:

Port existing codebase
Add type annotations and idioms along ml-compiler-opt codebase
Create training pipeline for Evolutionary Strategy

[infra] Support distributed compilation

This tracks incoming work that aims to support distributing the data collection steps (roughly what CompilationRunner implementations do today). The design is mainly aimed at (exclusively) supporting distribution via dask.distributed.

Runing Demo's train_locally.py. Failed to create saved model evaluator

Hello, I'm running this command

rm -rf $OUTPUT_DIR &&
PYTHONPATH=$PYTHONPATH:. python3
compiler_opt/rl/train_locally.py
--root_dir=$OUTPUT_DIR
--data_path=$CORPUS
--gin_bindings=clang_path="'$LLVM_INSTALLDIR/bin/clang'"
--gin_bindings=llvm_size_path="'$LLVM_INSTALLDIR/bin/llvm-size'"
--num_modules=100
--gin_files=compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin
--gin_bindings=train_eval.warmstart_policy_dir="$WARMSTART_OUTPUT_DIR/saved_policy"

script tell me --num_modules can't not use, I change the command --num_workers=100. But I get the following errors :

2022-09-24 07:12:57.902522: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2078] Estimated count of arithmetic ops: 0.011 M ops, equivalently 0.005 M MACs
I0924 07:12:58.107042 139987454576448 local_data_collector.py:78] Waiting for pending work from last iteration took 0.000004
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpeb9wk1gz/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpsc32ijpx/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmp0xz05pcf/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.

Do you have any idea?

progress on supporting lto

Hi @yundiqian @mtrofin, I haven't focused on this project for a long time because the performance comparison on full-lto with ml-trained. Is there any plan on improving the size-reduction with full-lto opened?

fix(ci): set-output in Github Actions to be deprecated

Hi!

https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
As per this URL, set-output of Github Actions is deprecated.

I would like to create a PR for a fix to this issue.
Thanks.

IR2Native Model?

Thanks for opensourcing the code!
I would like to ask, does the repo contain the code for the IR2Native supervised model that estimates the function size of a caller?

By skimming through the files in the repo, it seems that it only contains the RL model.

Fix usage of TENSORFLOW_C_LIB_PATH in benchmarking utils

Currently the LLVM setup in the benchmarking utilities still passes TENSORFLOW_C_LIB_PATH rather than the updated -C <path to tflite build>/tflite.cmake like other documentation has been updated to use. I need to either directly fix the usage or rework how this is used in the benchmarking code.

Address `The line search algorithm did not converge` warning

In blackbox_optimizers_test.py, testProjectedGradientOptimizer and testTrustRegionSubproblem result in this warning being raised

best trajectory json dump is occasionally broken, need investigation

Adding new features under InlineModelFeatureMaps.h results in TF model pruner to remove them at deployment

It might be a Tensorflow bug or incompatibility amongst installed libraries, but here is the issue:

When we declare new features under llvm/include/llvm/Analysis/InlineModelFeatureMaps.h and define them at llvm/lib/Analysis/MLInlineAdvisor.cpp, it is added to the frozen model at each iteration of the trainer.

Also, when I load the frozen graph under model/policy/$ITERATION_NO/saved_policy/* using tf.saved_model.load("saved_model.pb"), the signature shows all the tensor names, including the newly added one, but when .local/lib/python3.6/site-packages/tensorflow/python/tools/saved_model_aot_compile.py does the _prune_removed_feed_nodes(signature_def, graph_def) :

def _prune_removed_feed_nodes(signature_def, graph_def):
  """Identify the inputs in the signature no longer in graph_def, prune them.

  Args:
    signature_def: A `SignatureDef` instance.
    graph_def: A `GraphDef` instance.

  Returns:
    A new pruned `SignatureDef`.
  """
  node_names = set([n.name for n in graph_def.node])
  new_signature_def = meta_graph_pb2.SignatureDef()
  new_signature_def.CopyFrom(signature_def)
  for (k, v) in signature_def.inputs.items():
    tensor_name, _ = _parse_tensor_name(v.name)
    if tensor_name not in node_names:
    ¦ logging.warn(
    ¦   ¦ 'Signature input key \'{}\', tensor name \'{}\', has been pruned '
    ¦   ¦ 'while freezing the graph.  Removing it from the compiled signatures.'
    ¦   ¦ .format(k, tensor_name))
    ¦ del new_signature_def.inputs[k]
  return new_signature_def

when building LLVM for deployment, it prunes those newly added ones that it does not find in graph_dev.node and as a result my final deployed model is incorrect:

1644209582.45: clang: /home/llvm-project/llvm/lib/Analysis/ReleaseModeModelRunner.cpp:62: {anonymous}::ReleaseModeModelRunner::ReleaseModeModelRunner(llvm::LLVMContext&): Assertion `Index >= 0 && "Cannot find Feature in inlining model"' failed.
1644209582.45: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.

Here are my other TF related packages:

MLGO commit: dac1b149a523b3271341ae72431484df215d8dd3

commit dac1b149a523b3271341ae72431484df215d8dd3 (origin/master, origin/HEAD, master)
Author: Mircea Trofin <[email protected]>
Date:   Thu Jan 21 11:04:00 2021 -0800

    Fix to demo OUTPUT_DIR

    (Thanks to @liyuqian for the fix)

tensorboard             2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.7.0
tensorflow              2.4.1
tensorflow-addons       0.11.2
tensorflow-estimator    2.4.0
tensorflow-probability  0.12.2
tf-agents               0.7.1
tf-estimator-nightly    2.4.0.dev2020102201

Also, 2.4.1 was used to create the model:

>>> imported
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x7f6bcc4f7a58>
>>> imported.tensorflow_version
'2.4.1'

Here is the debug output:

coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-02-06 21:56:34.286796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:34.290840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-06 21:56:34.290891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-02-06 21:56:34.293434: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-06 21:56:34.293855: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-06 21:56:34.296799: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-02-06 21:56:34.297732: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-06 21:56:34.297943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-06 21:56:34.299759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-06 21:56:34.299802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:35.075324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-06 21:56:35.075387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2022-02-06 21:56:35.075400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2022-02-06 21:56:35.078385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11119 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: $
000:82:00.0, compute capability: 6.0)
2022-02-06 21:56:35.096700: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2194810000 Hz
2022-02-06 21:56:35.236522: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] model_pruner failed: Invalid argument: Graph does not contain terminal node StatefulPartitionedCall_2.
2022-02-06 21:56:35.247615: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:928] Optimization results for grappler item: graph_to_optimize
  model_pruner: Graph size after: 38 nodes (-2), 48 edges (0), time = 0.987ms.
  implementation_selector: Graph size after: 38 nodes (0), 48 edges (0), time = 0.562ms.
  function_optimizer: Graph size after: 343 nodes (305), 581 edges (533), time = 22.39ms.
  common_subgraph_elimination: Graph size after: 303 nodes (-40), 541 edges (-40), time = 3.508ms.
  constant_folding: Graph size after: 227 nodes (-76), 387 edges (-154), time = 55.232ms.
  shape_optimizer: shape_optimizer did nothing. time = 0.41ms.
  arithmetic_optimizer: Graph size after: 238 nodes (11), 398 edges (11), time = 4.174ms.
  layout: Graph size after: 238 nodes (0), 398 edges (0), time = 5.838ms.
  remapper: Graph size after: 238 nodes (0), 398 edges (0), time = 1.459ms.
  loop_optimizer: Graph size after: 238 nodes (0), 397 edges (-1), time = 1.714ms.
  dependency_optimizer: Graph size after: 156 nodes (-82), 221 edges (-176), time = 3.391ms.
  memory_optimizer: Graph size after: 156 nodes (0), 221 edges (0), time = 6.835ms.
  model_pruner: Invalid argument: Graph does not contain terminal node StatefulPartitionedCall_2.
  implementation_selector: Graph size after: 156 nodes (0), 221 edges (0), time = 0.468ms.
  function_optimizer: function_optimizer did nothing. time = 0.127ms.
  common_subgraph_elimination: Graph size after: 146 nodes (-10), 211 edges (-10), time = 1.021ms.
  constant_folding: Graph size after: 146 nodes (0), 211 edges (0), time = 3.151ms.
  shape_optimizer: shape_optimizer did nothing. time = 0.133ms.
  arithmetic_optimizer: Graph size after: 146 nodes (0), 211 edges (0), time = 2.64ms.
  remapper: Graph size after: 146 nodes (0), 211 edges (0), time = 0.8ms.
  dependency_optimizer: Graph size after: 146 nodes (0), 211 edges (0), time = 1.752ms.

2022-02-06 21:56:35.281080: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-02-06 21:56:35.282047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:82:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-02-06 21:56:35.282083: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:35.282137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-06 21:56:35.282155: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-02-06 21:56:35.282172: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-06 21:56:35.282190: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-06 21:56:35.282208: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-02-06 21:56:35.282226: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-06 21:56:35.282243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-06 21:56:35.283971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-06 21:56:35.284299: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-02-06 21:56:35.285214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:82:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-02-06 21:56:35.285237: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:35.285258: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-06 21:56:35.285277: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-02-06 21:56:35.285295: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-06 21:56:35.285311: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-06 21:56:35.285328: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-02-06 21:56:35.285345: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-06 21:56:35.285363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-06 21:56:35.287113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-06 21:56:35.287143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-06 21:56:35.287153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2022-02-06 21:56:35.287161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2022-02-06 21:56:35.288959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11119 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from /home/llvm-project/llvm/lib/Analysis/models/inliner/variables/variables
2022-02-06 21:56:35.354948: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
WARNING:tensorflow:From /home/.local/lib/python3.6/site-packages/tensorflow/python/tools/saved_model_aot_compile.py:332: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.convert_variables_to_constants`
WARNING:tensorflow:From /home/.local/lib/python3.6/site-packages/tensorflow/python/framework/convert_to_constants.py:856: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
WARNING:tensorflow:Signature input key 'XXX', tensor name 'action_XXX', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'discount', tensor name 'action_discount', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'XXX', tensor name 'action_XXX', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'reward', tensor name 'action_reward', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'step_type', tensor name 'action_step_type', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'inlining_default', tensor name 'action_inlining_default', has been pruned while freezing the graph.  Removing it from the compiled signatures.
INFO:tensorflow:Writing graph def to: /tmp/saved_model_clilxj7nh6h/frozen_graph.pb
INFO:tensorflow:Writing config_pbtxt to: /tmp/saved_model_clilxj7nh6h/config.pbtxt
INFO:tensorflow:Generating XLA AOT artifacts in: /home/llvm-project/build/lib/Analysis

The original question was also posted tensorflow/tensorflow#54296, but I thought this repo would have the better audience. It would be nice to have some sort of a compatibility table among all these needed libraries, as I suspect it might be a mismatch between two. I also checked the https://github.com/google/ml-compiler-opt/blob/main/requirements.txt and its history, but it doesn't have that info available.

Thanks,
-Amir

Still unable to build demos

The instructions are incomplete.

Has anyone built from scratch on a new machine with these instructions?

In an earlier step: export FUCHSIA_SRCDIR=~/fuchsia
~/fuschia will not exist per the instructions.

you need to add:
curl -s "https://fuchsia.googlesource.com/fuchsia/+/HEAD/scripts/bootstrap?format=TEXT" | base64 --decode | bash

Later:
pipenv sync --system
| grep Location | cut -d ' ' -f 2)

This is clearly not right

Address `Encoding a StructuredValue with type tfp.distributions.Deterministic_ACTTypeSpec` warning

Automate typing relationship between worker type, its stub, and the worker manager

This would stop us from having manually defined things like CompilationRunnerStub, or needing to type some APIs with Any.

Address "UserWarning:Encoding a StructuredValue with type tfp.distributions.Categorical_ACTTypeSpec"

Compiling LLVM meets the problem: error: use of undeclared identifier '__sanitizer_get_configuration'

Hi, I follow the instructions in demo.md and try to replicate the process of training a model. But I meet an error when compiling LLVM.

I am using the release package ml-compiler-opt-inlining-Oz-v1.1 and trying to replicate it in ubuntu 20:04 several times. The error always comes out. Here is the full error message:

/home/xx/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_fuchsia.cpp:410:24: error: use of undeclared identifier '__sanitizer_get_configuration'
  zx_status_t status = __sanitizer_get_configuration(file_name, &vmo);
                       ^
1 error generated.

...

ninja: build stopped: subcommand failed.
FAILED: runtimes/runtimes-x86_64-unknown-fuchsia-stamps/runtimes-x86_64-unknown-fuchsia-build

The LLVM is set to the commit fa4c3f70ff0768a270b0620dc6d158ed1205ec4e. And all instructions is running in a user mode. The python version is 3.8.10.

Is there any packages that I missed in this replication?

Use same actor kind when BC and training

The BC policy is trained using a QNetwork actor, which is different from the one we use for normal training. See for example:

ml-compiler-opt/compiler_opt/rl/inlining/gin_configs/behavioral_cloning_nn_agent.gin

Line 19 in 9d00bcf

create_agent.policy_network = @q_network.QNetwork

ml-compiler-opt/compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin

Line 35 in 9d00bcf

 create_agent.policy_network = @actor_distribution_network.ActorDistributionNetwork 

Demo build instructions

When I try and build the inlining demo, the instructions work until the need for "cipd". This is not a command I have
In the prelliminaries section, there is a typo: export WORKINGD_DIR=~ . It should be export WORKING_DIR=~

Address "Encoding a StructuredValue with type tf_agents...." warning

To repro: remove the corresponding line in pytest.ini, make sure tests pass.

[pipenv][3.9] importlib_metadata dependency missing

For some reason, on the buildbot setup (debian bullseye, python 3.9), we also need to manually install importlib_metadata (which pulls zlib). Curiously, this doesn't need to happen on our CI.

For the moment, unblocking the buildbots by manually installing importlib_metadata.

`extract_ir/convert_compile_command_to_objectfile` does not support `arguments`-style compilation databases

extract_ir errors on certain compilation databases.

In this case, I generated a compilation database for redis(uses Makefile) using bear:

PYTHONPATH=:/home/user/ml-opt/ml-compiler-opt python3 /home/user/ml-opt/ml-compiler-opt/compiler_opt/tools/extract_ir.py --cmd_filter=^-O2|-Os|-Oz$ --input=/home/user/ml-opt/redis_repo/compile_commands.json --input_type=json --llvm_objcopy_path=build/bin/llvm-objcopy --output_dir=/home/user/ml-opt/redis_corpus
Traceback (most recent call last):
  File "/home/user/ml-opt/ml-compiler-opt/compiler_opt/tools/extract_ir.py", line 142, in <module>
    app.run(main)
  File "/home/user/ml-opt/mlgo_venv/lib/python3.10/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/user/ml-opt/mlgo_venv/lib/python3.10/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/user/ml-opt/ml-compiler-opt/compiler_opt/tools/extract_ir.py", line 107, in main
    objs = extract_ir_lib.load_from_compile_commands(
  File "/home/user/ml-opt/ml-compiler-opt/compiler_opt/tools/extract_ir_lib.py", line 229, in load_from_compile_commands
    objs = [
  File "/home/user/ml-opt/ml-compiler-opt/compiler_opt/tools/extract_ir_lib.py", line 230, in <listcomp>
    convert_compile_command_to_objectfile(cmd, output_dir)
  File "/home/user/ml-opt/ml-compiler-opt/compiler_opt/tools/extract_ir_lib.py", line 210, in convert_compile_command_to_objectfile
    cmd = command['command']
KeyError: 'command'

This is occurs because compiler_opt/tools/extract_ir_lib.py/convert_compile_command_to_objectfile simply checks if command exists, when arguments may be present instead(see JSON Compilation Database Format Specification/Format).

This looks very easy to resolve: if command is missing but arguments is present, convert arguments to command by concatenation.

Do you mind if I fix this and create a pull request?

How the inlining in the compiled project is identified

Hi, I have read your paper "MLGO: a Machine Learning Guided Compiler Optimizations Framework" and some questions have come to me.

The paper writes, "we train the warmstart policy to imitate the heuristic inlining decisions in LLVM using behavioral cloning algorithm". I am curious how the ground truth of inlining is identified in the compiled projects.

I noticed that the projects are compiled to emit bitcode. Is the debug information in it reflecting the inlining? If so, could you introduce how it is identified in the bitcode？ And can the inlining in a linked binary be identified this way?

error: Could not setup Inlining Advisor for the requested mode and/or options

I'm trying to run the demo following this link: https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md
Everything goes fine till I run the fx build (i run the commands in order provided in the link. So I run fx set core.x64 followed by the given arguments before that). fx set runs successfully. However, fx build gives me this error:

[2754/67050] CXX host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_shared.cc.o
FAILED: host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_shared.cc.o
../../../llvm-install/bin/clang++ -MD -MF host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_shared.cc.o.d -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -I../.. -Ihost_x64/gen -I../../sdk -Ihost_x64/gen/sdk -I../../sdk/lib/fit-promise/include -I../../sdk/lib/fit/include -I../../sdk/lib/stdcompat/include -I../../zircon/system/public -fclang-abi-compat=13.0 -fcolor-diagnostics -fcrash-diagnostics-dir=clang-crashreports -Xclang -fembed-bitcode=all -ffp-contract=off --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -ffile-compilation-dir=. -no-canonical-prefixes -fomit-frame-pointer -fdata-sections -ffunction-sections -Os -mllvm -enable-ml-inliner=release -gdwarf-5 -Xclang -debug-info-kind=constructor -g3 -Wall -Wextra -Wconversion -Wextra-semi -Wimplicit-fallthrough -Wnewline-eof -Wstrict-prototypes -Wwrite-strings -Wno-sign-conversion -Wno-unused-parameter -Wnonportable-system-include-path -fvisibility=hidden -Werror -Wno-error=deprecated-declarations -Wa,--fatal-warnings --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -fPIE -fvisibility-inlines-hidden -stdlib=libc++ -std=c++17 -fno-exceptions -fno-rtti -stdlib=libc++ -c ../../sdk/lib/syslog/cpp/logging_backend_shared.cc -o host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_shared.cc.o
error: Could not setup Inlining Advisor for the requested mode and/or options
1 error generated.

[2755/67050] CXX host_x64/obj/sdk/lib/syslog/cpp/cpp.log_settings.cc.o
FAILED: host_x64/obj/sdk/lib/syslog/cpp/cpp.log_settings.cc.o
../../../llvm-install/bin/clang++ -MD -MF host_x64/obj/sdk/lib/syslog/cpp/cpp.log_settings.cc.o.d -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -I../.. -Ihost_x64/gen -I../../sdk -Ihost_x64/gen/sdk -I../../sdk/lib/fit-promise/include -I../../sdk/lib/fit/include -I../../sdk/lib/stdcompat/include -I../../zircon/system/public -fclang-abi-compat=13.0 -fcolor-diagnostics -fcrash-diagnostics-dir=clang-crashreports -Xclang -fembed-bitcode=all -ffp-contract=off --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -ffile-compilation-dir=. -no-canonical-prefixes -fomit-frame-pointer -fdata-sections -ffunction-sections -Os -mllvm -enable-ml-inliner=release -gdwarf-5 -Xclang -debug-info-kind=constructor -g3 -Wall -Wextra -Wconversion -Wextra-semi -Wimplicit-fallthrough -Wnewline-eof -Wstrict-prototypes -Wwrite-strings -Wno-sign-conversion -Wno-unused-parameter -Wnonportable-system-include-path -fvisibility=hidden -Werror -Wno-error=deprecated-declarations -Wa,--fatal-warnings --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -fPIE -fvisibility-inlines-hidden -stdlib=libc++ -std=c++17 -fno-exceptions -fno-rtti -stdlib=libc++ -c ../../sdk/lib/syslog/cpp/log_settings.cc -o host_x64/obj/sdk/lib/syslog/cpp/cpp.log_settings.cc.o
error: Could not setup Inlining Advisor for the requested mode and/or options
1 error generated.
[2757/67050] CXX host_x64/obj/sdk/lib/syslog/cpp/cpp.macros.cc.o
FAILED: host_x64/obj/sdk/lib/syslog/cpp/cpp.macros.cc.o
../../../llvm-install/bin/clang++ -MD -MF host_x64/obj/sdk/lib/syslog/cpp/cpp.macros.cc.o.d -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -I../.. -Ihost_x64/gen -I../../sdk -Ihost_x64/gen/sdk -I../../sdk/lib/fit-promise/include -I../../sdk/lib/fit/include -I../../sdk/lib/stdcompat/include -I../../zircon/system/public -fclang-abi-compat=13.0 -fcolor-diagnostics -fcrash-diagnostics-dir=clang-crashreports -Xclang -fembed-bitcode=all -ffp-contract=off --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -ffile-compilation-dir=. -no-canonical-prefixes -fomit-frame-pointer -fdata-sections -ffunction-sections -Os -mllvm -enable-ml-inliner=release -gdwarf-5 -Xclang -debug-info-kind=constructor -g3 -Wall -Wextra -Wconversion -Wextra-semi -Wimplicit-fallthrough -Wnewline-eof -Wstrict-prototypes -Wwrite-strings -Wno-sign-conversion -Wno-unused-parameter -Wnonportable-system-include-path -fvisibility=hidden -Werror -Wno-error=deprecated-declarations -Wa,--fatal-warnings --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -fPIE -fvisibility-inlines-hidden -stdlib=libc++ -std=c++17 -fno-exceptions -fno-rtti -stdlib=libc++ -c ../../sdk/lib/syslog/cpp/macros.cc -o host_x64/obj/sdk/lib/syslog/cpp/cpp.macros.cc.o
error: Could not setup Inlining Advisor for the requested mode and/or options
1 error generated.

[2758/67050] CXX host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_host.cc.o
FAILED: host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_host.cc.o
../../../llvm-install/bin/clang++ -MD -MF host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_host.cc.o.d -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -I../.. -Ihost_x64/gen -I../../sdk -Ihost_x64/gen/sdk -I../../sdk/lib/fit-promise/include -I../../sdk/lib/fit/include -I../../sdk/lib/stdcompat/include -I../../zircon/system/public -fclang-abi-compat=13.0 -fcolor-diagnostics -fcrash-diagnostics-dir=clang-crashreports -Xclang -fembed-bitcode=all -ffp-contract=off --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -ffile-compilation-dir=. -no-canonical-prefixes -fomit-frame-pointer -fdata-sections -ffunction-sections -Os -mllvm -enable-ml-inliner=release -gdwarf-5 -Xclang -debug-info-kind=constructor -g3 -Wall -Wextra -Wconversion -Wextra-semi -Wimplicit-fallthrough -Wnewline-eof -Wstrict-prototypes -Wwrite-strings -Wno-sign-conversion -Wno-unused-parameter -Wnonportable-system-include-path -fvisibility=hidden -Werror -Wno-error=deprecated-declarations -Wa,--fatal-warnings --sysroot=../../prebuilt/third_party/sysroot/linux --target=x86_64-unknown-linux-gnu -fPIE -fvisibility-inlines-hidden -stdlib=libc++ -std=c++17 -fno-exceptions -fno-rtti -stdlib=libc++ -c ../../sdk/lib/syslog/cpp/logging_backend_host.cc -o host_x64/obj/sdk/lib/syslog/cpp/backend_host.logging_backend_host.cc.o
error: Could not setup Inlining Advisor for the requested mode and/or options
1 error generated.

[2761/67050] CC efi_x64/obj/third_party/lz4/lib/liblz4.lz4hc.c.o
ninja: build stopped: subcommand failed.
Hint: run fx build with the option --log LOGFILE to generate a debug log if you are reporting a bug.

It'd be great to have some clarity on this.

Plans on providing the published results?

Hi @mtrofin,

I was wondering if you planned to publish the results on arXiv or on a conference, if you haven't done so. I recall you mentioned this in the LLVM-Dev.

Thanks,

Amir

The post-processing of the sequence examples in the 2 regalloc plugins is identical

"ld.lld: error: corrupt input file: version definition index 0 for symbol curl_jmpenv is out of bounds" for libtensorflow.so

Hello,
@kshiteejm @mtrofin
I am trying to learn MLGO and would like to contribute to it. I went through the main paper and documentation. While following the Fuchsia demo, I have run into an error with libtensorflow.so:

ninja distribution
[1/1273] Linking CXX executable bin/clang-ast-dump
FAILED: bin/clang-ast-dump
: && /usr/bin/c++ -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -ffile-prefix-map=/home/amd/tejas/MLGO/ninja-build=../ninja-build -ffile-prefix-map=/home/amd/tejas/MLGO/llvm-project/= -no-canonical-prefixes -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG -static-libstdc++ -fuse-ld=lld -Wl,--color-diagnostics    -Wl,--gc-sections tools/clang/lib/Tooling/DumpTool/CMakeFiles/clang-ast-dump.dir/ASTSrcLocProcessor.cpp.o tools/clang/lib/Tooling/DumpTool/CMakeFiles/clang-ast-dump.dir/ClangSrcLocDump.cpp.o -o bin/clang-ast-dump  -Wl,-rpath,"\$ORIGIN/../lib:/home/amd/tejas/MLGO/tensorflow/lib"  lib/libLLVMOption.a  lib/libLLVMFrontendOpenMP.a  lib/libLLVMSupport.a  lib/libclangAST.a  lib/libclangASTMatchers.a  lib/libclangBasic.a  lib/libclangDriver.a  lib/libclangFrontend.a  lib/libclangSerialization.a  lib/libclangToolingCore.a  lib/libclangDriver.a  lib/libLLVMWindowsDriver.a  lib/libLLVMOption.a  lib/libclangParse.a  lib/libclangSema.a  lib/libclangEdit.a  lib/libclangAnalysis.a  lib/libclangASTMatchers.a  lib/libclangAST.a  lib/libLLVMFrontendOpenMP.a  lib/libLLVMScalarOpts.a  lib/libLLVMAggressiveInstCombine.a  lib/libLLVMInstCombine.a  lib/libLLVMTransformUtils.a  lib/libLLVMAnalysis.a  lib/libLLVMProfileData.a  lib/libLLVMSymbolize.a  lib/libLLVMDebugInfoPDB.a  lib/libLLVMDebugInfoMSF.a  lib/libLLVMDebugInfoDWARF.a  /home/amd/tejas/MLGO/tensorflow/lib/libtensorflow.so  /home/amd/tejas/MLGO/tensorflow/lib/libtensorflow_framework.so  lib/libLLVMObject.a  lib/libLLVMMCParser.a  lib/libLLVMMC.a  lib/libLLVMDebugInfoCodeView.a  lib/libLLVMTextAPI.a  lib/libLLVMBitReader.a  lib/libLLVMCore.a  lib/libLLVMBinaryFormat.a  lib/libLLVMRemarks.a  lib/libLLVMBitstreamReader.a  lib/libclangRewrite.a  lib/libclangLex.a  lib/libclangBasic.a  lib/libLLVMSupport.a  -lrt  -ldl  -lm  /usr/lib/x86_64-linux-gnu/libz.so  lib/libLLVMDemangle.a && :
ld.lld: warning: found local symbol 'VERS_1.0' in global part of symbol table in file /home/amd/tejas/MLGO/tensorflow/lib/libtensorflow.so
ld.lld: error: corrupt input file: version definition index 0 for symbol curl_jmpenv is out of bounds
defined in /home/amd/tejas/MLGO/tensorflow/lib/libtensorflow.so
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

I am strictly following the demo steps but my system is Ubuntu 22.04. So for python compatibility, I have created venv with Python 3.8.10 (all the python libraries and tensorflow dependencies were installed successfully). But now, I am stuck here and couldn't find any solution. Is the libtensorflow version mentioned in the demo not compatible now? I couldn't try any other as I don't know how to get it.
Thanks!

Tasks assigned to LocalWorkerPool are not load balanced

Tracking issue, search for this issue number in code.

_schedule_jobs in local_data_collector load balances naively, in round-robin style
Ideally, each worker is only assigned a couple jobs at a time, and gets more upon completion of previous tasks

Should I regenerate the savedmodel when I use a new project

Hi @yundiqian, I have migrated the demo project to the chrome/v8 project and got 5% percent reduction of size in binary and I wanna know if I need to regenerate the saved model or use the exactly one generated by Fuchsia etc?

cancellation_manager.pause_all_work() is insufficient to pause work

It is possible for work to continue running if no processes are registered on the cancellation manager when pause_all_work() is called.

Possible solutions:
Also send SIGSTOP to the worker python process- however this has a race condition: pause_all_work -> new work scheduled -> pause python -> work is still being done. pausing python first would prevent pause_all_work from running so that's not an option.

Alternatively, have cancellation manager send SIGSTOP upon process registration when paused. This should work, I think, and is probably the easiest solution.

to verify: add a time_sleep right under .pause_children() in train_locally and look at the cpu graph

Convert documentation to TFLite

Some of the documentation that is around currently, particularly the inliner demo and some of the benchmarking documentation, as well as some of the benchmarking code itself, are stuck with the Tensorflow C API rather than TFLite. This documentation/code (just benchmarking) should be updated to work with TFLite.

I should be able to do this, just trying to find some time.

Update inlining demo
Update benchmarking code/documentation to work with TFLite

Model "replacing" compilation flags explicitly

We currently model flag changes as either "add" or "delete". "Replace" happens e.g. for profile file locations - is better modeled explicitly, rather than a delete followed by an add, because it allows catching setup errors where, say, a profile was meant to be used during compilation, but the experimenter forgets to pass one. Removing and not adding the flag will work - compilation will just chug along - but the results will be garbage, and hard to tell why. Expressing as "replace" would mean "we expect it to be present in the compilation, and a replacement provided, so if that's not the case -> error"

Address 'the imp module is deprecated' test warning

This is produced by a dependency.

To repro: remove corresponding line from pytest.ini, ensure test pass.

TFLite build script fails

I'm trying to build TFLite on Fedora 38 using buildbot/build_tflite.sh. All prior projects build, but TFLite fails to build with the following:

[21/210] Building CXX object CMakeFiles/tensorflow-lite.dir/kernels/internal/spectrogram.cc.o
FAILED: CMakeFiles/tensorflow-lite.dir/kernels/internal/spectrogram.cc.o 
/usr/bin/clang++ -DCPUINFO_SUPPORTED_PLATFORM=1 -I/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow -I/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow-build/farmhash/src -I/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow-build/gemmlowp -isystem /home/fedora/ml-opt/tflite/eigen/include/eigen3 -isystem /home/fedora/ml-opt/tflite/ARM_NEON_2_x86_SSE/include -isystem /home/fedora/ml-opt/tflite/abseil-cpp/include -isystem /home/fedora/ml-opt/tflite/flatbuffers/include -isystem /home/fedora/ml-opt/tflite/ruy/include -isystem /home/fedora/ml-opt/tflite/cpuinfo/include -O3 -DNDEBUG -std=gnu++17 -fPIC -DTFL_STATIC_LIBRARY_BUILD -Wno-deprecated-declarations -MD -MT CMakeFiles/tensorflow-lite.dir/kernels/internal/spectrogram.cc.o -MF CMakeFiles/tensorflow-lite.dir/kernels/internal/spectrogram.cc.o.d -o CMakeFiles/tensorflow-lite.dir/kernels/internal/spectrogram.cc.o -c /home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc
/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc:46:22: error: unknown type name 'uint32_t'
inline int Log2Floor(uint32_t n) {
                     ^
/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc:49:3: error: unknown type name 'uint32_t'
  uint32_t value = n;
  ^
/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc:52:5: error: unknown type name 'uint32_t'
    uint32_t x = value >> shift;
    ^
/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc:61:24: error: unknown type name 'uint32_t'
inline int Log2Ceiling(uint32_t n) {
                       ^
/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc:69:8: error: unknown type name 'uint32_t'
inline uint32_t NextPowerOfTwo(uint32_t value) {
       ^
/home/fedora/ml-opt/tflite/tensorflow/src/tensorflow/tensorflow/lite/kernels/internal/spectrogram.cc:69:32: error: unknown type name 'uint32_t'
inline uint32_t NextPowerOfTwo(uint32_t value) {
                               ^
6 errors generated.

enable alphabetical order of importing in linting

Replace tensorflow c api from cpu to gpu

Hi, I am using the demo to verify the size effect on Fuchsia compile. But it failed at the train the optimized model which runs 8h or so and I wanna know if I can replace the tensorflow-c-api with tensorflow-gpu, so I use the libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz but sadly failed in compiling llvm.

ld.lld: warning: found local symbol 'VERS_1.0' in global part of symbol table in file /home/tiger/tensorflow/lib/libtensorflow.so ld.lld: error: corrupt input file: version definition index 0 for symbol curl_jmpenv is out of bounds

MLGO Demo Problem

I recently follow project for optimization. Is demo of this repository too old？I have compiled it and produce many errors. Do you have a docker or the latest build command?

Enable gin bindings for modifying compilation flags

Modifying compilation flags to be added / deleted / (replaced in the future) should be gin configurable, as was possible in the past. Currently, they've been refactored to methods in problem_configuration.py, namely flags_to_add() and flags_to_delete() and changing these flags on a case-by-case basis requires modifying code instead of being able to modify them via gin_bindings.

Address "Importing display from IPython.core.display is deprecated since IPython 7.14" warning

This warning is added to the exception list in #109. Creating the issue before everything gets merged in. This warning is created due to the newly added dependency on shap. I've opened up a PR to fix the issue here, but the project does not seem particularly active, so we'll see if it gets merged.

Linker Error while reproducing demo

Hi,
I find your work very interesting and try to reproduce your results based on this instructions: https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md.
But I always fail at building the needed llvm-version with this error:

ld.lld: warning: found local symbol 'VERS_1.0' in global part of symbol table in file /tensorflow/lib/libtensorflow.so
ld.lld: error: corrupt input file: version definition index 0 for symbol curl_jmpenv is out of bounds
>>> defined in /tensorflow/lib/libtensorflow.so
collect2: error: ld returned 1 exit status

I'm not sure if I miss some needed dependency or setting and would much appreciate any hint to solve this.
For easier reproducibility I added all my steps in the attached Dockerfile.txt.
I already tried different lld-versions (6, 7, 8, 9, 10 and 12) and the newest libtensorflow-version, but neither solve the error.

regards
Alexander

Investigate lengthy data collection in python 3.10

All else being the same, compiling ~100 modules takes ~4-5s when using python3.9, but blows up to 30s when moving to python 3.10.

Blackbox optimizers do not check for standard deviation of 0

In ES training, when all of the rewards are the same value, the standard deviation is 0. The blackbox optimizers do not check for this case before dividing by the standard deviation of the function values (which is zero), resulting in a list of nan for the gradient and thus setting the model weights to nan as well. This case needs to be addressed so that the model weights result in real float values instead.

ppo_nn_agent.gin hyperparam tuning

Hi @yundiqian. I was skimming through the hyperparams of https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin and it seems counterintuitive to me that both PPOAgent.normalize_rewards and PPOAgent.normalize_observations are assigned as False. Would you be able to provide some info on it? Looking at the TF codebase (https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ppo/ppo_agent.py#L206)), it is advised to normalize rewards and observation, so I was wondering if you had tried these out before?

Thanks!
-Amir

Separate specifics from general-purpose functions in policy_utils

The config tests are very problem-specific and should be separated from the general purpose vectorization tests. create_actor_policy should depend less on problem_config registry and instead should just be given the necessary parameters--the signature spec and processing layer creator. By doing so, the policies created in the tests would not depend on specifics.

Demo link not working

The demo link provided in the README.md is not working.

TFRecord generator not working

Thanks for the opensource code!

I have encountered several problems while trying to preproduce the model following the README file:

I will really appreciate it if you can provide a explanation on how to use the extract_ir.py tool.
Currently I am compiling for the .bc code from source .c file using clang –c –emit-llvm –O1 X.c, and using an empty file for X.cmd.
While running generate_default_trace.py, the clang command in inclining_runner.py (i.e. clang +cmds -mllvm -enable-ml-inliner=development ... at line 119) does not generate any log file. I have tried to run the clang command outside of the script, but still no log file is generated. I wish to know if that is caused by having something empty for cmds? If yes, it might be useful to provide some instructions on how to use extract_ir.py to generate a correct .cmd file. If no, that might be worth figuring out why clang is not generating log files.
I get around the log file problem by breaking the clang into two commands:
Firstly useopt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to ir2native> -training-log=<path to training log output> -enable-ml-inliner=development -o <output> <module.o> to generate the log file
Then run clang to compile for the native.
I am not sure if this is a good replacement for the original code, and will really appreciate it if you can enlighten me on what's wrong with it.
It seems like this script is requiring cuda10, while the others are using cuda11.
On running train_bc.py, it hangs forever on

experience = next(dataset_iter) (in trainer.py)

At a closer at dataset_iter, we find that when we read from the TFRecord file in _file_iterator_fn (defined in data_reader.py), the dataset is actually empty.
I am not sure if it is the problem with how we read the file, or the TFRecord file generated from previous step.

It will be really helpful if you can give some insights on these points!

Move generate_default_trace (and any other tools) to use same distribution layer as training

More reuse, common debuggability, etc. E.g. currently generate_default_trace in-worker failures are hard to debug.