Adding new features under InlineModelFeatureMaps.h results in TF model pruner to remove them at deployment,about google/ml-compiler-opt

Comments (26)

kshiteejm commented on May 2, 2024 2

So, what method do you guys suggest to generate the buckets a priori? Looking at a large set of value distribution for the added features and generate a 1000 bucket? obviously, this has an egg and chicken problem for new features.

I've a tool to generate buckets for all features (including any new features). It is very close to a release, hopefully in the next couple of days.

For generating buckets:

You will have to execute the trace generator (https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/tools/generate_default_trace.py) on your repository of IR files with your version of LLVM to generate tfrecord files containing raw feature values.
The bucket generator tool that I release takes these tfrecord files as input and generates vocabs for all features.

You can then point to these vocabulary files and hopefully that will resolve the above issues you are facing.

from ml-compiler-opt.

jacob-hegna commented on May 2, 2024 2

In the interim, I would recommend trying to add mock vocab files with some values (maybe 500 zeros followed by 500 ones, or counting 1...1000) to see if this resolves the issue with the compiled model in clang.

from ml-compiler-opt.

mtrofin commented on May 2, 2024 1

It might be a Tensorflow bug or incompatibility amongst installed libraries, but here is the issue:

When we declare new features under llvm/include/llvm/Analysis/InlineModelFeatureMaps.h and define them at llvm/lib/Analysis/MLInlineAdvisor.cpp, it is added to the frozen model at each iteration of the trainer.

@yundiqian would know more about this, but this might help - adding to those just says "we expect these features in the model". There should be an accompanying change in compiler_opt/rl/inlining/config.py.

(well... we should add a guide for how to extend the feature set)

from ml-compiler-opt.

jacob-hegna commented on May 2, 2024 1

In addition to what @mtrofin mentioned, you will have to add a corresponding vocabulary file to compiler_opt/rl/inlining/vocab. If I remember correctly, the file is a 1000 bucket histogram for the feature values, which are used to normalize the inputs.

from ml-compiler-opt.

jacob-hegna commented on May 2, 2024 1

The vocabulary files are used to create a preprocessing layer for each feature in the model that normalizes the feature between 0 and 1.

If you do not provide vocabulary files for the new features, then the features are disconnected from the component of the graph that contains the observed output node (because the preprocessing layers do not exist), so the graph pruner may safely delete those nodes without changing observable computations. This causes your segfault in clang because the model does not contain the features you requested.

For some visibility into how this happens in the python side, check out the following functions/lines:

Fair warning, I'm not a tensorflow expert so I might have slightly incorrectly described how the graph pruning process works, but this is my mental model. @yundiqian will be able to correct anything I said here.

from ml-compiler-opt.

yundiqian commented on May 2, 2024 1

Hi Amir,
There are multiple reasons it can happen, two questions to identify the root cause:

Is your training going on well? (is it interrupted very quickly after starting or it runs for a while? how many iterations numbers did you see under model/policy/$ITERATION_NO? Do you see a lot or you see only 0 there?) If it does not go well, can you paste the log it prints?

I didn't see interruptions, however, the $ITERATION_NO were different on both cases, meaning the original features and the revised features. I guess for the original features it went on for around ~1700 and the revised features it was around ~600 and then stopped saving new $ITERATION_NO, although the overall iteration was going on for almost 500k iterations. Loss function kept going up and down a bit and I saw no reason to carry on the training.

what's the change you made to config.py?

Adding new features under feature_keys and then implementing the necessary code at LLVM in order to collect the features.

I think I understand where your issues come from. Your code repo hasn't been updated for a while (we no longer have 'feature_keys' now). In this old version, if it does not see the relevant vocab file for a certain feature (you added the feature to config.py but does not generate vocab for that), it's going to void this feature (it will still exists in saved_model.pb as you see, but it's going to disappear after you convert the model to AOT). This explains your observation perfectly.

For solutions,

pull the repo, add the feature in observation_spec in rl/inlining/config.py
generate the vocab (we will soon release a tool and also update the demo for instructions)
train model

from ml-compiler-opt.

yundiqian commented on May 2, 2024 1

A few explanations about vocab:

The vocab files is here. This is used during training. what's currently in the repo is what we pre-produced for current features in LLVM. So of course it does not include what you newly added in LLVM. With the tool we will release soon, you will be able to generate something similar, but including the new features. (Don't worry about how, we will update the demo for instructions)

Almost features during training expect its corresponding vocab file, if it does not find it, 1) in the old repo, it voids this feature; 2) in the latest repo, it breaks.

from ml-compiler-opt.

amirjamez commented on May 2, 2024 1

I can confirm that the issue is resolved after adding the buckets for the newly added features. The AOT model pruner doesn't touch them anymore. It only costs me to retrain the RL agent again (still going on!).

@kshiteejm @jacob-hegna Meanwhile the generate_buckets tool becomes available, I can suggest two methods for those in need of one:

If you know by heart your distribution of the data generated by the added features, you can simply design/leverage an existing distribution manually to fit them into your data and then bucketize it.
If don't, this is simply a density estimation problem and you need to look into the default_trace and come up with one. I ma guessing this is what is being worked on anyways.

from ml-compiler-opt.

kshiteejm commented on May 2, 2024 1

@amirjamez We released a tool today to generate your own bucket files. You can find the instructions at https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md#collect-trace-and-generate-vocab. Please let us know in case you have any further questions.

from ml-compiler-opt.

yundiqian commented on May 2, 2024 1

ok, i think my hypothesis is correct then :) the solution is to pull the latest version and it will solve your problem

The reason you see this phenomenon is that with the old version of the code, the fake inlining_default feature is used as a real feature with your new generated vocab folder, which is exactly the same as the label, that causes the loss to be 0, then NAN due to numerical issues. With the new code, this problem will no longer exist because we prune out the inlining_default feature explicitly here https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/inlining/config.py#L92

from ml-compiler-opt.

mtrofin commented on May 2, 2024

Oh, and we should have automation for that (like really soon) @kshiteejm

from ml-compiler-opt.

amirjamez commented on May 2, 2024

It might be a Tensorflow bug or incompatibility amongst installed libraries, but here is the issue:
When we declare new features under llvm/include/llvm/Analysis/InlineModelFeatureMaps.h and define them at llvm/lib/Analysis/MLInlineAdvisor.cpp, it is added to the frozen model at each iteration of the trainer.

@yundiqian would know more about this, but this might help - adding to those just says "we expect these features in the model". There should be an accompanying change in compiler_opt/rl/inlining/config.py.

(well... we should add a guide for how to extend the feature set)

Thanks. Yes, I already applied the necessary changes there.

from ml-compiler-opt.

amirjamez commented on May 2, 2024

In addition to what @mtrofin mentioned, you will have to add a corresponding vocabulary file to compiler_opt/rl/inlining/vocab. If I remember correctly, the file is a 1000 bucket histogram for the feature values, which are used to normalize the inputs.

Thanks. Maybe that could be the missing piece. Does that serve the purpose of quantizing/bucketizing the space to which the features are meant to move (rather than a continues space)?

I have difficulty drawing a connection between this and the reason why the added features (new tensors) are not in graph def when they are optimized via graph optimizer & pruner.

from ml-compiler-opt.

yundiqian commented on May 2, 2024

Hi Amir,

There are multiple reasons it can happen, two questions to identify the root cause:

Is your training going on well? (is it interrupted very quickly after starting or it runs for a while? how many iterations numbers did you see under model/policy/$ITERATION_NO? Do you see a lot or you see only 0 there?) If it does not go well, can you paste the log it prints?
what's the change you made to config.py?

from ml-compiler-opt.

amirjamez commented on May 2, 2024

Hi Amir,

There are multiple reasons it can happen, two questions to identify the root cause:

Is your training going on well? (is it interrupted very quickly after starting or it runs for a while? how many iterations numbers did you see under model/policy/$ITERATION_NO? Do you see a lot or you see only 0 there?) If it does not go well, can you paste the log it prints?

I didn't see interruptions, however, the $ITERATION_NO were different on both cases, meaning the original features and the revised features. I guess for the original features it went on for around ~1700 and the revised features it was around ~600 and then stopped saving new $ITERATION_NO, although the overall iteration was going on for almost 500k iterations. Loss function kept going up and down a bit and I saw no reason to carry on the training.

what's the change you made to config.py?

Adding new features under feature_keys and then implementing the necessary code at LLVM in order to collect the features.

from ml-compiler-opt.

amirjamez commented on May 2, 2024

The vocabulary files are used to create a preprocessing layer for each feature in the model that normalizes the feature between 0 and 1.

If you do not provide vocabulary files for the new features, then the features are disconnected from the component of the graph that contains the observed output node (because the preprocessing layers do not exist), so the graph pruner may safely delete those nodes without changing observable computations. This causes your segfault in clang because the model does not contain the features you requested.

For some visibility into how this happens in the python side, check out the following functions/lines:

https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/inlining/config.py#L88 (and line 97)

https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/feature_ops.py#L32

Fair warning, I'm not a tensorflow expert so I might have slightly incorrectly described how the graph pruning process works, but this is my mental model. @yundiqian will be able to correct anything I said here.

So, what method do you guys suggest to generate the buckets a priori? Looking at a large set of value distribution for the added features and generate a 1000 bucket? obviously, this has an egg and chicken problem for new features.

from ml-compiler-opt.

amirjamez commented on May 2, 2024

Thank you all for the info. I'll give it a try.

from ml-compiler-opt.

amirjamez commented on May 2, 2024

Thanks @kshiteejm. I'll give it a try.
Edited. Did try it and it worked fine. Generated buckets for all features, including those that I did not have already and were getting removed by the graph pruner later at deployment (reward.buckets, inlining_decision.buckets, and inlining_default.buckets).

Looking into the history https://github.com/google/ml-compiler-opt/tree/15ff9bfcfe5093f7e325a17a8d33b9db6b9e20f0/compiler_opt/rl/inlining/vocab, the latest commit made (8826749), the repo (https://github.com/google/ml-compiler-opt/tree/main/compiler_opt/rl/inlining/vocab) didn't have these three buckets in vocab. Maybe I am missing something or these shouldn't be added by the sparse_bucket_generator.py?

from ml-compiler-opt.

amirjamez commented on May 2, 2024

@kshiteejm I have got another issue, the loss is converging to almost zero, like literally zero in warmstart and that causes the train optimize model loss is returning as nan now. Any suggestions?

Edited. Removed those three newly added buckets mentioned above and it resolved the nan loss issue.

from ml-compiler-opt.

kshiteejm commented on May 2, 2024

@amirjamez glad you got it working, those three additional features are not picked up during training in the current version of the code and it is intended that way. More details follow.

The sparse_bucket_generator.py tool generates a superset of the features that are actually used during training. The reward.buckets, inlining_decision.buckets, and inlining_default.buckets are not used during training (even though these are generated by the tool). Only buckets for features in observation_spec (https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/inlining/config.py#L30; inlining_default is ignored) are picked-up during training (https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/agent_creators.py#L113). The reward_spec feature (rewards.buckets) and action_spec feature (inlining_decision.buckets) are not picked-up during training. I hope this further clarifies things!

from ml-compiler-opt.

amirjamez commented on May 2, 2024

Thanks @kshiteejm for clarifying the newly added buckets.

So what was the reason that including these three made my warm_start loss close to 0 that eventually led to a nan in the training? Can you reproduce this issue on your end?

from ml-compiler-opt.

kshiteejm commented on May 2, 2024

Not sure what you mean by including newly added buckets. Did you make edits to the code to do something more with these three buckets OR did you leave the code untouched and just modified the set of bucket files in compiler_opt/rl/inlining/vocab folder? This will help me reproduce the issue at my end.

from ml-compiler-opt.

amirjamez commented on May 2, 2024

So when I reran the warm_start script, having executed the sparse_bucket_generator.py which added three new buckets (reward.buckets, inlining_decision.buckets, and inlining_default.buckets) to my vocab, the loss reported (also can be seen in Tensorboard) dramatically dropped after a few iterations and by iteration ~1000 (out of 100 k default), it was pretty close to zero and eventually became zero. What I had been seeing before in previous iterations of training a warm_start model, was a fluctuation of loss between 0.18 and 0.02 at the end of the 100k stretch. So, at finetuning, the warm_start model is picked up as the starting point by --gin_bindings=train_eval.warmstart_policy_dir=\"$WARMSTART_OUTPUT_DIR/saved_policy\", and immediately, it reported the loss as nan as it carried on the training, I had to stop to see what has changed and when I removed those three buckets file from my vacab and redo the process, it happened to resolve my issue. Hope this was clear.

Did also another try with skipping the warm_start and directly training an optimize model, but as you see the loss was so low from the start and I don't think it the backprop learned anything at all (it kinda stayed around 0.00000X for around 200k iteration). Not sure what causes this vanishing gradients here.

Thanks,
-Amir

from ml-compiler-opt.

yundiqian commented on May 2, 2024

Hi Amir,

Did you pull the latest version of the whole repo? if so, what is the change you made to compiler_opt/rl/inlining/config.py?

I have some very direct hypothesis so I'd like to confirm these two points with you :)

from ml-compiler-opt.

amirjamez commented on May 2, 2024

No, unfortunately, I am still on the commit dac1b14 and only cherry-picked the sparse_bucket_generator.py.

Regarding the config.py (https://github.com/google/ml-compiler-opt/blob/dac1b149a523b3271341ae72431484df215d8dd3/compiler_opt/rl/config.py), I only added the names of my custom features to this feature_keys. I know there has been changed to the repo, but I don't understand how that could be the cause of this issue.

Looking forward to hearing your hypothesis :)
-Amir

from ml-compiler-opt.

amirjamez commented on May 2, 2024

Great. Thanks @yundiqian

from ml-compiler-opt.

Adding new features under InlineModelFeatureMaps.h results in TF model pruner to remove them at deployment about ml-compiler-opt HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent