ivanaxu / ideeprec Goto Github PK

DeepRec For Me https://github.com/alibaba/DeepRec

Home Page: https://deeprec.readthedocs.io/zh/latest/index.html

License: Apache License 2.0

Starlark 2.45% Shell 0.49% Batchfile 0.02% Python 33.30% Dockerfile 0.05% C++ 55.45% C 0.60% Java 0.58% CMake 0.15% Makefile 0.07% HTML 3.11% Cuda 0.14% Jupyter Notebook 1.93% MLIR 1.35% SWIG 0.11% Cython 0.01% LLVM 0.01% Objective-C 0.06% Objective-C++ 0.14% Ruby 0.01%

ideeprec's Introduction

བཀྲ་ཤིས་བདེ་ལེགས་

Practice
Profile
Product
Progress ' '
Poetry Daily

四支

茶對酒，賦對詩，燕子對鶯兒。栽花對種竹，落絮對遊絲。四目頡，一足夔，鴝鵒對鷺鷥。半池紅菡萏，一架白荼蘼。幾陣秋風能應候，一犁春雨甚知時。智伯恩深，國士吞變形之炭；羊公德大，邑人豎墮淚之碑。

行對止，速對遲，舞劍對圍棋。花箋對草字，竹簡對毛錐。汾水鼎，峴山碑，虎豹對熊羆。花開紅錦繡，水漾碧琉璃。去婦因探鄰舍棗，出妻爲種後園葵。笛韻和諧，仙管恰從雲裏降；櫓聲咿軋，漁舟正向雪中移。

戈對甲，鼓對旗，紫燕對黃鸝。梅酸對李苦，青眼對白眉。三弄笛，一圍棋，雨打對風吹。海棠春睡早，楊柳晝眠遲。張駿曾爲槐樹賦，杜陵不作海棠詩。晉士特奇，可比一斑之豹；唐儒博識，堪爲五總之龜。

ideeprec's People

Contributors

Stargazers

Watchers

Forkers

nann-auto

ideeprec's Issues

part2 🍎 x0.4 编译开启OneDNN + Eigen Threadpool工作线程池版本+ABI=0的版本 CPU

https://deeprec.readthedocs.io/zh/latest/oneDNN.html

oneDNN 是 Intel 开源的跨平台深度学习性能加速库，通过文档可以了解到被支持的原语，DeepRec 中已经加入了 oneDNN 的支持，只需要在 DeepRec 编译命令中加入关于 oneDNN 的编译选项：--config=mkl_threadpool 即可开启 oneDNN 加速算子计算。在支持 AVX512 指令集的机器（Sky Lake 及其之后的 CPU）上添加 --config=opt 选项，默认会打开 --copt=-march=native 的优化，可以进一步加速算子计算性能。

Tips: MKL-DNN 被重命名为 DNNL，之后又被重命名为 oneDNN；TensorFlow 初期采用的是 MKL 加速算子计算，在之后的版本迭代中，逐步使用 oneDNN 替换了 MKL，但宏定义还是仍然保留。

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:


Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
        --config=mkl            # Build with MKL support.
        --config=monolithic     # Config for mostly static monolithic build.
        --config=gdr            # Build with GDR support.
        --config=verbs          # Build with libverbs support.
        --config=ngraph         # Build with Intel nGraph support.
        --config=numa           # Build with NUMA support.
        --config=dynamic_kernels        # (Experimental) Build kernels into separate shared objects.
        --config=v2             # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
        --config=noaws          # Disable AWS S3 filesystem support.
        --config=nogcp          # Disable GCP support.
        --config=nohdfs         # Disable HDFS support.
        --config=noignite       # Disable Apache Ignite support.
        --config=nokafka        # Disable Apache Kafka support.
        --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished

https://deeprec.readthedocs.io/zh/latest/DeepRec-Compile-And-Install.html

GPU/CPU版本编译

bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
GPU/CPU版本编译+ABI=0

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
编译开启OneDNN + Eigen Threadpool工作线程池版本（CPU）

bazel build  -c opt --config=opt  --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package
编译开启OneDNN + Eigen Threadpool工作线程池版本+ABI=0的版本 （CPU）

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Demo Submit in The Semi-Finals

日期:2022-10-13 21:58:04
score:2082.8406
DIEN (s):380.1314
DIN (s):329.4982
DLRM (s):345.6923
DeepFM (s):341.6685
MMoE (s):331.0767
WDL (s):354.7736

v0.5 Back to Source

https://github.com/alibaba/DeepRec

part1 🌈 w0.1 基于CostModel的Executor

https://deeprec.readthedocs.io/zh/latest/Executor-Optimization.html

基于CostModel的Executor

通过动态Trace指定的Session Run情况，统计与计算多组指标，通过CostModel计算出一个较优的调度策略。该功能中包含了基于关键路径的调度策略和根据CostModel批量执行耗时短的算子的调度策略。

使用方式首先用户可以指定Trace哪些Step的Sesison Run来收集执行指标，默认是收集100～200 Step的指标，通过设置下列环境变量，用户可以自定义此参数。

os.environ['START_NODE_STATS_STEP'] = "200"
os.environ['STOP_NODE_STATS_STEP'] = "500"
上述示例表示Trace 200～500区间的执行指标。如果START_NODE_STATS_STEP小于STOP_NODE_STATS_STEP，会Disable此Trace功能，后续CostModel计算也不会被执行。同时在用户脚本中，需要增加增加下列代码来开启基于CostModel的Executor功能，

sess_config = tf.ConfigProto()
sess_config.executor_policy = tf.ExecutorPolicy.USE_COST_MODEL_EXECUTOR

with tf.train.MonitoredTrainingSession(
master=server.target,
...
config=sess_config) as sess:
...

part1 🌈 w0.9 //tensorflow.python.ops.work_queue.WorkQueue //_EagerTensorCache max_items=512

inter_threads/intra_threads = 7/1

serving/processor/serving/model_config.cc

    (*config)->inter_threads = schedule_threads / (8/7); // 2

    (*config)->intra_threads = schedule_threads / (8/1); // 2

session_num = 2 & select_session_policy = "RR"

serving/processor/serving/model_config.h

  // session num of session group,
  // default num is 1
  int session_num = 2; // 1
  // In multi-session mode, we have two policy for
  // select session for each thread.
  // "RR": Round-Robin policy, threads will use all sessions in Round-Robin way
  // "MOD": Thread select session according unique id, uid % session_num
  std::string select_session_policy = "RR"; // MOD

And More...

inter_threads/intra_threads = 4/4

#1 & MANAGER_MAX_THREAD_NUM/MANAGER_MAX_UPDATE_THREAD_NUM 96

serving/processor/storage/feature_store_mgr.h

const int MANAGER_MAX_THREAD_NUM = 96; // 96
const int MANAGER_MAX_UPDATE_THREAD_NUM = 96; // 16

inter_threads/intra_threads = 3/3

part2 🍎 x0.6 base x0.6

part2 🍎 x0.6 same to #32 x0.1// fix bug //config again

_DEFAULT_READER_BUFFER_SIZE_BYTES = 256 * 1024 * 1024 # 256 KB

tensorflow/python/data/ops/readers.py

_DEFAULT_READER_BUFFER_SIZE_BYTES = 256 * 1024 * 1024 # 256 KB

part1 🌈 w0.8 base w0.7

日期:2022-10-30 23:37:36
score:864000.0000
DIEN (s):0.0000
DIN (s):86.6291
DLRM (s):123.3763
DeepFM (s):118.7644
MMoE (s):134.3123
WDL (s):145.5450

part2 🍎 x0.3 inter_threads/intra_threads = 4/8

inter_threads = schedule_threads / (8/4); // 2
intra_threads = schedule_threads / (8/8); // 2

Copy Git Version To 121.41.18.244

#1 & https://github.com/alibaba/DeepRec & tools

https://github.com/alibaba/DeepRec/tree/main/tools

#1 & MANAGER_MAX_THREAD_NUM/MANAGER_MAX_UPDATE_THREAD_NUM 16

serving/processor/storage/feature_store_mgr.h

const int MANAGER_MAX_THREAD_NUM = 16; // 96
const int MANAGER_MAX_UPDATE_THREAD_NUM = 16; // 16

inter_threads/intra_threads = 8/8

v0.3 data_flow_ops.py

v0.1 inter_threads/intra_threads = 4/4
v0.2 select_session_policy = "MOD"
v0.3 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/data_flow_ops.py

v0.2 select_session_policy = "RR"

v0.4 data_flow_ops.py update

v0.1 inter_threads/intra_threads = 4/4
v0.2 select_session_policy = "MOD"
v0.4 https://github.com/alibaba/DeepRec/blob/main/tensorflow/python/ops/data_flow_ops.py

Add configure & run / no configure & run

# run.sh

#
echo
echo "> Run"

#
echo
echo ">> STEP@1"
cd /pro/DeepRec
ls -l

#
echo
echo ">> STEP@2"
# ./configure

#
echo
echo ">> STEP@3"
# bazel build  -c opt --config=opt  --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true//tensorflow/tools/pip_package:build_pip_package

echo
echo ">> STEP@4"
# ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /pkg/tensorflow_pkg

echo
echo ">> STEP@5"
pip uninstall tensorflow -y
pip install /pkg/tensorflow_pkg/tensorflow-1.15.5+deeprec2206-cp36-cp36m-linux_x86_64.whl

echo
echo ">> STEP@6"
cd /pro/DeepRec/tianchi
python run_models.py

#
echo "> Run"

lookup_kernels.cc const static size_t BUFFER_SIZE = 128 * 1024 << 20;

serving/processor/framework/kernels/lookup_kernels.cc

// TODO: magic-num 128MB
const static size_t BUFFER_SIZE = 128 << 20;

The Semi-Finals 复赛

初赛（2022年7月19日-10月8日，UTC+8）

提交时间及规则：7月19日10:00-10月8日18:00。每队每天有5次提交结果的机会，系统进行实时评测并每小时返回最新成绩。按照评测指标从高到低排序。排行榜将选择参赛队伍在本阶段的历史最优成绩进行排名展示；
进入复赛队伍：初赛结束，以10月11日18:00排行榜的榜单信息为准，组委会将审核并取消作弊等不合理行为的团队比赛资格，晋级空缺名额后补。初赛成绩符合要求且通过实名认证的排名前80名的参赛队伍将进入复赛。
复赛（2022年10月13日—11月14日，UTC+8）

提交时间及规则：10月13日10:00-11月14日18:00。每队每天有1次提交结果的机会，系统进行实时评测并每小时返回最新成绩。按照评测指标从高到低排序。排行榜将选择参赛队伍在本阶段的历史最优成绩进行排名展示；
进入决赛队伍：复赛结束，以11月16日18：00发布的榜单成绩为准。组委会将审核并取消作弊等不合理行为的团队比赛资格，晋级空缺名额后补。复赛成绩符合要求且通过实名认证的排名前6名的参赛队伍将进入复赛。TOP7-16的队伍获得优胜奖且受邀出席决赛现场。

inter_threads/intra_threads = 2/1

Server 121.41.18.244 * 2 To Configure

part1 🌈 w0.5 sess_config.graph_options.optimizer_options.micro_batch_num = 4

https://deeprec.readthedocs.io/zh/latest/Auto-Micro-Batch.html

AutoMicroBatch功能依赖于用户开启图优化的选项，需要注意的是，如果用户配置batch_size=1024，配置micro_batch_num=2，那么实际等价于用户之前使用batch_size=2048训练的收敛性。如果用户使用前的batch_size=512，使用large_batch_size功能配置micro_batch_num=4，那么在不改变收敛性的情况下，建议用户同时修改batch_size=128，用户接口如下:

config = tf.ConfigProto()
config.graph_options.optimizer_options.micro_batch_num = 4

part2 🍎 x0.6 //run whl again

sh 03.cmd-runWHL.sh 2

v0.6 --workqueue True

https://deeprec.readthedocs.io/zh/latest/index.html

part2 🍎 x0.1 inter_threads/intra_threads = 7/1

serving/processor/serving/model_config.cc

    (*config)->inter_threads = schedule_threads / (8/7); // 2

    (*config)->intra_threads = schedule_threads / (8/1); // 2

part2 🍎 x0.5 same to x0.4

https://deeprec.readthedocs.io/zh/latest/DeepRec-Compile-And-Install.html

./configure serving --mkl_threadpool
bazel build //serving/processor/serving:libserving_processor.so

part1 🌈 w0.4 #31 + sess_config.executor_policy = tf.ExecutorPolicy.USE_COST_MODEL_EXECUTOR

sess_config = tf.ConfigProto()
sess_config.executor_policy = tf.ExecutorPolicy.USE_COST_MODEL_EXECUTOR

with tf.train.MonitoredTrainingSession(
    master=server.target,
    ...
    config=sess_config) as sess:
  ...

// git config pull.ff only

v0.1 inter_threads/intra_threads = 2/2

part2 🍎 x0.2 tensorflow/core/kernels/data/text_line_dataset_op.cc

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/text_line_dataset_op.cc

inter_threads/intra_threads = 1/1

serving/processor/serving/model_config.cc

    (*config)->inter_threads = schedule_threads / (8/1); // 2

    (*config)->intra_threads = schedule_threads / (8/1); // 2

inter_threads/intra_threads = 0/0

part1 🌈 w0.6 sess_config.*

https://deeprec.readthedocs.io/zh/latest/Smart-Stage.html

sess_config.graph_options.optimizer_options.do_smart_stage = True

https://deeprec.readthedocs.io/zh/latest/Auto-Fusion.html

sess_config.graph_options.optimizer_options.do_op_fusion = True

https://deeprec.readthedocs.io/zh/latest/Async-Embedding-Stage.html

sess_config.graph_options.optimizer_options.do_async_embedding = True
sess_config.graph_options.optimizer_options.async_embedding_threads_num = 4
sess_config.graph_options.optimizer_options.async_embedding_capacity = 4

https://deeprec.readthedocs.io/zh/latest/Executor-Optimization.html

sess_config.executor_policy = tf.ExecutorPolicy.USE_INLINE_EXECUTOR

https://deeprec.readthedocs.io/zh/latest/XLA.html

sess_config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

update clear && add rm -rf pkg/tensorflow_pkg/

part1 🌈 w0.3 storage_size=[102410241024, 1010241024*1024]

下面是EmbeddingVariable多级存储的接口定义：

@tf_export(v1=["StorageOption"])
class StorageOption(object):
def init(self,
storage_type=None,
storage_path=None,
storage_size=[102410241024]):
self.storage_type = storage_type
self.storage_path = storage_path
self.storage_size = storage_size
参数解释：

stroage_type：使用的存储类型，例如DRAM_SSD为使用DRAM和SSD作为embedding的存储，具体支持的存储类型会在第4节中给出
storage_path: 如果使用SSD存储，则需要配置该参数指定保存embedding数据的文件夹路径
storage_size：指定每个层级可以使用的存储容量，单位是字节，例如对于DRAM+PMem要使用1GB DRAM和 10GB PMem，则配置为[102410241024, 1010241024*1024]，默认是每级1GB，目前的实现中无法限制SSD的使用量

https://github.com/alibaba/DeepRec/tree/main/modelzoo

inter_threads/intra_threads = 4/8

serving/processor/serving/model_config.cc

inter_threads = schedule_threads / (8/4); // 2
intra_threads = schedule_threads / (8/8); // 2

part2 🍎 x0.7 //+++ x0.2 tensorflow/core/kernels/data/text_line_dataset_op.cc

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/text_line_dataset_op.cc

Note

sess_config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1