paddlepaddle / paddlecustomdevice Goto Github PK

View Code? Open in Web Editor NEW

59.0 13.0 131.0 5.37 MB

PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)

License: Apache License 2.0

Shell 0.95% C++ 46.15% CMake 0.89% Python 48.47% Vim Script 0.01% C 3.34% Objective-C++ 0.19%

paddlecustomdevice's Introduction

PaddleCustomDevice

简体中文 | English | 日本語

『飞桨』自定义硬件接入实现。

使用指南

方案设计参考Custom Device 接入方案介绍，开发指南请参考新硬件接入示例且示例代码位于 CustomCPU。

硬件后端

飞桨自定义硬件接入支持如下硬件后端：

版权和许可证

PaddleCustomDevice由Apache-2.0 license提供。

paddlecustomdevice's People

Contributors

Stargazers

Watchers

Forkers

ronny1996 qili93 windstamp aganlengzi pawelpiotrowicz xbwgc arlesniak fudan-baiyunpeng csypbai ustckay yanhuidua 0x45f fwenguang ajunlonglive shawnnew chenwhql ashelly1 occupymars2025 yuanrisheng shentanyue leozhao-intel zhiqiu zhwesky2010 pangyoki feixliu zyfncg jiahy0825 ccsuzzh kuizhiqing risemeup1 courtesy-xs pidack firestonelib thisjiang zhaoyinglia bobholamovic seemingwang desmonday flyingqianmm zzsean hextostring tink2123 kevinyuk juncaipeng wuguangnann bise86 xcleancode yangulei yunyaoxyy annatrainingg minboo tjs2200120 huismiling cifar10 wufann heavyrain-lzy yingl-liu claire199705 lishicheng1996 zhangyuqin1998 zhiwei35 blacksheep-aristotle jinyouzhi kimbioinfostudio greatv peiyulau gptq zhjc bmers tbd1 liurizhou birensupa beinggod lhscau shuzihan ivorfeng difers eltociear sharonyunyun11 myangelayase moooonkin fearblackcat enflamegcu leeon-k wuhuachaocoding sylartianii gongweibao will-core xysheng-baidu lijinxin6 rengar-123 wangxcqupt zlzzl tianhaodongbd tongkaio ningbenzhe xuanyuanminzheng zhenwenqi2024 xieyunshen chuboning

paddlecustomdevice's Issues

[intel_gpu] BN args on Place(cpu) in fleet training

我在测试通过 fleet 在 "intel_gpu" 单机多卡的机器上训练 DDP RN50。forward 正常，backward fail 在一个错误的 memcpy 上：

Copy 2048 Bytes from 0xffff8181ff0cc800(Place(cpu)) to 0x1ab0f000(Place(cpu))

这里的 src 显然是一个 device buffer，但是被错误地认为是 host buffer，导致了 segfault。
通过日志发现有些 Tensor，如 batch_norm2d_48.w_0 ，在最开始用 full 填充时还是在 Place(intel_gpu:0)，但是在调 BN kernel 时就变成了 Place(cpu)。请问这个问题要从哪些方向去找原因？
所用 Paddle 版本为（commit f55b387df0f473574f82c83da0c4c821829f35a7 (tag: v2.5.0-rc0, release/2.5)），所作的修改只有在 memcpy 日志中加上了指针。worker0 上精简后的 log 如下：

......

I0605 17:04:43.523797 1570317 dygraph_functions.cc:39262] Finish AD API: gaussian
I0605 17:04:43.523912 1570317 dygraph_functions.cc:39276] { Input: [],  
 Output: [ 
( out , [{Name: None, Initialized: 1, Ptr: 0x72edc60 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512, 512, 3, 3 ], ADInfo:[ None ]}]), ] } 
I0605 17:04:43.524639 1570317 eager.cc:653]  args_num: 5
I0605 17:04:43.524663 1570317 eager.cc:823] Calling case2's initializer.
I0605 17:04:43.524729 1570317 grad_node_info.cc:64] Construct GradNodeBase
I0605 17:04:43.524763 1570317 accumulation_node.h:27] Construct GradNodeAccumulation
I0605 17:04:43.524791 1570317 eager.cc:107] Tensor(batch_norm2d_48.w_0) have not GradNode, add GradNodeAccumulation0x85cdae0 for it.
I0605 17:04:43.524863 1570317 eager_properties.cc:198] eager_properties 'Shape' method, layout autotune  desired_layout: Undefined(AnyLayout) default_layout: Undefined(AnyLayout) tensor layout: NCHW tensor's shape size is : 1
I0605 17:04:43.524895 1570317 eager_op_function.cc:19529] Running Eager Final State API: full_
I0605 17:04:43.524907 1570317 eager_op_function.cc:19531] args count: 2
I0605 17:04:43.524927 1570317 eager_utils.cc:1424] type_name: str
I0605 17:04:43.525002 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:04:43.525034 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:04:43.525020 1570317 eager_op_function.cc:19562] CurrentDeviceId: 0 from 0
I0605 17:04:43.525058 1570317 dygraph_functions.cc:38666] Running AD API: full_
I0605 17:04:43.525069 1570317 dygraph_functions.cc:38672]  No AMP for full__ad_func because it is a inplace or cast api. 
I0605 17:04:43.525080 1570317 dygraph_functions.cc:38692] Running C++ API: full_
I0605 17:04:43.525146 1570317 dygraph_functions.cc:38703] { Input: [ 
( output , [{Name: batch_norm2d_48.w_0, Initialized: 0, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: Unknown, Place: Unknown, Shape: Unknown ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ]} 
I0605 17:04:43.525171 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:04:43.525214 1570317 api.cc:26073] full_ API kernel key: [intel_gpu, NCHW, float32]
I0605 17:04:43.525262 1570317 api.cc:26080] full kernel: {"input":[],"output":["intel_gpu, NCHW, float32"],"attribute":["IntArray","Scalar","DataType"]}
I0605 17:04:43.525293 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:04:43.525385 1570317 full_kernel.cc:25] FullValue type=float
I0605 17:04:43.525417 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:04:43.525431 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:04:43.525476 1570317 runtime.cc:234] request allocate size=2048 device=0
I0605 17:04:43.525560 1570317 runtime.cc:258] allocate success size=2048 left=1765120799
I0605 17:04:43.525624 1570317 auto_growth_best_fit_allocator.cc:118] Not found and reallocate 2048(0xffff8181fe020000), and remaining 0
I0605 17:04:43.525640 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181fe020000
I0605 17:04:43.525681 1570317 full_kernel.cc:29] FullValue size=512 sizeof(T)=4
I0605 17:04:43.526175 1570317 dygraph_functions.cc:38717] Finish AD API: full_
I0605 17:04:43.526319 1570317 dygraph_functions.cc:38734] { Input: [ 
( output , [{Name: batch_norm2d_48.w_0, Initialized: 1, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ],  
 Output: [ 
( out , [{Name: batch_norm2d_48.w_0, Initialized: 1, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ] } 

......

I0605 17:04:46.935668 1570317 reducer.cc:101] var[conv2d_48.w_0] 's type is float32
I0605 17:04:46.935672 1570317 reducer.cc:101] var[batch_norm2d_48.w_0] 's type is float32
I0605 17:04:46.935675 1570317 reducer.cc:101] var[batch_norm2d_48.b_0] 's type is float32
I0605 17:04:46.935679 1570317 reducer.cc:101] var[conv2d_49.w_0] 's type is float32

......

I0605 17:05:02.728678 1570317 eager_op_function.cc:16839] Running Eager Final State API: batch_norm
I0605 17:05:02.728682 1570317 eager_op_function.cc:16841] args count: 5
I0605 17:05:02.728744 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:02.728756 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:02.728751 1570317 eager_op_function.cc:16881] CurrentDeviceId: 0 from 0
I0605 17:05:02.728763 1570317 dygraph_functions.cc:33987] Running AD API: batch_norm
I0605 17:05:02.728767 1570317 dygraph_functions.cc:34050] Running C++ API: batch_norm
I0605 17:05:02.728855 1570317 dygraph_functions.cc:34073] { Input: [ 
( x , [{Name: None, Initialized: 1, Ptr: 0x1a0755d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [2]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1923ac40, ReluGradNode] },  ]SlotID: 1, StopGradients: 0, , Edges[  { [0, 0]: [0x72ed8a0, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( mean , [{Name: batch_norm2d_48.w_1, Initialized: 1, Ptr: 0x85cf840 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 1 ] ]}]),  
( variance , [{Name: batch_norm2d_48.w_2, Initialized: 1, Ptr: 0x82d3300 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 1 ] ]}]),  
( scale , [{Name: batch_norm2d_48.w_0, Initialized: 1, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( bias , [{Name: batch_norm2d_48.b_0, Initialized: 1, Ptr: 0x85cec70 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ]} 
I0605 17:05:02.728883 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:02.728912 1570317 api.cc:22540] batch_norm API kernel key: [intel_gpu, NCHW, float32]
I0605 17:05:02.728945 1570317 api.cc:22547] batch_norm kernel: {"input":["intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32"],"output":["intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32"],"attribute":["bool","float","float","string","bool","bool"]}
I0605 17:05:02.728962 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:02.729009 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:02.729014 1570317 data_transform.cc:169] DeviceTransform in, src_place Place(cpu) dst_place: Place(intel_gpu:0)
I0605 17:05:02.729025 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.729063 1570317 tensor_utils.cc:50] TensorCopy 512 from Place(cpu) to Place(intel_gpu:0)
I0605 17:05:02.729077 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:02.729082 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:02.729099 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:02.729101 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0ca000
I0605 17:05:02.729135 1570317 tensor_utils.cc:97] src:0x85e0000, dst:0xffff8181ff0ca000
I0605 17:05:02.729149 1570317 memcpy.cc:66] memory::Copy 2048 Bytes from Place(cpu)(0x85e0000) to Place(intel_gpu:0)(0xffff8181ff0ca000), stream=0
I0605 17:05:02.729158 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:02.729244 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.729259 1570317 runtime.cc:324] sync-stream devid=0
I0605 17:05:02.729274 1570317 runtime.cc:374] memory-copy-h2d dst=0xffff8181ff0ca000 src=0x85e0000 size=2048
I0605 17:05:02.729657 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:02.729671 1570317 data_transform.cc:169] DeviceTransform in, src_place Place(cpu) dst_place: Place(intel_gpu:0)
I0605 17:05:02.729679 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.729691 1570317 tensor_utils.cc:50] TensorCopy 512 from Place(cpu) to Place(intel_gpu:0)
I0605 17:05:02.729701 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:02.729704 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:02.729713 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:02.729717 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0ca800
I0605 17:05:02.729736 1570317 tensor_utils.cc:97] src:0x8841000, dst:0xffff8181ff0ca800
I0605 17:05:02.729748 1570317 memcpy.cc:66] memory::Copy 2048 Bytes from Place(cpu)(0x8841000) to Place(intel_gpu:0)(0xffff8181ff0ca800), stream=0
I0605 17:05:02.729758 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:02.729801 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.729820 1570317 runtime.cc:324] sync-stream devid=0
I0605 17:05:02.729831 1570317 runtime.cc:374] memory-copy-h2d dst=0xffff8181ff0ca800 src=0x8841000 size=2048
I0605 17:05:02.730211 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:02.730233 1570317 data_transform.cc:169] DeviceTransform in, src_place Place(cpu) dst_place: Place(intel_gpu:0)
I0605 17:05:02.730244 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.730258 1570317 tensor_utils.cc:50] TensorCopy 512 from Place(cpu) to Place(intel_gpu:0)
I0605 17:05:02.730268 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:02.730271 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:02.730279 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:02.730283 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0cc000
I0605 17:05:02.730296 1570317 tensor_utils.cc:97] src:0x7c1a000, dst:0xffff8181ff0cc000
I0605 17:05:02.730306 1570317 memcpy.cc:66] memory::Copy 2048 Bytes from Place(cpu)(0x7c1a000) to Place(intel_gpu:0)(0xffff8181ff0cc000), stream=0
I0605 17:05:02.730315 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:02.730357 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.730368 1570317 runtime.cc:324] sync-stream devid=0
I0605 17:05:02.730379 1570317 runtime.cc:374] memory-copy-h2d dst=0xffff8181ff0cc000 src=0x7c1a000 size=2048
I0605 17:05:02.730762 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:02.730785 1570317 data_transform.cc:169] DeviceTransform in, src_place Place(cpu) dst_place: Place(intel_gpu:0)
I0605 17:05:02.730795 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.730809 1570317 tensor_utils.cc:50] TensorCopy 512 from Place(cpu) to Place(intel_gpu:0)
I0605 17:05:02.730818 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:02.730823 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:02.730829 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:02.730832 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0cc800
I0605 17:05:02.730845 1570317 tensor_utils.cc:97] src:0xe650000, dst:0xffff8181ff0cc800
I0605 17:05:02.730856 1570317 memcpy.cc:66] memory::Copy 2048 Bytes from Place(cpu)(0xe650000) to Place(intel_gpu:0)(0xffff8181ff0cc800), stream=0
I0605 17:05:02.730866 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:02.730907 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:02.730918 1570317 runtime.cc:324] sync-stream devid=0
I0605 17:05:02.730928 1570317 runtime.cc:374] memory-copy-h2d dst=0xffff8181ff0cc800 src=0xe650000 size=2048
I0605 17:05:02.731307 1570317 api.cc:22582] Perform View between Output and Input Tensor, share allocation and inplace version.
I0605 17:05:02.731328 1570317 api.cc:22586] Perform View between Output and Input Tensor, share allocation and inplace version.
I0605 17:05:02.731400 1570317 dense_tensor.cc:139] Allocate data with bytes: 3211264
I0605 17:05:02.731405 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 3211264 bytes, aligned to 3211264
I0605 17:05:02.731415 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 3211264 bytes from chunk size 4194304, remaining 983040
I0605 17:05:02.731431 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 3211264 bytes, ptr = 0xffff81d5fdaf0000
I0605 17:05:02.731453 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:02.731457 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:02.731462 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:02.731464 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0cd000
I0605 17:05:02.731470 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:02.731473 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:02.731477 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:02.731479 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0cd800
I0605 17:05:02.731565 1570317 dense_tensor.cc:139] Allocate data with bytes: 200832
I0605 17:05:02.731571 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 200832 bytes, aligned to 200832
I0605 17:05:02.731577 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 200832 bytes from chunk size 262144, remaining 61312
I0605 17:05:02.731585 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 200832 bytes, ptr = 0xffff8181febeef80
onednn_verbose,exec,gpu:0,batch_normalization,ocl:ref:any,forward_training,data_f32::blocked:abcd:f0 diff_undef::undef::,attr-scratchpad:user ,flags:CH,mb32ic512ih7iw7,0.275146
I0605 17:05:02.732051 1570317 auto_growth_best_fit_allocator.cc:131] Free 200832 bytes, ptr = 0xffff8181febeef80
I0605 17:05:02.732097 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0cc800
I0605 17:05:02.732105 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0cc000
I0605 17:05:02.732123 1570317 grad_node_info.cc:64] Construct GradNodeBase
I0605 17:05:02.732144 1570317 grad_node_info.cc:238] Add Edges for slot: 0, the Edge is from BatchNormGradNode (addr: 0x198b7070)  to Conv2dGradNodeFinal (addr: 0x1a366ff0)
I0605 17:05:02.732151 1570317 grad_node_info.h:77] Reseting Edge's Grad Node
I0605 17:05:02.732168 1570317 grad_node_info.cc:238] Add Edges for slot: 3, the Edge is from BatchNormGradNode (addr: 0x198b7070)  to GradNodeAccumulation (addr: 0x85cdae0)
I0605 17:05:02.732172 1570317 grad_node_info.h:77] Reseting Edge's Grad Node
I0605 17:05:02.732177 1570317 grad_node_info.cc:238] Add Edges for slot: 4, the Edge is from BatchNormGradNode (addr: 0x198b7070)  to GradNodeAccumulation (addr: 0x85cf000)
I0605 17:05:02.732182 1570317 grad_node_info.h:77] Reseting Edge's Grad Node
I0605 17:05:02.732187 1570317 grad_node_info.cc:86] Set GradSlotMeta for Grad Inputs
I0605 17:05:02.732193 1570317 grad_node_info.cc:86] Set GradSlotMeta for Grad Inputs
I0605 17:05:02.732198 1570317 grad_node_info.cc:86] Set GradSlotMeta for Grad Inputs
I0605 17:05:02.732203 1570317 grad_node_info.cc:86] Set GradSlotMeta for Grad Inputs
I0605 17:05:02.732208 1570317 grad_node_info.cc:86] Set GradSlotMeta for Grad Inputs
I0605 17:05:02.732213 1570317 grad_node_info.cc:86] Set GradSlotMeta for Grad Inputs
I0605 17:05:02.732218 1570317 grad_node_info.cc:106] Skip Configuring GradSlotMeta for uninitialized GradInput Tensor
I0605 17:05:02.732223 1570317 dygraph_functions.cc:34183] Finish AD API: batch_norm
I0605 17:05:02.732421 1570317 dygraph_functions.cc:34224] { Input: [ 
( x , [{Name: None, Initialized: 1, Ptr: 0x1a0755d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [2]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1923ac40, ReluGradNode] },  ]SlotID: 1, StopGradients: 0, , Edges[  { [0, 0]: [0x72ed8a0, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( mean , [{Name: batch_norm2d_48.w_1, Initialized: 1, Ptr: 0x85cf840 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 1 ] ]}]),  
( variance , [{Name: batch_norm2d_48.w_2, Initialized: 1, Ptr: 0x82d3300 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 1 ] ]}]),  
( scale , [{Name: batch_norm2d_48.w_0, Initialized: 1, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( bias , [{Name: batch_norm2d_48.b_0, Initialized: 1, Ptr: 0x85cec70 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ],  
 Output: [ 
( out , [{Name: None, Initialized: 1, Ptr: 0x19561630 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( mean_out , [{Name: None, Initialized: 1, Ptr: 0x1a19dc90 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( variance_out , [{Name: None, Initialized: 1, Ptr: 0x18621490 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( saved_mean , [{Name: None, Initialized: 1, Ptr: 0x188de050 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( saved_variance , [{Name: None, Initialized: 1, Ptr: 0x19dce0f0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( reserve_space , [{Name: None, Initialized: 0, Ptr: 0x19f979d0 TensorInfo: [ Type: DenseTensor, Dtype: Unknown, Place: Unknown, Shape: Unknown ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ] } 
I0605 17:05:02.732524 1570317 eager_op_function.cc:11066] Running Eager Final State API: relu

......

I0605 17:05:07.889714 1570317 nodes.cc:14271] Finish AD API GRAD: relu_grad
I0605 17:05:07.889755 1570317 nodes.cc:14288] { Input: [ 
( grad_out , [{Name: None, Initialized: 1, Ptr: 0x7cda990 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]),  
( out , [{Name: @Saved, Initialized: 1, Ptr: 0x191d79d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x198b7070, BatchNormGradNode] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ],  
 Output: [ 
 ( grad_x , [{Name: None, Initialized: 1, Ptr: 0x19838cd0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]), ] } 
I0605 17:05:07.889763 1570317 backward.cc:283] retain_graph is false, need to clear the TensorWrapper of nodes.
I0605 17:05:07.889771 1570317 auto_growth_best_fit_allocator.cc:131] Free 3211264 bytes, ptr = 0xffff81ac030c1ea0
I0605 17:05:07.889791 1570317 backward.cc:312] Node: ReluGradNode addr:0x191da280, Found pending node: BatchNormGradNode addr: 0x198b7070
I0605 17:05:07.889796 1570317 backward.cc:339] Get Edge and grad_output_tensor with slot: 0, rank: 0 's name is: 
I0605 17:05:07.889798 1570317 grad_tensor_holder.h:32] Init GradTensorHolder with meta size: 6
I0605 17:05:07.889801 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.889804 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.889807 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.889809 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.889822 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.889824 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.889827 1570317 backward.cc:348] Construct GradTensorHolder for grad node: BatchNormGradNode
I0605 17:05:07.889830 1570317 backward.cc:353] Sum or Move grad inputs for edge slot: 0, rank: 0
I0605 17:05:07.889834 1570317 grad_tensor_holder.cc:132] Move Tensor for buffer_ slot: 0, size: 1
I0605 17:05:07.889838 1570317 backward.cc:363] BatchNormGradNode ref_cnt is: 0
I0605 17:05:07.889843 1570317 backward.cc:243] Preparing GradNode:BatchNormGradNode addr:0x198b7070
I0605 17:05:07.889847 1570317 backward.cc:270] Run Backward Kernel with GradTensorHolder.
I0605 17:05:07.889849 1570317 nodes.cc:23093] Running AD API GRAD: batch_norm_grad
I0605 17:05:07.889856 1570317 grad_node_info.cc:43] float32 float32
I0605 17:05:07.889863 1570317 tensor_wrapper.h:137] Recover tensor: @Saved for wrapper
I0605 17:05:07.889868 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889869 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889873 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode Conv2dGradNodeFinal addr: 0x1a366ff0
I0605 17:05:07.889878 1570317 tensor_wrapper.h:137] Recover tensor: batch_norm2d_48.w_0@Saved for wrapper
I0605 17:05:07.889880 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor 'batch_norm2d_48.w_0@Saved' is [ 0 ]
I0605 17:05:07.889883 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor 'batch_norm2d_48.w_0@Saved' is [ 0 ]
I0605 17:05:07.889886 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode GradNodeAccumulation addr: 0x85cdae0
I0605 17:05:07.889889 1570317 tensor_wrapper.h:137] Recover tensor: batch_norm2d_48.b_0@Saved for wrapper
I0605 17:05:07.889895 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor 'batch_norm2d_48.b_0@Saved' is [ 0 ]
I0605 17:05:07.889899 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor 'batch_norm2d_48.b_0@Saved' is [ 0 ]
I0605 17:05:07.889901 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode GradNodeAccumulation addr: 0x85cf000
I0605 17:05:07.889904 1570317 tensor_wrapper.h:137] Recover tensor: @Saved for wrapper
I0605 17:05:07.889907 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889910 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889914 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode BatchNormGradNode addr: 0x198b7070
I0605 17:05:07.889916 1570317 tensor_wrapper.h:137] Recover tensor: @Saved for wrapper
I0605 17:05:07.889919 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889922 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889925 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode BatchNormGradNode addr: 0x198b7070
I0605 17:05:07.889928 1570317 tensor_wrapper.h:137] Recover tensor: @Saved for wrapper
I0605 17:05:07.889930 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889933 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889936 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode BatchNormGradNode addr: 0x198b7070
I0605 17:05:07.889940 1570317 tensor_wrapper.h:137] Recover tensor: @Saved for wrapper
I0605 17:05:07.889942 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889945 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889948 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode BatchNormGradNode addr: 0x198b7070
I0605 17:05:07.889951 1570317 tensor_wrapper.h:137] Recover tensor: @Saved for wrapper
I0605 17:05:07.889953 1570317 tensor_wrapper.h:213]  The wrapper_version_snapshot of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889956 1570317 tensor_wrapper.h:216]  The tensor_version of Tensor '@Saved' is [ 0 ]
I0605 17:05:07.889959 1570317 tensor_wrapper.h:161] Recovered TensorWrapper with GradNode BatchNormGradNode addr: 0x198b7070
I0605 17:05:07.889963 1570317 nodes.cc:23146] Running C++ API: batch_norm_grad
I0605 17:05:07.890098 1570317 nodes.cc:23181] { Input: [ 
( grad_out , [{Name: None, Initialized: 1, Ptr: 0x19838cd0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]),  
( x , [{Name: @Saved, Initialized: 1, Ptr: 0x1a0755d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [2]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1923ac40, ReluGradNode] },  ]SlotID: 1, StopGradients: 0, , Edges[  { [0, 0]: [0x72ed8a0, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( scale , [{Name: batch_norm2d_48.w_0@Saved, Initialized: 1, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( bias , [{Name: batch_norm2d_48.b_0@Saved, Initialized: 1, Ptr: 0x85cec70 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( out_mean , [{Name: @Saved, Initialized: 1, Ptr: 0x1a19dc90 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( out_variance , [{Name: @Saved, Initialized: 1, Ptr: 0x18621490 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( saved_mean , [{Name: @Saved, Initialized: 1, Ptr: 0x188de050 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( saved_variance , [{Name: @Saved, Initialized: 1, Ptr: 0x19dce0f0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( reserve_space , [{Name: @Saved, Initialized: 0, Ptr: 0x19f979d0 TensorInfo: [ Type: DenseTensor, Dtype: Unknown, Place: Unknown, Shape: Unknown ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ]} 
I0605 17:05:07.890122 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890137 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890144 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890151 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890158 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890165 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890172 1570317 backward_api.cc:15475] batch_norm_grad API kernel key: [intel_gpu, NCHW, float32]
I0605 17:05:07.890183 1570317 backward_api.cc:15482] batch_norm_grad kernel: {"input":["intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32"],"output":["intel_gpu, NCHW, float32","intel_gpu, NCHW, float32","intel_gpu, NCHW, float32"],"attribute":["float","float","string","bool","bool","bool"]}
I0605 17:05:07.890195 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:07.890215 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:07.890220 1570317 data_transform.cc:169] DeviceTransform in, src_place Place(cpu) dst_place: Place(intel_gpu:0)
I0605 17:05:07.890228 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:07.890240 1570317 tensor_utils.cc:50] TensorCopy 512 from Place(cpu) to Place(intel_gpu:0)
I0605 17:05:07.890247 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:07.890251 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:07.890259 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:07.890261 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181febde000
I0605 17:05:07.890275 1570317 tensor_utils.cc:97] src:0x7c1a000, dst:0xffff8181febde000
I0605 17:05:07.890283 1570317 memcpy.cc:66] memory::Copy 2048 Bytes from Place(cpu)(0x7c1a000) to Place(intel_gpu:0)(0xffff8181febde000), stream=0
I0605 17:05:07.890290 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890336 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:07.890345 1570317 runtime.cc:324] sync-stream devid=0
I0605 17:05:07.890355 1570317 runtime.cc:374] memory-copy-h2d dst=0xffff8181febde000 src=0x7c1a000 size=2048
I0605 17:05:07.890707 1570317 runtime.cc:128] get-device() : device->id=0
I0605 17:05:07.890719 1570317 data_transform.cc:169] DeviceTransform in, src_place Place(cpu) dst_place: Place(intel_gpu:0)
I0605 17:05:07.890733 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:07.890744 1570317 tensor_utils.cc:50] TensorCopy 512 from Place(cpu) to Place(intel_gpu:0)
I0605 17:05:07.890753 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:07.890756 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:07.890764 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:07.890769 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0cc000
I0605 17:05:07.890780 1570317 tensor_utils.cc:97] src:0xe650000, dst:0xffff8181ff0cc000
I0605 17:05:07.890790 1570317 memcpy.cc:66] memory::Copy 2048 Bytes from Place(cpu)(0xe650000) to Place(intel_gpu:0)(0xffff8181ff0cc000), stream=0
I0605 17:05:07.890800 1570317 runtime.cc:121] set-device : device->id=0
I0605 17:05:07.890838 1570317 context_pool.cc:62] DeviceContextPool Get: Place(intel_gpu:0)
I0605 17:05:07.890849 1570317 runtime.cc:324] sync-stream devid=0
I0605 17:05:07.890857 1570317 runtime.cc:374] memory-copy-h2d dst=0xffff8181ff0cc000 src=0xe650000 size=2048
I0605 17:05:07.891373 1570317 dense_tensor.cc:139] Allocate data with bytes: 3211264
I0605 17:05:07.891391 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 3211264 bytes, aligned to 3211264
I0605 17:05:07.891400 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 3211264 bytes from chunk size 4194304, remaining 983040
I0605 17:05:07.891409 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 3211264 bytes, ptr = 0xffff81aca4530000
I0605 17:05:07.891424 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:07.891427 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:07.891431 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:07.891434 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0cc800
I0605 17:05:07.891439 1570317 dense_tensor.cc:139] Allocate data with bytes: 2048
I0605 17:05:07.891443 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 2048 bytes, aligned to 2048
I0605 17:05:07.891446 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 2048 bytes from chunk size 2048, remaining 0
I0605 17:05:07.891449 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 2048 bytes, ptr = 0xffff8181ff0ce000
I0605 17:05:07.891551 1570317 dense_tensor.cc:139] Allocate data with bytes: 200832
I0605 17:05:07.891557 1570317 auto_growth_best_fit_allocator.cc:66] Allocate 200832 bytes, aligned to 200832
I0605 17:05:07.891562 1570317 auto_growth_best_fit_allocator.cc:76] Allocate 200832 bytes from chunk size 262144, remaining 61312
I0605 17:05:07.891569 1570317 auto_growth_best_fit_allocator.cc:123] Alloc 200832 bytes, ptr = 0xffff8181febeef80
onednn_verbose,exec,gpu:0,batch_normalization,ocl:ref:any,backward,data_f32::blocked:abcd:f0 diff_f32::blocked:abcd:f0,attr-scratchpad:user ,flags:CH,mb32ic512ih7iw7,0.158936
I0605 17:05:07.891780 1570317 auto_growth_best_fit_allocator.cc:131] Free 200832 bytes, ptr = 0xffff8181febeef80
I0605 17:05:07.891819 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0cc000
I0605 17:05:07.891827 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181febde000
I0605 17:05:07.891832 1570317 nodes.cc:23198] Fused api batch_norm_grad is called 
I0605 17:05:07.891839 1570317 nodes.cc:23285] Finish AD API GRAD: batch_norm_grad
I0605 17:05:07.892006 1570317 nodes.cc:23329] { Input: [ 
( grad_out , [{Name: None, Initialized: 1, Ptr: 0x19838cd0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]),  
( x , [{Name: @Saved, Initialized: 1, Ptr: 0x1a0755d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [2]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1923ac40, ReluGradNode] },  ]SlotID: 1, StopGradients: 0, , Edges[  { [0, 0]: [0x72ed8a0, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( scale , [{Name: batch_norm2d_48.w_0@Saved, Initialized: 1, Ptr: 0x72ed510 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( bias , [{Name: batch_norm2d_48.b_0@Saved, Initialized: 1, Ptr: 0x85cec70 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(cpu), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [1]: SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( out_mean , [{Name: @Saved, Initialized: 1, Ptr: 0x1a19dc90 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( out_variance , [{Name: @Saved, Initialized: 1, Ptr: 0x18621490 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( saved_mean , [{Name: @Saved, Initialized: 1, Ptr: 0x188de050 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( saved_variance , [{Name: @Saved, Initialized: 1, Ptr: 0x19dce0f0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]),  
( reserve_space , [{Name: @Saved, Initialized: 0, Ptr: 0x19f979d0 TensorInfo: [ Type: DenseTensor, Dtype: Unknown, Place: Unknown, Shape: Unknown ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ BackwardOutMeta: [  {SlotSize: [5]: SlotID: 0, StopGradients: 0, , Edges[  { [0, 0]: [0x1a366ff0, Conv2dGradNodeFinal] },  ]SlotID: 1, StopGradients: , Edges[  ]SlotID: 2, StopGradients: , Edges[  ]SlotID: 3, StopGradients: 0, , Edges[  { [0, 0]: [0x85cdae0, GradNodeAccumulation] },  ]SlotID: 4, StopGradients: 0, , Edges[  { [0, 0]: [0x85cf000, GradNodeAccumulation] },  ]}  ], BackwardInMeta: [  {SlotSize: [SlotID: 0, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 1, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 2, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 3, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 4, StopGradients: 0, , Edges[ { NULL Edge } ]SlotID: 5, StopGradients: 0, , Edges[ { NULL Edge } ]]:  ] ], StopGradient: [ 0 ] ]}]), ],  
 Output: [ 
 ( grad_x , [{Name: None, Initialized: 1, Ptr: 0x6f3e3a0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 32, 512, 7, 7 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]),  
 ( grad_scale , [{Name: None, Initialized: 1, Ptr: 0x1a580460 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]),  
 ( grad_bias , [{Name: None, Initialized: 1, Ptr: 0x19ad40d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]), ] } 
I0605 17:05:07.892024 1570317 backward.cc:283] retain_graph is false, need to clear the TensorWrapper of nodes.
I0605 17:05:07.892030 1570317 auto_growth_best_fit_allocator.cc:131] Free 3211264 bytes, ptr = 0xffff81d5fcaf0000
I0605 17:05:07.892040 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0ca000
I0605 17:05:07.892047 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0ca800
I0605 17:05:07.892052 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0cd000
I0605 17:05:07.892058 1570317 auto_growth_best_fit_allocator.cc:131] Free 2048 bytes, ptr = 0xffff8181ff0cd800
I0605 17:05:07.892067 1570317 backward.cc:312] Node: BatchNormGradNode addr:0x198b7070, Found pending node: Conv2dGradNodeFinal addr: 0x1a366ff0
I0605 17:05:07.892071 1570317 backward.cc:339] Get Edge and grad_output_tensor with slot: 0, rank: 0 's name is: 
I0605 17:05:07.892074 1570317 grad_tensor_holder.h:32] Init GradTensorHolder with meta size: 1
I0605 17:05:07.892076 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.892079 1570317 backward.cc:348] Construct GradTensorHolder for grad node: Conv2dGradNodeFinal
I0605 17:05:07.892082 1570317 backward.cc:353] Sum or Move grad inputs for edge slot: 0, rank: 0
I0605 17:05:07.892086 1570317 grad_tensor_holder.cc:132] Move Tensor for buffer_ slot: 0, size: 1
I0605 17:05:07.892091 1570317 backward.cc:363] Conv2dGradNodeFinal ref_cnt is: 0
I0605 17:05:07.892094 1570317 backward.cc:312] Node: BatchNormGradNode addr:0x198b7070, Found pending node: GradNodeAccumulation addr: 0x85cdae0
I0605 17:05:07.892097 1570317 backward.cc:339] Get Edge and grad_output_tensor with slot: 3, rank: 0 's name is: 
I0605 17:05:07.892099 1570317 grad_tensor_holder.h:32] Init GradTensorHolder with meta size: 1
I0605 17:05:07.892102 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.892104 1570317 backward.cc:348] Construct GradTensorHolder for grad node: GradNodeAccumulation
I0605 17:05:07.892108 1570317 backward.cc:353] Sum or Move grad inputs for edge slot: 0, rank: 0
I0605 17:05:07.892112 1570317 grad_tensor_holder.cc:132] Move Tensor for buffer_ slot: 0, size: 1
I0605 17:05:07.892113 1570317 backward.cc:363] GradNodeAccumulation ref_cnt is: 0
I0605 17:05:07.892117 1570317 backward.cc:312] Node: BatchNormGradNode addr:0x198b7070, Found pending node: GradNodeAccumulation addr: 0x85cf000
I0605 17:05:07.892119 1570317 backward.cc:339] Get Edge and grad_output_tensor with slot: 4, rank: 0 's name is: 
I0605 17:05:07.892122 1570317 grad_tensor_holder.h:32] Init GradTensorHolder with meta size: 1
I0605 17:05:07.892124 1570317 grad_tensor_holder.h:35] Init GradTensorHolder with meta rank: 1
I0605 17:05:07.892127 1570317 backward.cc:348] Construct GradTensorHolder for grad node: GradNodeAccumulation
I0605 17:05:07.892130 1570317 backward.cc:353] Sum or Move grad inputs for edge slot: 0, rank: 0
I0605 17:05:07.892132 1570317 grad_tensor_holder.cc:132] Move Tensor for buffer_ slot: 0, size: 1
I0605 17:05:07.892135 1570317 backward.cc:363] GradNodeAccumulation ref_cnt is: 0
I0605 17:05:07.892140 1570317 auto_growth_best_fit_allocator.cc:131] Free 3211264 bytes, ptr = 0xffff81aca4a30000
I0605 17:05:07.892148 1570317 backward.cc:243] Preparing GradNode:GradNodeAccumulation addr:0x85cf000
I0605 17:05:07.892150 1570317 backward.cc:270] Run Backward Kernel with GradTensorHolder.
I0605 17:05:07.892153 1570317 accumulation_node.cc:103] Running AD API Grad: GradNodeAccumulation
I0605 17:05:07.892158 1570317 accumulation_node.cc:40] Move Tensor ptr: 0x19ad40d0
I0605 17:05:07.892163 1570317 reducer.cc:762] Tensor[146] [batch_norm2d_48.b_0@Grad] arrived and triggered disthook
I0605 17:05:07.892166 1570317 reducer.cc:778] Tensor[146][batch_norm2d_48.b_0] is marked ready.
I0605 17:05:07.892175 1570317 accumulation_node.cc:135] Finish AD API Grad: GradNodeAccumulation
I0605 17:05:07.892191 1570317 accumulation_node.cc:148] { Input: [], Output: [(grad_out, [{Name: None, Initialized: 1, Ptr: 0x19ad40d0 TensorInfo: [ Type: DenseTensor, Dtype: float32, Place: Place(intel_gpu:0), Shape: 512 ], ADInfo:[ Grad: [ {Name: None, Initialized: 0, Ptr: 0 TensorInfo: [ Unknown ], ADInfo:[ None ]} ],  GradNode: [ None ], StopGradient: [ 0 ] ]}]), ] } 
I0605 17:05:07.892197 1570317 backward.cc:283] retain_graph is false, need to clear the TensorWrapper of nodes.
I0605 17:05:07.892201 1570317 accumulation_node.h:47] Do nothing here now
I0605 17:05:07.892204 1570317 backward.cc:243] Preparing GradNode:GradNodeAccumulation addr:0x85cdae0
I0605 17:05:07.892207 1570317 backward.cc:270] Run Backward Kernel with GradTensorHolder.
I0605 17:05:07.892210 1570317 accumulation_node.cc:103] Running AD API Grad: GradNodeAccumulation
I0605 17:05:07.892215 1570317 accumulation_node.cc:40] Move Tensor ptr: 0x1a580460
I0605 17:05:07.892218 1570317 reducer.cc:762] Tensor[145] [batch_norm2d_48.w_0@Grad] arrived and triggered disthook
I0605 17:05:07.892221 1570317 reducer.cc:778] Tensor[145][batch_norm2d_48.w_0] is marked ready.
I0605 17:05:07.892226 1570317 reducer.cc:906] Group[0] is ready
I0605 17:05:07.892230 1570317 reducer.cc:1045] group [0] start fused_allreduce.
I0605 17:05:07.892242 1570317 api.cc:24921] empty API kernel key: [CPU, Undefined(AnyLayout), float32]
I0605 17:05:07.892256 1570317 api.cc:24928] empty kernel: {"input":[],"output":["CPU, NCHW, float32"],"attribute":["IntArray","DataType"]}
I0605 17:05:07.892292 1570317 dense_tensor.cc:139] Allocate data with bytes: 30261152
I0605 17:05:07.892329 1570317 context_pool.cc:62] DeviceContextPool Get: Place(cpu)
I0605 17:05:07.892359 1570317 memcpy.cc:743] memory::Copy 2048 Bytes from 0xffff8181ff0cc800(Place(cpu)) to 0x1ab0f000(Place(cpu))


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   egr::Backward(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, bool)
1   egr::RunBackward(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, bool, bool, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, bool, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&)
2   egr::GradNodeAccumulation::operator()(paddle::small_vector<std::vector<paddle::Tensor, std::allocator<paddle::Tensor> >, 15u>&, bool, bool)
3   egr::GradNodeAccumulation::ApplyReduceHooks()
4   paddle::distributed::EagerReducer::MarkVarReady(unsigned long, bool)
5   paddle::distributed::EagerReducer::MarkGroupReady(unsigned long)
6   paddle::distributed::EagerReducer::FusedAllReduceSchedule(paddle::distributed::EagerGroup*, int)
7   paddle::distributed::EagerGroup::ConcatTensors(phi::Place const&)
8   paddle::operators::math::ConcatFunctor<phi::CPUContext, float>::operator()(phi::CPUContext const&, std::vector<phi::DenseTensor, std::allocator<phi::DenseTensor> > const&, int, phi::DenseTensor*)
9   phi::funcs::ConcatFunctor<phi::CPUContext, float>::operator()(phi::CPUContext const&, std::vector<phi::DenseTensor, std::allocator<phi::DenseTensor> > const&, int, phi::DenseTensor*)
10  phi::memory_utils::Copy(phi::Place const&, void*, phi::Place const&, void const*, unsigned long)
11  phi::MemoryUtils::Copy(phi::Place const&, void*, phi::Place const&, void const*, unsigned long)
12  void paddle::memory::Copy<phi::Place, phi::Place>(phi::Place, void*, phi::Place, void const*, unsigned long)

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1685955907 (unix time) try "date -d @1685955907" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0xffff8181ff0cc800) received by PID 1570317 (TID 0x7fa143155740) from PID 18446744073693612032 ***]

MLU zero-dim

paddle.jit.to_static fail with custom device

CustomDevice currently not support dygraph to static, reproduce steps as following.

Install paddle-custom-npu based on README
Run mnist_train.py on CPU succeeded:

from __future__ import print_function

import os
import shutil
import numpy as np
import paddle
from paddle import nn
import paddle.nn.functional as F

class ConvBNLayer(nn.Layer):
    def __init__(self,
                 num_channels,
                 num_filters,
                 filter_size,
                 stride,
                 padding,
                 num_groups=1):
        super().__init__()

        self.conv = nn.Conv2D(
            in_channels=num_channels,
            out_channels=num_filters,
            kernel_size=filter_size,
            stride=stride,
            padding=padding,
            groups=num_groups,
            weight_attr=None,
            bias_attr=False)
        self.bn = nn.BatchNorm(num_filters)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        return x

class MNIST(nn.Layer):
    def __init__(self):
        super(MNIST, self).__init__()

        self.conv0 = ConvBNLayer(
                    num_channels=1,
                    num_filters=4,
                    filter_size=5,
                    stride=1,
                    padding=0,
                    num_groups=1)
        self.conv1 = ConvBNLayer(
                    num_channels=4,
                    num_filters=4,
                    filter_size=1,
                    stride=1,
                    padding=0,
                    num_groups=1)
        self.max_pool = nn.MaxPool2D(kernel_size=4, stride=4, padding=0)
        self.fc = nn.Linear(in_features=144, out_features=10)

    @paddle.jit.to_static()
    def forward(self, inputs, label=None):
        x = self.conv0(inputs)
        x1 = self.max_pool(x)
        x2 = self.conv1(x1)
        x = paddle.add(x=x1, y=x2)
        x = paddle.flatten(x, start_axis=1, stop_axis=-1)
        x = self.fc(x)
        out = F.softmax(x)
        if label is not None:
            acc = paddle.metric.accuracy(input=x, label=label)
            return out, acc
        else:
            return out

def test_mnist(test_reader, mnist_model):
    acc_set = []
    avg_loss_set = []

    for batch_id, data in enumerate(test_reader()):
        x_data = np.array([x[0].reshape(1, 28, 28) for x in data]).astype('float32')
        y_data = np.array([x[1] for x in data]).astype('int64').reshape(-1, 1)

        image = paddle.to_tensor(x_data)
        label = paddle.to_tensor(y_data)

        prediction, acc = mnist_model(image, label)
        loss = F.cross_entropy(input=prediction, label=label)
        avg_loss = paddle.mean(loss)

        acc_set.append(float(acc.numpy()))
        avg_loss_set.append(float(avg_loss.numpy()))

    acc_val_mean = np.array(acc_set).mean()
    avg_loss_val_mean = np.array(avg_loss_set).mean()
    return avg_loss_val_mean, acc_val_mean


def train_mnist(num_epochs, save_dirname):
    paddle.set_device('cpu')

    mnist = MNIST()
    adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=mnist.parameters())

    train_reader = paddle.batch(paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
    test_reader = paddle.batch(paddle.dataset.mnist.test(), batch_size=BATCH_SIZE, drop_last=True)

    for epoch in range(num_epochs):
        for batch_id, data in enumerate(train_reader()):
            x_data = np.array([x[0].reshape(1, 28, 28) for x in data]).astype('float32')
            y_data = np.array([x[1] for x in data]).astype('int64').reshape(-1, 1)

            image = paddle.to_tensor(x_data)
            label = paddle.to_tensor(y_data)

            cost, acc = mnist(image, label)
            loss = F.cross_entropy(cost, label)
            avg_loss = paddle.mean(loss)

            avg_loss.backward()
            adam.minimize(avg_loss)
            mnist.clear_gradients()

            if batch_id % 100 == 0:
                print("Loss at epoch {} step {}: {:}".format(epoch, batch_id, avg_loss.numpy()))

        mnist.eval()
        test_cost, test_acc = test_mnist(test_reader, mnist)
        mnist.train()
        print("Loss at epoch {} , Test avg_loss is: {}, acc is: {}".format(epoch, test_cost, test_acc))

    # save inference model
    if save_dirname is None:
        return
    # delete old model
    if  os.path.exists(save_dirname):
        shutil.rmtree(save_dirname)
        os.makedirs(save_dirname)
    # save inference model
    mnist.eval()
    model = paddle.jit.to_static(mnist, input_spec=[paddle.static.InputSpec([None, 1, 28, 28], 'float32', 'image')])
    paddle.jit.save(model, save_dirname)

if __name__ == '__main__':
    BATCH_SIZE = 64
    train_mnist(num_epochs=1, save_dirname='assets/mnist')

Change paddle.set_device('cpu') to paddle.set_device('ascend'), then fail with error as following:

(base) λ cann504 /workspace/my-demo-code/PaddlePaddle/mnistv2 {develop} python mnist_train.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
I0618 05:08:42.379591 88034 init.cc:259] ENV [CUSTOM_DEVICE_ROOT]=/opt/conda/lib/python3.7/site-packages/paddle-plugins
I0618 05:08:42.379639 88034 init.cc:147] Try loading custom device libs from: [/opt/conda/lib/python3.7/site-packages/paddle-plugins]
I0618 05:08:42.832442 88034 custom_device.cc:712] Successed in loading custom runtime in lib: /opt/conda/lib/python3.7/site-packages/paddle-plugins/libpaddle-custom-npu.so
I0618 05:08:42.834924 88034 custom_kernel.cc:70] Successed in loading 123 custom kernel(s) from loaded lib(s), will be used like native ones.
I0618 05:08:42.835026 88034 init.cc:159] Finished in LoadCustomDevice with libs_path: [/opt/conda/lib/python3.7/site-packages/paddle-plugins]
I0618 05:08:42.835060 88034 init.cc:265] CustomDevice: ascend, visible devices count: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 379, in __call__
    return partial_program_layer(args)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/partial_program.py", line 349, in __call__
    in_vars, out_vars = self._prepare(inputs)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/partial_program.py", line 443, in _prepare
    expected_place):
TypeError: _equals(): incompatible function arguments. The following argument types are supported:
    1. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.Place) -> bool
    2. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.CUDAPlace) -> bool
    3. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.CPUPlace) -> bool
    4. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.XPUPlace) -> bool
    5. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.NPUPlace) -> bool
    6. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.IPUPlace) -> bool
    7. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.CUDAPinnedPlace) -> bool
    8. (self: paddle.fluid.core_avx.Place, arg0: paddle.fluid.core_avx.MLUPlace) -> bool

Invoked with: Place(ascend:0), Place(ascend:0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mnist_train.py", line 158, in <module>
    train_mnist(num_epochs=1, save_dirname='assets/mnist')
  File "mnist_train.py", line 128, in train_mnist
    cost, acc = mnist(image, label)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 929, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 914, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 388, in __call__
    error_data.raise_new_exception()
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/error.py", line 328, in raise_new_exception
    new_exception = self.create_exception()
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/error.py", line 161, in create_exception
    message = self.create_message()
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/error.py", line 182, in create_message
    self._simplify_error_value()
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/error.py", line 267, in _simplify_error_value
    start_idx = error_value_lines_strip.index(start_trace)
ValueError: 'outputs = static_func(*inputs)' is not in list

[NPU]编译选项ON_INFER打开后，用例执行失败

参考如下材料准备环境，编译前设置全局变量export ON_INFER=ON：
https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/npu/README_cn.md

问题一找不到符号_ZN3fLB15FLAGS_set_to_1dE：
(py37env) λ ascend /home/code/PaddleCustomDevice/backends/npu {develop} python3 tests/unittests/test_custom_pass_npu.py
I0628 18:17:59.323984 44207 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/home/code/PaddleCustomDevice/backends/npu/build
I0628 18:17:59.324012 44207 init.cc:145] Try loading custom device libs from: [/home/code/PaddleCustomDevice/backends/npu/build]
Traceback (most recent call last):
File "tests/unittests/test_custom_pass_npu.py", line 20, in
import paddle
File "/opt/py37env/lib/python3.7/site-packages/paddle/init.py", line 31, in
from .framework import monkey_patch_variable
File "/opt/py37env/lib/python3.7/site-packages/paddle/framework/init.py", line 17, in
from . import random # noqa: F401
File "/opt/py37env/lib/python3.7/site-packages/paddle/framework/random.py", line 17, in
from paddle import fluid
File "/opt/py37env/lib/python3.7/site-packages/paddle/fluid/init.py", line 211, in
bootstrap()
File "/opt/py37env/lib/python3.7/site-packages/paddle/fluid/init.py", line 203, in bootstrap
core.init_devices()
ValueError: (InvalidArgument) Fail to open library: /home/code/PaddleCustomDevice/backends/npu/build/libpaddle-custom-npu.so with error: /home/code/PaddleCustomDevice/backends/npu/build/libpaddle-custom-npu.so: undefined symbol: _ZN3fLB15FLAGS_set_to_1dE
[Hint: dso_handle should not be null.] (at /home/code/Paddle/paddle/fluid/platform/init.cc:152)

问题二规避问题一后：
(py37env) λ ascend /home/code/PaddleCustomDevice/backends/npu {develop} python3 tests/unittests/test_custom_pass_npu.py
I0628 18:22:27.276496 45602 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/home/code/PaddleCustomDevice/backends/npu/build
I0628 18:22:27.276530 45602 init.cc:145] Try loading custom device libs from: [/home/code/PaddleCustomDevice/backends/npu/build]
I0628 18:22:27.876804 45602 custom_device.cc:1115] Successed in loading custom runtime in lib: /home/code/PaddleCustomDevice/backends/npu/build/libpaddle-custom-npu.so
I0628 18:22:27.887941 45602 custom_kernel.cc:76] Successed in loading 316 custom kernel(s) from loaded lib(s), will be used like native ones.
I0628 18:22:27.888248 45602 init.cc:157] Finished in LoadCustomDevice with libs_path: [/home/code/PaddleCustomDevice/backends/npu/build]
I0628 18:22:27.888275 45602 init.cc:245] CustomDevice: npu, visible devices count: 1
/opt/py37env/lib/python3.7/site-packages/paddle/jit/api.py:945: UserWarning: What you save is a function, and jit.save will generate the name of the model file according to path you specify. When loading these files with jit.load, you get a TranslatedLayer whose inference result is the same as the inference result of the function you saved.
'What you save is a function, and jit.save will generate the name of the model file according to path you specify. When loading these files with jit.load, you get a TranslatedLayer whose inference result is the same as the inference result of the function you saved.'
/opt/py37env/lib/python3.7/site-packages/paddle/static/io.py:994: UserWarning: no variable in your model, please ensure there are any variables in your model to save
"no variable in your model, please ensure there are any variables in your model to save"
['generate_add_n']
I0628 18:22:28.240198 45602 analysis_predictor.cc:1502] CustomDevice is enabled
--- Running analysis [ir_graph_build_pass]
I0628 18:22:28.240448 45602 executor.cc:187] Old Executor is Running.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0628 18:22:28.240542 45602 allocator_facade.cc:331] GetAllocator Place(cpu) 1
I0628 18:22:28.240576 45602 allocator_facade.cc:331] GetAllocator Place(cpu) 1
I0628 18:22:28.240581 45602 allocator_facade.cc:331] GetAllocator Place(cpu) 0
I0628 18:22:28.240588 45602 allocator_facade.cc:331] GetAllocator Place(cpu) 0
--- Running analysis [ir_analysis_pass]
--- Running IR pass [generate_add_n]
I0628 18:22:28.244626 45602 ir_analysis_pass.cc:46] argument has no fuse statis
--- Running analysis [save_optimized_model_pass]
W0628 18:22:28.244675 45602 save_optimized_model_pass.cc:28] save_optim_cache_model is turned off, skip save_optimized_model_pass
--- Running analysis [ir_params_sync_among_devices_pass]
I0628 18:22:28.244690 45602 ir_params_sync_among_devices_pass.cc:142] Sync params from CPU to npu:0
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [memory_optimize_pass]
I0628 18:22:28.244761 45602 memory_optimize_pass.cc:118] The persistable params in main graph are : 0MB
I0628 18:22:28.244805 45602 memory_optimize_pass.cc:246] Cluster name : y size: 128
I0628 18:22:28.244812 45602 memory_optimize_pass.cc:246] Cluster name : x size: 128
I0628 18:22:28.244814 45602 memory_optimize_pass.cc:246] Cluster name : z size: 128
--- Running analysis [ir_graph_to_program_pass]
I0628 18:22:28.245867 45602 analysis_predictor.cc:1676] ======= optimize end =======
I0628 18:22:28.245908 45602 naive_executor.cc:167] --- skip [feed], feed -> z
I0628 18:22:28.245914 45602 naive_executor.cc:167] --- skip [feed], feed -> y
I0628 18:22:28.245918 45602 naive_executor.cc:167] --- skip [feed], feed -> x
I0628 18:22:28.245944 45602 naive_executor.cc:167] --- skip [tmp_1], fetch -> fetch

E

ERROR: test_my_add_n (main.TestCustomPass)

Traceback (most recent call last):
File "tests/unittests/test_custom_pass_npu.py", line 77, in test_my_add_n
input_tensor.copy_from_cpu(np_inputs[i])
File "/opt/py37env/lib/python3.7/site-packages/paddle/inference/wrapper.py", line 46, in tensor_copy_from_cpu
self._copy_from_cpu_bind(data)
RuntimeError: (NotFound) No allocator found for the place, Place(npu:0)
[Hint: Expected iter != allocators.end(), but received iter == allocators.end().] (at /home/code/Paddle/paddle/fluid/memory/allocation/allocator_facade.cc:338)

Ran 1 test in 0.086s

FAILED (errors=1)

[intel_gpu] 从 "paddle/phi/capi/all.h" 切换到 "paddle/phi/extension.h"

我们之前的实现参考 CustomDevice 文档，通过包含 "paddle/phi/capi/all.h" 实现 kernel。现在为了和其他 CustomDevice 的实现保持一致和一些开发需求，准备切换到 "paddle/phi/extension.h"。编译过后在执行

python -c "import paddle"

的时候会报以下错误：

I0522 14:07:09.374876 4179087 init.cc:231] ENV [CUSTOM_DEVICE_ROOT]=/home/youlei/miniconda3/envs/pd/lib/python3.10/site-packages/paddle_custom_device
I0522 14:07:09.374919 4179087 init.cc:140] Try loading custom device libs from: [/home/youlei/miniconda3/envs/pd/lib/python3.10/site-packages/paddle_custom_device]
free(): invalid pointer


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::InitDevices()
1   paddle::framework::InitDevices(std::vector<int, std::allocator<int> >)
2   paddle::framework::LoadCustomDevice(std::string const&)
3   phi::KernelRegistrar::KernelRegistrar(phi::RegType, char const*, char const*, phi::DataLayout, phi::DataType, void (*)(phi::KernelKey const&, phi::KernelArgsDef*), void (*)(phi::KernelKey const&, phi::Kernel*), std::function<void (phi::KernelContext*)>, void*)
4   phi::KernelRegistrar::ConstructKernel(phi::RegType, char const*, char const*, phi::DataLayout, phi::DataType, void (*)(phi::KernelKey const&, phi::KernelArgsDef*), void (*)(phi::KernelKey const&, phi::Kernel*), std::function<void (phi::KernelContext*)>, void*)
5   phi::CustomKernelMap::RegisterCustomKernel(std::string const&, phi::KernelKey const&, phi::Kernel const&)
6   std::pair<paddle::detailv3::sherwood_v3_table<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > >, std::string, std::hash<std::string >, paddle::detailv3::KeyOrValueHasher<std::string, std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > >, std::hash<std::string > >, std::equal_to<std::string >, paddle::detailv3::KeyOrValueEquality<std::string, std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > >, std::equal_to<std::string > >, std::allocator<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > >, std::allocator<paddle::detailv3::sherwood_v3_entry<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > > > >::templated_iterator<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > >, bool> paddle::detailv3::sherwood_v3_table<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > >, std::string, std::hash<std::string >, paddle::detailv3::KeyOrValueHasher<std::string, std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > >, std::hash<std::string > >, std::equal_to<std::string >, paddle::detailv3::KeyOrValueEquality<std::string, std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > >, std::equal_to<std::string > >, std::allocator<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > >, std::allocator<paddle::detailv3::sherwood_v3_entry<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > > > >::emplace_new_key<std::string const&, paddle::flat_hash_map<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > >, std::hash<std::string >, std::equal_to<std::string >, std::allocator<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > > >::convertible_to_value>(signed char, paddle::detailv3::sherwood_v3_entry<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > >*, std::string const&, paddle::flat_hash_map<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > >, std::hash<std::string >, std::equal_to<std::string >, std::allocator<std::pair<std::string, paddle::flat_hash_map<phi::KernelKey, phi::Kernel, phi::KernelKey::Hash, std::equal_to<phi::KernelKey>, std::allocator<std::pair<phi::KernelKey, phi::Kernel> > > > > >::convertible_to_value&&)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1684735629 (unix time) try "date -d @1684735629" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3e9003fc48f) received by PID 4179087 (TID 0x7fa4a648c740) from PID 4179087 ***]

Aborted (core dumped)

目前所做的修改主要有：

将所有的 #include "paddle/phi/capi/all.h" 改为 "paddle/phi/extension.h"；
形如

template <typename T>
void Kernel(const phi::Context& dev_ctx, ...

的内容改为：

template <typename T, typename Context>
void AssignValueKernel(const Context& dev_ctx, ...

PD_BUILD_PHI_KERNEL 改为 PD_REGISTER_PLUGIN_KERNEL;
其他如 DDim、DataType 等相关的修改。

一直找不到问题所在，请问是有哪里遗漏了吗？

[NPU][ResNet50] INT64 数据类型跑在 AI_CPU 上而不是 AI_Core 上

表现： Mul, Cast 等输入数据类型为 INT64 的算子跑在 AI_CPU 而不是 AI_CORE 上
原因： 昇腾等系列国产硬件对 INT64 数据类型支持有问题，但是飞桨框架默认数据集的 Label 读取为 int64 类似
解决办法： 修改套件代码，强制返回 Label 数据类型为 int32 类型
分析方法： 模型中添加 acl profiler 接口获得性能分析结果

临时解决方案：

# 模型代码中调用 mul, cast 等算子的代码为
cost = nn.CrossEntropyLoss()
loss = cost(outputs, labels) # 这里调用

# 框架代码中调用 mul, cast 等算子的代码为
def cross_entropy(... ...
valid_label = paddle.multiply(paddle.cast(label != ignore_index, dtype=label.dtype), label)

# 解决办法，修改 diff 如下，或者修改套件代码里面的 dataset 代码
diff --git a/python/paddle/vision/datasets/folder.py b/python/paddle/vision/datasets/folder.py
index 6ac0c4ca91..569396b482 100644
--- a/python/paddle/vision/datasets/folder.py
+++ b/python/paddle/vision/datasets/folder.py
@@ -271,8 +271,10 @@ class DatasetFolder(Dataset):
         sample = self.loader(path)
         if self.transform is not None:
             sample = self.transform(sample)
+        import numpy as np
+        return sample, np.array([target]).astype('int32')

-        return sample, target
+        #return sample, target

     def __len__(self):
         return len(self.samples)

[intel_gpu] elementwise_sub op unittest failed

我把mlu目录下的test_elementwise_sub_op_mlu.py测试脚本移植到intel_gpu来，在以下几个测试用例中报错：
1 TestElementwiseSubOp_broadcast_4
2 TestElementwiseSubOp_commonuse_1
3 TestElementwiseSubOp_commonuse_2
报错的原因都是类似的，TestElementwiseSubOp_broadcast_4的错误信息如下：

ERROR: test_check_grad_ingore_x (main.TestElementwiseSubOp_broadcast_4)

Traceback (most recent call last):
File "test_elementwise_sub_op_intel_gpu.py", line 65, in test_check_grad_ingore_x
self.place, ["Y"], "Out", max_relative_error=0.005, no_grad_set=set("X")
File "/home/tcl/master/frameworks.ai.paddle.gpu/python/tests/op_test.py", line 2517, in check_grad_with_place
atol=atol,
File "/home/tcl/master/frameworks.ai.paddle.gpu/python/tests/op_test.py", line 2250, in _assert_is_close
diff_mat = np.abs(a - b) / abs_a
ValueError: operands could not be broadcast together with shapes (2,5,1,12) (2,5,12)

======================================================================
ERROR: test_check_grad_normal (main.TestElementwiseSubOp_broadcast_4)

Traceback (most recent call last):
File "test_elementwise_sub_op_intel_gpu.py", line 61, in test_check_grad_normal
self.check_grad_with_place(self.place, ["X", "Y"], "Out")
File "/home/tcl/master/frameworks.ai.paddle.gpu/python/tests/op_test.py", line 2517, in check_grad_with_place
atol=atol,
File "/home/tcl/master/frameworks.ai.paddle.gpu/python/tests/op_test.py", line 2250, in _assert_is_close
diff_mat = np.abs(a - b) / abs_a
ValueError: operands could not be broadcast together with shapes (2,5,1,12) (2,5,12)

[NPU][ResNet50] Paddle 的 CrossEntroy 算子相比 Torch 存在冗余算子

问题表现：Paddle/Torch 单独跑 CrossEntroy 算子，Paddle 比 Torch 多了很多 cast, multiply, full 等算子的调用

代码定位和解决办法：

# ./nn/functional/loss.py
def cross_entropy(... ...
    if in_dygraph_mode():
        if soft_label == False: # 这里根据算子定义，只有 weight != None 的时候才需要 valid_label，并不是所有计算都需要，可以跳过
            valid_label = (
                paddle.cast(label != ignore_index, dtype=label.dtype) * label
            )
# 否则很多时间都会耗费在 full 和 cast 算子计算上
# 同时 npu 计算默认使用原有的 label 而不是 valid_label，以下这段代码是被 MLU 修改过
# PR1: MLU 复用 NPU 代码 https://github.com/PaddlePaddle/Paddle/pull/39523
# PR2：MLU 修改使用 valid_label https://github.com/PaddlePaddle/Paddle/pull/45201 ==> 和寒武纪确认中为啥要改？
# 寒武纪回复：有的label里面有255的值，valid把255的值过滤掉，是在 deeplabv3 模型上遇到的，需要确认下“ignore index应该是会忽略某一个类别，但是255也同样被拿掉”？
        if core.is_compiled_with_npu() or core.is_compiled_with_mlu():
            if soft_label == False:
                _, _, out = _legacy_C_ops.softmax_with_cross_entropy(
                    input,
                    valid_label, # 这里原先 NPU 使用的是 label，不需要 valid_label 的计算
                    'soft_label',
                    soft_label,
                    'ignore_index',
                    ignore_index,
                    'numeric_stable_mode',
                    True,
                    'axis',
                    axis,
                    'use_softmax',
                    use_softmax,
                )
 
 # 将以下代码注释之后
             valid_label = (
                paddle.cast(label != ignore_index, dtype=label.dtype) * label
            )

修复前后的 profiling 输出对比

# profiling 前后输出为
 ----------------------------------------------------------------Operator Summary---------------------
Time unit: ms
----------------------------------------------------  ------  ---------------------------------------- 
Name                                                  Calls   CPU Total / Avg / Max / Min / Ratio(%) 
----------------------------------------------------  ------  ---------------------------------------- 
-----------------------------------------------------------Thread: All threads merged------------------
full dygraph                                          1       0.35 / 0.35 / 0.35 / 0.35 / 41.28
  full infer_meta                                     1       0.00 / 0.00 / 0.00 / 0.00 / 0.81 
  full compute                                        1       0.17 / 0.17 / 0.17 / 0.17 / 50.29
cross_entropy_with_softmax dygraph                    1       0.17 / 0.17 / 0.17 / 0.17 / 20.35
  cross_entropy_with_softmax infer_meta               1       0.00 / 0.00 / 0.00 / 0.00 / 1.10 
  cross_entropy_with_softmax compute                  1       0.16 / 0.16 / 0.16 / 0.16 / 91.65
not_equal dygraph                                     1       0.09 / 0.09 / 0.09 / 0.09 / 10.28
  not_equal infer_meta                                1       0.00 / 0.00 / 0.00 / 0.00 / 4.14 
  not_equal compute                                   1       0.07 / 0.07 / 0.07 / 0.07 / 79.23
multiply dygraph                                      1       0.08 / 0.08 / 0.08 / 0.08 / 10.03
  multiply infer_meta                                 1       0.00 / 0.00 / 0.00 / 0.00 / 3.14 
  multiply compute                                    1       0.07 / 0.07 / 0.07 / 0.07 / 83.12
cast dygraph                                          1       0.08 / 0.08 / 0.08 / 0.08 / 9.64 
  cast infer_meta                                     1       0.00 / 0.00 / 0.00 / 0.00 / 1.96 
  cast compute                                        1       0.07 / 0.07 / 0.07 / 0.07 / 83.00
mean_all dygraph                                      1       0.07 / 0.07 / 0.07 / 0.07 / 8.42 
  mean_all infer_meta                                 1       0.00 / 0.00 / 0.00 / 0.00 / 1.43 
  mean_all compute                                    1       0.06 / 0.06 / 0.06 / 0.06 / 84.33
----------------------------------------------------  ------  ---------------------------------------- 

# 前后对比，减少了大量算子在 full, cast 以及 multiply 算子上耗费的时间

----------------------------------------------------------------Operator Summary-----------------------
Time unit: ms
----------------------------------------------------  ------  ---------------------------------------- 
Name                                                  Calls   CPU Total / Avg / Max / Min / Ratio(%) 
----------------------------------------------------  ------  ---------------------------------------- 
-----------------------------------------------------------Thread: All threads merged------------------
cross_entropy_with_softmax dygraph                    1       0.36 / 0.36 / 0.36 / 0.36 / 85.35
  cross_entropy_with_softmax infer_meta               1       0.00 / 0.00 / 0.00 / 0.00 / 0.99 
  cross_entropy_with_softmax compute                  1       0.21 / 0.21 / 0.21 / 0.21 / 58.29
mean_all dygraph                                      1       0.06 / 0.06 / 0.06 / 0.06 / 14.65
  mean_all infer_meta                                 1       0.00 / 0.00 / 0.00 / 0.00 / 1.67 
  mean_all compute                                    1       0.05 / 0.05 / 0.05 / 0.05 / 83.78
----------------------------------------------------  ------  ----------------------------------------

是否支持注册全连接层kernel(linear_kernel)?

rt.

[Paddle] 0维Tensor推全

主框架相关背景和计划见：https://ku.baidu-int.com/knowledge/HFVrC7hq1Q/yKeL8Lljko/6UmOO2EkH2/Gv4bMvQxAw61WT

当前状态：

目前主repo的输入0D已经完成，输出0D完成50%，剩余正在升级中，见任务明细表 - 输出0D，预计会在4月合入
PaddleCustomDevice在主框架完成对所有0维Tensor的修改之后，需要参考主框架修改相应升级Kernel代码对0维Tensor的支持，并增加对0维Tensor的单测case

开发计划：

主框架4月底之前会完成所有0D的修改，到时候主框架提供修改后的PR List给到PaddleCustomDevice进行修改

[NPU][ResNet50] 支持 Storage Format 转化后 InferMeta 会失败

经过 storage dims 转化后，origin_format: NCHW, rigin_dims: 6 的 tensor 会转化为 storage_format: NC0HWC1, storage_dims: 1, 1, 1, 1, 16；实际这个 tensor 需要申请的 NPU显存大小就是16，因此调用 DenseTensor 的 ResizeAndAllocate 后，框架中默认存储的 DDim 也编程了 [1, 1, 1, 1, 16]，之后再对 BN 算子的 BatchNormInferMeta 进行输入的 dims 检查的时候会失败(失败原因是 BN 输入的 mean Tensor 的 dim.size 需要为1，但是格式转化后的 dims.size 为 5，因此导致检查失败)

同时 InferMeta 获取得到的 output dims 的形状也是错误的

Input: x: format: NCHW, dims: [4, 1, 24, 24, 16], origin_format: 0, origin_dims: [4, 6, 24, 24], storage_format: 3, storage_dims: [4, 1, 24, 24, 16]
Input: running_mean: format: NCHW, dims: [1, 1, 1, 1, 16], origin_format: 0, origin_dims: [6], storage_format: 3, storage_dims: [1, 1, 1, 1, 16]
Input: running_var: format: NCHW, dims: [1, 1, 1, 1, 16], origin_format: 0, origin_dims: [6], storage_format: 3, storage_dims: [1, 1, 1, 1, 16]
Input: scale: format: NCHW, dims: [1, 1, 1, 1, 16], origin_format: 0, origin_dims: [6], storage_format: 3, storage_dims: [1, 1, 1, 1, 16]
Input: bias: format: NCHW, dims: [1, 1, 1, 1, 16], origin_format: 0, origin_dims: [6], storage_format: 3, storage_dims: [1, 1, 1, 1, 16]
Output: y: format: NCHW, dims: [4, 1, 24, 24, 16] ==> Not Initialized.
Output: mean_out: format: NCHW, dims: [1] ==> Not Initialized. # 这里应该是[C] = [6]，但是读取 Input: x 的 C 读到了 1 ，所以就错了
Output: variance_out: format: NCHW, dims: [1] ==> Not Initialized.
Output: saved_mean: format: NCHW, dims: [1] ==> Not Initialized.
Output: saved_variance: format: NCHW, dims: [1] ==> Not Initialized.

PaddleNLP ChatGLM在NPU环境下运行异常

目前NPU似乎有kernel不支持fp16推理

import paddlenlp
from paddlenlp.transformers import ChatGLMForConditionalGeneration, ChatGLMTokenizer, ChatGLMConfig

llm = 'THUDM/chatglm-6b'
config = ChatGLMConfig.from_pretrained(llm)
model = ChatGLMForConditionalGeneration.from_pretrained(llm, 
                            load_state_as_np=True, dtype="float16", 
                            config=config)
model_tokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained(llm)

text_input = '你好'

inputs = model_tokenizer(text_input, return_tensors="pd",
            add_special_tokens=True,
            padding="max_length",
            max_length=32,
            truncation=True,
            truncation_side="left")

output1 = model.generate(max_length=64, decode_strategy='sampling', top_k=1, 
                        bos_token_id=model_tokenizer.bos_token_id, 
                        eos_token_id=model_tokenizer.end_token_id,
                        pad_token_id=model_tokenizer.pad_token_id,
                        **inputs)

pprint.pprint(model_tokenizer.batch_decode(output1[0].tolist()))
print(time.time() - t_start)

NPU插件与sklearn存在兼容性问题

按照教程安装paddle与paddle_custom_device，如果先import paddle，再import sklearn，就会报错：

>>> import paddle
I0129 11:20:21.808104 11223 init.cc:266] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle-plugins
I0129 11:20:21.808182 11223 init.cc:150] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle-plugins]
I0129 11:20:26.164170 11223 custom_device.cc:1040] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle-plugins/libpaddle-custom-npu.so
I0129 11:20:26.169677 11223 custom_kernel.cc:76] Successed in loading 296 custom kernel(s) from loaded lib(s), will be used like native ones.
I0129 11:20:26.169983 11223 init.cc:162] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle-plugins]
I0129 11:20:26.170037 11223 init.cc:272] CustomDevice: npu, visible devices count: 8
>>> import sklearn
Traceback (most recent call last):
  File "/opt/py37env/lib/python3.7/site-packages/sklearn/__check_build/__init__.py", line 48, in <module>
    from ._check_build import check_build  # noqa
ImportError: /opt/py37env/lib/python3.7/site-packages/sklearn/__check_build/../../scikit_learn.libs/libgomp-d22c30c5.so.1.0.0: cannot allocate memory in static TLS block

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/py37env/lib/python3.7/site-packages/sklearn/__init__.py", line 81, in <module>
    from . import __check_build  # noqa: F401
  File "/opt/py37env/lib/python3.7/site-packages/sklearn/__check_build/__init__.py", line 50, in <module>
    raise_build_error(e)
  File "/opt/py37env/lib/python3.7/site-packages/sklearn/__check_build/__init__.py", line 43, in raise_build_error
    % (e, local_dir, "".join(dir_content).strip(), msg)
ImportError: /opt/py37env/lib/python3.7/site-packages/sklearn/__check_build/../../scikit_learn.libs/libgomp-d22c30c5.so.1.0.0: cannot allocate memory in static TLS block
___________________________________________________________________________
Contents of /opt/py37env/lib/python3.7/site-packages/sklearn/__check_build:
setup.py                  __pycache__               _check_build.cpython-37m-aarch64-linux-gnu.so
__init__.py
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.

反之如果先import sklearn，则一切正常：

>>> import sklearn
>>> import paddle
I0129 11:20:43.997946 11705 init.cc:266] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle-plugins
I0129 11:20:43.998006 11705 init.cc:150] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle-plugins]
I0129 11:20:48.402534 11705 custom_device.cc:1040] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle-plugins/libpaddle-custom-npu.so
I0129 11:20:48.407981 11705 custom_kernel.cc:76] Successed in loading 296 custom kernel(s) from loaded lib(s), will be used like native ones.
I0129 11:20:48.408267 11705 init.cc:162] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle-plugins]
I0129 11:20:48.408321 11705 init.cc:272] CustomDevice: npu, visible devices count: 8

npu merged_momentum 问题讨论

参考npu的merged_momentum代码时，这块最后的接口可能会修改velocity_out_data，他用例里是用momentum kernel做标杆的，inplace模式下结果可能不对，能不能试试注释掉npu的momentum的kernel，用cpu的momentum做标杆跑下他们的merged_momentum用例看看结果是否正确？

NPU反向计算与CPU不一致，测试代码如下

import paddle
import paddle.nn as nn
import pdb
from copy import deepcopy
from paddle import ParamAttr
import paddle.fluid as fluid
from paddle.fluid.dygraph.base import to_variable
import numpy as np
import math

paddle.seed(123)
paddle.set_device('npu:4')

def cosine_similarity(x, y):
    x = x.numpy()
    y = y.numpy()
    x1 = x.flatten().astype(np.float64)
    y1 = y.flatten().astype(np.float64)
    dot = np.dot(x1, y1)
    lx = np.linalg.norm(x1)
    ly = np.linalg.norm(y1)
    cos = dot / (lx * ly)
    return cos


def get_bias_attr(k):
    stdv = 1.0 / math.sqrt(k * 1.0)
    initializer = paddle.nn.initializer.Uniform(-stdv, stdv)
    bias_attr = ParamAttr(initializer=initializer)
    return bias_attr

# pdb.set_trace()

# data [16, 64, 160, 160]
xn = paddle.rand((16, 64, 160, 160))
xn.stop_gradient=False

xc = xn.cpu()
xc.stop_gradient=False

zn_grad = paddle.rand((16, 64, 320, 320))
zc_grad = zn_grad.cpu()


# model
conv_n = nn.Conv2DTranspose(
                 in_channels=64,
                 out_channels=64,
                 kernel_size=2,
                 stride=2,
                 weight_attr=ParamAttr(
                     initializer=paddle.nn.initializer.KaimingUniform()),
                 bias_attr=get_bias_attr(64)
                 )
bn_n = nn.BatchNorm(
                 num_channels=64,
                 param_attr=ParamAttr(
                     initializer=paddle.nn.initializer.Constant(value=1.0)),
                 bias_attr=ParamAttr(
                     initializer=paddle.nn.initializer.Constant(value=1e-4)),
                 act="relu")

conv_c = deepcopy(conv_n).to('cpu')
bn_c = deepcopy(bn_n).to('cpu')

# forward
yn = conv_n(xn)
yc = conv_c(xc)

zn = bn_n(yn)
zc = bn_c(yc)

print("[acc_forward]conv: ", cosine_similarity(yn, yc))
print("[acc_forward]bn: ", cosine_similarity(zn, zc))

breakpoint()
# pdb.set_trace()
# backward
zn.backward(zn_grad)
zc.backward(zc_grad)

print("[acc_backward]conv: ", cosine_similarity(conv_n.weight.grad, conv_c.weight.grad))
print("[acc_backward]bn: ", cosine_similarity(bn_n.weight.grad, bn_c.weight.grad))

如何注册paddle.static.nn.fc或paddle.nn.Linear算子

如题，应该注册哪个kernel？

[TODO] 收集框架对于CustomDevice待优化项目

Paddle集成CustomDevice时，主要考虑对不同硬件通用，部分实现性能较低，这里收集硬件期望优化的代码

is there any Intel dgpu runner on BCS

@qili93

使用PD_BUILD_PHI_KERNEL注册kernel失败

有两个问题想请教一下：

PD_BUILD_PHI_KERNEL与PD_REGISTER_PLUGIN_KERNEL具体的区别是什么？
编译了 custom_cpu 的代码，里面使用PD_BUILD_PHI_KERNEL进行注册Kernel，运行时加载LOG显示No custom kernel info found in loaded lib(s).

后端MLU是否支持单机多卡及多机多卡训练？

后端MLU是否支持单机多卡及多机多卡训练？该如何操作，是否有相应的支持文档？

PaddlePaddle 分布式是否支持Linux下的新硬件

rt
除了昆仑 XPU | 海光 DCU | 昇腾 NPU以外，自定义新硬件是否支持分布式训练？

PaddleOCR DPNet模型使用NPU训练反向传播具有精度问题

抽取了DPNet模型中的场景进行测试，对比cpu和npu反向计算后的grad，精度对比指标为余弦相似度
测试脚本：
test_DBFPN.py
test_DBHead.py
test_ResidualUnit.py

精度结果：

[NPU] 华为昇腾服务器Paddle训练问题，OSError: (External) ACL error, the error code is : 500002

系统环境/System Environment：EulerOS 2.0 (SP8) NPU 910APro
版本号/Version：Paddle：2.4 PaddleOCR：2.6 问题相关组件/Related components：npu_op_runner.cc
运行指令/Command Code：export FLAGS_selected_npus=3 然后 python3 tools/train.py -c ./configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml -o Global.use_npu=true Global.use_gpu=false
完整报错/Complete Error Message：

在华为NPU上已经编译成功了Paddle-2.4，PaddleDetection-2.4可以跑通目标检测Yolov3的训练，但是PaddleOCR在尝试训练文本检测模型ch_ppocr_server_v2.0_det时出现了以上报错，在报错之前模型已经成功加载到了卡上，占用了内存。
Paddle的安装文档参考的是https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/hardware_support/npu_docs/paddle_install_cn.html
训练文档参考的是https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/hardware_support/npu_docs/train_example_cn.html

经过验证，在CPU上可以进行正常训练，在NPU上训练会报以上错误。
CPU上训练过程截图：

NPU上训练截图：

目前Paddle已经尝试了2.4.0和2.4.1版本，都会出现上述报错。

Ascend训练日志中记录的报错为：

The kernel with key (npu, Undefined(AnyLayout), float16) of kernel `silu` is not registered and fail to fallback to CPU one

在Ascend910 npu环境上训练，训练时加载数据后出现以下错误

[06/07 17:19:37] ppdet.data.source.coco INFO: Load [100625 samples valid, 0 samples invalid] in file /cache/train.json.
Traceback (most recent call last):
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/train.py", line 188, in
main()
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/train.py", line 184, in main
run(FLAGS, cfg)
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/train.py", line 137, in run
trainer.train(FLAGS.eval)
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/ppdet/engine/trainer.py", line 414, in train
outputs = model(data)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
out = self.get_loss()
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/ppdet/modeling/architectures/yolov7.py", line 95, in get_loss
return self._forward()
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/ppdet/modeling/architectures/yolov7.py", line 73, in _forward
body_feats = self.backbone(self.inputs)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/ppdet/modeling/backbones/yolov7_elannet.py", line 592, in forward
x = self.stem(x)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/container.py", line 606, in forward
input = layer(input)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/home/ma-user/work/01.paddle_npu_20230603/01.PaddleYOLO-release-2.5-init/ppdet/modeling/backbones/csp_darknet.py", line 87, in forward
y = self.act(x)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/layer/activation.py", line 1163, in forward
return F.silu(x, self._name)
File "/home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/paddle/nn/functional/activation.py", line 985, in silu
return _C_ops.silu(x)
RuntimeError: (NotFound) The kernel with key (npu, Undefined(AnyLayout), float16) of kernel silu is not registered and fail to fallback to CPU one. Selected wrong Backend npu. Paddle support following Backends: CPU.
[Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /paddle/paddle/phi/core/kernel_factory.cc:259)

[NPU]paddledetection 测试占用npu的hbm显存，但是不占用npu核，ips特别低

编译好的paddle-aarch64 + paddle-npu 安装成功了，示例能跑通npu的训练，
但是使用paddleDetectionv2.6.0的 python -u tools/train.py -c configs/yolov3/yolov3_darknet5_270e_roadsign.yml ，
修改了配置文件：
runtime.yml
use_gpu: false
use_xpu: false
use_mlu: false
use_npu: true
log_iter: 1
现象是：占用npu的hbm显存，但是不占用npu核，ips特别低，单位时间处理的图片数0.0几，

看过昇腾npu的官方示例是支持yolo的，测试过昇腾torch的可以使用npu跑yolo的，求助，谁能告诉下啥原因导致的npu核没有使用，怎么修改才能支持使用npu核参与训练啊

PaddleCustomDevice可以支持在NPU设备上部署PaddleOCR服务吗？

如题。

[Paddle] 主框架 Kernel output datatype 类型修改

相关 ISSUE 描述见：PaddlePaddle/Paddle#51292

问题原因：目前 PHI Kernel 存在部分算子输出 tensor 的数据类型不对，默认 output tensor datatype 与 kernel 注册的数据类型一致。比如 add kernel 本身是 FP16 的，那输出的 output tensor 的数据类型也是 FP16；但是存在90+算子是特殊情况，比如 top_k kernel 无论是什么数据类型，indices都是输出int64_t，和kernel类型不一致，所以需要显式指定。修改代码见 PR PaddlePaddle/Paddle#51233

问题影响：会导致部分静态图代码执行失败，动态图暂时不受影响。因为目前主要套件都跑动态图，影响可控。

修复计划：

主框架会很快修改框架内部的CPU、GPU和XPU的kernel代码，但是 PaddleCustomDevice 的代码需要单独修改。
当前策略是随时关注Paddle主框架的Kernel代码变化，后续等主框架的90+算子修复完成之后，收集修改的90+的算子的PR list，然后统一对NPU和MLU代码进行修改。

==== Update on 4/17 =====

最新状态
外部开发者ISSUE已完成 PaddlePaddle/Paddle#51292
PR List见 https://ku.baidu-int.com/d/9dfab610c5bc49 (需要补全输出标记的算子)

测试方法
开启FLAGS_new_executor_static_build=1下跑通单测即可，没有补全输出标记的算子开启此开关之后CI会挂。

ScatterUpdate 算子输出有误

input_tensor = [0, 0, 0, 0, 0, 0, 0, 0]
indices = [[1], [3], [4], [7]] 或者 indices = [1, 3, 4, 7]
updates = [9, 10, 11, 12]

执行完 output = [0, 0, 0, 0, 0, 0, 0, 0, ]，input_tensor = [0, 0, 0, 0, 0, 0, 0, 0] ，不符合预期，input_tensor应该是[0,9,10,11,0,0,12]。
请问是我的输入数据有误还是？

backend/npu编译出错

昇腾编译出错 symbol BIO_dgram_sctp_wait_for_dry version OPENSSL_1_1_0 not defined in file libcrypto.so.1.1 with link time reference

环境：eulerosv2r8；aarch64；cann5.1；Ascend910

未使用Docker，在环境中安装了https://paddle-device.bj.bcebos.com/develop/cpu/paddlepaddle-0.0.0-cp37-cp37m-linux_aarch64.whl
执行编译脚本
bash tools/compile.sh

报错：

+++ dirname tools/compile.sh
++ cd tools/../
++ pwd

SOURCE_ROOT=/home/ma-user/work/PaddleCustomDevice/backends/npu
mkdir -p /home/ma-user/work/PaddleCustomDevice/backends/npu/build
cd /home/ma-user/work/PaddleCustomDevice/backends/npu/build
++ uname -i
arch=aarch64
'[' aarch64 == x86_64 ']'
WITH_MKLDNN=OFF
WITH_ARM=ON
cat
========================================
Configuring cmake in build ...
-DCMAKE_BUILD_TYPE=Release
-DWITH_KERNELS=ON
-DWITH_TESTING=ON
-DWITH_MKLDNN=OFF
-DWITH_ARM=ON
-DON_INFER=OFF
========================================
set +e
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_KERNELS=ON -DWITH_TESTING=ON -DWITH_MKLDNN=OFF -DWITH_ARM=ON -DON_INFER=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
Error: Can not import paddle core while this file exists: /home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
Traceback (most recent call last):
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/core.py", line 269, in
from . import libpaddle
ImportError: /usr/lib64/libssl.so.1.1: symbol BIO_dgram_sctp_wait_for_dry version OPENSSL_1_1_0 not defined in file libcrypto.so.1.1 with link time reference

During handling of the above exception, another exception occurred:

NPU编译出错

+++ dirname tools/compile.sh
++ cd tools/../
++ pwd

SOURCE_ROOT=/workspace/PaddleCustomDevice/backends/npu
mkdir -p /workspace/PaddleCustomDevice/backends/npu/build
cd /workspace/PaddleCustomDevice/backends/npu/build
++ uname -i
arch=aarch64
'[' aarch64 == x86_64 ']'
WITH_MKLDNN=OFF
WITH_ARM=ON
cat
========================================
Configuring cmake in build ...
-DCMAKE_BUILD_TYPE=Release
-DWITH_KERNELS=ON
-DWITH_TESTING=OFF
-DWITH_MKLDNN=OFF
-DWITH_ARM=ON
-DON_INFER=OFF
========================================
set +e
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_KERNELS=ON -DWITH_TESTING=OFF -DWITH_MKLDNN=OFF -DWITH_ARM=ON -DON_INFER=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
-- Found PADDLE_CORE_LIB: /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
CMake Error at CMakeLists.txt:8 (include):
include could not find load file:

generic

-- FWKACLLIB_INC_DIR /usr/local/Ascend/ascend-toolkit/latest/fwkacllib/include
-- ASCEND_CL_DIR /usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64
-- Current Ascend Toolkit version is 6.0.0
-- Current Ascend Driver version is 22.0.4
-- CXX compiler: /opt/compiler/gcc-8.2/bin/c++, version: GNU 8.2.0
-- C compiler: /opt/compiler/gcc-8.2/bin/gcc, version: GNU 8.2.0
-- AR tools: /usr/bin/ar
CMake Error at cmake/third_party.cmake:25 (include):
include could not find load file:

external/gflags

Call Stack (most recent call first):
CMakeLists.txt:62 (include)

CMake Error at cmake/third_party.cmake:26 (include):
include could not find load file:

external/glog

Call Stack (most recent call first):
CMakeLists.txt:62 (include)

CMake Error at cmake/third_party.cmake:27 (include):
include could not find load file:

external/pybind11

Call Stack (most recent call first):
CMakeLists.txt:62 (include)

-- Configuring incomplete, errors occurred!
See also "/workspace/PaddleCustomDevice/backends/npu/build/CMakeFiles/CMakeOutput.log".

cmake_error=1
'[' 1 '!=' 0 ']'
echo 'CMake Error Found !!!'
CMake Error Found !!!
exit 7

如何判断注册是否成功？是否与paddle版本有关

当前问题

已按照《自定义新硬件接入指南》文档https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/custom_device_docs/index_cn.html基于custom_cpu代码完成了runtime部分接口的编写（其他部分仍为custom_cpu源代码，是否有影响？）
编译、pip成功后在执行paddle.device.get_all_custom_device_type()时返回值为空（[]）（如何判断是否注册成功？）

环境

飞腾ARMCPU。从源码编译paddle时develop版编译失败，所以当前环境paddle版本为release版（非develop版是否对新硬件插件有影响？）

[intel_gpu] mem leak when runing RN50

we use GLOG_v=10 to run RN50 and found that paddle always allocate mem but w/o deallocate mem when lead out of mem

RN50: https://github.com/PaddlePaddle/PaddleClas/tree/f820473d1d4d5174e57a5a6b08a42f672eb13390
cmd: python ./PaddleClas/tools/train.py -c ./PaddleClas/ppcls/configs/ImageNet/ResNet/ResNet50.yaml

[NPU]模型场景反向计算精度问题

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import paddle
from paddle import ParamAttr
import paddle.nn as nn
import paddle.nn.functional as F

from paddle.vision.ops import DeformConv2D
from paddle.regularizer import L2Decay
from paddle.nn.initializer import Normal, Constant, XavierUniform


import os
from copy import deepcopy
import numpy as np
paddle.seed(123)
paddle.set_device('npu:4')

def cosine_similarity(x, y):
    #x = x.numpy()
    #y = y.numpy()
    x1 = x.flatten().astype(np.float64)
    y1 = y.flatten().astype(np.float64)
    dot = np.dot(x1, y1)
    lx = np.linalg.norm(x1)
    ly = np.linalg.norm(y1)
    cos = dot / (lx * ly)
    return cos
    

class ConvBNLayer(nn.Layer):
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size,
                 stride=1,
                 groups=1,
                 dcn_groups=1,
                 is_vd_mode=False,
                 act=None,
                 is_dcn=False):
        super(ConvBNLayer, self).__init__()

        self.is_vd_mode = is_vd_mode
        self._pool2d_avg = nn.AvgPool2D(
            kernel_size=2, stride=2, padding=0, ceil_mode=True)
        if not is_dcn:
            self._conv = nn.Conv2D(
                in_channels=in_channels,
                out_channels=out_channels,
                kernel_size=kernel_size,
                stride=stride,
                padding=(kernel_size - 1) // 2,
                groups=groups,
                bias_attr=False)
        else:
            self._conv = DeformableConvV2(
                in_channels=in_channels,
                out_channels=out_channels,
                kernel_size=kernel_size,
                stride=stride,
                padding=(kernel_size - 1) // 2,
                groups=dcn_groups,  #groups,
                bias_attr=False)
        self._batch_norm = nn.BatchNorm(out_channels, act=act)

    def forward(self, inputs):
        print(f"forward:{inputs.shape}")
        if self.is_vd_mode:
            inputs = self._pool2d_avg(inputs)
        print(f"forward:{inputs.shape}")
        y = self._conv(inputs)
        y = self._batch_norm(y)
        return y


# data
xn = paddle.rand((16, 512, 20, 20))
xn.stop_gradient=False

xc = xn.cpu()
xc.stop_gradient=False

grad_n = paddle.rand((16, 2048, 10, 10))
grad_c = grad_n.cpu()

# model
model_n = ConvBNLayer(
    in_channels=512,
    out_channels=2048,
    kernel_size=1,
    stride=1,
    groups=1,
    dcn_groups=1,
    is_vd_mode=True,
    act=None,
    is_dcn=False,
)

model_c = deepcopy(model_n).to('cpu')

print(model_c)
# forward
yn1 = model_n._pool2d_avg(xn)
yn2 = model_n._conv(yn1)
yn3 = model_n._batch_norm(yn2)

yc1 = model_c._pool2d_avg(xc)
yc2 = model_c._conv(yc1)
yc3 = model_c._batch_norm(yc2)


yn1.retain_grads()
yn2.retain_grads()
yn3.retain_grads()

yc1.retain_grads()
yc2.retain_grads()
yc3.retain_grads()


print("[acc_forward]y1=", cosine_similarity(yn1, yc1))
print("[acc_forward]y2=", cosine_similarity(yn2, yc2))
print("[acc_forward]y3=", cosine_similarity(yn3, yc3))

# backward
case = 1

if case == 1:
    yn3.backward(grad_n)
    yc3.backward(grad_c)
    
if case == 2:
    yn3.backward()
    yc3.backward()
    
if case == 3:
    yn3.sum().backward()
    yc3.sum().backward()    

print("[acc_backward]x.grad=", cosine_similarity(xn.grad, xc.grad))
print("[acc_backward]y1.grad=", cosine_similarity(yn1.grad, yc1.grad))
print("[acc_backward]y2.grad=", cosine_similarity(yn2.grad, yc2.grad))
print("[acc_backward]y3.grad=", cosine_similarity(yn3.grad, yc3.grad))


print("[acc_backward]model_c._conv.weight.grad=", cosine_similarity(model_c._conv.weight.grad, model_n._conv.weight.grad))
print("[acc_backward]model_c._batch_norm.weight.grad=", cosine_similarity(model_c._batch_norm.weight.grad, model_n._batch_norm.weight.grad))

测试脚本如上，在backward计算中测试了三个场景：

case == 1时
反向计算npu与cpu计算的grad精度一致。

case == 2时
反向计算npu与cpu计算的grad精度不一致。
此时，增环境变量“export CUSTOM_DEVICE_BLACK_LIST=batch_norm,batch_norm_grad,conv2d,conv2d_grad,pool2d,pool2d_grad” 后，反向计算npu与cpu计算的grad精度一致。

case == 3时
为模型中求loss后进行反向传播的使用场景，反向计算npu与cpu计算的grad精度不一致。
同样添加环境变量后精度一致。

疑问：

case=1和case=2反向计算有什么区别。
case=3时为模型中loss.backward()实际场景，需要解决精度问题。

编译出错

+++ dirname tools/compile.sh
++ cd tools/../
++ pwd

SOURCE_ROOT=/home/PaddleCustomDevice/backends/npu
mkdir -p /home/PaddleCustomDevice/backends/npu/build
cd /home/PaddleCustomDevice/backends/npu/build
++ uname -i
arch=x86_64
'[' x86_64 == x86_64 ']'
WITH_MKLDNN=ON
WITH_ARM=OFF
cat
========================================
Configuring cmake in build ...
-DCMAKE_BUILD_TYPE=Release
-DWITH_KERNELS=ON
-DWITH_TESTING=OFF
-DWITH_MKLDNN=ON
-DWITH_ARM=OFF
-DON_INFER=OFF
========================================
set +e
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_KERNELS=ON -DWITH_TESTING=OFF -DWITH_MKLDNN=ON -DWITH_ARM=OFF -DON_INFER=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
-- Found PADDLE_CORE_LIB: /opt/py37env/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
CMake Error at cmake/generic.cmake:1:
Parse error. Expected a command name, got unquoted argument with text
"../../../Paddle/cmake/generic.cmake".
Call Stack (most recent call first):
CMakeLists.txt:8 (include)

-- Configuring incomplete, errors occurred!

cmake_error=1
'[' 1 '!=' 0 ']'
echo 'CMake Error Found !!!'
CMake Error Found !!!
exit 7

[MLU] 请问paddlecustomdevice后端分布式训练是否只支持数据并行？

paddlecustomdevice后端分布式训练是否只支持数据并行？是否支持参数服务器的形式进行分布式训练？如果支持参数服务器的形式该如何启动？

[NPU Daily-CI] fail because Paddle commit

Daily CI: https://xly.bce.baidu.com/paddlepaddle/paddle-custom-device/newipipe/detail/6051767/job/16438707
Paddle PR: PaddlePaddle/Paddle#43878

华为昇腾服务器上编译NPU版本报错

参考文档：https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/npu/README_cn.md
由于服务器环境原因，无法使用docker
遇到的报错：

Paddle安装后未找到，已经在文件中指定Paddle路径解决；
相对路径找不到对应头文件，通过修改为绝对路径解决，路径为/home/ma-user/work/PaddleCustomDevice/Paddle/paddle/phi/api/profiler/trace_event.h和/home/ma-user/work/PaddleCustomDevice/Paddle/paddle/phi/api/profiler/trace_event_collector.h

完整报错信息如下：
(MindSpore) [ma-user npu]$bash tools/compile.sh
+++ dirname tools/compile.sh
++ cd tools/../
++ pwd

SOURCE_ROOT=/home/ma-user/work/PaddleCustomDevice/backends/npu
mkdir -p /home/ma-user/work/PaddleCustomDevice/backends/npu/build
cd /home/ma-user/work/PaddleCustomDevice/backends/npu/build
++ uname -i
arch=aarch64
'[' aarch64 == x86_64 ']'
WITH_MKLDNN=OFF
WITH_ARM=ON
cat
========================================
Configuring cmake in build ...
-DCMAKE_BUILD_TYPE=Release
-DWITH_KERNELS=ON
-DWITH_TESTING=OFF
-DWITH_MKLDNN=OFF
-DWITH_ARM=ON
-DON_INFER=OFF
========================================
set +e
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_KERNELS=ON -DWITH_TESTING=OFF -DWITH_MKLDNN=OFF -DWITH_ARM=ON -DON_INFER=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
Error: Can not import paddle core while this file exists: /home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
Traceback (most recent call last):
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/core.py", line 269, in
from . import libpaddle
ImportError: /lib64/libssl.so.1.1: symbol BIO_dgram_sctp_wait_for_dry version OPENSSL_1_1_0 not defined in file libcrypto.so.1.1 with link time reference

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/init.py", line 27, in
from .framework import monkey_patch_variable
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/framework/init.py", line 17, in
from . import random # noqa: F401
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/framework/random.py", line 16, in
import paddle.fluid as fluid
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/init.py", line 36, in
from . import framework
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 34, in
from . import core
File "/home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/core.py", line 350, in
if not avx_supported() and libpaddle.is_compiled_with_avx():
NameError: name 'libpaddle' is not defined
-- Found PADDLE_CORE_LIB: /home/ma-user/.local/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
-- FWKACLLIB_INC_DIR /usr/local/Ascend/ascend-toolkit/latest/fwkacllib/include
-- ASCEND_CL_DIR /usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64
-- Current Ascend Toolkit version is 6.0.1
-- Current Ascend Driver version is 22.0.0
-- CXX compiler: /usr/bin/c++, version: GNU 7.3.0
-- C compiler: /usr/bin/cc, version: GNU 7.3.0
-- AR tools: /usr/bin/ar
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ma-user/work/PaddleCustomDevice/backends/npu/build

cmake_error=0
'[' 0 '!=' 0 ']'
'[' aarch64 == x86_64 ']'
make TARGET=ARMV8 -j8
[ 1%] Built target ascend_cl
[ 6%] Built target extern_pybind
[ 11%] Built target extern_gflags
[ 16%] Built target extern_glog
[ 16%] Built target third_party
Scanning dependencies of target paddle-custom-npu
[ 17%] Building CXX object CMakeFiles/paddle-custom-npu.dir/runtime/runtime.cc.o
[ 18%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/accuracy_kernel.cc.o
[ 18%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/activation_kernel.cc.o
[ 21%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/adagrad_kernel.cc.o
[ 21%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/adam_kernel.cc.o
[ 22%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/amp/check_finite_and_unscale_kernel.cc.o
[ 22%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/add_n_kernel.cc.o
[ 22%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/abs_kernel.cc.o
[ 22%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/amp/update_loss_scaling_kernel.cc.o
[ 23%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/arange_kernel.cc.o
[ 24%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/arg_min_max_kernel.cc.o
[ 24%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/argsort_grad_kernel.cc.o
[ 25%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/argsort_kernel.cc.o
[ 26%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/assign_kernel.cc.o
[ 26%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/batch_norm_kernel.cc.o
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc: In function 'void custom_kernel::AssignValueKernel(const Context&, const std::vector&, phi::DataType, const std::vector<paddle::experimental::ScalarBasephi::DenseTensor >&, phi::DenseTensor*)':
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:94:30: error: 'CppTypeToDataType' is not a member of 'phi'
auto template_dtype = phi::CppTypeToDataType::Type();
^~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:94:30: note: suggested alternative: 'ProtoDataType'
auto template_dtype = phi::CppTypeToDataType::Type();
^~~~~~~~~~~~~~~~~
ProtoDataType
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:94:49: error: expected primary-expression before '>' token
auto template_dtype = phi::CppTypeToDataType::Type();
^
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:94:52: error: '::Type' has not been declared
auto template_dtype = phi::CppTypeToDataType::Type();
^~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:94:52: note: suggested alternative: 'dtype'
auto template_dtype = phi::CppTypeToDataType::Type();
^~~~
dtype
In file included from /home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_enforce.h:21:0,
from /home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_funcs.h:19,
from /home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:15:
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:431:48: error: 'TYPE2' was not declared in this scope
::phi::details::CommonType1<TYPE1, TYPE2>;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:431:48: note: suggested alternative: 'TYPE1'
::phi::details::CommonType1<TYPE1, TYPE2>;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:431:57: error: template argument 2 is invalid
::phi::details::CommonType1<TYPE1, TYPE2>;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:433:57: error: type/value mismatch at argument 2 in template parameter list for 'template<class T1, class T2> using CommonType2 = typename std::add_lvalue_reference<typename std::add_const<typename phi::enforce::details::TypeConverter<T1, T2>::Type2>::type>::type'
::phi::details::CommonType2<TYPE1, TYPE2>;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:433:57: note: expected a type, got 'TYPE2'
::phi::details::CommonType2<TYPE1, TYPE2>;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:434:40: error: 'COMMON_TYPE1' does not name a type; did you mean 'INTMAX_TYPE'?
bool __is_not_error = (static_cast<COMMON_TYPE1>(__val1))__CMP(
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:435:21: error: 'COMMON_TYPE2' does not name a type; did you mean 'INTMAX_TYPE'?
static_cast<COMMON_TYPE2>(__val2));
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:440:48: error: type/value mismatch at argument 1 in template parameter list for 'template struct phi::enforce::details::CanToString'
::phi::details::CanToString<TYPE2>::kValue;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:440:48: note: expected a type, got 'TYPE2'
::phi::details::CanToString<TYPE2>::kValue;
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:448:15: error: the value of 'kCanToString' is not usable in a constant expression
kCanToString>::Convert(#__VAL1, __val1),
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:448:31: error: the value of 'kCanToString' is not usable in a constant expression
kCanToString>::Convert(#__VAL1, __val1),
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:448:31: note: in template argument for type 'bool'
kCanToString>::Convert(#__VAL1, __val1),
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:450:15: error: the value of 'kCanToString' is not usable in a constant expression
kCanToString>::Convert(#__VAL2, __val2));
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:450:31: error: the value of 'kCanToString' is not usable in a constant expression
kCanToString>::Convert(#__VAL2, __val2));
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:450:31: note: in template argument for type 'bool'
kCanToString>::Convert(#__VAL2, __val2));
^
/home/ma-user/.local/lib/python3.7/site-packages/paddle/include/paddle/phi/core/enforce.h:457:3: note: in expansion of macro '__PADDLE_BINARY_COMPARE'
__PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, VA_ARGS)
^~~~~~~~~~~~~~~~~~~~~~~
/home/ma-user/work/PaddleCustomDevice/backends/npu/kernels/assign_kernel.cc:95:3: note: in expansion of macro 'PADDLE_ENFORCE_EQ'
PADDLE_ENFORCE_EQ(
^~~~~~~~~~~~~~~~~
[ 27%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/bce_loss_kernel.cc.o
[ 28%] Building CXX object CMakeFiles/paddle-custom-npu.dir/kernels/bitwise_kernel.cc.o
make[2]: *** [CMakeFiles/paddle-custom-npu.dir/build.make:232: CMakeFiles/paddle-custom-npu.dir/kernels/assign_kernel.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:75: CMakeFiles/paddle-custom-npu.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
make_error=2
'[' 2 '!=' 0 ']'
echo 'Make Error Found !!!'
Make Error Found !!!
exit 7

NPU RuntimeError: (Unavailable) AddCallback is not supported on ascend

For test_momentum_op_npu:
ctest -R test_momentum_op_npu, it reports:

subdmoudle update lead to all ut failed

@qili93 recently paddle repo merged PaddlePaddle/Paddle#51033, which introduce a lot of change of paddle.fluid.layers.utils --> paddle.utils

Short Term Actions

upgrade paddlecustomdevice's submodule paddle to newer one
link need .py file to paddlecustomdevice's python/tests

Long Term Actions

move paddle test base to paddle wheel

基于CustomDevice的分布式训练精度收敛差

基于CustomDevice训练vgg16，网络单卡训练精度收敛正常，分布式单机多卡训练精度收敛差。
分布式训练网络性能是单卡的一半。

问题如下图：

左边是单卡训练log，右边是4卡训练log

[问题记录] MPS backend buffer 指针非法偏移引发错误

详情：MPS backend alloc得到MTLBuffer的handle。单独调用算子没有报错。跑模型时，handle以（void*) 传给主框架，主框架会在前64字节记录chunk信息，然后非法偏移64字节使用，引发错误。

[FAQ] 自定义 PASS + 自定义 OP 用于 Inference

自定义PASS
PaddlePaddle/Paddle#35602
PaddlePaddle/Paddle#36095
自定义OP
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/custom_op/index_cn.html#zidingyisuanzi

my_add_n.cc // 放到插件中一起编译成一个 so 文件

// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#include <iostream>
#include <vector>

#include "paddle/extension.h"

std::vector<paddle::Tensor> MyAddNOp(const paddle::Tensor& x,
                                     const paddle::Tensor& y,
                                     const paddle::Tensor& z) {
  return {paddle::add(x, y)};
}

std::vector<std::vector<int64_t>> MyAddNOpInferShape(
    const std::vector<int64_t>& x_shape,
    const std::vector<int64_t>& y_shape,
    const std::vector<int64_t>& z_shape) {
  return {x_shape};
}

PD_BUILD_OP(my_add_n)
    .Inputs({"X", "Y", "Z"})
    .Outputs({"Out"})
    .SetKernelFn(PD_KERNEL(MyAddNOp))
    .SetInferShapeFn(PD_INFER_SHAPE(
        MyAddNOpInferShape));  // neccessary if the op has muti_inputs

run.py

import paddle
import numpy as np

paddle.utils.cpp_extension.extension_utils.load_op_meta_info_and_register_op('/opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so')

@paddle.incubate.passes.ir.RegisterPass
def generate_add_n():
    def pattern(x, y, z):
        return paddle.add(paddle.add(x, y), z)

    def replace(x, y, z):
        return paddle.incubate.passes.ir.PassDesc.OP.my_add_n(X=x, Y=y, Z=z)

    return pattern, replace

@paddle.jit.to_static(input_spec=[paddle.static.InputSpec([None, 32], 'float32', 'x'),  paddle.static.InputSpec([None, 32], 'float32', 'y'),  paddle.static.InputSpec([None, 32], 'float32', 'z')])
def func(x, y, z):
    return x + y + z

model_file = './saved_models/func'
paddle.jit.save(func, model_file)

# inference
config = paddle.inference.Config()
config.set_prog_file(model_file + '.pdmodel')
config.enable_memory_optim()
pass_builder = config.pass_builder()
pass_builder.append_pass('generate_add_n')
print(pass_builder.all_passes())
predictor = paddle.inference.create_predictor(config)

input_names = predictor.get_input_names()
for i, name in enumerate(input_names):
    input_tensor = predictor.get_input_handle(name)
    input_tensor.copy_from_cpu(np.random.randn(2, 32).astype('float32'))

predictor.run()
results = []
output_names = predictor.get_output_names()
for i, name in enumerate(output_names):
    output_tensor = predictor.get_output_handle(name)
    output_data = output_tensor.copy_to_cpu()
    results.append(output_data)
print(results)

GLOG_v=10 python run.py

I0517 18:23:54.884903 94348 operator.cc:750] Place(cpu) Op(my_add_n), inputs:{X[x:float[2, 32]({})(Place(cpu))], Y[y:float[2, 32]({})(Place(cpu))], Z[z:float[2, 32]({})(Place(cpu))]}, outputs:{Out[tmp_1:[0]({})()]}.
I0517 18:23:54.884943 94348 context_pool.cc:62] DeviceContextPool Get: Place(cpu)
I0517 18:23:54.884971 94348 operator.cc:2130] op type:my_add_n, expected_kernel_key:{data_type[RAW(runtime decided type)]; data_layout[Undefined(AnyLayout)]; place[Place(cpu)]; library_type[PLAIN]}
I0517 18:23:54.884994 94348 context_pool.cc:62] DeviceContextPool Get: Place(cpu)
I0517 18:23:54.885025 94348 custom_operator.cc:424] Custom Operator: InferShape - get input ddim.
I0517 18:23:54.885046 94348 custom_operator.cc:505] Custom Operator: InferShape - calc output ddim.
I0517 18:23:54.885062 94348 custom_operator.cc:530] Custom Operator: InferShape - set output ddim: inplace_map.size() = 0, output_shapes.size() = 1
I0517 18:23:54.885083 94348 custom_operator.cc:1160] Custom Operator: run custom kernel func in lambda.
I0517 18:23:54.885099 94348 custom_operator.cc:64] Custom Operator: Start run KernelFunc.
I0517 18:23:54.885111 94348 custom_operator.cc:68] Custom Operator: input name - X
I0517 18:23:54.885135 94348 custom_operator.cc:68] Custom Operator: input name - Y
I0517 18:23:54.885149 94348 custom_operator.cc:68] Custom Operator: input name - Z
I0517 18:23:54.885154 94348 custom_operator.cc:185] Custom Operator: push outputs into CustomOpKernelContext.
I0517 18:23:54.885172 94348 custom_operator.cc:268] Custom Operator: Run ComputeFunc.
I0517 18:23:54.885187 94348 op_meta_info.cc:202] Custom opertor ConstructInplaceIndex no need to recompute.
I0517 18:23:54.885202 94348 op_meta_info.cc:245] Custom opertor update plain outputs map successfully.
I0517 18:23:54.885227 94348 api.cc:24106] add API kernel key: [CPU, NCHW, float32]
I0517 18:23:54.885249 94348 custom_device_op_list.cc:46] Custom Device Black List: 
I0517 18:23:54.885263 94348 api.cc:24113] add kernel: {"input":["CPU, NCHW, float32","CPU, NCHW, float32"],"output":["CPU, NCHW, float32"],"attribute":[]}
I0517 18:23:54.885291 94348 context_pool.cc:62] DeviceContextPool Get: Place(cpu)
I0517 18:23:54.885329 94348 dense_tensor.cc:139] Allocate data with bytes: 256
I0517 18:23:54.885344 94348 stats.h:84] Update peak_value, after update, peak_value = 1024 , current value = 1024
I0517 18:23:54.885383 94348 operator.cc:797] Place(cpu) Op(my_add_n), inputs:{X[x:float[2, 32]({})(Place(cpu))], Y[y:float[2, 32]({})(Place(cpu))], Z[z:float[2, 32]({})(Place(cpu))]}, outputs:{Out[tmp_1:float[2, 32]({})(Place(cpu))]}.
I0517 18:23:54.885411 94348 helper.h:464] after run : [cpu current allocated memory: 0.000976562MB], [cpu current reserved memory: 0MB], [cpu peak allocated memory: 0.000976562MB], [cpu peak reserved memory: 0MB]
I0517 18:23:54.885437 94348 reset_tensor_array.cc:45] Collect 0 arrays
[array([[ 0.58247435,  0.826475  ,  0.6871278 ,  0.4126696 , -0.2559116 ,
         0.65742874,  2.1384077 ,  0.24653143, -0.29847062, -2.2460418 ,
        -1.1594441 , -1.5321505 ,  3.0779753 ,  1.3047652 ,  5.319272  ,
        -3.2988782 ,  2.2765095 ,  0.8565507 , -3.34338   , -1.906771  ,
        -1.3918409 , -0.9324397 , -0.14787453, -0.4925239 , -0.24697244,
        -0.29773337, -2.2361014 , -2.4385114 ,  1.9175045 , -1.7525816 ,
        -2.0501115 ,  2.8168874 ],
       [-0.42592376, -1.5766194 ,  3.0644276 , -1.9179165 ,  2.8835368 ,
         0.28963447,  0.4251368 ,  1.146347  , -0.45447612, -0.9540442 ,
         1.8834621 ,  0.5726208 , -1.1495211 ,  2.1192973 , -0.1619632 ,
         1.1780676 , -3.423511  ,  0.31345803,  2.212157  ,  2.284046  ,
        -1.8597114 , -0.988636  ,  2.5586586 ,  0.6752815 , -0.8432386 ,
        -1.5520113 , -0.93274736,  0.7499885 , -2.2453508 ,  1.2411486 ,
         0.89078593,  0.02444351]], dtype=float32)]
I0517 18:23:54.887071 94348 imperative.cc:2204] Tracer(0x3b7d92b0) set expected place Place(npu:0)
I0517 18:23:54.887138 94348 mmap_allocator.cc:321] PID: 94348, MemoryMapFdSet: set size - 0
I0517 18:23:54.889010 94348 mmap_allocator.cc:321] PID: 94348, MemoryMapFdSet: set size - 0
I0517 18:23:55.128073 94348 mmap_allocator.cc:321] PID: 94348, MemoryMapFdSet: set size - 0

NPU DEMO: #578

PaddleOCR CRNN模型使用NPU训练反向传播LSTM算子梯度全零

上图打印了CRNN的反向梯度，CRNN模型最后的模型结构是LSTM接FC层，FC层梯度正常，但是从LSTM开始梯度全零
原因是当前框架不支持LSTM反向在NPU上的计算
Fallback到CPU上计算时，CPU计算的结果为全零

[FAQ] 保证每次 DeviceContext.template Alloc(...) 调用 Plugin 注册的 DeviceAllocate 函数 / 关闭设备内存复用

问题

DeviceContext.template Alloc(...) 不会每次调用 Runtime API

原因

paddle 内部会管理设备内存

解决办法

1.设置环境变量

export FLAGS_allocator_strategy=naive_best_fit

切换内存管理分配策略，paddle 默认策略为 auto_growth，该策略无法保证每次调用 runtime API 分配内存

2.修改 Plugin 注册的 device_max_chunk_size 函数返回 0，该函数的作用是超过该尺寸，paddle 不会使用内部管理的内存，而直接调用 runtime API 分配，返回 0 表示始终使用 runtime API 分配

即可关闭内存复用

编译kernel报错：找不到paddle/fluid/platform/flags.h

问题

编译时kernel部分报错：
In file included from ……python3.8/site-packages/paddle/include/paddle/phi/core/ddim.h:21,
from ……python3.8/site-packages/paddle/include/paddle/phi/core/tensor_meta.h:22,
from ……python3.8/site-packages/paddle/include/paddle/phi/core/compat/convert_utils.h:21,
from ……python3.8/site-packages/paddle/include/paddle/phi/core/kernel_factory.h:26,
from ……python3.8/site-packages/paddle/include/paddle/phi/core/custom_kernel.h:17,
from ……python3.8/site-packages/paddle/include/paddle/phi/core/kernel_registry.h:24,
from ……PaddleCustomDevice/backends/mlu/kernels/add_n_kernel.cc:21:
fatal error: paddle/fluid/platform/flags.h: No such file or directory
101 | #include "paddle/fluid/platform/flags.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/paddle-custom-mlu.dir/build.make:76: CMakeFiles/paddle-custom-mlu.dir/kernels/add_n_kernel.cc.o] Error 1
但该路径下有这一文件……请问该如何解决呢？

基本信息

paddle为最新dev，Aarch64环境编译。
add_n_kernel.cc文件，实现仅有一条cout语句，include仅为#include "paddle/phi/core/kernel_registry.h"
注册部分代码为
PD_REGISTER_PLUGIN_KERNEL(add_n,
CustomMLU,
ALL_LAYOUT,
custom_kernel::AddNKernel,
float,
phi::dtype::float16,
double) {}

dataloader第三次数据在设备侧出错

在MLU设备上运行网络 lstm 时，第三次dataloader给出的数据无效，导致网络的第一个算子出错。出错的数据可以参考下图

submodule更新异常

运行git submodule update --remote --init --recursive时，Paddle的第三方依赖 leveldb无法拉取

[***@*** PaddleCustomDevice]$ git submodule update --remote --init --recursive
Submodule 'Paddle' (https://github.com/PaddlePaddle/Paddle.git) registered for path 'Paddle'
Cloning into 'Paddle'...
remote: Enumerating objects: 552541, done.
remote: Counting objects: 100% (887/887), done.
remote: Compressing objects: 100% (664/664), done.
remote: Total 552541 (delta 408), reused 406 (delta 221), pack-reused 551654
Receiving objects: 100% (552541/552541), 349.40 MiB | 335.00 KiB/s, done.
Resolving deltas: 100% (466820/466820), done.
remote: Enumerating objects: 45, done.
remote: Counting objects: 100% (45/45), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 24 (delta 22), reused 5 (delta 3), pack-reused 0
Unpacking objects: 100% (24/24), done.
From https://github.com/PaddlePaddle/Paddle
   a195ef3..cbeff5f  develop    -> origin/develop
Submodule path 'Paddle': checked out 'cbeff5fc580672ff0882ca6bd9a1341fe8bef1fc'
Submodule 'third_party/dlpack' (https://github.com/dmlc/dlpack.git) registered for path 'third_party/dlpack'
Submodule 'third_party/eigen3' (https://gitlab.com/libeigen/eigen.git) registered for path 'third_party/eigen3'
Submodule 'third_party/gflags' (https://github.com/gflags/gflags.git) registered for path 'third_party/gflags'
Submodule 'third_party/glog' (https://github.com/google/glog.git) registered for path 'third_party/glog'
Submodule 'third_party/gloo' (https://github.com/ziyoujiyi/gloo.git) registered for path 'third_party/gloo'
Submodule 'third_party/leveldb' (https://github.com/google/leveldb) registered for path 'third_party/leveldb'
Submodule 'third_party/protobuf' (https://github.com/protocolbuffers/protobuf.git) registered for path 'third_party/protobuf'
Submodule 'third_party/threadpool' (https://github.com/progschj/ThreadPool.git) registered for path 'third_party/threadpool'
Submodule 'third_party/utf8proc' (https://github.com/JuliaStrings/utf8proc.git) registered for path 'third_party/utf8proc'
Submodule 'third_party/warpctc' (https://github.com/baidu-research/warp-ctc.git) registered for path 'third_party/warpctc'
Submodule 'third_party/warprnnt' (https://github.com/PaddlePaddle/warp-transducer.git) registered for path 'third_party/warprnnt'
Submodule 'third_party/xxhash' (https://github.com/Cyan4973/xxHash.git) registered for path 'third_party/xxhash'
Submodule 'third_party/zlib' (https://github.com/madler/zlib.git) registered for path 'third_party/zlib'
Cloning into 'third_party/dlpack'...
remote: Enumerating objects: 462, done.
remote: Counting objects: 100% (99/99), done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 462 (delta 74), reused 69 (delta 59), pack-reused 363
Receiving objects: 100% (462/462), 1.70 MiB | 191.00 KiB/s, done.
Resolving deltas: 100% (162/162), done.
Submodule path 'Paddle/third_party/dlpack': checked out '3ec04430e89a6834e5a1b99471f415fa939bf642'
Cloning into 'third_party/eigen3'...
remote: Enumerating objects: 119484, done.
remote: Counting objects: 100% (1163/1163), done.
remote: Compressing objects: 100% (375/375), done.
remote: Total 119484 (delta 812), reused 1117 (delta 787), pack-reused 118321
Receiving objects: 100% (119484/119484), 103.62 MiB | 10.13 MiB/s, done.
Resolving deltas: 100% (98647/98647), done.
Submodule path 'Paddle/third_party/eigen3': checked out '07e4604b1961a32bbe21841a1e97fc274b50c443'
Cloning into 'third_party/gflags'...
remote: Enumerating objects: 2458, done.
remote: Counting objects: 100% (71/71), done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 2458 (delta 34), reused 52 (delta 25), pack-reused 2387
Receiving objects: 100% (2458/2458), 1.53 MiB | 1.12 MiB/s, done.
Resolving deltas: 100% (1436/1436), done.
Submodule path 'Paddle/third_party/gflags': checked out 'a738fdf9338412f83ab3f26f31ac11ed3f3ec4bd'
Cloning into 'third_party/glog'...
remote: Enumerating objects: 4005, done.
remote: Counting objects: 100% (267/267), done.
remote: Compressing objects: 100% (161/161), done.
remote: Total 4005 (delta 153), reused 191 (delta 95), pack-reused 3738
Receiving objects: 100% (4005/4005), 2.27 MiB | 237.00 KiB/s, done.
Resolving deltas: 100% (2755/2755), done.
Submodule path 'Paddle/third_party/glog': checked out '22491eb1236c8b5c1dcba2ed3a213c74ce699988'
Cloning into 'third_party/gloo'...
remote: Enumerating objects: 3613, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 3613 (delta 0), reused 1 (delta 0), pack-reused 3606
Receiving objects: 100% (3613/3613), 1.07 MiB | 694.00 KiB/s, done.
Resolving deltas: 100% (2762/2762), done.
Submodule path 'Paddle/third_party/gloo': checked out '9877014465775fac31df9297e5425e599031be76'
Cloning into 'third_party/leveldb'...
remote: Enumerating objects: 3525, done.
remote: Counting objects: 100% (68/68), done.
remote: Compressing objects: 100% (49/49), done.
remote: Total 3525 (delta 29), reused 29 (delta 15), pack-reused 3457
Receiving objects: 100% (3525/3525), 1.67 MiB | 173.00 KiB/s, done.
Resolving deltas: 100% (2460/2460), done.
fatal: Needed a single revision
Unable to find current origin/master revision in submodule path 'third_party/leveldb'
Failed to recurse into submodule path 'Paddle'

PaddleCustomDevice和paddlepaddle-npu的区别

我在官网上看到有一个编译安装昇腾NPU版飞浆框架的文档，请问和这里的PaddleCustomDevice有什么区别？随便挑一种方式安装就行吗？

编译安装npu版成功，执行ppocr却报不是npu版本

如下图，编译安装npu版paddle成功，能识别卡信息，但是训练时失败，定位到是校验不通过。
paddle-custom-npu 0.0.0
paddleocr 2.6.1.0
paddlepaddle 0.0.0

paddlepaddle / paddlecustomdevice Goto Github PK

paddlecustomdevice's Introduction

PaddleCustomDevice

使用指南

硬件后端

版权和许可证

paddlecustomdevice's People

Contributors

Stargazers

Watchers

Forkers

paddlecustomdevice's Issues

E

ERROR: test_check_grad_ingore_x (main.TestElementwiseSubOp_broadcast_4)

====================================================================== ERROR: test_check_grad_normal (main.TestElementwiseSubOp_broadcast_4)

当前问题

环境

Short Term Actions

Long Term Actions

问题

原因

解决办法

问题

基本信息

Recommend Projects

Recommend Topics

Recommend Org

======================================================================
ERROR: test_check_grad_normal (main.TestElementwiseSubOp_broadcast_4)