oneflow-inc / oneflow_convert Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 8.0 2.02 MB

OneFlow->ONNX

Python 99.20% Shell 0.33% Dockerfile 0.48%

oneflow_convert's People

Contributors

Stargazers

Watchers

Forkers

mosout jiangjiajun 666dzy666 wanghongsheng01 alive1024 qmpzzpmq infralearning liuli1001

oneflow_convert's Issues

调用convert_to_onnx_and_check接口报错 "Error No Op registered for Range with domain_version of 10"

在调用 convert_to_onnx_and_check 时

convert_to_onnx_and_check(t5_graph,
                  external_data=False, 
                  opset=None, 
                  flow_weight_dir=None, 
                  onnx_model_path="./", 
                  dynamic_batch_size=False)

报错:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./model.onnx failed:This is an invalid model. In Node, ("model.t5_model.encoder.layers.0.self_attention-arange-18", Range, "", -1) : ("start85": tensor(int64),"limit80": tensor(int64),"delta81": tensor(int64),) -> ("model.t5_model.encoder.layers.0.self_attention-arange-18/out_0",) , Error No Op registered for Range with domain_version of 10

在网上查找了以后, NVIDIA/TensorRT#1658 , 发现可能onnx的版本原因.
修改调用接口为:

convert_to_onnx_and_check(t5_graph,
                  external_data=False, 
                  opset=11, 
                  flow_weight_dir=None, 
                  onnx_model_path="./", 
                  dynamic_batch_size=False)

出现新的报错:

Traceback (most recent call last):
  File "libai/onnx_export/t5_to_onnx.py", line 57, in <module>
    convert_to_onnx_and_check(t5_graph,
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow_onnx-0.5.5-py3.8.egg/oneflow_onnx/oneflow2onnx/util.py", line 99, in convert_to_onnx_and_check
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow_onnx-0.5.5-py3.8.egg/oneflow_onnx/oneflow2onnx/util.py", line 29, in run_onnx
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./model.onnx failed:This is an invalid model. Type Error: Type 'tensor(int64)' of input parameter (model.t5_model.encoder.layers.0.self_attention-scalar_add-25/out_0) of operator (Sum) in node (model.t5_model.encoder.layers.0.self_attention-add_n-39) is invalid.

0.3.2 whell包存在的问题

不支持PreLU
Codegen的时候由于随机产生的数可能比较大，导致没有bn的网络比如googlenet会精度爆炸（这是猜测，因为本地可以正常运行，ci有概率挂，所以codegen暂时没有放入ci）
oneflow2onnx生成的模型会立马被删除，master代码已经解决这个问题，解决上面2个问题之后一起发布0.3.3 whell包。

计划分2个pr分别解决。

opset version=14时unsqueeze op dtype推导会挂掉

应该是ONNX自己的bug。复现代码如下：

"""
Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import tempfile
import oneflow as flow
from oneflow_onnx.oneflow2onnx.util import convert_to_onnx_and_check

class Conv2d(flow.nn.Module):
    def __init__(self) -> None:
        super(Conv2d, self).__init__()
        self.conv = flow.nn.Conv2d(3, 16, 3)

    def forward(self, x: flow.Tensor) -> flow.Tensor:
        return self.conv(x)

conv_module = Conv2d()
class Conv2dOpGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m = conv_module

    def build(self, x):
        out = self.m(x)
        return out


def test_conv2d():
    
    conv_graph = Conv2dOpGraph()
    conv_graph._compile(flow.randn(1, 3, 224, 224))

    with tempfile.TemporaryDirectory() as tmpdirname:
        flow.save(conv_module.state_dict(), tmpdirname)
        convert_to_onnx_and_check(conv_graph, flow_weight_dir=tmpdirname, onnx_model_path="/tmp", opset=14)

test_conv2d()

报错信息：

ONNX Failed to infer shapes and dtypes for [m.conv-bias_add-131, type: Unsqueeze]
Traceback (most recent call last):
  File "/home/zhangxiaoyu/oneflow_convert/oneflow_onnx/schemas.py", line 184, in InferOnnxShapeDtype
    inferred_model = shape_inference.infer_shapes(model_proto)
  File "/home/zhangxiaoyu/miniconda3/envs/clang10/lib/python3.8/site-packages/onnx/shape_inference.py", line 41, in infer_shapes
    inferred_model_str = C.infer_shapes(model_str, check_type, strict_mode, data_prop)
RuntimeError: Input 1 is out of bounds.

可能某个op的转化存在bug

昨天迟哥在使用oneflow->onnx的时候好像出现过一次转换模型失败（结果的精度误差>1e-4），今天确认onnx文件写入的方式是wb，所以会覆盖原有onnx模型，不存在模型文件追加导致失败的原因。所以可能是某个op有转换精度问题，需要在本地多次运行尝试复现解决。

oneflow_onnx.oneflow2onnx.util import export_onnx_model模块导入问题

oneflow_onnx.oneflow2onnx.util import export_onnx_model模块导入报错？

更新文档中遇到的问题
文档中给出的测试代码如下:

from oneflow_onnx.oneflow2onnx.util import export_onnx_model

export_onnx_model(graph,
                  external_data=False, 
                  opset=None, 
                  flow_weight_dir=None, 
                  onnx_model_path="/tmp", 
                  dynamic_batch_size=False)

报错信息如下：
目前做了以下两个方面的排查

检查运行python程序时是否包含该包的路径，发现路径已经包含在sys.path中
检查oneflow_onnx是否缺少__init__.py文件

但仍未解决相关问题

相关包的版本信息如下：

python    3.8
oneflow            0.8.1.dev20221201+cu112
oneflow-onnx       0.6.1
onnx               1.12.0
onnx-simplifier    0.4.10
onnxoptimizer      0.3.2
onnxruntime-gpu    1.13.1

logical_slice_assign op如何支持

如何支持logical_slice_assign op？

例子：

import tempfile
import oneflow as flow
from oneflow_onnx.oneflow2onnx.util import convert_to_onnx_and_check

class logicalSliceAssign(flow.nn.Module):
    def __init__(self) -> None:
        super(logicalSliceAssign, self).__init__()
    
    def forward(self, x: flow.Tensor) -> flow.Tensor:
        x[:, 0 : 2] += x
        return x

logical_slice = logicalSliceAssign()
class logicalSliceOpGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m = logical_slice

    def build(self, x):
        out = self.m(x)
        return out

def test_logical_slice():
    logical_slice_graph = logicalSliceOpGraph()
    logical_slice_graph._compile(flow.randn(1, 2, 1, 1))
    print(logical_slice_graph._ops_repr)

    with tempfile.TemporaryDirectory() as tmpdirname:
        flow.save(slice.state_dict(), tmpdirname)
        convert_to_onnx_and_check(logical_slice_graph, flow_weight_dir=tmpdirname, onnx_model_path="/tmp")

test_logical_slice()

来源 flowvision rexnet

yolov5转换不支持的op

Unsupported ops: Counter({'silu': 57, 'narrow': 9, 'max_pool_2d': 3, 'scalar_pow': 3, 'upsample_nearest_2d': 2})

LiBai MT5 不支持的op

Unsupported ops: Counter({'broadcast_matmul': 89, 'scalar_div': 4, 'gather': 4, 'elementwise_minimum': 3, 'fill_': 2, 'scalar_logical_less': 2, 'where': 2, 'scalar_logical_greater': 1})

人脸识别的iresnet，在graph模式下开了fuse_add_to_output，无法通过数值check

复现代码

import tempfile
import oneflow as flow
from oneflow_onnx.oneflow2onnx.util import convert_to_onnx_and_check
from flowvision.models.face_recognition import iresnet50

model = iresnet50().to("cuda")

class ModelGraph(flow.nn.Graph):
    def __init__(self, model):
        super().__init__()
        self.config.allow_fuse_add_to_output(True)
        self.backbone = model

    def build(self, x):
        x = x.to("cuda")
        out = self.backbone(x)
        return out

model.eval()
model_graph = ModelGraph(model)
model_graph._compile(flow.randn(1, 3, 112, 112))


with tempfile.TemporaryDirectory() as tmpdirname:
    flow.save(model.state_dict(), tmpdirname)
    convert_to_onnx_and_check(
        model_graph, flow_weight_dir=tmpdirname, onnx_model_path="./", print_outlier=True)

报错

File ~/miniconda/lib/python3.9/site-packages/oneflow_onnx/oneflow2onnx/util.py:102, in compare_result(a, b, rtol, atol, print_outlier)
    100         if np.abs(a[i] - b[i]) > atol + rtol * np.abs(b[i]):
    101             print("a[{}]={}, b[{}]={}".format(i, a[i], i, b[i]))
--> 102 assert np.allclose(a, b, rtol=rtol, atol=atol)

AssertionError:

但是 self.config.allow_fuse_add_to_output(True) 这一行注释掉就可以成功转换。

Eager Oneflow2onnx

最开始删掉所有包含有@flow的装饰器，还有flow.experimental，flow.checkpoint等

examples.models

将模型迁移到Graph上

# 假设class AlexNet存在
def alexnet(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> AlexNet:
    model = AlexNet(**kwargs)
    return model


def test_alexnet():
    alexnet_module = alexnet()
    alexnet_module.eval().to("cuda")
	
    # Graph添加到oneflow_onnx/oneflow2onnx/util.py文件中
    alexnet_graph = Graph(alexnet_module)
	
    # 这里把job换成了graph
    convert_to_onnx_and_check(
        alexnet_graph, 
        flow_weight_dir="./examples/oneflow2onnx/models/alexnet_oneflow_model", 
        onnx_model_path="/tmp"
    )

其中Graph为

class Graph(flow.nn.Graph):
    def __init__(self, module):
        super().__init__()
        self.m = module

    def build(self, x):
        out = self.m(x)
        return out

在这里也有叫Graph的类，我打算修改成OneflowGraph，跟tvm里一致

oneflow2onnx.util

关于convert_to_onnx_and_check()函数

把函数中的job换成graph(所有函数都得换)
关于explicit_init参数，我不太清楚在lazy中这段代码的具体作用

关于export_onnx_model()函数，转换主函数

我打算删除关于flow_weight_dir的if-else语句，可以放在后续Export主函数里面（我觉得直接assert必须输入model_dir_path要好一些，我在tvm里是这么做的）
关于判断是否存在snapshot_done，在tvm中有更快的方案，直接搬过来了

oneflow2onnx.flow2onnx

关于ProcessFlowGraph函数

lazy中通过helper获取节点信息，现在的版本应该是得从repr(graph)里面自己提取，graph_proto(对应之前的job_func).helper返回是None，所以这里得加上test中体现的那些节电信息提取的脚本，也对应tvm的这里。当然这个需要小修，因为最新的repr(graph)把data type也放进来了，可以单个提取，如下：

shapes = {}
dtypes = {}
graph_str = repr(graph)
# print(graph_str)
size_where = 2
if "cuda" in graph_str:
    size_where = 3

p_size = re.compile(r"size=\(.*?\)", re.S)
p_type = re.compile(r"dtype=.*?,", re.S)
types = ["INPUT", "PARAMETER", "BUFFER", "OUTPUT"]

for t in types:
    data = re.finditer(t+":.*", graph_str)
    for i in data:
        attrs = i.group().split(":")
        size_str = re.findall(p_size, attrs[size_where])
        type_str = re.findall(p_type, attrs[size_where])
        assert size_str != [] or type_str != [], \
        "size should not be None, please check your inputs dtype"

        size_attr = size_str[0].replace("size=", "")
        type_attr = type_str[0].replace("dtype=", "")
        if size_attr[-2] == ",":
            size_attr = size_attr.replace(",", "")
        if type_attr[-1] == ",":
            type_str = type_attr.replace(",", "")

        data_size = tuple(map(int, size_attr[1:-1].split(", ")))
        node_name = attrs[1]
        shapes[node_name] = data_size
        dtypes[node_name] = STR_TO_FLOW[type_str]

关于nodes的存放，现有的oneflow2onnx是list模式，用helper.make_node()暂时不清楚会有什么问题

同时其中的get_inputs和get_outputs这些函数可以使用node.user_conf.node_name进行改动，也就是对应这里

当然我觉得这个可能可以并入这里，在tvm中是这么做的

oneflow_onnx

关于_GraphCheck，可以改成graph._is_compiled()
关于UpdateProto，好像有两个，一、二，或许需要改个名字
关于MakeSure，可以改成assert吧
关于获取数据MakeSlice，那几个参数应该可以提前存储好然后要用的时候直接提取吧，tvm里面我是先存好后面直接提取
关于获取attr，好像比较复杂，不仅仅只有这里

def parse_attr(attr):
    # Parse node_attr
    attrs = {}
    for a in attr:
        attr_str = str(attr[a])

        if attr_str[0:7] == "at_list":
            attr_str_ = attr_str.split(" ")[0]

            if attr_str_ == "at_list_float":
                attrs[a] = tuple(attr[a].at_list_float.val)
            elif attr_str_ == "at_list_int32":
                attrs[a] = tuple(attr[a].at_list_int32.val)
            elif attr_str_ == "at_list_int64":
                attrs[a] = tuple(attr[a].at_list_int64.val)

        elif attr_str.split(":")[0] == "at_string":
            attrs[a] = attr[a].at_string

        elif attr_str.split(" ")[0] == "at_shape":
            attrs[a] = tuple(list(attr[a].at_shape.dim))

        else:
            attr_str_ = attr_str.split(":")[0]
            if attr_str_ == "at_bool":
                attrs[a] = attr[a].at_bool
            elif attr_str_ == "at_double":
                attrs[a] = attr[a].at_double
            elif attr_str_ == "at_float":
                attrs[a] = attr[a].at_float
            elif attr_str_ == "at_int32":
                attrs[a] = attr[a].at_int32
            elif attr_str_ == "at_int64":
                attrs[a] = attr[a].at_int64

    return attrs

关于optimizer我没有详细看，应该也有一些类似上面的错误，比如命名的修改以及op中attrs信息的提取

disco diffusion项目中遇到部分op不支持onnx

目前梳理出的，不支持oneflow -> onnx 转换的op如下：

arange
scalar_div
expand_dims
silu
narrow