alibaba / easyparallellibrary Goto Github PK

View Code? Open in Web Editor NEW

257.0 13.0 49.0 790 KB

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

License: Apache License 2.0

Makefile 0.69% Shell 0.16% C++ 6.52% Python 92.62%

deep-learning data-parallelism model-parallelism pipeline-parallelism memory-efficient distributed-training gpu

easyparallellibrary's Introduction

English | 简体中文

Easy Parallel Library

Overview

Easy Parallel Library (EPL) is a general and efficient library for distributed model training.

Usability - Users can implement different parallelism strategies with a few lines of annotations, including data parallelism, pipeline parallelism, tensor model parallelism, and their hybrids.
Memory Efficient - EPL provides various memory-saving techniques, including gradient checkpoint, ZERO, CPU Offload, etc. Users are able to train larger models with fewer computing resources.
High Performance - EPL provides an optimized communication library to achieve high scalability and efficiency.

For more information, you may read the docs.

EPL Model Zoo provides end-to-end parallel training examples.

Installation

To install EPL, please refer to the following instructions.

Examples

Here are a few examples of different parallelism strategies by changing only annotations. Please refer to API documentation for API details and tutorials for more examples.

Data Parallelism

The following example shows a basic data parallelism annotation. The data parallelism degree is determined by the allocated GPU number.

+ import epl
+ epl.init()
+ with epl.replicate(device_count=1):
    model()

Pipeline Parallelism

The following example shows pipeline parallelism with two pipeline stages, each stage is computed with one GPU. If the total GPU number is 4, EPL will automatically apply two-degree data parallelism over the model pipeline.

+ import epl
+ 
+ config = epl.Config({"pipeline.num_micro_batch": 4})
+ epl.init(config)
+ with epl.replicate(device_count=1, name="stage_0"):
    model_part1()
+ with epl.replicate(device_count=1, name="stage_1"):
    model_part2()

Tensor Model Parallelism

The following example shows a tensor model parallelism annotation. We apply data parallelism to the ResNet part, and apply tensor model parallelism to classification part.

+ import epl
+ config = epl.Config({"cluster.colocate_split_and_replicate": True})
+ epl.init(config)
+ with epl.replicate(8):
    ResNet()
+ with epl.split(8):
    classification()

Publication

If you use EPL in your publication, please cite it by using the following BibTeX entry.

@inproceedings {jia2022whale,
	author = {Xianyan Jia and Le Jiang and Ang Wang and Wencong Xiao and Ziji Shi and Jie Zhang and Xinyuan Li and Langshi Chen and Yong Li and Zhen Zheng and Xiaoyong Liu and Wei Lin},
	title = {Whale: Efficient Giant Model Training over Heterogeneous {GPUs}},
	booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
	year = {2022},
	isbn = {978-1-939133-29-57},
	address = {Carlsbad, CA},
	pages = {673--688},
	url = {https://www.usenix.org/conference/atc22/presentation/jia-xianyan},
	publisher = {USENIX Association},
	month = jul,
}

Contact Us

Join the Official Discussion Group on DingTalk.

easyparallellibrary's People

Contributors

Stargazers

Watchers

easyparallellibrary's Issues

2台服务器分布式跑example中的resnet_split.py遇到无限等待的情况

环境 nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像生成的容器
代码： FastNN/resnet/resnet_split.py
执行命令：
服务器1：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh
服务器2：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况：

服务器2的执行情况：

可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复，但是没有继续往下运行。

补充： 同样的环境可以分布式运行bert，服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改？

epl单机单卡和单机多卡训练step如何理解

单机单卡：
启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh

单机双卡：
启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh

代码修改了一下：去掉了last_step限制，数据集repeat=10，将txt改为py，可执行。
resnet_dp.txt

想请教下，这个如何理解呢？每个卡分别跑了10step？

AttributeError: 'NoneType' object has no attribute 'taskgraph'

Hi EPL team,

When I use epl library to train the following code:

import os
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from PIL import Image
import tensorflow as tf
import epl

def preprocess_image(image):
    # Resize and crop
    width, height = image.size
    if width > height:
        new_width = int(224 * width / height)
        image = image.resize((new_width, 224))
        left = (new_width - 224) / 2
        image = image.crop((left, 0, left + 224, 224))
    else:
        new_height = int(224 * height / width)
        image = image.resize((224, new_height))
        top = (new_height - 224) / 2
        image = image.crop((0, top, 224, top + 224))

    # Normalize pixel values
    image = np.array(image, dtype=np.float32) / 255.0
    mean = np.array([0.485, 0.456, 0.406])[None, None, :]
    std = np.array([0.229, 0.224, 0.225])[None, None, :]
    image = (image - mean) / std

    return image


def load_and_preprocess_image(path):
    image = Image.open(path).convert('RGB')
    return preprocess_image(image)


train_image_dir = '/users/Master/imagenet/train'
val_image_dir = '/users/Master/imagenet/val'
class_names = sorted(os.listdir(train_image_dir))
num_classes = len(class_names)

train_image_paths = []
train_labels = []
val_image_paths = []
val_labels = []

for label, class_name in enumerate(class_names):
    train_class_dir = os.path.join(train_image_dir, class_name)
    val_class_dir = os.path.join(val_image_dir, class_name)

    for img_name in os.listdir(train_class_dir):
        img_path = os.path.join(train_class_dir, img_name)
        train_image_paths.append(img_path)
        train_labels.append(label)

    for img_name in os.listdir(val_class_dir):
        img_path = os.path.join(val_class_dir, img_name)
        val_image_paths.append(img_path)
        val_labels.append(label)


def load_images_parallel(image_paths, num_workers=16):
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        images = list(executor.map(load_and_preprocess_image, image_paths))
    return np.array(images)


def load_images_chunk(image_paths, labels, batch_size):
    num_batches = int(np.ceil(len(image_paths) / batch_size))
    for i in range(num_batches):
        batch_image_paths = image_paths[i * batch_size:(i + 1) * batch_size]
        batch_labels = labels[i * batch_size:(i + 1) * batch_size]
        batch_images = load_images_parallel(batch_image_paths)
        batch_labels_one_hot = tf.keras.utils.to_categorical(batch_labels, num_classes=num_classes)
        yield batch_images, batch_labels_one_hot


def conv2d_bn(x, filters, kernel_size, strides=1, padding='same', activation=tf.nn.relu, name=None):
    x = tf.layers.conv2d(x, filters, kernel_size, strides=strides, padding=padding, use_bias=False, name=name)
    x = tf.layers.batch_normalization(x, training=True)
    if activation is not None:
        x = activation(x)
    return x


def identity_block(input_tensor, filters, stage, block):
    filters1, filters2, filters3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = conv2d_bn(input_tensor, filters1, 1, name=conv_name_base + '2a')
    x = conv2d_bn(x, filters2, 3, name=conv_name_base + '2b')
    x = conv2d_bn(x, filters3, 1, activation=None, name=conv_name_base + '2c')

    x = tf.add(x, input_tensor)
    x = tf.nn.relu(x)
    return x


def conv_block(input_tensor, filters, stage, block, strides=2):
    filters1, filters2, filters3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = conv2d_bn(input_tensor, filters1, 1, strides=strides, name=conv_name_base + '2a')
    x = conv2d_bn(x, filters2, 3, name=conv_name_base + '2b')
    x = conv2d_bn(x, filters3, 1, activation=None, name=conv_name_base + '2c')

    shortcut = conv2d_bn(input_tensor, filters3, 1, strides=strides, activation=None, name=conv_name_base + '1')

    x = tf.add(x, shortcut)
    x = tf.nn.relu(x)
    return x


def resnet50(input_tensor, classes):
    x = conv2d_bn(input_tensor, 64, 7, strides=2, name='conv1')
    x = tf.layers.max_pooling2d(x, 3, strides=2, padding='same', name='pool1')

    x = conv_block(x, [64, 64, 256], stage=2, block='a', strides=1)
    x = identity_block(x, [64, 64, 256], stage=2, block='b')
    x = identity_block(x, [64, 64, 256], stage=2, block='c')

    x = conv_block(x, [128, 128, 512], stage=3, block='a')
    x = identity_block(x, [128, 128, 512], stage=3, block='b')
    x = identity_block(x, [128, 128, 512], stage=3, block='c')
    x = identity_block(x, [128, 128, 512], stage=3, block='d')

    x = conv_block(x, [256, 256, 1024], stage=4, block='a')
    x = identity_block(x, [256, 256, 1024], stage=4, block='b')
    x = identity_block(x, [256, 256, 1024], stage=4, block='c')
    x = identity_block(x, [256, 256, 1024], stage=4, block='d')
    x = identity_block(x, [256, 256, 1024], stage=4, block='e')
    x = identity_block(x, [256, 256, 1024], stage=4, block='f')

    x = conv_block(x, [512, 512, 2048], stage=5, block='a')
    x = identity_block(x, [512, 512, 2048], stage=5, block='b')
    x = identity_block(x, [512, 512, 2048], stage=5, block='c')

    x = tf.layers.average_pooling2d(x, 7, strides=1, padding='valid', name='pool5')
    x = tf.layers.flatten(x)
    x = tf.layers.dense(x, classes, activation=None, name='fc1000')

    return x


def run_model():
    with tf.Session() as sess:
        input_tensor = tf.placeholder(tf.float32, shape=[None, 224, 224, 3], name="input_image")
        labels_tensor = tf.placeholder(tf.float32, shape=[None, num_classes], name="labels")
        learning_rate = 0.001

        logits = resnet50(input_tensor, num_classes)

        loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels_tensor))
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        train_op = optimizer.minimize(loss_op)

        correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_tensor, 1))
        accuracy_op = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

        sess.run(tf.global_variables_initializer())

        epochs = 10
        batch_size = 64

        for epoch in range(epochs):
            step = 0

            for batch_images, batch_labels_one_hot in load_images_chunk(train_image_paths, train_labels, batch_size):
                _, loss, accuracy = sess.run(
                    [train_op, loss_op, accuracy_op],
                    feed_dict={input_tensor: batch_images, labels_tensor: batch_labels_one_hot}
                )
                print(f"Epoch {epoch + 1}/{epochs}, Step: {step}, Loss: {loss:.4f}, Accuracy: {accuracy:.4f}")
                step = step + 1

            # Validate the model
            val_accuracy_list = []
            for batch_images, batch_labels_one_hot in load_images_chunk(val_image_paths, val_labels, batch_size):
                accuracy = sess.run(accuracy_op,
                                    feed_dict={input_tensor: batch_images, labels_tensor: batch_labels_one_hot})
                val_accuracy_list.append(accuracy)
            val_accuracy = np.mean(val_accuracy_list)
            print(f"Validation Accuracy: {val_accuracy:.4f}")


if __name__ == '__main__':
    tf.logging.set_verbosity(tf.logging.INFO)
    config_json = {}
    epl.init(epl.Config(config_json))
    print(epl.Env.get().cluster.gpu_num_per_worker)
    if epl.Env.get().cluster.gpu_num_per_worker > 1:
        # Avoid NCCL hang.
        os.environ["NCCL_LAUNCH_MODE"] = "GROUP"
    epl.set_default_strategy(epl.replicate(device_count=1))
    run_model()

I am confronted with the following issue:
Traceback (most recent call last):
File "resnet50_split3.py", line 203, in
run_model()
File "resnet50_split3.py", line 164, in run_model
sess.run(tf.global_variables_initializer())
File "/users/Master/anaconda3/envs/py37/lib/python3.7/site-packages/epl/parallel/hooks.py", line 453, in run
assign_ops = _init_local_resources(self, fn)
File "/users/Master/anaconda3/envs/py37/lib/python3.7/site-packages/epl/parallel/hooks.py", line 416, in _init_local_resources
assign_ops = broadcast_variables()
File "/users/Master/anaconda3/envs/py37/lib/python3.7/site-packages/epl/parallel/hooks.py", line 339, in broadcast_variables
bcast_variables = taskgraph.get_variables(replica_idx)
File "/users/Master/anaconda3/envs/py37/lib/python3.7/site-packages/epl/ir/taskgraph.py", line 409, in get_variables
if id(var_tensor.taskgraph) != id(self):
AttributeError: 'NoneType' object has no attribute 'taskgraph'

Could you give me a hand when you are free? Thank you very much!

DingTalk QR code is outdated

DistributedDense只支持按照列切分吗？

DistributedDense只支持按照列切分吗？如果想实现Megatron-LM那种方式，先列切，再行切该怎么办？

Gradient Checkpoint with auto type got a TypeError

After I added below codes to my worked functions, I got a TypeError

epl_config = epl.Config({
    "gradient_checkpoint.type": "auto",
    "zero.level": "v1",
    "amp.level": "O1", "amp.loss_scale": 128
})
epl.init(epl_config)
epl.set_default_strategy(epl.replicate(1))

error info:

"/venv/lib/python2.7/site-packages/tensorflow/contrib/graph_editor/util.py", line 214, in get_unique_graph
    t) for t in check_types]), type(op)))
TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Tensor'>), got: <class 'tensorflow.python.ops.resource_variable_ops.Resource

my worked functions: (modeling module from robert)

import epl
import tensorflow as tf
from tensorflow.contrib import layers, metrics

epl_config = epl.Config({
    "gradient_checkpoint.type": "auto",
    "zero.level": "v1",
    "amp.level": "O1", "amp.loss_scale": 128
})
epl.init(epl_config)
epl.set_default_strategy(epl.replicate(1))
bert_path = 'robert_checkpoint_path'


def model_fn(features, labels, mode, params):
    is_train_bool = mode == tf.estimator.ModeKeys.TRAIN

    # Building BERT model
    bert_config = modeling.BertConfig.from_dict({
        "attention_probs_dropout_prob": 0.1,
        "directionality": "bidi",
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "hidden_size": 768,
        "initializer_range": 0.02,
        "intermediate_size": 3072,
        "max_position_embeddings": 512,
        "num_attention_heads": 12,
        "num_hidden_layers": 6,
        "pooler_fc_size": 768,
        "pooler_num_attention_heads": 12,
        "pooler_num_fc_layers": 3,
        "pooler_size_per_head": 128,
        "pooler_type": "first_token_transform",
        "type_vocab_size": 2,
        "vocab_size": 21128
    })
    bert = modeling.BertModel(
        config=bert_config,
        is_training=is_train_bool,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=False,
        scope='bert'
    )
    # Getting BERT's outputs
    bert_output = bert.get_sequence_output()
    # Loading pre-trained BERT
    if is_train_bool:
        tvars = tf.trainable_variables()
        (
            assignment_map, initialized_names
        ) = modeling.get_assignment_map_from_checkpoint(
            tvars, bert_path
        )
        tf.train.init_from_checkpoint(bert_path, assignment_map)
        tf.logging.info("**** Trainable Variables ****")
        for var in tvars:
            tf.logging.info("  name = {}, shape = {}{}".format(
                var.name, var.shape,
                ", *INIT_FROM_CKPT*" if var.name in initialized_names
                else ''
            ))

    with tf.variable_scope("network"):
        # MLP
        first_hidden_layer = tf.layers.dense(
            tf.concat(bert_output, axis=1), 128, activation=tf.nn.relu)
        second_hidden_layer = tf.layers.dense(
            first_hidden_layer, 128, activation=tf.nn.relu)
        logits = tf.layers.dense(second_hidden_layer, 1)
        predictions = tf.sigmoid(logits)

    predictions = tf.identity(predictions, name="predict")

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
            mode=mode, predictions={
                "predict": predictions,
                'label': features['label'],
            }
        )
    labels = tf.reshape(labels, [-1, 1])
    loss = tf.losses.sigmoid_cross_entropy(labels, logits)
    epl.add_to_collection(loss, epl.GraphKeys.GLOBAL_MEAN_OBJECTS)

    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(loss=loss,
                                  global_step=tf.train.get_global_step())
    predictions = tf.reshape(predictions, [-1, 1])
    eval_metric_ops = {
        "auc": tf.metrics.auc(labels, predictions),
        "f1": metrics.f1_score(labels, predictions),
        "precision": tf.metrics.precision_at_thresholds(
            labels, predictions, [0.5]
        ),
        "recall": tf.metrics.recall_at_thresholds(
            labels, predictions, [0.5]
        )
    }

    return tf.estimator.EstimatorSpec(
        mode=mode,
        loss=loss,
        predictions={"predict": predictions},
        train_op=train_op,
        eval_metric_ops=eval_metric_ops)

训练时，除chief worker外，其余worker在每次save checkpoint 后 step归0，且在第二次save checkpoint 后整个进程卡死

代码：

"""Run downstream classification"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import tensorflow as tf
import utils.optimizer as optimizer
import epl

FLAGS = tf.flags.FLAGS

tf.flags.DEFINE_integer("task_index", None, "Worker or server index")
tf.flags.DEFINE_string("worker_hosts", "", "worker hosts")

tf.flags.DEFINE_string("buckets", "", "tables info")
tf.flags.DEFINE_string("train_table", "", "tables info")
tf.flags.DEFINE_string("val_table", "", "tables info")
tf.flags.DEFINE_string("checkpoint_dir", '',
                       """Path to checkpoint folder""")

tf.flags.DEFINE_integer("num_epochs", 100,
                        """Number of training epochs (default: 20)""")
tf.flags.DEFINE_integer("max_steps", 10000, "")
tf.flags.DEFINE_integer("batch_size", 256, """Batch size (default: 64)""")
tf.flags.DEFINE_integer("display_step", 200,
                        """Number of steps to display log into TensorBoard (default: 20)""")
tf.flags.DEFINE_integer("save_checkpoints_steps", 1000,
                        "How often to save the model checkpoint.")
tf.flags.DEFINE_float("learning_rate", 0.001,
                      """Learning rate (default: 0.0005)""")
tf.flags.DEFINE_float("max_grad_norm", 5.0,
                      """Maximum value of the global norm of the gradients for clipping (default: 5.0)""")

tf.flags.DEFINE_integer("num_pipe_stages", 1, "number of pipeline stages")
tf.flags.DEFINE_integer("num_micro_batch", 1, "number of pipeline micro batches")


def str2list(str_in, shape, separator=' ', dtype=tf.int32):
    data = tf.string_split([str_in], separator)
    data = tf.string_to_number(data.values, dtype)
    return tf.reshape(data, shape)

def file_based_input_fn_builder(input_file, slice_id, slice_count, is_training, drop_remainder):
    """Creates an `input_fn` closure to be passed to TPUEstimator."""
    def _decode_record(*record):
        """Decodes a record to a TensorFlow example."""
        (cert_no, coll_case_no, embedding, dt, label) = record

        embedding = str2list(embedding, shape=[512], separator='\002', dtype=tf.float32)

        example = {'input_embed': embedding,
                   'label': label,
                   'dt': dt,
                   'cert_no': cert_no,
                   'coll_case_no': coll_case_no}
        return example

    def input_fn(params):
        """The actual input function."""
        d = tf.data.TableRecordDataset([input_file], record_defaults=['', '', '', '', 0])
        if is_training:
            d = d.repeat(FLAGS.num_epochs)
            d = d.shuffle(buffer_size=1000)

        d = d.apply(tf.contrib.data.map_and_batch(
                        lambda v1, v2, v3, v4, v5: _decode_record(v1, v2, v3, v4, v5),
                        batch_size=FLAGS.batch_size,
                        drop_remainder=drop_remainder))
        return d

    return input_fn

def create_model(input_embed, label):

    with tf.variable_scope("loss", reuse=tf.AUTO_REUSE):
        with tf.variable_scope("cls"):            
            logits = tf.layers.dense(
                input_embed,
                2,
                activation=None,
                kernel_initializer=tf.truncated_normal_initializer())

        one_hot_label = tf.one_hot(label, depth=2, dtype=tf.float32)
        loss = tf.losses.softmax_cross_entropy(one_hot_label, logits)
        probs = tf.nn.softmax(logits, axis=-1)
        predict = tf.argmax(probs, axis=-1, output_type=tf.int32)

        acc = tf.metrics.accuracy(label, predict)
        auc = tf.metrics.auc(label, probs[:,-1])
        return (loss, acc, auc)


def model_fn_builder(checkpoint_dir, learning_rate):
    """Returns `model_fn` closure for TPUEstimator."""
    def model_fn(features, mode):
        """The `model_fn` for Estimator."""

        input_embed = features['input_embed']
        label = features["label"]

        # create loss
        (loss, acc, auc) = create_model(input_embed, label)

        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:
            #rms optimizer
            tvars = tf.trainable_variables()
            grads = tf.gradients(loss, tvars)
            clipped_grads, global_norm = tf.clip_by_global_norm(grads, FLAGS.max_grad_norm)
            tf.summary.scalar('global_grad_norm', global_norm)

            global_step = tf.train.get_or_create_global_step()
            optimizer = tf.train.RMSPropOptimizer(learning_rate)
            train_op = optimizer.apply_gradients(zip(clipped_grads, tvars),
                                            name='train_op',
                                            global_step=global_step)

            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=loss,
                train_op=train_op)

        elif mode == tf.estimator.ModeKeys.EVAL:
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=loss,
                eval_metric_ops={'Accuracy':acc, "AUC":auc})
        else:
            raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))

        return output_spec

    return model_fn

def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)
    tf.logging.info("############## Start #####################")
    checkpoint_dir = os.path.join(FLAGS.buckets, FLAGS.checkpoint_dir)
    train_file = FLAGS.train_table
    val_file = FLAGS.val_table

    worker_spec = FLAGS.worker_hosts.split(",")
    worker_count = len(worker_spec)
    task_index = FLAGS.task_index

    epl_env = epl.Env.get()
    total_device = len(epl_env.cluster.available_devices)
    num_replica = total_device // FLAGS.num_pipe_stages
    micro_batch = FLAGS.batch_size // epl_env.config.pipeline.num_micro_batch
    micro_batch = micro_batch // num_replica
  
    print("total_batch: {}, num_micro_batch: {}, num_replica: {}, micro_batch: {}".format(
            FLAGS.batch_size,
            epl_env.config.pipeline.num_micro_batch,
            num_replica,
            micro_batch))
    print("task_index:", task_index)
    print("total_device:", total_device)

    model_fn = model_fn_builder(checkpoint_dir, FLAGS.learning_rate)

    train_input_fn = file_based_input_fn_builder(
        input_file=train_file,
        slice_id=task_index,
        slice_count=worker_count,
        is_training=True,
        drop_remainder=True
    )

    val_input_fn = file_based_input_fn_builder(
        input_file=val_file,
        slice_id=task_index,
        slice_count=worker_count,
        is_training=False,
        drop_remainder=False
    )

    sess_config = tf.ConfigProto(allow_soft_placement=True)
    config = tf.estimator.RunConfig(session_config=sess_config,
                                    save_checkpoints_steps=FLAGS.save_checkpoints_steps)

    estimator = tf.estimator.Estimator(
                model_fn=model_fn,
                config=config,
                model_dir=checkpoint_dir)

    train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=FLAGS.max_steps)
    eval_spec = tf.estimator.EvalSpec(input_fn=val_input_fn, start_delay_secs=6, throttle_secs=1)
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    tf.logging.info("#################  All process done.  ########################")

if __name__ == '__main__':
    env_dist = os.environ
    print(env_dist.get('TF_CONFIG'))
    config_json = {}
    config_json["pipeline.num_micro_batch"] = FLAGS.num_micro_batch
    epl.init(epl.Config(config_json))
    if FLAGS.num_pipe_stages == 1:
        epl.set_default_strategy(epl.replicate(device_count=1))
    tf.app.run()

训练提交worker sql：


pai -name tensorflow1120_py3
-Dscript="***/resources/***.tar.gz"
-DentryFile="train_downstream_cls.py"
-Dbuckets="***"
-DuserDefinedParameters="--num_epochs=10 --max_steps=100000 --buckets=*** --checkpoint_dir=*** --train_table=*** --val_table=*** “
-Dtables="***, ***"
-Dcluster="{\"worker\":{\"count\":8,\"cpu\":400,\"gpu\":100}}"

2机2卡实验NCCL报错

使用两个容器进行2机2卡实验，报错如下，希望可以帮忙解决一下

环境:

基于nvcr.io/nvidia/tensorflow:21.12-tf1-py3构建的容器

脚本:

FastNN的resnet脚本

启动命令

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

报错

2023-08-31 01:40:46.786721: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.397497: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.403631: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.433142: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[{{node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater}}]]

Traceback (most recent call last):
  File "resnet_dp.py", line 92, in <module>
    run_model()
  File "resnet_dp.py", line 67, in run_model
    with tf.train.MonitoredTrainingSession(hooks=hooks) as sess:
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
    return MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in __init__
    super(MonitoredSession, self).__init__(
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
    res = fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
    assign_ops = _init_local_resources(self, fn)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 423, in _init_local_resources
    fn(self, local_resources_init_op)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Problem of Data Parallel Model, program didn't end when reached global step

I got a problem when using EPL data parallel Model.
The num worker is set to 3 and each worker had its own TF data record input and Model save_dir. The global step is set to 3500 for each worker. It seems normal when global step was below 3500, but the program not end when reached 3500. Seems like the chief worker 0 didn't know other worker was end。