Git Product home page Git Product logo

flexflow's Introduction

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

build gpu tests multinode gpu tests docker pip shell-check clang-format Documentation Status


News🔥:

  • [09/02/2023] Adding AMD GPU support, released Docker images for ROCM 5.3->5.6
  • [08/16/2023] Adding Starcoder model support
  • [08/14/2023] Released Docker images for different CUDA versions

What is FlexFlow Serve

The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. FlexFlow Serve is an open-source compiler and distributed system for low latency, high performance LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.

Performance comparison

Install FlexFlow Serve

Requirements

  • OS: Linux
  • GPU backend: Hip-ROCm or CUDA
    • CUDA version: 10.2 – 12.0
    • NVIDIA compute capability: 6.0 or higher
  • Python: 3.6 or higher
  • Package dependencies: see here

Install with pip

You can install FlexFlow Serve using pip:

pip install flexflow

Try it in Docker

If you run into any issue during the install, or if you would like to use the C++ API without needing to install from source, you can also use our pre-built Docker package for different CUDA versions (NVIDIA backend) and multiple ROCM versions (AMD backend). To download and run our pre-built Docker container:

docker run --gpus all -it --rm --shm-size=8g ghcr.io/flexflow/flexflow-cuda-12.0:latest

To download a Docker container for a backend other than CUDA v12.0, you can replace the cuda-12.0 suffix with any of the following backends: cuda-11.1, cuda-11.2, cuda-11.3, cuda-11.4, cuda-11.5, cuda-11.6, cuda-11.7, cuda-11.8, and hip_rocm-5.3, hip_rocm-5.4, hip_rocm-5.5, hip_rocm-5.6). More info on the Docker images, with instructions to build a new image from source, or run with additional configurations, can be found here.

Build from source

You can install FlexFlow Serve from source code by building the inference branch of FlexFlow. Please follow these instructions.

Quickstart

The following example shows how to deploy an LLM using FlexFlow Serve and accelerate its serving using speculative inference. First, we import flexflow.serve and initialize the FlexFlow Serve runtime. Note that memory_per_gpu and zero_copy_memory_per_node specify the size of device memory on each GPU (in MB) and zero-copy memory on each node (in MB), respectively. We need to make sure the aggregated GPU memory and zero-copy memory are both sufficient to store LLM parameters in non-offloading serving. FlexFlow Serve combines tensor and pipeline model parallelism for LLM serving.

import flexflow.serve as ff

ff.init(
        num_gpus=4,
        memory_per_gpu=14000,
        zero_copy_memory_per_node=30000,
        tensor_parallelism_degree=4,
        pipeline_parallelism_degree=1
    )

Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at supported models.

# Specify the LLM
llm = ff.LLM("meta-llama/Llama-2-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("JackFram/llama-68m")
ssms.append(ssm)

Next, we declare the generation configuration and compile both the LLM and SSMs. Note that all SSMs should run in the beam search mode, and the LLM should run in the tree verification mode to verify the speculated tokens from SSMs. You can also use the following arguments to specify serving configuration when compiling LLMs and SSMs:

  • max_requests_per_batch: the maximum number of requests to serve in a batch (default: 16)
  • max_seq_length: the maximum number of tokens in a request (default: 256)
  • max_tokens_per_batch: the maximum number of tokens to process in a batch (default: 128)
# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(generation_config)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
            max_requests_per_batch = 16,
            max_seq_length = 256,
            max_tokens_per_batch = 128,
            ssms=ssms)

Next, we call llm.start_server() to start an LLM server running on a seperate background thread, which allows users to perform computations in parallel with LLM serving. Finally, we call llm.generate to generate the output, which is organized as a list of GenerationResult, which include the output tokens and text. After all serving requests are processed, you can either call llm.stop_server() to terminate the background thread or directly exit the python program, which will automatically terminate the background server thread.

llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
llm.stop_server() # This invocation is optional

Incremental decoding

Expand here
import flexflow.serve as ff

# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
ff.init(
        num_gpus=4,
        memory_per_gpu=14000,
        zero_copy_memory_per_node=30000,
        tensor_parallelism_degree=4,
        pipeline_parallelism_degree=1
    )

# Create the FlexFlow LLM
llm = ff.LLM("meta-llama/Llama-2-7b-hf")

# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=True, temperature=0.9, topp=0.8, topk=1
)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
            max_requests_per_batch = 16,
            max_seq_length = 256,
            max_tokens_per_batch = 128)

# Generation begins!
llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
llm.stop_server() # This invocation is optional

C++ interface

If you'd like to use the C++ interface (mostly used for development and benchmarking purposes), you should install from source, and follow the instructions below.

Expand here

Downloading models

Before running FlexFlow Serve, you should manually download the LLM and SSM(s) model of interest using the inference/utils/download_hf_model.py script (see example below). By default, the script will download all of a model's assets (weights, configs, tokenizer files, etc...) into the cache folder ~/.cache/flexflow. If you would like to use a different folder, you can request that via the parameter --cache-folder.

python3 ./inference/utils/download_hf_model.py <HF model 1> <HF model 2> ...

Running the C++ examples

A C++ example is available at this folder. After building FlexFlow Serve, the executable will be available at /build_dir/inference/spec_infer/spec_infer. You can use the following command-line arguments to run FlexFlow Serve:

  • -ll:gpu: number of GPU processors to use on each node for serving an LLM (default: 0)
  • -ll:fsize: size of device memory on each GPU in MB
  • -ll:zsize: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
  • -llm-model: the LLM model ID from HuggingFace (e.g. "meta-llama/Llama-2-7b-hf")
  • -ssm-model: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple -ssm-models in the command line to launch multiple SSMs.
  • -cache-folder: the folder
  • -data-parallelism-degree, -tensor-parallelism-degree and -pipeline-parallelism-degree: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used.
  • -prompt: (optional) path to the prompt file. FlexFlow Serve expects a json format file for prompts. In addition, users can also use the following API for registering requests:
  • -output-file: (optional) filepath to use to save the output of the model, together with the generation latency

For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.

./inference/spec_infer/spec_infer -ll:gpu 4 -ll:cpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model meta-llama/Llama-2-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion

Speculative Inference

A key technique that enables FlexFlow Serve to accelerate LLM serving is speculative inference, which combines various collectively boost-tuned small speculative models (SSMs) to jointly predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM’s output in parallel using a novel tree-based parallel decoding mechanism. FlexFlow Serve uses an LLM as a token tree verifier instead of an incremental decoder, which largely reduces the end-to-end inference latency and computational requirement for serving generative LLMs while provably preserving model quality.

A Speculative Inference Demo

Supported LLMs and SSMs

FlexFlow Serve currently supports all HuggingFace models with the following architectures:

  • LlamaForCausalLM / LLaMAForCausalLM (e.g. LLaMA/LLaMA-2, Guanaco, Vicuna, Alpaca, ...)
  • OPTForCausalLM (models from the OPT family)
  • RWForCausalLM (models from the Falcon family)
  • GPTBigCodeForCausalLM (models from the Starcoder family)

Below is a list of models that we have explicitly tested and for which a SSM may be available:

Model Model id on HuggingFace Boost-tuned SSMs
LLaMA-7B meta-llama/Llama-2-7b-hf LLaMA-68M , LLaMA-160M
LLaMA-13B decapoda-research/llama-13b-hf LLaMA-68M , LLaMA-160M
LLaMA-30B decapoda-research/llama-30b-hf LLaMA-68M , LLaMA-160M
LLaMA-65B decapoda-research/llama-65b-hf LLaMA-68M , LLaMA-160M
LLaMA-2-7B meta-llama/Llama-2-7b-hf LLaMA-68M , LLaMA-160M
LLaMA-2-13B meta-llama/Llama-2-13b-hf LLaMA-68M , LLaMA-160M
LLaMA-2-70B meta-llama/Llama-2-70b-hf LLaMA-68M , LLaMA-160M
OPT-6.7B facebook/opt-6.7b OPT-125M
OPT-13B facebook/opt-13b OPT-125M
OPT-30B facebook/opt-30b OPT-125M
OPT-66B facebook/opt-66b OPT-125M
Falcon-7B tiiuae/falcon-7b
Falcon-40B tiiuae/falcon-40b
StarCoder-7B bigcode/starcoderbase-7b
StarCoder-15.5B bigcode/starcoder

CPU Offloading

FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention). Besides, since the small model occupies considerably less space, it it does not pose a bottleneck for GPU memory, the offloading will bring more runtime space and computational cost, so we only do the offloading for the large model. [TODO: update instructions] You can run the offloading example by enabling the -offload and -offload-reserve-space-size flags.

Quantization

FlexFlow Serve supports int4 and int8 quantization. The compressed tensors are stored on the CPU side. Once copied to the GPU, these tensors undergo decompression and conversion back to their original precision. Please find the compressed weight files in our s3 bucket, or use this script from FlexGen project to do the compression manually.

Prompt Datasets

We provide five prompt datasets for evaluating FlexFlow Serve: Chatbot instruction prompts, ChatGPT Prompts, WebQA, Alpaca, and PIQA.

TODOs

FlexFlow Serve is under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.

  • AMD benchmarking. We are actively working on benchmarking FlexFlow Serve on AMD GPUs and comparing it with the performance on NVIDIA GPUs.
  • Chatbot prompt templates and Multi-round conversations
  • Support for FastAPI server
  • Integration with LangChain for document question answering

Acknowledgements

This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. Please cite FlexFlow Serve as:

@misc{miao2023specinfer,
      title={SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification}, 
      author={Xupeng Miao and Gabriele Oliaro and Zhihao Zhang and Xinhao Cheng and Zeyu Wang and Rae Ying Yee Wong and Alan Zhu and Lijie Yang and Xiaoxiang Shi and Chunan Shi and Zhuoming Chen and Daiyaan Arfeen and Reyna Abhyankar and Zhihao Jia},
      year={2023},
      eprint={2305.09781},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

FlexFlow uses Apache License 2.0.

flexflow's People

Contributors

awgu avatar chenzhuofu avatar daiyaanarfeen avatar derrickylj avatar dycz0fx avatar eddy16112 avatar efrainq07 avatar eric-zheng avatar facebook-github-bot avatar ferdiko avatar goliaro avatar jhancox avatar jiazhihao avatar kadinlz avatar kateunger avatar lockshaw avatar mandeeplearning avatar mengdiz97 avatar msbaines avatar reyna-abhyankar avatar soumyac1999 avatar stas00 avatar suranap avatar tnoyola avatar vincent-163 avatar williamberman avatar wmdi avatar xinhaoc avatar yingyee0111 avatar zwang86 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flexflow's Issues

Potential Features

Here is a list of feature requests discussed during the FlexFlow office hour on Oct 22.

  • Include additional information in strategy configurations, such as the DNN model architecture and machine topology information.
  • Enable a way to simulate hardware that are not accessible
  • Enhance GPU-DRAM or GPU Direct support
  • Plan to support multi-stream optimizations

Cant add batch_norm to cpp ResNet model

When I add a batch_norm layer to the ResNet model (see code below) I get a runtime error. There are no errors during compilation, just when I try to run but I'm not sure how to fix it based on the error message.

Starting at line 92 in function top_level_task of resnet.cc:

// Add layers
Tensor t = input;
t = ff.conv2d(input, 64, 7, 7, 2, 2, 3, 3);
t = ff.batch_norm(t, false); //I added this line which should be batch_norm and no relu?
t = ff.pool2d(t, 3, 3, 2, 2, 1, 1);

This results in the following output/error:

root@46083842-0c93-44ec-9c8b-fd074d7b4ee2-77f877845f-g2r6d:/home/acc/FlexFlow# ./examples/cpp/ResNet/resnet -ll:gpu 2 --epochs 1 -ll:fsize 24000 -ll:zsize 8000 --search-budget 2000
[0 - 7fbaf9502fc0] 4.586106 {3}{Mapper}: No strategy file provided. Use default data parallelism.
[0 - 7fbaf9502fc0] 4.586708 {3}{ResNet}: batchSize(64) workersPerNodes(2) numNodes(1)
workSpaceSize (1024 MB)
workSpaceSize (1024 MB)
resnet: /home/acc/FlexFlow/legion/runtime/realm/dynamic_templates.inl:199: static void Realm::DynamicTemplates::IntList<MIN, MAX>::DemuxHelper<TARGET, MAX>::demux(int, T1, T2) [with T1 = unsigned int; T2 = int*; TARGET = Realm::DynamicTemplates::ListProduct2<Realm::DynamicTemplates::IntList<1, 4>, Realm::DynamicTemplates::TypeListElem<int, Realm::DynamicTemplates::TypeListElem<unsigned int, Realm::DynamicTemplates::TypeListElem<long long int, Realm::DynamicTemplates::TypeListTerm> > > >::DemuxHelper1Legion::Internal::NT_TemplateHelper::DimHelper; int MIN = 1; int MAX = 4]: Assertion `index == MAX' failed.
Aborted (core dumped)

It runs through the MCMC strategy search without the batch_norm layer.

Thanks for the help.

Alexnet crashes after the interface changes commit

@jiazhihao I noticed the Flexflow crashes after your latest commit
https://github.com/flexflow/FlexFlow/commit/2bc9997a269cd420515549770aea6c750b25d511

Here is the error I get from Alexnet:
[wwu12@saturn AlexNet_old]$ mpirun -np 1 ./alexnet -ll:gpu 1 -ll:fsize 2048 -ll:zsize 12192
[0 - 7ffff7fcd800] {4}{threads}: reservation ('DMA request queue') cannot be satisfied
[0 - 7ffff7fa0800] {3}{Mapper}: No strategy file provided. Use default data parallelism.
[0 - 7ffff7fa0800] {3}{AlexNet}: batchSize(64) workersPerNodes(1) numNodes(1)
in update_metrics_task
workSpaceSize (1024 MB)
seed = 1804289383 scale = 0.1185
[0 - 7ffff7fa0800] {3}{AlexNet}: Use random dataset...
[0 - 7ffff7fa0800] {3}{AlexNet}: Number of random samples = 2560

alexnet: /home/wwu12/legion-cr/runtime/legion/region_tree.cc:5819: void Legion::Internal::RegionTreeForest::initialize_path(Legion::Internal::IndexTreeNode*, Legion::Internal::IndexTreeNode*, Legion::Internal::RegionTreePath&): Assertion `child->depth > 0' failed.

FlexFlow Release 2020.12

List of TODOs for the next FlexFlow release:

  • Docker images for stable builds
  • Conda installation
  • Website Improvement
  • Documentations
  • NCCL integration
  • CMake for stable builds
  • Support for PyTorch FX models and ONNX
  • Spack package for FlexFlow

TODOs for future releases

  • (Optional) Scalability tests
  • TensorBoard support using TensorBoardX
  • Distributed Data Loader in Python, Generator support for Keras
  • Support for AMD CPUs/GPUs

Legion Error: error_code_68

I am trying to run alexnet example with sysml19_ae branch.

My run command is:
./alexnet -b 64 -ll:gpu 1 -ll:fsize 5000 -ll:zsize 5000 --strategy strategies/alexnet_data_parallel.strategy

The output is:
-------- Start FlexFlow Runtime --------
batchSize(64) inputHeight(224) inputWdith(224)
workersPerNode(1) loadersPerNode(8)
datasetPath(synthetic data)
strategy(strategies/alexnet_data_parallel.strategy)
Default Data Parallelism
Create Alexnet:
Create conv layer: output(n=64 c=64 h=55 w=55)
Create pool2d layer: output(n=64 c=64 h=27 w=27)
Create conv layer: output(n=64 c=192 h=27 w=27)
Create pool2d layer: output(n=64 c=192 h=13 w=13)
Create conv layer: output(n=64 c=384 h=13 w=13)
Create conv layer: output(n=64 c=256 h=13 w=13)
Create conv layer: output(n=64 c=256 h=13 w=13)
Create pool2d layer: output(n=64 c=256 h=6 w=6)
Create flat layer: input(N=64 C=256 H=6 W=6) -> output(N=64 C=9216)
Create linear layer: output(n=64 c=4096))
Create linear layer: output(n=64 c=4096))
Create linear layer: output(n=64 c=1000))
[0 - 2ab352172700] {5}{runtime}: [error 68] LEGION ERROR: Region requirement 0 of operation linear_init_task (UID 110) in parent task top_level (UID 1) is using uninitialized data for field(s) 0 of logical region (82,1,41) with read-only privileges (from file legion/runtime/legion/legion_ops.cc:518)[error 68]
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_68

Can not compile any example

I compile codes with INSTALL.md.

But when I run ./ffcompile.sh examples/cpp/AlexNet/ (or other example), I will get

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(1761): error: identifier "__builtin_ia32_sqrtsd_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(1770): error: identifier "__builtin_ia32_sqrtss_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(2728): error: identifier "__builtin_ia32_scalefsd_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(2737): error: identifier "__builtin_ia32_scalefss_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(11265): error: identifier "__builtin_ia32_scalefsd_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(11274): error: identifier "__builtin_ia32_scalefss_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1163): error: identifier "__builtin_ia32_reducesd" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1171): error: identifier "__builtin_ia32_reducess" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1179): error: identifier "__builtin_ia32_rangesd128_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1189): error: identifier "__builtin_ia32_rangess128_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1198): error: identifier "__builtin_ia32_rangesd128_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1207): error: identifier "__builtin_ia32_rangess128_round" is undefined

12 errors detected in the compilation of "/tmp/tmpxft_00003251_00000000-18_simulator.compute_75.cpp1.ii".
/root/work/codes/FlexFlow/legion/runtime/runtime.mk:1103: recipe for target '/root/work/codes/FlexFlow/src/runtime/simulator.cu.o' failed
make: *** [/root/work/codes/FlexFlow/src/runtime/simulator.cu.o] Error 1
make: *** Waiting for unfinished jobs....

CUDNN failure: CUDNN_STATUS_NOT_INITIALIZED

hi , when i run the dlrm case, after complie, i execute "./dlrm -ll:gpu 2 -ll:cpu 4 -ll:fsize 10000 -ll:zsize 9000 -ll:util 2 --arch-sparse-feature-size 64 --arch-embedding-size 1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --arch-mlp-bot 64-512-512-64 --arch-mlp-top 576-1024-1024-1024-1 --epochs 20 --batch-size 64 -dm:memoize --strategy ../../../src/runtime/dlrm_strategy_gpu_2_node_1.pb"

i got the CUDNN failure: CUDNN_STATUS_NOT_INITIALIZED erro
image

I don't know where the problem is. It seems to be a memory problem

device info: two 2080ti with 11G gpu memory

Problem running FFConfig after installing Python FlexFlow

After installing Flexflow again and trying to test with the examples there was an error particularly when trying to call FFConfig()
image

There were no errors when building so the only problem appeared at the end when testing flexflow. Notice that the library can be imported with no problems, but the error happens anyway.
The steps I followed for the installation were these:
export FF_HOME=$HOME/FlexFlow/
export CUDNN_HOME=$CUDNN_ROOT_DIR
export PROTOBUF_DIR=/public/apps/protobuf/v3.10.1/gcc.7.3.0/
export LG_RT_DIR=$HOME/FlexFlow/legion/runtime
export CUDA_ARCH="60,70"
export GPU_ARCH="60,70"
git clone --recursive [email protected]:flexflow/FlexFlow.git
cd FlexFlow/nccl
make -j src.build NVCC_GENCODE="-gencode=arch=compute_60,code=sm_60"
cd ../python
make -j

Getting errors when building FlexFlow

Hi,

I try to build FlexFlow runtime following the instructions. When running "./ffcompile.sh examples/cpp/InceptionV3", there are some errors occurring:

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(1761): error: identifier "__builtin_ia32_sqrtsd_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(1770): error: identifier "__builtin_ia32_sqrtss_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(2728): error: identifier "__builtin_ia32_scalefsd_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(2737): error: identifier "__builtin_ia32_scalefss_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(11265): error: identifier "__builtin_ia32_scalefsd_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512fintrin.h(11274): error: identifier "__builtin_ia32_scalefss_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1163): error: identifier "__builtin_ia32_reducesd" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1171): error: identifier "__builtin_ia32_reducess" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1179): error: identifier "__builtin_ia32_rangesd128_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1189): error: identifier "__builtin_ia32_rangess128_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1198): error: identifier "__builtin_ia32_rangesd128_round" is undefined

/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512dqintrin.h(1207): error: identifier "__builtin_ia32_rangess128_round" is undefined

12 errors detected in the compilation of "/tmp/tmpxft_00013d3f_00000000-18_simulator.compute_75.cpp1.ii".
/home/wisr/aoranwu/FlexFlow/legion/runtime/runtime.mk:1103: recipe for target '/home/wisr/aoranwu/FlexFlow/src/runtime/simulator.cu.o' failed
make: *** [/home/wisr/aoranwu/FlexFlow/src/runtime/simulator.cu.o] Error 1
make: *** Waiting for unfinished jobs....

Is there any way to solve that? Thanks!

How to train a model with multiple nodes in a cluster?

Hi. I wish to use FlexFlow to train a model with multiple nodes in a cluster.

According to your paper, FlexFlow should support training with more than one node and each node has multiple GPUs in clusters. However, there is no documentation mentioning how to run FlexFlow training on multiple nodes.

Should there be more than one training process in this case? Or does FlexFlow only support training with multiple GPUs on one node currently? Thanks!

How to generate a strategy for a DNN?

It seems that I haven't figured out how to use FlexFlow.
How should a strategy be used in training and how to generate one strategy?
Can you provide a tutorial? (use the simulator in the scripts?)

In the strategy folder, the strategies can not work and an error occurs (as follow):
'''
strategies.size() = 12
workSpaceSize (1024 MB)
Floating point exception (core dumped)
'''
However, With the default strategy(data parallelism) , it can run successfully.
'''
......
forwardAlgo(7) time(0.67)
bwdFilterAlgo(5) time(0.77)
bwdDataAlgo(5) time(0.66)
init pool (input): n(64) c(256) h(13) w(13)
init pool (output): n(64) c(256) h(6) w(6)
init linear (input): in_dim(9216) out_dim(4096) batch_size(64)
init linear (input): in_dim(4096) out_dim(4096) batch_size(64)
init linear (input): in_dim(4096) out_dim(1000) batch_size(64)
ELAPSED TIME = 7.2131s, THROUGHPUT = 1135.71 samples/s

'''
Thanks.

Remove the virtual function Parameter* get_parameter(int index) = 0 from Op base class

@jiazhihao I plan to remove the virtual function virtual Parameter* get_parameter(int index) = 0; from the Op base class and move it down to child classes like Conv2d, Linear and etc., since not all Ops have parameters.

When calling it from python, I can cast the Op* to Conv2d* (or others) because when calling this function, users should already know what kind of Ops it is.

Please let me know your suggestions.

DLRM Criteo example segmentation fault

Hello! I'm trying to run the DLRM Criteo example with the command

./run_criteo_kaggle.sh 1 $CRITEO_PATH

and I get this error

[0 - 7fc0b0691700] {3}{Mapper}: No strategy file provided. Use default data parallelism.
./run_criteo_kaggle.sh: line 8:  3031 Segmentation fault      (core dumped) ./dlrm -ll:gpu ${numgpu} -ll:cpu 1 -ll:fsize 6000 -ll:zsize 10000 --arch-sparse-feature-size 16 --arch-embedding-size 1396-550-1761917-507795-290-21-11948-608-3-58176-5237-1497287-3127-26-12153-1068715-10-4836-2085-4-1312273-17-15-110946-91-72655 --arch-mlp-bot 13-512-256-64-16 --arch-mlp-top 224-512-256-1 --dataset ${dataset} --epochs 100 --batch-size ${batchsize}

When I look into the dumped core, I see the last error message is

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffcf3d6700 (LWP 14806)]
[New Thread 0x7fffcebd5700 (LWP 14808)]
[New Thread 0x7fffce3d4700 (LWP 14809)]
[New Thread 0x7fffcdbd3700 (LWP 14810)]
[New Thread 0x7fffcd3d2700 (LWP 14811)]
[New Thread 0x7fffccbd1700 (LWP 14815)]
[New Thread 0x7fffccb90700 (LWP 14816)]
dlrm: ../../src/mapper/mapper.cc:292: void update_mappers(Legion::Machine, Legion::Runtime*, const std::set<Realm::Processor>&): Assertion `zc_query.count() == 1' failed.

Thread 8 "dlrm" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffccb90700 (LWP 14816)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

Anyone can help? Thanks in advance!

Python type error in flexflow_cbinding.py

I encountered the following type error when training AlexNet with the most recent Python implementation. The command line to reproduce the failure is ./flexflow_python example/alexnet.py -ll:py 1 -ll:gpu 1 -ll:zsize 30000 -ll:fsize 10000

[0 - 7f19c06f5fc0] {6}{python}: python exception occurred within task:
Traceback (most recent call last):
  File "/home/users/zhihao/legion/apps/FlexFlow/python/flexflow/core/flexflow_top.py", line 195, in flexflow_top_level_task
    run_path(args[start], run_name='__main__')
  File "/home/users/zhihao/legion/apps/FlexFlow/python/flexflow/core/flexflow_top.py", line 145, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "example/alexnet.py", line 174, in <module>
    top_level_task()
  File "example/alexnet.py", line 157, in top_level_task
    cbias = cbias_tensor.get_flat_array(ffconfig, DataType.DT_FLOAT)
  File "/home/users/zhihao/legion/apps/FlexFlow/python/flexflow/core/flexflow_cbinding.py", line 381, in get_flat_array
    array = np.asarray(initializer)
  File "/share/software/user/open/py-numpy/1.14.3_py27/lib/python2.7/site-packages/numpy/core/numeric.py", line 492, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: __array_interface__ typestr must be a string

Run dlrm case erro

when i run the dlrm case, i encounter another problem. I'm not familiar with this code now, so I still don't know how to solve it.
image
In addition, when I was learning code, I did not understand from the code level when the simulator and MCMC sampling were used, such as under the runtime folder simulator.cc It seems to be using simulation. In general, can you tell us from the code level simply how to find the optimal parallel strategy when training a network? thanks~

fp16 training

Does flexflow support fp16 training? If so, how do I enable it and take advantage of Tensor Cores?

Thanks!

Implementing with text transformer model

Hi,

I have been going through the examples to try and figure out how to implement BERT to work with FlexFlow. However, as far as I can tell none of the examples are for text input. I'm not sure how to approach rewriting the code to make it compatible with FlexFlow - do you think you could add an explanation of the examples? While the AlexNet and InceptionV3 examples are easy to understand, the other 2 are a bit more difficult to understand the changes to make it work with FlexFlow.

Error reproducing MLSys19 optimizing strategies

Branch Used - sysml19_ae

./alexnet -b 64 -ll:gpu 1 -ll:fsize 5000 -ll:zsize 5000 --strategy strategies/alexnet_optimized.strategy

--------  Start FlexFlow Runtime  --------
batchSize(64) inputHeight(224) inputWdith(224)
workersPerNode(1) loadersPerNode(4)
datasetPath(synthetic data)
strategy(strategies/alexnet_optimized.strategy)
FlexFlow Optimized Strategy
Create Alexnet:
    Create conv layer: output(n=64 c=64 h=55 w=55)
    Create pool2d layer: output(n=64 c=64 h=27 w=27)
    Create conv layer: output(n=64 c=192 h=27 w=27)
    Create pool2d layer: output(n=64 c=192 h=13 w=13)
    Create conv layer: output(n=64 c=384 h=13 w=13)
    Create conv layer: output(n=64 c=256 h=13 w=13)
    Create conv layer: output(n=64 c=256 h=13 w=13)
    Create pool2d layer: output(n=64 c=256 h=6 w=6)
    Create flat layer: input(N=64 C=256 H=6 W=6) -> output(N=64 C=9216)
    Create linear layer: output(n=64 c=4096))
    Create linear layer: output(n=64 c=4096))
    Create linear layer: output(n=64 c=1000))
[0 - 7f9f34c1ffc0]    4.390285 {5}{runtime}: [error 68] LEGION ERROR: Region requirement 0 of operation linear_init_task (UID 110) in parent task top_level (UID 1) is using uninitialized data for field(s) 0 of logical region (77,1,41) with read-only privileges (from file legion/runtime/legion/legion_ops.cc:623)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_68

Aborted (core dumped)

Running Error

Hi, I tried several times to run this project, but I still fail to run it. There are some errors as below:

sheldon@instance-2:~/FlexFlow$ ./ffcompile.sh alexnet ---> Linking objects into one binary: alexnet g++ -o alexnet cnn.cc.o cnn_mapper.cc.o alexnet.cc.o conv_2d.cu.o ops.cu.o pool_2d.cu.o batch_norm.cu.o linear.cu.o softmax.cu.o concat.cu.o cnn_helper.cu.o -lcudnn -lcublas -lcurand -L. -llegion -lrealm -lrt -lpthread -ldl -rdynamic -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs -lcuda -Xlinker -rpath=/usr/local/cuda/lib64 -lz conv_2d.cu.o: In function __sti____cudaRegisterAll()':
tmpxft_000029b2_00000000-5_conv_2d.cudafe1.cpp:(.text.startup+0x19): undefined reference to __cudaRegisterFatBinaryEnd' ops.cu.o: In function __sti____cudaRegisterAll()':
tmpxft_000029b8_00000000-5_ops.cudafe1.cpp:(.text.startup+0x9b): undefined reference to __cudaRegisterFatBinaryEnd' pool_2d.cu.o: In function __sti____cudaRegisterAll()':
tmpxft_000029b9_00000000-5_pool_2d.cudafe1.cpp:(.text.startup+0x19): undefined reference to __cudaRegisterFatBinaryEnd' batch_norm.cu.o: In function __sti____cudaRegisterAll()':
tmpxft_000029b6_00000000-5_batch_norm.cudafe1.cpp:(.text.startup+0x19): undefined reference to __cudaRegisterFatBinaryEnd' linear.cu.o: In function __sti____cudaRegisterAll()':
tmpxft_000029b7_00000000-5_linear.cudafe1.cpp:(.text.startup+0x19): undefined reference to __cudaRegisterFatBinaryEnd' softmax.cu.o:tmpxft_000029e9_00000000-5_softmax.cudafe1.cpp:(.text.startup+0x47): more undefined references to __cudaRegisterFatBinaryEnd' follow
./librealm.a(cuda_module.cc.o): In function Realm::Cuda::GPU::create_dma_channels(Realm::RuntimeImpl*)': cuda_module.cc:(.text+0x20e7): undefined reference to Realm::register_gpu_in_dma_systems(Realm::Cuda::GPU*)'
collect2: error: ld returned 1 exit status
/home/sheldon/install/legion/runtime/runtime.mk:601: recipe for target 'alexnet' failed
make: *** [alexnet] Error 1
sheldon@instance-2:~/FlexFlow$
`

Could you please help me with this? Thank you very much.

Building DNN model failed

Hi,

When I was trying to build DNN models, I ran into this issue:

root@ubuntu05:/app/FlexFlow# ./ffcompile.sh examples/InceptionV3
Use the FlexFlow protoc
auto-detected CUDA at: /usr/local/cuda
python /app/FlexFlow/legion/runtime/../tools/generate_defines.py  -DLEGION_USE_CUDA -DLEGION_GPU_REDUCTIONS -DLEGION_USE_ZLIB -DDEBUG_LEGION -DLEGION_MAX_DIM=4		 -c -i /app/FlexFlow/legion/runtime/../cmake/legion_defines.h.in -o /app/FlexFlow/examples/InceptionV3/legion_defines.h
python /app/FlexFlow/legion/runtime/../tools/generate_defines.py  -DREALM_USE_LIBDL -DREALM_USE_CUDA -DREALM_USE_CUDART_HIJACK -DDEBUG_REALM -DREALM_MAX_DIM=4		 -DCOMPILE_TIME_MIN_LEVEL=LEVEL_DEBUG	 -c -i /app/FlexFlow/legion/runtime/../cmake/realm_defines.h.in -o /app/FlexFlow/examples/InceptionV3/realm_defines.h
g++ -o ../../src/runtime/strategy.pb.cc.o -c ../../src/runtime/strategy.pb.cc  -std=c++11   -march=native -DUSE_CUDA -O0 -ggdb  -Wall -Wno-strict-overflow -I../../include/ -I/app/FlexFlow/protobuf/src -I/usr/local/cuda/include -I/app/FlexFlow/examples/InceptionV3 -I/app/FlexFlow/legion/runtime -I/app/FlexFlow/legion/runtime/mappers -I/usr/local/cuda/include -I/app/FlexFlow/legion/runtime/realm/transfer 
---> Linking objects into one binary: inception
g++ -o inception ../../src/runtime/model.cc.o ../../src/mapper/mapper.cc.o ../../src/runtime/initializer.cc.o ../../src/runtime/optimizer.cc.o ../../src/ops/embedding.cc.o ../../src/runtime/strategy.pb.cc.o ../../src/runtime/strategy.cc.o inception.cc.o ../../src/ops/conv_2d.cu.o ../../src/runtime/model.cu.o ../../src/ops/pool_2d.cu.o ../../src/ops/batch_norm.cu.o ../../src/ops/linear.cu.o ../../src/ops/softmax.cu.o ../../src/ops/concat.cu.o ../../src/ops/flat.cu.o ../../src/ops/embedding.cu.o ../../src/ops/mse_loss.cu.o ../../src/runtime/initializer_kernel.cu.o ../../src/runtime/optimizer_kernel.cu.o ../../src/runtime/accessor_kernel.cu.o ../../src/runtime/cuda_helper.cu.o  -lcudnn -lcublas -lcurand -lprotobuf -L/usr/local/lib -L/app/FlexFlow/protobuf/src/.libs -L/usr/local/cuda/lib64  -L. -llegion -lrealm -lrt -lpthread -ldl -rdynamic -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs -lcuda -Xlinker -rpath=/usr/local/cuda/lib64 -lz 
../../src/runtime/model.cc.o: In function `void Legion::LegionTaskWrapper::legion_task_wrapper<&ElementUnary::forward_task>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)':
/app/FlexFlow/legion/runtime/legion/legion.inl:12788: undefined reference to `ElementUnary::forward_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)'
../../src/runtime/model.cc.o: In function `void Legion::LegionTaskWrapper::legion_task_wrapper<&ElementUnary::backward_task>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)':
/app/FlexFlow/legion/runtime/legion/legion.inl:12788: undefined reference to `ElementUnary::backward_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)'
../../src/runtime/model.cc.o: In function `void Legion::LegionTaskWrapper::legion_task_wrapper<&ElementBinary::forward_task>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)':
/app/FlexFlow/legion/runtime/legion/legion.inl:12788: undefined reference to `ElementBinary::forward_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)'
../../src/runtime/model.cc.o: In function `void Legion::LegionTaskWrapper::legion_task_wrapper<&ElementBinary::backward_task>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)':
/app/FlexFlow/legion/runtime/legion/legion.inl:12788: undefined reference to `ElementBinary::backward_task(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)'
collect2: error: ld returned 1 exit status
/app/FlexFlow/legion/runtime/runtime.mk:1003: recipe for target 'inception' failed
make: *** [inception] Error 1

I've checked that the LG_RT_DIR is set correctly.

Could you provide any suggestions please? Thanks!

FlexFlow Performance of Alexnet drastically drops in Multi-GPU

Environment CUDA V100-PCIE:

FlexFlow-AlexNet (1GPU):

  • data-parallel: 1172.39 samples/s
  • optimized: 2005.06 samples/s

FlexFlow-AlexNet (2GPU):

  • data-parallel: 840.98 samples/s
  • optimized: 2000.77 samples/s

FlexFlow-AlexNet (4GPU):

  • data-parallel: 331.25 samples/s
  • optimized: 1444.97 samples/s

Tensorflow-AlexNet (1GPU):

  • data-parallel: 3079.8 samples/s

Tensorflow-AlexNet (2GPU):

  • data-parallel: 3414.8 samples/s

Tensorflow-AlexNet (4GPU):

  • data-parallel: 3210.6 samples/s

How to fix it?

resnet example performance issues

When running the resnet cpp example on a single V100 GPU with batch_size of 16 (I assume its fp32 as well since FlexFlow doesnt seem to have an fp16 option as far as I know) I'm seeing throughput of 30-32 samples per second. This means an iteration time of about 500ms.

In pytorch I get an iteration time for fp32, batch 16 and same input size of 58ms. Thats a pretty large ~10x discrepancy. Is an iteration time of 500 ms expected? Is there any options to FlexFlow i can add to increase performance?

Looking at the individual layer times from measure_compute_time seem closer to pytorch. So I dont understand why the actual run time is so much slower. Especially when you take parallelization out of the equation.

Assertion failed when running Alexnet example

Hi, I tried to run the AlexNet example but failed with an assertion error with no meaningful error messages. I tried all three released versions (I can't get master compiled) and they all reports the same error.

$ ./alexnet -b 4 -ll:gpu 2 --ll:fsize 16130 -ll:zsize 5000

--------  Start FlexFlow Runtime  --------
batchSize(4) inputHeight(224) inputWdith(224)
workersPerNode(2) loadersPerNode(4)
datasetPath(synthetic data)
strategy(Default Data Parallelism)
Create Alexnet:
    Create conv layer: output(n=4 c=64 h=55 w=55)
    Create pool2d layer: output(n=4 c=64 h=27 w=27)
    Create conv layer: output(n=4 c=192 h=27 w=27)
    Create pool2d layer: output(n=4 c=192 h=13 w=13)
    Create conv layer: output(n=4 c=384 h=13 w=13)
    Create conv layer: output(n=4 c=256 h=13 w=13)
    Create conv layer: output(n=4 c=256 h=13 w=13)
    Create pool2d layer: output(n=4 c=256 h=6 w=6)
    Create flat layer: input(N=4 C=256 H=6 W=6) -> output(N=4 C=9216)
    Create linear layer: output(n=4 c=4096))
    Create linear layer: output(n=4 c=4096))
    Create linear layer: output(n=4 c=1000))
alexnet: linear.cu:519: static void Linear::backward_task(const Legion::Task*, const std::vector<Legion::PhysicalRegion>&, Legion::Context, Legion::Runtime*): Assertion `acc_kernel_grad.accessor.is_dense_arbitrary(rect_kernel_grad)' failed.
Aborted (core dumped)

Ubuntu 16.04 + CUDA 10.1 + CuDNN 7 with 2 Tesla V100. Legion is compiled following the guide with default options except for USE_CUDA=1.

cant make python api

Following the directions on the installation page results in the following error:

Makefile:76: cannot find libpython3.6*.so - falling back to using LD_LIBRARY_PATH
/home/acc/FlexFlow/legion/runtime//runtime.mk:345: cannot find libpython3.6*.so - falling back to using LD_LIBRARY_PATH
/home/acc/FlexFlow/protobuf/bin/protoc --proto_path=/home/acc/FlexFlow/src/runtime strategy.proto --cpp_out=.
make: /home/acc/FlexFlow/protobuf/bin/protoc: Command not found
Makefile:148: recipe for target 'strategy.pb.cc' failed
make: *** [strategy.pb.cc] Error 127

Any thoughts?

ps libpython3.6-stdlib is installed in the normal location so idk why it cant find it.

cant ffcompile examples/cpp/Resnet

Get the following error when I try to compile resnet:

./ffcompile.sh examples/cpp/ResNet/
Use the FlexFlow protoc
make: *** No rule to make target '../../src/runtime/model.cc', needed by '../../src/runtime/model.cc.o'. Stop.

Adding measure compute time to Batch_Norm or Relu Ops

Is there any plan to add measure compute time implementations to the Batch_Norm or ElementUnary ops? It seems like while they are really fast ops there is still some time associated with them that could be factored in. Was there some reason they arent in there? Or was it just to save time to work on the other ops? Thanks!

CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)

@jiazhihao I am training AlexNet with cifar10 data_batch_1.bin on my server. I only use 2 GPUs and it seems that parameters on different GPUs may fail to synchronize. The label is -1 and logits in softmax_cross_entropy_calc_loss are nan (the same when using other batches). If I use synthetic data,it runs properly.

Thanks.

ERROR INFO:
...
acc_train_loss: inf train_accuracy: 9.38%(24/256)
acc_train_loss: inf train_accuracy: 8.01%(41/512)
../../src/ops/softmax.cu[0 - 7f3f69ffe700] {6}{gpu}: CUDA error reported on GPU 0: device-side assert triggered (CUDA_ERROR_ASSERT)
:226: void softmax_cross_entropy_calc_loss(const float *, const int *, PerfMetrics *, int, int): block: [0,0,0], thread: [96,0,0] Assertion my_label >= 0 failed.
../../src/ops/softmax.cu:226: void softmax_cross_entropy_calc_loss(const float *, const int *, PerfMetrics *, int, int): block: [0,0,0], thread: [97,0,0] Assertion my_label >= 0 failed.
../../src/ops/softmax.cu:226alexnet: /seu_share/home/yuchen/FlexFlow-test/FlexFlow/legion/runtime/realm/cuda/cuda_module.cc:234: bool Realm::Cuda::GPUStream::reap_events(): Assertion `0' failed.

Device RTX 2080ti *2 Ubuntu 18.04 Cuda10.5 cudnn7.6.5

Error while compiling the given example Inception

Hi, I'm trying to run the examples, but having trouble compiling Inception. The error is as follows:

$ ./ffcompile.sh examples/InceptionV3/
Use the FlexFlow protoc
auto-detected CUDA at: /data/opt/cuda-10.1
/usr/bin/python ../../legion/runtime/../tools/generate_defines.py  -DLEGION_USE_LIBDL -DLEGION_USE_CUDA -DLEGION_GPU_REDUCTIONS -DLEGION_USE_ZLIB -DDEBUG_LEGION -DLEGION_MAX_DIM=4           -c -i ../../legion/runtime/../cmake/legion_defines.h.in -o /home/huzhongzhe/FF_v0729/FlexFlow/examples/InceptionV3/legion_defines.h
/usr/bin/python ../../legion/runtime/../tools/generate_defines.py  -DREALM_USE_LIBDL -DREALM_USE_CUDA -DREALM_USE_CUDART_HIJACK -DDEBUG_REALM -DREALM_MAX_DIM=4               -DCOMPILE_TIME_MIN_LEVEL=LEVEL_DEBUG    -c -i ../../legion/runtime/../cmake/realm_defines.h.in -o /home/huzhongzhe/FF_v0729/FlexFlow/examples/InceptionV3/realm_defines.h
make: *** No rule to make target 'inception.cu', needed by 'inception.cu.o'.  Stop.
make: *** Waiting for unfinished jobs....

I have already installed FlexFlow according to install.md, and successfully compiled and run Alexnet, why doesn't Inception work?

Running on multiple nodes using GASNet UDP conduit

Hi,

So I have two connected nodes, both of which have Flexflow and GASNet installed. I setup GASNet and AlexNet on both nodes the following way.

GASNet:

  1. Navigate to the directory where gasnet is installed.
  2. Run ./configure --enable-segment-fast --disable-pshm --with-max-segsize=16GB --disable-seq --disable-parsync --enable-udp --enable-mpi-compat --disable-mpi --disable-aligned-segments --disable-ibv
  3. Run CXXFLAGS=-fPIC make -e CONDUIT=udp
  4. Run make install

AlexNet:

USE_GASNET=1 LDFLAGS=-L/usr/lib/x86_64-linux-gnu/hdf5/serial <path_to_FlexFlow>/ffcompile.sh <path_to_FlexFlow>/examples/cpp/AlexNet

UDP conduit:

  1. Setup a passwordless ssh connection between the two nodes using ssh-copy-id.
  2. Suppose we want to run FlexFlow on remote node a from local node b.
  3. Set the following environment variables:
    GASNET_SPAWNFN='S'
    SSH_CMD='ssh'
    SSH_SERVERS='username@<node_a_ip_address>'
    In my case, I had to set SSH_OPTIONS='-p 2222' since we are port forwarding a
    container's ssh port.

To run AlexNet:

<path_to_executable>/alexnet 1 <FlexFlow_command_line_arguments>.

The parameters 1 indicates we are running on one remote node.

When I try this, I get the following error:

*** GASNET WARNING: int AMUDP_SPMDStartup_AMUDP_NDEBUG(int*, char***, int, int, amudp_spawnfn_t, uint64_t*, amudp_eb**, amudp_ep**) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
  from function AMUDP_SPMDStartup
  at ./amudp_spmd.cpp:971
  reason: worker failed DNSLookup on master host name

GASNet initialization encountered an error: "worker AMUDP_SPMDStartup() failed"
  in gasnetc_init at /usr/local/gasnet/udp-conduit/gasnet_core.c:242
*** WARNING (4e25f04e502e:10879): GASNet gasnetc_init returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /usr/local/gasnet/udp-conduit/gasnet_core.c:326
*** WARNING (4e25f04e502e:10879): GASNet gex_Client_Init_GASNET_202031PARnopshmFASTnodebugnotracenostatsnodebugmallocnosrclines returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /usr/local/gasnet/udp-conduit/gasnet_core.c:504
GASNET: gasnet_init(argc, const_cast<char ***>(argv)) = 3 (GASNET_ERR_RESOURCE, Problem with requested resource)
Terminated

Thanks.

Plan for the FlexFlow release 2021.3

CI Infrastructures:

  • Build tests and CircleCI configurations: Efrain
  • Finish unitests for individual operators (potentially use the Python interface): Efrain, Wei
  • Performance unitests again existing tensor libraries (PyTorch): Efrain
  • Accuracy tests for vision models and language models: Vinay, Wei
  • Multi-GPU performance tests (1-4GPUs): Nirmal

Runtime improvements:

  • Finishing the NCCL integration (nccl -> master): Zhihao
  • Scalability evaluation on large models: Zhihao
  • Enhancing front-end support for vision models (torch.vision zoo) and language models (hugging face model zoo): Mandeep, Wei

Documentations, tutorials, websites:

  • API documentations: Wei
  • Detailed tutorials to demonstrate auto-tuning, debugging, and profiling: Zhihao

Data preprocessing:

  • Distributed data loader in Python
  • Distributed data generator support for Keras

Others:

  • Spack package for FlexFlow: Pat

TODOs for future releases

  • TensorBoard support using TensorBoardX
  • Support for AMD CPUs/GPUs
  • Integration of FlexFlow and Legate

Python import error

I am getting the following error when testing the Python interface:

Traceback (most recent call last):
  File "/autofs/nccs-svm1_home1/zhihao/FlexFlow/python/flexflow/core/flexflow_top.py", line 195, in flexflow_top_level_task
    run_path(args[start], run_name='__main__')
  File "/autofs/nccs-svm1_home1/zhihao/FlexFlow/python/flexflow/core/flexflow_top.py", line 145, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "example/mlp.py", line 3, in <module>
    from flexflow.keras.datasets import mnist
  File "/autofs/nccs-svm1_home1/zhihao/FlexFlow/python/flexflow/keras/__init__.py", line 2, in <module>
    from . import datasets
  File "/autofs/nccs-svm1_home1/zhihao/FlexFlow/python/flexflow/keras/datasets/__init__.py", line 5, in <module>
    from . import reuters
  File "/autofs/nccs-svm1_home1/zhihao/FlexFlow/python/flexflow/keras/datasets/reuters.py", line 9, in <module>
    from ..preprocessing.sequence import _remove_long_seq
  File "/autofs/nccs-svm1_home1/zhihao/FlexFlow/python/flexflow/keras/preprocessing/__init__.py", line 7, in <module>
    import keras_preprocessing

Is keras_preprocessing our package or an external package?

Supporting native Python

This issue tracks the progress of supporting a FlexFlow runtime library for native python interpreter. FlexFlow's python support is largely based on the Legion python bindings: https://github.com/StanfordLegion/legion/tree/stable/bindings/python.

The Legion python bindings include a customized python interpreter for execution leaf tasks implemented in Python. In contrast, FlexFlow mainly use python for model constructions and supporting existing frontends.

Runtime Hangs when Running ResNet at larger batch sizes

When running the resnet model with a batch size of 128. The runtime hangs when it gets to the training loop. It seems to have something to do with legion because if I remove 'runtime->begin_trace(ctx, 111/trace_id/);' and 'runtime->end_trace(ctx, 111/trace_id/);' and launch the file with the '-lg:inorder' it will get through the runtime ok. But I imagine there is a performance hit from using '-lg:inorder'.

I also tried with the debug flag enabled but it still just hangs and no error is printed. And i tried using legion spy, the last thing in the log was '[0 - 7f4214849fc0] 19.405782 {3}{legion_spy}: Future Creation 4893 0 4 0 0 0 0'

Not sure whats going on but there seems almost like its not behaving in a milti-thread/process safe manner. Otherwise why would forcing In-Order Execution make a difference?

Any help or suggestions would be appreciated. Thanks!

CUDA variable is not defined

Hi, when i ./ffcompile.sh alexnet, I got legion/runtime/runtime.mk:287: *** CUDA variable is not defined, aborting build. Stop. How can I sovle this error?

create_tensor -> create_input

Currently both inputs and intermediate tensors are created using the following API:
Tensor FFModel::create_tensor(const int dims[],
const std::string& name,
DataType data_type,
bool create_grad)

This prevents the FlexFlow runtime from tracking which tensors are the original inputs. In the future, we should use the following API for create inputs and only use the above API for intermediate tensors
Tensor FFModel::create_input(const int dims[],
const std::string& name,
DataType data_type,
bool create_grad)

Working set of your application is too big for the allotted capacity

Hi I tried running the Alexnet example and this is the error I am getting,
even for the small subset cifar-10 datasize of 30730000 bytes and smaller batch sizes like 5-10,

[0 - 7f0cbb7f5700] {5}{default_mapper}: Default mapper failed allocation of size 150994944 bytes for region requirement 0 of task Zero Init Task (UID 1906) in memory 1e00000000000001 for processor 1d00000000000002. This means the working set of your application is too big for the allotted capacity of the given memory under the default mapper's mapping scheme. You have three choices: ask Realm to allocate more memory, write a custom mapper to better manage working sets, or find a bigger machine

Also, can someone please explain Realm.

Owner Op incorrect

Tensor t = ff.conv2d(input, out_channels, 1, 1, 1, 1, 0, 0, AC_MODE_NONE); //owner is null
t = ff.batch_norm(t, false); //owner is Conv2d_11_100
t = ff.relu(t); //owner is Conv2d_11_100 but should be batch norm

Any idea why the owner ops are not propagating to the model layers correctly? It seems to be something with batch_norm because all the other layers are working?

How to run batch size smaller than the number of nodes for very large DNN

I getting this error when I try to run Batch size 2 on 4 GPUs

CnnModel::CnnModel(int, int, int, int, int, int, int, int, bool, float, int, int, Legion::Context, Legion::Runtime*): Assertion `num_images * 3 % (config.num_loaders * config.num_nodes) == 0' failed.

If I run batch size 4 then it is giving out of memory error.

FlexFlow Release 2020.08

List of TODOs for the next FlexFlow release:

  • Runtime support for Loss and Metrics functions
  • FlexFlow Keras support for Loss and Metrics functions
  • Future plans for multiple work branches (stable for future releases, master for internal development)
  • unit tests and regression tests for FlexFlow
  • (Optional) Finish the integration of the Candle benchmarks
  • (Optional) Docker images of stable builds

TODOs for future releases

  • (Optional) Scalability tests
  • Website
  • Documentations
  • CMake and Spack for stable builds
  • Support for PyTorch lazy model
  • TensorBoard support using TensorBoardX
  • Distributed Data Loader in Python, Generator support for Keras
  • Support for AMD CPUs/GPUs

dlrm case erro

when i run the dlrm case, i encounter another problem.
image
command line is:
./dlrm -ll:gpu 2 -ll:cpu 4 -ll:fsize 8000 -ll:zsize 20000 -ll:util 2 --arch-sparse-feature-size 64 --arch-embedding-size 1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --arch-mlp-bot 64-512-512-64 --arch-mlp-top 576-1024-1024-1024-1 --epochs 20 --batch-size 128 -dm:memoize --strategy ../../../src/runtime/dlrm_strategy_gpu_2_node_1.pb

tensor owner op incorrect

Tensor t = ff.conv2d(input, out_channels, 1, 1, 1, 1, 0, 0, AC_MODE_NONE); //owner is null
t = ff.batch_norm(t, false); //owner is Conv2d_11_100
t = ff.relu(t); //owner is Conv2d_11_100 but should be batch norm

Any idea why the owner ops are not propagating to the model layers correctly? It seems to be something with batch_norm because all the other layers are working?

Potential bug in cnn.cc

bool parse_strategy_file(const std::string &filename, FFConfig& config)
{
FILE* file;
if ((file = fopen(filename.c_str(), "r")) == NULL) {
log_ff.print("Cannot open strategy file (%s)", filename.c_str());
return false;
}
fclose(file);
return true;
}

if (parse_strategy_file(config.strategyFile, config))
{
log_ff.print("Error: cannot parse strategy file");
return;
}

If parse_strategy_file returns true, then log_ff.print("Error: cannot parse strategy file") ?

Build DLRM example

Hi, I tried both ./ffcompile.sh examples/cpp/DLRM and cmake .. -DBUILD_DLRM=ON, and they all lead to a failed building. Can you share the DLRM building flags? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.