Git Product home page Git Product logo

openmp-101's Introduction

OpenMP-101

Optimization Notice

opt-img-01 opt-img-zh

0. Fast Guide: OMP in Caffe

0.1 What's the OMP

  • an easy, portable and scalable way to parallelize applications for many cores. – Multi-threaded, shared memory model (like pthreads)
  • a standard API
  • omp pragmas are supported by major C/C++ , Fortran compilers (gcc, icc, etc).

A lot of good tutorials on-line:

0.2 OpenMP programming model

omp program model

0.3 Example

naive implementation

int main(int argc, char *argv[])
{
    int idx;
    float a[N], b[N], c[N];
    
    for(idx=0; idx<N; ++idx)
    {
        a[idx] = b[idx] = 1.0;
    }
    
    for(idx=0; idx<N; ++idx)
    {
        c[idx] = a[idx] + b[idx];
    }
}

omp implementation

#include <omp.h>
int main(int argc, char *argv[])
{
    int idx;
    float a[N], b[N], c[N];
    #pragma omp parallel for
    for(idx=0; idx<N; ++idx)
    {
        a[idx] = b[idx] = 1.0;
    }
    #pragma omp parallel for
    for(idx=0; idx<N; ++idx)
    {
        c[idx] = a[idx] + b[idx];
    }
}
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N (100)
int main(int argc, char *argv[])
{
    int nthreads, tid, idx;
    float a[N], b[N], c[N];
    nthreads = omp_get_num_threads();
    printf("Number of threads = %d\n", nthreads);
    #pragma omp parallel for
    for(idx=0; idx<N; ++idx)
    {
        a[idx] = b[idx] = 1.0;
    }
    #pragma omp parallel for
    for(idx=0; idx<N; ++idx)
    {
        c[idx] = a[idx] + b[idx];
        tid = omp_get_thread_num();
        printf("Thread %d: c[%d]=%f\n", tid, idx, c[idx]);
    }
}

0.4 Compiling, linking etc

You need to add flag –fopenmp

# compile using gcc
gcc -fopenmp omp_vecadd.c -o vecadd

# compile using icc
icc -openmp omp_vecadd.c -o vecad

Control number of threads through set enviroment variable on command line:

export OMP_NUM_THREADS=8 

0.5 Exercise

  1. Implement
  • vector dot-product: c=<x,y>
  • matrix-matrix multiply
  • 2D matrix convolution
  1. Add openmp support to relu, and max-pooling layers

note

synch and critical sections,

  • use critical section to reduce false sharing
  • BUT don't put critical sections inside tight loops - doing so serializes things

0.6 Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs

improve_performance_for_deep_learning_frameworks_on_cpu

Tutorial1: Introduction to OpenMP

Intel’s Tim Mattson’s Introduction to OpenMP video tutorial is now available.

Outline:

Unit 1: Getting started with OpenMP

  • Module 1: Introduction to parallel programming
  • Module 2: The boring bits: Using an OpenMP compiler (hello world)
  • Discussion 1: Hello world and how threads work

Unit 2: The core features of OpenMP

  • Module 3: Creating Threads (the Pi program)
  • Discussion 2: The simple Pi program and why it sucks
  • Module 4: Synchronization (Pi program revisited)
  • Discussion 3: Synchronization overhead and eliminating false sharing
  • Module 5: Parallel Loops (making the Pi program simple)
  • Discussion 4: Pi program wrap-up

Unit 3: Working with OpenMP

  • Module 6: Synchronize single masters and stuff
  • Module 7: Data environment
  • Discussion 5: Debugging OpenMP programs
  • Module 8: Skills practice … linked lists and OpenMP
  • Discussion 6: Different ways to traverse linked lists

Unit 4: a few advanced OpenMP topics

  • Module 9: Tasks (linked lists the easy way)
  • Discussion 7: Understanding Tasks
  • Module 10: The scary stuff … Memory model, atomics, and flush (pairwise synch).
  • Discussion 8: The pitfalls of pairwise synchronization
  • Module 11: Threadprivate Data and how to support libraries (Pi again)
  • Discussion 9: Random number generators

Unit 5: Recapitulation

Thanks go to the University Program Office at Intel for making this tutorial available.

Tutorial2: OpenMP

Author: Blaise Barney, Lawrence Livermore National Laboratory

OpenMP

Tutorial3: OpenMP tutorial | Goulas Programming Soup

https://goulassoup.wordpress.com/2011/10/28/openmp-tutorial/

reference

openmp-101's People

Contributors

ysh329 avatar zchrissirhcz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

openmp-101's Issues

Survey Question on the For loop in source file pi/my_pi.c line 63

Hello Sir/ Madam
We are from a research group at Iowa State University, USA. We want to do a survey on Github developers on the methods they used for paralleling their code. To do the survey, We want to ask three questions:

  1. Have you ever tried to add pragma for that 'for' loop?.

  2. How much confidence do you have about the correctness of this implementation? You can choose from 1-5 with 1 as the lowest confidence score and 5 as the highest confidence.

  3. (Optional) Do you actually run (interpret the code with compilation and pass input/get output) the code? Yes/No

  • If yes, can you provide the information of what are the input and expected output of this program (the input that caused the program to run through this for-loop).

The for loop is from line 63 of file https:/github.com/ysh329/OpenMP-101/blob/master/pi/my_pi.c
Here is a part of the code:

nan
for (i = start_step_num; i < finish_step_num; ++i)
{
x = (i + 0.5) * step;
sum = sum + (4.0 / (1.0 + (x * x)));
}

Sincerely thanks

omp parallel for raises segmentation fault(core dumped) when the number of iterations in the for loop exceed a certain amount.

I have been writing a code and have tried to parallelize it using #pragma omp parallel for. Everything works fine till about 100000 iterations but whe i increase the number of iterations to about 200000, it compiles without errors but while running shows the error "segmentation fault(core dumped) " immidiately.
I have been told that it might be local/global variable issue of the iteration variables but i think it is unlikely because it runs just fine if the number of iteration is lesser.
Also i have been told that it might be stack issue which i have no idea about

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.