Git Product home page Git Product logo

Comments (2)

krzikalla avatar krzikalla commented on August 21, 2024

The following program fails reliable with 1024 processes on our cluster. Please, can someone look at it, apparently somewhat in the collectives is broken (GPI2 v1.5.0). (or it is my AllGatherValueImpl function)

//  mpicxx gaspi_segment.cpp -pthread -I$GPI2_HOME/include -L$GPI2_HOME/lib64 -lGPI2

#include <iostream>
#include <cassert>
#include <vector>
#include <mpi.h>
#include "GASPI.h"


inline gaspi_return_t CheckGaspiResult(gaspi_return_t result, const char* what)
{
  if (result != GASPI_SUCCESS && result != GASPI_TIMEOUT)
  {
    throw std::runtime_error(what);
  }
  return result;
}

#define GASPI_CHECK( X ) CheckGaspiResult((X), #X)


using RankIndexT = unsigned int;

struct GASPICommunicator
{
  gaspi_rank_t numProcs_;
  gaspi_rank_t ownRank_;
  gaspi_number_t maxReduceElems_;

  GASPICommunicator()
  {
    GASPI_CHECK(gaspi_allreduce_elem_max(&maxReduceElems_));
    GASPI_CHECK(gaspi_proc_rank(&ownRank_));
    GASPI_CHECK(gaspi_proc_num(&numProcs_));
  }

  void AllGatherValueImpl(const unsigned int* values, unsigned int* data)
  {
    std::fill_n(data, numProcs_, 0);
    std::copy(values, values + 1, data + ownRank_);
    gaspi_number_t remainingElems = gaspi_number_t(numProcs_);
    while (remainingElems > 0)
    {
      auto reduceEles = std::min(maxReduceElems_, remainingElems);
      GASPI_CHECK(gaspi_allreduce(data, data, reduceEles, GASPI_OP_SUM, GASPI_TYPE_UINT, GASPI_GROUP_ALL, GASPI_BLOCK));
      remainingElems -= reduceEles;
      data += reduceEles;
    }
  }
};

void CheckAllreduce()
{
  GASPICommunicator communicator;
  unsigned int value = communicator.ownRank_;
  std::vector<unsigned int> allData (communicator.numProcs_, -1);
  communicator.AllGatherValueImpl(&value, allData.data());
  for (int i = 0; i < communicator.numProcs_; ++i)
  {
    if (i != allData[i])
    {
      std::cout << "At rank " << value << " first fail at " << i << ", content is " << allData[i] << std::endl;
      return;
    }
  }
  std::cout << "At rank " << value << " all OK." << std::endl;
}


int main(int argc, char** argv)
{
  int provided_thread_level;
  int mpi_init_result = MPI_Init_thread(&argc, &argv, MPI_THREAD_SERIALIZED, &provided_thread_level);

  GASPI_CHECK(gaspi_proc_init(GASPI_BLOCK));

  CheckAllreduce();

  GASPI_CHECK(gaspi_proc_term(GASPI_BLOCK));
  MPI_Finalize();
}

from gpi-2.

krzikalla avatar krzikalla commented on August 21, 2024

Update on this issue: the reason seems to be the setting of PCI_WR_ORDERING. If set to per_mkey(0), all is fine. If set to force_relax(1), races will happen.

from gpi-2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.