e***@mail.ru
2011-01-01 18:13:22 UTC
Hi, I'm using MPI_Reduce to sum my data. As I take it MPI_Reduce is
implementing some tree algorithm to do it. The question is: does it
take into account the node locality of the processes while building
this tree?
My job runs on a bunch of SMP nodes. And I want as much as possible
summing done on the node's shared memory. My MPI think low
communication level uses shared memory exchange if the peers HAPPEN to
be on the same node. But does MPI build the tree for MPI_Reduce so
that there will be much more intranode shared memory exchanges and
less internode network ones?
For example: would I get any speedup if I manually split my MPI_Reduce
in two steps: first call separate MPI_Reduce for each node(creating
their local communicator), summing the results on, say, their zero
ranks, and then call MPI_Reduce with internode communicator consisting
of those local zero ranks mapped to MPI_COMM_WORLD?
implementing some tree algorithm to do it. The question is: does it
take into account the node locality of the processes while building
this tree?
My job runs on a bunch of SMP nodes. And I want as much as possible
summing done on the node's shared memory. My MPI think low
communication level uses shared memory exchange if the peers HAPPEN to
be on the same node. But does MPI build the tree for MPI_Reduce so
that there will be much more intranode shared memory exchanges and
less internode network ones?
For example: would I get any speedup if I manually split my MPI_Reduce
in two steps: first call separate MPI_Reduce for each node(creating
their local communicator), summing the results on, say, their zero
ranks, and then call MPI_Reduce with internode communicator consisting
of those local zero ranks mapped to MPI_COMM_WORLD?