Discussion:
mpi and multicore processors
(too old to reply)
giuseppe
2007-11-30 16:17:59 UTC
Permalink
hi everyone,
I'm a student in computer science with a little of experience
(exercises) in programming clusters of PC with mpi. I'd like to have
some reference about using the mpi library with multicore processors.

thanks a lot,
Giuseppe
Greg Lindahl
2007-11-30 19:19:10 UTC
Permalink
Post by giuseppe
I'd like to have
some reference about using the mpi library with multicore processors.
It's basically the same as using MPI on multi-processor machines, which
have been around for a long time.

-- greg
Justin W
2007-11-30 23:40:02 UTC
Permalink
If running a distributed/parallel program on a single physical
machine. There are more efficient ways to "pass information" around.
Shared memory for example (take a look at OpenMP... or implement your
own with pthreads). MPI's real advantage is for multi-machine
communication/coordination.

If you just want to play around with MPI (which is what I'm doing)...
doesn't really matter what you run it on. Single-core, multi-core...
multi-machine. It'll "run", but obviously different setups are going
to have different advantages over one another.

-Justin
giuseppe
2007-12-01 08:32:19 UTC
Permalink
Post by Justin W
If running a distributed/parallel program on a single physical
machine. There are more efficient ways to "pass information" around.
Shared memory for example (take a look at OpenMP... or implement your
own with pthreads). MPI's real advantage is for multi-machine
communication/coordination.
uhmm, you are right! May be better using an efficient and stable library
like OpenMP than using my own implementation (or not?)
Post by Justin W
If you just want to play around with MPI
I wanna play around with somethings that make me programming effectively
real-life machine such as multicore processors (not the HPC at the NASA).
Post by Justin W
doesn't really matter what you run it on. Single-core, multi-core...
multi-machine. It'll "run", but obviously different setups are going
to have different advantages over one another.
that was my doubt, i haven't thought that mpi is optimized for network
comunications even if it can work on a single multi-core machine.

Thanks,
Giuseppe
Sebastian Hanigk
2007-12-01 12:45:33 UTC
Permalink
Post by giuseppe
Post by Justin W
doesn't really matter what you run it on. Single-core, multi-core...
multi-machine. It'll "run", but obviously different setups are going
to have different advantages over one another.
that was my doubt, i haven't thought that mpi is optimized for network
comunications even if it can work on a single multi-core machine.
MPI - as the expanded acronym says - is based on a messaging paradigm
which incurs loss of efficiency inside a SMP node because you would not
only have to transfer data, but you will also have the messaging
overhead (setting up and tearing down the connections between processes
and so on).

Good MPI implementations would use something like shared memory IPC
inside a SMP node, but if you're concerned with the last bit of
performance, a thread-based programming model like OpenMP would be
better suited.


Sebastian
Greg Lindahl
2007-12-02 01:22:45 UTC
Permalink
Post by Sebastian Hanigk
MPI - as the expanded acronym says - is based on a messaging paradigm
which incurs loss of efficiency inside a SMP node because you would not
only have to transfer data, but you will also have the messaging
overhead (setting up and tearing down the connections between processes
and so on).
And with MPI, you get the increase of efficiency of never having false
sharing and other locality problems.

Which is why it's frequently the case that codes with both OpenMP and
MPI implementations run faster in pure MPI mode on big SMPs.

-- greg
Sebastian Hanigk
2007-12-05 10:17:11 UTC
Permalink
Post by Greg Lindahl
And with MPI, you get the increase of efficiency of never having false
sharing and other locality problems.
Which is why it's frequently the case that codes with both OpenMP and
MPI implementations run faster in pure MPI mode on big SMPs.
It's good that you mention the threading problems that can occur.

One of the major drawbacks of MPI on SMP machines is in my opinion the
necessary synchronisation for communication; one-sided communication
directives (which MPI supports only half-hearted) are a really nice way
of loose coupling, especially if your hardware supports it natively.


Sebastian
Greg Lindahl
2007-12-05 20:51:33 UTC
Permalink
Post by Sebastian Hanigk
One of the major drawbacks of MPI on SMP machines is in my opinion the
necessary synchronisation for communication; one-sided communication
directives (which MPI supports only half-hearted) are a really nice way
of loose coupling, especially if your hardware supports it natively.
Yes, although many programmers are unpleased to discover that they
often need just as much synchronization with one-sided
communications. So they end up sprinkling their code with barriers,
and sometimes have to resort to double-buffering.

-- greg
Sebastian Hanigk
2007-12-05 23:35:42 UTC
Permalink
Post by Greg Lindahl
Yes, although many programmers are unpleased to discover that they
often need just as much synchronization with one-sided
communications. So they end up sprinkling their code with barriers,
and sometimes have to resort to double-buffering.
I had good experiences with one-sided communication in cases where your
data layout would be unpredictable (in my case plugging newly developed
algorithms into existing legacy codebase). The buffering issues can
sometimes (often?) used for non-blocking communication, especially
useful if your interconnect supports some kind of RDMA operations.

Regarding the synchronisation subroutine calls: I surmise that MPI codes
usually employing send-receives where many if not all processes take
part which means an implicit synchronisation step at the end of every
communication epoch - if it's needed or not; at least in theory one
could use less synchronisation, albeit explicit, by employing RDMA
communication. I'm currently using a BlueGene for some tests and the
low-level messaging layer gives you the opportunity to specifiy
callbacks for sender and receiver of those messages so you could for
example simply notify the target whenever you put something into its memory.


Sebastian
Greg Lindahl
2007-12-06 20:06:51 UTC
Permalink
Post by Sebastian Hanigk
Regarding the synchronisation subroutine calls: I surmise that MPI codes
usually employing send-receives where many if not all processes take
part which means an implicit synchronisation step at the end of every
communication epoch - if it's needed or not; at least in theory one
could use less synchronisation,
Many of the MPI codes I've looked at have the minimum of synchronization.

BTW, you may not want to use "RDMA" the way you're using it, it's been
hijacked by one community and redefined to be more and less than
actual remote direct memory access.
Post by Sebastian Hanigk
I'm currently using a BlueGene for some tests and the
low-level messaging layer gives you the opportunity to specifiy
callbacks for sender and receiver of those messages so you could for
example simply notify the target whenever you put something into its memory.
This is a typical feature -- it's needed because you still need
synchronization.

-- greg
Sebastian Hanigk
2007-12-06 21:33:33 UTC
Permalink
Post by Greg Lindahl
Post by Sebastian Hanigk
Regarding the synchronisation subroutine calls: I surmise that MPI codes
usually employing send-receives where many if not all processes take
part which means an implicit synchronisation step at the end of every
communication epoch - if it's needed or not; at least in theory one
could use less synchronisation,
Many of the MPI codes I've looked at have the minimum of
synchronization.
I think we talk about slightly different things; if you mean by
"synchronisation" explicit calls to the barrier subroutine, you're
right. I was more referring to the (sometimes unnecessary)
synchronisation due to the two-sided communication model of MPI (let's
not talk about eager vs. rendezvous at the moment).

Simple example: ghost cell exchange in a CFD code. In the MPI case,
every send/receive incurs synchronisation, but you could simply read the
remote processes' memory without the - explicit - help of the target. Of
course, you have to ensure that you're reading consistent data, but this
is simply one barrier before the next update step.
Post by Greg Lindahl
BTW, you may not want to use "RDMA" the way you're using it, it's been
hijacked by one community and redefined to be more and less than
actual remote direct memory access.
It is? I'm not really sure what would be the best terminology, I'm
often using RDMA, SHMEM or distributed shared memory whenever I'm
referring to (more or less) passive-target, one-sided communication in a
cluster.
Post by Greg Lindahl
Post by Sebastian Hanigk
I'm currently using a BlueGene for some tests and the
low-level messaging layer gives you the opportunity to specifiy
callbacks for sender and receiver of those messages so you could for
example simply notify the target whenever you put something into its memory.
This is a typical feature -- it's needed because you still need
synchronization.
Depends. Current work on a 3D-FFT could be realised solely with
get-communication on disjunct buffers so barrier synchronisation is
barely needed. I've dabbled with the implementation of an accumulation
routine protoype which uses a put operation into remote memory and the
respective callback on the target process does the accumulation
operation, but I'm still thinking how to implement atomicity.


Sebastian
Greg Lindahl
2007-12-06 23:45:00 UTC
Permalink
Post by Sebastian Hanigk
Post by Greg Lindahl
Many of the MPI codes I've looked at have the minimum of
synchronization.
I think we talk about slightly different things; if you mean by
"synchronisation" explicit calls to the barrier subroutine, you're
right.
No, I'm referring to all forms of synchronization, including
2-sided communication synchronization.
Post by Sebastian Hanigk
Simple example: ghost cell exchange in a CFD code. In the MPI case,
every send/receive incurs synchronisation,
No, it doesn't. For example, I can irecv/isend and then waitall. That
results in one synchronization with my neighbors. Nothing extra.
Post by Sebastian Hanigk
but you could simply read the
remote processes' memory without the - explicit - help of the target. Of
course, you have to ensure that you're reading consistent data, but this
is simply one barrier before the next update step.
That's a synchronization, too. So there you have it: one in each case.

-- greg
Sebastian Hanigk
2007-12-07 11:07:38 UTC
Permalink
Post by Greg Lindahl
Post by Sebastian Hanigk
Simple example: ghost cell exchange in a CFD code. In the MPI case,
every send/receive incurs synchronisation,
No, it doesn't. For example, I can irecv/isend and then waitall. That
results in one synchronization with my neighbors. Nothing extra.
But this only works for eager sends or receives! If the amount of data
you're about to transfer exceeds some buffer limit, even the i-routines
will behave like the synchronous ones. Many MPI implementation let you
fiddle with the buffer limit and you could use the more unusual
immediate buffered send/receive routines.
Post by Greg Lindahl
Post by Sebastian Hanigk
but you could simply read the
remote processes' memory without the - explicit - help of the target. Of
course, you have to ensure that you're reading consistent data, but this
is simply one barrier before the next update step.
That's a synchronization, too. So there you have it: one in each case.
It is one synchronisation per update cycle with one-sided communication
regardless of the number of dimensions etc. whereas the synchronisations
in the MPI case would be two times the number of exchange dimensions for
the rendezvous protocol; it can be brought down to one synchronisation
if immediate routines are used and they do not have to switch to a
synchronous mode of communication.


Sebastian
Greg Lindahl
2007-12-07 19:50:38 UTC
Permalink
Post by Sebastian Hanigk
But this only works for eager sends or receives! If the amount of data
you're about to transfer exceeds some buffer limit, even the i-routines
will behave like the synchronous ones.
Not only is this implementation-dependent behavior, but your comment
doesn't make any sense. MPI_RECV always blocks until the data is
available. MPI_IRECV never does. So no, large transfers never make
MPI_IRECV behave like MPI_RECV. With IRECV, the blocking happens at
the MPI_WAIT.

And there is usually only one MPI_WAIT, no matter how many dimensions
your halo exchange has.

Now perhaps you're using a funny definition of "synchronization". But
it doesn't sound like a useful one.

-- greg
Sebastian Hanigk
2007-12-07 22:17:59 UTC
Permalink
Post by Greg Lindahl
Post by Sebastian Hanigk
But this only works for eager sends or receives! If the amount of data
you're about to transfer exceeds some buffer limit, even the i-routines
will behave like the synchronous ones.
Not only is this implementation-dependent behavior, but your comment
doesn't make any sense. MPI_RECV always blocks until the data is
available. MPI_IRECV never does. So no, large transfers never make
MPI_IRECV behave like MPI_RECV. With IRECV, the blocking happens at
the MPI_WAIT.
I'm sorry for any misunderstanding, my comment above has been written in
a slight hurry ...

Regarding MPI_Irecv I cannot say anything at the moment - I strongly
assume that your description should be expected. But its complementary
sending routine switches from immediate return to blocking behaviour
after exceeding an implementation-dependend message size threshold.
Post by Greg Lindahl
And there is usually only one MPI_WAIT, no matter how many dimensions
your halo exchange has.
Yes. But if your halo's exchange buffer size is larger than the
implementation's threshold, you will end up with blocking behaviour on
each exchange while the zero-copy RDMA (without any connotation I'm
perhaps unaware of) access can obviate this.
Post by Greg Lindahl
Now perhaps you're using a funny definition of "synchronization". But
it doesn't sound like a useful one.
I don't think I have given or used an unusual definition of
synchronisation; in MPI, there is an implicit synchronisation between
the sending and receiving party hidden in the respective calls to the
send or receive routines, with the exception of the immediate versions of
those routines whose behaviour depends on the transfer size.

Could it be that this discussion goes in some kind of circle while we're
misunderstanding each other? I'm in no way dismissing MPI as inferior,
but for some purposes it is very nice to have the means for one-sided,
passive-target communication available. Without doubt the RDMA scheme
has its own set of problems (I just remembered a short article:
<http://www.hpcwire.com/hpc/815242.html>), I'm still struggling with the
registration/pinning issues - compute node kernels without swapping
capability are a godsend for that purpose.


Sebastian
Greg Lindahl
2007-12-07 22:38:25 UTC
Permalink
Post by Sebastian Hanigk
Regarding MPI_Irecv I cannot say anything at the moment - I strongly
assume that your description should be expected. But its complementary
sending routine switches from immediate return to blocking behaviour
after exceeding an implementation-dependend message size threshold.
No. Isend returns immediately in all cases. What work it does before
returning is implementation dependent, and that's what you seem to be
referring to, incorrectly.
Post by Sebastian Hanigk
Could it be that this discussion goes in some kind of circle while we're
misunderstanding each other?
It's entirely possible.
Post by Sebastian Hanigk
I'm in no way dismissing MPI as inferior,
but for some purposes it is very nice to have the means for one-sided,
passive-target communication available.
Indeed, it is sometimes useful. But now you've returned to the
beginning of the discussion, and I have the same reply as before.

-- greg
Sebastian Hanigk
2007-12-08 00:58:28 UTC
Permalink
Post by Greg Lindahl
No. Isend returns immediately in all cases. What work it does before
returning is implementation dependent, and that's what you seem to be
referring to, incorrectly.
I beg to differ. Now it seems that you have an unusual definition of
"immediately". Take a look at the data from
<http://www.cs.sandia.gov/smb/overhead.html> and you see in fig. 2
(Overhead as a function of message size for MPI_Isend) that
interconnects without good communication offload capabilities suffer a
penalty proportional to the message size.

On the available Blue Gene
(<http://www.epcc.ed.ac.uk/facilities/blue-gene/>) I had done some
measurements a few weeks ago and put the resulting data files and
overhead plots on the web just now
(<http://www.fs.tum.de/~shanigk/mpi_overhead/>).


Sebastian
Greg Lindahl
2007-12-08 01:47:09 UTC
Permalink
Post by Sebastian Hanigk
I beg to differ. Now it seems that you have an unusual definition of
"immediately".
OK, "does not block".

The fact that ISend sometimes does a significant amount of work before
returning has nothing to do with synchronization or blocking.
Post by Sebastian Hanigk
<http://www.cs.sandia.gov/smb/overhead.html>
You might want to ask Doug about my objections to his experimental method.

-- greg
Sebastian Hanigk
2007-12-08 02:01:18 UTC
Permalink
Post by Greg Lindahl
The fact that ISend sometimes does a significant amount of work before
returning has nothing to do with synchronization or blocking.
Fair enough. Now I have to try to explain the difference to our users :-)

I'll try to do the same measurements with the MPI calls replaced by ARMCI
calls (that's the library I'm currently using) and post the results.
Post by Greg Lindahl
Post by Sebastian Hanigk
<http://www.cs.sandia.gov/smb/overhead.html>
You might want to ask Doug about my objections to his experimental method.
Care to explain?


Anyway, have a good night!

Sebastian
Greg Lindahl
2007-12-08 21:26:37 UTC
Permalink
Post by Sebastian Hanigk
Post by Greg Lindahl
The fact that ISend sometimes does a significant amount of work before
returning has nothing to do with synchronization or blocking.
Fair enough. Now I have to try to explain the difference to our users :-)
How did they notice?
Post by Sebastian Hanigk
Post by Greg Lindahl
Post by Sebastian Hanigk
<http://www.cs.sandia.gov/smb/overhead.html>
You might want to ask Doug about my objections to his experimental method.
Care to explain?
Doug is asking "how much work can I get done while communicating?" But
he's measuring a loop that doesn't touch main memory. You've probably
heard of Don Becker's comment on zero copy: it's when you get someone
else to do the copy. Everone likes to pretend that this copy is free,
but it isn't. Well, all that DMA memory traffic costs. So Doug's
number is an upper bound; if you used the Stream benchmark as the work
you'd get a lower bound. And a real app would be somewhere in between.
(Since you have a framework for measuring this, perhaps you could do the
stream measurement for us.)

Another issue I have with Doug's paper is that many readers
misinterpreted it. It only applies to the modest fraction of codes
which do large messages and can overlap. Most codes aren't like that.

-- greg
Sebastian Hanigk
2007-12-09 00:57:49 UTC
Permalink
Post by Greg Lindahl
How did they notice?
Parameter space exploration in some newly implemented parallelisations;
mostly we noticed the (more or less) sudden rise in run time due to
non-overlap.
Post by Greg Lindahl
Doug is asking "how much work can I get done while communicating?" But
he's measuring a loop that doesn't touch main memory.
Yes, this is not really realistic and you have to be careful that your
optimiser does not remove the loop.

One part of my diploma thesis' work was the implementation of the SRUMMA
matrix multiplication algorithm where the key idea is maximal overlap;
the working part sandwiched between get and wait calls was a BLAS call.
Post by Greg Lindahl
You've probably heard of Don Becker's comment on zero copy: it's when
you get someone else to do the copy. Everone likes to pretend that
this copy is free, but it isn't. Well, all that DMA memory traffic
costs.
On the upside, you can probably decrease the transfer latency and if
memory is tight, it could help to save the memory which would have been
used for transfer buffers.
Post by Greg Lindahl
So Doug's number is an upper bound; if you used the Stream
benchmark as the work you'd get a lower bound. And a real app would be
somewhere in between. (Since you have a framework for measuring this,
perhaps you could do the stream measurement for us.)
I wouldn't call it a framework, but I think I can do something useful
with my allotted CPU time.
Post by Greg Lindahl
Another issue I have with Doug's paper is that many readers
misinterpreted it. It only applies to the modest fraction of codes
which do large messages and can overlap. Most codes aren't like that.
I had the luxury of tackling a very easy problem in that respect (matrix
multiplication) so for me that hasn't been unusual; for other codes I do
concur with you.


Sebastian

Loading...