Receiving MPI messages of unknown size

Discussion:

(too old to reply)

Lars

2009-06-04 01:31:51 UTC

Hi,

I'm trying to solve a problem of passing serializable, arbitrarily
sized objects around using MPI and non-blocking communication. The
problem I'm facing is what to do at the receiving end when expecting
an object of unknown size, but at the same time not block on waiting
for it.

When using blocking message passing, I have simply solved the problem
by first sending a small, fixed size header containing the size of
rest of the data, sent in the following mpi message. When using
non-blocking message passing, this doesn't seem to be such a good
idea, since we cant post the main data transfer until we have received
the message header... It seems to take away most of the advantages on
non-blocking io in the first place.

I've been thinking about solving this using MPI_Probe / MPI_IProbe,
but I'm worried about performance.

Question 1:

Will MPI_Probe or the underlying MPI implementation actually receive
the full message data (assuming reasonably sized message, like less
than 10MB) before MPI_Probe returns? Or will there be a significant
data transfer delay (for large messages) when calling MPI_Recv after a
successful MPI_Probe?

What I want is something like this:

1) post one or several non-blocking, variable sized message receives

2) do other, non-MPI work, while any incoming messages will be fully
received into buffers on the local machine.

3) perform completion of the receives posted in 1). I don't want to
unnecessarily wait here for data transfers that could have taken
place during 2).

Problems:

I can't post non-blocking MPI_Irecv() calls in 1, because I don't know
the sizes of incoming messages.

If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that
I won't get nice compute/transfer overlap because the messages wont
actually be received locally until I post a Probe or Recv in 3.

Question 2:

How can I achieve the communication sequence described in 1,2,3 above,
with overlapping data transfer and local computation during 2?

Question 3:

A temporary kludge solution to the problem above might be to allocate
a temporary receive buffer of some arbitrary, constant maximum size
BUFSIZE in 1 for each non-blocking receive operation, make sure
messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer,
BUFSIZE,...) calls in 1. I haven't been able to figure out if it's
actually correct and portable to receive less data than specified in
the count argument to MPI_Irecv.

What if the message sent on the other end is 10 bytes, and
BUFSIZE=count=20. Would that be OK?

If anyone can shed any light on this, I'd be grateful. FYI, we're
using a cluster of 2-8 core x86-64 machines running Linux and
connected using ordinary 1Gbit ethernet.

Best regards,

Lars Andersson

Michael Hofmann

2009-06-04 08:48:17 UTC

Permalink

Post by Lars
Will MPI_Probe or the underlying MPI implementation actually receive
the full message data (assuming reasonably sized message, like less
than 10MB) before MPI_Probe returns? Or will there be a significant
data transfer delay (for large messages) when calling MPI_Recv after a
successful MPI_Probe?

That depends on your MPI implementation. In general one can assume that
small messages are sent/received immediately, while large messages require
the "rendezvous" with MPI_Recv. I don't think that MPI_Probe has any
significant influence on that. This means that for large messages the
"transfer delay" is all the additional time you spend between MPI_Probe
and MPI_Recv.

Post by Lars
1) post one or several non-blocking, variable sized message receives
2) do other, non-MPI work, while any incoming messages will be fully
received into buffers on the local machine.
3) perform completion of the receives posted in 1). I don't want to
unnecessarily wait here for data transfers that could have taken
place during 2).
I can't post non-blocking MPI_Irecv() calls in 1, because I don't know
the sizes of incoming messages.
If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that
I won't get nice compute/transfer overlap because the messages wont
actually be received locally until I post a Probe or Recv in 3.
How can I achieve the communication sequence described in 1,2,3 above,
with overlapping data transfer and local computation during 2?

I think your "kludge solution" is OK, even though it has some
disadvantages (how to chose BUFSIZE? you need a separate receive buffer
for each non-blocking receive). An alternative solution is to look for
incoming messages (repeatedly) with MPI_Iprobe during the non-MPI work
(e.g., if the work has some loop structure). If this is not possible, you
can also start a separate thread that waits for incoming messages using
MPI_Probe.

Post by Lars
A temporary kludge solution to the problem above might be to allocate
a temporary receive buffer of some arbitrary, constant maximum size
BUFSIZE in 1 for each non-blocking receive operation, make sure
messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer,
BUFSIZE,...) calls in 1. I haven't been able to figure out if it's
actually correct and portable to receive less data than specified in
the count argument to MPI_Irecv.
What if the message sent on the other end is 10 bytes, and
BUFSIZE=count=20. Would that be OK?

Yes, that would be OK. The "count" argument of MPI_Recv and MPI_Irecv is
used to specify the total size of the receive buffer, but not the exact
number of elements you want to receive.

"The length of the received message must be less than or equal to the
length of the receive buffer."
(http://www.mpi-forum.org/docs/mpi21-report/node44.htm)

MPI_Get_count is used to determine the exact number of entries received
(using the "status" returned by receive operations).

Michael

Lars

2009-06-05 00:59:16 UTC

Permalink

Post by Lars
How can I achieve the communication sequence described in 1,2,3 above,
with overlapping data transfer and local computation during 2?

Thanks Michael, that was useful information.

I've implemented the "kludge solution", and yes, it works ok. Because
of how malloc() is implemented under Linux, I don't think there will
be any noticeable memory/performance impact of allocating large (and
mostly unused) buffers for each posted recv.

I've now ran into a new problem though. When trying to overlap local
work with data transfer, my MPI implementation (LAM or Open MPI) don't
really do any progress at all for large messages until I call MPI_Wait
(), or intersperse the local computation with MPI_Test() calls. What
MPI implementation do you use? Do you have any experience solving this
problem?

Assuming we have posted a relatively large recv (1-10Mb), I see three
possible solutions to make progress in both transfer and local
computation:

1) Spawning a thread doing MPI_Wait() while doing the local work in
the main thread.

2) Spawning a thread doing something like

while(!done)
{
usleep(1000);
for r in each request
{
MPI_Test(r);
}
}

What amount of sleep would you recommend here?

3) Trying to intersperse my local computation with MPI_Test() calls?

I don't really like solution 3 because most of the local work is being
done in external library code, which means it's going to be hard/ugly
to intersperse it with MPI calls.

I also don't like solution 1, because MPI_Wait() will busy-wait under
Open MPI, stealing up to 50% of the CPU cycles from the thread trying
to do local work. Do you have any recommendations?

Cheers,

Lars

Michael

2009-06-05 11:09:00 UTC

Permalink

Has anyone even considered simply using the BOOST MPI

http://www.boost.org/doc/libs/1_39_0/doc/html/mpi.html

over raw MPI API

Post by Lars

Post by Lars
How can I achieve the communication sequence described in 1,2,3 above,
with overlapping data transfer and local computation during 2?

Thanks Michael, that was useful information.
I've implemented the "kludge solution", and yes, it works ok. Because
of how malloc() is implemented under Linux, I don't think there will
be any noticeable memory/performance impact of allocating large (and
mostly unused) buffers for each posted recv.
I've now ran into a new problem though. When trying to overlap local
work with data transfer, my MPI implementation (LAM or Open MPI) don't
really do any progress at all for large messages until I call MPI_Wait
(), or intersperse the local computation with MPI_Test() calls. What
MPI implementation do you use? Do you have any experience solving this
problem?
Assuming we have posted a relatively large recv (1-10Mb), I see three
possible solutions to make progress in both transfer and local
1) Spawning a thread doing MPI_Wait() while doing the local work in
the main thread.
2) Spawning a thread doing something like
while(!done)
{
usleep(1000);
for r in each request
{
MPI_Test(r);
}
}
What amount of sleep would you recommend here?
3) Trying to intersperse my local computation with MPI_Test() calls?
I don't really like solution 3 because most of the local work is being
done in external library code, which means it's going to be hard/ugly
to intersperse it with MPI calls.
I also don't like solution 1, because MPI_Wait() will busy-wait under
Open MPI, stealing up to 50% of the CPU cycles from the thread trying
to do local work. Do you have any recommendations?
Cheers,
Lars

Katka

2009-06-05 13:02:19 UTC

Permalink

Post by Michael
Has anyone even considered simply using the BOOST MPI
http://www.boost.org/doc/libs/1_39_0/doc/html/mpi.html
over raw MPI API

My limited experience with BOOST hasn't been very positive. I found it
to generate bloated code, huge compile times and page after page of
incomprehensible template error messages if you make a coding
mistake.

From a quick lock at the documentation, I also don't see how it would
solve any concrete problems. It looks like just an MPI wrapper with
some handy features such as object serialization etc?

/Lars

Michael Hofmann

2009-06-09 07:42:32 UTC

Permalink

Post by Lars
I've implemented the "kludge solution", and yes, it works ok. Because
of how malloc() is implemented under Linux, I don't think there will
be any noticeable memory/performance impact of allocating large (and
mostly unused) buffers for each posted recv.
I've now ran into a new problem though. When trying to overlap local
work with data transfer, my MPI implementation (LAM or Open MPI) don't
really do any progress at all for large messages until I call MPI_Wait
(), or intersperse the local computation with MPI_Test() calls. What
MPI implementation do you use? Do you have any experience solving this
problem?

No, not really. Again, this depends strongly on the MPI implementation. In
case of portable MPI implementations (LAM, OpenMPI, MPICH) it can also
depend on the underlying communication device used or on special support
from the operating system.

For OpenMPI you may have a look at the "--enable-progress-threads" and
"--with-threads" configuration option (disabled by default). From the
README file:
- Asynchronous message passing progress using threads can be turned on
with the --enable-progress-threads option to configure.
Asynchronous message passing progress is only supported for TCP,
shared memory, and Myrinet/GM. Myrinet/GM has only been lightly
tested.
--with-threads=value
Since thread support (both support for MPI_THREAD_MULTIPLE and
asynchronous progress) is only partially tested, it is disabled by
default. To enable threading, use "--with-threads=posix". This is
most useful when combined with --enable-mpi-threads and/or
--enable-progress-threads.

Post by Lars
Assuming we have posted a relatively large recv (1-10Mb), I see three
possible solutions to make progress in both transfer and local
1) Spawning a thread doing MPI_Wait() while doing the local work in
the main thread.
2) Spawning a thread doing something like
while(!done)
{
usleep(1000);
for r in each request
{
MPI_Test(r);
}
}
What amount of sleep would you recommend here?
3) Trying to intersperse my local computation with MPI_Test() calls?
I don't really like solution 3 because most of the local work is being
done in external library code, which means it's going to be hard/ugly
to intersperse it with MPI calls.
I also don't like solution 1, because MPI_Wait() will busy-wait under
Open MPI, stealing up to 50% of the CPU cycles from the thread trying
to do local work. Do you have any recommendations?

Solution 2 (but using MPI_Testany)? Sorry, but if the MPI implementation
does not provide these capabilities (asynchronous progress) it is very
hard to enforce it from the outside.

Michael

Michael Hofmann

2009-06-04 08:48:17 UTC

Permalink