MPI_Allreduce failure

Discussion:

(too old to reply)

e***@gmail.com

2009-02-05 18:24:21 UTC

Good Afternoon,

I've recently encountered an error with MPI_Allreduce that I can't
figure out. I have a CFD code I play with and it's been working great
with some test grids, but with a larger grid I get the following
error:

Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(289).......................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)[cli_0]: aborting
job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(289).......................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215)...rank 2 in job 37 hermes03_45491
caused collective abort of all ranks
exit status of rank 2: killed by signal 6
..........: an error occurred while handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)

I'm presuming this is a timeout in the wait event? I get this error
running 2-4 processes on a quad core (Opteron) machine, but running
the program in single rank it works just fine... any clues? And if
this is a timeout, how do I work around it when I am parsing larger
data sets?

Thanks,

Philip

e***@gmail.com

2009-02-05 19:25:05 UTC

Permalink

Update: I stuck a barrier right before the Allreduce call, which
should reduce or eliminate any latency between ranks, and still have
the same problem. I check the local result in each rank and the values
are sane values for a MPI_DOUBLE. I'm at a loss for what the problem
might be.

Again, it's data set size dependant, I have no issue with smaller data
sets, just larger ones. But I dont see why this would cause the
failure in Allreduce, and I don't quite understand how to parse this
MPI error stack.

Thanks,

Philip

Georg Bisseling

2009-02-09 11:35:14 UTC

Permalink

On Thu, 05 Feb 2009 19:24:21 +0100, ***@gmail.com <***@gmail.com> wrote:

"connection reset by peer" might be a polite way to say that the
other side died from a segmentation fault.

Most probably the size depending error is in the application.

Try valgrind for the 1-process job. Maybe it will find the problem
already. If not you will have to go through the exercies of
tweaking the mpiexec scripts to use valgrind - but maybe they
already can include valgrind.

In the error log below you see two processes complaining.
How many did you start? Did you search for core dumps?

Post by e***@gmail.com
Good Afternoon,
I've recently encountered an error with MPI_Allreduce that I can't
figure out. I have a CFD code I play with and it's been working great
with some test grids, but with a larger grid I get the following
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)[cli_0]: aborting
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIDI_CH3i_Progress_wait(215)...rank 2 in job 37 hermes03_45491
caused collective abort of all ranks
exit status of rank 2: killed by signal 6
..........: an error occurred while handling an event returned by
MPIDU_Sock_Wait()
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)
I'm presuming this is a timeout in the wait event? I get this error
running 2-4 processes on a quad core (Opteron) machine, but running
the program in single rank it works just fine... any clues? And if
this is a timeout, how do I work around it when I am parsing larger
data sets?
Thanks,
Philip

--
This signature intentionally left almost blank.
http://www.this-page-intentionally-left-blank.org/