e***@gmail.com
2009-02-05 18:24:21 UTC
Good Afternoon,
I've recently encountered an error with MPI_Allreduce that I can't
figure out. I have a CFD code I play with and it's been working great
with some test grids, but with a larger grid I get the following
error:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(289).......................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)[cli_0]: aborting
job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(289).......................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215)...rank 2 in job 37 hermes03_45491
caused collective abort of all ranks
exit status of rank 2: killed by signal 6
..........: an error occurred while handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)
I'm presuming this is a timeout in the wait event? I get this error
running 2-4 processes on a quad core (Opteron) machine, but running
the program in single rank it works just fine... any clues? And if
this is a timeout, how do I work around it when I am parsing larger
data sets?
Thanks,
Philip
I've recently encountered an error with MPI_Allreduce that I can't
figure out. I have a CFD code I play with and it's been working great
with some test grids, but with a larger grid I get the following
error:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(289).......................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)[cli_0]: aborting
job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(707)........................: MPI_Allreduce
(sbuf=0x7fffd3caf780, rbuf=0x7fffd3caf788, count=1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(289).......................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215)...rank 2 in job 37 hermes03_45491
caused collective abort of all ranks
exit status of rank 2: killed by signal 6
..........: an error occurred while handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=3,errno=104:Connection reset by peer)
I'm presuming this is a timeout in the wait event? I get this error
running 2-4 processes on a quad core (Opteron) machine, but running
the program in single rank it works just fine... any clues? And if
this is a timeout, how do I work around it when I am parsing larger
data sets?
Thanks,
Philip