Discussion:
How MPI handle network connection loss
(too old to reply)
Helen
2008-02-07 16:45:52 UTC
Permalink
I'm experimenting MPI's response to a sudden network connection loss
(both MPICH2 from Argonne Lab and Intel). Even during a
"sleep(20000)", when I unplugged the network cable, MPI detects it and
gives a fatal abort (mpiexec.exe got abort). I cannot catch any
exceptions. The error messages that appear in my command window are:

op_read error on left context: generic socket failure, error stack:
MPIDU_Sock_wait(2571): The specified network name is no longer
available. (errno
64)
unable to read the cmd header on the left context, generic socket
failure, error
stack:
MPIDU_Sock_wait(2571): The specified network name is no longer
available. (errno
64).

I spent hours on the web and haven't found any related information
yet. It seems that error handling is not very well defined in MPI.
Even the errorhandler, seems only handle MPI_XXX calls rather than
MPIDU_xxxxx

Does anybody knows about it?

Thanks a lot!

Helen
David Cronk
2008-02-07 17:12:14 UTC
Permalink
This is the default behaviour for most MPI implementations. As you
noted, the standard is a little vague. Basically, the standard does not
define what to do on errors. However, there has been a fair amount of
work in this area. Research "fault tolerant MPI" and you should be able
to find a fair amount of information.

Hope this helps.

Dave.
Post by Helen
I'm experimenting MPI's response to a sudden network connection loss
(both MPICH2 from Argonne Lab and Intel). Even during a
"sleep(20000)", when I unplugged the network cable, MPI detects it and
gives a fatal abort (mpiexec.exe got abort). I cannot catch any
MPIDU_Sock_wait(2571): The specified network name is no longer
available. (errno
64)
unable to read the cmd header on the left context, generic socket
failure, error
MPIDU_Sock_wait(2571): The specified network name is no longer
available. (errno
64).
I spent hours on the web and haven't found any related information
yet. It seems that error handling is not very well defined in MPI.
Even the errorhandler, seems only handle MPI_XXX calls rather than
MPIDU_xxxxx
Does anybody knows about it?
Thanks a lot!
Helen
--
Dr. David Cronk, Ph.D. phone: (865) 974-3735
Research Director fax: (865) 974-8296
Innovative Computing Lab http://www.cs.utk.edu/~cronk
University of Tennessee, Knoxville
Loading...