Yet another mpi problem - cluster fault?

Discussion:

(too old to reply)

Ben Jones

2008-07-11 10:23:56 UTC

Hi all,

I have the following test code:

#include <stdio.h>
#include <mpi.h>

int main (int argc, char *argv[])
{
int rank, size;

MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}

It is being launched from the backend of a PBS script. This script
looks as follows:

#!/bin/bash
#PBS -N allIn
#PBS -l nodes=5:ppn=1
#PBS -l walltime=00:01:00
#PBS -m bea
#PBS -M ***@omitted
#PBS -o jobB_.out
#PBS -e jobB_.err

count=`cat $PBS_NODEFILE|wc -l`

echo $count

cd $PBS_O_WORKDIR
mpiexec -verbose -comm mpich-p4 /bb/cca/bxj165/emp

These are my standard outs and standard errors. Standard out:

+--------------------------------------------------------------------------+
| Job starting at 2008-07-11 10:29:39 for bxj165 on the BlueBEAR Cluster
| Job identity jobid 347336 jobname allIn
| Job requests nodes=5:ppn=1,pvmem=5gb,walltime=00:01:00
| Job assigned to nodes u4n040 u4n050 u4n051 u4n052 u4n053
+--------------------------------------------------------------------------+
5
node 0: name u4n040.beowulf.cluster, cpu avail 1
node 1: name u4n050.beowulf.cluster, cpu avail 1
node 2: name u4n051.beowulf.cluster, cpu avail 1
node 3: name u4n052.beowulf.cluster, cpu avail 1
node 4: name u4n053.beowulf.cluster, cpu avail 1
rm_3776: p4_error: semget failed for setnum: 0
rm_13115: p4_error: semget failed for setnum: 0
p0_20491: p4_error: net_recv read: probable EOF on socket: 1
p0_20491: (8.218750) net_send: could not write to fd=4, errno = 32
rm_24596: p4_error: net_recv read: probable EOF on socket: 3
rm_26706: p4_error: net_recv read: probable EOF on socket: 3
+--------------------------------------------------------------------------+
| Job finished at 2008-07-11 10:29:48 for bxj165 on the BlueBEAR Cluster
| Job required cput=00:00:00,mem=6356kb,vmem=151232kb,walltime=00:00:08
+--------------------------------------------------------------------------+

This is my standard error:

mpiexec: resolve_exe: using absolute path "/bb/cca/bxj165/emp".
mpiexec: process_start_event: evt 2 task 0 on u4n040.beowulf.cluster.
mpiexec: read_p4_master_port: waiting for port from master.
mpiexec: read_p4_master_port: got port 48865.
mpiexec: process_start_event: evt 6 task 3 on u4n052.beowulf.cluster.
mpiexec: process_start_event: evt 5 task 2 on u4n051.beowulf.cluster.
mpiexec: process_start_event: evt 4 task 1 on u4n050.beowulf.cluster.
mpiexec: process_start_event: evt 7 task 4 on u4n053.beowulf.cluster.
mpiexec: All 5 tasks (spawn 0) started.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 4 others.
mpiexec: process_obit_event: evt 10 task 1 on u4n050.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 11 task 4 on u4n053.beowulf.cluster stat 139.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 2 others.
mpiexec: process_obit_event: evt 3 task 0 on u4n040.beowulf.cluster stat 1.
mpiexec: process_obit_event: evt 9 task 2 on u4n051.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 8 task 3 on u4n052.beowulf.cluster stat 139.
mpiexec: Warning: task 0 exited with status 1.
mpiexec: Warning: tasks 1-4 exited with status 139.

So without explaining the technical details of the cluster (of which I
have no knowledge of anyhow, since I don't personally maintain it,
although I have sent an email to my support team), would anybody like
to suggest what the problem could be?

Thanks.
Ben.

Ben Jones

2008-07-11 10:31:19 UTC

Permalink

Update, the support at my university says it could be a faulty IB
switch and that its currently being looked in to. Does anybody know
what an IB switch actually is and how it coud bring about the below
problem?

Thanks again,
Ben.

Post by Ben Jones
Hi all,
#include <stdio.h>
#include <mpi.h>
int main (int argc, char *argv[])
{
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
It is being launched from the backend of a PBS script. This script
#!/bin/bash
#PBS -N allIn
#PBS -l nodes=5:ppn=1
#PBS -l walltime=00:01:00
#PBS -m bea
#PBS -o jobB_.out
#PBS -e jobB_.err
count=`cat $PBS_NODEFILE|wc -l`
echo $count
cd $PBS_O_WORKDIR
mpiexec -verbose -comm mpich-p4 /bb/cca/bxj165/emp
+--------------------------------------------------------------------------+
| Job starting at 2008-07-11 10:29:39 for bxj165 on the BlueBEAR Cluster
| Job identity jobid 347336 jobname allIn
| Job requests nodes=5:ppn=1,pvmem=5gb,walltime=00:01:00
| Job assigned to nodes u4n040 u4n050 u4n051 u4n052 u4n053
+--------------------------------------------------------------------------+
5
node 0: name u4n040.beowulf.cluster, cpu avail 1
node 1: name u4n050.beowulf.cluster, cpu avail 1
node 2: name u4n051.beowulf.cluster, cpu avail 1
node 3: name u4n052.beowulf.cluster, cpu avail 1
node 4: name u4n053.beowulf.cluster, cpu avail 1
rm_3776: p4_error: semget failed for setnum: 0
rm_13115: p4_error: semget failed for setnum: 0
p0_20491: p4_error: net_recv read: probable EOF on socket: 1
p0_20491: (8.218750) net_send: could not write to fd=4, errno = 32
rm_24596: p4_error: net_recv read: probable EOF on socket: 3
rm_26706: p4_error: net_recv read: probable EOF on socket: 3
+--------------------------------------------------------------------------+
| Job finished at 2008-07-11 10:29:48 for bxj165 on the BlueBEAR Cluster
| Job required cput=00:00:00,mem=6356kb,vmem=151232kb,walltime=00:00:08
+--------------------------------------------------------------------------+
mpiexec: resolve_exe: using absolute path "/bb/cca/bxj165/emp".
mpiexec: process_start_event: evt 2 task 0 on u4n040.beowulf.cluster.
mpiexec: read_p4_master_port: waiting for port from master.
mpiexec: read_p4_master_port: got port 48865.
mpiexec: process_start_event: evt 6 task 3 on u4n052.beowulf.cluster.
mpiexec: process_start_event: evt 5 task 2 on u4n051.beowulf.cluster.
mpiexec: process_start_event: evt 4 task 1 on u4n050.beowulf.cluster.
mpiexec: process_start_event: evt 7 task 4 on u4n053.beowulf.cluster.
mpiexec: All 5 tasks (spawn 0) started.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 4 others.
mpiexec: process_obit_event: evt 10 task 1 on u4n050.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 11 task 4 on u4n053.beowulf.cluster stat 139.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 2 others.
mpiexec: process_obit_event: evt 3 task 0 on u4n040.beowulf.cluster stat 1.
mpiexec: process_obit_event: evt 9 task 2 on u4n051.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 8 task 3 on u4n052.beowulf.cluster stat 139.
mpiexec: Warning: task 0 exited with status 1.
mpiexec: Warning: tasks 1-4 exited with status 139.
So without explaining the technical details of the cluster (of which I
have no knowledge of anyhow, since I don't personally maintain it,
although I have sent an email to my support team), would anybody like
to suggest what the problem could be?
Thanks.
Ben.

Alan Woodland

2008-11-27 11:29:54 UTC

Permalink

Update, the support at my university says it could be a faulty IB switch
and that its currently being looked in to. Does anybody know what an IB
switch actually is and how it coud bring about the below problem?

In this context IB most probably = InfiniBand switch - normally used for
Node-Node comms in HPC systems. A failure of that might cause MPI to
miss/drop some/all messages in a possibly permanent way, which could
explain what you're seeing.

Alan