Ben Jones
2008-07-11 10:23:56 UTC
Hi all,
I have the following test code:
#include <stdio.h>
#include <mpi.h>
int main (int argc, char *argv[])
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
return 0;
It is being launched from the backend of a PBS script. This script
looks as follows:
#PBS -N allIn
#PBS -l nodes=5:ppn=1
#PBS -l walltime=00:01:00
#PBS -m bea
#PBS -M ***@omitted
#PBS -o jobB_.out
#PBS -e jobB_.err
count=`cat $PBS_NODEFILE|wc -l`
echo $count
mpiexec -verbose -comm mpich-p4 /bb/cca/bxj165/emp
These are my standard outs and standard errors. Standard out:
| Job starting at 2008-07-11 10:29:39 for bxj165 on the BlueBEAR Cluster
| Job identity jobid 347336 jobname allIn
| Job requests nodes=5:ppn=1,pvmem=5gb,walltime=00:01:00
| Job assigned to nodes u4n040 u4n050 u4n051 u4n052 u4n053
node 0: name u4n040.beowulf.cluster, cpu avail 1
node 1: name u4n050.beowulf.cluster, cpu avail 1
node 2: name u4n051.beowulf.cluster, cpu avail 1
node 3: name u4n052.beowulf.cluster, cpu avail 1
node 4: name u4n053.beowulf.cluster, cpu avail 1
rm_3776: p4_error: semget failed for setnum: 0
rm_13115: p4_error: semget failed for setnum: 0
p0_20491: p4_error: net_recv read: probable EOF on socket: 1
p0_20491: (8.218750) net_send: could not write to fd=4, errno = 32
rm_24596: p4_error: net_recv read: probable EOF on socket: 3
rm_26706: p4_error: net_recv read: probable EOF on socket: 3
| Job finished at 2008-07-11 10:29:48 for bxj165 on the BlueBEAR Cluster
| Job required cput=00:00:00,mem=6356kb,vmem=151232kb,walltime=00:00:08
This is my standard error:
mpiexec: resolve_exe: using absolute path "/bb/cca/bxj165/emp".
mpiexec: process_start_event: evt 2 task 0 on u4n040.beowulf.cluster.
mpiexec: read_p4_master_port: waiting for port from master.
mpiexec: read_p4_master_port: got port 48865.
mpiexec: process_start_event: evt 6 task 3 on u4n052.beowulf.cluster.
mpiexec: process_start_event: evt 5 task 2 on u4n051.beowulf.cluster.
mpiexec: process_start_event: evt 4 task 1 on u4n050.beowulf.cluster.
mpiexec: process_start_event: evt 7 task 4 on u4n053.beowulf.cluster.
mpiexec: All 5 tasks (spawn 0) started.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 4 others.
mpiexec: process_obit_event: evt 10 task 1 on u4n050.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 11 task 4 on u4n053.beowulf.cluster stat 139.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 2 others.
mpiexec: process_obit_event: evt 3 task 0 on u4n040.beowulf.cluster stat 1.
mpiexec: process_obit_event: evt 9 task 2 on u4n051.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 8 task 3 on u4n052.beowulf.cluster stat 139.
mpiexec: Warning: task 0 exited with status 1.
mpiexec: Warning: tasks 1-4 exited with status 139.
So without explaining the technical details of the cluster (of which I
have no knowledge of anyhow, since I don't personally maintain it,
although I have sent an email to my support team), would anybody like
to suggest what the problem could be?
I have the following test code:
#include <stdio.h>
#include <mpi.h>
int main (int argc, char *argv[])
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
return 0;
It is being launched from the backend of a PBS script. This script
looks as follows:
#PBS -N allIn
#PBS -l nodes=5:ppn=1
#PBS -l walltime=00:01:00
#PBS -m bea
#PBS -M ***@omitted
#PBS -o jobB_.out
#PBS -e jobB_.err
count=`cat $PBS_NODEFILE|wc -l`
echo $count
mpiexec -verbose -comm mpich-p4 /bb/cca/bxj165/emp
These are my standard outs and standard errors. Standard out:
| Job starting at 2008-07-11 10:29:39 for bxj165 on the BlueBEAR Cluster
| Job identity jobid 347336 jobname allIn
| Job requests nodes=5:ppn=1,pvmem=5gb,walltime=00:01:00
| Job assigned to nodes u4n040 u4n050 u4n051 u4n052 u4n053
node 0: name u4n040.beowulf.cluster, cpu avail 1
node 1: name u4n050.beowulf.cluster, cpu avail 1
node 2: name u4n051.beowulf.cluster, cpu avail 1
node 3: name u4n052.beowulf.cluster, cpu avail 1
node 4: name u4n053.beowulf.cluster, cpu avail 1
rm_3776: p4_error: semget failed for setnum: 0
rm_13115: p4_error: semget failed for setnum: 0
p0_20491: p4_error: net_recv read: probable EOF on socket: 1
p0_20491: (8.218750) net_send: could not write to fd=4, errno = 32
rm_24596: p4_error: net_recv read: probable EOF on socket: 3
rm_26706: p4_error: net_recv read: probable EOF on socket: 3
| Job finished at 2008-07-11 10:29:48 for bxj165 on the BlueBEAR Cluster
| Job required cput=00:00:00,mem=6356kb,vmem=151232kb,walltime=00:00:08
This is my standard error:
mpiexec: resolve_exe: using absolute path "/bb/cca/bxj165/emp".
mpiexec: process_start_event: evt 2 task 0 on u4n040.beowulf.cluster.
mpiexec: read_p4_master_port: waiting for port from master.
mpiexec: read_p4_master_port: got port 48865.
mpiexec: process_start_event: evt 6 task 3 on u4n052.beowulf.cluster.
mpiexec: process_start_event: evt 5 task 2 on u4n051.beowulf.cluster.
mpiexec: process_start_event: evt 4 task 1 on u4n050.beowulf.cluster.
mpiexec: process_start_event: evt 7 task 4 on u4n053.beowulf.cluster.
mpiexec: All 5 tasks (spawn 0) started.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 4 others.
mpiexec: process_obit_event: evt 10 task 1 on u4n050.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 11 task 4 on u4n053.beowulf.cluster stat 139.
mpiexec: wait_tasks: waiting for u4n040.beowulf.cluster and 2 others.
mpiexec: process_obit_event: evt 3 task 0 on u4n040.beowulf.cluster stat 1.
mpiexec: process_obit_event: evt 9 task 2 on u4n051.beowulf.cluster stat 139.
mpiexec: process_obit_event: evt 8 task 3 on u4n052.beowulf.cluster stat 139.
mpiexec: Warning: task 0 exited with status 1.
mpiexec: Warning: tasks 1-4 exited with status 139.
So without explaining the technical details of the cluster (of which I
have no knowledge of anyhow, since I don't personally maintain it,
although I have sent an email to my support team), would anybody like
to suggest what the problem could be?