Ovidiu Gheorghies
2008-06-17 10:48:24 UTC
Hello all,
I have installed from sources the latest OpenMPI on an Intel Quad-Core
running 64-bit Fedora 7 and I have a new problem with another master/
slave program (listed below).
The "master" sends repeatedly a 13-byte string ('\0' included) to a
"slave" (a MPI_Send/MPI_Recv pair is used), and the slave performs
some idle logic on the received string (i.e strlen).
At ~30K messages the programs stall, as follows:
- the master executes MPI_Finalize but does not complete it
- the slaves apparently hangs during message processing
Running `top' show first two processes running and then only one
remains running, while keeping its core at 100%.
When the "[PRINT]" line is commented out in the do_slave function, the
following is printed:
test_mpi, last compiled: Jun 17 2008 13:30:21
Send/receive 30000 messages...
Delta time: 0
--- FINALIZING rank 0...
[---> program hangs, only one process remains active]
When the "[PRINT]" line is active, the last printed message from
do_slave (in the format iteration/string-length) is around iteration
26000 (but varies between runs), e.g.
[---> more data here ]
26381/12 26382/12 26383/12 26384/12 26385/12 26386/12 26387/12
26388/12 26389/12 26390/12 26391/12 26392/12
[---> program hangs, same behavior as above ]
The problem is apparent when I compile and run with:
$ mpicc test_mpi.c -o test_mpi; mpirun -c 2 ./test_mpi
However, when I optimize the program with -O3, everything works fine:
$ test_mpi, last compiled: Jun 17 2008 13:40:24
Send/receive 3000000 messages...
--- FINALIZING rank 1...
Delta time: 3
--- FINALIZING rank 0...
--- FINALIZED rank 0.
--- FINALIZED rank 1.
What could I do to diagnose this problem? I'm not sure that -O3 would
fix the problem in all cases, as this might be an issue depending on
how long the "processing" of the message takes on the slave.
Thanks in advance,
Ovidiu
------------ CODE FOLLOWS -----------------
#include "mpi.h"
#include <time.h>
#include <stdio.h>
#include <string.h>
#define TOTAL_COUNT 30000
void do_master(int rank);
void do_slave(int rank);
int main(int argc, char**argv)
{
int numtasks, rank;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
fprintf(stderr, "test_mpi, last compiled: %s %s\n", __DATE__,
__TIME__);
fprintf(stderr, "Send/receive %d messages...\n", TOTAL_COUNT);
do_master(rank);
}
else if (rank == 1) {
do_slave(rank);
}
fprintf(stderr, "--- FINALIZING rank %d...\n", rank);
MPI_Finalize();
fprintf(stderr, "--- FINALIZED rank %d.\n", rank);
return 1;
}
void do_master(int rank)
{
int tag = 1;
char* buffer="ABCxyzABCxyz";
int size = strlen(buffer) + 1;
int count = 0;
time_t t1 = time(0);
while (count++ < TOTAL_COUNT) {
MPI_Send(buffer, size, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
}
time_t t2 = time(0);
printf("Delta time: %d\n", (int)(t2-t1));
}
void do_slave(int rank)
{
MPI_Status Stat;
int tag = 1;
char buffer[255];
int count = 0;
while (count++ < TOTAL_COUNT) {
MPI_Recv(buffer, 13, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &Stat);
int s, i;
for(i=0; i<100; i++) {
s = strlen(buffer);
}
// printf("%d/%d ", count, s); fflush(stdout); // [PRINT]
}
}
I have installed from sources the latest OpenMPI on an Intel Quad-Core
running 64-bit Fedora 7 and I have a new problem with another master/
slave program (listed below).
The "master" sends repeatedly a 13-byte string ('\0' included) to a
"slave" (a MPI_Send/MPI_Recv pair is used), and the slave performs
some idle logic on the received string (i.e strlen).
At ~30K messages the programs stall, as follows:
- the master executes MPI_Finalize but does not complete it
- the slaves apparently hangs during message processing
Running `top' show first two processes running and then only one
remains running, while keeping its core at 100%.
When the "[PRINT]" line is commented out in the do_slave function, the
following is printed:
test_mpi, last compiled: Jun 17 2008 13:30:21
Send/receive 30000 messages...
Delta time: 0
--- FINALIZING rank 0...
[---> program hangs, only one process remains active]
When the "[PRINT]" line is active, the last printed message from
do_slave (in the format iteration/string-length) is around iteration
26000 (but varies between runs), e.g.
[---> more data here ]
26381/12 26382/12 26383/12 26384/12 26385/12 26386/12 26387/12
26388/12 26389/12 26390/12 26391/12 26392/12
[---> program hangs, same behavior as above ]
The problem is apparent when I compile and run with:
$ mpicc test_mpi.c -o test_mpi; mpirun -c 2 ./test_mpi
However, when I optimize the program with -O3, everything works fine:
$ test_mpi, last compiled: Jun 17 2008 13:40:24
Send/receive 3000000 messages...
--- FINALIZING rank 1...
Delta time: 3
--- FINALIZING rank 0...
--- FINALIZED rank 0.
--- FINALIZED rank 1.
What could I do to diagnose this problem? I'm not sure that -O3 would
fix the problem in all cases, as this might be an issue depending on
how long the "processing" of the message takes on the slave.
Thanks in advance,
Ovidiu
------------ CODE FOLLOWS -----------------
#include "mpi.h"
#include <time.h>
#include <stdio.h>
#include <string.h>
#define TOTAL_COUNT 30000
void do_master(int rank);
void do_slave(int rank);
int main(int argc, char**argv)
{
int numtasks, rank;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
fprintf(stderr, "test_mpi, last compiled: %s %s\n", __DATE__,
__TIME__);
fprintf(stderr, "Send/receive %d messages...\n", TOTAL_COUNT);
do_master(rank);
}
else if (rank == 1) {
do_slave(rank);
}
fprintf(stderr, "--- FINALIZING rank %d...\n", rank);
MPI_Finalize();
fprintf(stderr, "--- FINALIZED rank %d.\n", rank);
return 1;
}
void do_master(int rank)
{
int tag = 1;
char* buffer="ABCxyzABCxyz";
int size = strlen(buffer) + 1;
int count = 0;
time_t t1 = time(0);
while (count++ < TOTAL_COUNT) {
MPI_Send(buffer, size, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
}
time_t t2 = time(0);
printf("Delta time: %d\n", (int)(t2-t1));
}
void do_slave(int rank)
{
MPI_Status Stat;
int tag = 1;
char buffer[255];
int count = 0;
while (count++ < TOTAL_COUNT) {
MPI_Recv(buffer, 13, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &Stat);
int s, i;
for(i=0; i<100; i++) {
s = strlen(buffer);
}
// printf("%d/%d ", count, s); fflush(stdout); // [PRINT]
}
}