MPI_Bcast problem

s***@gmail.com

2009-04-10 08:55:30 UTC

Hi, people!
I use MPI in solving a problem, which involves an iterative method.
There is an 2D-array of data, over which some calculations performed.
Since each process works with its own memory, each process works just
on a part of its own copy of the array. The array is split into
contigious parts. The union of the parts makes the whole array, the
intersection is an empty set.
Ffter each iteration, I have to update the array on each process —
exchange data of the changed parts. But I don't want to create buffers
just to send the parts of the array (it's too difficult in my case), I
use the whole array-size buffer for it. From the source code you'd get
the clue.
Each process has two storage for the arrays: v and vbuf (both are
pointers). v is used for the data of a process, vbuf — buffer for
exchanging. Both arrays are contigious.

This is the source of the Update function:

for (p = 0; p < pn; ++p)
{
if (pid == p)
{
...
Update the vbuf on only the part, which belongs to current process
pid.
vbuf <-- v(partial); (symbolically)
...
MPI_Bcast(vbuf, K, MPI_DOUBLE, p, MPI_COMM_WORLD);
// K - size of vbuf, vbuf is contigious
}
MPI_Barrier(MPI_COMM_WORLD);

v <-- vbuf; (symbolically) — update the whole domain.
}

pid — process id,
pn — number of processes
p — loop variable

So, what should happen. First, process with pid=0 should copy the
part, which belongs to it, to and broadcast to all other processes.
Now every process has the data of pid=0 in vbuf. If pid=1 does the
same thing, every process would have a copy the data of pid=0 and
pid=1. And next, they can update their part in vbuf and broadcast in
their turn. In the end, each process has the whole array.

Now, the problem. The program is unpredictable. I compile and run the
same compiled program several times using just 2 nodes (2 processes).
Sometime the MPI_Bcast works well, sometimes it fails and the program
finishes with “received signal SEGV (core dumped).” And this is just
one iteration.
It seems to me, that one process can work faster than the other and
something wrong happens (but that is not likely with MPI_Bcast,
because, as I understand, it a blocking function).

I don't know how to debug properly. I use printf-debugging, and, with
such an unpredictable behaviour, don't know what to do.
Need help.