Discussion:
sending failing?
(too old to reply)
gudny
2008-06-07 17:48:01 UTC
Permalink
Hello
Lately, strange things have been happening to my program, and I hope
someone can give advice.
I have written a program that needs to send large amounts of data,
sometimes a third of the whole memory. That could probably be done in
a better way, but for a long time it worked fine. Now I have two
problems, that might be related:
1) Sometimes nodes die when I am running my program on them. The
program spends basically all its time in the same for loop, so it
cannot be connected to any special place in the program. The log files
have no information.
2) This is the output of a run, every iteration a real and imaginary
part of an integral is written. After 300.000 times sending the
borders right and calculating right, the imaginary part goes crazy:
time real part
imaginary part
.
.
.
32873.69999980212015 -0.01464190666462 -0.05132513592074
32873.79999980211869 -0.01470085544712
1569150716050824235461423375058804236368891348615907757856539862127342115243281091050882922531907468433171923942498361778050972528904800702644869914427392.00000000000000

My system administrator thinks that the nodes dying might have to do
with some kind of buffer overload. I am using sendrecv. Would it be
better to use a blocking send, is it possible to say I do not want to
buffer the message?
The strange thing is that most of the time the program calculates
correctly and all nodes live. But every once in a while, a node dies,
and now, once I got this very strange result.
best wishes
gudny
Georg Bisseling
2008-06-10 08:34:55 UTC
Permalink
Did you consider to debug your program?

You do not specify what MPI you use, but
chances are high that the MPI code went through
longer and more thorough testing and revising
than yours.

I suggest to use a tool like valgrind to check
your code for errors. When debugging an MPI
code it is much simpler to use a shared memory device
to have all ranks running on the same machine.

ciao
Georg
Post by gudny
Hello
Lately, strange things have been happening to my program, and I hope
someone can give advice.
I have written a program that needs to send large amounts of data,
sometimes a third of the whole memory. That could probably be done in
a better way, but for a long time it worked fine. Now I have two
1) Sometimes nodes die when I am running my program on them. The
program spends basically all its time in the same for loop, so it
cannot be connected to any special place in the program. The log files
have no information.
2) This is the output of a run, every iteration a real and imaginary
part of an integral is written. After 300.000 times sending the
time real part
imaginary part
.
.
.
32873.69999980212015 -0.01464190666462 -0.05132513592074
32873.79999980211869 -0.01470085544712
1569150716050824235461423375058804236368891348615907757856539862127342115243281091050882922531907468433171923942498361778050972528904800702644869914427392.00000000000000
My system administrator thinks that the nodes dying might have to do
with some kind of buffer overload. I am using sendrecv. Would it be
better to use a blocking send, is it possible to say I do not want to
buffer the message?
The strange thing is that most of the time the program calculates
correctly and all nodes live. But every once in a while, a node dies,
and now, once I got this very strange result.
best wishes
gudny
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Loading...