Discussion:
net_conn_to_listener failed????? Need help with error message
(too old to reply)
Saville
2008-03-05 01:02:31 UTC
Permalink
Hi all,

I have a two node cluster running LINUX.

Each node has a full OS

Both nodes have the same username.

ssh has been set up so that the Master can log into the same username onthe
slave without using a password.

The machine file has been edited to include the slave node (erik).

The /etc/hosts file was updated to include the IP addr/name of the slave.

Under the username in both nodes, there is a subdirectory that contains the
mpich install and I'm trying to run the test program called "cpi".

I have a window open on the slave and I am running "top" on it.

When I run the mpirun command, I see cpi starting on the slave (in top), and
then I get the following error message:


$ ../../bin/mpirun -np 2 cpi
rm_1400: p4_error: rm_start: net_conn_to_listener failed: 33408
p0_12586: p4_error: Child process exited while making connection to remote
process on erik: 0
p0_12586: (10.472656) net_send: could not write to fd=4, errno = 32

Can anyone give me a pointer to some information that would help me figure
out what the problem is?

thanks
Georg Bisseling
2008-03-06 13:57:20 UTC
Permalink
ssh has been set up so that the Master can log into the same username=
onthe
slave without using a password.
That should be possible in both directions just to be save.
The machine file has been edited to include the slave node (erik).
The /etc/hosts file was updated to include the IP addr/name of the sl=
ave.

On both nodes? Consider to have an DNS server.
Under the username in both nodes, there is a subdirectory that contai=
ns the
mpich install and I'm trying to run the test program called "cpi".
Why don't you use NFS to make it absolutely sure to have identical softw=
are?
(Same machine file etc.)
I have a window open on the slave and I am running "top" on it.
When I run the mpirun command, I see cpi starting on the slave (in top=
), and
$ ../../bin/mpirun -np 2 cpi
rm_1400: p4_error: rm_start: net_conn_to_listener failed: 33408
p0_12586: p4_error: Child process exited while making connection to r=
emote
process on erik: 0
p0_12586: (10.472656) net_send: could not write to fd=3D4, errno =3D 3=
2
Can anyone give me a pointer to some information that would help me fi=
gure
out what the problem is?
thanks
What MPI are you using and what device does it use?
If you configured and compiled it yourself you will
have to look into the respective log files to find
out which config was chosen.

The "p4" stuff indicates MPICH or MPICH2.

Maybe it uses ssh to startup the program on the other
node - but maybe it uses a set of demons to do it and
that must be started before mpirun would work.

RTFM and follow the instructions!

What Linux do you use? Does it come with a restrictive
firewall software that prevents "erik" from being a
TCP server? My opinion is that firewall software should
not be present on a cluster...

-- =

Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Saville
2008-03-08 02:20:07 UTC
Permalink
Post by Georg Bisseling
Post by Saville
ssh has been set up so that the Master can log into the same username onthe
slave without using a password.
That should be possible in both directions just to be save.
It is. I can ssh from one node to the other.
Post by Georg Bisseling
Post by Saville
The machine file has been edited to include the slave node (erik).
The /etc/hosts file was updated to include the IP addr/name of the slave.
On both nodes? Consider to have an DNS server.
Yes on both nodes.
Post by Georg Bisseling
Post by Saville
Under the username in both nodes, there is a subdirectory that contains the
mpich install and I'm trying to run the test program called "cpi".
Why don't you use NFS to make it absolutely sure to have identical
software? (Same machine file etc.)
But is this required? I can issue the following command on the Master and it
works (csc10 is the Master and csc11 is the slave):

ssh csc11 /home/flashman/mpich-1.2.7p1/examples/basic/cpi
This works just fine.
Post by Georg Bisseling
Post by Saville
I have a window open on the slave and I am running "top" on it.
When I run the mpirun command, I see cpi starting on the slave (in top),
$ ../../bin/mpirun -np 2 cpi
rm_1400: p4_error: rm_start: net_conn_to_listener failed: 33408
p0_12586: p4_error: Child process exited while making connection to
remote process on erik: 0
p0_12586: (10.472656) net_send: could not write to fd=4, errno = 32
Can anyone give me a pointer to some information that would help me
figure out what the problem is?
thanks
What MPI are you using and what device does it use?
mpich-1.2.7p1

I take the default device which seems to be ch_p4. That's what it says
inside of mpirun:

DEFAULT_DEVICE=ch_p4
RSHCOMMAND="ssh"
Post by Georg Bisseling
If you configured and compiled it yourself you will
have to look into the respective log files to find
out which config was chosen.
The "p4" stuff indicates MPICH or MPICH2.
Maybe it uses ssh to startup the program on the other
node - but maybe it uses a set of demons to do it and
that must be started before mpirun would work.
RTFM and follow the instructions!
I did and I did.
Post by Georg Bisseling
What Linux do you use?
Fedora Core 8 on both nodes.
Post by Georg Bisseling
Does it come with a restrictive
firewall software that prevents "erik" from being a
TCP server? My opinion is that firewall software should
not be present on a cluster...
I selected no firewall.

I really would like to know what the error message means or where it comes
from:

rm_23914: p4_error: rm_start: net_conn_to_listener failed: 41488
p0_30790: p4_error: Child process exited while making connection to remote
process on csc11: 0
p0_30790: (11.355131) net_send: could not write to fd=4, errno = 32

thanks
Georg Bisseling
2008-03-08 16:56:07 UTC
Permalink
Am 08.03.2008, 03:20 Uhr, schrieb Saville <***@comcast.net>:

Excuse my long list of gotchas. Seems you did everything right.
I never ran into error messages exactly like yours.

To make the p4 device work it is required to have the fully
qualified host names in the machine file. I experienced weird
effects otherwise: incomprehensible mapping of processes to nodes.

If you want to avoid NFS then you will have to take great care
that the processes on both ranks will start in the same working
directory using a compatible (not necessarily identical) set
of environment variables etc. NFS with shared config files
just makes that much easier. You may call it superstition,
but I am quite sure that your setup is not covered by Argonne's
regular tests.
I really would like to know what the error message means or where it c=
omes
rm_23914: p4_error: rm_start: net_conn_to_listener failed: 41488
p0_30790: p4_error: Child process exited while making connection to r=
emote
process on csc11: 0
p0_30790: (11.355131) net_send: could not write to fd=3D4, errno =3D 3=
2

If you like it or not, the best place to look for an explanation
might be the source of the p4 device and the functions rm_start
and net_conn_to_listener. And in the logs of the remote machine.
Did you have a look into /var/log/messages?

But the fact that the error message mentions a child process
(presumably of mpirun) that can not connect to a listener seems
to indicate that mpirun does not use ssh to start the remote
processes but tries to connect to a locally running demon.
But that would be the device p4mpd. Weird.

There is a dedicated chp4 user's guide
ftp://info.mcs.anl.gov/pub/tech_reports/reports/ANL9217.ps.Z
maybe that can help.

You can make the p4 device more verbose by saying
mpirun -np 2 myprog -p4dbg 20 -p4rdbg
the exact meaning of the is explained in the mentioned user's guide.

BTW: using the demons gives you a much faster startup.
BTW2: mpich-1.2.7 is not further maintained, if there are no
backward compatibility concerns I would always recommend
OpenMPI.

Good Luck!
Georg


-- =

This signature was left intentionally almost blank.
http://www.this-page-intentionally-left-blank.org/
Georg Bisseling
2008-03-08 17:00:16 UTC
Permalink
Post by Georg Bisseling
mpirun -np 2 myprog -p4dbg 20 -p4rdbg
It has to be:
mpirun -np 2 myprog -p4dbg 20 -p4rdbg 20
Saville
2008-03-11 22:30:47 UTC
Permalink
Post by Georg Bisseling
Excuse my long list of gotchas. Seems you did everything right.
I never ran into error messages exactly like yours.
I found the problem:

I opened up the Firewall on the Master and everything worked.

So now I need to find what ports are used by MPI and only open those.

Thanks for all the help.
Georg Bisseling
2008-03-13 11:47:47 UTC
Permalink
Post by Saville
So now I need to find what ports are used by MPI and only open those.
Thanks for all the help.
One route to ease the pain is to have two network
cards in the master: one outbound with the firewall
watching over it and one inbound to the other cluster
nodes that is considered internal=3D=3Dharmless by the
firewall. The master then acts as a gateway and router
for the cluster nodes.

Configuring that can be much easier and the fast
ethernet card comes for $5.


Cheers
Georg
Saville
2008-03-15 13:52:33 UTC
Permalink
Post by Georg Bisseling
Post by Saville
So now I need to find what ports are used by MPI and only open those.
Thanks for all the help.
One route to ease the pain is to have two network
cards in the master: one outbound with the firewall
watching over it and one inbound to the other cluster
nodes that is considered internal==harmless by the
firewall. The master then acts as a gateway and router
for the cluster nodes.
Configuring that can be much easier and the fast
ethernet card comes for $5.
Hi Georg,

I already have two ethernet cards in my Master. However I didn't see
anyting in the MPICH install document that allowed me to specify which
ethernet card it should use.

I'd very much like to use that second card to isolate the cluster.

thanks

Gregg
Georg Bisseling
2008-03-16 22:12:49 UTC
Permalink
Post by Saville
Hi Georg,
I already have two ethernet cards in my Master. However I didn't see
anyting in the MPICH install document that allowed me to specify which
ethernet card it should use.
I'd very much like to use that second card to isolate the cluster.
thanks
Gregg
Quite simple: put the two cards in the same IP network,
name the IP adresses (in /etc/hosts for a start) and put
the respective names in your machine file.
--
This signature was left intentionally almost blank.
http://www.this-page-intentionally-left-blank.org/
Loading...