Discussion:
problem running jobs
(too old to reply)
mlm
2008-03-09 01:25:53 UTC
Permalink
I'm using torque and mpich2 on a 64 node cluster

I am submitting a job through using qsub that calls mpiexec. My
intent is to have the same program run on 2 different nodes. These
two programs will then talk to each other using MPI. Here are some of
my settings:

#PBS -l nodes=2:ppn=1
...
mpiexec -l -n 2 myprog

In my script, I print out the PBS_NODEFILE (using cat $PBS_NODEFILE),
and I'm seeing that two nodes are listed. However, both jobs are
placed on one node (the first node listed in PBS_NODEFILE. The
programs run without any problems, but I just want them to run on
separate nodes due to their memory usage.

Can anyone help me out? Is this an mpich or pbs problem?

Thanks
Michael Hofmann
2008-03-10 08:30:53 UTC
Permalink
Post by mlm
Can anyone help me out? Is this an mpich or pbs problem?
It might be MPICH. Have you started something like the process manger da=
emon "mpd" (with mpdboot)?

Have at look at the MPICH2 FAQ: "Q: What determines the hosts on which m=
y MPI processes run?"

http://www.mcs.anl.gov/research/projects/mpich2/support/index.php?s=3Dfa=
qs#whererun


Michael
Javier Vazquez
2008-03-10 14:45:17 UTC
Permalink
Post by mlm
I'm using torque and mpich2 on a 64 node cluster
I am submitting a job through using qsub that calls mpiexec. My
intent is to have the same program run on 2 different nodes. These
two programs will then talk to each other using MPI. Here are some of
#PBS -l nodes=2:ppn=1
...
mpiexec -l -n 2 myprog
In my script, I print out the PBS_NODEFILE (using cat $PBS_NODEFILE),
and I'm seeing that two nodes are listed. However, both jobs are
placed on one node (the first node listed in PBS_NODEFILE. The
programs run without any problems, but I just want them to run on
separate nodes due to their memory usage.
Can anyone help me out? Is this an mpich or pbs problem?
Thanks
If you are using the mpiexec, perhaps you can use the following command:

mpiexec -n 1 -host node1 myprog : -n 1 -host node2 myprog

This will spawn one task per node. See more details in the documentation


Regards,
Javier
mlm
2008-03-12 18:24:42 UTC
Permalink
Post by mlm
I'm using torque and mpich2 on a 64 node cluster
I am submitting a job through using qsub that calls mpiexec. My
intent is to have the same program run on 2 different nodes. These
two programs will then talk to each other using MPI. Here are some of
#PBS -l nodes=2:ppn=1
...
mpiexec -l -n 2 myprog
In my script, I print out the PBS_NODEFILE (using cat $PBS_NODEFILE),
and I'm seeing that two nodes are listed. However, both jobs are
placed on one node (the first node listed in PBS_NODEFILE. The
programs run without any problems, but I just want them to run on
separate nodes due to their memory usage.
Can anyone help me out? Is this an mpich or pbs problem?
Thanks
Thanks for the help guys....I think the problem was that I wasn't
starting mpd on the correct nodes. When I executed the mpdboot
command, it was only starting mpd on 2 nodes because I (incorrectly)
was using the '-n 2' argument. The two nodes that had mpd started on
them didn't match the two nodes in $PBS_NODESILE. So when mpiexec
read $PBS_NODEFILE, and tried to use those nodes, they didn't have
mpd's running on them.

When I instructed mpdboot to start mpd's on all of the nodes, my
problem was fixed.

Loading...