Discussion:
LATENCY & DATA TRANSFER TIME in SHARED MEMORY ???
(too old to reply)
kis
2007-10-17 20:14:54 UTC
Permalink
Hi All,

I use a cluster which consists of 8 processors with a single 32GB
memory being shared among the processors. Does a system like that
spend time on LATENCY and DATA TRANSFER when the processors in the
system are communicating the data using the MPI routine (e.g.
MPI_SEND) ?

Thanks,
IRFAN
Sebastian Hanigk
2007-10-18 07:47:06 UTC
Permalink
Post by kis
I use a cluster which consists of 8 processors with a single 32GB
memory being shared among the processors. Does a system like that
spend time on LATENCY and DATA TRANSFER when the processors in the
system are communicating the data using the MPI routine (e.g.
MPI_SEND) ?
It depends on the MPI implementation. A clever one will use
shared-memory access and you should see transfer rates (asymptotically
for large transfer units) comparable to memcpy/bcopy or the like. As for
the latency, you will have to deal with the MPI overhead. From my
experience it's usually better to use a threading model, be it OpenMP or
pthreads, on shared-memory machines.


Sebastian
kis
2007-10-18 12:05:09 UTC
Permalink
Thanks Sebastion for your comment.

The reason I asked this question is that my program performed better
when runnig on 1 node with several processors with memory shared among
the processors and degrade if I distribute it into two nodes, for
instance.

Just wondering. I come across, quite often, this terminology: "vector
multiprocessors" in MPI implementation. Does it imply anything about
memory? Is it a cluster with several processors which shared common
memory?

Thanks
Irfan
Post by Sebastian Hanigk
Post by kis
I use a cluster which consists of 8 processors with a single 32GB
memory being shared among the processors. Does a system like that
spend time on LATENCY and DATA TRANSFER when the processors in the
system are communicating the data using the MPI routine (e.g.
MPI_SEND) ?
It depends on the MPI implementation. A clever one will use
shared-memory access and you should see transfer rates (asymptotically
for large transfer units) comparable to memcpy/bcopy or the like. As for
the latency, you will have to deal with the MPI overhead. From my
experience it's usually better to use a threading model, be it OpenMP or
pthreads, on shared-memory machines.
Sebastian
Sebastian Hanigk
2007-10-19 08:12:15 UTC
Permalink
Post by kis
The reason I asked this question is that my program performed better
when runnig on 1 node with several processors with memory shared among
the processors and degrade if I distribute it into two nodes, for
instance.
Why is this astonishing? Even with a non-optimised MPI implementation,
you would benefit from running on a single node; processes on different
nodes have to communicate via the internode network which usually has a
much smaller bandwidth and higher latency.
Post by kis
Just wondering. I come across, quite often, this terminology: "vector
multiprocessors" in MPI implementation. Does it imply anything about
memory? Is it a cluster with several processors which shared common
memory?
You have to look at two aspects here: are your nodes single-processor or
SMP systems and what kind of CPU is used. In most high-performance
computers today, one builds a cluster of SMP-nodes (usually two to eight
cores per node), i.e. you have a shared-memory programming model inside
the node and a distributed model between them.

Now onto the CPU architecture. A vector architecture has the distinction
of having a rather large number of registers and a high-bandwidth direct
memory access; on the other hand you get cache-based architectures with
a typically smaller register count. The first kind is a vector
architecture because the machine is able to work with a single
instruction on a whole range of registers (a vector ...) at the same
time (look at NECs SX series for example) whereas the "off-the-shelf"
CPUs like AMD or Intel have to work on every item of the data vector in
a more serial fashion. Of course, the distinction is not so clearly cut
nowadays because new processor types like the Itanium have a rather
large register set, too, and SSE is an attempt to provide vectorising
support.

To return to your question: a vector multiprocessor would most probably
mean an SMP node consisting of vector CPUs.


Hope that help,

Sebastian
kis
2007-10-19 20:41:52 UTC
Permalink
Thanks, for the info.

Irfan
Post by Sebastian Hanigk
Post by kis
The reason I asked this question is that my program performed better
when runnig on 1 node with several processors with memory shared among
the processors and degrade if I distribute it into two nodes, for
instance.
Why is this astonishing? Even with a non-optimised MPI implementation,
you would benefit from running on a single node; processes on different
nodes have to communicate via the internode network which usually has a
much smaller bandwidth and higher latency.
Post by kis
Just wondering. I come across, quite often, this terminology: "vector
multiprocessors" in MPI implementation. Does it imply anything about
memory? Is it a cluster with several processors which shared common
memory?
You have to look at two aspects here: are your nodes single-processor or
SMP systems and what kind of CPU is used. In most high-performance
computers today, one builds a cluster of SMP-nodes (usually two to eight
cores per node), i.e. you have a shared-memory programming model inside
the node and a distributed model between them.
Now onto the CPU architecture. A vector architecture has the distinction
of having a rather large number of registers and a high-bandwidth direct
memory access; on the other hand you get cache-based architectures with
a typically smaller register count. The first kind is a vector
architecture because the machine is able to work with a single
instruction on a whole range of registers (a vector ...) at the same
time (look at NECs SX series for example) whereas the "off-the-shelf"
CPUs like AMD or Intel have to work on every item of the data vector in
a more serial fashion. Of course, the distinction is not so clearly cut
nowadays because new processor types like the Itanium have a rather
large register set, too, and SSE is an attempt to provide vectorising
support.
To return to your question: a vector multiprocessor would most probably
mean an SMP node consisting of vector CPUs.
Hope that help,
Sebastian
Joachim Worringen
2007-10-23 11:23:34 UTC
Permalink
Post by Sebastian Hanigk
Post by kis
The reason I asked this question is that my program performed better
when runnig on 1 node with several processors with memory shared among
the processors and degrade if I distribute it into two nodes, for
instance.
Why is this astonishing? Even with a non-optimised MPI implementation,
you would benefit from running on a single node; processes on different
nodes have to communicate via the internode network which usually has a
much smaller bandwidth and higher latency.
Bandwidth-demanding applications often run better on multiple nodes then
on a single node with inadequate accumulated memory bandwidth. Both,
memory bandwidth or network latency/bandwidth can be a bottleneck.

Joachim
Sebastian Hanigk
2007-10-23 15:17:37 UTC
Permalink
Post by Joachim Worringen
Bandwidth-demanding applications often run better on multiple nodes
then on a single node with inadequate accumulated memory
bandwidth. Both, memory bandwidth or network latency/bandwidth can be
a bottleneck.
I think that this would be an architectural flaw of the multiprocessor
nodes. It may not be a large sample of existing systems, but on every
(super-)computer I was able to run codes, intranode communication has
never been a problem whereas the internode connection (especially
without a hybrid programming model) usually hits its bandwidth limit
very soon, be it Gigabit Ethernet, Myrinet or Infiniband.

Could you share experiences with bandwidth-limited codes?


Sebastian
Joachim Worringen
2007-11-14 15:35:20 UTC
Permalink
Post by Sebastian Hanigk
Post by Joachim Worringen
Bandwidth-demanding applications often run better on multiple nodes
then on a single node with inadequate accumulated memory
bandwidth. Both, memory bandwidth or network latency/bandwidth can be
a bottleneck.
I think that this would be an architectural flaw of the multiprocessor
nodes. It may not be a large sample of existing systems, but on every
(super-)computer I was able to run codes, intranode communication has
never been a problem whereas the internode connection (especially
without a hybrid programming model) usually hits its bandwidth limit
very soon, be it Gigabit Ethernet, Myrinet or Infiniband.
I was not referring to intra-node communication, but to the available
memory bandwidth for each process running on a node in relation to the
inter-node communication overhead.
Post by Sebastian Hanigk
Could you share experiences with bandwidth-limited codes?
Lots of codes are bandwidth-limited more than everything else. CFD is
typically in this category.

Joachim
Greg Lindahl
2007-11-16 23:41:00 UTC
Permalink
Post by Joachim Worringen
Lots of codes are bandwidth-limited more than everything else. CFD is
typically in this category.
CFD is a huge field involving many algorithms; the one I use in
astronomy is incredibly cache friendly.

-- greg
TheMask
2008-01-14 13:12:31 UTC
Permalink
how to read cache size on windows os systems?
Post by Greg Lindahl
Post by Joachim Worringen
Lots of codes are bandwidth-limited more than everything else. CFD is
typically in this category.
CFD is a huge field involving many algorithms; the one I use in
astronomy is incredibly cache friendly.
-- greg
Michael Hofmann
2008-01-14 14:08:55 UTC
Permalink
Post by TheMask
how to read cache size on windows os systems?
http://monetdb.cwi.nl/Calibrator/

Loading...