How does MPI_Bcast work?

Discussion:

(too old to reply)

l***@yahoo.com

2008-02-02 14:25:11 UTC

Hi all,
I would like to know how the one-to-all broadcast routine MPI_Bcast
transmit data to all the processors from the root.

For example, say there are 8 processors and I want to send data from
P0 to the other 7 processors. Without using MPI_Bcast, I can
explicitly implement the broadcast in a number of ways, such as a
hypercube, or a pipeline where P0 sends to P1, then P2, P3, P4 etc.
Obviously the hypercube implementation would be more efficient. So how
does MPI_Bcast transmit data, and is it more efficient than other
implementations?

Thank you.

Regards,
Rayne

Georg Bisseling

2008-02-02 19:59:51 UTC

Permalink

Post by l***@yahoo.com
Obviously the hypercube implementation would be more efficient. So how
does MPI_Bcast transmit data, and is it more efficient than other
implementations?

The MPI standard does only specify the semantics.

For open source MPI implementations you can check it yourself.
For closed source implementations you have to trust the vendors
documentation.

In both cases the algorithm might be chosen depending on the
devices that are used and on the actual size of the communicator.

Probably Google will find many papers about the optimization of
broadcast and other collective patterns. One idea could be to map
an MPI_Bcast to an ethernet broadcast operation. For a shared memory
device it could be to write the message to only one buffer that is
then read by all recipients etc.

--
This signature was left intentionally almost blank.
http://www.this-page-intentionally-left-blank.org/

Christian

2008-02-03 15:59:18 UTC

Permalink

... implement the broadcast in a number of ways, such as a
hypercube, or a pipeline where P0 sends to P1, then P2, P3, P4 etc.

Just to make sure that we are using the same terminology for the
algorithms:
"pipeline" is P0=>P1; P1=>P2; P2=>P3; P3=>P4 etc.
("linear" is P0=>P1; P0=>P2; P0=>P3; P0=>P4 etc.)

Obviously the hypercube implementation would be more efficient.

Well, it's not always that easy! Georg told you that it depends on
many parameters - I'd like to add another two important ones: the
underlying network topology and the length of the message. Two
(outlier?) examples:

Nowadays, almost every computer provides two network interfaces. Just
take a bunch of cables and ready is your own "homemade" cluster (i.e.
without any switch). Your "hypercube" implementation will likely
perform worse in this scenario than a "pipeline" implementation
(because of the underlying ring topology).

What if the length parameter is set to "very long" (e.g. 100 MiB or 1
GiB)? Then both algorithms would perform amazingly bad. However, when
we allow a small modification called "fragmentation" (i.e. sending the
message in smaller chunks of e.g. 100 KiB) then we will see a
substantial change:
The fragmented pipeline broadcast will surely be several times more
efficient than your hypercube broadcast (with or without
fragmentation).

Christian

l***@yahoo.com

2008-02-04 03:30:35 UTC

Permalink

Thank you for all the replies!

I've read that "MPIR_BCAST_BLOCK_SIZE determines the size of blocks"
in MPI_Bcast. So what is the default value of MPIR_BCAST_BLOCK_SIZE
and how do I change it?