I need an example of unbalanced MPI/IO in use

Discussion:

(too old to reply)

Jim Hill

2007-12-18 00:48:46 UTC

Hi, all. This is seemingly a simple issue but I'm not finding any
examples (and I thrive on examples). I've got an array of numbers
distributed across multiple processors, and I want to write that to
a file in as efficient a manner such that from byte 0 of the file
you'll find the elements from PE 0, then 1, then 2, etc.

The kicker is that the load balance isn't perfect. No powers of two
here. I might have 9 elements on one PE, 6 on another, 15 on a third,
and so forth. The examples I've found all seem to assume a perfect load
balance of X items per PE.

Can someone point me to an example of how to pull this off? I don't
want to take up a lot of your time so unless you're just really gung-ho
to provide code here, please don't.

Thanks,

Jim

--
"I loathe people who say, 'I always read the ending of the book first.'
That really irritates me. It's like someone coming to dinner, just
opening the fridge and eating pudding, while you're standing there still
working on the starter. It's not on." -- JK Rowling

David Cronk

2008-01-02 16:25:40 UTC

Permalink

I am not sure what you mean by example, but if you mean how to solve
this I will take a shot. First, a little clarification about the data
layout in the file.

Let's say 3 PEs, with 3, 7, and 5 elements each, respectively.

Does the file look like:

012012012121211

or

012012012_12_12_1__1_ wheere _ means just no data written?

If case 1, it is a little more difficult. I would recommend each
process figures out where its data goes in the file using
MPI_CREATE_TYPE_INDEXED_BLOCK. Use this to set your file view. Then,
when you write, simply use MPI_FILE_WRITE_ALL.

Case 2 is a little easier. Each process would use MPI_TYPE_VECTOR, with
count being the number of elements it provides to the file. Again, then
set file vire and use MPI_FILE_WRITE_ALL.

Case 1 should provide better performance. Perhaps much better
performance unless the MPI-I/O implementatio is very clever.

Hope this helps (and that this was not a homework problem).

Dave.

Post by Jim Hill
Hi, all. This is seemingly a simple issue but I'm not finding any
examples (and I thrive on examples). I've got an array of numbers
distributed across multiple processors, and I want to write that to
a file in as efficient a manner such that from byte 0 of the file
you'll find the elements from PE 0, then 1, then 2, etc.
The kicker is that the load balance isn't perfect. No powers of two
here. I might have 9 elements on one PE, 6 on another, 15 on a third,
and so forth. The examples I've found all seem to assume a perfect load
balance of X items per PE.
Can someone point me to an example of how to pull this off? I don't
want to take up a lot of your time so unless you're just really gung-ho
to provide code here, please don't.
Thanks,
Jim

--
Dr. David Cronk, Ph.D. phone: (865) 974-3735
Research Director fax: (865) 974-8296
Innovative Computing Lab http://www.cs.utk.edu/~cronk
University of Tennessee, Knoxville

Jim Hill

2008-01-03 03:03:21 UTC

Permalink

Post by David Cronk
I am not sure what you mean by example, but if you mean how to solve
this I will take a shot. First, a little clarification about the data
layout in the file.
Let's say 3 PEs, with 3, 7, and 5 elements each, respectively.
012012012121211
or
012012012_12_12_1__1_ wheere _ means just no data written?

Actually, it looks like this:

000111111122222

which in theory makes this even easier.

I can do a quick parallel communication so that rank-1 knows there are 3
elements coming before it and rank-2 knows there are 10 elements coming
before it, so they can compute offsets relative to the beginning of the
file and then just *splat* their contents in a continuous chunk
beginning at the offset. I just have to chase down the proper sequence
of MPI_something_something calls to achieve the task.

Post by David Cronk
Hope this helps (and that this was not a homework problem).

It does and it was not. The project that I work on makes extensive use
of visualization dumps. Currently we do that via a gather to rank-0,
which does the file I/O while all the other processors hang out, tell
fishing stories, check up on the wife and kids, the usual. When rank-0
finishes up, it's back to work. Based on rough profiling I've done,
we're spending about 75% of our viz dump time with the current file I/O.
Not elegant but it gets the job done.

However, the problems that our users are solving are getting ever-larger
and the portion of problem time dedicated to viz dumps is growing to
unacceptable levels. One way to minimize this is to ask the users to
request less-frequent dumps with less data per. This has been greeted
with a notable lack of enthusiasm. So yours truly suggested parallel
I/O and as the first person to pipe up received the plum assignment of
implementing it. We're reasonably well load-balanced but I think the
only place where you find a perfect 1/numprocs decomposition is in
reference materials which then take advantage of that decomposition to
present a nonrealistic source code example (e.g., they all use an
identical value of "buflen" for the quantity of data a particular
processor will be writing so rank*buflen is an obvious offset into the
file).

I have talked to some of our HPC folks here about the task at hand and
they were pretty keen so it looks like I'll get more local support than
I was anticipating.

Thanks for taking the time to reply and you can sleep easy knowing you
weren't an inadvertent party to academic misconduct.

Jim

David Cronk

2008-01-03 14:44:39 UTC

Permalink

Post by Jim Hill

000111111122222
which in theory makes this even easier.

Yes, it sure does.

Post by Jim Hill
I can do a quick parallel communication so that rank-1 knows there are 3
elements coming before it and rank-2 knows there are 10 elements coming
before it, so they can compute offsets relative to the beginning of the
file and then just *splat* their contents in a continuous chunk
beginning at the offset. I just have to chase down the proper sequence
of MPI_something_something calls to achieve the task.

I suggest MPI_EXSCAN for each PE to figure out how many elements come in
the file before their contribution. After that you have many options
and I recommend testing each for perforamnce on your system. For your
data layout I suspect non-collective I/O may lead to the best results,
though you should test with collective I/O as well.

Basically you have explicit offsets (no need even for a new file view as
long as you convert offsets to bytes), individual file pointers, and
shared file pointers. Sahred file pointers are unlikely to give you the
desired performance, but I would try them just for giggles. Once you
have done the setup part, changing between the different choices is
quite simple.

Post by Jim Hill

Post by David Cronk
Hope this helps (and that this was not a homework problem).

More and more apps are running into the I/O bottleneck. I expect
parallel I/O to become essential to most HPC apps in the near future.
Sounds like the other PEs have no other work they can be doing while
waiting on the I/O so perhaps non-blocking I/O will not help here.
Though, if possible, you may want to investigate making some changes to
the code to allow overlap of I/O and computation. Just a thought. You
know the code so you should be a good judge of if it is worth the effort.

One other thing here, if performance becomes critical, advanced features
of MPI-I/O should be explored, like the app controling file striping and
other things that may be controlable through use of the "info" argument
in the MPI-I/O calls.

Post by Jim Hill
However, the problems that our users are solving are getting ever-larger
and the portion of problem time dedicated to viz dumps is growing to
unacceptable levels. One way to minimize this is to ask the users to
request less-frequent dumps with less data per. This has been greeted
with a notable lack of enthusiasm. So yours truly suggested parallel
I/O and as the first person to pipe up received the plum assignment of
implementing it. We're reasonably well load-balanced but I think the
only place where you find a perfect 1/numprocs decomposition is in
reference materials which then take advantage of that decomposition to
present a nonrealistic source code example (e.g., they all use an
identical value of "buflen" for the quantity of data a particular
processor will be writing so rank*buflen is an obvious offset into the
file).
I have talked to some of our HPC folks here about the task at hand and
they were pretty keen so it looks like I'll get more local support than
I was anticipating.
Thanks for taking the time to reply and you can sleep easy knowing you
weren't an inadvertent party to academic misconduct.

Glad I could help and that I did not fall into a trap of a student
looking for unfair help :)

Good luck with this effort.

Dave

Post by Jim Hill
Jim

--
Dr. David Cronk, Ph.D. phone: (865) 974-3735
Research Director fax: (865) 974-8296
Innovative Computing Lab http://www.cs.utk.edu/~cronk
University of Tennessee, Knoxville