![]() |
Blitz Devel : |
From: Todd Veldhuizen (tveldhui_at_[hidden])
Date: 1998-08-14 10:53:52
Here's a possible alternative to doing loop fusion the conventional
way. The initial performance results aren't as good as I had hoped,
but it is simpler to implement, and has some nice properties.
Suppose we have the pair of array expressions
E = A*B + C*D;
F = C*B + A*D;
If you don't fuse the loops, then you take a performance hit
because the A, B, C, D arrays must be fetched from main memory
twice.
Unfused implementation: Fused implementation:
for (int i=0; i < N; ++i) for (int i=0; i < N; ++i) {
e[i] = a[i]*b[i] + c[i]*d[i]; e[i] = a[i]*b[i] + c[i]*d[i];
for (int i=0; i < N; ++i) f[i] = c[i]*b[i] + a[i]*d[i];
f[i] = c[i]*b[i] + a[i]*d[i]; }
For lack of inspiration, I'll refer to the alternate approach as
"chunky loop fusion". The basic idea is that we can avoid fetching
the arrays twice from memory if we evaluate the expressions in
blocks which are small enough to fit in cache:
e.g. evaluate E = A*B + C*D for i=0..511
F = C*B + A*D for i=0..511
then E = A*B + C*D for i=512..1023
F = C*B + A*D for i=512..1023
then E = A*B + C*D for i=1024..1535
etc. etc.
This way, the arrays stay in cache and only need to be fetched
from main memory once. The penalty you pay is the overhead from
having multiple loops, and that you are fetching data from the
L1 cache instead of registers (as you would be if the loops
were fused).
It is much simpler to implement with expression templates; here's
a small example:
// Some global variables
int _chunk;
bool _done_chunks;
const int _chunk_size = 512;
#define fuse _chunk = 0; _done_chunks = false; \
for (; !_done_chunks; ++_chunk)
Then the expressions would be written as:
fuse {
E = A*B + C*D;
F = C*B + A*D;
}
Inside the expression templates code, you would consult
the globals _chunk and _chunk_size to determine which
chunk to evaluate:
int lbound = _chunk * _chunk_size;
int ubound = lbound + _chunk_size - 1;
if (ubound >= N)
{
_done_chunks = true;
ubound = N - 1;
}
for (int i=lbound; i <= ubound; ++i)
array[i] = expr[i];
A possible advantage is that you can do this "chunky loop fusion"
over much more complicated code. For example, inside the
fuse { } directive, you could call functions which contain
array expressions, and they would be "chunkily fused" too.
Sameer pointed out that this solves a problem for profilers: because
the two loops are help separate, you can accurately attribute the
time for each expression (hard when the loops are fused).
Performance results for the above expressions, using N=100000:
Unfused Fused Chunky
T3E/KCC 27.4 43.2 34.0 (mcurie.nersc.gov)
alpha/DECcxx 8.7 12.7 11.9 (hgar1.cwru.edu)
i686/egcs 14.4 10.4 19.0 (n2001.lbl.gov)
O2000/KCC 14.4 22.3 18.4 (bluemountain.acl.lanl.gov)
I'm not sure why fused loops are slower under i686/egcs; it could
be because the compiler has trouble scheduling fatter loops.
Cheers,
Todd
--------------------- blitz-dev list --------------------------------
* To subscribe/unsubscribe: mail to majordomo_at_[hidden], with
"subscribe blitz-dev" or "unsubscribe blitz-dev" in the body of the message
* Blitz++ web page: http://oonumerics.org/blitz/