![]() |
Blitz Support : |
From: Fernando Perez (fperez_at_[hidden])
Date: 2003-08-29 17:36:22
Todd Veldhuizen wrote:
> Maybe it's best to visualize an outer product followed by a reduction.
> The indices refer to the dimensions of the outer product.
[snip]
> Your example:
> M(i1,j) * T(i2,i3,i4,j)
> This is a five dimensional array (i1,i2,i3,i4,j)
>
> sum(M(i1,j) * T(i2,i3,i4,j), j)
> This is a four dimensional array (i1,i2,i3,i4)
>
> Hope that helps!
Thanks a lot, and it helps indeed. But I have a bit of a worry now, and it's
performance. I am wondering if blitz is making an intermediate rank d+1 array
above before the reduction, or something.
Here is the problem. Now that I understood how to fix the code, I went ahead
and timed the two versions: my hand-rolled loops (horrendous macro code), vs
the blitz-expanded one (beautiful, elegant one-liners).
The results are:
planck[~/mtinner]> ./mtinner 300
Memory usage requested: 300 MB
Rank size dif t_hand t_bz ratio (bz/hand)
1 6270 0 0.49 0.4 0.816327
2 340 0 0.36 0.41 1.13889
3 79 0 0.36 0.53 1.47222
4 33 0 0.4 0.7 1.75
5 18 0 0.35 0.82 2.34286
6 12 0 0.38 1.17 3.07895
Times are in seconds. I estimate the size at rank 1 to fit in the requested
memory, and other sizes are computed to keep the total op-count approximately
constant (memory use goes down with rank).
The hand-rolled loops do indeed remain constant in time, but the
blitz-expanded code doesn't. This worries me a lot, since a factor of 3 is
unfortunately, far more than I can afford to pay once I move to production
with this project.
I'd really love to be able to use blitz's fancy syntax with abandon, and I
wonder if I am doing something wrong, or if this is something which can be fixed.
These numbers are on a P-4_at_2.8GHz, with 1 GB of RAM, RedHat 9.0. Code compiled
with gcc 3.2.2 (also tested 3.3.1, no significant difference), and
CXXFLAGS = -I$(BZINCDIR) -DNDEBUG -O3
CFLAGS = -DNDEBUG -O3
Next, I decided to test the Intel compiler (icc 7.0). Here's what I get, with
SSE2 optimizations turned on:
planck[~/mtinner]> ./mtinner 300
Memory usage requested: 300 MB
Rank size dif t_hand t_bz ratio (bz/hand)
1 6270 0 0.58 0.36 0.62069
2 340 0 0.67 0.37 0.552239
3 79 0 0.79 0.47 0.594937
4 33 0 1.02 0.61 0.598039
5 18 0 0.92 0.7 0.76087
6 12 0 1.11 1.1 0.990991
It's quite interesting: icc is a bit faster than gcc for the templated code,
but far worse for the hand-rolled loops. In the end, the blitz ratio is
better wiht icc, but for all the wrong reasons :)
The opcount is approximately constant, so what gcc + hand loops gives is the
'ideal' situation: time _should_ remain constant.
In case I am doing something outright wrong, I've posted a tarball with the
necessary code so others can run the tests:
http://www-hep.colorado.edu/~fperez/tmp/mtinner.tgz
Just 'make', then './mtinner NNN' will run with NNN Mbytes of memory (if NNN
is not given, 50 is the default).
In summary: I'd love to know if it's possible to use the fancy tensor
notation with no performance penalty, and if so, how to do it. The project
I'm working on requires algorithms to be written for d=1..6, and that syntax
would make my life *much*, *much* easier.
Thanks in advance for any feedback.
Regards,
Fernando.
ps. And to the Blitz team, many thanks. Even if the tensor notation can't be
used for real work, I'm still amazed by blitz, and I'll keep using it with
loops. It's a joy to work with :)