BZDEV: Re: Blitz and PoomaII comparing (fwd)

From: Scott Haney (swhaney@lanl.gov)
Date: Tue Jan 12 1999 - 14:53:56 EST


>> /usr/bin/time eg++ -c
> /home/spike/pooma-2.0/benchmarks/SimpleArray/atest.cpp \
>>-o /home/spike/pooma-2.0/benchmarks/SimpleArray/SOLARISEGCS/atest.o \
>> -ftemplate-depth-40 \
>>-DNOPAssert -DNOCTAssert -O3 -funroll-loops -fstrict-aliasing \
>>-I/home/spike/pooma-2.0/src \
>>-I/home/spike/pooma-2.0/lib/SOLARISEGCS

>> C C
>>N w/o restrict w/ restrict CppTran PoomaII
>>> > 10 35.24 52.64 6.754 4.881
>>> > 21 36.64 60.19 9.167 5.915
>>> > 46 34.53 36.24 8.36 6.723
>>> > 100 39.9 32.66 8.12 5.87
>>> > 215 39.66 35.78 7.812 5.556
>>> > 464 18.27 14.12 6.437 5.082
>>> > 1000 14.56 12.79 5.769 4.491

Alexander,

Your results that Blitz++ meets or exceeds C performance are not a surprise.
Nor is the the fact that Pooma doesn't perform as well as Blitz++. Pooma has
a more general array/expression template implementation and evaluation
mechanism (e.g., to support parallelism) than Blitz++ which, in turn, means
increased overhead and difficulty in optimization. Also, of course, Todd has
done an excellent job tuning Blitz++. Blitz's good performance at small
vector sizes is particularly impressive. However, normally the difference is
in the ball-park of 20% except at small vector sizes, not a factor of 5-10.

"CppTran" above refers to writing the loops using explicit scalar indexing
of the arrays. We typically see that the CppTran results track the C without
restrict column pretty well. No reason why they shouldn't. However, that is
clearly not the case above. I presume this is due to some EGCS code
generation peculiarities on the Sun since I see better performance on the
SGI (see below).

On an SGI Origin 2K with KCC 3.3 I get:

              C C
N w/o restrict w/ restrict CppTran PoomaII
10 15.00 15.63 16.30 6.70
21 30.62 21.76 38.46 11.33
46 30.40 23.76 27.74 13.31
100 21.16 20.88 19.35 14.16
215 20.83 20.62 17.63 14.41
464 11.96 12.32 11.65 9.76
1000 9.39 9.52 10.72 9.85

Note: this test is with a newer version of Pooma and it uses an updated
version of the Benchmark class that exploits SGI's high-speed hardware
timers. The numbers reported on all columns are the highest MFLOPs for 50
trials (e.g., the command-line is atest --samples 50). With this many
trials, I get fairly good reproducibility of the results above, even though
the test is real short.

Here, we see the CppTran results compare well with the C w/o restrict
results and the Pooma results are slower by a lot at the low end and 10-30%
at the high-end. These results aren't great, but they are not surprising
compared to what I've seen in the past and they're better than the factors
on the sun.

With EGCS, again on the O2K, I get

              C C
N w/o restrict w/ restrict CppTran PoomaII
     10 14.42 15 12.1 1.769
     21 28.03 16.21 14.01 2.912
     46 24.05 22.8 20.09 2.941
    100 16.27 16.05 14.49 2.98
    215 13.63 13.64 11.72 2.953
    464 10.67 10.58 9.617 2.696
   1000 9.491 9.475 9.215 2.692

The CppTran results compare better than on the Sun, but the Pooma II results
are even worse. The factor of 3-6 compared to KCC is a bit worse than
results we've seen for other benchmarks, but generally consistent with
previous EGCS results.

Looking at the SimpleArray benchmark, I notice that the run() member
function actually calls setInitialConditions(), which is generating
overhead, but is not included in the MFLOP calculation. That is, we're
really trying to time

w += x + y + z;

not filling w, x, y, and z with numbers and, in particular, not measuring
the cost of the first touch of memory. Moving this call out of run() and
into initialize() gives the following results for KCC:

N w/o restrict w/ restrict CppTran PoomaII
10 46.88 41.67 75.00 17.05
21 78.75 82.69 91.88 53.35
46 60.11 61.51 54.35 53.26
100 72.53 69.57 62.81 60.58
215 75.63 70.09 62.17 59.67
464 24.91 25.71 25.48 23.13
1000 23.15 23.87 22.90 23.88

and for EGCS:

              C C
N w/o restrict w/ restrict CppTran PoomaII
     10 62.5 62.5 37.5 4.412
     21 87.04 87.04 63.61 6.919
     46 53.98 53.98 45.09 6.719
    100 57.6 58.32 49.73 6.773
    215 59.38 59.26 48.29 6.636
    464 23.35 23.4 22.68 5.637
   1000 21.39 21.39 21.01 5.633

As expected, the MFLOPs are better across the board. Also, for KCC, Pooma is
much closer to CppTran and C, except at the small sizes where our overhead
is getting us. EGCS is doing a better job with CppTran, but the Pooma
results are still off by the large factor you reported.

I agree that Pooma needs more aggressive compiler optimization than Blitz++
to get C-like results. It looks to me that KCC can supply these, but EGCS
currently cannot. Specifically, my guess is that KCC more aggressively
inlines.

Now, admittedly, this realization doesn't doesn't do EGCS users interested
in Pooma that much good right now. However, we're working with EGCS
developers to improve the optimization and hope to see restrict plus better
inlining implemented sometime this year. These enhancements would presumably
benefit Blitz++ as well. Also, of course, we in the Pooma Team can work to
try to make our code more palatable to compilers.

Alexander and Todd, thanks for bringing this to our attention.

Scott

--
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Scott W. Haney                              | email: swhaney@lanl.gov
Technical Staff Member, Pooma Team          | phone: 505-667-5486
Los Alamos National Lab, CIC-ACL, MS B287   | fax:   505-665-4939
Los Alamos, NM 87545                        |

--------------------- blitz-dev list -------------------------------- * To subscribe/unsubscribe: mail to majordomo@oonumerics.org, with "subscribe blitz-dev" or "unsubscribe blitz-dev" in the body of the message * Blitz++ web page: http://oonumerics.org/blitz/



This archive was generated by hypermail 2b29 : Wed Feb 20 2002 - 04:30:08 EST