At Peter.Bienstman's suggestion I finally took a look at the hao-he
benchmark. It times the expression "B += sqr(A)" where A, B are
complex<double> arrays.
Here are the initial timings:
In-cache:
Time using array notation b += a*a: 2.78
Time using array notation b += sqr(a): 2.24
Time using low-level version: 3.8
Time using really low-level version: 0.47
Out-of-cache:
Time using array notation b += a*a: 4.17
Time using array notation b += sqr(a): 3.54
Time using low-level version: 5.65
Time using really low-level version: 1.98
So the Array version is about 4.7 times slower in-cache, and
1.8 times slower out of cache. Usually you can get away with
terrible code out-of-cache and still get decent performance,
so this 1.8 figure is a clear sign something is wrong.
Next I generated assembler code (g++ -S). Here is a snippet
of the inner loop of the "really low-level" version (corresponding
to one unrolling of the loop):
.LL59667:
add %o5, -1, %o5
fmuld %f2, %f2, %f8
faddd %f2, %f2, %f10
ldd [%o4+8], %f4
ldd [%o3], %f12
add %o4, 16, %o4
ldd [%o3+8], %f6
fmuld %f4, %f4, %f2
fmuld %f10, %f4, %f4
fsubd %f8, %f2, %f2
faddd %f12, %f2, %f2
faddd %f6, %f4, %f4
std %f2, [%o3]
std %f4, [%o3+8]
add %o3, 16, %o3
ldd [%o4], %f2
.
. etc.
Here is some of the inner loop code from the "b += sqr(a)" version:
add %o4, %o0, %o5
ldd [%o5], %o2
mov %o5, %o4
std %o2, [%fp-336]
ldd [%fp-336], %f2
ldd [%o4+8], %o0
faddd %f2, %f2, %f8
std %o0, [%fp-328]
ldd [%fp-328], %f4
mov %l0, %o0
fmuld %f4, %f4, %f6
add %fp, -320, %o1
fmuld %f2, %f2, %f2
fmuld %f8, %f4, %f8
fsubd %f2, %f6, %f2
std %f2, [%fp-320]
call __doapl__H1Zd_Pt7complex1ZX01RCt7complex1ZX01_Rt7complex1ZX01, 0
std %f8, [%fp-312]
ld [%fp-352], %o4
.
.
Yikes! There's the culprit. There is a function call in the inner
loop. I couldn't find a demangler utility for g++ (anyone know if
there is one)? but it looks like some global function associated
with complex. Hunting through the library headers I find:
template <class _FLT>
inline complex<_FLT>&
__doapl (complex<_FLT>* ths, const complex<_FLT>& r)
{
ths->re += r.re;
ths->im += r.im;
return *ths;
}
class complex<double>
{
public:
complex& operator+= (const complex& r) { return __doapl (this, r); }
...
};
For some reason egcs is not inlining this call. A quick fix
is to change the complex header to:
complex& operator+= (const complex& r)
{
re += r.re;
im += r.im;
return *this;
}
Okay. Then I find the call to __doapl__... disappears, but now there
is a call to
add %i0, 8, %o0
call sqr__H1ZQ25blitzt5Array2Zt7complex1Zdi1__5blitzRCt6ETBase1ZX01_Q
25blitzt13_bz_ArrayExpr1ZQ25blitzt20_bz_ArrayExprUnaryOp2ZQ35blitzt6asExpr1ZX016
T_exprZQ25blitzt7_bz_sqr1ZQ45blitzt6asExpr1ZX016T_expr9T_numtype, 0
nop
unimp 20
ld [%fp-148], %o0
ld [%fp-144], %o1
st %o0, [%fp-172]
st %o1, [%fp-168]
st %o0, [%fp-356]
st %o1, [%fp-352]
ld [%i1+20], %o0
st %i1, [%fp-304]
ld [%i1+28], %o1
call .umul, 0
mov 0, %l2
ld [%fp-304], %o3
sll %o0, 4, %o0
ld [%o3+28], %o1
ld [%fp-356], %o4
st %o1, [%fp-292]
ld [%o4+28], %o2
ld [%i1+4], %o1
st %o2, [%l3+12]
...
which is the blitz sqr routine. Also there are scattered calls to .umul.
It's like playing "whack the gopher".
I tried various things but couldn't get everything inlined.
Possible fixes:
1. Figure out a way to tell egcs to inline everything
(anyone know an option for this?)
2. Use libstdc++, which has a different implementation of
complex. I tried building this under solaris but ran
into trouble -- maybe someone else will have better luck?
It is available from ftp://egcs.cygnus.com/pub/libstdc++
Cheers,
Todd
--------------------- blitz-dev list --------------------------------
* To subscribe/unsubscribe: mail to majordomo@oonumerics.org, with
"subscribe blitz-dev" or "unsubscribe blitz-dev" in the body of the message
* Blitz++ web page: http://oonumerics.org/blitz/
This archive was generated by hypermail 2b29 : Wed Feb 20 2002 - 04:30:09 EST