OONumerics User : |
From: Chris Kees (kees_at_[hidden])
Date: 2001-11-14 13:34:23
Hi,
I'm trying to optimize an OO numerical code in C++ on the Origin2000 and
am having trouble using their tools. Specifically, I've been trying to
generate assembler listings with the -S option to the MipsPro C++
compiler, and I don't think I'm getting the listing that is used for
the executable. For a simple DAXPY test code I can't get the proper
scheduling (which should produce a loop capable of 1/3 of peak MFLOPS in
the ideal case for this architecture) even though the loop does get
unrolled. When I run the test code against stripped down C++ that gets
unrolled AND pipelined correctly, I see no difference in performance,
and they both get roughly 1/3 of peak for reasonable vector lengths,
which suggests that the pipelining is being done in spite of the fact
that the assembler listing says it's not. Does anybody have any ideas
about what is happening here or have suggestions on how to proceed with
tuning? I'm afraid that the compiler doesn't properly optimize loops
with inlined functions until the final pass of "inter procedural
analysis" and that I'll never be able to tell except by performance
testing whether the loops are pipelined or not.
Below are two daxpy implementations with the same real world performance
on the Origin and and their loop schedules, which suggest that the first
implementation shouldn't be getting the performance that it does, in
fact, achieve. SVec is a simple C++ vector class with storage pointer p,
and inlined operator[]. SDaxpy is just a class with the daxpy's as
member functions.
--Chris
void SDaxpy::unrollThis(SVec& x, SVec& y, real a)
{
int end=x.size;
for (int i=0;i<end;i++)
{
y[i] = y[i] + a*x[i];
}
}
#<sched> 16 flops ( 11% of peak) (madds count as 2)
#<sched> 8 flops ( 5% of peak) (madds count as 1)
#<sched> 8 madds ( 11% of peak)
#<sched> 24 mem refs ( 33% of peak)
#<sched> 3 integer ops ( 2% of peak)
#<sched> 35 instructions ( 12% of peak)
void SDaxpy::unrollThat(SVec& x, SVec& y, real a)
{
int end=x.size;
for (int i=0;i<end;i++)
{
y.p[i]=y.p[i]+a*x.p[i];
}
}
#<swps> 12 estimated iterations before pipelining
#<swps> 8 unrollings before pipelining
#<swps> 24 cycles per 8 iterations
#<swps> 16 flops ( 33% of peak) (madds count as 2)
#<swps> 8 flops ( 16% of peak) (madds count as 1)
#<swps> 8 madds ( 33% of peak)
#<swps> 24 mem refs (100% of peak)
#<swps> 3 integer ops ( 6% of peak)
#<swps> 35 instructions ( 36% of peak)
-- Christopher E. Kees UCAR Visiting Scientist University of North Carolina, Chapel Hill, NC 27599-7400 (919) 966-7892, fax: (919) 966-7911, http://www.unc.edu/~ckees --------------------- Object Oriented Numerics List -------------------------- * To subscribe/unsubscribe: use the handy web form at http://oonumerics.org/oon/ * If this doesn't work, please send a note to owner-oon-list_at_[hidden]