Blitz logo

Blitz Support :

From: Andreas R. (andreasreifschneider_at_[hidden])
Date: 2004-06-21 18:07:30


Hi.

Some time ago I wrote a C program to generate reaction-diffusion textures.
Now I try to port this to blitz, since blitz's stencil notation is really
wonderful.
But I cannot manage to achieve comparable performance (in best case I
get about 115% overhead of the blitz code compared to C code).
So I extracted the most important code (omitting dummy values production
at the boundaries, error handling code, value rescaling, etc.) into
myprogram1.c and wrote according blitz stencil code into myprogram3.cpp. I use
a few hacks to keep the code similar and short.
I use 2D matrices of size 68x68 and perform 30000 iterations (to be sure it is
not too much affected by the initialization):
time ./myprogram3 > test3.pgm

real 0m5.406s
user 0m5.404s
sys 0m0.002s

compared to C program result:
time ./myprogram1 > test1.pgm

real 0m2.506s
user 0m2.505s
sys 0m0.002s

I then tried to find out the reasons for that. So I took the C's
implementation and used Array<float, 2> instead of float**
(myprogram2.cpp), but that made it even worse than stencil notation!
By the way I didn't find a swap method in Array so I hope:
matrix_a.swap(matrix_b) for std::vector is equivalent to
matrix_swap_tmp.reference(matrix_a); matrix_a.reference(matrix_a_out);
matrix_a_out.reference(matrix_swap_tmp); for blitz::Array.
time ./myprogram2 > test2.pgm

real 0m6.110s
user 0m6.108s
sys 0m0.002s

I took a look at Array's implemetation. It uses a single array, so I
modified myprogram1.c to do the same, but it didn't affect the execution
time that much:
time ./myprogram1b > test1b.pgm

real 0m2.636s
user 0m2.634s
sys 0m0.002s

I asked myself if Array is perhaps having too much overhead, so I rewrote
myprogram2.cpp to use a low overhead version (without reference
counting, even omitting the freeing of space), but that didn't help much
either:
time ./myprogram2b > test2b.pgm

real 0m4.903s
user 0m4.902s
sys 0m0.000s

Using std::vector< std::vector<float> > is worse than float** , but better
than Array:
time ./myprogram1c > test1c.pgm

real 0m3.779s
user 0m3.763s
sys 0m0.002s

I didn't know exactly whether this had something to do with memory
alignment, so I just adjusted myprogram1b to use one std::vector instead of
pointer. But this caused still much overhead:
time ./myprogram1bb > test1bb.pgm

real 0m4.407s
user 0m4.348s
sys 0m0.001s

Low-overhead version of std::vector:
bash-2.05b$ time ./myprogram1bbb > test1bbb.pgm

real 0m4.144s
user 0m4.128s
sys 0m0.001s

My last thought was: how would the execution time of the stencil change if I
use multiple stencils instead of 1 (I would have to use it later, because
this simple stencil already uses 8 arrays and there are only 11)? So I split
the stencil and was gladly surprised that the running time decreased and now
has only 70% overhead over fastest C implementation.
time ./myprogram3b > test3b.pgm

real 0m4.388s
user 0m4.268s
sys 0m0.002s

The fastest C++ code I get with std::vector< std::vector<float> > still has
50% overhead (I compare user times).
It seems that the simple use of C++ operators causes the code to get
that slow. Is this really the case?

Previously I actually thought that using std::vector would be just as fast as
accesses to float*, since (at least when using -O3 or -finline-functions) the
code should be inlined. Or is there something else? Can somebody verify my
experience?

I read in a forum
  http://lists.trolltech.com/qt-interest/2004-04/thread00525-0.html
that using double instead of float saves some time (this was the case on my
machine), I also tried very many optimization options, so the times above are
actually the times of my latest code (with the doubles instead of
floats) and optimization options I found out to work best:
-O3 -march=pentium4 -mcpu=pentium4 -msse -msse2 -ffast-math
-fomit-frame-pointer -mfpmath=sse -malign-double
Or are there perhaps some other options I could enable to speed up the
execution on g++? Should I try Intel's compiler?

I use
gcc/g++ (GCC) 3.3.3 20040412 (Gentoo Linux 3.3.3-r6, ssp-3.3.2-2, pie-8.7.6)
on Linux 2.6.5-gentoo-r1 i686 Intel(R) Pentium(R) 4 CPU 2.60GHz GenuineIntel
cache size 512 KB and 1GB RAM.

I attach the programs, so you can verify that on your machines if you
want (you will probably need to modify the OPTIMIZ_FLAGS and substitute
-march=pentium4 -mcpu=pentium4 with your architecture, then just type
 "make tests"). If you want to see the produced picture, open it with xv,
right click on it, select Windows->Color Editor, click on HistEq (almost
bottom, left).

Any advice would be appreciated.

Regards, Andreas R.