All of the work in setting up the FFT is done at compile time. This includes calculating the weights, determining which elements to swap, unwinding all the loops, etc. There are no function calls or loops in the generated code.
So far this code will compile on Borland C++ V4.0 for cases up to N=16, Sunpro4 for cases up to N=64, KAI C++ for N=16. Andrew Dalgleish (andrewd@axonet.com.au) was able to compile with Microsoft Visual C++ 4.1 for N=256 (however, the FFT wasn't completely inlined). Thomas Kunert (kunert@Ptprs1.phy.tu-dresden.de) was able to compile with IBM xlC for N=256 (possibly also not completely inlined). Felix Kasza (felixk@mailbag.shd.de) compiled with N=512 on Visual C++ (Alpha XL 300, single processor, NT 4.0), requiring 80 Mb of memory. Geoff Leyland (geoff.leyland@it.dgm.epfl.ch) was able to compile up to N=128 using Metrowerks CodeWarrior V11 on a PowerMac. The current record holder is Serge Plagnol (splagnol@nocliche.com), who generated N=1024 using VC++ 5.0 SP3 x86.
I used the FFT routine from 'Numerical Algorithms in C' as a base. Using that routine as a benchmark, the inlined version runs 4.5 times faster on Borland C++ (N=16), and 3.7 times faster on Sunpro4 (N=16). These figures should be taken with a lot of salt, since the Numerical Algorithms version calculates weights on the fly, whereas the template metaprogram version precalculates them at compile time.
For huge N, four1(..) may be faster, since the inlined version will not fit in the cache, whereas four1 will. To tackle large transforms, it's best to use a routine like four1(..), and bottom out to an inlined version for N=16,32 or 64.
The program is useful as an illustration of template metaprogram techniques:
Return to the
Blitz++ Project
Home Page.