BZDEV: looptest (Intel C++ 4.0 beta)

From: Allan Stokes [cbi] (allan@stokes.ca)
Date: Wed Dec 09 1998 - 02:39:30 EST


Intel has just released their VTune 4.0 C++ compiler into beta testing.

Using this compiler I was able to compile Blitz essentially without error.
This compiler is used as a plug-in replacement for the Visual C++ 6.0
compiler and requires that environment (the Intel C++ compiler provides no
header files or libraries of its own).

The results here are very preliminary and DO NOT necessarily reflect what
the final Intel 4.0 C++ compiler will be able to achieve. I am posting
these results merely to show that Intel is making great progress.

Test machine: 200MHz PPro with 256KB L2 unified / 8KB L1 d-cache, 60nS EDO
memory
Release build, no specialized optimizations selected
Theoretical peak rate: 200 MFLOPS

I noticed another post describe the peak rate as 400 MFLOPS for an
equivalent configuration. This is not accurate.

The PPro does have separate add/mul units, but it only provides one issue
port to serve both units. Furthermore, muls can not be issued on
consecutive cycles; the peak rate can only be achieved if addition
represents at least 50% of the mix.

Looking to the future, the Intel Katmai processor, due in 2Q99, will provide
4-way SIMD for the single precision floating point format and will be
introduced at around 500MHz. The peak rate for a Katmai processor will be
on the order of 2 GFLOPS (assuming a single dispatch path).

It will be interesting to see what is required to adapt Blitz to make use of
the KNI facilities.

In-cache:
Mflops/s Description
 61.035 for, indirection, unit stride
 61.035 for, indirection, unit stride, no +=
 82.748 for, indirection, unit stride, backwards loops
 90.503 for, unroll=4, unit stride, constants loaded into temps
 93.842 for, unroll=4, unit stride, constants loaded into temps,
                4 read then 4 write
 92.142 for, unroll=4, unit stride, constants loaded into temps,
            no +=
 88.817 for, unroll=4, unit stride, constants loaded into temps,
        CSE for index offsets
 92.142 for, unroll=4, unit stride, constants loaded into temps,
backwards
 90.503 for, unroll=8, unit stride, constants loaded into temps
 72.869 for, indirection, unit stride, constants into temps
 59.512 for, indirection, non-unit stride
 72.939 for, indirection, non-unit stride, constants loaded into temps
 85.627 while, pointer increment, unit stride
 95.726 while, pointer increment, unit stride,
    constants loaded into temps
 82.748 while, pointer increment, non-unit stride
 108.53 while, pointer increment, unroll=4, non-unit stride,
     constants loaded into temps
 90.396 for, unroll=4, unit stride, constants loaded into temps,
prefetching
 39.388 interlaced, for, indirection, unit stride
 43.202 for, unroll=4, unit stride, interlaced,
                constants loaded into temps

Out of cache:
Mflops/s Description
 8.2874 for, indirection, unit stride
 8.2982 for, indirection, unit stride, no +=
 8.7884 for, indirection, unit stride, backwards loops
 8.8714 for, unroll=4, unit stride, constants loaded into temps
 8.8586 for, unroll=4, unit stride, constants loaded into temps,
                4 read then 4 write
 8.8648 for, unroll=4, unit stride, constants loaded into temps,
            no +=
 8.8714 for, unroll=4, unit stride, constants loaded into temps,
        CSE for index offsets
 8.8586 for, unroll=4, unit stride, constants loaded into temps,
backwards
 9.1854 for, unroll=8, unit stride, constants loaded into temps
 8.4711 for, indirection, unit stride, constants into temps
 8.2817 for, indirection, non-unit stride
 8.4478 for, indirection, non-unit stride, constants loaded into temps
 8.7441 while, pointer increment, unit stride
 8.7381 while, pointer increment, unit stride,
    constants loaded into temps
 8.7257 while, pointer increment, non-unit stride
 9.0891 while, pointer increment, unroll=4, non-unit stride,
     constants loaded into temps
 8.7819 for, unroll=4, unit stride, constants loaded into temps,
prefetching
 7.2967 interlaced, for, indirection, unit stride
 7.2967 for, unroll=4, unit stride, interlaced,
                constants loaded into temps

--------------------- blitz-dev list --------------------------------
* To subscribe/unsubscribe: mail to majordomo@oonumerics.org, with
"subscribe blitz-dev" or "unsubscribe blitz-dev" in the body of the message
* Blitz++ web page: http://oonumerics.org/blitz/



This archive was generated by hypermail 2b29 : Wed Feb 20 2002 - 04:30:07 EST