BZDEV: analysis of looptest for Intel C++

From: Allan Stokes [cbi] (allan@stokes.ca)
Date: Wed Dec 09 1998 - 21:18:35 EST


Hello again,

Yesterday I posted my preliminary results with the Intel C++ 4.0 beta
compiler. Today I took a closer look at the generated code.

Each iteration of the DAXPY loop requires two memory reads and one memory
write.

My best result with ICL so far is 116 Mflops on a 200MHz PPro processor.
This corresponds to a cache read or write operation completing on 91% of the
available cycles.

The Pentium Pro Processor System Architecture book from Mindshare claims
that the PPro is capable of performing a read and write to the L1 cache
concurrently (under some conditions), but I rarely see evidence of this feat
in practice.

Most of the difference in the loop results (which generally range from 60 to
110 Mflops) is accounted by whether the constants c/d are loaded into
temporaries. When these are not loaded into temporaries, ICL generates one
extra memory read per DAXPY iteration. The overhead from short loops also
matters.

The most interesting output was the anomalous result for the interlaced
case. Interlacing is where the A and B vectors are represented as if they
were--to use STL notation--vector<pair<double>>.

The ICL generated code for this case was absolutely flawless and yet the
performance fell by over 1/2. It was very hard to explain.

    39.388 interlaced, for, indirection, unit stride
    43.202 for, unroll=4, unit stride, interlaced, constants loaded into
temps

I began to suspect a cache effect so I reduced the size of the vectors by
half. The new results were interesting:

    61.035 interlaced, for, indirection, unit stride
    105.96 for, unroll=4, unit stride, interlaced, constants loaded into
temps

I have a good idea about might be happening here.

The PPro L1 cache (on the old 8K models) is two way set associative with
32-byte lines. In the non-interleaved cases the cache lines for the B
vector do not become dirty, whereas in the interleaved case, the entire
composite AB vector becomes dirty.

My theory is that the PPro has a rule that when _both_ lines in a cache set
become dirty this automatically triggers a write-back of one of the two
dirty lines within the set to the L2 backing cache.

Thus the PPro has an 8KB L1 cache, but it only allows 4KB of this cache (one
line out of each set) to be in the dirty state at any given time.

Overall, I would have to say after inspection that the code generated by ICL
for these test cases could hardly be improved.

Allan

--------------------- blitz-dev list --------------------------------
* To subscribe/unsubscribe: mail to majordomo@oonumerics.org, with
"subscribe blitz-dev" or "unsubscribe blitz-dev" in the body of the message
* Blitz++ web page: http://oonumerics.org/blitz/



This archive was generated by hypermail 2b29 : Wed Feb 20 2002 - 04:30:07 EST