BLAS benchmarks

The use of BLAS subroutines in scientific and technical codes has become widespread during the past decade. Hardware (and in some cases scientific software) vendors have come up with optimized versions of the BLAS subroutines which enable (in theory at least) a program using them as building blocks to get the most out of the performance of any platform, without the need for extensive hand-tuning. Thus the benchmarking of the performance of the BLAS subroutines themselves should give a good indication of the performance attainable on different platforms or of performance problems in a vendor's library.

Unfortunately, the vendors themselves have not provided such information and even when they do only peak performance figures are provided. A notable exception is IBM's Algorithm and Architecture Aspects of Producing ESSL BLAS on POWER2 whitepaper, Parallel ESSL performance measurements and Digital's DXML: A High-performance Scientific Subroutine Library. (I would be grateful for pointers to more publicly available information.)

Unlike the IBM & Digital cases the tests whose results are presented below present performance figures as seen by the end-user. To put it in other words, no attempt was made to measure the actual speed of the algorithm itself by timing the time spent doing the argument checking and compensating for it, as this time is a cost that one actually pays when using BLAS subroutines in any code.

In the case of dcopy I measure bi-directional bandwidth (ie. maximum traffic going in both directions) - this should be half of what people usually call bandwidth.

The tests involve timing a for-loop of calls to the some of the most widely used double-precision BLAS routines for CFD codes. Thus for problem sizes that fit in cache the benchmark calculates the in-cache speed of the BLAS routines. A future revision of the code will provide for the choice of loading part or all of the operands from main memory at every invocation of the BLAS routines, thus providing a more complete description of the performance of the routines.

The timing routine used is times; the value used user time tms_utime) or the Fortran routine dtime, which in some cases (notably the SGI Power Challenge uses a higher resolution internal times. Unfortunately dtime is not present on all platforms, Crays being one example. In both cases the time used in just user time and not user+system time (though in most cases there would have been little difference). The reason for that is that system time is more variable, depending on the total memory (and memory use by other users) of the machine etc. This choice may be disadvantaging IBM hardware as AIX apparently counts as part of user time things that other O/Ses count as system time.

Timings were done (when possible) in dedicated mode or on very lightly loaded machines but the results did not differ a lot when the benchmarks were run on a loaded machine (except from one exception). The low timer resolution (10ms) of a lot of modern architectures (especially IBM RS/6000/AIX, HP9000/HP-UX) meant that for the dgemv and dgemm benchmarks a lot of CPU time had to be used.


(Last updated October 9th 1995 - will be updated with ddot results for IBM SP2 Wide and Thin nodes and all of the results for Convex SPP-1200, SGI Power Challenge/90MHz and (hopefully) Sun Ultra1 as soon as I find the time to produce the graphs)


(Last updated March 31st 1995)


(Last updated October 18th 1995: These tests were run on NCSA's SPP-1000/32 machine, lena. The machine was upgraded to an SPP-1200/64 a little while after the tests were performed but the results are still interesting for any SPP-1000 user)


(Last updated November 13th 1995: These tests were run on NCSA's new SPP-1200/64 machine, lena, in dedicated mode. For some reason which is unclear at the time the CPU utilization reported by tcsh time for the benchmark runs was very low, never surpassing 85% and staying mostly about 60%.)


(Last updated November 13th 1995 - ran on loaded system at NCSA, awaiting results in dedicated mode)


Back to the CFM Home Page