BLAS benchmarks
The use of BLAS subroutines in scientific and technical codes has become widespread during the past decade. Hardware (and in some cases scientific software) vendors have come up with optimized versions of the BLAS subroutines which enable (in theory at least) a program using them as building blocks to get the most out of the performance of any platform, without the need for extensive hand-tuning. Thus the benchmarking of the performance of the BLAS subroutines themselves should give a good indication of the performance attainable on different platforms or of performance problems in a vendor's library.
Unfortunately, the vendors themselves have not provided such information and even when they do only peak performance figures are provided. A notable exception is IBM's Algorithm and Architecture Aspects of Producing ESSL BLAS on POWER2 whitepaper, Parallel ESSL performance measurements and Digital's DXML: A High-performance Scientific Subroutine Library. (I would be grateful for pointers to more publicly available information.)
Unlike the IBM & Digital cases the tests whose results are presented below present performance figures as seen by the end-user. To put it in other words, no attempt was made to measure the actual speed of the algorithm itself by timing the time spent doing the argument checking and compensating for it, as this time is a cost that one actually pays when using BLAS subroutines in any code.
In the case of dcopy I measure bi-directional bandwidth (ie. maximum traffic going in both directions) - this should be half of what people usually call bandwidth.
The tests involve timing a for-loop of calls to the some of the most widely used double-precision BLAS routines for CFD codes. Thus for problem sizes that fit in cache the benchmark calculates the in-cache speed of the BLAS routines. A future revision of the code will provide for the choice of loading part or all of the operands from main memory at every invocation of the BLAS routines, thus providing a more complete description of the performance of the routines.
The timing routine used is times; the value used user time tms_utime) or the Fortran routine dtime, which in some cases (notably the SGI Power Challenge uses a higher resolution internal times. Unfortunately dtime is not present on all platforms, Crays being one example. In both cases the time used in just user time and not user+system time (though in most cases there would have been little difference). The reason for that is that system time is more variable, depending on the total memory (and memory use by other users) of the machine etc. This choice may be disadvantaging IBM hardware as AIX apparently counts as part of user time things that other O/Ses count as system time.
Timings were done (when possible) in dedicated mode or on very lightly loaded machines but the results did not differ a lot when the benchmarks were run on a loaded machine (except from one exception). The low timer resolution (10ms) of a lot of modern architectures (especially IBM RS/6000/AIX, HP9000/HP-UX) meant that for the dgemv and dgemm benchmarks a lot of CPU time had to be used.
(Last updated October 9th 1995 - will be updated with ddot results for IBM SP2 Wide and Thin nodes and all of the results for Convex SPP-1200, SGI Power Challenge/90MHz and (hopefully) Sun Ultra1 as soon as I find the time to produce the graphs)
- BLAS level 1:
- BLAS level 2 - dgemv (for square matrices)
- vectorsize 1-200
- vectorsize 3-14 showing the difference between vendor library BLAS and specialized small matrix-vector multiply routines. (work in progress)
- BLAS level 3 - dgemm (for square matrices)
(Last updated March 31st 1995)
- BLAS tests on the IBM SP1 & SP2 Wide, Thin and Thin2 nodes (comparing the performance of Power & Power2 ESSL libraries with surprising results).
(Last updated October 18th 1995: These tests were run on NCSA's SPP-1000/32 machine, lena. The machine was upgraded to an SPP-1200/64 a little while after the tests were performed but the results are still interesting for any SPP-1000 user)
- BLAS tests on the Convex SPP-1000, comparing the performance of the Convex MLIB (libveclib) and the BLAS implementation in (libblas) against the Netlib BLAS source compiled with the Convex fc & the HP f77 compilers. A comparison of the Convex cc & the HP c89 C compilers was also conducted using the Numerical Recipes Brenner fft routine (FOUR1). The compilation flags used were:
- fc & cc: -O2 -tm spp1000 -cache 1024 -ds -ipo -mrl -or all
- c89: +O4 +DA735 +Oall
- f77: +OP4 +OPK +Oall +O4 +DA735
- BLAS Level 1: The transition to working outside the cache has a more sharp effect in the performance of the Convex MLIB which experiences abnormally bad performance (compared to the other 3) for datasets just larger than the cache.
- BLAS Level 2: For vectors larger than n=79 the Convex MLIB dgemv performs extremely poorly. The results suggest that libblas is just the Netlib BLAS source compiled with the HP f77 compiler. However the BLAS 1 results and especially ddot suggest this is not true.
- BLAS Level 3: The Convex MLIB dgemm displays very erratic behavior and once outside the cache, sharing the system with other users running processes seems to produce a significant drop in performance. Taken from Netlib, DGEMM for IBM RS/6000 by Dongarra, Mayes, and Radicatti was compiled with both Fortran compilers and added to the comparison.
Unfortunately the resolution of the timer on the SPP-1000 is 10ms and this adds further noise to the erratic results seen. It appears that for small matrix sizes the Convex MLIB dgemm peaks out at multiples of four, probably suggesting the unrolling factor used in one of the loops. DGEMM for IBM RS/6000 by Dongarra, Mayes, and Radicatti uses a blocking factor of 64 and this shows in its performance curve when compiled with the HP compiler.
(Last updated November 13th 1995: These tests were run on NCSA's new SPP-1200/64 machine, lena, in dedicated mode. For some reason which is unclear at the time the CPU utilization reported by tcsh time for the benchmark runs was very low, never surpassing 85% and staying mostly about 60%.)
- BLAS tests on the Convex SPP-1200, comparing the performance of the Convex MLIB (libveclib) and the BLAS implementation in (libblas) against the Netlib BLAS source compiled with the Convex fc & the HP f77 compilers. A comparison of the Convex cc & the HP c89 C compilers was also conducted using the Numerical Recipes Brenner fft routine (FOUR1). The compilation flags used were:
- fc & cc: -O2 -tm spp1200 -cache 256 -ds -ipo -mrl -or all
- c89: +O4 +DA735 +Oall
- f77: +OP4 +OPK +Oall +O4 +DA735
- BLAS Level 1: The problems seen with veclib on the SPP-1200 are not present here. Moreover in the case of daxpy and ddot, veclib offers the best performance for vectors just fitting in cache, with the Netlib src/fc combination outperforming everyone in the same limit of dcopy.
- BLAS Level 2: For vectors larger than n=79 the Convex MLIB dgemv performs extremely poorly. libblas on the other hand offers excellent performance, unlike the case of the SPP-1000.
- BLAS Level 3: Both Convex MLIB dgemm and libblas dgemm peak out at array sizes that are multiples of 4, with significantly worse performance for sizes of the form (4 n - 1). Once outside the cache, unlike the usual behavior for blocked dgemm routines, there is a significant drop in performance, with Convex MLIB dgemm displaying a very erratic performance. Taken from Netlib, DGEMM for IBM RS/6000 by Dongarra, Mayes, and Radicatti was compiled with both Fortran compilers and added to the comparison.
As with the SPP-1000, the resolution of the timer on the SPP-1200 is 10ms and this adds further noise to the erratic results seen. DGEMM for IBM RS/6000 by Dongarra, Mayes, and Radicatti uses a blocking factor of 64 and this shows in its performance curve when compiled with the HP compiler, as seen on the SPP-1000 as well.
(Last updated November 13th 1995 - ran on loaded system at NCSA, awaiting results in dedicated mode)
- BLAS tests on the NCSA Power Challenge/MIPS R8000/90MHz comparing its performance with the older R8000/75MHz results and their scaling to 90MHz:
- dcopy Also shown are the results for an assembly coded dcopy (optimized for unit stride copies only) which achieves very close to the 572MB/s peak bi-directional bandwidth of the 75MHz R8000 for unit stride copies in cache.
- daxpy
- ddot
- dgemv
- dgemm
Notice the problem with even dimensioned square matrices in dgemm (performance not exceeding 120MFlop/s) and the odd drop in performance before the cache limit is reached in the BLAS Level 1 routines.