Description
SSE3DNow benchmark measures Single Precision (SP) and Double Precision (DP) Floating Point speeds, data streaming from caches and RAM. It uses SSE (SP) and/or SSE2 (DP) and/or AMD 3DNow (SP) Single Instruction Multiple Data (SIMD) floating point instructions. Tests using normal i386 single precision floating point are also included to demonstrate performance gains of the new functions.
Note: SSE instructions are available on Intel Pentium III onwards and SSE2 came with the Pentium 4. AMD Athlon 4 CPUs also incorporated SSE instructions and Athlon 64 has SSE2.
The procedures used are memory to register operations (r = r + x[m] * y[m]) and memory to register to memory (x[m] = x[m] + y[m]). SSE instructions use 128 bit registers for 4 single precision floating point words with SSE2 using them for 2 double precision words. 3DNow uses the 64 bit MMX registers with two single precision words. The tests are run with a range of data sizes to provide measurement via L1 cache, L2 cache and RAM, where up to 25% of main memory size is automatically selected.
A pre-compiled version of the benchmark can be found in SSE3DNow.zip which also contains the source code, providing further explanatory comments.
SSE64, a version compiled for 64 bit operation is in More64bit.zip, where i387 normal floating point and 3DNow instructions are not supported. The SSE/SSE2 assembly code is the same as the original (see results below).
Information on maximum speeds when less processing is involved can be obtained from BusSpd2K results.htm and RandMem results.htm .
Then there is My Main Page for other PC benchmarks and results.
Results are given as Millions of Bytes Per Second (MB/s) memory reading speed. The second set of figures can be multiplied by 1.5 for read/write speed. The results can be converted to Millions of Floating Point Operations Per Second (MFLOPS) - memory to register, divide SSE2 MB/s by 8 and others by 4 then memory to register to memory, divide SSE2 MB/s by 16 and others by 8. Test results output includes maximum MFLOPS speeds.
A variation of the benchmark has also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see
linux benchmarks.htm,
and download benchmark execution files, source code, compile and run instructions in
memory_benchmarks.tar.gz.
Using Windows the file downloaded wrongly as .tar.tar but is fine when renamed .tar.gz.
See Linux results below.
Following is an example output for an Athlon 64 CPU. Variations in performance identify L1 and L2 cache sizes.
Results
Separate tables of speeds obtained via L1 cache, L2 cache and RAM are given below. Except when connected via the memory bus, performance via caches tends to be proportional to CPU MHz for a given type of processor. So, only a sample of results are provided. Details of cache sizes, speed and range of CPU MHz can be found in CPUSpeed.htm.
The latter also provides a range of performance comparisons based on %MFLOPS/MHz, which include SSE3DNow results.
The tables also show the highest value for each column of results and associated MFLOPS. This makes it easier to see that one CPU is not the fastest on all tests.
L1 Cache Results - Comparing Intel Pentium 4 results with those from AMD of similar MHz shows that the latter has superior floating point performance, particularly on old i386 instructions. This also applies to Pentium M and both have larger L1 caches. The results also show that the later Pentium 4E (Prescott) performs significantly faster than earlier P4s, on these tests. Also, AMD Athlon 64 CPUs produce similar performance to earlier Athlons of the same MHz. Later results for Core 2 Duo show that this has outstanding
performance on SSE and SSE2 tests relative to CPU MHz.
Results are sorted by the first i386 FP speeds (Sngl).
L2 Cache Results - These are sorted in the same order as for L1 cache. The i386 FP columns indicating relative improvements on Pentium 4 results, although AMD CPUs of the same MHz are still faster. The superior Pentium 4 (and Pentium M) L2 cache performance is reflected in SSE and SSE2 results where AMD CPUs of the same MHz are slower. Some Athlon 64/Opteron results are much better than their earlier processors. Again, Core 2 Duo results are outstanding.
RAM Speed Results - Result include four Pentium 4s at 2533 MHz, showing the impact of different types of RAM and variations due to different mainboards. Performance depends on CPU and RAM speed with Pentium 4s showing best scores on all tests. At least Athlon 64s take the lead on %MFLOPS/MHz. Later, the first Core 2 Duo results were via a board using the nForce 570 chipset, where most speeds were disappointing. Further results were via the Intel 965 chipset which produced the best performance.
Linux Results - The 64-Bit and 32-Bit versions, produced to run under Linux Ubuntu, do not have the 3DNow tests. Instead, SSE functions of the form s=(s+x[m])*y[m] and s=s+x[m]+y[m] are used instead, again using assembly code. This can lead to faster speed, as only memory to register calculations are used in the main timed loop.