BENCH(5NEMO) manual page

This HTML automatically generated with rman for NEMO
Table of Contents

Synopsis

/usr/bin/time $NEMO/src/scripts/nemo.bench
make -f Benchfile clean all
nemobench LABEL COMMAND

Description

There are several methods to compare the performance of NEMO on different processors or environments:

0.

The command make bench5 from the $NEMO directory runs a small number of benchmarks designed to each takae 5 seconds on a given CPU. An average speed is then computed to compare different compilers and/or CPU’s, an idea simular to the geekbench suite of tests.

1.

The command make bench from the $NEMO directory runs a standard benchmark of about a dozen selected NEMO programs. It should run in about 30 seconds on a ~2020 computer. A short table (see below in RESULTS) lists results from some recent computers.

2.

The command make bench5 from the $NEMO directory runs a standard benchmark of (currently) 6 nbody runs, all calibrated to take 5 seconds on a specific machine. the average of those will result in a rating (speed) of 1000. Most machines post ~2020 should get a nemobench5 (SEE BELOW) of over 1000.

3.

Various Benchfile filles scattered through the $NEMO source tree contain benchmarks. There is no wrapper like "make check" has for Testfile’s

4.

Selected man pages have some examples. Over time these should be moved into the Benchfile in the source directory where this program is maintained. Some are mentioned below in this man page.

Unless otherwise noted, benchmarks are always performed at the max power settings of a laptop. In many cases the compiler version also will make a difference. An example is the NEMOBENCH5, which lost 10% in speed going from gcc-11 to gcc-12!
On KDE Ubuntu if the cpupower-gui is set to Powersave, one actually gets better performance, as the Performance setting also scales up many desktop apps, competing with the NEMO benchmark.

The NEMOBENCH5 benchmark (cd $NEMO;make bench5) runs a number (currently 6) N-body codes all calibrated to take 5.0 seconds on an i5-1135G7 in turbo mode (4.2GHz) using gcc-11. The NEMOBENCH5 rating is defined as 5000 divided by the average of those CPU times. Thus the i5-1135G7 has a score of 1000. This definition can be adjusted to 10s runs when the time comes that processors run too fast, there is a parameter internal to the nemo.bench script, and could also be adjusted if more codes are used.

M4-air            2000      ~3700(6)
i7-14700    1468    -
i9-13900K    -    2133
i9-12900K    -    n/a ~2000
M2          1540    n/a 1919, 1899
i9-12900H    -    n/a 1851
M1-Max    -    n/a 1785
i7-12800H    -    n/a 1654
i9-11980HK    -    n/a 1616
ultra 9 - 185H    1364    (hypothetical 5.1/4.8)
i9-12900K    1330    jansky gcc 11.2.0
ultra 7 - 155H    1283    darter76 gcc 11.4.0 (gcc-13 13% slower)
i7-1260P    1261    k2
i5-1240P    1155    chromebook
M1 (Macbook Air)    1269    n/a 1750 / 7xxxx
i7-1260P    1250    k2 - but only on full power
i5-1240P    1155    chromebook (frameworks)
i7-11800H        n/a    1474
i7-1185G7    1060    Dell Latitude 5420
i5-1135G7    1000    XPS13 1460 /
AMD threadripper 3960x    896    AlexBox
AMD Ryzen 7 5800H    891    beelink 
i5-10400H    700    Dell Precision 3351 (astroumd)
AMD EPYC 7302 @ 3.0GHz    630    lma 1000/ 28000
Intel Core i9-9900X    -    geminid 1214 / 9467
Xeon Gold 5218 @ 2.30 GHz    761    unity/node99 1100 / 8610
i5-10210U @ 1.60 GHz    822     X1Y4  clang-18
i5-10210U @ 1.60 GHz    732     X1Y4  1029 / 3194
Xeon(R) Silver 4114 @ 2.20GHz    -    terra 777 / 10213
i7-8550U @ 1.80 GHz    548     T480 (1600) 696 / 1935
Xeon E-2186G 3.80 GHz    -     aurora 1313 / 6721
i7-3820QM @ 2.70 GHz    498    MacBookPro (Retina, Mid 2012)
i7-3820 @ 3.60 GHz    414     dante 833 / 3693
Xeon(R) E3-1280 @ 3.50 GHz    384    chara 828 / 3117
i7-3630QM @ 2.40 GHz    429    T530 (1200) 762 / 3006 
Xeon E5-2623 v4 @ 2.60 GHz    -    kraken 782 / 5800
Xeon X5550  @ 2.67 GHz    256    sdp 566 / 3938
AMD Opteron(tm) 6380    234    yorp (2023)

A typical output of bench5 might look as follows:

CPU_USAGE  directcode  :  11.31  11.29  0.00  0.00  0.00  1787697842
CPU_USAGE  gyrfalcON   :  9.09   9.08   0.00  0.00  0.00  1787698975
CPU_USAGE  hackcode1   :  11.69  11.68  0.00  0.00  0.00  1787699885
CPU_USAGE  orbint      :  9.81   9.80   0.00  0.00  0.00  1787701056
CPU_USAGE  potcode     :  11.01  10.99  0.01  0.00  0.00  1787702038
CPU_USAGE  treecode1   :  10.94  10.93  0.01  0.00  0.00  1787703140
NEMOBENCH5 score: 469.85

where on this particular CPU there is a diversity of speeds. Some CPU show even more diversity.

Here is an example how to use nemobench5 is parallel mode on a 32 core Xeon, where a single NEMOBENCH5 has a score of 630. Since all codes are single core, gnu parallel is used to run them in parallel, which should scale linearly as long as memory is not exhausted (this particular CPU has 512GB memory):

for j in 2 4 8 12 16 24 32 48 64 128; do
    $NEMO/src/scripts/nemo.bench mode=5 np=$j | txtpar - p0=npt,1,2 p1=sum,1,2
p2=mean,1,2
done
  2  1256  628
  4  2512  628
  8  5020  628
 12  7512  626
 16 10047  628
 24 15024  626
 32 20025  626
 48 20589  429
 64 21517  336
128 20792  162
(numbers were rounded to improve readability)

Bench

Here are the results of the "make bench" benchmark. The time is the user CPU time. If two values are listed after the machine name, these are the GeekBench5 values.

Intel Core i5-1135G7    15.4    XPS13 1460 /
AMD EPYC 7302 @ 3.0GHz    23.9    lma 1000/ 28000
Intel Core i9-9900X    25.8    geminid 1214 / 9467
i5-10210U CPU @ 1.60GHz    29.4     X1Y4  1029 / 3194
Xeon(R) Silver 4114 @ 2.20GHz    37.1    terra 777 / 10213
i7-8550U CPU @ 1.80GHz    40.6     T480 (1600) 696 / 1935
Xeon E-2186G 3.80GHz    52.6     aurora 1313 / 6721
i7-3820 CPU @ 3.60GHz    ??/67.5     dante 833 / 3693
Xeon(R) CPU E3-1280 @ 3.50GHz    ??/70.3     chara 828 / 3117
i7-3630QM CPU @ 2.40GHz    76.2     T530 (1200) 762 / 3006 
Xeon(R) E5-2623 v4 @ 2.60GHz    42.1/79.4     kraken 782 / 5800
Xeon(R) X5550  @ 2.67GHz    115.4    sdp 566 / 3938

Keep in mind for most of these a default compilation was used. Some benchmarks are known to be able to improved by up to a factor of two with selected compiler and options changes.

The following tasks are run in the standard NEMO bench, for details see the src/scripts/nemo.bench script.

directcode nbody=3072 out=d0 seed=123 
hackcode1 nbody=10240  out=h0 seed=123 
mkplummer p0 10240 seed=123 
gyrfalcON p0 p1 kmax=6 tstop=2 eps=0.05
potcode p0 p2 freqout=10 freq=1000 tstop=2 potname=plummer
mkspiral s0 $nbody3 nmodel=40 seed=123 
ccdmath "" c0 ’ranu(0,1)’ size=256 seed=123
ccdpot c0 c1 
mkorbit o0 x=1 e=1 lz=1 potname=log
orbint o0 o1 nsteps=10000000 dt=0.001 nsave=100000

In addition each data file that is produced is checksummed and compared to a baseline version using bsf(1NEMO) if the argument bsf=1 is added.

Bench8

The bench8 measures how well your CPU performs simple OpenMP algorithm as more cores are employed. This can be an effective way to determine how well your CPU adjusts under increased power demands, for example laptops with thermal protection will scale down their cpu frequency as more cores are employed or as a long benchmark heats up the CPU.

   scaling2 umax=20000 np=1 iter=20
   scaling2 umax=20000 np=2 iter=40
   scaling2 umax=20000 np=4 iter=80

Look at the elapsed time to follow how well a particular CPU performs at increased load:

CPU         1    2    4    8    16
i7-1260P (pow)    7.0    7.1    8.5    18.7    24.6
i7-1260P (bat)    10.2    13.0    17.6    40.4    59.9
jansky       6.4    6.5    6.5    6.7    17.4
lma         12.2    12.3    12.2    12.3    12.4
M2          6.9    7.2    7.6    9.4    19.3

Bench10

Not really implemented, but this will be benchmarks orchestrated via the Benchfile’s found in the source tree.

Oldest Bench

At the inception of NEMO in 1986 there was no real benchmark, so for a while (as computers were relatively slow still) we used the default hackcode1(1NEMO) setting, where 128 particles in virial equilibrium are integrated for 64 timesteps:

      /usr/bin/time hackcode1 tstop=2  > /dev/null

On a Sun 3/50 (our development machine) this took about 5 seconds per step. Now, nearly 35 years later, my laptop runs this about 50,000 times faster. Looking in more detail at the original NEMO manual:

                       cpu/steps
sun 3/60:  20 MHz    2.28        
i5-1135G7: 4200 MHz    0.0000875

Despite that the cpu was 210 times faster, the code ran 26,000 faster. A very impressive factor of 120 improvement in chip and possibly some compiler technology. The NEMO users guide has an appendix in which this benchmark is listed for a variety of computers since 1986.

Caveats

Defining and running a benchmark can be very tricky stuff. It might be important to separate disk I/O from CPU usage. The unix time(1) command can be a help. The output from bash::time is a bit different form csh::time, and yet different from /usr/bin/time. Unless you find a special one, we prefer the csh::time, since the output clearly separates user, system and wall clock time, and also reports the I/O, viz.

   % time ls 
   0.012u 0.068s 0:00.77 9.0%    0+0k 8376+0io 0pf+0w
   2.324u 1.080s 0:09.25 36.7%    0+0k 1049384+2097160io 2pf+0w
   1.876u 0.788s 0:03.63 73.0%    0+0k 0+2097160io 0pf+0w

On linux the command

   echo 1 > > /proc/sys/vm/drop_caches

will clear the disk cache in memory, so your program will be forced to read from disk, with all possible interference from other programs

In NEMO another useful addition to the benchmark is that the output can be turned off easily, by using out=., viz.

    % sudo $NEMO/src/scripts/clearcache
    % time ccdsmooth n1 . dir=x
    0.852u 1.068s 0:12.41 15.3%    0+0k 2098312+0io 6pf+0w
    0.812u 0.400s 0:01.21 100.0%    0+0k 0+0io 0pf+0w
    0.820u 0.380s 0:01.20 100.0%    0+0k 0+0io 0pf+0w

where the last two instances were just re-running the same command, but now clearly showing the effect of reading the file from memory instead of disk. By repeating this whole series a few times, an lower bound to the wall clock time is more likely to properly account for the I/O overhead time.

Rule of thumb: always run a benchmark a few times to see if a hot CPU slows down the benchmark. If I/O is cached. Other tasks are interfering.

"others"

A few other man pages in NEMO also maintain their own list how its program compares under different compilers/options/cpu options:

CGS(1NEMO)
scfm(1NEMO)

Other industry benchmarks:

    Geekbench 5 (very wide variety of compute workloads - baseline is i3-8100)
    Linpack   (focus on floating point operations - Gflops)
    SPEC CPU 2017 ($$$) benchmark -

Tabbench

The table I/O benchmark uses a 100M row dataset with 3 columns, representing X,Y,Z of which the radius R=sqrt(X^2+Y^2+Z^2) is computed. This table is about 2.7 GB in size. Of course reading the table is all dependent on the HDD/SDD, but in the case described here this was a fast SSD, and took 2 sec to read, or just over 1000 MB/sec.

    /usr/bin/time tabgen tab3 100000000 3
    /usr/bin/time tabbench2 . mode=-1
    

this bench will need to be repeated for mode=0,1,2,3 to estimate the different
components as they
are added to the workflow. The tabgen(1NEMO) is dominated by
drawing random numbers and writing them using printf(3) , which is slow.

    80s   writing, using tabgen
     2s   reading in tabbench2
    22s   parsing in numbers  [np.loadtxt takes 748 sec!!!]
     6s   using fie(3NEMO) to compute radii
     1s   using np.sqrt(), and presumably C’s sqrt() as well

Parallel

The GNU parallel(1) tool can be of great use if your tasks are pure single core and you have enough cores (most laptops have at least 4 these days) and memory to fit your tasks. As an example, here is something contrived using mkplummer(1NEMO) that does not write to disk, so it should be highly parallizable:

    nbody=10000000
    /usr/bin/time mkplummer . $nbody
    2.80user 0.45system 0:03.26elapsed 99%CPU
    
    echo mkplummer . $nbody  > run.txt
    echo mkplummer . $nbody >> run.txt
    /usr/bin/time parallel --jobs 1 < run.txt
    5.89user 0.83system 0:06.71elapsed 100%CPU
    
    /usr/bin/time parallel --jobs 2 < run.txt
    6.00user 0.79system 0:03.44elapsed 197%CPU

which follows Amdahl’s law close to 100%!

$NEMO/src/scripts/nemo.bench    Script uses by make bench/bench5/bench10
$NEMO/data       standard repository area for (small) data files.
Benchfile    A Makefile that can orchestrate series of benchmarks
/tmp/nemobench.log    The nemobench keeps logfile

Update History

12-may-97    created      PJT
26-nov-03    finally added some data        PJT
17-feb-04    added bench0 comparison      PJT
31-mar-05    added some cygwin numbers, fixed input    PJT
6-may-11    added i7 and SHMEM/HDD comparison    PJT
27-sep-13    added caveats    PJT
6-jan-2018    updated for V4, more balanced benchmarks     PJT
27-dec-2019    nemo.bench; updated with potcode and orbint    PJT
26-jul-2020    added timings / added geekbench5     PJT
31-aug-2023    added bench8    PJT

Table of Contents

Name
Synopsis
Description
Nemobench5
Bench
Bench8
Bench10
Oldest Bench
Caveats
"others"
Tabbench
Parallel
Considerations
See Also
Author
Files
Update History