/usr/bin/time $NEMO/src/scripts/nemo.bench make -f Benchfile clean all nemobench LABEL COMMAND
- 0.
- The command make bench5 from the $NEMO directory runs a small number of benchmarks designed to each takae 5 seconds on a given CPU. An average speed is then computed to compare different compilers and/or CPU’s, an idea simular to the geekbench suite of tests.
- 1.
- The command make bench from the $NEMO directory runs a standard benchmark of about a dozen selected NEMO programs. It should run in about 30 seconds on a ~2020 computer. A short table (see below in RESULTS) lists results from some recent computers.
- 2.
- The command make bench5 from the $NEMO directory runs a standard benchmark of (currently) 6 nbody runs, all calibrated to take 5 seconds on a specific machine. the average of those will result in a rating (speed) of 1000. Most machines post ~2020 should get a nemobench5 (SEE BELOW) of over 1000.
- 3.
- Various Benchfile filles scattered through the $NEMO source tree contain benchmarks. There is no wrapper like "make check" has for Testfile’s
- 4.
- Selected man pages have some examples. Over time these should be moved into the Benchfile in the source directory where this program is maintained. Some are mentioned below in this man page.
Unless otherwise noted, benchmarks are always performed at the max power settings of a laptop. In many cases the compiler version also will make a difference. An example is the NEMOBENCH5, which lost 10% in speed going from gcc-11 to gcc-12!
The NEMOBENCH5 benchmark (cd $NEMO;make bench5) runs
a number (currently 6) N-body codes all calibrated to take 5.0 seconds on
an i5-1135G7 in turbo mode (4.2GHz) using gcc-11. The NEMOBENCH5 rating is
defined as 5000 divided by the average of those CPU times. Thus the i5-1135G7
has a score of 1000. This definition can be adjusted to 10s runs when the
time comes that processors run too fast, there is a parameter internal
to the nemo.bench script, and could also be adjusted if more codes are used.
i9-13900K - 2133 i9-12900K - n/a ~2000 M2 ~1450? n/a 1919, 1899 i9-12900H - n/a 1851 M1-Max - n/a 1785 i7-12800H - n/a 1654 i9-11980HK - n/a 1616 ultra 9 - 185H 1364 (hypothetical 5.1/4.8) i9-12900K 1330 jansky gcc 11.2.0 ultra 7 - 155H 1283 darter76 gcc 11.4.0 (gcc-13 13% slower) i7-1260P 1261 k2 i5-1240P 1155 chromebook M1 (Macbook Air) 1269 n/a 1750 / 7xxxx i7-1260P 1250 k2 - but only on full power i5-1240P 1155 chromebook (frameworks) i7-11800H n/a 1474 i7-1185G7 1060 Dell Latitude 5420 i5-1135G7 1000 XPS13 1460 / AMD threadripper 3960x 896 AlexBox AMD Ryzen 7 5800H 891 beelink i5-10400H 700 Dell Precision 3351 (astroumd) AMD EPYC 7302 @ 3.0GHz 630 lma 1000/ 28000 Intel Core i9-9900X - geminid 1214 / 9467 Xeon Gold 5218 @ 2.30 GHz 761 unity/node99 1100 / 8610 i5-10210U @ 1.60 GHz 822 X1Y4 clang-18 i5-10210U @ 1.60 GHz 732 X1Y4 1029 / 3194 Xeon(R) Silver 4114 @ 2.20GHz - terra 777 / 10213 i7-8550U @ 1.80 GHz 548 T480 (1600) 696 / 1935 Xeon E-2186G 3.80 GHz - aurora 1313 / 6721 i7-3820QM @ 2.70 GHz 498 MacBookPro (Retina, Mid 2012) i7-3820 @ 3.60 GHz 414 dante 833 / 3693 Xeon(R) E3-1280 @ 3.50 GHz 384 chara 828 / 3117 i7-3630QM @ 2.40 GHz 429 T530 (1200) 762 / 3006 Xeon E5-2623 v4 @ 2.60 GHz - kraken 782 / 5800 Xeon X5550 @ 2.67 GHz 256 sdp 566 / 3938 AMD Opteron(tm) 6380 234 yorp (2023)
A typical output of bench5 might look as follows:
CPU_USAGE directcode : 11.31 11.29 0.00 0.00 0.00 1787697842 CPU_USAGE gyrfalcON : 9.09 9.08 0.00 0.00 0.00 1787698975 CPU_USAGE hackcode1 : 11.69 11.68 0.00 0.00 0.00 1787699885 CPU_USAGE orbint : 9.81 9.80 0.00 0.00 0.00 1787701056 CPU_USAGE potcode : 11.01 10.99 0.01 0.00 0.00 1787702038 CPU_USAGE treecode1 : 10.94 10.93 0.01 0.00 0.00 1787703140 NEMOBENCH5 score: 469.85where on this particular CPU there is a diversity of speeds. Some CPU show even more diversity.
Here is an example how to use nemobench5 is parallel mode on a 32 core Xeon, where a single NEMOBENCH5 has a score of 630. Since all codes are single core, gnu parallel is used to run them in parallel, which should scale linearly as long as memory is not exhausted (this particular CPU has 512GB memory):
for j in 2 4 8 12 16 24 32 48 64 128; do $NEMO/src/scripts/nemo.bench mode=5 np=$j | txtpar - p0=npt,1,2 p1=sum,1,2 p2=mean,1,2 done 2 1256 628 4 2512 628 8 5020 628 12 7512 626 16 10047 628 24 15024 626 32 20025 626 48 20589 429 64 21517 336 128 20792 162 (numbers were rounded to improve readability)
Intel Core i5-1135G7 15.4 XPS13 1460 / AMD EPYC 7302 @ 3.0GHz 23.9 lma 1000/ 28000 Intel Core i9-9900X 25.8 geminid 1214 / 9467 i5-10210U CPU @ 1.60GHz 29.4 X1Y4 1029 / 3194 Xeon(R) Silver 4114 @ 2.20GHz 37.1 terra 777 / 10213 i7-8550U CPU @ 1.80GHz 40.6 T480 (1600) 696 / 1935 Xeon E-2186G 3.80GHz 52.6 aurora 1313 / 6721 i7-3820 CPU @ 3.60GHz ??/67.5 dante 833 / 3693 Xeon(R) CPU E3-1280 @ 3.50GHz ??/70.3 chara 828 / 3117 i7-3630QM CPU @ 2.40GHz 76.2 T530 (1200) 762 / 3006 Xeon(R) E5-2623 v4 @ 2.60GHz 42.1/79.4 kraken 782 / 5800 Xeon(R) X5550 @ 2.67GHz 115.4 sdp 566 / 3938
Keep in mind for most of these a default compilation was used. Some benchmarks are known to be able to improved by up to a factor of two with selected compiler and options changes.
The following tasks are run in the standard
NEMO bench, for details see the src/scripts/nemo.bench script.
directcode nbody=3072 out=d0 seed=123 hackcode1 nbody=10240 out=h0 seed=123 mkplummer p0 10240 seed=123 gyrfalcON p0 p1 kmax=6 tstop=2 eps=0.05 potcode p0 p2 freqout=10 freq=1000 tstop=2 potname=plummer mkspiral s0 $nbody3 nmodel=40 seed=123 ccdmath "" c0 ’ranu(0,1)’ size=256 seed=123 ccdpot c0 c1 mkorbit o0 x=1 e=1 lz=1 potname=log orbint o0 o1 nsteps=10000000 dt=0.001 nsave=100000In addition each data file that is produced is checksummed and compared to a baseline version using bsf(1NEMO) if the argument bsf=1 is added.
scaling2 umax=20000 np=1 iter=20 scaling2 umax=20000 np=2 iter=40 scaling2 umax=20000 np=4 iter=80
Look at the elapsed time to follow how well a particular CPU performs at increased load:
CPU 1 2 4 8 16 i7-1260P (pow) 7.0 7.1 8.5 18.7 24.6 i7-1260P (bat) 10.2 13.0 17.6 40.4 59.9 jansky 6.4 6.5 6.5 6.7 17.4 lma 12.2 12.3 12.2 12.3 12.4 M2 6.9 7.2 7.6 9.4 19.3
/usr/bin/time hackcode1 tstop=2 > /dev/null
On a Sun 3/50 (our development machine) this took about 5 seconds per step. Now, nearly 35 years later, my laptop runs this about 50,000 times faster. Looking in more detail at the original NEMO manual:
cpu/steps sun 3/60: 20 MHz 2.28 i5-1135G7: 4200 MHz 0.0000875Despite that the cpu was 210 times faster, the code ran 26,000 faster. A very impressive factor of 120 improvement in chip and possibly some compiler technology. The NEMO users guide has an appendix in which this benchmark is listed for a variety of computers since 1986.
% time ls 0.012u 0.068s 0:00.77 9.0% 0+0k 8376+0io 0pf+0w 2.324u 1.080s 0:09.25 36.7% 0+0k 1049384+2097160io 2pf+0w 1.876u 0.788s 0:03.63 73.0% 0+0k 0+2097160io 0pf+0wOn linux the command
echo 1 > > /proc/sys/vm/drop_cacheswill clear the disk cache in memory, so your program will be forced to read from disk, with all possible interference from other programs
In NEMO
another useful addition to the benchmark is that the output can be turned
off easily, by using out=., viz.
% sudo $NEMO/src/scripts/clearcache % time ccdsmooth n1 . dir=x 0.852u 1.068s 0:12.41 15.3% 0+0k 2098312+0io 6pf+0w 0.812u 0.400s 0:01.21 100.0% 0+0k 0+0io 0pf+0w 0.820u 0.380s 0:01.20 100.0% 0+0k 0+0io 0pf+0wwhere the last two instances were just re-running the same command, but now clearly showing the effect of reading the file from memory instead of disk. By repeating this whole series a few times, an lower bound to the wall clock time is more likely to properly account for the I/O overhead time.
Rule of thumb: always run a benchmark a few times to see if a hot CPU slows down the benchmark. If I/O is cached. Other tasks are interfering.
CGS(1NEMO) scfm(1NEMO)
Other industry benchmarks:
Geekbench 5 (very wide variety of compute workloads - baseline is i3-8100) Linpack (focus on floating point operations - Gflops) SPEC CPU 2017 ($$$) benchmark -
/usr/bin/time tabgen tab3 100000000 3 /usr/bin/time tabbench2 . mode=-1 this bench will need to be repeated for mode=0,1,2,3 to estimate the different components as they are added to the workflow. The tabgen(1NEMO) is dominated by drawing random numbers and writing them using printf(3) , which is slow. 80s writing, using tabgen 2s reading in tabbench2 22s parsing in numbers [np.loadtxt takes 748 sec!!!] 6s using fie(3NEMO) to compute radii 1s using np.sqrt(), and presumably C’s sqrt() as well
nbody=10000000 /usr/bin/time mkplummer . $nbody 2.80user 0.45system 0:03.26elapsed 99%CPU echo mkplummer . $nbody > run.txt echo mkplummer . $nbody >> run.txt /usr/bin/time parallel --jobs 1 < run.txt 5.89user 0.83system 0:06.71elapsed 100%CPU /usr/bin/time parallel --jobs 2 < run.txt 6.00user 0.79system 0:03.44elapsed 197%CPU
which follows Amdahl’s law close to 100%!
https://browser.geekbench.com/processor-benchmarks
$NEMO/src/scripts/nemo.bench Script uses by make bench/bench5/bench10 $NEMO/data standard repository area for (small) data files. Benchfile A Makefile that can orchestrate series of benchmarks /tmp/nemobench.log The nemobench keeps logfile
12-may-97 created PJT 26-nov-03 finally added some data PJT 17-feb-04 added bench0 comparison PJT 31-mar-05 added some cygwin numbers, fixed input PJT 6-may-11 added i7 and SHMEM/HDD comparison PJT 27-sep-13 added caveats PJT 6-jan-2018 updated for V4, more balanced benchmarks PJT 27-dec-2019 nemo.bench; updated with potcode and orbint PJT 26-jul-2020 added timings / added geekbench5 PJT 31-aug-2023 added bench8 PJT