Table of Contents


bench - NEMO benchmarks in NemoStones


Nemo Benchmarks (a.k.a. NemoStones)

bench1 tests the classic (1986) Barnes and Hut code in C, without quadrupole corrections, for 10240 particles and 64 timesteps. bench2 tests a more modern Treecode, this one is the O(N) version from Dehnen (2002?). Again 64 timesteps. bench0 has been added to compare a tree code with a direct N-body code, also 64 timesteps, but now for fewer particles to prevent the benchmark to dominate the whole series. bench3 benchmarks the I/O and creation of a large (>2GB) snapshot. bench4 benchmarks an image operation.


Here are the commands to execute the benchmarks, now the default in "cd $NEMO;make bench"
bench0:    time directcode nbody=10240
bench1:    time hackcode1 nbody=10240
bench2:    time mkplummer p1 10240; time gyrfalcON p1 . kmax=6 tstop=2 eps=0.05
bench3:    time mkspiral s000 1000000 nmodel=40
bench4:    ccdmath "" - ’ranu(0,1)’ size=128 | ccdpot - . help=c

and the data :

bench0 (directcode)
       I5/7200@2.5       47.7 
       I7/3630QM@3.6    174.2  (nemo2: 
       I7/3770@3.5    152.6
       P4/3.0            342.7
       G5/2.0            472.8
       AMD-opt64/2.0     300.3
       AMD-ath64/2.0     888.7
       P4/2.8 w/cygwin   365.9
       x86_64/3.0        516.4
       x86_64/3.2        480.6
       sparcv9+vis/0.36 3048.2
bench1 (hackcode1)
       I5/2.5            2.2
       I7/3.6        14.0  (nemo2: 1.2 - 3.6)  5.1
       I7/3.5         4.4
       P4/3.0           13.4
       G5/2.0           13.1
       AMD-opt64/2.0     9.36
       AMD-ath64/2.0    17.45
       P4/2.8 w/cygwin  15.1
       x86_64/3.0       11.7
       x86_64/3.2       10.7
       sparcv9+vis/0.36 81.5
bench2 (gyrfalcON)
       I5/2.5            2.5
       I7/3.6         8.1  (nemo2)  2.948 3.19 5.09
       I7/3.6        31.2  (?)
       P4/3.0           10.8
       G5/2.0           21.0
       AMD-opt64/2.0     8.1
       AMD-ath64/2.0     8.9
       P4/2.8 w/cygwin  45.5
       x86_64/3.2        8.6
       sparcv9+vis/0.36 85.1  
bench3 (mkspiral)
       I5/2.5    5.4u   1.6s
       I7/3.6    23.657u 3.856s 0:28.04 98.0%  (nemo2)
            10.760u 0.832s 0:20.33 57.0%
       I7/3.5    7.112u 1.520 0:09.37 98.5%
       P4/3.0    22.890u  5.980s 1:45.63 27.3%
       G5/2.0    28.400u 24.660s 1:05.41 81.1% 
       AMD-opt64/2.0    18.540u 10.921s 0:56.93 51.7% 
       AMD-ath64/2.0    29.311u 10.353s 0:59.88 66.2% (SATA)
       P4/2.8    25.541u 8.081s 0:59.98 56.0% (S/ATA)
       P4/2.8 w/cygwin    276.56u 26.35s 6:34.75 76.7% (using mkplummer V2.8)
       x86_64/3.0       21.651u 8.897s 0:48.05 63.5%    0+0k 0+0io 0pf+0w
       x86_64/3.2       21.950u 9.997s 0:39.37 81.1%  (SATA)
       i7/2.93          7.892u 3.170s 0:12.92 85.6% (HDD)
       i7/2.93          7.662u 1.467s 0:09.13 99.8% (SHMEM)
bench4 (ccdpot)
       I5/2.5   11.9
       I7/3.6    23.657u 3.856s 0:28.04 98.0%  (nemo2)


Defining and running a benchmark can be very tricky stuff. It might be important to separate disk I/O from CPU usage. The unix time(1) command can be a help. The output from bash::time is a bit different form csh::time, and yet different from /usr/bin/time. Unless you find a special one, we prefer the csh::time, since the output clearly separates user, system and wall clock time, and also reports the I/O, viz.
   % time ls 
   0.012u 0.068s 0:00.77 9.0%    0+0k 8376+0io 0pf+0w
   2.324u 1.080s 0:09.25 36.7%    0+0k 1049384+2097160io 2pf+0w
   1.876u 0.788s 0:03.63 73.0%    0+0k 0+2097160io 0pf+0w
On linux the command
   echo 1 > > /proc/sys/vm/drop_caches
will clear the disk cache in memory, so your program will be forced to read from disk, with all possible interference from other programs

In NEMO another useful addition to the benchmark is that the output can be turned off easily, by using out=., viz.

    % sudo $NEMO/src/scripts/clearcache
    % time ccdsmooth n1 . dir=x
    0.852u 1.068s 0:12.41 15.3%    0+0k 2098312+0io 6pf+0w
    0.812u 0.400s 0:01.21 100.0%    0+0k 0+0io 0pf+0w
    0.820u 0.380s 0:01.20 100.0%    0+0k 0+0io 0pf+0w
where the last two instances were just re-running the same command, but now clearly showing the effect of reading the file from memory instead of disk. By repeating this whole series a few times, an lower bound to the wall clock time is more likely to properly account for the I/O overhead time.


A few other manual pages in NEMO also maintain their own list how its program compares under different compilers/options/cpu options:

See Also

gyrfalcON(1NEMO) , data(5NEMO) , mkspiral(1NEMO) , mkplummer(1NEMO) , hackcode1(1NEMO) , nbody1(1NEMO) , scfm(1NEMO) , CGS(1NEMO) , triple(1NEMO) , accudate(lNEMO)


Peter Teuben


~/data       standard repository area for data files.

Update History

12-may-97    created      PJT
26-nov-03    finally added some data        PJT
17-feb-04    added bench0 comparison      PJT
31-mar-05    added some cygwin numbers, fixed input    PJT
6-may-11    added i7 and SHMEM/HDD comparison   PJT
27-sep-13    added caveats    PJT
6-jan-2018    updated for V4, more balanced benchmarks    PJT
27-dec-2019    nemo.bench; updated with potcode and orbint; now 10 tasks    PJT

Table of Contents