Each spectrum (called row below) has nchan channels, Each nscan has 4 rows, with the CalOn and CalOff (Hot and Cold if you wish) for the Sig and the Ref (On and Off if you wish). The following math is needed to construct a spectrum, thus the total operations scales as nscan*nchan*iter.
Tsys = Tc * <row1> / <row2-row1>
<cold> / <hot>-<cold>
Ta = Tsys * ( (row1+row2)/(row3+row4) - 1) on/off - 1
Spectrum = <Ta> time-averaging
This block of memory of size 4*nchan*nscan can be iteratively visited, to get a more reliable measure of the CPU usage. It can make a huge difference if this block fits in one of the CPU caches.
The benchmark here is very basic, and only deals with the math with simulated data without the need to pass through a fitsio type library. The file I/O bench portion of this benchmark reads a large file to get an idea of I/O overhead, even though many other tools exist to measure disk I/O, from hdparm(1) to iozone(1) .
A slightly more realistic one can be found in sdinfo(1NEMO) reading actual sdfits(5NEMO) files and selecting a set of operations.
The standard benchmark loops 10 times over 1000 scans of 100,000 channels (thus 1e9 row operations) in about 2-3 secs. We call this 1 GRop.
Standard benchmark was extended to iter=100 to avoid system time. These are 10 Grops. CPU times are user space seconds [nemobench5 scores listed where available]
Xeon E5620 @ 2.40GHz (fourier) - 115.1 [218] AMD EPYC 7302 @ 3.0 GHz (lma) - 34.1 [675] AMD Ryzen AI 9 HX PRO 370 - 23.4 [~1170] Ultra 7 155H @ 4.5 GHz (d76) - 21.8 [~1200] M4 air - 15.4 [~2000]
To view the impact of OpenMP, here is an example performance on an Ultra 7 155H, measured in wall clock time seconds, and iter=20:
OMP_NUM_THREADS=2 /usr/bin/time sdmath 1000 160000 iter=20
nscan nchan 1 2 4 8
12 16 20 ----- ------ --- --- --- --- --- --- --- 1000 160000 7.8
5.0 3.9 4.2 3.7 3.6 5.7
2000 80000 7.8
4000 40000 7.8
8000 20000 7.8
16000 10000 8.0 5.4 4.4 5.3 5.2 9.5 9.9 ----- ------ --- --- ---
--- --- --- ---
Somewhat surprisingly that barely a factor of 2 can be gained by going multi-core on an typical laptop intel CPU. An AMD CPU fares much better. The Ryzen AI 9 HX PRO 370 is able to lower the wall clock time all the way up to 12 cores.
To see the effect of the CPU cache, pick a case where nscan*nchan*iter is constant, but nscan nchan iter 1e1 1e1 1e7 1.52sec 1e1 1e2 1e6 1.42 1e1 1e3 1e5 1.77 1e1 1e4 1e4 1.82 3 MB 1e1 1e5 1e3 2.33 30 MB 1e1 1e6 1e2 2.52 300 MB 1e1 1e7 1e1 3.23 3000 MB 1e2 1e6 1e1 2.69 3000 MB 1e2 1e5 1e2 2.31 300 MB
A read benchmark can be done
with a large series of dummy Plummer models. The example file created here
is about 5GB in size: mkplummer p1M 1000000 nmodel=100
/usr/bin/time sdmath in=p1M
/usr/bin/time sdmath in=p1M maxbuf=1000
the latter resets the buffer 5 times, where the last read is a partial
one.
30-apr-2025 Created PJT