
The main purpose of the program is to measure scheduling latencies
under high system load, of programs which must do things in realtime.

Actually there are 5 operating system stressing classes:

- heavy graphics output , using x11perf to simulate large
  BitBlts

- heavy access to the /proc filesystem using "top" with an update frequency
  of 0.01 sec

- disk write stress ( write a large file to disk)
- disk copy  stress (copy a large file to an other)
- disk read  stress ( read a large file from disk)


I wrote the benchmark to test the realtime audio (PCM I/O) capatibilities
of Linux.
In future I will extend the program to test other subsystems,
like MIDI I/O , serial I/O and using usleep()s

The playing is done strictly from RAM.
the player thread gets RT priorty through sched_setscheduler()
is is scheduled with FIFO policy at maximum priority

the player sits in a loop which does basically the following

while(1)
{
  time1=my_gettime();
  waste 80% of the CPU of the duration of one audio fragment
  time2=my_gettime();
  write(audio_fd,playbuffer,fragmentsize);
  time3=my_gettime();

}

time3-time1 = duration of one loop ( CPU wasting + audio output)

If this time gets bigger then the audio buffer ( n fragments)
then you will hear an audio dropout.

time2-time1 = duration of the CPU wasting loop should be constant
at 80% of the fragment timelen , but can vary if if there is some
device on the bus (or kernel I/O routine) which steals cycles to the CPU.

On some graphics cards, heavy graphic output, blocks the bus for too much
time, and therefore the process gets blocked too long, and 
the deadline (in this case the audio buffer timelen) will be missed.

So high scheduling latencies have two causes

- hardware related ( operating system can't cure these problems,
  therefore the only solution is to buy non-problematic hardware)

  examples: DMA/PCI transfers which move around large amounts
  of data and block bus access to the CPU,
  like some graphics cards which blocks the bus during large
  BitBlits , or some older mainboards (Socket 7) which freeze the bus
  during harddisk DMA transfers.

- software related:

  the operating system holds some global spinlocks during disk I/O
  so that other syscalls can't access to needed data structures
  for long time, and therefore there are scheduling deadline misses.


My machine is a PII400 , Mainboard ASUS P2B (BX) 256MB RAM
Harddisk IBM Deskstar 16GB EIDE UDMA

The performed tests show clearly that modern hardware ( PII + UDMA Harddrives)
is capable of low scheduling latencies during high system load.

Unfortunately Linux has some problems, and sometimes
blocks other system calls for too much time ( up to 100-130ms ) and
therefore stable low-latency ( < 10-20ms) realtime audio becomes almost 
impossible.

*UPDATE*: Ingo Molnar found the problems ( certain parts of the kernel
do not reschedule for several msecs leading to long scheduling latencies
for user processes)

With his patch, I get excellent results (about 2ms-3ms audio latency).

get the latest news from my audio page: 

http://www.gardena.net/benno/linux/audio



HOW TO USE THE BENCHMARK:


IMPORTANT:

Before running the benchmark if you have EIDE drives,
tune *ALL* your disks, because a not tuned EIDE drive
can produce very bad benchmark results.

You can tune the disk with the my script "tunedisk"
wich calls hdparm and turns on all disk-tuning options.

for example if you have 2 drives hda and hdc

runs run

./tunedisk /dev/hda
./tunedisk /dev/hdc


Turn off the screensaver too, because it could cause some distortion
in the testresults.



NOTE: to compile the program you need the gd GIF library,
which is installed on Redhat by default.


to compile the program on NON-INTEL architectures (like ALPHA, SPARC , PPC)
comment 
TIMER_OPTIONS=-DUSE_PENTIUM_TIMER
and remove the comment from
TIMER_OPTIONS=-DUSE_GENERIC_TIMER

( USE_GENERIC_TIMER uses the gettimeofday() syscall)

If you know that your non-intel architecture,has some cycle counter,
please let me know , so I will add these timer methods, to the benchmark.


compile the benchmark with 


make

the executable is named "latencytest"

to run the tests you should use "do_tests"

this is a script which calls latencytest,
but in background starts several system stressing scripts.

stress_x11       : graphics card stress using x11perf
stress_proc      : /proc file system stress using top
stress_diskwrite : disk write stress
stress_diskcopy  : disk copy stress
stress_diskread : disk read stress


example:

./do_tests none 3 256 0 350000000

none means produce a static sound,
(you can pass a .WAV file as argument instead)

3 = number of audio fragments to use
256 = fragmentsize in bytes
0 = background syncing frequency ( a thread which calls sync() every n msec

(0 disables the sync thread, I left this option only for experimenting,
syncing 1-2 times per sec. may give you a bit better results, but it's not the
definitive solution, so it's better to disable)


350000000 = filesize of the disk stress tests

use a filesize at least 1.5times of your RAM size,
to avoid caching.

Note that the diskcopy test needs at least 2 times the filesize you gave
as argument, so make sure you have enough space left on disk.
(in the above example you need at least 700MB free on your disk)

There is another script which runs several tests in sequence:

"runalltests" which takes as parameter the filesize

example:

./runalltests 350000000

the scripts performs the tests with
3x256 , 3x512 , 3x1024  (num.fragments x fragmentsize)


do_tests (or runalltests)  outputs the result in the
subdirectory "html"

the file is called for example "3x256.html" which contains
the tests performed for an audio buffer of 3 fragments of 256 bytes.


The html page shows 5 diagrams for each system stress type,

The green line is the time spent in the CPU wasting loop
which is calibrated at 80% of the frament time,
If this time goes up, the cause can be bus contention through
other devices (like gfx-card, harddisk etc.)

NOTE: if you want to change the cpu load, change the value of the
variable
cpu_load ( default 0.80 = 80% )  in latencytest.c and rtc_latencytest.c

My tests shows that this time remains pretty constant on my machine
even under high disk I/O.

but if I turn off DMA with hdparm , then things get extremely bad,
and even the RT scheduled CPU loop is slowed down *EXTREMELY*
due to the non-DMA EIDE transfer.


The white line is the actual measured scheduling latency of the entire
loop, if it goes beyond the red line (audio buffer len in ms),
then you will hear a sound drop-out.

The yellow line is the fragment timelen, the white line should not exceed
this line too much.

"cpu latency" is the nominal time of the CPU loop
"fragment latency" is the fragent latency.
(the value enclosed in the braces is the max measured cpu latency,
this value should ideally differ only a fraction of ms from the nominal value,
you can notice this on the thickness of the green line too)

"max latency" is the max measured scheduling latency of the entire loop
( CPU + audio play)


"overruns" are the number of audio buffer overruns, which occurs everytime
the white line goes beyond the red line.

the white "between +/-1ms" tell how much % of time
the total scheduling latency (CPU + audio play) stays in the range fragment latency +/-1ms
idem for "between +/-2ms"

the green "between +/-0.2ms" tell how much % of time the
CPU loop latency stays in the range ( CPU loop latency +/-0.2ms)
idem for "between +/-0.1ms"


The CPU loop latency is pretty constant on my machine during
DMA Disk I/O, and stays in the the +/-0.2ms range 99.98% of time
with peaks of +0.5ms , this makes me think that we could 
use <5ms audio buffers for realtime audio software and having
no problems during heavy disk I/O.


NEW FEATURES in latencytest 0.42:

Realtime clock (RTC) benchmarking:

added rtc_latency_test which performs benchmarking
of the realtime clock (RTC) device, using async notification
( by installing a SIGIO hander)

To perform this test you must first apply the included rtc-async.patch
because the standard RTC device doesn't support async signal
handling.

rtc_latencytest is similar to latencytest but takes only one
argument which is the RTC frequency (ie 256 for a 3.9ms interval)

"do_rtc_tests" is a script similar to "do_tests"
and takes 2 arguments: 
the RTC frequency and the filesize for the disk stress tests

for example
./do_rtc_tests 2048 350000000
sets the RTC frequency to 2048HZ (0.48ms period), and uses
a filesize of 350MB for the disk I/O tests)

The benchmark measures the time-differences between two calls to the
SIGIO signal handler (triggered by the RTC device).
The red deadline is set to an immaginary 3*RTC-period.

Again , with Ingo's patches the resuls excellent (+/-500usecs jitter).




NOTE FOR SMP SYSTEMS:

the cpu wasting loop is not multithreaded yet, therefore, it's
not a real simulation of a highly stressed SMP system.


For updated info and latest news check out:
http://www.gardena.net/benno/linux/audio


comments, suggestions feedback welcome. 

regards,
Benno.

sbenno@gardena.net

