Author: Mugabi Siro

Category: General Linux

Summary:

This entry presents a few examples to that help observe CPU affinity on a Linux system. Linux 3.x used.

Tags: linux ftrace debugging/tracing realtime

Update

The approach taken for illustrating CPU affinity by the orginal demo program, i.e. ne_ftrace_read_char.c, was naïve - real world programs that benefit from CPU affinity for processes are typically realtime (and multithreaded) with high periodic sampling rates. For this reason, the ne_ftrace_rt_proc_affinity.c is included. It presents a template of a typical firm realtime application.

This tests presented in the CPU affinity for processes section are due for an update. For now, replace the tests for ne_ftrace_read_char.c with ne_ftrace_rt_proc_affinity.c. With the default settings, the latter program runs for 10secs before exiting with the Ftrace results in ./ftrace_output.txt. Also note that a few other factors may have to be taken into consideration e.g. the CONFIG_PREEMPT_* settings for the kernel. In short, the tests presented in Section CPU affinity for processes are due for revision. S.M.

Table Of Contents

CPU affinity for processes

CPU affinity for processes is a scheduler property that restricts process execution to a given CPU, or set of CPUs, in the system. For performance reasons, the Linux SMP balancing and scheduling algorithms will (by default) strive to run a given process on a particular CPU, but may reschedule it on another if, say, the original CPU is busy i.e. soft CPU affinity. However, it is still possible to enforce strict process execution on a (set of) CPU(s) i.e. hard CPU affinity.

Performance related reasons why hard CPU affinity is desirable include the fact that when a process gets rescheduled on a new CPU, the cache of the new CPU must first be invalidated in order for a line of the process' data to be loaded. Significant performance gains can be achieved with hard CPU affinity in firm/soft realtime and multi-threaded applications that share the same data as a result of reduction in cache misses. However, note that hard realtime applications such as motion control (e.g. motor/actuator control, robot mechanics control, etc) have stringent deadlines and, therefore, the need to establish worst-case latencies overrides any statistical performance benefits by caches. In other words, while caches improve overall performance in soft/firm realtime applications, they introduce an element of non-determinism in the domain of hard realtime execution.

The examples presented in this entry rely on Ftrace to illustrate CPU affinity on a Linux System. See Function Tracing with Ftrace for an introduction to function call tracing with the Ftrace framework. To illustrate process to CPU affinity, the ne_ftrace_read_char.c is presented. It takes a basic approach: a read(2) blocking on stdin puts the process to sleep - effectively scheduling it out. Only after user keyboard interaction will this foreground process get a chance to be rescheduled and proceed. Involving Ftrace in this procedure allows one to trace on which particular CPU these two events (i.e. process sleep and wakeup) occurred.

Compile the program with:

$ gcc -O2 -Wall -Wno-unused-result ne_ftrace_read_char.c

This program performs some Ftrace initialization before doing a blocking read(2) on stdin. Pressing ENTER (or Carriage Return) will then cause the program to run till completion.

Now, this program could be run from a local terminal (/dev/ttyN or xterm), a serial port terminal or even a remote (e.g. ssh) terminal. The traces presented in this entry were performed from a local terminal. To reduce the number of recorded events to a bare minimum in the trace output, function filtering was set to:

$ sudo sh -c "echo i8042_interrupt > /sys/kernel/debug/tracing/set_ftrace_filter"

since the machine had an i8042 keyboard controller. This command instructs Ftrace to record only the occurrence of this particular keyboard interrupt handler in addition to the trace markers written by the ne_ftrace_read_char.c program. If your system has a different keyboard controller, you may specify the name of its interrupt handler. Alternatively, since the QEMU PC1 machine emulator emulates the i8042 keyboard controller, it can be used here instead. Otherwise, simply specify, say, do_IRQ.

The Ftrace trace markers are used here to determine on which particular CPU process execution occurred. The first trace marker is used to indicate on which logical CPU the process was executing just prior to the blocking read (i.e. process sleep). The second trace marker reveals the logical CPU on which process execution occurred just after user keyboard interaction (i.e. process wakeup).

If, on the other hand, the tests are performed from a serial or remote terminal, then note that stdin will be the terminal and not the local keyboard. So, if the keyboard interrupt function filtering command is specified, then no keyboard interrupt event will get recorded. But this is just fine, since what is of most interest is the recording of the trace markers. Also note that if do_IRQ was specified instead, then the occurance of do_IRQ events will still appear in the trace even with a serial or remote terminal.

When done with the examples presented in this entry, do not forget to reset the Ftrace function filter:

$ sudo sh -c "echo > /sys/kernel/debug/tracing/set_ftrace_filter"

Soft CPU affinity

The objective in this section is to run the compiled program several times on a "loaded" system, each time inspecting the trace output in ./ftrace_output.txt for changes in CPU ID upon process rescheduling.

For example, the following basic stress factor was introduced on one xterm:

$ find /

and program execution performed on another:

$ sudo ./a.out 
kernel.ftrace_enabled = 1
Strike ENTER to continue...

    <hit-ENTER>

$ cat ftrace_output.txt

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  /* Before key press */
 ------------------------------------------
 0)  a.out-14778   =>   Xorg-1623   
 ------------------------------------------

 0) + 27.246 us   |  i8042_interrupt();
 ------------------------------------------
 0)   Xorg-1623    =>    <idle>-0   
 ------------------------------------------

 0) + 27.427 us   |  i8042_interrupt();
 ------------------------------------------
 0)    <idle>-0    =>  a.out-14778  
 ------------------------------------------

 0)               |  /* After key press */

In this first execution instance, the process was rescheduled on the same CPU 0 as indicated by the trace markers, /* Before key press */ and /* After key press */. The keyboard interrupt handler also happened to get serviced on CPU 0. But, in later iterations, events occasionally took a sudden twist with everything occurring on different logical CPUs, for instance:

$ sudo ./a.out 
kernel.ftrace_enabled = 1
Strike ENTER to continue...

    <hit-ENTER>

$ cat ftrace_output.txt

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 3)               |  /* Before key press */
 0) + 37.523 us   |  i8042_interrupt();
 ------------------------------------------
 0)    <idle>-0    =>  compiz-2526  
 ------------------------------------------

 0) + 38.286 us   |  i8042_interrupt();
 2)               |  /* After key press */

and,

$ sudo ./a.out 
kernel.ftrace_enabled = 1
Strike ENTER to continue...

    <hit-ENTER>

$ cat ftrace_output.txt

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 2)               |  /* Before key press */
 0) + 21.213 us   |  i8042_interrupt();
 ------------------------------------------
 0)  amaroka-3364  =>    <idle>-0   
 ------------------------------------------

 0) + 24.086 us   |  i8042_interrupt();
 3)               |  /* After key press */

Hard CPU affinity

Kernel interfaces

Linux features mechanisms to restrict process execution to a CPU or a set of CPUs. These include:

  • The CPU affinity system calls.

    sched_setaffinity(2) sets the CPU affinity of the process while the sched_getaffinity(2) retrieves the CPU affinity mask of the process. Using the taskset(1) command to launch an application and set its CPU affinity mask is the most straightforward and non-intrusive way of effecting hard CPU affinity for a process. This command internally invokes the affinity system calls. Otherwise, these calls could be directly included in the program's source code.

  • The cpuset framework.

    This interface exported by the kernel allows for more sophisticated control over how the CPUs and memory are allocated to the process. See Documentation/cgroups/cpusets.txt.

With the affinity system calls, a bitmask is used to represent the CPU affinity, with the lowest order bit corresponding to the first logical CPU and the highest order bit corresponding to the last logical CPU. The masks are typically given in hexadecimal. For instance, specifying:

$ sudo taskset 0x1 ./a.out 
kernel.ftrace_enabled = 1
main:137:: Strike ENTER to continue...

always resulted in ftrace outputs that indicated the trace markers on CPU 0, for instance:

$ cat ftrace_output.txt

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  /* Before key press */
 ------------------------------------------
 0)  a.out-14883   =>    <idle>-0   
 ------------------------------------------

 0) + 24.786 us   |  i8042_interrupt();
 ------------------------------------------
 0)    <idle>-0    =>  a.out-14883  
 ------------------------------------------

 0)               |  /* After key press */

and

$ sudo taskset 0x8 ./a.out 
kernel.ftrace_enabled = 1
main:137:: Strike ENTER to continue...

$ cat ftrace_output.txt

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 3)               |  /* Before key press */
 0) + 21.060 us   |  i8042_interrupt();
 0) + 21.131 us   |  i8042_interrupt();
 3)               |  /* After key press */

Of course, conclusions drawn from the few test iterations that were performed and the results presented here are based on the assumption that the scheduler will always honor the CPU affinity mask settings. A rigorous assesment and verification of the the Linux SMP scheduling algorithms is beyond the scope of this entry.

Finally, to view the CPU number/ID on which a process is executing, a variant of the following command could be used:

$ ps -eo pid,comm,psr
     PID COMMAND         PSR
       1 init              2
       2 kthreadd          3
       3 ksoftirqd/0       0
      [...]              [...]
      35 kswapd0           0
      36 ksmd              0
      37 khugepaged        0
      [...]              [...]
    1169 rsyslogd          2
    1170 dbus-daemon       0
    1175 bluetoothd        0
      [...]              [...]
    9166 kworker/1:1       1
    9169 kworker/0:0       0
    9196 ps                1

The isolcpus kernel parameter

The isolcpus kernel command line parameter can be used to specify one or more logical CPUs for isolation from the general SMP balancing and scheduling algorithms. Isolation will be effected at least for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs. A process can then be moved on or off the isolated CPU via the affinity syscalls or cpuset.

This feature, used in combination with the CPU affinity syscalls, is of particular interest in the embedded and realtime Linux domain: one or more CPUs may be reserved for a time-critical application while the rest of the processes are confined to the remaining CPUs.

For example, passing isolcpus=3 to the kernel at boot:

$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.2.0-35-generic root=UUID=57[...]bdb ro quiet splash vt.handoff=7 isolcpus=3

will isolate CPU 3 (i.e. the fourth logical CPU).

Consult the isolcpus entry in Documentation/kernel-parameters for details of its usage format. See Hacking ISR-to-Process Control Paths with Ftrace for a case study of isolcpus used in combination with the CPU affinity syscalls when function tracing the keyboard interrupt return path for a read(2) blocking on stdin.

IRQ to CPU affinity

This feature of the kernel enables one to "hook" an IRQ to one or a set of CPUs. It can also exclude a CPU, or a set of CPUs, from handling an IRQ. Since Linux 2.4, the /proc/irq directory has been made available to enable setting IRQ to CPU affinity.

$ ls /proc/irq/
0  10  12  14  16  2   3  40  42  44  5  7  9
1  11  13  15  18  23  4  41  43  45  6  8  default_smp_affinity

Each of these directories correspond to an active IRQ:

$ less /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       
  0:        320          0          0          0   IO-APIC-edge      timer
  1:      19259          0          0     118916   IO-APIC-edge      i8042
  8:          1          0          0          0   IO-APIC-edge      rtc0
  9:       2433          0          0          0   IO-APIC-fasteoi   acpi
 12:     564213          0          0          0   IO-APIC-edge      i8042
 16:        152          0       1781          0   IO-APIC-fasteoi   ehci_hcd:usb1
 23:         39          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
 40:     261631          0          0          0   PCI-MSI-edge      ahci
 41:      64525          0          0          0   PCI-MSI-edge      xhci_hcd
 42:    4473534          0          0          0   PCI-MSI-edge      i915
 43:         14          0          0          0   PCI-MSI-edge      mei
 44:          2          0          0          0   PCI-MSI-edge      eth0
 45:        386          0         91          0   PCI-MSI-edge      snd_hda_intel
 [...]     [...]     [...]      [...]      [...]      [...]            [...]

The default_smp_affinty mask applies to all non-active IRQs, which are the IRQs that have not yet been allocated/activated, and hence which lack a directory under /proc/irq/.

Now, under each /proc/irq/IRQ/ directory e.g.

$ ls /proc/irq/1/
affinity_hint  i8042  node  smp_affinity  smp_affinity_list  spurious

is the smp_affinity file which is a bitmask that can be used to specify which logical CPU can handle the IRQ e.g:

$ sudo cat /proc/irq/1/smp_affinity
0f

$ sudo sh -c "echo 8 > /proc/irq/1/smp_affinity"

$ sudo cat /proc/irq/1/smp_affinity
08

The echo 8 > /proc/irq/1/smp_affinity command sets keyboard IRQ handling to be done exclusively by CPU 3 (fourth logical CPU). Note that this command interprets the number 8 as hexadecimal (i.e. 1000 binary bitmask). IRQ to CPU affinity can also be done for more than one CPU e.g. echoing, say, 5 instead of 8 will now mean that only the first and third logical CPUs can handle the IRQ. Initially, the contents of each smp_affinity file is the same by default.

The smp_affinity_list allows specifying a cpu range instead of a bitmask e.g:

$ sudo cat /proc/irq/1/smp_affinity
08

$ sudo cat /proc/irq/1/smp_affinity_list
3

$ sudo cat /proc/irq/8/smp_affinity
0f

$ sudo cat /proc/irq/8/smp_affinity_list
0-3

See Hacking ISR-to-Process Control Paths with Ftrace for a case study on using IRQ to CPU affinity in combination with isolcpus and the CPU affinity syscalls.

Also See

Resources

  • Documentation/kernel-parameters
  • Documentation/filesystems/proc.txt
  • http://www.linuxjournal.com/article/6799

Footnotes

1. See QEMU Intro [go back]