Author: Mugabi Siro

Category: Device Drivers

Summary:

A device driver oriented tutorial on using Ftrace for identifying the kernel control path taken from a syscall to a device driver operation/method. A write(2) to the /dev/tty1 console bound to the LinuxFB stack is explored here as a case study.

Tags: linux ftrace debugging/tracing

Preliminaries

An introduction to the function call tracing capabilities of the Ftrace framework can be found here.

The traces in this entry were run on an (aging) quad i3-2350M/4GB laptop running Linux 3.2.0-35-generic Ubuntu 12.04 kernel. Note that until Section Function Filtering below, the following traces were performed from an x-term (X Windows) - program output appearing on the /dev/tty1 console.

Consider ne_ftrace_console_write.c that writes a string to /dev/tty1. The gist of the program is:

write(tr_on_fd, "1", 1);    /* "echo 1 > ${debugfs}/tracing/trace_on" */

str = "Before console write\n";
write(marker_fd, str, strlen(str)); /* "marker_fd" is "${debugfs}/tracing/trace_marker" file desc */

str = "Dunia, vipi?\n";
write(tty_fd, str, strlen(str)); /* "tty_fd" is '/dev/tty1' file desc */

str = "After console write\n";
write(marker_fd, str, strlen(str));

write(tr_on_fd, "0", 1); /* "echo 0 > ${debugfs}/tracing/trace_on" */

Performing:

$ gcc -Wall -O2 -Wno-unused-result ne_ftrace_console_write.c

$ sudo ./a.out

produces this function call graph in ./ftrace_output.txt when run from an xterm. Note that the console output will appear on /dev/tty1 (i.e. CTRL+ALT+F1).

Now, a few details have been overlooked. In fact, this particular trace instance was "luckily" uneventful: not a single timer (or other hardware) interrupt and/or process re-scheduling event occured between or during the traced write(2) operations, and all logged activity happened on one CPU. The trace output was, therefore, brief and clean:

$ wc -l ftrace_output.txt 
394 ftrace_output.txt

Often, a trace will involve a more complex syscall scenario (than a simple, nonblocking write(2)) and, depending on a number of other factors such as system load, the trace output will normally include several more event entries. These include timer and scheduler information, and activity on other CPU cores. For instance, introducing this basic stress factor in the system:

$ find /

on one xterm, then running the trace on another, yeilded the following results:

$ sudo ./a.out
$ wc -l ftrace_output.txt
1058 ftrace_output.txt

$ cat ftrace_output.txt
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 [...]                  [...]                       [...]
 3)               |  sys_write() {
 [...]                  [...]                       [...]
 3)               |      /* Before console write */
 [...]                  [...]                       [...]
 3)               |  sys_write() {                          [ write("Dunia, vipi?") ]
 [...]                  [...]                       [...]
 3)   3.661 us    |      }
 3)               |      tty_write() {
 3)   ==========> |                                         [ timer interrupt on CPU 3 ]
 3)               |        smp_apic_timer_interrupt() {
 0)               |  ktime_get_real() {                     [ first recorded CPU 0 activity ]
 [...]                  [...]                       [...]
 3)               |          irq_enter() {
 2)   2.490 us    |  }                                      [ first recorded CPU 2 activity ]
 3)               |            rcu_irq_enter() {
 [...]                  [...]                       [...]
 1)   3.666 us    |  }                                      [ first recorded CPU 1 activity ]
 0)   3.666 us    |  }                                  
 1)               |  ns_to_timeval() {
 0)               |  ns_to_timeval() {
 [...]                  [...]                       [...]
 2)   ==========> |                                         [ timer interrupt on CPU 2 ]
 2)               |  smp_apic_timer_interrupt() {
 [...]                  [...]                       [...]
 2)   0.201 us    |    native_apic_mem_write();
 3)               |          hrtimer_interrupt() {
 1)   ==========> |                                         [ timer interrupt on CPU 1 ]
 1)               |  smp_apic_timer_interrupt() {
 0)   ==========> |                                         [ timer interrupt on CPU 0 ]
 0)               |  smp_apic_timer_interrupt() {
 [...]                  [...]                       [...]
 2)               |                rcu_report_qs_rnp() {
 3) ! 158.003 us  |        }
 3)   <========== |                                         [ timer IRQ on CPU 3 returns ]
 0)   0.105 us    |                _raw_spin_lock_irqsave();
 2)   0.077 us    |                  _raw_spin_unlock_irqrestore();
 1)   0.142 us    |                update_cfs_load();
 3)   0.211 us    |        tty_paranoia_check();            [ trace to "/dev/tty1" console resumes on CPU3 ]
 [...]                  [...]                       [...]
 3)               |        tty_write_lock() {
 2)   0.636 us    |              }
 1)               |                  update_group_power() {
 3)   0.164 us    |          mutex_trylock();
 [...]                  [...]                       [...]
 0)   0.072 us    |          __local_bh_enable();
 1)   1.374 us    |                  }
 2) + 23.467 us   |        }
 3)               |        n_tty_write() {
 0) + 20.773 us   |        }
 1)   2.553 us    |                }
 2) + 24.679 us   |      }
 [etc]                  [etc]                   [etc]

where:

  • ==========> indicates an occurence of an interrupt and,
  • <========== shows the return.
  • text appearing between the square brackets, i.e. [ and ], are comments that have been included by hand - they do not appear in the actual log.

To further complicate matters, on multiprocessor systems, a process once scheduled out will not necessarily run on the same CPU on which it was last scheduled. While Linux tries to ensure soft CPU affinity for a process, it may reschedule it on another CPU if, say, the original CPU is busy. This can introduce an element of confusion when manually filtering out entries based on the value of the CPU field. Fortunately, there are a number of approaches that can be applied to address these issues including employing certain ftrace filtering features e.g. based on the process' PID. One may even consider some exotic options such as configuring the system for hard CPU affinity to ensure that the process runs on a given CPU. Ultimately, the most suitable (combination of) approach(es) will depend on the task at hand. Check out the CPU Affinity entry for a discussion on CPU affinity in a Linux system.

Tracing based on Process ID

In light of the issues discussed above, to reduce clutter/noise in the trace output, tracing can be restricted to events based on the process' ID. The /sys/kernel/debug/tracing/set_ftrace_pid control file is used to configure this mechanism.

For example, now compile ne_ftrace_console_write.c with the following command:

$ gcc -Wall -O2 -Wno-unused-result ne_ftrace_console_write.c -DTRACE_PID

Then run the program several times, each time checking the line count of the trace output:

$ sudo ./a.out 
kernel.ftrace_enabled = 1
$ wc -l ftrace_output.txt 
396 ftrace_output.txt

$ sudo ./a.out 
kernel.ftrace_enabled = 1
$ wc -l ftrace_output.txt 
396 ftrace_output.txt

$ sudo ./a.out 
kernel.ftrace_enabled = 1
$ wc -l ftrace_output.txt 
396 ftrace_output.txt

Now, introduce the basic stress factor in the system from a separate xterm:

$ find /

Then repeat program execution and ftrace_output.txt line count check iterations:

$ sudo ./a.out 
kernel.ftrace_enabled = 1
$ wc -l ftrace_output.txt 
396 ftrace_output.txt

$ sudo ./a.out 
kernel.ftrace_enabled = 1
$ wc -l ftrace_output.txt 
396 ftrace_output.txt

Function filtering

The trace output can sometimes be a little overwhelming. Even on quads (low-end development machines by today's standards), examining trace output can quickly get depressing. While the set_ftrace_pid control file discussed in the previous section alliviates the situation a little bit, plenty of "distractions" or uninteresting events still get logged - which also contribute to overflows in the ring buffer. Increasing the size of the ring buffer may only translate to more strain on the brain - more information to audit - rather than help solve matters.

Fortunately, ftrace provides a way to filter out unwanted functions. The set_ftrace_filter and set_ftrace_notrace control files are used to achieve this. Standard shell redirection, i.e. > and >>, can be used to update these files.

Only functions listed in set_ftrace_filter will be traced. For example:

  • Only trace sys_write:

    $ sudo sh -c "echo sys_write > /sys/kernel/debug/tracing/set_ftrace_filter"
    
    $ gcc -Wall -O2 -Wno-unused-result ne_ftrace_console_write.c -DTRACE_PID
    
    $ sudo ./a.out 
    kernel.ftrace_enabled = 1
    
    $ wc -l ftrace_output.txt 
    12 ftrace_output.txt
    
    $ cat ftrace_output.txt 
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     0)               |  sys_write() {
     0)               |  /* Before console write */
     0)   4.743 us    |  }
     0) + 24.884 us   |  sys_write();
     0)               |  sys_write() {
     0)               |  /* After console write */
     0)   1.257 us    |  }
     0)               |  sys_write() {
    
  • Adding tty_write to the trace:

    $ sudo sh -c "echo tty_write >> /sys/kernel/debug/tracing/set_ftrace_filter"
    
    $ sudo ./a.out 
    kernel.ftrace_enabled = 1
    
    $ wc -l ftrace_output.txt 
    14 ftrace_output.txt
    
    $ cat ftrace_output.txt 
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     0)               |  sys_write() {
     0)               |  /* Before console write */
     0)   4.380 us    |  }
     0)               |  sys_write() {
     0) + 12.778 us   |    tty_write();
     0) + 14.181 us   |  }
     0)               |  sys_write() {
     0)               |  /* After console write */
     0)   0.467 us    |  }
     0)               |  sys_write() {
    

On the other hand, any function listed in set_ftrace_notrace will not be traced. A typical class of functions that are candidates for set_ftrace_notrace include the spin_lock and rcu_read variants.

Note that, so far, the tests have been performed from an xterm on X-windows, with the program's output appearing on /dev/tty1 or VT1. Switching to VT1 (i.e. CTRL+ALT+F1) and running the program from there results in the Dunia, vipi? message (finally) getting printed out on the same console. This means that the trace output will now include calls made to the console driver (stack).

In this particular setup, the console on /dev/tty1 was bound to the Linux FB framework:

$ cat /sys/class/vtconsole/vtcon1/name 
(M) frame buffer device

$ cat /sys/class/vtconsole/vtcon1/bind 
1

$ cat /proc/fb
0 inteldrmfb

$ lspci | grep -i vga
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)

$ lspci -s 00:02.0 -k
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
    Subsystem: QUANTA Computer Inc Device 0872
    Kernel driver in use: i915
    Kernel modules: i915

One source of noise in the trace output was the console_conditional_schedule and __cond_resched functions in the Linux FB stack. Even tracing this simple, non-blocking write(2) operation on VT1 more often than not resulted in ring buffer overflows!

Now, applying the following function filtering:

$ cat ne_console_write_notrace_filter.sh

#!/bin/sh

## clear "functions to trace" filter
echo > /sys/kernel/debug/tracing/set_ftrace_filter

## specify which functions "not to trace"
echo console_conditional_schedule *cond_resched* *srcu_read* \
rw_verify_area get_user_pages_fast put_page *fsnotify* \
*_spin_* > /sys/kernel/debug/tracing/set_ftrace_notrace

$ sudo chmod 744 ne_console_write_notrace_filter.sh

$ sudo ./ne_console_write_notrace_filter.sh

$ sudo cat /sys/kernel/debug/tracing/set_ftrace_notrace
get_user_pages_fast
__cond_resched
__cond_resched_lock
mutex_spin_on_owner
__cond_resched_softirq
_cond_resched
console_conditional_schedule
assert_spin_locked.part.17
__srcu_read_lock
__srcu_read_unlock
srcu_readers_active_idx.isra.0
_raw_spin_trylock
_raw_spin_lock
(...)

and then running the program:

$ gcc -Wall -O2 -Wno-unused-result ne_ftrace_console_write.c -DTRACE_PID

$ sudo ./a.out
kernel.ftrace_enabled = 1
Dunia, vipi?

produced this nice trace.

As can be seen, function calls for the Linux FB stack now appear in the trace.

  • FB console e.g. fbcon_cursor, fbcon_scroll, fbcon_redraw, etc. See drivers/video/console/fbcon.c

  • Low-level FB driver stack e.g. cfb_fillrect (see drivers/video/cfbfillrect.c) and cfb_imageblit (see drivers/video/cfbimgblit.c) which are the generic Linux FB-ops that the Intel /lib/modules/3.2.0-35-generic/kernel/drivers/gpu/drm/i915/i915.ko driver uses (See drivers/gpu/drm/i915/intel_fb.c).

Finally, do not forget to reset the function filters when done. Simply echo a blank line into the respective function filter control file:

$ sudo sh -c "echo > /sys/kernel/debug/tracing/set_ftrace_filter"

$ sudo sh -c "echo > /sys/kernel/debug/tracing/set_ftrace_notrace"

$ sudo cat /sys/kernel/debug/tracing/set_ftrace_filter
#### all functions enabled ####

$ sudo cat /sys/kernel/debug/tracing/set_ftrace_notrace

Resources and Further Reading

Ftrace function tracing used in combination with a source code browsing utility such as cscope tremendously facilitate hacking and understanding the Linux kernel. The context of this guide only covered a fraction of function tracing with ftrace. Also recall that function tracing is just one feature of the ftrace framework.

Also See