Author: Mugabi Siro

Category: general linux

Summary:

An introduction to the Function Tracing capabilities of the Ftrace framework.

Tags: linux ftrace debugging/tracing introduction

Ftrace Background

From Documentation/trace/ftrace.txt:

Ftrace is an internal tracer designed to help out developers and
designers of systems to find what is going on inside the kernel.
It can be used for debugging or analyzing latencies and
performance issues that take place outside of user-space.

Although ftrace is typically considered the function tracer, it
is really a frame work of several assorted tracing utilities.
There's latency tracing to examine what occurs between interrupts
disabled and enabled, as well as for preemption and from a time
a task is woken to the task is actually scheduled in.

One of the most common uses of ftrace is the event tracing.
Through out the kernel is hundreds of static event points that
can be enabled via the debugfs file system to see what is
going on in certain parts of the kernel.

Documentation/trace/ftrace.txt (updated for Linux 3.10) is approximately 3000 lines of text (80x25 display) and thoroughly documents features of this truly versatile framework. It is fierce. The context of this guide is but a fraction of ftrace's staggering capabilities.

Ftrace uses the debugfs file system (i.e. CONFIG_DEBUG_FS) to hold the control files as well as the files to display output. This file system is configured into the kernel when any ftrace configuration option is selected. The directory /sys/kernel/debug should be available upon system boot. If not, and if ftrace was properly configured and built into the kernel, try:

$ sudo mount -t debugfs none /sys/kernel/debug/

Control and Output Files

Enabling any ftrace kernel configuration option will also result in the presence of the sys/kernel/debug/tracing directory. This is the directory that holds all control and output files. The following are the files relevant to the context of this guide:

  • available_tracers

    This file lists the available tracers compiled into the kernel.

  • current_tracer

    This file is used to set or display the current tracer.

  • tracing_on

    Echo 1 into this file to allow/start update of the ring buffer. Echoing 0 will prevent further writes into the ring buffer.

    Note:

    • disabling writes into the ring buffer does not stop tracing action, nor does it clear/reset the buffer; it just disables further ring buffer updates and tracing overhead may still be in effect. Also see Section Enabling/Disabling the Function Tracers below.

    • since a ring buffer is used to log events, newest data may overwrite the oldest data. For this reason, it is sometimes better to disable writes into the buffer directly from a program's source code, right at the point where the syscalls being traced ends.

  • trace_marker

    This is a very usefile file for synchronizing userspace with events happening in the kernel. The tracing_on switch used in combination with trace_marker are essentially what make the kind function tracing discussed in the context of this guide a very useful hacking tool. A string written into this file appears as a C-style comment in the trace output. This mechanism can be used to mark the start and end of the relevant/interesting parts of the trace.

  • trace

    This file holds the output of the trace in a human readable format.

Tracers

The relevant tracers in the context of this guide are:

  • function

    Function call tracer to trace all kernel functions.

  • function_graph

    Similar to the function tracer except that the function tracer probes the functions on their entry whereas the function graph tracer traces on both entry and exit of the functions. This makes it possible for the ftrace framework to output a C-style function call graph.

  • nop

    This is the "trace nothing" tracer. To remove all tracers from tracing, simply echo nop into current_tracer.

Kernel Configuration

Kernel configuration options for ftrace can be viewed under Kernel hacking > Tracers [CONFIG_FTRACE]. For purposes of this guide, the following are the relevant kernel config options:

  • CONFIG_FUNCTION_TRACER=y

    Enables the kernel to trace every kernel function.

  • CONFIG_FUNCTION_GRAPH_TRACER=y

    Enable the kernel to trace a function at both its return and entry. This is required by the function_graph tracer.

  • CONFIG_FTRACE_SYSCALLS=y

    Basic tracer to catch the syscall entry and exit events.

  • CONFIG_DYNAMIC_FTRACE=y

    Required for function filtering - and recommended in general.

Performing a function trace

Enabling/Disabling the function tracers

Enabling the function tracers can be done with the following one-liner:

$ sudo sysctl kernel.ftrace_enabled=1

The procfs sysctl(8) function tracing enable is one big on/off switch for ftrace. By default, it is enabled (when function tracing is enabled in the kernel). If it is disabled, then all function tracing is disabled. This not only includes the function tracers for ftrace, but also for any other uses (perf, kprobes, stack tracing, profiling, etc). Disable with care.

Generally, disabling/enabling can be done via:

sysctl kernel.ftrace_enabled=0
sysctl kernel.ftrace_enabled=1

or

echo 0 > /proc/sys/kernel/ftrace_enabled
echo 1 > /proc/sys/kernel/ftrace_enabled

NOTE:

  • Enabling/disabling tracers

    This is done via /proc/sys/kernel/ftrace_enabled as explained above. This file is the ftrace on/off switch.

  • The nop tracer

    Setting the current tracer to nop sets tracing to no-ops and clears the ring buffer. It does not switch off ftrace.

  • Enabling/disabling ring buffer updates

    Echoing 0 into /sys/kernel/debugfs/tracing/tracing_on only prevents further ring buffer updates. It does not disable the current tracer activity/overhead, nor does it clear the ring buffer, nor is it the ftrace on/off switch.

function tracer

Now, consider the following shell script:

$ cat script.sh

#!/bin/sh

dir=/sys/kernel/debug/tracing

sysctl kernel.ftrace_enabled=1
echo function > ${dir}/current_tracer
echo 1 > ${dir}/tracing_on ## note the spacing; "echo 1> ${dir}/tracing_on" will not work!
sleep 1
echo 0 > ${dir}/tracing_on
less ${dir}/trace

Executing it:

$ sudo ./script.sh
kernel.ftrace_enabled = 1
# tracer: function
#
# entries-in-buffer/entries-written: 19854/19854   #P:2
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
       script.sh-17774 [001] .... 663131.217150: mutex_unlock <-rb_simple_write
       script.sh-17774 [001] .... 663131.217153: __fsnotify_parent <-vfs_write
       script.sh-17774 [001] .... 663131.217153: fsnotify <-vfs_write
       script.sh-17774 [001] .... 663131.217154: __srcu_read_lock <-fsnotify
       script.sh-17774 [001] .... 663131.217154: __srcu_read_unlock <-fsnotify
       script.sh-17774 [001] .... 663131.217155: __sb_end_write <-vfs_write
       script.sh-17774 [001] .... 663131.217160: SyS_dup2 <-system_call_fastpath
       script.sh-17774 [001] .... 663131.217160: SyS_dup3 <-SyS_dup2
       script.sh-17774 [001] .... 663131.217160: _raw_spin_lock <-SyS_dup3
       script.sh-17774 [001] .... 663131.217161: do_dup2 <-SyS_dup3
       script.sh-17774 [001] .... 663131.217161: filp_close <-do_dup2
       script.sh-17774 [001] .... 663131.217162: dnotify_flush <-filp_close

displays the tracer's output which consists of a small header section followed by a listing/sequence of traced events.

Entries at the top of the tracer output begin with a # and are what form its header section. The first entry lists the tracer that that produced the output - in this case, tracer: function.

The second header entry shows the number of event entries in the ring buffer as well as the total number of event entries that were written. The difference is the number of events that were lost due to (an) overflow(s) of the ring buffer. In this case, the ring buffer did not fill up and overflow, and so no events were lost (19854 - 19854 = 0).

The next entries in the header describe the fields of the event listing:

  • TASK - the task name, script.sh.
  • PID - the task PID, 17774.
  • CPU# - the CPU that the task was running on, 001.
  • irqs-off, need-resched, hardirq/softirq, preempt-depth delay - these yeild the latency format.
  • TIMESTAMP - This is in the form <secs>.<usecs>.
  • FUNCTION - This field contains a pair of function names. To the right is the name of the parent function that calls (<-) the function to the left. This information is incredibly useful for the hacker/student. For example, it is easy to see that sys_dup2 calls sys_dup3 which, in turn, invokes do_dup2. However, to the uninitiated eye, this display format might appear rather cryptic and it may take a while to figure out, say, function call depth. Fortunately, ftrace developers were kind enough to include another tracer, function_graph which makes this display format more readable for the less seasoned developer. It displays this information in the form of a C-style function call graph.

function_graph tracer

From Documentation/trace/ftrace.txt:

This tracer is similar to the function tracer except that it
probes a function on its entry and its exit. This is done by
using a dynamically allocated stack of return addresses in each
task_struct. On function entry the tracer overwrites the return
address of each function traced to set a custom probe. Thus the
original return address is stored on the stack of return address
in the task_struct.

Probing on both ends of a function leads to special features
such as:

- measure of a function's time execution
- having a reliable call stack to draw function calls graph

Now, modifying script.sh (above) to use function_graph:

$ cat script.sh
[...]
echo function_graph > ${dir}/current_tracer
echo 1 > ${dir}/tracing_on
[...]

$ sudo ./script.sh
kernel.ftrace_enabled = 1
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)   0.320 us    |  mutex_unlock();
 0)   0.077 us    |  __fsnotify_parent();
 0)               |  fsnotify() {
 0)   0.115 us    |    __srcu_read_lock();
 0)   0.098 us    |    __srcu_read_unlock();
 0)   1.239 us    |  }
 0)   0.105 us    |  __sb_end_write();
 0)               |  SyS_dup2() {
 0)               |    SyS_dup3() {
 0)   0.077 us    |      _raw_spin_lock();
 0)               |      do_dup2() {
 0)               |        filp_close() {
 0)   0.072 us    |          dnotify_flush();
 0)   0.071 us    |          locks_remove_posix();
 0)               |          fput() {
 0)               |            file_sb_list_del() {
 0)   0.085 us    |              lg_local_lock_cpu();
 0)   0.073 us    |              lg_local_unlock_cpu();
 0)   1.084 us    |            }

The start and end of a function are indicated by { and }, respectively. Leaf functions, which do not call other functions end with a ;. The DURATION column shows the time spent in the corresponding function - values appear against leaf functions and a }. The latter time value includes the overhead of all functions within a nested function as well as the overhead of the function graph tracer itself. Occassionally, these timing values will get prepended with a + or !. These annotations appear when the duration value exceeds 10 microseconds or 100 microseconds, respectively. Nevertheless, keep it in mind that these timings are mainly approximations.

The first column of the ftrace output indicates the logical CPU on which the corresponding event entries were executed. Typically, on multicore systems, the output will feature interleaved CPU traces like this:

 0)   0.073 us    |        mutex_unlock();
 3)   0.124 us    |                iov_iter_fault_in_readable();
 1)   0.105 us    |        fget_light();
 3)               |                ext3_write_begin() {
 0)   0.071 us    |        put_pid();
 3)               |                  ext3_writepage_trans_blocks() {
 3)   0.043 us    |                    journal_blocks_per_page();
 1)               |        sock_poll() {
 0)   3.126 us    |      }
 1)               |          unix_poll() {
 3)   0.390 us    |                  }
 0)   6.007 us    |    }
 3)               |                  grab_cache_page_write_begin() {
 1)               |            __pollwait() {
 3)               |                    find_lock_page() {
 0)   0.077 us    |    fput();
 3)   0.074 us    |                      find_get_page();
 1)               |              add_wait_queue() {
 1)   0.081 us    |                _raw_spin_lock_irqsave();

and it may become hard to read the trace. The grep(1) utility can be used to filter this output for a specific CPU. Alternatively, the trace output files under the per_cpu directory may be used:

$ sudo ls /sys/kernel/debug/tracing/per_cpu
cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7

$ sudo less /sys/kernel/debug/tracing/per_cpu/cpu3/trace

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 3) + 22.926 us   |                } /* ext3_ordered_write_end */
 3)   0.044 us    |                _cond_resched();
 3)   0.044 us    |                iov_iter_advance();
 3)   0.049 us    |                balance_dirty_pages_ratelimited_nr();
 3) + 31.098 us   |              } /* generic_perform_write */
 3) + 31.911 us   |            } /* generic_file_buffered_write */
 3) + 34.098 us   |          } /* __generic_file_aio_write */
 3)   0.047 us    |          mutex_unlock();
 3)   0.042 us    |          generic_write_sync();
 3)               |          blk_finish_plug() {
 3)   0.141 us    |            blk_flush_plug_list();
 3)   0.480 us    |          }

Function Filtering

Ftrace function tracers support additional features that facilitate trace output inspection. Notable among these are the function trace filters. Recall that CONFIG_DYNAMIC_FTRACE need be enabled in order to use the function filters. A few of these filters are described below:

  • set_ftrace_filter

    A trace can be limited to only the functions listed in this control file. For a case study of its usage, see Hacking Syscall-to-Driver Control Paths. This interface also allows for commands to be used. Check out the Filter commands section of Documentation/trace/ftrace.txt.

  • set_ftrace_notrace

    This function filter has an effect opposite to that of set_ftrace_filter. Any function that is added here will not be traced. If a function is listed in both set_ftrace_filter and this control file, then the function will not be traced. See Hacking Syscall-to-Driver Control Paths for a case study of its usage.

  • set_ftrace_pid

    This filter instructs Ftrace to trace only a single thread, based on its process ID. For case studies of its usage, check out:

  • set_graph_function

    This file sets a "trigger" function where tracing should start with the function graph tracer. For a case study of its usage, see Hacking ISR-to-Process Control Paths.

Also See

Resources and Further Reading

Function tracing is just one feature of the ftrace framework. Ftrace is capable of a lot more. Check out: