Author: Siro Mugabi

Category: General Linux

Summary:

This entry discusses a few considerations when debugging the Linux boot process with QEMU and GDB.

Tags: linux qemu debugging/tracing

Update

This site apparently shows a better way of fixing the remote GDB/x86-64 debugging problem on QEMU. The gist of the matter is that GDB requires a patch.

This entry is a prime candidate for my list of shame and is scheduled for removal.

Table Of Contents

Background

Several guides on the Internet suggest the following generic procedure when debugging the Linux boot process with GDB:

  • Start QEMU with the following debugging options:

    $ qemu-system-$ARCH -kernel $IMAGE -S -s
    

    where:

    • $IMAGE is a bootable kernel image that should have been built with debugging info (at least, CONFIG_DEBUG_INFO).

    • -S stalls QEMU CPU at startup.

    • -s is shorthand for -gdb tcp::1234. This results in QEMU's GDB server listening on TCP port 1234 for a remote connection from a gdb(1) client.

  • Separately load the corresponding ELF vmlinux object in gdb(1) and then connect to the waiting QEMU GDB server:

    $ gdb vmlinux
    (gdb) target remote tcp::1234
    

Case 1

Now, depending on several factors including the target architecture, and the manner in which the QEMU guest kernel image gets loaded into the machine emulator's memory, this generic procedure may take certain variations. Consider the following case:

$ make x86_64_defconfig

$ make menuconfig ## to enable "CONFIG_DEBUG_INFO"

$ make -jN

$ qemu-system-x86_64 -kernel arch/x86/boot/bzImage -S -s

QEMU starts up in a "stopped" state (courtesy of -S), with the QEMU GDB server waiting for a connection (on TCP port 1234 courtesy of -s) from a remote gdb(1) client. Now, if the following gdb(1) client connection,

$ gdb -q vmlinux
Reading symbols from /tmp/qgdb/linux/vmlinux...done.

(gdb) target remote tcp::1234
Remote debugging using tcp::1234
0x0000000000000000 in ?? ()

breakpoint specification e.g.

(gdb) break parse_early_param 
Breakpoint 1 at 0xffffffff81cdba4d: file init/main.c, line 422.

and continue command

(gdb) c
Continuing.

result in QEMU resuming execution without ever stopping at the specified breakpoints, and if further gdb(1) commands result in errors such as

<CTRL+C>
^CRemote 'g' packet reply is too long: 0000000000000000d81fc0[...]801f0000
(gdb)

(gdb) c
Continuing.
Remote 'g' packet reply is too long: 0000000000000000d81fc0[...]801f0000

(gdb) s
Remote 'g' packet reply is too long: 0000000000000000d81fc0[...]801f0000

then consider the following variation in procedure.

Case 2

The problem in the case above - I think, I'm not sure - is that connection by the remote gdb(1) client to QEMU GDB server was made prior to the Linux guest bzImage decompression and loading into the QEMU machine memory.

So, for the earliest possible debugging in such cases, GDB client connection should be made immediately after the guest kernel image has been decompressed and properly loaded into the QEMU machine memory. In scenarios like these, remote gdb(1) client connection ought to be made moments after

Decompressing Linux... Parsing ELF... done.
Booting the kernel.

appears on the QEMU (SDL) display1. Also notice that with this approach, use of the -S (i.e. the capital s) option is no longer applicable. For instance:

QGDB-LINUX-ADHOC

In other words,

  • First load vmlinux in gdb(1), and setup for remote connection to the yet to be started QEMU GDB server - but do not attempt a connection yet e.g:

    $ gdb -q vmlinux
    Reading symbols from /tmp/qgdb/linux/vmlinux...done.
    
    (gdb) target remote tcp::1234   ## but don't press "ENTER" yet
    
  • Then fire-up the QEMU machine e.g:

    $ qemu-system-x86_64 -kernel arch/x86/boot/bzImage -s -no-kvm
    
  • Once the guest kernel is decompressed and loaded, hit enter on the gdb(1) console to initiate a remote connection with the QEMU GDB server, thereby interrupting guest kernel boot.

From this point on, GDB operations should be able to proceed in a manner more-or-less similar to debugging a userspace program.

Nevertheless, a problem with this case is that earliest possible GDB trapping into guest kernel execution largely depends on the user's timing and reflexes; since the QEMU -S option is no longer applicable/available for synchronisation. One may even consider disabling KVM (as shown above) in the interests of slowing down QEMU execution; thereby increasing the chances of early debugging.

So, if you seek a more deterministic/synchronised way of connection between gdb(1) and the QEMU GDB server, and if you lack a better idea, then you may consider the hack presented in the next section.

Forcing Synchronisation

Continuing with Case 2 above, in order to force synchronisation between the QEMU GDB server and the remote gdb(1) client, some other form of a "timely delay" is required other than the use of the QEMU -S option.

The solution presented here is intrusive as it involves (a one-liner) modification of the kernel sources in order to insert an early boot "synchronisation mechanism". Essentially, an endless loop is inserted at the point of interest to stall the guest kernel boot process, thereby allowing for a synchronised connection between the remote gdb(1) client and the waiting QEMU GDB server. NOTE: The resulting kernel image can only be used for the QEMU/GDB debugging setups described in this section. Otherwise, its execution will always stall at the endless loop.

On x86(_64), the function responsible for decompressing bzImage is arch/x86/boot/compressed/misc.c:decompress_kernel():

asmlinkage void decompress_kernel(void *rmode, memptr heap,
                    unsigned char *input_data,
                    unsigned long input_len,
                    unsigned char *output)
{
    [...]
    console_init();
    debug_putstr("early console in decompress_kernel\n");
    [...]
    debug_putstr("\nDecompressing Linux... ");
    decompress(input_data, input_len, NULL, NULL, output, NULL, error);
    parse_elf(output);
    debug_putstr("done.\nBooting the kernel.\n");
    return;
}

However, at this point, the decompressed kernel is not quite ready yet for remote debugging with gdb(1), since some proper loading in (QEMU) machine memory is still pending. For illustration, an architecture-independent case of GDB stepping through init/main.c:start_kernel() will be considered. Nevertheless, note that remote debugging is quite possible with code as early as (on x86) arch/x86/kernel/head_{32,64}.S.

Including the hack

There exist several styles of implementing endless loop constructs. Some more elegant than others. The important thing to keep in mind is that Kbuild uses some gcc(1) optimizations and, therefore, your code must account for this.

  • C style example:

    472     asmlinkage void __init start_kernel(void)
    473     {
    474         char * command_line;
    475         extern const struct kernel_param __start___param[], __stop___param[];
    476
    477         {                           /* start of hack */
    478             volatile int xNzt;
    479             for(xNzt = 0; !xNzt; );
    480         }                           /* end of hack */
    481
    482         /*
    483          * Need to run as early as possible, to initialize the
    484          * lockdep hash:
    485          */
    486         lockdep_init();
    

    The volatile C keyword instructs the compiler to treat the grotesquely named xNzt variable as a true memory access - in effect suppressing optimizations on the for loop.

  • Inline Assembly style example:

    472     asmlinkage void __init start_kernel(void)
    473     {
    474         char * command_line;
    475         extern const struct kernel_param __start___param[], __stop___param[];
    476
    477         __asm__ __volatile__ ("loop:\n\t" "jmp loop\n\t"::);
    478
    479         /*
    480          * Need to run as early as possible, to initialize the
    481          * lockdep hash:
    482          */
    483         lockdep_init();
    

    The inline asm volatile keyword prevents gcc(1) optimizations from re-arranging the code i.e. inline asm lines must execute where they were "programmer-specified". While this approach is cleaner that the C style hack above, it remains architecture specific.

Whichever the style used:

$ make -jN

Recall that CONFIG_DEBUG_INFO should at least have been enabled.

Test Run

  • Boot against the new image e.g:

    $ qemu-system-x86_64 -kernel arch/x86/boot/bzImage -s -smp 2 -enable-kvm
    

    As shown below, once decompress_kernel completes and control eventually reaches start_kernel, QEMU's execution of the guest kernel gets tied up in an endless loop, chewing CPU cycles while the QEMU GDB server awaits remote gdb(1) client connection:

    STAGE1_QGDB-LINUX-HACK-BOOT

  • Remote gdb(1) connection, breakpoint setting and other GDB operations may now proceed e.g:

    STAGE2_QGDB-LINUX-HACK-BOOT

    Basically, upon remote gdb(1) client connection, the looping guest kernel was stopped at line:

    0xffffffff81cdbaac in start_kernel ( ) at init/main.c:479
    479                     for(xNzt = 0; !xNzt; );
    

    or, for the inline asm style hack:

    STAGE2_QGDB-LINUX-HACK-BOOT

    At this point, breakpoints for parse_early_param and console_init were inserted. The GDB jump command was then used to break from the endless loop. And since the jump "targets" were lines in the respective init/main.c that came before any of the Linux code, the effect of jump was harmless.

    From that point on, GDB operations proceeded as shown in the screenshots above. The Linux boot messages shown on the QEMU SDL interface appeared right after the execution of:

    Breakpoint 2, console_init () at drivers/tty/tty_io.c:3480
    3480    {
    (gdb) finish
    

Also See

Footnotes

1. See Redirecting Serial Line Terminals for alternative setups for a system console. [go back]