Author: Mugabi Siro

Category: Device Drivers

Summary:

A jump-start tutorial on writing a Linux PCI device driver. QEMU-PC's ivshmem virtual device is used here as the PCI hardware platform. Originally tested with QEMU v1.0 of the Ubuntu 12.04 repository and guest Linux v3.x. Also tested with upstream QEMU v1.0.1, v2.2.0 and v2.7.0-rc4. References to header files and other sources, e.g. drivers/pci/pci.c, are with respect to the root of the Linux kernel sources. Notation such as <linux/pci.h> implies include/linux/pci.h.

Tags: linux qemu pci

Table Of Contents

Preliminaries

Background

First things first, it must be emphasized that virtual platforms are not a replacement for physical hardware. However, for purposes of a demo/training/lab session, virtual setups are quite convinient and present an elegant solution to the problem of a universally accessible platform. Components of a virtual platform can also be easily added, extended or replaced.

The InterVM SHared MEMory (ivshmem) QEMU PCI device is used. This device was selected for a number of reasons including its simplicity and framework that facilitates interfacing and testing with external code on the VM host. The ivshmem device enables the guest OS to access a POSIX SHM region on the host. This framework emulates MMIO access on a physical device: the host SHM region appears as a MMIO region to the guest OS. The ivshmem framework also features a mechanism that allows other ivshmem-enabled guests (or some other stand-alone program on the host) to send IRQs to the VM by way of the eventfd(2) mechanism.

Since ivshmem is not one of the default QEMU-PC character devices, the corresponding -device option need be specified on the QEMU boot command line. To view the list of supported devices for a given machine architecture, run (for instance):

$ qemu-system-x86_64 -device ?

PCI Device Compatibility Note

Linux API for PCI device drivers generally remains compatible across the family of PCI technologies e.g. PCI, PCI-X, PCI-E. The ivshmem device emulates a PCI device.

QEMU Version Compatibility Note

As of QEMU versions later than v2.5, ivshmem supports three different devices:

  • The legacy ivshmem device. This device supports two configurations: shared-memory-only and shared-memory-with-IRQ-support. Its IRQs can be configured for either traditional pin-based interrupts or for Message Signaled Interrupts (MSI).

    Being a legacy device, the -device ivshmem,shm=$SHMOBJ,size=$SIZE command line, e.g.:

    -device ivshmem,shm=ivshmem,size=1
    

    is deprecated and results in a warning similar to:

    ivshmem is deprecated, please use ivshmem-plain or ivshmem-doorbell instead
    

    upon QEMU lauch. However, it is supported for backward compatibility. In fact, the device driver test instructions presented in this tutorial still rely on this backward compatibility.

  • The -device ivshmem-plain shared-memory-only device. This device does not support IRQs at all. With respect to the legacy -device ivshmem,shm=ivshmem,size=1 command line presented above, the equivalent command line for a POSIX SHM host mapping with this device is:

    -object memory-backend-file,id=mb1,size=1M,share,mem-path=/dev/shm/ivshmem \
    -device ivshmem-plain,memdev=mb1
    
  • The -device ivshmem-doorbell shared-memory-and-MSI-only device.

See docs/specs/ivshmem-spec.txt for more information.

The next section, PCI Background, assumes that an ivshmem-enabled QEMU instance with a Linux guest is already up and running. The instructions in that section were performed on a:

$ qemu-system-x86_64 -enable-kvm

instance. The legacy -device ivshmem or the -device ivshmem-plain command line switches presented above can be used:

  • For the legacy -device ivshmem switch, the shm property specifies the name of the POSIX SHM object. For example, shm=ivshmem implies /dev/shm/ivshmem (at least on Debian/Ubuntu systems). The object will be created if it does not exist and truncated to the specified size. If the object already exists, QEMU will not resize it but will ensure that the size of the object matches the size given with the value of the size property.

  • The interpretation of the size property remains the same across the different ivshmem devices. The default units are in megabytes. The value specified against size must be a power of two; a restriction of PCI memory regions.

The instructions presented in this entry have been tested with upstream QEMU v1.0.1 (and Ubuntu-QEMU v1.0) and v2.2.0 (Linux guest v3.2.0, v3.11.0, v3.18.7), and QEMU v2.7.0-rc4.

PCI Background

PCI Bus, Device and Function numbers

PCI devices are addressed using bus, device and function numbers:

$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 20)
  • For -device ivshmem, the PCI revision ID is 0:

    00:04.0 RAM memory: Red Hat, Inc Device 1110
    
  • For -device ivshmem-plain, the PCI revision ID is 1:

    00:04.0 RAM memory: Red Hat, Inc Device 1110 (rev 01)
    

The XX:YY.Z tuple at the beginning of each entry is interpreted as bus:device.function number. The following lspci output displays the PCI tree layout.

$ lspci -t -v
-[0000:00]-+-00.0  Intel Corporation 440FX - 82441FX PMC [Natoma]
           +-01.0  Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
           +-01.1  Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
           +-01.3  Intel Corporation 82371AB/EB/MB PIIX4 ACPI
           +-02.0  Cirrus Logic GD 5446
           +-03.0  Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
           \-04.0  Red Hat, Inc Device 1110

The [0000:00] value yeilds the PCI domain. A PCI domain can host upto 256 buses. This QEMU VM machine instance features a simple tree layout with one PCI bus hosting 4 devices. Each bus can have a maximum of 32 devices, and each device can feature upto 8 functions. For example, PCI device number 01 has 3 functions. The ivshmem device corresponds to 0000:00:04.0.

PCI Regions

PCI devices contain three addressable regions:

  • Configuration space

  • I/O Ports

  • Device Memory

Configuration Space

PCI devices are identified via VendorIDs, DeviceIDs, and Class Codes:

$ lspci -v -m -n -s 00:04.0
Device: 00:04.0
Class:  0500
Vendor: 1af4
Device: 1110
SVendor:    1af4
SDevice:    1100
PhySlot:    4

In otherwords:

$ lspci -v -m -s 00:04.0
Device: 00:04.0
Class:  RAM memory
Vendor: Red Hat, Inc
Device: Device 1110
SVendor:    Red Hat, Inc
SDevice:    Device 1100
PhySlot:    4

PCI VendorIDs are maintained and assigned globally by the PCI Special Interest Group (PCI SIG). Check out www.pcidatabase.com and http://pci­ids.ucw.cz/read/PC/. Also see include/linux/pci_ids.h of the Linux sources. The vendor then allocates DeviceID values for its devices. Vendor Red Hat, Inc. donates a part of its DeviceID range to QEMU, to be used for virtual devices. The VendorIDs are 1af4 (formerly Qumranet ID) and 1b36. The ivshmem device has been allocated 1af4:1110 for its VendorID and DeviceID, respectively. See docs/specs/pci-ids.txt of the QEMU sources.

The VendorID, DeviceID, Class Code, Subsystem VendorID (i.e. SVendor) and SubsystemID (i.e. SDevice) values are embedded in the PCI configuration space:

$ hexdump /sys/devices/pci0000\:00/0000\:00\:04.0/config 
0000000 1af4 1110 0003 0000 0000 0500 0000 0000
0000010 2000 feb2 0000 0000 0000 fea0 0000 0000
0000020 0000 0000 0000 0000 0000 0000 1af4 1100
0000030 0000 0000 0000 0000 0000 0000 010b 0000
0000040

As shown above, the first two bytes contain the VendorID followed by the DeviceID to form a unique 32-bit identifier for the device. However, note that the PCI configuration space is little endian, i.e:

$ hd /sys/devices/pci0000\:00/0000\:00\:04.0/config 
00000000  f4 1a 10 11 03 00 00 00  00 00 00 05 00 00 00 00  |................|
00000010  00 20 b2 fe 00 00 00 00  00 00 a0 fe 00 00 00 00  |. ..............|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 f4 1a 00 11  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 0b 01 00 00  |................|
00000040

In this respect, the 3 byte-wide Class Code register starts at offset 0x09, while the 16-bit Subsystem VendorID and SubsystemID registers start at offsets 0x2c and 0x2e, respectively.

Organisation of the registers in the configuration space impose a specific record structure on the first 256-bytes (PCI supports 256-byte config space; PCI-X 2.0 and PCIe offer an extended config space of 4KB). The config space is divided into a predefined header region and a device dependent region. The first 64 bytes form the standardized header region.

A few of the PCI configuration registers in the header region, as defined in include/uapi/linux/pci_regs.h, are shown below:

#define PCI_STD_HEADER_SIZEOF   64
#define PCI_VENDOR_ID       0x00    /* 16 bits */
#define PCI_DEVICE_ID       0x02    /* 16 bits */
#define PCI_COMMAND         0x04    /* 16 bits */
(...)
#define PCI_STATUS              0x06    /* 16 bits */
(...)
#define PCI_CLASS_REVISION      0x08    /* High 24 bits are class, low 8 revision */    
#define PCI_REVISION_ID     0x08    /* Revision ID */
#define PCI_CLASS_PROG      0x09    /* Reg. Level Programming Interface */
#define PCI_CLASS_DEVICE    0x0a    /* Device class */
(...)
#define PCI_BASE_ADDRESS_0  0x10    /* 32 bits */
#define PCI_BASE_ADDRESS_1  0x14    /* 32 bits [htype 0,1 only] */
#define PCI_BASE_ADDRESS_2  0x18    /* 32 bits [htype 0 only] */
#define PCI_BASE_ADDRESS_3  0x1c    /* 32 bits */
#define PCI_BASE_ADDRESS_4  0x20    /* 32 bits */
#define PCI_BASE_ADDRESS_5  0x24    /* 32 bits */
(...)
#define PCI_SUBSYSTEM_VENDOR_ID 0x2c
#define PCI_SUBSYSTEM_ID    0x2e
(...)
#define PCI_INTERRUPT_LINE  0x3c    /* 8 bits */
#define PCI_INTERRUPT_PIN   0x3d    /* 8 bits */

The Class Code identifies the generic function of the device, and in some cases, a specific register-level programming interface. The byte at offset 0x0b is a base class code which broadly classifies the type of function the device performs. The byte at offset 0x0a is a sub-class code which identifies more specifically the function of the device. The byte at offset 0x09 identifies a specific register-level programming interface (if any) so that device independent software can interact with the device. For the ivshmem device, the base class and sub-class codes are:

$ cat include/linux/pci_ids.h | grep CLASS | egrep '(\<0x05\>|0x0500)' 
#define PCI_BASE_CLASS_MEMORY       0x05
#define PCI_CLASS_MEMORY_RAM        0x0500

Other common class codes include:

$ less include/linux/pci_ids.h
(...)
#define PCI_BASE_CLASS_STORAGE          0x01
#define PCI_CLASS_STORAGE_SCSI          0x0100
#define PCI_CLASS_STORAGE_IDE           0x0101
(...)
#define PCI_BASE_CLASS_NETWORK          0x02
#define PCI_CLASS_NETWORK_ETHERNET      0x0200
(...)
#define PCI_BASE_CLASS_DISPLAY          0x03
#define PCI_CLASS_DISPLAY_VGA           0x0300  
#define PCI_CLASS_DISPLAY_XGA           0x0301
(...)
#define PCI_BASE_CLASS_COMMUNICATION    0x07
#define PCI_CLASS_COMMUNICATION_SERIAL  0x0700
#define PCI_CLASS_COMMUNICATION_PARALLEL 0x0701
(...)

BAR: I/O Spaces and Range Sizes

Central to the PCI device I/O addressing scheme are the Base Address Registers (BAR). A device driver uses these registers to determine the type, size, and location of I/O regions of the PCI device. BARs are placed in a pre-defined portion of the PCI Configuration Space, starting at offset 0x10, and through 0x24. PCI devices can have upto six 32-bit Base Address Registers (BARs). A 64-bit BAR consumes two consecutive 32-bit locations.

  • I/O Region Type

    A BAR may either point to port I/O region or MMIO region. Note that for architectures that support both port I/O and MMIO addressing, these two types of I/O regions lie in separate I/O spaces: port I/O access requires a special set of CPU instructions while MMIO access involves the same machine instructions as those for main memory access (MMIO address space is located within the range of processor addressable space for system memory).

    Port I/O spaces typically contain device registers while MMIO regions may hold device registers or device memory. Registers are typically used for device control or for obtaining device status. Device memory, on the other hand, could be used to support, say, a framebuffer for video. BARs that map to memory I/O space have bit 0 set to 0; otherwise, a value of 1 indicates a port I/O mapping.

    Further, a MMIO region may also be prefetchable, i.e. the device memory may be cached, with read operations on the MMIO region being performed from the cache rather than device memory. In this case, devices hardwire bit 3 of the BAR to 1. Otherwise, for non-prefetchable mappings, this bit remains reset.

  • I/O Region Sizes

    A BAR associated with a MMIO region can be 32bits or 64bits wide (to support mapping into a 64-bit address space). For bits 2 and 1, a value of 00 means that the device's memory mapping can be located anywhere in a 32-bit address space; a value of 10 implies a relocation anywhere within a 64-bit address space.

    BARs associated with port I/O regions are always 32-bit wide, and limited to an I/O range size of 256 bytes.

From include/uapi/linux/pci_regs.h:

#define  PCI_BASE_ADDRESS_SPACE_IO  0x01
#define  PCI_BASE_ADDRESS_SPACE_MEMORY  0x00
#define  PCI_BASE_ADDRESS_MEM_TYPE_MASK 0x06
#define  PCI_BASE_ADDRESS_MEM_TYPE_32 0x00  /* 32 bit address */
#define  PCI_BASE_ADDRESS_MEM_TYPE_1M 0x02  /* Below 1M [obsolete] */
#define  PCI_BASE_ADDRESS_MEM_TYPE_64 0x04  /* 64 bit address */
#define  PCI_BASE_ADDRESS_MEM_PREFETCH  0x08  /* prefetchable? */

For illustration, compare the following instances of a graphics device configuration space output:

  • Performed on an Intel Core i3-2350M based laptop, 4GB

    $ lspci -v -s 00:02.0 -x
    00:02.0 VGA compatible controller: Intel Corporation 2nd Generation [...]
        Memory at d8000000 (64-bit, non-prefetchable) [size=4M]
        Memory at d0000000 (64-bit, prefetchable) [size=128M]
        I/O ports at 5000 [size=64]
        Kernel driver in use: i915
        Kernel modules: i915
    00: 86 80 16 01 07 04 90 00 09 00 00 03 00 00 00 00
    10: 04 00 00 d8 00 00 00 00 0c 00 00 d0 00 00 00 00
    20: 01 50 00 00 00 00 00 00 00 00 00 00 2d 15 72 08
    30: 00 00 00 00 90 00 00 00 00 00 00 00 0b 01 00 00
    
  • Performed on a qemu-system-x86_64 instance, 384MB

    $ lspci -v -s 00:02.0 -k -x
    00:02.0 VGA compatible controller: Cirrus Logic GD 5446
        [...]
        Memory at fc000000 (32-bit, prefetchable) [size=32M]
        Memory at febf0000 (32-bit, non-prefetchable) [size=4K]
        Expansion ROM at febd0000 [disabled] [size=64K]
        Kernel driver in use: cirrus
        Kernel modules: cirrus, cirrusfb
    00: 13 10 b8 00 03 00 00 00 00 00 00 03 00 00 00 00
    10: 08 00 00 fc 00 00 bf fe 00 00 00 00 00 00 00 00
    20: 00 00 00 00 00 00 00 00 00 00 00 00 f4 1a 00 11
    30: 00 00 bd fe 00 00 00 00 00 00 00 00 00 00 00 00
    

At minimum, a device will implement at least one BAR for device control operations. Device registers for these control operations may be port I/O or memory mapped. Some devices support both I/O spaces for their control functions, exposing separate BARs for each scheme. In such cases, the device driver may only engage one I/O space while the other remains unused. Graphics devices support at least two I/O ranges, one for device control operations and another for a frame buffer. Operations on the ivshmem device are similar to those on a frame buffer device. Ultimately, the semantics of I/O space, and sizes of I/O ranges remain device/hardware dependent.

For the ivshmem device:

$ lspci -v -s 00:04.0
00:04.0 RAM memory: Red Hat, Inc Device 1110
    Subsystem: Red Hat, Inc Device 1100
    Physical Slot: 4
    Flags: fast devsel, IRQ 11
    Memory at feb22000 (32-bit, non-prefetchable) [size=256]
    Memory at fea00000 (32-bit, non-prefetchable) [size=1M]

In QEMU, the monitor console interface can also be used to obtain similar info1:

(qemu) info pci
...
    Bus  0, device   4, function 0:
        RAM controller: PCI device 1af4:1110
            IRQ 11.
            BAR0: 32 bit memory at 0xfeb22000 [0xfeb220ff].
            BAR2: 32 bit memory at 0xfea00000 [0xfeafffff].
            id ""

As shown, BAR0 and BAR2 correspond to MMIO Region 0 (at 0xfeb22000, size 256bytes) and MMIO Region 2 (at 0xfea00000, size 1MB), respectively. Depending on the QEMU commandline options specified for the ivshmem device, BAR1 can also be made available. BAR1 is only present if Message Signaled Interrupts (MSI) are used.

The MMIO region associated with BAR0 contains a small region of four 32-bit registers. These registers are used for device control and status checks. Their byte offsets in the MMIO region are defined as (see hw/misc/ivshmem.c):

 /* registers for the Inter-VM shared memory device */
 enum ivshmem_registers {
         INTRMASK = 0,
         INTRSTATUS = 4,
         IVPOSITION = 8,
         DOORBELL = 12,
 };

QEMU's memory allocation for these registers remains within the private virtual address space of the VM instance and, therefore, these registers are not accessible from a separate process address space. For purposes of this guide, only the first two registers are used.

On the other hand, the MMIO region associated with BAR2 translates to the POSIX SHM region on the host. This region can be shared between guests or some other external code on the host.

Summary

The more interesting details of the ivshmem configuration space (of this QEMU instance) can be summarised as:

Config Space RegisterByte Index (size)Value
PCI_VENDOR_ID 0x00 (16-bits) 0x1af4
PCI_DEVICE_ID0x02 (16-bits)0x1110
PCI_CLASS_DEVICE0x0a (16-bits)0x0500
PCI_BASE_ADDRESS_00x10 (32-bits)0xfeb22000
PCI_BASE_ADDRESS_20x18 (32-bits)0xfea00000
PCI_SUBSYSTEM_VENDOR_ID 0x2c (16-bits)0x1af4
PCI_SUBSYSTEM_ID0x2e (16-bits)0x1100
PCI_INTERRPT_LINE0x3c (8-bits)0x0b
PCI_INTERRUPT_PIN0x3d (8-bits)0x01

The value of the PCI_INTERRUPT_LINE register tells which input of the system's interrupt controller(s) the device's interrupt pin is connected to. The PCI_INTERRUPT_PIN tells which interrupt pin the device (or device function) uses; values of 1, 2, 3 and 4 correspond to the INTA, INTB, INTC and INTD hardware interrupt pins, respectively.

PCI Region Access w/o Device Driver

A userland program can still access a PCI device's configuration space, and regions associated with its BARs, even when the device's driver is absent, or has not yet been loaded. This can be quite useful for prelimary checks or troubleshooting. Nevertheless, functionality with this mode of device access remains very limited. lspci(1) already provides some of this functionality. An example program that reads/writes a PCI device's I/O ports and MMIO regions is presented in /dev/mem and mmap(2).

PCI Device Driver Specifics

PCI Device Initialization

From Documentation/PCI/pci.txt, the following is the general flow when writing initialization code for a PCI device:

  • Register the device driver and find the device.

  • Enable the device.

  • Request MMIO/IOP resources

  • Set the DMA mask size (for both coherent and streaming DMA)

  • Allocate and initialize shared control data.

  • Access device configuration space (if needed)

  • Register IRQ handler.

  • Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)

  • Enable DMA/processing engines.

Unfortunately, since the ivshmem virtual PCI device was not designed to support DMA operations, the DMA related steps above will be skipped.

Unless noted otherwise, the structures and functions described below are defined in <linux/pci.h> and drivers/pci/pci.c.

Driver Registration

  • At bare minimum, initialize struct pci_driver with the driver's name, the list of PCI devices the driver supports, and its callback methods for the PCI core. The driver name must be unique among all PCI drivers in the kernel. It will appear under /sys/bus/pci/drivers. Essential callback methods are probe and remove. The types of supported devices are listed in struct pci_device_id (<linux/mod_devicetable.h>):

    struct pci_device_id {
        __u32 vendor, device;       /* Vendor and device ID or PCI_ANY_ID*/
        __u32 subvendor, subdevice; /* Subsystem ID's or PCI_ANY_ID */
        __u32 class, class_mask;    /* (class,subclass,prog-if) triplet */
        kernel_ulong_t driver_data; /* Data private to the driver */
    };
    

    The PCI_DEVICE(vendor, device) and PCI_DEVICE_CLASS(device_class, device_class_mask) macros may be used to initialize the respective fields of struct pci_device_id. To facilitate module loading and the hotplug mechanisms, export struct pci_device_id to userspace via the MODULE_DEVICE_TABLE macro.

  • Register struct pci_driver with the PCI layer:

    int pci_register_driver(struct pci_driver *drv);
    

    Invoking this function initiates probing for the device in the underlying PCI layer. This function returns 0 upon success. Otherwise, if something went wrong, an negative value (error code) is returned.

The probe Method

When the PCI core finds a matching PCI device during the probing process, it prepares a struct pci_dev object that describes the PCI device. Eventually, it invokes the device driver's probe callback. The probe method typically handles the following initialization steps:

  • Enable the PCI device

    int pci_enable_device(struct pci_dev *dev);
    

    If called for the first time, this function enables the device. This may involve assigning I/O, memory and an interrupt line for the device. It wakes up the device if it was suspended. Otherwise, it simply increments the usage count for the device. It returns 0 on success; a negative value (error code) otherwise.

  • Mark ownership of the region

    int pci_request_region(struct pci_dev *dev, int bar, const char *res_name);
    

    This function marks the PCI region associated with the PCI device bar BAR as reserved by owner res_name. Do not access any address inside the PCI regions unless this call returns successfully. This function returns 0 on success; EBUSY on error.

  • If need be, access the PCI configuration space with:

    pci_read_config_[byte|word|dword](struct pci_dev *dev,
                                      int offset, int *value);
    

    and

    pci_write_config_[byte|word|dword](struct pci_dev *dev,
                                       int offset, int value);
    

    For example, to read the IRQ number assigned to the device,

    pci_read_config_byte(dev, PCI_INTERRUPT_LINE, &irq);
    

    where PCI_INTERRUPT_LINE is defined in include/linux/pci_regs.h.

  • Access PCI BAR info for the address, length and flags associated with the port I/O or MMIO regions:

    resource_size_t start, len, end;
    unsigned long flags;
    
    start = pci_resource_start(dev, bar);
    len  = pci_resource_len(dev, bar);
    end = pci_resource_end(dev, bar);
    flags = pci_resource_flags(dev, bar);
    

    where

    • dev is a pointer of type struct pci_dev
    • bar is a PCI BAR type int
    • resource_size_t is typdef'd in include/linux/types.h.

    .

  • Obtain CPU access to the device port I/O space or MMIO regions

    The function pci_iomap (lib/pci_iomap.c) creates a virtual mapping cookie for a PCI BAR.

    void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen);
    

    maxlen specifies the maximum length to map. To get access to the complete BAR without checking for its length, specify 0.

    This function provides an abstraction such that both I/O port and MMIO (both prefetchable (cacheable) and non-prefetchable) access works transparently:

    void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen)
    {
        resource_size_t start = pci_resource_start(dev, bar);
        resource_size_t len = pci_resource_len(dev, bar);
        unsigned long flags = pci_resource_flags(dev, bar);
    
        if (!len || !start)
            return NULL;
        if (maxlen && len > maxlen)
            len = maxlen;
        if (flags & IORESOURCE_IO)
            return __pci_ioport_map(dev, start, len);
        if (flags & IORESOURCE_MEM) {
            if (flags & IORESOURCE_CACHEABLE)
                return ioremap(start, len);
            return ioremap_nocache(start, len);
        }
        /* What? */
        return NULL;
    }
    EXPORT_SYMBOL(pci_iomap);
    

    The read*() and write*() functions (see arch/ARCH/include/asm/io.h, include/asm-generic.h, etc) can now be used transperently for both port I/O and MMIO access with the returned __iomem address.

    There exist other less generic mapping functions, e.g. pci_ioremap_bar for mapping non-prefetchable device memory.

  • Register IRQ handler

    Many PCI devices - and indeed, ivshmem - can support both traditional pin-based interrupts and Message Signaled Interrupts (MSI). Such support is a requirement for PCI-X and PCIe devices. Pin-based interrupts will be covered here3.

    Overlooking device specific semantics such as setting a certain interrupt mask control register, registering a pin-based interrupt is simply a matter of calling:

    int request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char *name, void *dev);
    

    See include/linux/interrupt.h for its function definition. This function enables the device's interrupt capability. According to Documentation/PCI/pci.txt, all interrupt handlers for pin-based IRQ lines should be registered with the flags parameter set to IRQF_SHARED. Pin-based interrupts are often shared amongst several devices and so the kernel must call each interrupt handler associated with the interrupt. request_irq will associate the interrupt handler handler and device handle with the interrupt number irq. To verify whether its device caused the interrupt, the handler could check against the opaque object that was passed as the dev parameter of request_irq.

PCI Device Shutdown

Unregister the Driver

The following function should be called when unloading the driver.

int pci_unregister_driver(struct pci_driver *dvr);

This function does not return until the PCI core invokes the driver's remove callback method.

The remove Method

From Documentation/PCI/pci.txt, the following sequence should be followed (wherever applicable) when shutting down the PCI device:

  • Disable the device from generating IRQs
  • Release the IRQ (free_irq)
  • Stop all DMA activity
  • Release DMA buffers (both streaming and consistent).
  • Unregister from other subsystems (e.g. ALSA, framebuffer, SCSI, etc)
  • Disable device from responding to MMIO/IO port addresses (pci_iounmap, pci_disable_device)
  • Release MMIO/Port resources (pci_release_region)

Demo Programs, Guest-Side

With respect to program build, the usual runtime considerations, particularly the guest environment C library and kernel version, must be observed2. The simplest and most straightforward approach would be to build the sources in the VM environment.

  • ne_ivshmem_ldd_basic.c

    This is a skeletal version of the original kvm_ivshmem.c "standard" device driver. It mainly features:

    • Interrupt handling for pin-based interrupts

    • The mmap file operation method

      Operations on the ivshmem device are similar to those on graphics memory. While read(2), write(2), lseek(2), ioctl(2), etc may be used by a userspace application to access the device's MMIO data region (via the corresponding device driver file operations), mmap(2) is used here. The mmap(2) system call memory maps the ivshmem device's MMIO data region into an area of the calling process' virtual address space. Upon successful mmap(2) completion, operations on this mapped memory region become a simple matter of pointer referencing. In other words, the userspace application is given direct access to the ivshmem MMIO data region. The overhead (context switching, kernel-user buffer copies, etc) associated with read, write, ioctl, etc operations is now avoided. This translates to a dramatic improvement in throughput for high performance applications.

      Implementing userspace mmap(2) support in the ivshmem device driver can be done with the conventional mmap file operation or by way of the User I/O (UIO) framework. The code presented here implements the conventional mmap file operation method. A discussion on mmaping device MMIO regions can be found here. Check out UIO device driver example for a case study of UIO with ivshmem.

  • ne_ivshmem_shm_guest_usr.c

    A demo userspace program that updates/reads the MMIO region associated with BAR2 of the ivshmem device.

Demo Programs, Host-side

The ivshmem server

To test IRQ generation and handling with an ivshmem device, services of an ivshmem server will be required. This program will provide centralized management for the ivshmem-enabled QEMU VMs or any other standalone ivshmem client programs on the host. Essentially, the ivshmem server will manage file-descriptor information for the UNIX domain socket IPC, eventfd(2) notification and the common region shared among the clients.

The ivshmem client-server protocol changed with QEMU v2.5 (or thereabouts) and breaks compatibility with (ad-hoc) ivshmem client programs that worked with previous QEMU versions. See docs/specs/ivshmem-spec.txt.

Pre-v2.5 QEMU

For tests with pre-v2.5 QEMU setups, consider:

  • ne_ivshmem_send_qeventfd.c

    A very ad-hoc and quite crippled ivshmem client program that periodically sends eventfd(2) notification to an ivshmem-server to generate IRQs on an ivshmem-enabled QEMU VM. It supports only one QEMU VM instance.

    To build this program:

    $ gcc -Wall -O2 ne_ivshmem_send_qeventfd.c -o ne_ivshmem_send_qeventfd
    
  • ivshmem-server.c, send_scm.{c,h}

    An ivshmem server program. Its sources were obtained from the Nahanni package by Cam Macdonell. Note that this ivshmem-server.c file may include one or two hacks to suit the current setup.

Post-v2.5 QEMU (Recommended)

The contrib/{ivshmem-client/,ivshmem-server/} directories of the QEMU tree contain sample programs that are compatible with the current client-server protocol. No special QEMU configuration options are required and they should automatically get built and installed along with any qemu-system-$ARCH binary that supports ivshmem, e.g.:

$ ./configure --target-list=i386-softmmu,x86_64-softmmu --enable-kvm --prefix=${QEMU_INSTALL}
$ make -jN
$ make install
$ ls ${QEMU_INSTALL}/bin/
ivshmem-client  qemu-ga   qemu-io   qemu-system-i386
ivshmem-server  qemu-img  qemu-nbd  qemu-system-x86_64

Execution of these programs requires no special privileges.

Utility Progs

The ne_ivshmem_shm_host_usr.c program is stand-alone demo program that updates/reads the POSIX SHM region on the host. Note that use of this program does not require the services of the ivshmem server.

Build with:

$ gcc -Wall -O2 ne_ivshmem_shm_host_usr.c -o ne_ivshmem_shm_host_usr -lrt

Testing POSIX SHM update

Boot a QEMU instance with:

$ qemu-system-x86_64 -enable-kvm ... -device ivshmem,shm=ivshmem,size=1
  • NOTE: In order to use -device ivshmem-plain with ne_ivshmem_ldd_basic.c, remove all IRQ-related Linux API since this device does not support IRQs.

In the guest VM:

you@vm$ sudo insmod ne_ivshmem_ldd_basic.ko

At this point, udev should have automatically created the ivshmem special device file:

you@vm$ ls -l /dev/ivshmem0
crw------- 1 root root 250, 0 /dev/ivshmem0

If not, create it manually e.g.

you@vm$ cat /proc/devices | grep ivshmem
250 ivshmem

you@vm$ sudo mknod -m 600 /dev/ivshmem0 c 250 0

Run a write operation test on the ivshmem MMIO data region.

you@vm$ sudo ./ne_ivshmem_shm_guest_usr -w "Dunia, vipi?"
main:169:: writing "Dunia, vipi?"

Now, from an xterm(1) on the host machine, the following preliminary check may be executed:

$ hd /dev/shm/ivshmem 
00000000  44 75 6e 69 61 2c 20 76  69 70 69 3f 00 00 00 00  |Dunia, vipi?....|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00100000

otherwise,

$ ./ne_ivshmem_shm_host_usr 
main:171:: read "Dunia, vipi?"

Note that both ne_ivshmem_shm_guest_usr and ne_ivshmem_shm_host_usr accept a number of options to modify their default settings. Run the programs with the -h option to view their usage.

Testing IRQ generation and handling

For this setup, services of ivshmem-server will be required.

Compatibility Note

For traditional pin-based interrupts, command line compatibility has been maintained4. For example:

-chardev socket,path=/tmp/ivshmem_socket,id=ivshmemid \
-device ivshmem,chardev=ivshmemid,size=1,msi=off

will continue to work with post-v2.5 QEMU.

The -chardev option5 is used to specify the UNIX domain socket used for IPC with ivshmem-server. Notice the omission of the shm=$SHMOBJ device property since ivshmem-server now manages the file-descriptor for the region shared among its clients.

Pre-v2.5 QEMU

Fire up ivshmem_server with the default settings:

$ ./ivshmem_server
listening socket: /tmp/ivshmem_socket
shared object: ivshmem
shared object size: 1048576 (bytes)
vm_sockets (0) =

Waiting (maxfd = 4)

Note that this program also accepts several options to adjust is default settings. Run the program with the -h option to identify the available options.

Launch the pre-v2.5 QEMU with:

$ qemu-system-x86_64 -enable-kvm ... \
    -chardev socket,path=/tmp/ivshmem_socket,id=ivshmemid \
    -device ivshmem,chardev=ivshmemid,size=1,msi=off

The following messages should get displayed by ivshmem_server as the QEMU instance connects:

[NC] new connection
increasing vm slots
[NC] Live_vms[0]
    efd[0] = 6
[NC] trying to send fds to new connection
[NC] Connected (count = 0).
Live_count is 1
vm_sockets (1) = [5|6]

Waiting (maxfd = 5)

Once the QEMU guest boots up, login and execute:

you@vm$ sudo sh -c "echo 8 > /proc/sys/kernel/printk"
you@vm$ sudo insmod ne_ivshmem_ldd_basic.ko
[  106.635224] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
[  106.637666] ivshmem 0000:00:04.0: data_mmio iomap base = 0xffffc90000700000 
[  106.639503] ivshmem 0000:00:04.0: data_mmio_start = 0xfea00000 data_mmio_len = 1048576
[  106.641867] ivshmem 0000:00:04.0: regs iomap base = 0xffffc9000053c000, irq = 10
[  106.645141] ivshmem 0000:00:04.0: regs_addr_start = 0xfeb22000 regs_len = 256

Then, from an xterm(1) on the host, initiate periodic eventfd(2) notification to the QEMU guest in order to generate IRQs on the ivshmem device:

$ ./ne_ivshmem_send_qeventfd
process_server_msg:87:: Our ID = 1
check_shm_size:71:: SHM size: 1048576
process_server_msg:114:: Using fd 7 for eventfd to VM ID 0
main:316:: sending eventfd to VM ID 0 using fd = 7
process_server_msg:120:: Our eventfd fd 9
main:316:: sending eventfd to VM ID 0 using fd = 7
main:316:: sending eventfd to VM ID 0 using fd = 7
main:316:: sending eventfd to VM ID 0 using fd = 7
main:316:: sending eventfd to VM ID 0 using fd = 7

and ivshmem_server should now display:

[NC] new connection
[NC] Live_vms[1]
    efd[0] = 8
[NC] trying to send fds to new connection
[NC] Connected (count = 1).
[UD] sending fd[1] to 0
    efd[0] = [8]
Live_count is 2
vm_sockets (2) = [5|6] [7|8]

Waiting (maxfd = 7)

On the VM, the following messages will get displayed as its ivshmem device receives IRQs via the eventfd(2) mechanism:

[  389.889621] ivshmem_interrupt:71:: interrupt (status = 0x0002)
[  390.890675] ivshmem_interrupt:71:: interrupt (status = 0x0001)
[  391.891999] ivshmem_interrupt:71:: interrupt (status = 0x0001)
[  392.893097] ivshmem_interrupt:71:: interrupt (status = 0x0001)

Typing CTRL+C against ne_ivshmem_send_qeventfd foreground process will kill it. This will immediately terminate eventfd(2) notification to the QEMU guest and ivshmem_server should report:

[DC] recv returned 0
[UD] sending kill of fd[1] to 0
Killing posn 1
rv is 0
vm_sockets (2) = [5|6]

Waiting (maxfd = 7)

Post-v2.5 QEMU

Launch the ivshmem-server. The -F (foreground) and -v (verbose) options are to facilitate debugging:

$ ivshmem-server -F -v
*** Example code, do not use in production ***
Using POSIX shared memory: ivshmem
create & bind socket /tmp/ivshmem_socket

By default, this program performs a shm_open(3) on the /dev/shm/ivshmem RAM-based POSIX SHM object. It also supports using an ordinary (disk) file memory-mapping. The file-descriptor will get passed to a connecting client, e.g. an ivshmem-enabled QEMU instance, which will then perform the actual mmap(2) operation.

  • NOTE: The following command may have to be executed:

    $ rm /tmp/ivshmem_socket
    

    to delete any stale UNIX domain socket that might be present prior to ivshmem-server execution. Otherwise program launch might complain and exit with:

    $ ./ivshmem-server -F -v
    *** Example code, do not use in production ***
    Using POSIX shared memory: ivshmem
    create & bind socket /tmp/ivshmem_socket
    cannot connect to /tmp/ivshmem_socket: Address already in use
    cannot bind
    

    Specify -h (help) to view available options.

Launch a post-v2.5 QEMU:

$ qemu-system-x86_64 -enable-kvm ... \
 -chardev socket,path=/tmp/ivshmem_socket,id=ivshmemid \
 -device ivshmem,chardev=ivshmemid,size=1,msi=off

At this point, ivshmem-server should report something like:

accept()=5
new peer id = 0
peer->sock_fd=5

Login the guest and load ne_ivshmem_ldd_basic.ko:

you@vm$ sudo sh -c "echo 8 > /proc/sys/kernel/printk"
you@vm$ sudo insmod ne_ivshmem_ldd_basic.ko
[   66.424979] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
[   66.426990] ivshmem 0000:00:04.0: data_mmio iomap base = 0xffffc90000400000 
[   66.429297] ivshmem 0000:00:04.0: data_mmio_start = 0xfe800000 data_mmio_len = 1048576
[   66.431954] ivshmem 0000:00:04.0: regs iomap base = 0xffffc900001d6000, irq = 11
[   66.434453] ivshmem 0000:00:04.0: regs_addr_start = 0xfebf1000 regs_len = 256

Now execute the ivshmem-client. The -v (verbose) option is to facilitate debugging. Specify -h (help) to view available options:

$ ivshmem-client -v
dump: dump peers (including us)
int <peer> <vector>: notify one vector on a peer
int <peer> all: notify all vectors of a peer
int all: notify all vectors of all peers (excepting us)
cmd> connect to client /tmp/ivshmem_socket
our_id=1
shm_fd=4
listen on server socket 3
new peer id = 0
  new vector 0 (fd=5) for peer id 0
  new vector 0 (fd=6) for peer id 1

Press RETURN to display the ivshmem-client command prompt:

cmd>

Upon ivshmem-client connection, ivshmem-server should report something like:

accept()=7
new peer id = 1
peer->sock_fd=5
peer->sock_fd=7

On the ivshmem-client terminal, type dump to view client id and vector info, e.g.:

cmd> dump
our_id = 1
  vector 0 is enabled (fd=6)
peer_id = 0
  vector 0 is enabled (fd=5)

Enter int <peer> <vector> commands to send eventfd(2) notifications to the QEMU instance, e.g.:

cmd> int 0 0
notify peer 0 on vector 0, fd 5

Back in the guest environment, each int <peer> <vector> command on the ivshmem-client terminal should result in a ne_ivshmem_ldd_basic.ko IRQ message similar to:

[   84.734043] ivshmem_interrupt:71:: interrupt (status = 0x0001)

A SIGINT (i.e. CTRL+C) on the ivshmem-client terminal should result in the following disconnection message on the ivshmem-server terminal:

peer->sock_fd=5
peer->sock_fd=7
free peer 1

With the execution instance of this setup, a guest power down, for example:

you@vm$ sudo shutdown -h now

and exit will cause ivshmem-server to report:

peer->sock_fd=5
free peer 0

Send a SIGINT to kill the ivshmem-server instance. Delete the stale UNIX domain socket:

$ rm /tmp/ivshmem_socket

Also See

Resources and Further Reading

  • Cam Macdonell's Nahanni code base

  • Linux-3.x

  • Books

    • Essential Linux Device Drivers (ELDD), Sreekrishnan Venkateswaran, Prentice Hall.

    • Linux Device Drivers (LDD), 3rd Edition, Greg Kroah-Hartman et. al, O'Reilly Media, Inc. (Available Online).

  • PCI Specifications

    • PCI Local Bus Specification, Revision 3.0

    • PCI Express 2.0 Base Specification, Revision 0.9

Footnotes

1. See QEMU Monitor Console for usage tips. [go back]

2. See Toolchain Intro for a discussion on toolchain and runtime host considerations. [go back]

3. MSI example still not yet included!? Bwana Siro please fix this ... you are cramping nairobi-embedded's style. [go back]

4. For MSI, either msi=on (default) for legacy -device ivshmem, or -device ivshmem-doorbell, required ... and so is a case-study -- can Bwana Siro please get this fixed ASAP? [go back]

5. See QEMU Command line: Character devices for an introduction to -chardev and -device usage. [go back]