Author: Mugabi Siro

Category: Device Drivers

Summary:

This entry describes mmaping device memory mapped I/O (MMIO) regions and Direct Memory Access (DMA) memory. A number of qemu-system virtual devices are considered here as the hardware platforms. References to Linux header files and sources, e.g. include/linux/mm_types.h, mm/memory.c, etc are with respect to the root of the kernel source tree. Based on Linux 3.x.

Tags: linux qemu arm pci

Preliminaries

To address the problem of a universally accessible hardware platform, the QEMU machine emulator is used here. However, keep it in mind that machine emulators are not a replacement for physical hardware: Virtual Machine (VM) implementations may not fully/accurately model certain details/aspects of the physical target. Nevertheless, VM environments present a convinient platform for prototyping and are an ideal setup for a demo/training/teaching session. See QEMU Intro for an introduction to the QEMU machine emulator. The host machine used was Ubuntu 12.04.

Virtual Memory Areas (VMAs)

In Linux, a process' (sparsely populated) linear address space is organised in sets of virtual memory areas (VMAs). Each VMA is a contiguous chunk of related and allocated pages. The code segment, data segment1 and heap area by each module of the application (i.e. executable and its shared object dependencies), and the user stack are all distinct VMAs. For instance:

$ cat /proc/self/maps 
00400000-0040b000 r-xp 00000000 08:08 884809                             /bin/cat
0060a000-0060b000 r--p 0000a000 08:08 884809                             /bin/cat
0060b000-0060c000 rw-p 0000b000 08:08 884809                             /bin/cat
010a0000-010c1000 rw-p 00000000 00:00 0                                  [heap]
7fbf48526000-7fbf48c09000 r--p 00000000 08:08 1040385                    /usr/lib/locale/locale-archive
7fbf48c09000-7fbf48dbe000 r-xp 00000000 08:08 1321013                    /lib/x86_64-linux-gnu/libc-2.15.so
7fbf48dbe000-7fbf48fbe000 ---p 001b5000 08:08 1321013                    /lib/x86_64-linux-gnu/libc-2.15.so
[...]           [...]           [...]           [...]                                   [...]
7fbf491eb000-7fbf491ec000 r--p 00022000 08:08 1321061                    /lib/x86_64-linux-gnu/ld-2.15.so
7fbf491ec000-7fbf491ee000 rw-p 00023000 08:08 1321061                    /lib/x86_64-linux-gnu/ld-2.15.so
7fffc3456000-7fffc3477000 rw-p 00000000 00:00 0                          [stack]
7fffc35ff000-7fffc3600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

and

[root@buildroot ~]# cat /proc/1021/maps 
08048000-080ef000 r-xp 00000000 08:02 54654                              /root/ne_ivshmem_shm_guest_usr
080ef000-080f1000 rw-p 000a6000 08:02 54654                              /root/ne_ivshmem_shm_guest_usr
080f1000-080f3000 rw-p 00000000 00:00 0 
085a9000-085cb000 rw-p 00000000 00:00 0                                  [heap]
f77a7000-f77a8000 rw-p 00000000 00:00 0 
f77a8000-f77a9000 rw-s 00001000 08:02 77808                              /dev/ivshmem0
f77a9000-f77aa000 r-xp 00000000 00:00 0                                  [vdso]
ffd29000-ffd4a000 rw-p 00000000 00:00 0                                  [stack]

Basically, each line corresponds to a VMA. The fields in each line are:

start-end permissions offset major:minor inode image

where:

  • start-end: start and end addresses of the VMA.

  • permissions: r (read), w (write), and x (execute). The p (private) and s (shared) flags indicate the type of mapping.

  • offset the offset into the underlying object where the VMA mapping begins.

  • major:minor: the major and minor number pairs of the device holding the file that has been mapped. Note, for device mappings e.g. /dev/ivshmem0 above, this number pair refers to the disk partition holding the opened device special file rather than the device's major and minor:

    [root@buildroot ~]# ls -l /dev/ivshmem0 
    crw-r--r--    1 root     root      248,   0  /dev/ivshmem0
    
  • inode: the inode number of the mapped file.

  • image: the name of the mapped file.

Each existing (allocated) virtual page is contained in some area, and any virtual page that is not part of some VMA does not exist and cannot be referrenced by the process.

Now, the kernel maintains a distinct task_struct (see include/linux/sched.h) for each process. This structure contains all the information the kernel requires to run a process. Members of type struct mm_struct (see include/linux/mm_types.h) in task_struct characterize the current state of a process' virtual memory. The mm_struct structure contains a struct vm_area_struct *mmap field which points to the list of the process' VMAs. A vm_area_struct structure (see include/linux/mm_types.h) describes a virtual memory area. Each VMA (by the code and data segments, heap areas, etc) has a special rule for the page-fault handlers.

The mmap File Operation

To support mmap(2), a driver defines a mmap file operation. A pointer to the function's prototype is declared in struct file_operations (include/linux/fs.h) to the following effect:

int (*mmap)(struct file *filp, struct vm_area_struct *vma);

where:

  • filp is a pointer to the Linux structure that represents the open device file for the driver.

  • vma is a pointer to the structure that describes the newly allocated VMA for memory mapping. This is the process' VMA and is allocated by the kernel. The following is the relationship between the mmap(2) parameters:

    void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
    

    and the relevant vma fields. Check this for a discussion on mmap(2) and memory-mapped I/O:

    • addr influences the virtual address assignment of vma->vm_start (the starting address of the VMA). If NULL was specified, then the kernel will assign a virtual address of its choice.

    • length determines the size of the VMA, or more precisely, the number of the pages to be allocated for memory mapping. In other words, it influences the kernel's assignment of vma->vm_end:

      • If length <= PAGE_SIZE, then vma->vm_end - vma->vm_start == PAGE_SIZE

      • If PAGE_SIZE < length <= (N * PAGE_SIZE), then vma->vm_end - vma->vm_start == (N * PAGE_SIZE) ; where N is the smallest possible integer starting from 2, 3, 4, ...

      where the PAGE_SIZE definition is included by <asm/page.h>.

    • prot determines access permissions of the VMA in vma->vm_prot. In addition to the user specified access permissions (PROT_READ, PROT_WRITE, etc), the device driver's mmap method may also have to explicitly disable caching for the VMA region for device memory mappings. This is usually achieved with:

      vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
      
    • flags determines the type of mapping (shared or private) in vma->vm_flags.

    • offset specifies (in multiples of page size) the offset into the shared object where the memory mapping will start from. It determines the value of vma->vm_pgoff. However, note that vma->vm_pgoff yields the offset value in form of the number of pages. To retrieve the offset size in bytes, the vma->vm_pgoff << PAGE_SHIFT operation is typically used (see Appendix PAGE_SHIFT).

The remap_pfn_range interface

Defined in mm/memory.c, this is the function that is remaps the object to the process' VMA and is called from the driver's mmap file operation definition:

remap_pfn_range(struct vma_area_struct *vma, 
                unsigned long addr,
                unsigned long pfn, 
                unsigned long size, 
                pgprot_t prot);

where:

  • vma a pointer to the structure that describes the process' VMA for mapping.
  • addr the starting virtual address of the VMA.
  • pfn i.e. page frame number of the physical address. Given a physical address paddr, this value is typically derived from paddr >> PAGE_SHIFT (see Appendix PAGE_SHIFT).
  • size the length of the VMA.
  • prot protection access bits of the VMA.

This function builds suitable page tables for the VMA and maps a corresponding range of physical address to it. As far as shared object mappings are concerned, this function marks the page(s) of the VMA with the following flags:

vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

where:

  • VM_IO specifies (among other things) that the VMA is a mapping of a device's I/O and should not be included in any process' core dump2.
  • VM_PFNMAP tells the core MM that the base pages are just raw page frame number mappings, and do not have a struct page associated with them.
  • VM_DONTEXPAND Disable vma merging and expanding with mremap(2).
  • VM_DONTDUMP Omit vma from core dump, even when VM_IO turned off.

There also exists the io_remap_pfn_range interface that accepts the same function arguments as remap_pfn_range. It is simply defined as the latter on most architectures. Nevertheless, for portability, remap_pfn_range should be used for situations when mmapping main memory while io_remap_pfn_range should be used when mmapping I/O memory.

Example: mmaping MMIO Regions

An example on using io_remap_pfn_range for device MMIO access can be seen in ne_ivshmem_ldd_basic.c (See A PCI Device Driver Tutorial for instructions on its usage). This is a skeleton device driver for the ivshmem virtual PCI device of the QEMU PC platform. Basically, the driver's mmap method performs a sanity check and an offset into the device MMIO region:

unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;

if((offset + (vma->vm_end - vma->vm_start)) > ivshmem_dev.data_mmio_len)
  return -EINVAL;

offset += (unsigned long)ivshmem_dev.data_mmio_start;

and sets a few extra VMA flags:

vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

before invoking io_remap_pfn_range:

if(io_remap_pfn_range(vma, vma->vm_start,
                      offset >> PAGE_SHIFT,
                        vma->vm_end - vma->vm_start,
                      vma->vm_page_prot))
           return -EAGAIN;

Note that the snippets above assume that the physical address, ivshmem_dev.data_mmio_start is page aligned. In this particular example, this value is the address returned by pci_resource_start and happens to be aligned on a page boundary. But this may not always be the case in other device driver scenarios and the driver may have to perform certain alignment operations before invoking io_remap_pfn_range. In addition, the driver had to perform sanity checks to ensure that the values passed in mmap(2) were usable at all. To reduce this workload, newer kernels export the vm_iomap_memory (see mm/memory.c) wrapper function to io_remap_pfn_range which takes care of most of these preliminary checks. Nevertheless, the page access bits may still have to be set explicitly prior to invoking the function. For instance,

static int ivshmem_mmap(struct file *filp, struct vm_area_struct *vma)
{
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    return vm_iomap_memory(vma, ivshmem_dev.data_mmio_start, ivshmem_dev.data_mmio_len);
}

... can you dig it?

Example: mmaping DMA Allocations

This section will describe mmaping a coherent DMA allocation for the ARM PrimeCell PL11x Color LCD (CLCD) framebuffer device of the qemu-system-arm Versatile Express machine. The very same QEMU ARM setup described in ARM Toolchain for a QEMU Target is used here.

Central to the DMA allocation and mapping for this QEMU ARM target are two groups of functions:

  • three low-level functions defined in arch/arm/plat-versatile/clcd.c, namley: versatile_clcd_setup_dma, versatile_clcd_mmap_dma and versatile_clcd_remove_dma, and

  • three other corresponding functions defined in the CLCD framebuffer device driver (drivers/video/amba-clcd.c), namely: clcdfb_probe, clcdfb_mmap and clcdfb_remove.

Ultimately, these two groups of functions are tied together via the following structure initialization (see arch/arm/mach-vexpress/ct-ca9x4.c):

static struct clcd_board ct_ca9x4_clcd_data = {
    [...]         [...]
    .setup      = ct_ca9x4_clcd_setup,
    .mmap       = versatile_clcd_mmap_dma,
    .remove     = versatile_clcd_remove_dma,
};

DMA allocation and freeing

The CLCD driver gets loaded right before the FB console kicks in during system boot. The driver's probe method then invokes the low-level versatile_clcd_setup_dma function to initiate DMA allocation for the framebuffer:

static int clcdfb_probe(struct amba_device *dev, const struct amba_id *id)
{
    struct clcd_board *board = dev->dev.platform_data;
    struct clcd_fb *fb;
    [...]
    fb = kzalloc(sizeof(struct clcd_fb), GFP_KERNEL);
    [...]
    fb->dev = dev;
    fb->board = board;
    [...]
    ret = fb->board->setup(fb); /* invokes "ct_ca9x4_clcd_setup()" */
    [...]
}

The ct_ca9x4_clcd_setup function calculates the framebuffer size before invoking versatile_clcd_setup_dma:

int versatile_clcd_setup_dma(struct clcd_fb *fb, unsigned long framesize)
{
    dma_addr_t dma;

    fb->fb.screen_base = dma_alloc_writecombine(&fb->dev->dev, framesize,
                                                &dma, GFP_KERNEL);
    if (!fb->fb.screen_base) {
        pr_err("CLCD: unable to map framebuffer\n");
        return -ENOMEM;
    }

    fb->fb.fix.smem_start   = dma;
    fb->fb.fix.smem_len = framesize;

    return 0;
}

As can be seen, this function initializes the appropriate fields of the struct fb_info instance (embedded in struct clcd_fb). The dma_alloc_writecombine function internally invokes dma_alloc_attrs which essentially is a dma_alloc_coherent extension (arch/arm/include/asm/dma-mapping.h):

#define dma_alloc_coherent(d, s, h, f) dma_alloc_attrs(d, s, h, f, NULL)

where the d, s, h, and f parameters correspond to those of dma_alloc_writecombine. This is the function that actually handles allocation and mapping such that the buffer is placed in a location that works with DMA. The screen_base field of struct fb_info then gets set to point to the virtual address of the coherent DMA buffer while the smem_start field is set to its bus address.

In a similar fashion, the DMA buffer gets freed via clcdfb_remove -> versatile_clcd_remove_dma -> dma_free_writecombine -> dma_free_attrs (again, a dma_free_coherent extension):

#define dma_free_coherent(d, s, c, h) dma_free_attrs(d, s, c, h, NULL)

mmaping the DMA allocation

The CLCD FB device driver's mmap method:

static int clcdfb_mmap(struct fb_info *info,
                     struct vm_area_struct *vma)
{
    struct clcd_fb *fb = to_clcd(info);
    [...]
    ret = fb->board->mmap(fb, vma);
    return ret;
}

invokes

int versatile_clcd_mmap_dma(struct clcd_fb *fb, struct vm_area_struct *vma)
{
    return dma_mmap_writecombine(&fb->dev->dev, vma,
                         fb->fb.screen_base,
                         fb->fb.fix.smem_start,
                         fb->fb.fix.smem_len);
}

The dma_mmap_writecombine function internally invokes dma_mmap_attrs (once again, a dma_mmap_coherent extenstion).

#define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, NULL)

This function eventually invokes arm_dma_mmap in arch/arm/mm/dma-mapping.c (i.e. via ops->mmap) to perform the mapping of the DMA allocation to the userspace process' VMA via remap_pfn_range. In this particular case,

/*
 * Create userspace mapping for the DMA-coherent memory.
 */
int arm_dma_mmap(struct device *dev, struct vm_area_struct *vma,
         void *cpu_addr, dma_addr_t dma_addr, size_t size,
         struct dma_attrs *attrs)
{
    int ret = -ENXIO;
#ifdef CONFIG_MMU
    unsigned long nr_vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
    unsigned long nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
    unsigned long pfn = dma_to_pfn(dev, dma_addr);
    unsigned long off = vma->vm_pgoff;

    vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot);
    [...]
    if (off < nr_pages && nr_vma_pages <= (nr_pages - off)) {
        ret = remap_pfn_range(vma, vma->vm_start,
                            pfn + off,
                            vma->vm_end - vma->vm_start,
                            vma->vm_page_prot);
    }
#endif  /* CONFIG_MMU */

    return ret;
}

Note that the coherent DMA buffer must not be freed (dma_free_coherent) until this userspace mapping has been released (i.e. munmap(2)).

So now, the df_window example:

root@genericarmv7a:~# df_window [--dfb:system=fbdev] [&]
[...]
(*) DirectFB/FBDev: Found 'CLCD FB' (ID 0) with frame buffer at 0x67000000, 1536k (MMIO 0x10020000, 4k)
[...]

employs mmap(2) (See DirectFB-1.7.0/systems/fbdev/fbdev.c):

static DFBResult
system_initialize( CoreDFB *core, void **data )
{
     [...]
     D_INFO( "DirectFB/FBDev: Found '%s' (ID %d) with frame buffer at 0x%08lx, %dk (MMIO 0x%08lx, %dk)\n",
                     shared->fix.id, shared->fix.accel,
                     shared->fix.smem_start, shared->fix.smem_len >> 10,
                     shared->fix.mmio_start, shared->fix.mmio_len >> 10 );

     /* Map the framebuffer */
     dfb_fbdev->framebuffer_base = mmap( NULL, shared->fix.smem_len,
                     PROT_READ | PROT_WRITE, MAP_SHARED,
                     dfb_fbdev->fd, 0 );
    [...]

to map the DMA allocated framebuffer via the clcdfb_mmap method into its address space to give:

directfb on qemu-openembedded

Resources and Further Reading

  • Linux 3.x sources

  • Recommended Reading

    • Linux Device Drivers, 3rd Edition, Jonathan Corbet, Alessandro Rubuni and Greg Kroah-Hartman, 2005, O'Reilly Media, Inc. Somewhat dated but still a very useful reference.

    • Linux Kernel Development, 3rd Edition, Robert Love, 2010. The book is thorough.

Footnotes

1. Data segment typically spans two or more VMAs. For example, check out Sections-Segments-VMA mappings. [go back]

2. Also check out core(5). [go back]

Appendix

PAGE_SHIFT

A memory address - virtual or physical - is divided into a page number (in a manner of speaking) and an offset within the page. The PAGE_SHIFT macro (see arch/ARCH/include/asm/page.h for most archs; arch/x86/include/asm/page_types.h) is defined as the number of bits that comprise the offset and, therefore, determines the page size. For example, if 4KB pages are used, then the 12 least significant bits hold the offset value. The PAGE_SHIFT macro, in this case, is defined as 12. The value of the PAGE_SIZE macro is derived from PAGE_SHIFT:

#define PAGE_SHIFT  12
#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)

Typical usage of PAGE_SHIFT in a driver's mmap file operation definition includes:

  • Obtaining the size of the offset in bytes given the number of offset pages in vma->vm_pgoff:

    vma->vm_pgoff << PAGE_SHIFT
    
  • Obtaining the page frame number for a given physical address paddr:

    paddr >> PAGE_SHIFT