Author: Mugabi Siro

Category: Device Drivers

Summary:

This entry presents a few examples that illustrate (dangerous) /dev/mem usage.

Tags: linux qemu exercises pci

Background

The /dev/mem character driver is implemented in drivers/char/mem.c and uses remap_pfn_range1 to memory map a region of physical memory (e.g. system RAM, device memory, etc) to the calling process' address space. From mem(4):

   mem  is  a character device file that is an image of the main memory of
   the computer.  It may be used, for example, to examine (and even patch)
   the system.

   Byte  addresses  in  mem  are interpreted as physical memory addresses.
   References to nonexistent locations cause errors to be returned.

   Examining and patching is likely to lead  to  unexpected  results  when
   read-only or write-only bits are present.

   It is typically created by:

          mknod -m 660 /dev/mem c 1 1
          chown root:kmem /dev/mem

In other words, once a userspace application (with sufficient privileges) successfully mmap(2)'s with /dev/mem, the new Virtual Memory Area (VMA) has a one-to-one correspondence with the specified physical memory region. Check this for a discussion on mmap(2) and memory-mapped I/O.

NOTE:

  • These examples directly touch device memory, or (reserved) physical RAM to modify kernel data. It might be prudent to first perform the tests in a machine emulator environment such as QEMU2 or on some development machine/board.

  • At least for system RAM access, CONFIG_STRICT_DEVMEM may have to be disabled.

Examples

PCI MMIO Access

Memory mapping via /dev/mem makes it possible to access a PCI device's memory mapped regions even when its device driver is absent, or has not yet been loaded. This can be quite useful for preliminary tests/checks. Nevertheless, functionality with this mode of device access remains very limited.

The file upci.c is a library that contains functions to scan the PCI bus, find a particular PCI card, and read/write its I/O ports and MMIO regions. The ne_upci_rw .c program will use this library to illustrate memory mapping via /dev/mem, and MMIO access on a PCI device.

The general flow in ne_upci_rw.c is:

upci_scan_bus() --> upci_find_device() --> upci_open_region() --> upci_read_N()/upci_write_N()

These functions are defined in upci.c:

  • upci_scan_bus() scans the PCI bus by reading /proc/bus/pci/devices. For each device listed in this file, it initializes an entry in an internal table of data structures with the device's standard PCI configuration space info: DeviceID, VendorID, SubDeviceID, SubVendorID, address and size of regions associated with each available Base Address Register (BAR), etc.

  • upci_find_device() accepts user supplied (Sub)DeviceID and (Sub)VendorID info, and returns an integer descriptor, a.k.a device number.

  • upci_open_region() accepts the device number and a user supplied Base Address Register (BAR) index, a.k.a region number (starting from zero). It returns a new integer descriptor for the BAR, a.k.a data region. In upci_open_region(), the user supplied region number is used as an index into the table that was prepared by upci_scan_bus() to retrieve BAR information. Interesting fields in this table entry are the base address, size and type of I/O region (UPCI_REG_MEM for a MMIO region). Along with a file descriptor obtained via open(2) on /dev/mem (see incr_mem_usage()), the base address and size values are used as the offset and length parameters, respectively, of the mmap(2) interface.

At this point, a new VMA - associated with the MMIO region of the BAR - has been allocated within the virtual address space of the ne_upci_rw instance. The upci_read_N() and upci_write_N() interfaces (where N is the size of a data unit) require the data region (i.e. the BAR's descriptor), and an offset (in number of bytes) into the mapping. The upci_write_N() function additionally accepts a value to write into the specified region, while upci_read_N() returns a copy of the data stored in the specified region.

The following tests were carried out on QEMU and should work with v1.x as well as v2.x releases. The InterVM Shared Memory (ivshmem) implementation is considered.

Ivshmem Device Memory Access

For purposes of this entry, either the "ivshmem-plain" (available only as of QEMU versions later than v2.5), or the "ivshmem" (legacy) device can be used:

  • "ivshmem" boot:

    $ qemu-system-x86_64 ... \
        -device ivshmem,shm=ivshmem,size=1
    
  • "ivshmem-plain" boot:

    $ qemu-system-x86_64 ... \
        -object memory-backend-file,id=mb1,size=1M,share,mem-path=/dev/shm/ivshmem-plain \
        -device ivshmem-plain,memdev=mb1,id=ivshmem-plain
    

Recall that upon QEMU boot, the ivshmem PCI device driver3 need not be loaded/present for the following tests.

  • Compile ne_upci_rw in the guest environment:

    $ make -f ne_upci_makefile
    
  • Check ne_upci_rw program usage, and obtain lspci(1) listing:

    $ ./ne_upci_rw -h   ## to view program options and arguments
    
    $ lspci             ## to view PCI device listing
    00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
    ...
    00:04.0 RAM memory: Red Hat, Inc Virtio Inter-VM shared memory
    
    $ lspci -s 00:04.0 -m -n -v
    Device: 00:04.0
    Class:  0500
    Vendor: 1af4
    Device: 1110
    SVendor:    1af4
    SDevice:    1100
    ...
    
  • Write operation: A string is written into the MMIO region associated with BAR2. Recall that the ivshmem device performs a POSIX SHM mapping on the host, and presents this mapping as its BAR2 MMIO region to the Linux guest.

    $ sudo -s
    # DARRAY="68 117 110 105 97 44 118 105 112 105 63 13 0"
    # c=0; for i in $DARRAY
    > do 
    >   ./ne_upci_rw -D 0x1110 -V 0x1af4 -d 0x1100 -v 0x1af4 -r 2 -f u8 -o $c -a 1 -w $i
    >   c=$((c+1))
    >   printf "%3d %3d\n" $i $c
    > done
    

    Then on the host

    $ hd /dev/shm/ivshmem 
    00000000  44 75 6e 69 61 2c 76 69  70 69 3f 0d 00 00 00 00  |Dunia,vipi?.....|
    00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00100000
    
  • Read operation: Back in the guest environment, a read is performed on the device memory associated with BAR2. In this case, what was just written is returned.

    # i=0; d=0; while [ $d == 0 ]
    > do 
    >   val=$(./ne_upci_rw -D 0x1110 -V 0x1af4 -d 0x1100 -v 0x1af4 -r 2 -f u8 -o $i -a 0)
    >   if [ $val == "0x00" ]
    >   then
    >     d=1
    >   fi
    >   echo -n "$val "
    >   i=$((i+1))
    > done; echo
    0x44 0x75 0x6E 0x69 0x61 0x2C 0x76 0x69 0x70 0x69 0x3F 0x0D 0x00
    

Ivshmem Device Register Access

The entry QEMU MMIO Memory Regions includes an example of accessing the memory mapped registers of a PCI device. The objective of that exercise was to obtain the function call trace taken by the QEMU vCPU thread function for the device's I/O callbacks.

System RAM Access

In this example, a location within reserved physical memory belonging to the kernel's data segment is directly accessed and patched by a userspace program. References to Linux header files and sources, e.g. include/uapi/linux/utsname.h, drivers/char/mem.c, etc are with respect to the root of the kernel source tree. Based on Linux 3.x.

Providentially, while scribbling content for a related entry, I stumbled across the following sentence in Sreekrishnan Venkateswaran's seminal Essential Linux Device Drivers, Chapter 5, Section "Psuedo Char Drivers":

As an exercise, change the hostname of your system by accessing /dev/mem.

Now, there exists more than one way of doing this. The approach taken here is simple: A tiny kernel module is used to obtain the physical address of the nodename field of struct new_utsname (defined in include/uapi/linux/utsname.h)4. Then /dev/mem is used to mmap to this physical address.

The sources are available here. Simply run make to compile the kernel stub. Build ne_devmem_hostname_usr with:

$ gcc -Wall -O2 ne_devmem_hostname_usr.c

Obtain the physical address of the nodename field of struct new_utsname:

Welcome to Buildroot
buildroot login: root

[root@buildroot ~]# hostname
buildroot

[root@buildroot ~]# echo 8 > /proc/sys/kernel/printk

[root@buildroot ~]# insmod ne_devmem_hostname_krn.ko 
[   22.319639] nodename: buildroot, phys addr: 0x1e102c5

which, indeed, lies within the range of physical memory reserved for the kernel data segment:

[root@buildroot ~]# less /proc/iomem 
...
00100000-17ffcfff : System RAM
    01000000-0182852c : Kernel code
    0182852d-01eee43f : Kernel data
    0203c000-02137fff : Kernel bss
...

So, proceeding with /dev/mem to patch the contents of this physical address:

[root@buildroot ~]# ./a.out -l 0x1e102c5
main:126:: Patching paddr. 0x1e102c5, vaddr. 0xf77dc2c5 (nodename "buildroot")
main:128:: with string "simsima"

[root@buildroot ~]# hostname               
simsima

[root@buildroot ~]# ./a.out -l 0x1e102c5 -s NinjaNinja
main:126:: Patching paddr. 0x1e102c5, vaddr. 0xf77382c5 (nodename "simsima")
main:128:: with string "NinjaNinja"

[root@buildroot ~]# hostname
NinjaNinja

[root@buildroot ~]# exit
logout

Welcome to Buildroot
NinjaNinja login: root
[root@NinjaNinja ~]#

Footnotes

1. See mmaping MMIO and DMA regions for a description of the remap_pfn_range Linux interface. [go back]

2. See QEMU Intro [go back]

3. See Writing a Linux PCI Device Driver, Tutorial with a QEMU Virtual Device. [go back]

4. See sethostname(2) and Ftrace for a commentary on how the struct new_utsname business was determined via Ftrace. [go back]