Author: Mugabi Siro

Category: GNU/Linux Toolchain

Summary:

An introduction to the C/C++ GNU/Linux toolchain. Needs update (last update Sep 2014).

Tags: gnu/linux toolchain gnu linux introduction

Toolchain Overview

A toolchain is the set of distinct development tools whose specific roles form the chain of events that culminate in software build. This process essentially involves compilation, assembly and link of code that will execute on a given runtime host.

A toolchain is a critical component of the system. In addition to determining how code is built, pieces of the toolchain itself also end up the final binaries e.g. the language runtime support1. Performance and proper functioning of a program or the system as a whole, therefore, depend on the toolchain. Effects of a single bug in a toolchain component may manifest in very unexpected ways during build or runtime and may even end up compromising system integrity.

GNU/Linux Toolchain Components

The core components of a GNU/Linux based toolchain consist of GNU Binary Utilities binutils, the GNU Compiler Collection GCC, a C library, and a set of the Linux Kernel headers for userspace development. In practice, additional components such as the GNU Debugger gdb, and various libraries are also included in the mix.

GCC

GCC is a suite of compilers for several major programming languages including C, C++, Objective-C, Objective-C++, Java, Fortran and Ada. Note that, originally, the acronym GCC stood for the GNU C compiler. Nevertheless, gcc(1) remains as the command used to invoke the GNU C compiler. Correspondingly, other language compilers have their own commands such as g++ for the C++ compiler and gnat for the Ada compiler.

gcc(1) is actually a compiler driver program. Upon invocation, gcc(1) will by default build a program into an executable in the ELF Object File Format. It does so by automatically invoking the appropriate tool in a serialized fashion:

  • C preprocessor`2
    Processes the programmer's C source code and outputs pure C language code with processed macros and directives

  • compiler proper, cc1
    Translates the pure C language code into machine-dependent assembly language code

  • assembler (as(1))
    Produces a relocatable object file (see ELF Object File Types) containing machine instructions or byte code from the assembly language source code text file(s)

  • linkers (collect2 and ld(1))3
    Combine the user-supplied object files together along with the language runtime support and required libraries to generate an executable or a shared object (see ELF Object File Types)

The gcc(1) command line accepts different groups of options which it automatically passes to the respective tool during the build process.

C Library Alternatives

The C library and its headers act as a wrapper around the raw Linux kernel API and is used by most applications running on a Linux system.

GNU C library

The GNU C library, AKA glibc, is a feature rich and portable library that complies to all major C standards (see standards(7)). It includes support for name resolution, time zone information and Authentication via Name Service Switch (NSS) and Pluggable Authentication Modules (PAM). The Native POSIX Threads Library (NPTL) now replaces Linuxthreads for thread support. Also see libc(7) and feature_test_macros(7).

On GNU/Linux, glibc supports a variety of machine architectures including ARM (32- and 64-bit), AVR32, MIPS, PowerPC (32- and 64-bit), Sparc (32- and 64-bit), x86(-64), Xilinx MicroBlaze, etc. On these systems, the soname libc.so.6 implies glibc.2.x. Refer to this link for answers to glibc related FAQs.

For devices with relatively limited storage, glibc (which primarily targets servers and workstations) might be rather bloated. In addition applications and libraries compiled against its headers will generally be larger than if they had been built against other C libraries optimized for embedded systems.

Embedded GNU C library

Embedded GLIBC (eglibc) is a variant of glibc that is designed to work well on embedded systems. eglibc strives to be source and binary compatible with GLIBC. eglibc's goals include reduced footprint, configurable components, better support for cross-compilation and cross-testing. Distributions such as Ubuntu currently ship with eglibc. Follow this link for more information on eglibc.

uClibc

Also released under LGPL, uClibc features an even smaller footprint than eglibc. It is an independent project from the mainstream (e)glibc but strives to comply with the same standards: uClibC is designed to be source compatible with eglibc but not binary compatible. To use uClibc, the application must be recompiled.

Originally developed to support uClinux, a port of Linux for MMU-less microcontrollers e.g. Dragonball, Coldfire and ARM7TDMI, it now supports stock Linux systems e.g. IA32, ARM, PowerPC, SH, SPARC, MIPS, etc. Being less feature rich, there might be a few suprises when compiling against this library. NPTL support for IA32, x86-64, ARM and other architectures included since uClibc 0.9.32. See this for more information on uClibc.

Others

Embedded C library alternatives include dietlibc and newlib. There also exists Klibc which is designed for use in an initramfs.

Linkers

collect2 and ld

collect2 is a GCC linker utility that eventually invokes the real (static) linker ld(1). The former is part of GCC while the latter belongs to the binutils package. Follow this link for a more detailed explanation.

Static Linker vs Dynamic Linker/Loader

The static linker, ld(1), is usually invoked in the last stage of a program's build. It processes relocatable object files by performing symbol resolution and relocation to produce an executable or loadable shared object file. The static linker is part of the binutils package.

The dynamic linker/loader (e.g. ld-linux.so(8), ld-uClibc.so, etc), on the other hand, is provided by the C library package. Basically, it loads the shared libraries needed by a program, prepares the program to run and then runs it. Unless a program is statically linked, it remains incomplete and requires the services of the dynamic linker/loader for further linking at load time and run-time.

Linux Kernel Headers

The kernel exports a "sanitised" subset of its headers for use by userspace programs that require kernel services. These kernel header files are used by the system's C library to define available system calls, as well as constants and structures to be used with these system calls. If, for example, the C library header files are located at /usr/include, then the santised kernel headers are likely to be found under /usr/include/linux and /usr/include/asm.

The make headers_install command is used to export this sanitised subset of the kernel headers. See Documentation/make/headers_install.txt of the Linux kernel sources for more info.

GNU/Linux Toolchain and ABI

The Application Binary Interface (ABI) defines the system interface between compiled programs. It defines how a generated binary interacts with itself, libraries, and the kernel. These runtime conventions are abstracted and implemented by the toolchain components (compiler, assembler, linkers, language runtime support, etc), and remain largely transperent to the C/C++ programmer. The ABI implemented by the toolchain determines code generation and support for things such as:

  • Data representation (size and alignment of data types, layout of structured types, etc)
  • Register usage conventions
  • Function, library and syscall calling conventions
  • Program loading and dynamic linking, and library behaviour
  • Interfaces for runtime arithmetic support
  • Binary object file formats4

In addition, the ABI implemented by a C++ compiler affects code generation and runtime support for:

  • Name mangling
  • Exception handling
  • Invoking constructors and destructors
  • Layout, alignment, and padding of classes
  • Layout and alignment of virtual tables

Note that the ABI is intimately tied to a specific machine architecture. Each machine architecture in Linux has its own ABI, with each ABI being referred to by the architecture name e.g. x86-64. However, some architectures also support a variety of calling conventions, instruction set extensions etc. In otherwords, these architectures support more than one ABI. Examples include:

  • ISA extensions: i386/i486/i586/i686/MMX/SSE etc (x86 architecture)
  • ABI transitions: armel and armhf (ARM architecture)

Consult this link for more details.

A toolchain conforms to an ABI if it generates code that adheres to all of the specifications enumerated by that ABI. A library conforms to an ABI if it is implemented according to that ABI. An application conforms to an ABI if it is built using tools that conform to that ABI and does not contain source code that specifically changes behaviour specified by the ABI.

In short, the ABI guarantees binary compatibility i.e. binaries will function on any system with the same ABI without the need for recompilation. An understanding of the ABI is required when working at the machine-level or assembly. Otherwise, when programming in a high-level language such as C or C++, knowledge of the ABI is not strictly required. Nevertheless, some understanding of it allows for more control of program runtime behaviour5 and facilitates writing more optimized or efficient code; but, probably, at the risk of less portable code. A solid background of the ABI also heightens the chances of success in a hacking or reverse engineering endeavour.

Toolchain Types

A toolchain is built to support:

  • A specific architecture.
  • A particular application build set-up.

These factors result in the different types of toolchains. The terms build, host, and target take special meanings in the context of GNU/Linux toolchain build:

  • Build The system on which the toolchain is built.

  • Host The system on which the toolchain will run.

  • Target The system for which the toolchain generates code for.

Native toolchain

A Typical example of a native toolchain is a distro supplied toolchain. Here, the build, host and target machines are all of the same architecture. Follow this link for an example of installing a distro supplied (pre-compiled) native toolchain in Ubuntu.

However, note that the installation of a seperate, native toolchain may be preferable for embedded development. Among other things, this approach will ensure consistency upon distro updates/upgrades. In addition, certain features of the distro toolchain (e.g. the C library) may not be quite suitable or optimised for embedded development.

Cross-platform toolchain

A cross-platform toolchain (or simply cross-toolchain) is meant to execute on a development host of one architecture (typically, an x86 based workstation) while generating code for a target of a different architecture.

There are two cross-toolchain build scenarios:

  • build and host machines are of the same architecture. This is the usual build scenario. Below is the output of an ARM cross-compiler driver that was built with this approach:

    $ arm-linux-gnueabihf-gcc -v
    ...
    Configured with: ... --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf ...
    ...
    
  • build and host are of different machine architectures. This is sometimes referred to as a Canadian Cross and happens when cross-compiling a cross-toolchain. This build approach is rare.

Cross-toolchains are probably the most commonly used type of toolchains in embedded development. Note that a mismatch between the cross-toolchain components (especially the C library) and the target's native componets may result in improper or even none functioning binaries.

Check out Using a Cross-Toolchain for various examples of using a cross-toolchain on a GNU/Linux system.

Cross-native toolchain

In this case, while host and target machines are of the same architecture, the build machine used was of a different architecture.

Cross-back toolchain

This is an exotic cross-toolchain setup where build and target are of the same architecture and differ from host.

More Toolchain Lingo

Toolchain Tuple

A toolchain tuple is the name that gets prepended to the compiler and binutils components of the toolchain. For example, given the cross-compiler driver arm-linux-gnueabihf-gcc, and corresponding binutils components such as arm-linux-gnueabihf-as and arm-linux-gnueabihf-ld, then the tuple here is arm-linux-gnueabihf-. Note the trailing hyphen.

Tuples are generally written in a standardized format which follows the naming conventions of the GNU Triplets i.e. cpu-vendor-os, where os can be system or kernel-system. The following table describes the meanings of these fields:

cpu The type of processor used on the system e.g. x86_64 or arm.
vendor This may take any (reasonable) value. Typically, the name of the manufacturer of the system, for example example pc, or simply unknown e.g. x86_64-unkown-linux-gnu. In fact, this field is often omitted e.g. arm-linux-gnueabi.
kernel This is used mainly for GNU/Linux systems e.g. i686-pc-linux-gnu. In other words, linux specifies kernel.
system Typically gnu* for GNU based systems.

Below is an instance of a listing of toolchain components pre-fixed by tuples:

$ ls /usr/bin/ | grep x86_64-linux-gnu-
x86_64-linux-gnu-cpp
x86_64-linux-gnu-cpp-4.6
x86_64-linux-gnu-g++
x86_64-linux-gnu-g++-4.6
x86_64-linux-gnu-gcc
x86_64-linux-gnu-gcc-4.6

$ ls -l /usr/bin | grep arm-unkown-linux-gnueabihf-
arm-unknown-linux-gnueabihf-addr2line
arm-unknown-linux-gnueabihf-ar
arm-unknown-linux-gnueabihf-as
...
arm-unknown-linux-gnueabihf-cpp
arm-unknown-linux-gnueabihf-ct-ng.config
...
arm-unknown-linux-gnueabihf-gcc
arm-unknown-linux-gnueabihf-gcc-4.7.4
...
arm-unknown-linux-gnueabihf-gcov
...
arm-unknown-linux-gnueabihf-gprof
...
arm-unknown-linux-gnueabihf-ld
arm-unknown-linux-gnueabihf-ldd
arm-unknown-linux-gnueabihf-nm

Note that toolchain components of the distro supplied (i.e. native and default) toolchain may not necessarily have the tuple prefixed.

sysroot

This is the root directory that the toolchain components such as gcc and ld search to find headers and libraries. For example, the --sysroot=dir command line option instructs gcc to use dir as the logical root directory for headers and libraries. So, if the compiler would normally search for headers in /usr/include and libraries in /usr/lib, it will instead search dir/usr/include and dir/usr/lib. See gcc(1), ld(1), etc for more details.

Using a Binary Package vs. Building from Source

General Issues with Binary Toolchain Packages

Binary toolchain packages typically find their origins from vendors (as part of a Board Support Package, BSP), third-party products, initiatives or community projects using certain hardware. These packages offer convinience i.e. easy installation and, typically, have already been widely tested and proven.

Nevertheless, they may lack of support for newer processor features and optimizations due to the use of aging toolchain components. Some may even have been optimised for a specific processor, or may have included the use of unkown patchsets. Other issues include relocatability i.e. may not fit your particular build system.

A key component of a toolchain is the C library. On GNU/Linux, like most other operating systems, it provides the kernel/userland interface. Accordingly, the C library is built against a set of "santised" kernel headers. But since this involves a specific kernel version (effectively a snap-shot of the kernel releases) a few notable constraints result i.e. with respect to the dynamic nature of Linux:

More recent kernels

Applications compiled against the C library are only able to take advantage of features in kernel versions up to and including the kernel the C library was built against.

At least for (e)glibc, the kernel maintains backwards compatibility and so applications will still run on later (more recent) kernel versions - only that new features by these kernels will not be available to them.

For instance, info about the kernel version used to build the C library in Ubuntu x86-64 can be obtained via:

$ /lib/x86_64-linux-gnu/libc.so.6 
...
Compiled on a Linux 3.2.50 system on 2013-09-30.
...

... and this value coincides with the version of its Linux development headers:

$ cat /usr/include/linux/version.h
#define LINUX_VERSION_CODE 197170
#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))

i.e. (310 << 1610) + (210 << 810) + 5010 = 19717010

Older and the mininum supported kernel version

(e)glibc includes support for older kernel versions than the one it was compiled against - but only upto a certain version. For example, (e)glibc compiled against Linux 3.2.50 may still include support for older kernels upto v2.6.24. This oldest version is what is often referred to as the minimum supported kernel version in toolchain lingo. A few implications of this are:

  • Running applications compiled against this toolchain on an older (but supported) kernel than the one its (e)glibc was compiled against may result in sub-optimal performance: the C library may now have to provide its own implementations (in userspace) of some kernel features.

  • Attempts to use a kernel version older than the minimum supported kernel version will result in the:

    FATAL: kernel too old
    

    error message before the userspace application's main( ) is called. In a traditional system boot, this would be before /sbin/init gets called.

    Obtaining info about the minimum supported kernel version by the C library can be done by way of reading the .note.ABI-tag section of the ELF object file. On Ubuntu 12.04 x86-64, for instance:

    $ readelf -n /lib/x86_64-linux-gnu/libc-2.15.so | grep Linux
      OS: Linux, ABI: 2.6.24
    

    Alternatively, since file6 also reads this section of an ELF object file:

    $ file /lib/x86_64-linux-gnu/libc-2.15.so 
    ... for GNU/Linux 2.6.24 ...
    

Since these version related constraints cannot be adjusted when working with pre-built toolchain binaries, the minimum supported kernel version should be first established (at the very least) when selecting a toolchain package or release version.

Ideally, the kernel headers that were used to build the toolchain's C Library should match the version of the Linux kernel on the target. Flexibility in selection of toolchain component versions is greater when building a toolchain from source.

General Issues with Building a Toolchain from Source

While pre-compiled toolchain packages offer convinience in terms of installation, building a toolchain from source presents its own advantages. Notably, it allows for flexibity in terms of choosing toolchain component versions and for fine tuning in order to meet other specific requirements. This includes the possibility of making optimisations for a processor and of configuration to fit your build system. Since the origins are upstream sources, fixes are easy to apply.

However, rolling your toolchain has its cons. Upstream may have poor support for a particular processor (missing, incomplete or improper patches). How much of an issue this will be depends on the nature of the problem, your skill level and/or resources. As is typical with other community-based upstream projects, you may have to rely on community support. Validation of the toolchain may also present an issue.

There exist a number of automated toolchain build platforms (e.g. crosstool-NG). These greatly facilitate an otherwise painstacking enterprise: Building a toolchain by hand requires a good understanding of the roles of the various (and numerous) packages involved. It is both a time-consuming and error prone process due to the delicate dependencies that exist among versions of individual components.

Check out Using crosstool-ng for a description of its usage, and ARM Cross-toolchain with crosstool-ng for a realworld case study of generating and using a crosstool-ng generated toolchain.

Also See

Resources

  • https://elinux.org/ToolChains
  • https://sourceware.org/glibc/wiki/FAQ
  • http://www.eglibc.org/faq
  • http://www.uclibc.org/FAQ.html
  • http://gcc.gnu.org/onlinedocs/gcc/Compatibility.html
  • https://sourcery.mentor.com/sgpp/lite/arm/portal/kbentry32
  • https://wiki.ubuntu.com/ToolChain
  • Yann E. Morin, Embedded Linux Conference, Europe, October 2009

Footnotes

1 For example, the C Runtime. [go back]

2. gcc will by default use cc1's internal preprocessor. [go back]

3 cc1 and collect2 are part of GCC where as as(1) and ld(1) are part of binutils [go back]

4. For example, check out ELF Object File Format. [go back]

5. For example, check out C run-time, Placing Functions or Data in Arbitrary Sections, etc. [go back]

6. Check out src/readelf.c of its source code. [go back]