Author: Mugabi Siro
Category: GNU/Linux Toolchain
An introduction to the C/C++ GNU/Linux toolchain. Update required (last update Sep 2014).Tags: gnu/linux toolchain gnu linux introduction
A toolchain is the set of distinct development tools whose specific roles form the chain of events that culminate in software build. This process essentially involves compilation, assembly and link of code that will execute on a given runtime host.
A toolchain is a critical component of the system. In addition to determining how code is built, pieces of the toolchain itself also end up the final binaries e.g. the language runtime support1. Performance and proper functioning of a program or the system as a whole, therefore, depend on the toolchain. Effects of a single bug in a toolchain component may manifest in very unexpected ways during build or runtime and may even end up compromising system integrity.
The core components of a GNU/Linux based toolchain consist of GNU Binary Utilities binutils, the GNU Compiler Collection GCC, a C library, and a set of the Linux Kernel headers for userspace development. In practice, additional components such as the GNU Debugger gdb, and various libraries are also included in the mix.
GCC is a suite of compilers for several major programming
languages including C, C++, Objective-C, Objective-C++, Java,
Fortran and Ada. Note that, originally, the acronym GCC stood for
the GNU C compiler. Nevertheless,
remains as the command used to invoke the GNU C compiler.
Correspondingly, other language compilers have their
own commands such as
g++ for the C++ compiler and
gnat for the Ada compiler.
gcc(1) is actually a compiler driver program.
gcc(1) will by default build a program
into an executable in the ELF Object File Format. It does so by automatically invoking
the appropriate tool in a serialized fashion:
Processes the programmer's C source code and outputs pure C language code with processed macros and directives
Translates the pure C language code into machine-dependent assembly language code
Produces a relocatable object file (see ELF Object File Types) containing machine instructions or byte code from the assembly language source code text file(s)
Combine the user-supplied object files together along with the language runtime support and required libraries to generate an executable or a shared object (see ELF Object File Types)
gcc(1) command line accepts different groups of
options which it automatically passes to the respective
tool during the build process.
The C library and its headers act as a wrapper around the raw Linux kernel API and is used by most applications running on a Linux system.
The GNU C library, AKA glibc,
is a feature rich
and portable library that
complies to all major C standards (see
It includes support for name resolution, time zone information
and Authentication via Name Service Switch (NSS)
and Pluggable Authentication Modules (PAM).
The Native POSIX Threads Library (NPTL) now replaces Linuxthreads for thread support. Also see
On GNU/Linux, glibc supports a variety of machine architectures including ARM (32- and 64-bit), AVR32, MIPS, PowerPC (32- and 64-bit), Sparc (32- and 64-bit), x86(-64), Xilinx MicroBlaze, etc. On these systems, the soname libc.so.6 implies glibc.2.x. Refer to this link for answers to glibc related FAQs.
For devices with relatively limited storage, glibc (which primarily targets servers and workstations) might be rather bloated. In addition applications and libraries compiled against its headers will generally be larger than if they had been built against other C libraries optimized for embedded systems.
Embedded GLIBC (eglibc) is a variant of glibc that is designed to work well on embedded systems. eglibc strives to be source and binary compatible with GLIBC. eglibc's goals include reduced footprint, configurable components, better support for cross-compilation and cross-testing. Distributions such as Ubuntu currently ship with eglibc. Follow this link for more information on eglibc.
Also released under LGPL, uClibc features an even smaller footprint than eglibc. It is an independent project from the mainstream (e)glibc but strives to comply with the same standards: uClibC is designed to be source compatible with eglibc but not binary compatible. To use uClibc, the application must be recompiled.
Originally developed to support uClinux, a port of Linux for MMU-less microcontrollers e.g. Dragonball, Coldfire and ARM7TDMI, it now supports stock Linux systems e.g. IA32, ARM, PowerPC, SH, SPARC, MIPS, etc. Being less feature rich, there might be a few suprises when compiling against this library. NPTL support for IA32, x86-64, ARM and other architectures included since uClibc 0.9.32. See this for more information on uClibc.
collect2 is a GCC linker utility that eventually
invokes the real (static) linker
former is part of GCC while the latter belongs to
the binutils package. Follow this
link for a more detailed explanation.
The static linker,
ld(1), is usually invoked in the
last stage of a program's build. It processes relocatable object files by
performing symbol resolution and relocation to produce an executable or loadable shared object file.
static linker is part of the binutils package.
The dynamic linker/loader (e.g.
ld-uClibc.so, etc), on the
other hand, is provided by the C library package.
it loads the shared libraries needed
by a program, prepares the program to run and then runs it.
Unless a program is statically linked, it remains incomplete and requires
the services of the dynamic linker/loader for further linking at load time and run-time.
The kernel exports a "sanitised" subset of its headers for use by userspace
programs that require kernel services. These kernel header files are
used by the system's C library to define available
system calls, as well as constants and structures to be used with these
system calls. If, for example, the C library header files are located
/usr/include, then the santised kernel headers are likely to be
make headers_install command is used to export this sanitised subset
of the kernel headers. See
Documentation/make/headers_install.txt of the Linux kernel sources for more info.
The Application Binary Interface (ABI) defines the system interface between compiled programs. It defines how a generated binary interacts with itself, libraries, and the kernel. These runtime conventions are abstracted and implemented by the toolchain components (compiler, assembler, linkers, language runtime support, etc), and remain largely transperent to the C/C++ programmer. The ABI implemented by the toolchain determines code generation and support for things such as:
In addition, the ABI implemented by a C++ compiler affects code generation and runtime support for:
Note that the ABI is intimately tied to a specific machine architecture. Each machine architecture in Linux has its own ABI, with each ABI being referred to by the architecture name e.g. x86-64. However, some architectures also support a variety of calling conventions, instruction set extensions etc. In otherwords, these architectures support more than one ABI. Examples include:
Consult this link for more details.
A toolchain conforms to an ABI if it generates code that adheres to all of the specifications enumerated by that ABI. A library conforms to an ABI if it is implemented according to that ABI. An application conforms to an ABI if it is built using tools that conform to that ABI and does not contain source code that specifically changes behaviour specified by the ABI.
In short, the ABI guarantees binary compatibility i.e. binaries will function on any system with the same ABI without the need for recompilation. An understanding of the ABI is required when working at the machine-level or assembly. Otherwise, when programming in a high-level language such as C or C++, knowledge of the ABI is not strictly required. Nevertheless, some understanding of it allows for more control of program runtime behaviour5 and facilitates writing more optimized or efficient code; but, probably, at the risk of less portable code. A solid background of the ABI also heightens the chances of success in a hacking or reverse engineering endeavour.
A toolchain is built to support:
These factors result in the different types of toolchains. The terms build, host, and target take special meanings in the context of GNU/Linux toolchain build:
Build The system on which the toolchain is built.
Host The system on which the toolchain will run.
Target The system for which the toolchain generates code for.
A Typical example of a native toolchain is a distro supplied toolchain. Here, the build, host and target machines are all of the same architecture. Follow this link for an example of installing a distro supplied (pre-compiled) native toolchain in Ubuntu.
However, note that the installation of a seperate, native toolchain may be preferable for embedded development. Among other things, this approach will ensure consistency upon distro updates/upgrades. In addition, certain features of the distro toolchain (e.g. the C library) may not be quite suitable or optimised for embedded development.
A cross-platform toolchain (or simply cross-toolchain) is meant to execute on a development host of one architecture (typically, an x86 based workstation) while generating code for a target of a different architecture.
There are two cross-toolchain build scenarios:
build and host machines are of the same architecture. This is the usual build scenario. Below is the output of an ARM cross-compiler driver that was built with this approach:
$ arm-linux-gnueabihf-gcc -v ... Configured with: ... --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf ... ...
build and host are of different machine architectures. This is sometimes referred to as a Canadian Cross and happens when cross-compiling a cross-toolchain. This build approach is rare.
Cross-toolchains are probably the most commonly used type of toolchains in embedded development. Note that a mismatch between the cross-toolchain components (especially the C library) and the target's native componets may result in improper or even none functioning binaries.
Check out Using a Cross-Toolchain for various examples of using a cross-toolchain on a GNU/Linux system.
In this case, while host and target machines are of the same architecture, the build machine used was of a different architecture.
This is an exotic cross-toolchain setup where build and target are of the same architecture and differ from host.
A toolchain tuple is the name that gets prepended to
the compiler and binutils components of the toolchain.
For example, given the
and corresponding binutils components such as
then the tuple here is
the trailing hyphen.
Tuples are generally written in a standardized format
which follows the naming conventions of the
GNU Triplets i.e.
os can be
The following table describes the meanings of these fields:
||The type of processor used on the system e.g.
||This may take any (reasonable) value. Typically,
the name of the manufacturer of the system, for example
||This is used mainly for GNU/Linux systems e.g.
Below is an instance of a listing of toolchain components pre-fixed by tuples:
$ ls /usr/bin/ | grep x86_64-linux-gnu- x86_64-linux-gnu-cpp x86_64-linux-gnu-cpp-4.6 x86_64-linux-gnu-g++ x86_64-linux-gnu-g++-4.6 x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc-4.6 $ ls -l /usr/bin | grep arm-unkown-linux-gnueabihf- arm-unknown-linux-gnueabihf-addr2line arm-unknown-linux-gnueabihf-ar arm-unknown-linux-gnueabihf-as ... arm-unknown-linux-gnueabihf-cpp arm-unknown-linux-gnueabihf-ct-ng.config ... arm-unknown-linux-gnueabihf-gcc arm-unknown-linux-gnueabihf-gcc-4.7.4 ... arm-unknown-linux-gnueabihf-gcov ... arm-unknown-linux-gnueabihf-gprof ... arm-unknown-linux-gnueabihf-ld arm-unknown-linux-gnueabihf-ldd arm-unknown-linux-gnueabihf-nm
Note that toolchain components of the distro supplied (i.e. native and default) toolchain may not necessarily have the tuple prefixed.
This is the root directory that the
toolchain components such as
to find headers and libraries. For example, the
--sysroot=dir command line option instructs
gcc to use
dir as the logical root directory for headers and libraries.
So, if the compiler would normally search for headers
/usr/include and libraries in
it will instead search
ld(1), etc for more details.
Binary toolchain packages typically find their origins from vendors (as part of a Board Support Package, BSP), third-party products, initiatives or community projects using certain hardware. These packages offer convinience i.e. easy installation and, typically, have already been widely tested and proven.
Nevertheless, they may lack of support for newer processor features and optimizations due to the use of aging toolchain components. Some may even have been optimised for a specific processor, or may have included the use of unkown patchsets. Other issues include relocatability i.e. may not fit your particular build system.
A key component of a toolchain is the C library. On GNU/Linux, like most other operating systems, it provides the kernel/userland interface. Accordingly, the C library is built against a set of "santised" kernel headers. But since this involves a specific kernel version (effectively a snap-shot of the kernel releases) a few notable constraints result i.e. with respect to the dynamic nature of Linux:
Applications compiled against the C library are only able to take advantage of features in kernel versions up to and including the kernel the C library was built against.
At least for (e)glibc, the kernel maintains backwards compatibility and so applications will still run on later (more recent) kernel versions - only that new features by these kernels will not be available to them.
For instance, info about the kernel version used to build the C library in Ubuntu x86-64 can be obtained via:
$ /lib/x86_64-linux-gnu/libc.so.6 ... Compiled on a Linux 3.2.50 system on 2013-09-30. ...
... and this value coincides with the version of its Linux development headers:
$ cat /usr/include/linux/version.h #define LINUX_VERSION_CODE 197170 #define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))
i.e. (310 << 1610) + (210 << 810) + 5010 = 19717010
(e)glibc includes support for older kernel versions than the one it was compiled against - but only upto a certain version. For example, (e)glibc compiled against Linux 3.2.50 may still include support for older kernels upto v2.6.24. This oldest version is what is often referred to as the minimum supported kernel version in toolchain lingo. A few implications of this are:
Running applications compiled against this toolchain on an older (but supported) kernel than the one its (e)glibc was compiled against may result in sub-optimal performance: the C library may now have to provide its own implementations (in userspace) of some kernel features.
Attempts to use a kernel version older than the minimum supported kernel version will result in the:
FATAL: kernel too old
error message before the userspace application's
main( ) is called. In a traditional system boot, this
would be before
/sbin/init gets called.
Obtaining info about the minimum
supported kernel version by the C library can be done
by way of reading the
.note.ABI-tag section of the
ELF object file. On Ubuntu 12.04 x86-64, for instance:
$ readelf -n /lib/x86_64-linux-gnu/libc-2.15.so | grep Linux OS: Linux, ABI: 2.6.24
$ file /lib/x86_64-linux-gnu/libc-2.15.so ... for GNU/Linux 2.6.24 ...
Since these version related constraints cannot be adjusted when working with pre-built toolchain binaries, the minimum supported kernel version should be first established (at the very least) when selecting a toolchain package or release version.
Ideally, the kernel headers that were used to build the toolchain's C Library should match the version of the Linux kernel on the target. Flexibility in selection of toolchain component versions is greater when building a toolchain from source.
While pre-compiled toolchain packages offer convinience in terms of installation, building a toolchain from source presents its own advantages. Notably, it allows for flexibity in terms of choosing toolchain component versions and for fine tuning in order to meet other specific requirements. This includes the possibility of making optimisations for a processor and of configuration to fit your build system. Since the origins are upstream sources, fixes are easy to apply.
However, rolling your toolchain has its cons. Upstream may have poor support for a particular processor (missing, incomplete or improper patches). How much of an issue this will be depends on the nature of the problem, your skill level and/or resources. As is typical with other community-based upstream projects, you may have to rely on community support. Validation of the toolchain may also present an issue.
There exist a number of automated toolchain build platforms (e.g. crosstool-NG). These greatly facilitate an otherwise painstacking enterprise: Building a toolchain by hand requires a good understanding of the roles of the various (and numerous) packages involved. It is both a time-consuming and error prone process due to the delicate dependencies that exist among versions of individual components.
gcc will by default use
cc1's internal preprocessor. [go back]
collect2 are part of
GCC where as
ld(1) are part of binutils [go back]
6. Check out
src/readelf.c of its source code. [go back]