This assignment will make you more familiar with the organization of ELF files. You can do this assignment on any operating system that supports the Unix API (Linux Openlab machines, your laptop that runs Linux or Linux VM, and even MacOS). You don't need to set up Xv6 for this assignment.
For MacOS users, the support of 32 bit applications is deprecated in the latest version of your system. So if you already updated your system to MacOS Catalina or have updated your XCode then we recommend you to do the homework at the Openlab machines.
Part 1: Take a Look at ELF files
Download the main.c, and elf.c and look them over. At a high level this homework asks you to implement a simple ELF loader (you will extend the main.c file) and use it to load a simple ELF object file (the one compiled from elf.c). However, before starting this task, lets make ourselves familiar with ELF files.
We provide a simple Makefile that compiles elf.o and main as ELF executables. Look over the Makefile and then compile both files by running:
Now, let's take a look at the ELF files that we just compiled. We will use the readelf tool:
ELF is the file format used for .o object files, binaries, shared libraries and core dumps in Linux. It's actually pretty simple and well thought-out.
ELF has the same layout for all architectures, however endianness and word size can differ; relocation types, symbol types and the like may have platform-specific values, and of course the contained code is architecture specific.
ELF files are used by two tools: the linker and the loader. A linker combines multiple ELF files into an executable or a library and a loader loads the executable ELF file in the memory of the process. On real operating systems, loading may require relocation (e.g., if the file is dynamically linked it has to be linked again with all the shared libraries it depends on). In this homework we will not do any relocation (it's too complicated), we'll simply load an ELF file in memory and run it.
Linker and loader need two different views of the ELF file, i.e., they access it differently.
On the one hand, the linker needs to know where the DATA, TEXT, BSS, and other sections are to merge them with sections from other libraries. If relocation is required the linker needs to know where the symbol tables and relocation information is.
On the other hand, the loader does not need any of these details. It simply needs to know which parts of the ELF file are code (executable), which are data and read-only data, and where to put the BSS in the memory of a process.
Hence, the ELF file provides two separate views on the data inside the ELF file: a linker view with several details, and a loader view, a higher level view with less details.
To provide these views each ELF file contains two tables (or arrays):
- Section Header Table With pointers to sections to be used by the linker.
- Program Header Table With pointers to segments to be used by the loader.
Both tables are simply arrays of entries that contain information about each part of the ELF file (e.g., where the sections/segments used by the linker/loader are inside the ELF file).
Here is a simple figure of a typical ELF file that starts with the ELF header. The header contains pointers to the locations of Section Header Table and Program Header Table within the ELF file. Then each tables have entries that point to the starting locations of individual sections and segments.
Typical ELF file. The linker uses the Section Header Table, and the loader uses the Program Header Table.
It's a bit annoying but the parts of the ELF file used by the linker are called "Sections", and the parts used by the loader are called "segments" (my guess is that different CPU segments were configured in the past for each part of the program loaded in memory, hence the name "segments", for example, an executable CPU segment was created for the executable parts of the ELF file (i.e., one segment that contained all executable sections like .text, and .init, etc.).
Also don't get confused: sections and segments do overlap. I.e., typically multiple sections (as the linker sees the ELF file, such as .text, and .init) are all contained in one executable segment (what the loader sees). Check the previous image and see how depending on the perspective, same parts of the ELF file belong to one section and one segment. Confusing, huh? It will become clear soon.
Linking View: Section Header Table (SHT)
The Section Header Table is an array in which every entry contains a pointer to one of the sections of the ELF file. Lets take a look at what inside the ELF file. Run this command, and scroll down to the Section headers you will see all "sections" of the ELF file that the linker can use:
Since elf.c is a very simple program, it has only .text section (i.e., code of the program), a bunch of sections that contain debugging information, and a .symtab section that contains imported and exported symbols.
Note, there is no .data or .bss sections for global variables (there are no globals in elf.c).
You can experiment by adding a not initialized global variable to elf.c, recompile, and see the difference in the SHT. Then add an initialized global variable, recompile and check again the SHT.
Make sure of removing the global variables and recompile before continuing with the assignment.)
Moreover, since we linked elf.c to be a static executable that is linked to run if loaded at address 0x0 (the -Ttext 0 in the Makefile tells the linker to relocate the executable at linking time to work at 0x0):
The symbol table contains these symbols
The main is our function (it's FUNC, and GLOBAL), the __bss_start, _edata, and _end are added by the linker to mark the start and end of the BSS, TEXT, and DATA sections.
If we take a look at the main executable, the ELF file is more complicated.
It contains all the section we've mentioned in class: .text (main code of the program), .data (data section for global variables), .rodata (data section for global read-only variables), .bss (uninitialized global variables), .init (init section to call the constructors that run before main()), .got(Global Offset Table), .plt (Procedure Linking Table for lazy linking of imported functions), and even the .interp (the section for the interpreter, i.e., the linker that links dynamically linked program before it runs, typically it's /lib/ld-linux.so.2 on Linux systems.
Execution view: Program Header Table (PHT)
The Program Header Table contains information for the kernel on how to start the program. The LOAD directives determinate what parts of the ELF file get mapped into program memory.
Again, in our elf example the program header defines only two segments. And only one of them should be loaded by the operating system in memory to run.
The only loadable section is linked to run at address 0x00000000. We can inspect the elf binary with the objdump tool to see what is there:
Well, no surprises: it's the main function compiled into machine code.
Putting it All Together
Neither the SHT nor the PHT have fixed positions, they can be located anywhere in an ELF file. To find them the ELF header is used, which is located at the very start of the file.
The first bytes contain the elf magic "\x7fELF"
, followed by the class ID (32 or 64 bit ELF file), the data format ID (little endian/big endian), the machine type, etc.
At the end of the ELF header are then pointers to the SHT and PHT. Specifically, the Segment Header Table which is used by the linker starts at byte 1120 in the ELF file, and the Program Header Table starts at byte 52 (right after the ELF header)
Finally, the entry point of this file is at address 0x0. This is exactly what we told the linker to do — link the program to run at address 0x00000000. And this is where the main function of the elf.c file is shown in the objdump.
Program Loading in the Kernel
The execution of a program starts inside the kernel, in the exec("/bin/wc",...) system call takes a path to the executable file. The kernel reads the ELF header and the program header table (PHT), followed by lots of sanity checks.
For statically linked executables:
- The kernel reads the PHT, and loads the parts specified in the LOAD directives into memory.
- Once the parts specified in LOAD directives are loaded, control can be transferred to the entry point of the program, which is main().
For dynamically linked executables:
- The kernel reads the PHT, and loads the parts specified in the INTERP and LOAD directives into memory. Dynamically linked programs always need /lib/ld-linux.so as interpreter because it includes some startup code, loads shared libraries needed by the binary, and performs relocations.
- Once the parts specified in INTERP and LOAD directives are loaded, control can be transferred to the interpreter.
- The dynamic linker (contained within the interpreter):
- Looks at the .dynamic section, whose address is stored in the PHT; and finds:
- The NEEDED entries determining which libraries have to be loaded before the program can be run.
- The *REL* entries containing the address of the relocation tables.
- The VER* entries which contain symbol versioning information.
- Etc...
- The dynamic linker loads the needed libraries and performs relocations (either directly at program startup or later, as soon as the relocated symbol is needed, depending on the relocation type).
- Finally, control is transferred to the address given by the symbol _start in the binary. Normally some gcc/glibc startup code lives there, which in the end calls main().
What do you need to do in this homework: load an ELF file
While ELF might look a bit intimidating, in practice the loading algorithm is trivial:
- Read the ELF header.
- One of the ELF header fields tells you the offset of the program header table inside the file.
- Read each entry of the program header table (i.e., read each program header)
- Each program header has an offset and size of a specific segment inside the ELF file (e.g., a executable code). You have to read it from the file and load it in memory.
- When done with all segments, jump to the entry point of the program. (Note since we don't control layout of the address space at the moment, we load the sections at some random place in memory (the place that is allocated for us by the mmap() function). Obviously the address of the entry point should be an offset within that random area. Such loading will not work for a real ELF file, but ours is simple: it's statically linked, and contains the code that can run at any location in memory. So even though it's linked to run at 0x0 it will run anywhere where you load it.
Looks manageable. We make a couple of simplifications. First we create a very simple ELF file out of elf.c: it contains only one function, it has no data, and the code can be placed anywhere in memory and will run ok (it simply does not refer to any global addresses, all variables are on the stack).
The main.c file provides definitions for the structs that match the ELF header and entries of the Program Header Table. So you can simply read the header out of the ELF file with the open, lseek, and read functions, read the offset of the program header table, get the number of the entries in the program header table from the ELF header, and read all entries one by one. If the entry has ELF_PROG_LOAD type, you will load it in memory.
To load the segment in memory, you can allocate executable memory with the following function
where ph.memsz is the size of the segment that we currently are loading.
You should figure out where the entry point of your program is (it is in one of the segments, and since you map the segments at the addresses returned by the mmap() function, you need to find where it ends up in your address space.
If you found the entry point then type cast it to the function pointer that matches the function signature of the sum function and call it.
Your program should take the name of the executable elf file as its only argument and return the result produced by the function main present in that elf file. In particular, if the source code of the elf is:
Then your loader should print the following result:
Please note that different elf files (one at the time) with different definitions of the function main will be passed to your loader in order to test the correctness of it.
Extra credit: (10% bonus)
Try loading elf-data.c file. Can you explain why it crashes?
Fix the crash by either ensuring that the file is linked to work at the address that is available inside your process (it's a bit tricky to ensure that any specific address is available, but we'll accept any meaningful solution), or by performing a simple relocation step for the ELF file when it is loaded.
These ELF tutorial and Executable and Linkable Format 101 Part 3: Relocations resources can be helpful.
Submit your work
Submit your solution (the zip archive) through Gradescope in the assignment called HW3 - The ELF Format. Please zip all of your files (main.c, Makefile) and submit them. If you have done extra credit then place that main.c and Makefile in an additional folder coled "extra". Please note that the non-optional part must be in the root of the zip archive, not inside yet another folder.
You can resubmit as many times as you wish. If you have any problems with the structure the autograder will tell you. The structure of the zip file should be the following: