250P Computer Architecture


Home
Homework 1: Simple Performance Measurements This assignment will make you more familiar with how to perform simple performance measurements on x86 machines. You should do this assignment on the openlab machines. NOTE: YOU CANNOT PUBLICLY RELEASE SOLUTIONS TO THIS HOMEWORK. It's ok to show your work to your future employer as a private github/gitlab repo, however any public release is prohibited. Download the main.c, and look it over. This is a skeleton for a simple UNIX program. To compile `main.c`, you need a C compiler, such as gcc. On Openlab machines, you can compile the skeleton with the following command: $ gcc main.c -o perf-hw1 This will produce an `perf-hw1` file, which you can run: $ ./perf-hw1 In the rest of this part of the assignment you will explore how to measure latency of individual parts of your program. Measuring latency of individual instrucitons with RDTSC In general measuring performance of individual instructions is hard. Start this homework by reading the following white paper by Intel: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. Preparing measurements Download the main.c file and look over it. It contains the definition for the `rdtsc` function that we will use to access the time-stamp counter (TSC). You can compile this code with the following command: gcc -O2 -g main.c -o perf-hw1 Here we're telling the `gcc` compiler to use `O2` level of optimizations and generate debug information (`-g`) so we can look at the source code while stepping through the program with GDB. Warmup: Function invocations Now imagine we would like to know how many cycles it takes to perform a simple function invocation, e.g. c = foo(a, b); It's a bit hard to measure, because precision of the `rdtsc` instruction does not allow us to measure the cost of one function invocation. An obvious strategy is to perform invocation in a loop for N iteration and then divide the total average by N. Lets introduce a simple helper function that does this for us: int __attribute__ ((noinline)) foo(int a, int b) { return a + b; } int test_func_invocation() { uint64_t start; uint64_t end; volatile int ret = 0; start = rdtsc(); for (int i = 0; i < runs; i++) { ret = foo(ret, i); } end = rdtsc(); printf("func invocation cycles:%lu\n", (end - start)/runs); return 0; } Here we take TSC measurements before and after the loop and report a total number of cycles devided by the number of iterations (`runs`). There is a couple of unusual things: 1) the `__attribute__ ((noinline))` is required to prevent the compiler from optimizing out the function invocation, 2) the `ret` value is declared with the `volatile` attribute again required to prevent compiler from optimizing out the function invocation entirely. Don't forget to invoke this helper function from `main`. Now you can compile the code and run it again ./perf-hw1 Lets run it a couple of time. The number of cycles on my machine stays the same, which means that our measurement is reasonably stable. Lets take a look at what code is actually running. You can take a look at the generated code either with the GDB or with objdump. Lets first use GDB, you can start it like gdb ./perf-hw1 If you never used GDB before, I suggest you quickly look over this homework from cs238p Operating Systems class: GDB Homework You can set the breakpoint on the function you would like to explore, i.e., (gdb) b test_func_invocation And run the program with (gdb) r Now switch to the split layout with (gdb) layout split You can see both source code and the generated assembly. You can single step individual instructions with the "step instruction" command, or `si` (gdb) si If you don't want to step into `foo` each time you can use the "next instruction" or `ni` command. Alternatively, you can take a look at the generated code with the objdump tool, you can invoke objdump like objdump -M intel -d perf-hw1 you can search for the `test_func_invocation` function and locate the main body of the loop we're measuring: 8b0: 8b 7c 24 0c mov edi,DWORD PTR [rsp+0xc] 8b4: 89 d6 mov esi,edx 8b6: e8 95 ff ff ff call 850 8bb: 83 c2 01 add edx,0x1 8be: 89 44 24 0c mov DWORD PTR [rsp+0xc],eax 8c2: 81 fa 00 e1 f5 05 cmp edx,0x5f5e100 8c8: 75 e6 jne 8b0 The loop body first saves `ret` that is allocated on the stack into the register that carries the first function argument (the calling convention is that the first argument is passed in `edi` and the second in `esi`). It then moves the value of `i` into the second argument and calls the function with `call`. The code then increments `i` which is in `edx`. The result is returned in `eax` (again this is the calling convention). The code compares `i` with `0x5f5e100` (which is hex for 100000000) and if they are not equal repeats the loop. What to submit: take a look at the provided template and fill in the cycles you measured. Addition instruction Now lets try to measure the cost of executing a simple addition instruction. Lets create another helper function int test_add() { uint64_t start; uint64_t end; volatile int ret = 0; start = rdtsc(); for (int i = 0; i < runs; i++) { ret = ret + i; } end = rdtsc(); printf("add cycles:%lu\n", (end - start)/runs); return ret; } If we compile and run the measurement you will see that addition is measured to take 3 cycles. This is a bit higher than expected, since remember we looked at the Agner Fog's instruction tables and there it was 1 cycle. Moreover, our pipeline seems to be able to execute addition in one cycle. It looks like addition is not measured correctly due to the high overhead of maintaining the loop. Let's try to come up with a better way to measure overhead of individual instructions. Lets introduce a macro that allows us to execute something 10000 times #define OP(x) { x; } #define OP10(x) OP(x) OP(x) OP(x) OP(x) \ OP(x) OP(x) OP(x) OP(x) \ OP(x) OP(x) #define OP100(x) OP10(x) OP10(x) OP10(x) OP10(x) \ OP10(x) OP10(x) OP10(x) OP10(x) \ OP10(x) OP10(x) #define OP1000(x) OP100(x) OP100(x) OP100(x) OP100(x) \ OP100(x) OP100(x) OP100(x) OP100(x) \ OP100(x) OP100(x) #define OP10000(x) OP1000(x) OP1000(x) OP1000(x) OP1000(x) \ OP1000(x) OP1000(x) OP1000(x) OP1000(x) \ OP1000(x) OP1000(x) Here the `OP10000(x)` macro expands recursively generating 10000 instances of `x`. Now if we add the following helper function int test_add_op() { uint64_t start; uint64_t end; volatile int ret = 0; start = rdtsc(); asm volatile ("":::"%rax"); OP10000({asm volatile ("add %rbx, %rax");}); end = rdtsc(); printf("add op cycles:%lu\n", (end - start)); return ret; } Now we can measure how many cycles it takes to execute the `add` instruction 10000 times. Here, we're using inline assembly to generate the specific instruction we would like to execute. Specifically, we're using the ATT syntax which means that we simply copy the `%rbx` register into `%rax`. Note, that the assembly instruction clobbers the `%rax` register and might break the code around it, if `%rax` contains something useful. I add `asm volatile ("":::"%rax");` telling the compiler that this empty assembly sequence clobbers the `%rax` (the compiler obeys this instruction and will generate the code in such a manner that either `%rax` is not used or it's saved on the stack. (Well strictly speaking there are no guarantees, but in practice this trick works). Note that if you invoke the program several times the measured number is constantly changing. The CPU changes frequency constantly to achieve an optimal power utilization. You can run the program multiple times with the following shell command if you're using bash. for i in {1..15}; do ./perf-hw1; done You can watch the frequency of your CPU cores changing with the following command watch "cat /proc/cpuinfo \| grep MHz" Let's also add a little routine that can warm up the core before we're taking the measurement putting the core to the highest available frequency. void warmup() { for(unsigned long i = 0; i < 1000; i++) { asm volatile ("":::"%rax"); OP10000({asm volatile ("add %rbx, %rax");}); } return; } This warmup routine takes roughly 6M cycles and is enough to warm up the core. Run this routine before taking your measurement. Finally, the kernel can also move the execution of your program between the cores. To prevent this, we can pin the program to a specific core, e.g., core 3 (please pick one at random on the openlab machines) with the following command: for i in {1..15}; do taskset -c 3 ./perf-hw1; done Now the question is what to report? Clearly some measurements are outliers and the system reaches a stable state only after some time. Lets report the average of the last five measurements. What to submit: take a look at the provided template and fill in the cycles you measured for both techniques and small explanations for the assembly code generated by both methods. More measurements Now since you understand the basics of taking performance measurements, your task is to conduct similar measurements for the following things (you need to create helping functions similar to the one above, fill in the same things into the template: explanations for the assembly code and performance measurements). Division Multiplication Malloc (loop method only, no asm) Copy from one array of 256 characters into another using a loop (loop method only, no asm) memcpy() of 256 bytes (loop method only, no asm) Hints The compiler might try to optimize out your memory allocations or memory copies. Try touching the values after you take the measurements by, for example, reading each element and adding it to the return value that you return from the function. Submit your work Submit your solution through Gradescope Gradescope CS250P Computer Architecture. You need to submit `main.c` and `template.md`. You can resubmit as many times as you wish.

Updated: September, 2021