Homework 1: Simple Performance MeasurementsThis assignment will make you more familiar with how to perform simple performance measurements on x86 machines. You should do this assignment on the openlab machines. NOTE: YOU CANNOT PUBLICLY RELEASE SOLUTIONS TO THIS HOMEWORK. It's ok to show your work to your future employer as a private github/gitlab repo, however any public release is prohibited. Download the main.c, and look it over. This is a skeleton for a simple UNIX program. To compile main.c, you need a C compiler, such as gcc. On Openlab machines, you can compile the skeleton with the following command: $ gcc main.c -o perf-hw1This will produce an perf-hw1 file, which you can run: $ ./perf-hw1In the rest of this part of the assignment you will explore how to measure latency of individual parts of your program. Measuring latency of individual instrucitons with RDTSCIn general measuring performance of individual instructions is hard. Start this homework by reading the following white paper by Intel: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures.Preparing measurementsDownload the main.c file and look over it. It contains the definition for the rdtsc function that we will use to access the time-stamp counter (TSC). You can compile this code with the following command:gcc -O2 -g main.c -o perf-hw1Here we're telling the gcc compiler to use O2 level of optimizations and generate debug information (-g) so we can look at the source code while stepping through the program with GDB. Warmup: Function invocationsNow imagine we would like to know how many cycles it takes to perform a simple function invocation, e.g.c = foo(a, b);It's a bit hard to measure, because precision of the rdtsc instruction does not allow us to measure the cost of one function invocation. An obvious strategy is to perform invocation in a loop for N iteration and then divide the total average by N. Lets introduce a simple helper function that does this for us: int __attribute__ ((noinline)) foo(int a, int b) { return a + b; } int test_func_invocation() { uint64_t start; uint64_t end; volatile int ret = 0; start = rdtsc(); for (int i = 0; i < runs; i++) { ret = foo(ret, i); } end = rdtsc(); printf("func invocation cycles:%lu\n", (end - start)/runs); return 0; }Here we take TSC measurements before and after the loop and report a total number of cycles devided by the number of iterations (runs). There is a couple of unusual things: 1) the __attribute__ ((noinline)) is required to prevent the compiler from optimizing out the function invocation, 2) the ret value is declared with the volatile attribute again required to prevent compiler from optimizing out the function invocation entirely. Don't forget to invoke this helper function from main. Now you can compile the code and run it again ./perf-hw1Lets run it a couple of time. The number of cycles on my machine stays the same, which means that our measurement is reasonably stable. Lets take a look at what code is actually running. You can take a look at the generated code either with the GDB or with objdump. Lets first use GDB, you can start it like gdb ./perf-hw1If you never used GDB before, I suggest you quickly look over this homework from cs238p Operating Systems class: GDB Homework You can set the breakpoint on the function you would like to explore, i.e., (gdb) b test_func_invocationAnd run the program with (gdb) rNow switch to the split layout with (gdb) layout splitYou can see both source code and the generated assembly. You can single step individual instructions with the "step instruction" command, or si (gdb) siIf you don't want to step into foo each time you can use the "next instruction" or ni command. Alternatively, you can take a look at the generated code with the objdump tool, you can invoke objdump like objdump -M intel -d perf-hw1you can search for the test_func_invocation function and locate the main body of the loop we're measuring: 8b0: 8b 7c 24 0c mov edi,DWORD PTR [rsp+0xc] 8b4: 89 d6 mov esi,edx 8b6: e8 95 ff ff ff call 850The loop body first saves ret that is allocated on the stack into the register that carries the first function argument (the calling convention is that the first argument is passed in edi and the second in esi). It then moves the value of i into the second argument and calls the function with call. The code then increments i which is in edx. The result is returned in eax (again this is the calling convention). The code compares i with 0x5f5e100 (which is hex for 100000000) and if they are not equal repeats the loop. What to submit: take a look at the provided template and fill in the cycles you measured. Addition instructionNow lets try to measure the cost of executing a simple addition instruction. Lets create another helper functionint test_add() { uint64_t start; uint64_t end; volatile int ret = 0; start = rdtsc(); for (int i = 0; i < runs; i++) { ret = ret + i; } end = rdtsc(); printf("add cycles:%lu\n", (end - start)/runs); return ret; }If we compile and run the measurement you will see that addition is measured to take 3 cycles. This is a bit higher than expected, since remember we looked at the Agner Fog's instruction tables and there it was 1 cycle. Moreover, our pipeline seems to be able to execute addition in one cycle. It looks like addition is not measured correctly due to the high overhead of maintaining the loop. Let's try to come up with a better way to measure overhead of individual instructions. Lets introduce a macro that allows us to execute something 10000 times #define OP(x) { x; } #define OP10(x) OP(x) OP(x) OP(x) OP(x) \ OP(x) OP(x) OP(x) OP(x) \ OP(x) OP(x) #define OP100(x) OP10(x) OP10(x) OP10(x) OP10(x) \ OP10(x) OP10(x) OP10(x) OP10(x) \ OP10(x) OP10(x) #define OP1000(x) OP100(x) OP100(x) OP100(x) OP100(x) \ OP100(x) OP100(x) OP100(x) OP100(x) \ OP100(x) OP100(x) #define OP10000(x) OP1000(x) OP1000(x) OP1000(x) OP1000(x) \ OP1000(x) OP1000(x) OP1000(x) OP1000(x) \ OP1000(x) OP1000(x)Here the OP10000(x) macro expands recursively generating 10000 instances of x. Now if we add the following helper function int test_add_op() { uint64_t start; uint64_t end; volatile int ret = 0; start = rdtsc(); asm volatile ("":::"%rax"); OP10000({asm volatile ("add %rbx, %rax");}); end = rdtsc(); printf("add op cycles:%lu\n", (end - start)); return ret; }Now we can measure how many cycles it takes to execute the add instruction 10000 times. Here, we're using inline assembly to generate the specific instruction we would like to execute. Specifically, we're using the ATT syntax which means that we simply copy the %rbx register into %rax. Note, that the assembly instruction clobbers the %rax register and might break the code around it, if %rax contains something useful. I add asm volatile ("":::"%rax"); telling the compiler that this empty assembly sequence clobbers the %rax (the compiler obeys this instruction and will generate the code in such a manner that either %rax is not used or it's saved on the stack. (Well strictly speaking there are no guarantees, but in practice this trick works). Note that if you invoke the program several times the measured number is constantly changing. The CPU changes frequency constantly to achieve an optimal power utilization. You can run the program multiple times with the following shell command if you're using bash. for i in {1..15}; do ./perf-hw1; doneYou can watch the frequency of your CPU cores changing with the following command watch "cat /proc/cpuinfo | grep MHz"Let's also add a little routine that can warm up the core before we're taking the measurement putting the core to the highest available frequency. void warmup() { for(unsigned long i = 0; i < 1000; i++) { asm volatile ("":::"%rax"); OP10000({asm volatile ("add %rbx, %rax");}); } return; }This warmup routine takes roughly 6M cycles and is enough to warm up the core. Run this routine before taking your measurement. Finally, the kernel can also move the execution of your program between the cores. To prevent this, we can pin the program to a specific core, e.g., core 3 (please pick one at random on the openlab machines) with the following command: for i in {1..15}; do taskset -c 3 ./perf-hw1; doneNow the question is what to report? Clearly some measurements are outliers and the system reaches a stable state only after some time. Lets report the average of the last five measurements. What to submit: take a look at the provided template and fill in the cycles you measured for both techniques and small explanations for the assembly code generated by both methods. More measurementsNow since you understand the basics of taking performance measurements, your task is to conduct similar measurements for the following things (you need to create helping functions similar to the one above, fill in the same things into the template: explanations for the assembly code and performance measurements).
HintsThe compiler might try to optimize out your memory allocations or memory copies. Try touching the values after you take the measurements by, for example, reading each element and adding it to the return value that you return from the function.Submit your workSubmit your solution through Gradescope Gradescope CS250P Computer Architecture. You need to submit main.c and template.md. You can resubmit as many times as you wish. |
|||
Updated: April, 2020
|