1 00:00:03,129 --> 00:00:07,360 *35C3 preroll music* 2 00:00:18,780 --> 00:00:23,869 Herald: So the next talk Benjamin Kollenda and Philipp Koppe - they will refresh our 3 00:00:23,869 --> 00:00:30,529 memories because they already had a talk on 34C3 where they talked about the micro 4 00:00:30,529 --> 00:00:37,580 code ROM and today they're gonna give us more insights on how micro code works. And 5 00:00:37,580 --> 00:00:44,320 more details on the ROM itself. Benjamin is a PhD student and has a focus on 6 00:00:44,320 --> 00:00:51,280 software attacks and defenses and together with Phillip they will now abuse AMD 7 00:00:51,280 --> 00:00:55,190 microcode for fun and security. Please enjoy. 8 00:00:55,190 --> 00:00:58,730 *Applause* 9 00:01:01,320 --> 00:01:06,260 Benjamin: Thank you. So as mentioned we were able to reverse engineer the AMD 10 00:01:06,260 --> 00:01:11,599 microcode and the AMD microcode ROM and I'm going to talk about our journey. What 11 00:01:11,599 --> 00:01:16,369 we learned on the way and how we did it. So this joint work with my colleagues at 12 00:01:16,369 --> 00:01:20,799 Ruhr Universtat Bochum and a quick outline how are we going to do it. We're going to 13 00:01:20,799 --> 00:01:25,380 start with a quick crash course on micro architectural basics and what microcode 14 00:01:25,380 --> 00:01:28,350 actually is. Then I talk about how we reconstructed the 15 00:01:28,350 --> 00:01:30,330 microcode ROM and what we learned 16 00:01:30,330 --> 00:01:35,389 along the way. Then I quickly give some examples of the applications we 17 00:01:35,389 --> 00:01:41,430 implemented with the knowledge we gained from second step. And lastly I talk about 18 00:01:41,430 --> 00:01:47,649 a framework we used. How it works and what we can do with it. And also this framework 19 00:01:47,649 --> 00:01:51,899 is available on GitHub along with some other tools so you're free to continue our 20 00:01:51,899 --> 00:01:57,189 work. OK. So when I'm talking about microcode you can think of it essentially 21 00:01:57,189 --> 00:02:02,331 as a firmware for your processor. It handles multiple purposes for example 22 00:02:02,331 --> 00:02:06,440 you can use it to fix CPU bugs that you have in silicon and you want to fix later 23 00:02:06,440 --> 00:02:11,971 in the design phase. It is used for instruction decoding - I cover this one a 24 00:02:11,971 --> 00:02:17,970 bit more. It is also used for exception handling. For example, if an exception or 25 00:02:17,970 --> 00:02:22,200 interrupt is raised, microcode has a first chance of modifying this interrupt 26 00:02:22,200 --> 00:02:27,110 ignoring it or just passing it along to the operating system. It's also used for 27 00:02:27,110 --> 00:02:31,790 power management and some other complex features like Intel SGX. And most 28 00:02:31,790 --> 00:02:37,318 importantly for us microcode is updatable. This used to patch errors in the field. 29 00:02:37,318 --> 00:02:40,975 Everyone remembers Spectre / Meltdown patches and there's 30 00:02:40,975 --> 00:02:44,210 a microcode update. So your 31 00:02:44,210 --> 00:02:50,830 x86 CPU takes multiple steps to execute an instruction. The first step is decoding 32 00:02:50,830 --> 00:02:55,022 a x86 instruction into multiple smaller micro ops. 33 00:02:55,022 --> 00:02:57,150 These are then scheduled into the pipeline 34 00:02:57,150 --> 00:03:01,632 From there, they are dispatched to the different functional units 35 00:03:01,632 --> 00:03:03,532 like your ALU / AGU 36 00:03:03,532 --> 00:03:06,392 multiplication division units 37 00:03:06,392 --> 00:03:08,355 For our purposes the decode step is the 38 00:03:08,355 --> 00:03:12,190 most interesting one. In the decode step you have a instruction buffer that feeds 39 00:03:12,190 --> 00:03:17,030 instructions to some decoders. You have short decoders that handle really simple 40 00:03:17,030 --> 00:03:21,100 instructions. There are long decoders that can handle some more advance instructions. 41 00:03:21,100 --> 00:03:25,260 And finally, the vector decoder. The vector decoder handles the most complex 42 00:03:25,260 --> 00:03:29,690 instructions with the help of microcode. So the microcode engine is essentially the 43 00:03:29,690 --> 00:03:31,247 vector decoder. 44 00:03:32,458 --> 00:03:36,570 The Microcode engine in essence is compromised out of a microcode 45 00:03:36,570 --> 00:03:40,770 ROM that stores the instructions for the microcode engine. Think of it as your 46 00:03:40,770 --> 00:03:48,190 standard instructions. Then there is also a writeable memory the microcode RAM. This 47 00:03:48,190 --> 00:03:52,520 is where the microcode updates end up when you apply microcode updates. And of course 48 00:03:52,520 --> 00:03:57,310 around the storage has a whole lot of things that make it actually run. For this 49 00:03:57,310 --> 00:04:00,860 talk, you only need to know what is a Match Registers. Match Registers are 50 00:04:00,860 --> 00:04:05,650 essentially breakpoint registers. So if we write an address from inside the microcode 51 00:04:05,650 --> 00:04:10,670 ROM inside a Match Register whenever this address is fetched, execution, control is 52 00:04:10,670 --> 00:04:17,570 transferred to the microcode RAM so our patch gets executed. And the microcode 53 00:04:17,570 --> 00:04:23,060 updates are usually loaded by the BIOS or by the kernel. Linux has an update driver, 54 00:04:23,060 --> 00:04:28,340 sometimes the BIOS updates it with a pre-installed version and they have a 55 00:04:28,340 --> 00:04:32,120 pretty simple structure, a partially documented header, and followed by the 56 00:04:32,120 --> 00:04:37,730 actual microcode that is loaded inside the CPU. And so microcode is organized in 57 00:04:37,730 --> 00:04:42,650 something called triads. Each triad has three operations essentially x86 58 00:04:42,650 --> 00:04:48,230 instructions, but based on differences. And lastly, you have a sequence word. The 59 00:04:48,230 --> 00:04:52,025 sequence word indicates which microcode instructions should be executed next. We 60 00:04:52,025 --> 00:04:57,950 have options of executing just the next triad, executing another one by branching 61 00:04:57,950 --> 00:05:01,936 to it, or just saying OK, I'm done with decoding this instruction continue with 62 00:05:01,936 --> 00:05:07,490 x86 code. These updates are protected by some weak authentication which we were 63 00:05:07,490 --> 00:05:13,260 able to break so we can create our own. We can analyze existing ones and we can apply 64 00:05:13,260 --> 00:05:20,620 these to your standard laptop and desktop. However there can only ever be one update 65 00:05:20,620 --> 00:05:26,534 loaded at the time and when you reboot your machine this update will be gone. 66 00:05:28,490 --> 00:05:32,990 Also for the talk we are going to look at some microcode and we will present this 67 00:05:32,990 --> 00:05:38,150 microcode using a register transfer language. It is heavily based on x86. I'm 68 00:05:38,150 --> 00:05:43,290 just going to cover the differences between these two. Most importantly the 69 00:05:43,290 --> 00:05:48,650 microcode can have three operands for an instruction in comparison to x86 which 70 00:05:48,650 --> 00:05:53,640 usually only has two. So you can specify a destination and two source operands. 71 00:05:55,618 --> 00:05:56,446 Also, 72 00:05:57,210 --> 00:06:02,240 microcode has some certain bit flags that need to be set and these we do we see with 73 00:06:02,240 --> 00:06:07,449 these annotations for example ".C" means says instruction also updates a carry flag 74 00:06:07,449 --> 00:06:14,050 based on the result. Then you have the instruction "jcc" which is a conditional 75 00:06:14,050 --> 00:06:19,570 branch and the first operand denotes the condition up on which this branch is 76 00:06:19,570 --> 00:06:24,100 taken. In this case branch if the carry flag is one and [the] second operand 77 00:06:24,100 --> 00:06:30,300 indicates the offset to add to the instruction pointer. Then we also have 78 00:06:30,300 --> 00:06:35,760 some sequence word annotations: "next", "complete", and "branch". Also it should 79 00:06:35,760 --> 00:06:39,958 be noted that the internal microcode architecture is a load-store architecture. 80 00:06:39,958 --> 00:06:45,350 You can't use memory operands in other instructions like you can on x86 you 81 00:06:45,350 --> 00:06:48,310 always need to load and store memory explicitly. 82 00:06:49,190 --> 00:06:51,710 Now we are going to talk about 83 00:06:51,710 --> 00:06:58,710 how we manage to recover the microcode ROM. The microcode ROM is baked into your 84 00:06:58,710 --> 00:07:06,860 CPU, you can't change it anymore. It is defined in the silicon during the 85 00:07:06,860 --> 00:07:12,930 fabrication process and in this picture you can see a die shot taken with a 86 00:07:12,930 --> 00:07:16,840 electron microscope and this is one of three regions that contains the bits for 87 00:07:16,840 --> 00:07:23,240 the microcode operations. And if you zoom in a bit more, each of these regions 88 00:07:23,240 --> 00:07:30,050 consist out of four arrays and these are further subdivided into blocks. Really 89 00:07:30,050 --> 00:07:34,660 interesting is "Array 2" which is a bit smaller than the other ones but it has 90 00:07:34,660 --> 00:07:42,160 some structures above it which are of a different visual layout. This is SRAM 91 00:07:42,160 --> 00:07:47,050 which stores the microcode update. So this is one-time reprogrammable memory that is 92 00:07:47,050 --> 00:07:53,860 still pretty fast. So the microcode RAM is located right next to the microcode ROM 93 00:07:53,860 --> 00:07:57,645 which also makes sense from a design standpoint. 94 00:08:00,445 --> 00:08:02,010 Just an overview of how we 95 00:08:02,010 --> 00:08:06,930 went ahead and how we went about. We started with pictures and then we used 96 00:08:06,930 --> 00:08:11,456 some OCR-ike process to transform them into bit strings which we can then further 97 00:08:11,456 --> 00:08:17,169 process. These bitstrings were then arranged into triads. We could already 98 00:08:17,169 --> 00:08:22,050 gather that we got individual triades right because there were data dependencies 99 00:08:22,050 --> 00:08:27,550 all over the place, but between triads, there were no or very few data 100 00:08:27,550 --> 00:08:33,699 dependencies so the ordering of the triades was still wrong and this was a 101 00:08:33,699 --> 00:08:38,860 major problem when we went ahead and what we had to reverse engineer and this is 102 00:08:38,860 --> 00:08:43,870 mapping a certain physical address of a triad that we gathered from the ROM 103 00:08:43,870 --> 00:08:48,050 readout to a virtual address that is used inside the microcode update or the 104 00:08:48,050 --> 00:08:53,690 microcode ROM. But after reverse engineer this, you can just do a linear sweep 105 00:08:53,690 --> 00:08:59,020 disassembly of the microcode ROM and arrive at human readable output. But this 106 00:08:59,020 --> 00:09:04,870 recovery was a bit tricky because we required physical virtual address pairs. 107 00:09:04,870 --> 00:09:09,520 But gathering these is a bit harder because we worked there through the 108 00:09:09,520 --> 00:09:14,040 available updates, but we could only find two pairs of them. These pairs were 109 00:09:14,040 --> 00:09:18,520 actually easy to find because every update replaces a certain triad inside your 110 00:09:18,520 --> 00:09:24,580 microcode ROM and this triad is usually also placed in the microcode update. So by 111 00:09:24,580 --> 00:09:31,260 matching the address this update replaces with a microcode ROM readout. You can just 112 00:09:31,260 --> 00:09:38,000 get your two data points. But we had to get more data points so we generated these 113 00:09:38,000 --> 00:09:42,630 mappings by matching semantics of triads in the microcode ROM readout and the 114 00:09:42,630 --> 00:09:47,779 semantics when we force execution of a certain microcode address. And gathering 115 00:09:47,779 --> 00:09:52,330 the semantics of the read-out microcode, we implemented a simple microcode 116 00:09:52,330 --> 00:09:58,820 simulator. Essentially it works on triad level, so you give it an input state and a 117 00:09:58,820 --> 00:10:03,430 triad and it calculates the output state of it. Input and output state are 118 00:10:03,430 --> 00:10:08,460 comprised out of the x86-state which is your standard registers and also the 119 00:10:08,460 --> 00:10:12,320 internal microcode registers. There are multiple temporary registers that get 120 00:10:12,320 --> 00:10:18,350 reset for every new x86 instruction that is executed, but they can also be modified 121 00:10:18,350 --> 00:10:24,130 by microcode of course. Our emulator supports all known arithmetic operations 122 00:10:24,130 --> 00:10:29,230 and we have a white-list of operations that do not form or produce any observable 123 00:10:29,230 --> 00:10:32,950 change in state just so that we could process more triades and give them more 124 00:10:32,950 --> 00:10:41,310 data points. In total we gathered 54 additional data-address pairs which turned 125 00:10:41,310 --> 00:10:46,649 out to be enough to recover the whole mapping. This mapping, essentially you 126 00:10:46,649 --> 00:10:50,820 have the four different arrays that map to individual blocks and these blocks in 127 00:10:50,820 --> 00:10:56,750 these arrays or then again permuted a bit and then the triads inside these blocks 128 00:10:56,750 --> 00:11:02,330 have some table-based permutations. So this is not an obfuscation. This is just 129 00:11:02,330 --> 00:11:07,680 from a hardware design standpoint it can make sense to reroute it a bit differently 130 00:11:09,330 --> 00:11:14,629 Also now that we can actually map a certain address to the microcode ROM 131 00:11:14,629 --> 00:11:19,093 readout and we know the addresses of different x86 instructions from our 132 00:11:19,093 --> 00:11:24,240 earlier experiments, we can look at the implementation of instructions. So let's 133 00:11:24,240 --> 00:11:29,130 start with a pretty simple one. Shift- Right-Double which essentially takes a 134 00:11:29,130 --> 00:11:33,250 register, shift it by a given amount and shifts in bits from another register. So 135 00:11:33,250 --> 00:11:38,180 of course you would expect a lot of shifts and rolls in its implementation and this 136 00:11:38,180 --> 00:11:45,338 is exactly what we're seeing here. You have two shift-right operands and you can 137 00:11:45,338 --> 00:11:50,830 see regmd6 and regmd4. These are place holders. The microcode engine can 138 00:11:50,830 --> 00:11:55,630 replace certain bit combinations with the registers that are used in the x86 139 00:11:55,630 --> 00:12:01,560 operation. For example this one would be replaced by ECX or EAX depending on what 140 00:12:01,560 --> 00:12:08,339 you wrote in x86. And at this point we can also already gather more information about 141 00:12:08,339 --> 00:12:13,601 microcodes than we previously knew because we know "OK, so this is source, this is 142 00:12:13,601 --> 00:12:18,529 also a source and this is a destination". But this source which indicates the shift 143 00:12:18,529 --> 00:12:22,750 amount, this one was previously unknown, because it is a high temporary microcode 144 00:12:22,750 --> 00:12:28,279 register and we found out that these usually implement specific different 145 00:12:28,279 --> 00:12:31,800 purpose. They are not - if you write to them, sometimes the CPU behaves 146 00:12:31,800 --> 00:12:35,890 erratically, sometimes it crashes, sometimes nothing happens. But in this 147 00:12:35,890 --> 00:12:40,300 case, this seems to be the shift count, and the shift count is given by a third 148 00:12:40,300 --> 00:12:45,279 operand in the instruction. So in this case, we already learned "OK, if you want 149 00:12:45,279 --> 00:12:51,380 to read the third operand of an instruction, we need to read t41". And 150 00:12:51,380 --> 00:12:56,236 this is how we went about recovering more and more information about microcode. The 151 00:12:56,236 --> 00:13:00,160 rest of the implementation is essentially concerned with implementing the rest of 152 00:13:00,160 --> 00:13:05,721 the semantics of the x86 instruction and updating the flags correctly. OK, so now 153 00:13:05,721 --> 00:13:11,980 let's look at a instruction set that is a bit more complicated. If you check out 154 00:13:11,980 --> 00:13:19,620 rdtsc. rdtsc returns a internal cycle counter in EDX and EAX, so the upper part 155 00:13:19,620 --> 00:13:25,520 ends up in EDX, lower part in EAX. So in the end we want to see writes to these 156 00:13:25,520 --> 00:13:30,760 registers, potentially with a shift somewhere in there. But somewhere the CPU 157 00:13:30,760 --> 00:13:37,570 needs to gather the cycle counter. So in the beginning we have two load-style 158 00:13:37,570 --> 00:13:41,410 operations. This one is a proper load which we identified and this one is 159 00:13:41,410 --> 00:13:48,569 unknown. But despite that we do not know the instruction, we know the target 160 00:13:48,569 --> 00:13:52,720 because the result of this instruction will end up in t9 and the result of this 161 00:13:52,720 --> 00:13:58,060 instruction will end up in t10, so we can follow the uses of these two registers. So 162 00:13:58,060 --> 00:14:04,450 for simplicity I'm going to start with t10 and t10, which we later found out, this is 163 00:14:04,450 --> 00:14:09,730 another register which essentially denotes a specific internal register. And if you 164 00:14:09,730 --> 00:14:15,450 play around with these bits you notice that this combination encodes cr4. The x86 165 00:14:15,450 --> 00:14:22,987 will just see cr4. You can also address cr1 and cr2. And if you look further, t10 166 00:14:22,987 --> 00:14:29,160 is then ended with this bit mask and if you look in the manual you find out that 167 00:14:29,160 --> 00:14:34,930 this bit in cr4 denotes the bit that determines whether oddity C is 168 00:14:34,930 --> 00:14:40,019 available from user space or not. So this is the check if this instruction should be 169 00:14:40,019 --> 00:14:48,170 executed. So now let's just keep in mind that t9 holds some other loaded value from 170 00:14:48,170 --> 00:14:53,930 some other internal register and we will come back to this one a bit later. For 171 00:14:53,930 --> 00:14:58,848 now, let's follow execution. This triad is essentially a padding triad. It is a 172 00:14:58,848 --> 00:15:04,885 common pattern we see. So let's look at where this branch takes us. 173 00:15:05,895 --> 00:15:07,180 And this branch 174 00:15:07,180 --> 00:15:15,959 takes us to a conditional branch triad. And if you look a bit up, this end 175 00:15:15,959 --> 00:15:21,740 instruction actually updated this flag. So this is a conditional branch that 176 00:15:21,740 --> 00:15:26,360 determines whether this check was successful or not. So it branches toward 177 00:15:26,360 --> 00:15:32,570 the error triad or the success triad. But here we already see the exit. We see a 178 00:15:32,570 --> 00:15:41,170 write to RDX or EDX in this case with a shift from t9 by 32 bit, which is exactly 179 00:15:41,170 --> 00:15:45,910 what you would expect to write the time stamp counter on the upper 32 bits of the 180 00:15:45,910 --> 00:15:50,829 time stamp counter to edx. And you have an unknown instruction, but we know, okay, we 181 00:15:50,829 --> 00:15:57,877 move something from t9 to eax, which is the lower 32 bits. But we're not done 182 00:15:57,877 --> 00:16:02,690 here, because we can still look at the error pass that is taken if the access is 183 00:16:02,690 --> 00:16:09,210 denied. So if you scroll a bit down we can see a move of an immediate into a certain 184 00:16:09,210 --> 00:16:14,530 internal register. And this is immediate actually encodes a general protection 185 00:16:14,530 --> 00:16:21,790 fault interrupt code. D denotes to the exception handler that this was a general 186 00:16:21,790 --> 00:16:28,680 protection fault. And later this triad branches to this address, and if you look 187 00:16:28,680 --> 00:16:34,013 at the uses of this address we can find other immediates that also correspond on 188 00:16:34,013 --> 00:16:36,962 to x86 instructions. So now we learned 189 00:16:36,962 --> 00:16:39,947 how we can actually raise our own interrupts. We 190 00:16:39,947 --> 00:16:46,100 just need to load the code we want into the specific register and branch to this 191 00:16:46,100 --> 00:16:52,820 address. And now we learned a lot about how we can actually write microcode, but 192 00:16:52,820 --> 00:16:57,000 it's also interesting to see how certain instructions are implemented. So let's 193 00:16:57,000 --> 00:17:03,671 look at a pretty complicated one: wrmsr (Write MSR). wrmsr essentially writes some 194 00:17:03,671 --> 00:17:08,449 data it is given to a machine specific register. This machine specific register 195 00:17:08,449 --> 00:17:12,980 differs between CPUs, between vendors, sometimes between revisions. And these 196 00:17:12,980 --> 00:17:17,910 implement non-standard extensions or pretty complex features. For example, you 197 00:17:17,910 --> 00:17:23,949 trigger a microcode update by writing to a machine specific register. The register 198 00:17:23,949 --> 00:17:30,570 addresses you want to write to is given in ecx. And now we can see ecx is read and 199 00:17:30,570 --> 00:17:39,679 it is shifted by sixteen bits to t10. So again, we follow uses of t10 and we see 200 00:17:39,679 --> 00:17:46,070 it as XOR'd with a certain bitmask. And this bitmask is C000, which actually 201 00:17:46,070 --> 00:17:52,429 denotes a namespace of the model specific registers. In this case this should be an 202 00:17:52,429 --> 00:17:58,450 AMD-specific namespace. And, of course, this one again sets some flags, and you 203 00:17:58,450 --> 00:18:04,240 can see your conditional branch depending on these flags to what should be the 204 00:18:04,240 --> 00:18:06,235 handler for this namespace. 205 00:18:06,695 --> 00:18:10,770 Next one: We have another XOR that uses a different bit 206 00:18:10,770 --> 00:18:16,890 mask — in this case C001. C001 is the namespace where the microcode update 207 00:18:16,890 --> 00:18:25,050 routine is actually located in. So again, we branch to this handler. And if you just 208 00:18:25,050 --> 00:18:31,010 continue on, there are more operations on rcx, followed by more branches, and this 209 00:18:31,010 --> 00:18:35,790 continues until everything is dispatched to the correct handler. And this is how, 210 00:18:35,790 --> 00:18:40,340 internally, wrmsr is implemented, and also Read MSR is going to be implemented pretty 211 00:18:40,340 --> 00:18:43,640 similar, because it implements some kind of similar thing. 212 00:18:47,750 --> 00:18:49,190 OK, so now I showed you 213 00:18:49,190 --> 00:18:52,470 how we actually went ahead of reconstructing the knowledge we 214 00:18:52,470 --> 00:18:57,939 currently have. And now I'm going to show you what we can actually do with it. And 215 00:18:57,939 --> 00:19:02,440 for this I am going to quickly cover what applications we wrote in microcode. We 216 00:19:02,440 --> 00:19:04,940 wrote a simple configurable rdtsc precision. 217 00:19:04,940 --> 00:19:07,710 This means a certain bit mask is AND'd to 218 00:19:07,710 --> 00:19:11,890 the result of rdtsc, so you can reduce the accuracy of it, which can 219 00:19:11,890 --> 00:19:18,284 sometimes prevent timing attacks. We also implemented microcode-assisted address 220 00:19:18,284 --> 00:19:23,260 sanitizer, which I'll cover quickly in a second. We also have some basic microcode 221 00:19:23,260 --> 00:19:29,070 instruction set randomization. Some microcode-assisted instrumentation. What 222 00:19:29,070 --> 00:19:33,520 this means is, you can write a filter for your instrumentation in microcode itself. 223 00:19:33,520 --> 00:19:37,580 So instead of hooking an instruction, instead of debugging your code or 224 00:19:37,580 --> 00:19:42,160 emulating it, you can just say whenever the instruction is executed filter if this 225 00:19:42,160 --> 00:19:47,180 is relevant for me, and if it is, call my x86 handler — entirely in microcode, 226 00:19:47,180 --> 00:19:52,470 without changing the instruction in the RAM. We also implemented some basic 227 00:19:52,470 --> 00:20:00,000 authenticated microcode updates. The usual update mechanism is weak — that's how we 228 00:20:00,000 --> 00:20:05,430 got our foot in the door in the first place. So we improved upon it a bit. Also 229 00:20:05,430 --> 00:20:09,799 we found out that microcode actually has some enclave-like features because once 230 00:20:09,799 --> 00:20:13,730 we're executing in Microcode, your kernel can't interupt you, your hypervisor can't 231 00:20:13,730 --> 00:20:18,610 interrupt you and any state you want visible to the outside world. You actually 232 00:20:18,610 --> 00:20:22,840 need to write explicitly. So all these microcode internal registers are not 233 00:20:22,840 --> 00:20:26,600 accessible from the outside world. So any computation you perform in micro code 234 00:20:26,600 --> 00:20:30,360 cannot be interfered with. So you can implement a simple enclave on top of this 235 00:20:30,360 --> 00:20:37,039 one. So our hardware-assisted address sanitizer variant is based on the work by 236 00:20:37,039 --> 00:20:41,970 the original authors and address sanitizer is a software instrumentation that detects 237 00:20:41,970 --> 00:20:47,070 invalid memory access by using a shadow map shadow memory to just say which memory 238 00:20:47,070 --> 00:20:50,746 is valid to be read and written to. 239 00:20:50,746 --> 00:20:53,840 The authors proposed hardware address sanitizer 240 00:20:53,840 --> 00:20:59,011 which is essentially doing the same checks but using a new instruction. And the 241 00:20:59,011 --> 00:21:03,940 instruction should raise a fault if an invalid access is detected. This algorithm 242 00:21:03,940 --> 00:21:07,670 they proposed - The details are not important. What is important is in 243 00:21:07,670 --> 00:21:12,080 essence: It's pretty simple. You load from a certain adress, performs the operations 244 00:21:12,080 --> 00:21:18,816 on it and if there is the shadow after this operations you just report a bug. 245 00:21:18,816 --> 00:21:24,910 Advantages of hardware address sanitizer are for example you get better performance 246 00:21:24,910 --> 00:21:29,170 out of it. Because you only have a single instruction maybe you can do some fancy 247 00:21:29,170 --> 00:21:34,450 tricks inside your CPU that are faster than using x86 instructions, you get more 248 00:21:34,450 --> 00:21:38,880 compact code and you have the possibility of one time configuration which is a bit 249 00:21:38,880 --> 00:21:45,210 hard with software address sanitizer. We implemented hardware address sanitizer our 250 00:21:45,210 --> 00:21:49,270 variant by replacing the bound instruction Bound is an old instruction that is no 251 00:21:49,270 --> 00:21:54,870 longer used by compilers because in fact it is slower to use bound instead of 252 00:21:54,870 --> 00:21:58,901 performing the checks with multiple x86 instructions. We changed the interface. 253 00:21:58,901 --> 00:22:04,090 The first argument is the register which holds the address you want to access. And 254 00:22:04,090 --> 00:22:07,835 the second argument holds the size you want this access to be. 255 00:22:07,835 --> 00:22:11,050 So, 1 byte, 2 byte and so on. 256 00:22:11,050 --> 00:22:14,950 This instruction is a no-op if the check succeeds. So if there is no bug it 257 00:22:14,950 --> 00:22:19,980 just continues on like nothing happened. However if we detect an invalid access we 258 00:22:19,980 --> 00:22:25,359 can take a configurable action, we can for example just raise your normal page fault 259 00:22:25,359 --> 00:22:29,630 or we can raise a bound interrupt, which is a custom interrupt, that only denotes 260 00:22:29,630 --> 00:22:34,299 this one or we can branch to an x86 handler that either performs additional 261 00:22:34,299 --> 00:22:39,760 checking, for example whitelisting, or it generates a pretty error report for you. 262 00:22:41,340 --> 00:22:47,480 Most importantly this is a single instruction. We also do not dirty any x86 263 00:22:47,480 --> 00:22:52,690 registers because they are some intermediate results. You need to store 264 00:22:52,690 --> 00:22:56,360 these somewhere and this you usually do in the x86 registers. So you increase 265 00:22:56,360 --> 00:23:00,010 register pressure. Maybe you cause spilling. So overall your performance gets 266 00:23:00,010 --> 00:23:07,230 worse. We also found out that we are actually faster than doing the checking 267 00:23:07,230 --> 00:23:12,390 using x86 instructions. So just by moving the implementation from x86 level to 268 00:23:12,390 --> 00:23:16,805 microcode, which in some way is still kind of like software, we already improved the 269 00:23:16,805 --> 00:23:22,160 performance. Also on top of this you get better cache utilization because you have 270 00:23:22,160 --> 00:23:27,020 less instructions, there are less bytes in the cache, so we get fuller cache lines. 271 00:23:27,020 --> 00:23:31,630 And also it is really easy to tell which is testing code and which is your actual 272 00:23:31,630 --> 00:23:40,080 program code. Lastly I'm going to show you just a rough overview of our framework 273 00:23:40,080 --> 00:23:45,920 which we used during our development and which you can also find on GitHub. Early 274 00:23:45,920 --> 00:23:50,079 on we found out that we are probably going to need to test a lot of microcode 275 00:23:50,079 --> 00:23:55,640 updates, because in the beginning you just throw everything at the CPU and see how it 276 00:23:55,640 --> 00:24:01,400 behaves and we wanted to do this in parallel. So we developed a small custom 277 00:24:01,400 --> 00:24:07,180 OS called "Angry OS" and deployed it to mainboards. These mainboards are just old 278 00:24:07,180 --> 00:24:13,270 AMD mainboards. All these mainboards were hooked up via serial for communication and 279 00:24:13,270 --> 00:24:19,400 GPIO to a Raspberry Pi. With the GPIO you can reset, support power on, power down 280 00:24:19,400 --> 00:24:23,890 and just have remote control of this mainboard and then you can connect to that 281 00:24:23,890 --> 00:24:28,719 Raspberry Pi from anywhere on earth and just deploy and play around with it. 282 00:24:28,719 --> 00:24:30,640 This was the first version. 283 00:24:30,640 --> 00:24:34,490 In the beginning we didn't really know much about electronics 284 00:24:34,490 --> 00:24:38,520 so we used one Raspberry Pi per mainboard. And it turns out Raspberry Pis are more 285 00:24:38,520 --> 00:24:43,970 expensive than these old mainboards, but we improved upon this and now we're down 286 00:24:43,970 --> 00:24:48,007 to one Raspberry Pi for four / five setups. 287 00:24:48,007 --> 00:24:51,587 For example you only need 3 GPIO ports per 288 00:24:51,587 --> 00:24:57,358 mainboard. You connect each of these to optocouplers just to separate the voltage 289 00:24:57,358 --> 00:25:01,860 levels and then you connect one side of the optocoupler to the GPIO the other side 290 00:25:01,860 --> 00:25:05,909 to your reset pin, to your power pin and for input to know whether your board is up 291 00:25:05,909 --> 00:25:11,230 or down you connect the power LED. And that way you can save a lot of space, a 292 00:25:11,230 --> 00:25:17,205 lot of money. And also if you're really constrained you can just remove the power 293 00:25:17,205 --> 00:25:23,530 LED sensing because usually you know it is in the state your setup is in. As I 294 00:25:23,530 --> 00:25:28,230 already said we wrote our custom operating system and it is intentionally really 295 00:25:28,230 --> 00:25:32,659 really minimal because the major feature we wanted is control over every 296 00:25:32,659 --> 00:25:36,740 instructions that's going to be executed from a certain point on, because we're 297 00:25:36,740 --> 00:25:40,780 playing around with instruction encoding and if we execute an instructions that we 298 00:25:40,780 --> 00:25:45,530 did not intend we might crash the CPU, we might go into an invalid state and we do 299 00:25:45,530 --> 00:25:50,850 not even know which instruction caused it. And Angry OS essentially only listens on 300 00:25:50,850 --> 00:26:00,150 the serial port for something to do. What it can do is apply an update. These 301 00:26:00,150 --> 00:26:04,820 updates are just microcode updates. They are streamed via serial. We can also 302 00:26:04,820 --> 00:26:10,039 stream x86 code which is then run by Angry OS and this is just so that we do not need 303 00:26:10,039 --> 00:26:14,409 to reflash the USB stick every time we want to update our testing code and the 304 00:26:14,409 --> 00:26:19,280 result, all the errors are reported back to the Raspberry Pi and thus they are 305 00:26:19,280 --> 00:26:26,852 forwarded to us. The framework we use most importantly has the microcode assembler 306 00:26:26,852 --> 00:26:30,713 and a pretty verbose disassembler. This disassembler generates the output I showed 307 00:26:30,713 --> 00:26:36,919 you earlier and using this you can just quickly write your own microcode. We also 308 00:26:36,919 --> 00:26:42,245 included an x86 assembler because we wanted to rapidly test different x86 309 00:26:42,245 --> 00:26:47,730 testing codes. Using this framework we were able to disassemble the existing 310 00:26:47,730 --> 00:26:53,500 updates and we also used it to disassemble our ROM after we reordered it and also 311 00:26:53,500 --> 00:27:01,169 during the process when we fed it to our emulator. And we can also create the 312 00:27:01,169 --> 00:27:07,909 proper binary files that can be loaded by the Linux kernel driver. We modified the 313 00:27:07,909 --> 00:27:12,777 stock one to just load any update you give it without checking if it's the correct 314 00:27:12,777 --> 00:27:20,060 CPU ID and all these things just for testing purposes. It's also available. And 315 00:27:20,060 --> 00:27:25,740 also of course the framework can control Angry OS to make your testing easier. And 316 00:27:25,740 --> 00:27:29,650 we implemented a pretty basic remote execution wrapper, so you can work on a 317 00:27:29,650 --> 00:27:33,389 remote Raspberry Pi as if you were using it locally. 318 00:27:34,809 --> 00:27:36,799 And this brings me to the end 319 00:27:36,799 --> 00:27:40,800 of talk. And in conclusion we can say reversing the ROM opened up a lot of new 320 00:27:40,800 --> 00:27:44,809 possibilities. We learned a lot about how microcode works. We learned about how to 321 00:27:44,809 --> 00:27:49,720 actually use it properly instead of just inferring from a really small dataset, 322 00:27:49,720 --> 00:27:55,060 that we have from the updates, or from the random bits things we send to the CPU and 323 00:27:55,060 --> 00:27:59,530 observe what happened. But there's a lot left to do. So if you really want to hack 324 00:27:59,530 --> 00:28:04,089 on it, just get in contact, we were happy to share our findings with you. And as I 325 00:28:04,089 --> 00:28:09,009 said the framework AngryOS, example programs, that we implemented, and some 326 00:28:09,009 --> 00:28:13,850 other stuff like the wiring is available on GitHub. So that's that. And we are 327 00:28:13,850 --> 00:28:16,809 happy to answer any questions you might have. 328 00:28:16,809 --> 00:28:22,234 *applause* 329 00:28:24,910 --> 00:28:28,438 Herald Angel: Thank you very much. So we 330 00:28:28,438 --> 00:28:34,260 have 10 minutes for questions please line up at the microphones. We start with this 331 00:28:34,260 --> 00:28:39,220 one: microphone number 2. M2: Hi. Thanks for a nice talk. A few 332 00:28:39,220 --> 00:28:42,780 questions about your hardware address sanitizer. 333 00:28:42,780 --> 00:28:49,830 Benjamin: Mhm M2: As I understand you don't need the 334 00:28:49,830 --> 00:28:56,010 source code instrumentation because the microcode is responsible for checking the 335 00:28:56,010 --> 00:29:02,929 shadow memory, right? Benjamin: No... The original hardware 336 00:29:02,929 --> 00:29:07,950 sanitizer implementation is also based on a compiler extension, that inserts a new 337 00:29:07,950 --> 00:29:12,200 instruction because it doesn't exist usually. And it also inserts a bootstrap 338 00:29:12,200 --> 00:29:18,049 code that in inits your shadow map and also instruments your allocators to update 339 00:29:18,049 --> 00:29:23,020 the shadow map doing runtime and we essentially need the same component, but 340 00:29:23,020 --> 00:29:26,850 we do not need the software address sanitizer component that essentially 341 00:29:26,850 --> 00:29:33,740 inserts 10 or 20 x86 instructions before every memory access. So yes we still need 342 00:29:33,740 --> 00:29:37,647 a compile time component and we are still source code based in a sense. 343 00:29:39,388 --> 00:29:45,600 Herald: And, so.. M2: And I didn't see, maybe I missed the 344 00:29:45,600 --> 00:29:51,299 numbers. How much it is faster than this initial version? 345 00:29:51,299 --> 00:29:56,419 Benjamin: You mean the initial hardware sanitizer version or the software address 346 00:29:56,419 --> 00:29:59,900 sanitizer. M2: I mean let's say custom kernel address 347 00:29:59,900 --> 00:30:05,180 sanitizer for Linux kernel which is the the usual one and your approach. 348 00:30:05,180 --> 00:30:10,270 Benjamin: We only performed a micro benchmark on Angry OS and we essentially 349 00:30:10,270 --> 00:30:16,059 took the instrumentation as emitted by the compiler for some memory access which is 350 00:30:16,059 --> 00:30:20,590 your standard software address sanitizer and compared it to our version using only 351 00:30:20,590 --> 00:30:24,640 the modified bound instruction. So I really can't talk about how it compares to 352 00:30:24,640 --> 00:30:28,820 KASAN or something or some like real world implementation, because we only have the 353 00:30:28,820 --> 00:30:34,069 prototype and the basic instrumentation. M2: Thank you very much. 354 00:30:34,069 --> 00:30:36,490 Herald Angel: OK. Microphone number 4 please. 355 00:30:36,490 --> 00:30:51,145 M4: Hey thanks for the talk and did you find any weird microcode 356 00:30:51,145 --> 00:31:00,529 implementations. I don't mean security wise, just like you rarely expected to 357 00:31:00,529 --> 00:31:07,330 see it be implemented that way. 358 00:31:09,040 --> 00:31:11,700 Benjamin: The problem is there's a lot of 359 00:31:11,700 --> 00:31:20,270 microcode to begin with. You have f000 triads. Each of which has 3 op-codes. So 360 00:31:20,270 --> 00:31:25,003 you have a lot of ground to cover and also we have read-out errors. Sometimes you are 361 00:31:25,003 --> 00:31:29,169 seeing bit flips, which kind of slows you down because you then need to always 362 00:31:29,169 --> 00:31:32,820 consider: OK, maybe this register is something else, maybe this address is 363 00:31:32,820 --> 00:31:37,420 wrong. And also sometimes you have a dust particles that kind of knocks out an 364 00:31:37,420 --> 00:31:42,550 entire region. So we only looked at the components, we were pretty sure that we 365 00:31:42,550 --> 00:31:46,520 recovered correctly, and we'd only looked at a really tiny subset compared to all of 366 00:31:46,520 --> 00:31:52,940 the microcode ROM. It's just not feasible to do and to go through it and look at 367 00:31:52,940 --> 00:31:57,330 everything. So no we didn't find anything funny but we also wouldn't know what funny 368 00:31:57,330 --> 00:32:00,790 looks like because we don't know what the official spec for microcode is. 369 00:32:01,180 --> 00:32:03,990 M4: Thanks. Herald Angel: Interesting. We have one 370 00:32:04,034 --> 00:32:05,809 question from the Internet, from the 371 00:32:05,809 --> 00:32:09,792 Signal Angel please. Signal Angel: Yes. Which AMD CPU 372 00:32:09,792 --> 00:32:15,510 generations does this apply to? Benjamin: Yeah this is still based on the 373 00:32:15,510 --> 00:32:21,289 work of our first talk and this only works on pretty old ones: K8, K10. So until, 374 00:32:21,289 --> 00:32:26,940 CPUs produced until 2013. Yeah this was the last year AMD produced anything like 375 00:32:26,940 --> 00:32:32,520 that. Newer ones use some public key based cryptography from what we can tell and we 376 00:32:32,520 --> 00:32:36,559 haven't yet managed to break it. Same goes for Intel, they seem to be using public 377 00:32:36,559 --> 00:32:39,919 key cryptography and we haven't gotten a foot in the door yet. 378 00:32:40,989 --> 00:32:44,789 Herald Angel: Thank you. We go one around. On microphone number 3 please. 379 00:32:44,789 --> 00:32:51,290 M3: Yeah. Thank you. I would like to know how complex could the microcode programs 380 00:32:51,290 --> 00:32:59,159 be, that you could write. So what's the complexity of new operations you could 381 00:32:59,159 --> 00:33:03,300 implement. Benjamin: The only limiting factor is the 382 00:33:03,300 --> 00:33:07,923 size of your microcode update RAM. But this one is really really limited. 383 00:33:07,923 --> 00:33:12,679 For example on K8, where we performed the majority of our experiments. We are 384 00:33:12,679 --> 00:33:19,050 limited to 32 triads, which comes down to a sixty nine instructions and you also 385 00:33:19,050 --> 00:33:22,440 have some constraints on these instructions for example the next triad 386 00:33:22,440 --> 00:33:27,809 will always be executed no matter what. Some operations can only go at the second 387 00:33:27,809 --> 00:33:33,859 slot. Some can only go on another slot, so it's really really hard. And you're also 388 00:33:33,859 --> 00:33:38,930 limited from our knowledge to loading 16 bit immediates instead of 32 bit or even 389 00:33:38,930 --> 00:33:44,470 64 bit immediates. So your whole program grows really fast if you're trying to do 390 00:33:44,470 --> 00:33:49,400 something complex. For example our authenticated microcode update mechanism 391 00:33:49,400 --> 00:33:54,440 is the most complex one we wrote it nearly fills out the RAM and we used TEA – Tiny 392 00:33:54,440 --> 00:33:58,700 Encryption Algorithm – because that was the only one we managed to fit mostly due 393 00:33:58,700 --> 00:34:04,510 to S-box and other constants we would need to load. So it's really small. 394 00:34:04,510 --> 00:34:08,539 Herald Angel: Thank you Microphone number 1. 395 00:34:08,539 --> 00:34:14,709 M1: So you said the microcode is used for instruction decoding and it needs to meet 396 00:34:14,709 --> 00:34:19,429 the micro-ops to the scheduler and micro queue in some way. Did you find out how 397 00:34:19,429 --> 00:34:27,519 that works? Bejamin: In essence we are not actually 398 00:34:27,519 --> 00:34:33,539 executing code inside in microcode engine. From what from what we understand, the 399 00:34:33,539 --> 00:34:38,569 microcode engine is just some kind of a software based recipe, that describes how 400 00:34:38,569 --> 00:34:43,479 to decode an instruction, so you don't actually get execution, you just commit 401 00:34:43,479 --> 00:34:47,269 instructions into the pipelines, that do what you want. And because we have some 402 00:34:47,269 --> 00:34:51,269 control flow possibility, that is actually inside the micro code engine, because you 403 00:34:51,269 --> 00:34:55,268 can branch to different addresses, you can conditionally branch and loop. You kind of 404 00:34:55,268 --> 00:34:59,089 get an execution, but in essence to just commit stuff in the pipeline and the CPU 405 00:34:59,089 --> 00:35:01,440 does what you tell it to. 406 00:35:04,240 --> 00:35:07,161 Herald Angel: One more question. Microphone number 2, please. 407 00:35:07,161 --> 00:35:11,927 M2: How did you take the picture of the internal CPU? Did you open it? 408 00:35:11,927 --> 00:35:14,969 Benjamin: Yeah. We worked together with 409 00:35:14,969 --> 00:35:19,680 Chris. He's our hardware guy. He has access to his equipment to delayer it and 410 00:35:19,680 --> 00:35:24,289 to take high resolution optical shots and he also takes shots with a scanning 411 00:35:24,289 --> 00:35:29,279 electron microscope. So I think about five or six CPUs were harmed in the making of 412 00:35:29,279 --> 00:35:30,357 this paper. 413 00:35:33,810 --> 00:35:37,815 Herald Angel: So we have one more last question. Microphone number 2 please. 414 00:35:39,248 --> 00:35:41,390 M2: Are you aware of research done by 415 00:35:41,390 --> 00:35:49,400 Christopher Domas, where he mapped out the instruction set for x86 processors? 416 00:35:49,400 --> 00:35:57,119 B: You mean sandsifter? We actually talked with him and yeah we are 417 00:35:57,119 --> 00:36:02,910 aware, that there's a map essentially of the instruction set and also maybe you can 418 00:36:02,910 --> 00:36:07,275 combine it, because in the beginning we reverse engineered where certain x86 419 00:36:07,275 --> 00:36:11,335 instructions are implemented in microcode. So if you plug these two together you kind 420 00:36:11,335 --> 00:36:15,170 of map out the whole microcode ROM at the same time that you map out a whole 421 00:36:15,170 --> 00:36:18,989 instruction set. However there are some components of the microcode ROM that are 422 00:36:18,989 --> 00:36:23,470 most likely not triggered by instructions. For example it seems like power management 423 00:36:23,470 --> 00:36:27,368 or everything that is behind a write MSR [wrmsr] or read MSR [rdmsr]. wrmsr is a 424 00:36:27,368 --> 00:36:31,249 single instruction, but depending on the arguments you give it it just branches to 425 00:36:31,249 --> 00:36:36,442 totally different triads and the microcode itself is implemented in microcode. And 426 00:36:36,442 --> 00:36:40,190 this one is a huge chunk you wouldn't even find without brute forcing all 427 00:36:40,190 --> 00:36:44,159 combinations for all instructions which is not really feasible. 428 00:36:46,483 --> 00:36:51,279 Herald Angel: Thank you. Thank you Benjamin. 429 00:36:51,279 --> 00:36:57,210 *applause* 430 00:36:57,210 --> 00:37:01,811 *35c3 postroll music* 431 00:37:01,811 --> 00:37:21,000 subtitles created by c3subtitles.de in the years 2019-2020. Join, and help us!