0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/379 Thanks! 1 00:00:10,540 --> 00:00:12,609 So hi, welcome to 2 00:00:12,610 --> 00:00:14,739 my talk on mining for box with 3 00:00:14,740 --> 00:00:16,149 graph database queries, 4 00:00:18,010 --> 00:00:20,379 it's been a while since I last 5 00:00:20,380 --> 00:00:22,629 talked at CCC and 6 00:00:22,630 --> 00:00:24,459 I actually planned on submitting every 7 00:00:24,460 --> 00:00:26,799 year, but then turned 8 00:00:26,800 --> 00:00:28,479 out last time that I submitted that it's 9 00:00:28,480 --> 00:00:30,010 actually been five years already. 10 00:00:31,900 --> 00:00:34,149 You probably saw most of the people 11 00:00:34,150 --> 00:00:36,249 in the room probably did not see the 12 00:00:36,250 --> 00:00:38,589 last talk, but to make it short, 13 00:00:38,590 --> 00:00:40,929 it was mostly about our 14 00:00:40,930 --> 00:00:43,149 hard earned bucks, meaning 15 00:00:44,200 --> 00:00:46,389 staring at code for a very, very 16 00:00:46,390 --> 00:00:48,759 long time, trying to find bugs 17 00:00:48,760 --> 00:00:51,459 and exploiting those bugs in the code. 18 00:00:51,460 --> 00:00:52,780 And since then, 19 00:00:54,220 --> 00:00:56,529 my interest has been in making 20 00:00:56,530 --> 00:00:58,929 this process 21 00:00:58,930 --> 00:01:01,599 just a little less painful. 22 00:01:01,600 --> 00:01:04,179 And what I'm presenting today is 23 00:01:04,180 --> 00:01:06,010 the results of that. 24 00:01:07,570 --> 00:01:09,669 So I want to give you the big picture 25 00:01:09,670 --> 00:01:11,769 first of 26 00:01:11,770 --> 00:01:13,419 what I actually want to achieve on the 27 00:01:13,420 --> 00:01:14,589 long run. 28 00:01:14,590 --> 00:01:17,049 So what I'm trying to do here is 29 00:01:17,050 --> 00:01:19,539 I'm trying to combine two 30 00:01:19,540 --> 00:01:21,639 well, my main two 31 00:01:21,640 --> 00:01:23,709 interests, which are on the 32 00:01:23,710 --> 00:01:26,289 one side, the kinds of things 33 00:01:26,290 --> 00:01:28,389 that we see at conferences like this, 34 00:01:28,390 --> 00:01:31,509 you see a lot, which is buck hunting, 35 00:01:31,510 --> 00:01:32,560 exploiting bugs, 36 00:01:33,640 --> 00:01:36,129 things that are nicely represented 37 00:01:36,130 --> 00:01:37,989 by the books that you see there on the 38 00:01:37,990 --> 00:01:40,569 top. So the Show Coders Handbook, 39 00:01:40,570 --> 00:01:42,649 the book diary, the art 40 00:01:42,650 --> 00:01:44,559 of software security assessment. 41 00:01:44,560 --> 00:01:46,659 And in the back you see 42 00:01:46,660 --> 00:01:48,729 Freck shining over all of 43 00:01:48,730 --> 00:01:49,730 this. Right. 44 00:01:50,620 --> 00:01:52,179 And then it becomes a little more 45 00:01:52,180 --> 00:01:54,579 academic with principles 46 00:01:54,580 --> 00:01:56,020 of program analysis. 47 00:01:57,310 --> 00:01:59,409 And what with these 48 00:01:59,410 --> 00:02:01,479 kinds of methods that are proposed 49 00:02:01,480 --> 00:02:03,699 in principles of program analysis 50 00:02:04,900 --> 00:02:06,519 have in common, is that they are very, 51 00:02:06,520 --> 00:02:08,198 very exact. 52 00:02:08,199 --> 00:02:10,599 Now, my other interest 53 00:02:10,600 --> 00:02:12,759 is in pattern recognition 54 00:02:12,760 --> 00:02:14,259 and machine learning, these kinds of 55 00:02:14,260 --> 00:02:16,569 methods, and they are actually 56 00:02:16,570 --> 00:02:17,739 very inexact. 57 00:02:17,740 --> 00:02:20,259 So they approach 58 00:02:20,260 --> 00:02:22,299 problems more from the engineering kind 59 00:02:22,300 --> 00:02:24,009 of perspective. 60 00:02:24,010 --> 00:02:26,079 And the idea now is, 61 00:02:26,080 --> 00:02:28,209 can we take the tools that we have down 62 00:02:28,210 --> 00:02:30,879 there, the the pattern recognition tools 63 00:02:30,880 --> 00:02:33,429 and kind of build tools 64 00:02:33,430 --> 00:02:35,619 to recognize bugs, you know, 65 00:02:35,620 --> 00:02:37,779 are looking looking at bugs, looking 66 00:02:37,780 --> 00:02:39,520 at code similar in the way 67 00:02:40,630 --> 00:02:42,909 a person auditing code would look at it 68 00:02:42,910 --> 00:02:44,110 and then just, you know, 69 00:02:45,160 --> 00:02:46,929 do an inexact analysis. 70 00:02:46,930 --> 00:02:49,419 But which scales very well, 71 00:02:49,420 --> 00:02:51,069 because the problems that you have with 72 00:02:51,070 --> 00:02:53,229 these exact methods is 73 00:02:53,230 --> 00:02:55,329 that scaling them to really large code 74 00:02:55,330 --> 00:02:57,429 base, like the Linux kernel, for example, 75 00:02:57,430 --> 00:02:58,720 is really, really hard. 76 00:03:00,460 --> 00:03:01,460 Now doing this. 77 00:03:02,470 --> 00:03:05,079 I want to have tools which are realistic, 78 00:03:05,080 --> 00:03:07,119 not those static analysis tools that give 79 00:03:07,120 --> 00:03:09,249 you like thousands of hits and you 80 00:03:09,250 --> 00:03:10,959 can't really tune them or do anything 81 00:03:10,960 --> 00:03:12,849 like that. But instead, I want something 82 00:03:12,850 --> 00:03:15,729 to assist auditors 83 00:03:15,730 --> 00:03:16,780 in their daily work. 84 00:03:18,630 --> 00:03:21,419 Now, zooming in a little, 85 00:03:21,420 --> 00:03:23,249 where you're going to be seeing today is 86 00:03:23,250 --> 00:03:25,349 I'm trying to look at two very 87 00:03:25,350 --> 00:03:27,599 different topics and showing 88 00:03:27,600 --> 00:03:29,009 that they actually have something in 89 00:03:29,010 --> 00:03:31,169 common, that they actually 90 00:03:31,170 --> 00:03:32,669 fit nicely together. 91 00:03:32,670 --> 00:03:34,769 So one of them is good 92 00:03:34,770 --> 00:03:36,299 old computer science, compiler 93 00:03:36,300 --> 00:03:38,489 construction, and the other 94 00:03:38,490 --> 00:03:41,129 is the shiny and new graph databases, 95 00:03:41,130 --> 00:03:42,750 or as some people like to say, 96 00:03:43,920 --> 00:03:45,299 big data. Right. 97 00:03:45,300 --> 00:03:47,459 And we're going to see that they 98 00:03:47,460 --> 00:03:48,749 actually fit nicely together. 99 00:03:48,750 --> 00:03:50,429 That's that's what I want to achieve in 100 00:03:50,430 --> 00:03:51,430 this talk. 101 00:03:52,370 --> 00:03:54,599 Now, to start off with let's take a look 102 00:03:54,600 --> 00:03:55,919 at an example back. 103 00:03:55,920 --> 00:03:58,019 So this is a bug found 104 00:03:58,020 --> 00:04:00,479 by Stefan Stephanson and presented 105 00:04:00,480 --> 00:04:02,819 in Siskin 13 talk. 106 00:04:02,820 --> 00:04:04,949 And well, in his talk, 107 00:04:04,950 --> 00:04:06,749 he presented a lot of different bugs. 108 00:04:06,750 --> 00:04:09,179 And this one is certainly 109 00:04:09,180 --> 00:04:10,469 not one of those. 110 00:04:10,470 --> 00:04:12,539 Well, this is more of a side finding, I 111 00:04:12,540 --> 00:04:14,699 would say, but it's interesting 112 00:04:14,700 --> 00:04:16,319 for us, as you'll see in a moment. 113 00:04:16,320 --> 00:04:17,320 So 114 00:04:18,720 --> 00:04:20,309 the one thing that's already interesting 115 00:04:20,310 --> 00:04:22,409 is that you see these kinds of 116 00:04:22,410 --> 00:04:24,539 bugs like these all over the place. 117 00:04:24,540 --> 00:04:26,879 So it's like a it's not a very unique 118 00:04:26,880 --> 00:04:28,289 bug. It's something that you will see 119 00:04:28,290 --> 00:04:30,329 again and again and again. 120 00:04:30,330 --> 00:04:32,639 Now, as you see, there's 121 00:04:32,640 --> 00:04:34,829 this variable, 32 122 00:04:34,830 --> 00:04:37,529 bit wide called name Lynn. 123 00:04:37,530 --> 00:04:39,839 And it's it's 124 00:04:42,180 --> 00:04:44,549 produced by taking this point of data 125 00:04:44,550 --> 00:04:47,009 and then converting from 126 00:04:47,010 --> 00:04:49,529 network byte to order. 127 00:04:49,530 --> 00:04:51,689 OK, so that kind of tells you this 128 00:04:51,690 --> 00:04:54,119 data is probably attacker 129 00:04:54,120 --> 00:04:56,219 controlled because it comes 130 00:04:56,220 --> 00:04:58,379 from a network and you would not be 131 00:04:58,380 --> 00:05:00,299 converting from network troublespot order 132 00:05:00,300 --> 00:05:02,339 if it didn't come from a network. 133 00:05:02,340 --> 00:05:04,589 Now, what happens next is 134 00:05:04,590 --> 00:05:06,419 that there's an allocation. 135 00:05:06,420 --> 00:05:08,729 So this a buffer called exit 136 00:05:08,730 --> 00:05:11,369 signal is being allocated 137 00:05:11,370 --> 00:05:13,469 and they use this variable name, Lenn, 138 00:05:13,470 --> 00:05:15,389 that has just been initialized by the 139 00:05:15,390 --> 00:05:17,519 attacker and they add one inside 140 00:05:17,520 --> 00:05:19,169 the allocation. 141 00:05:19,170 --> 00:05:21,539 And clearly, the problem is if 142 00:05:21,540 --> 00:05:24,569 a name Lenn is the maximum size 143 00:05:24,570 --> 00:05:27,299 of an unsigned integer, then 144 00:05:27,300 --> 00:05:29,429 this will overflow and actually 145 00:05:29,430 --> 00:05:31,349 something close to close to zero bytes 146 00:05:31,350 --> 00:05:33,059 will be allocated. 147 00:05:33,060 --> 00:05:35,399 And now finally, there's 148 00:05:35,400 --> 00:05:37,529 the operation and we copy into 149 00:05:37,530 --> 00:05:39,899 this buffer, which is close 150 00:05:39,900 --> 00:05:42,149 to zero bytes and we copy namely 151 00:05:42,150 --> 00:05:43,109 in bytes into there. 152 00:05:43,110 --> 00:05:45,059 So that's about four gigabytes. 153 00:05:45,060 --> 00:05:47,339 So there's a typical 154 00:05:47,340 --> 00:05:48,629 he based buffer overflow. 155 00:05:50,250 --> 00:05:51,959 Now, what's interesting is how he found 156 00:05:51,960 --> 00:05:54,629 this, and that's why I took this example. 157 00:05:54,630 --> 00:05:56,879 So in academic 158 00:05:56,880 --> 00:05:58,799 research, there are a lot of different 159 00:06:00,150 --> 00:06:02,219 different methods being 160 00:06:02,220 --> 00:06:03,809 proposed and how you could find this kind 161 00:06:03,810 --> 00:06:05,069 of stuff. And they all sound very 162 00:06:05,070 --> 00:06:07,349 advanced. So you hear things about white 163 00:06:07,350 --> 00:06:09,359 box phasers and hansell, symbolic 164 00:06:09,360 --> 00:06:11,669 execution and the machine learning 165 00:06:11,670 --> 00:06:14,009 powered anomaly detectors, maybe 166 00:06:14,010 --> 00:06:16,170 Thierer improving or model checking. 167 00:06:17,760 --> 00:06:19,889 But in practice, he 168 00:06:19,890 --> 00:06:21,600 used the regular expression for grep. 169 00:06:25,830 --> 00:06:27,039 That was pretty amusing. 170 00:06:27,040 --> 00:06:29,249 So with all the with 171 00:06:29,250 --> 00:06:31,379 all the all 172 00:06:31,380 --> 00:06:33,449 the advanced methods that you could use, 173 00:06:33,450 --> 00:06:34,560 he chose to do this. 174 00:06:36,600 --> 00:06:39,239 Now, I think this tells us a lot. 175 00:06:39,240 --> 00:06:41,309 So first of all, I think 176 00:06:41,310 --> 00:06:42,329 it tells us that 177 00:06:43,560 --> 00:06:45,839 if you use it right, even primitive 178 00:06:45,840 --> 00:06:48,089 tools like Repp are very, very 179 00:06:48,090 --> 00:06:49,469 powerful. 180 00:06:49,470 --> 00:06:52,469 You just need to make sure that the the 181 00:06:52,470 --> 00:06:55,169 the person auditing the code 182 00:06:55,170 --> 00:06:57,719 can actually introduce his knowledge 183 00:06:57,720 --> 00:06:59,669 into the process. 184 00:06:59,670 --> 00:07:00,809 Now, the second thing 185 00:07:02,030 --> 00:07:04,259 that becomes clear is that these 186 00:07:04,260 --> 00:07:06,359 kind of queries, so things 187 00:07:06,360 --> 00:07:08,519 like this grep regular 188 00:07:08,520 --> 00:07:10,049 expression. 189 00:07:10,050 --> 00:07:12,479 They actually encode knowledge. 190 00:07:12,480 --> 00:07:14,489 So I'm going to go back one slide. 191 00:07:14,490 --> 00:07:15,899 Yeah. If you look at this regular 192 00:07:15,900 --> 00:07:17,819 expression, you see what it tries to 193 00:07:17,820 --> 00:07:20,399 match. So it tries to look for 194 00:07:20,400 --> 00:07:21,629 an allocation. 195 00:07:21,630 --> 00:07:23,759 And inside there there's 196 00:07:23,760 --> 00:07:26,369 some sort of arithmetic 197 00:07:26,370 --> 00:07:29,339 operation. So plus minus 198 00:07:29,340 --> 00:07:30,839 something like that. 199 00:07:30,840 --> 00:07:32,909 And in a way, you could 200 00:07:32,910 --> 00:07:35,789 say this is a model 201 00:07:35,790 --> 00:07:36,839 for different kinds of 202 00:07:37,920 --> 00:07:40,049 bugs where where he's 203 00:07:40,050 --> 00:07:41,849 saying, you know, these kinds of things 204 00:07:41,850 --> 00:07:43,589 happen all over the place. 205 00:07:43,590 --> 00:07:45,659 So let's let's just search for 206 00:07:45,660 --> 00:07:46,660 this. 207 00:07:47,430 --> 00:07:49,829 And then finally, it kind of shows that 208 00:07:51,480 --> 00:07:54,029 false positive rates for bug hunting, 209 00:07:54,030 --> 00:07:55,769 that's not so much of a problem. 210 00:07:55,770 --> 00:07:57,959 You know, if if you have if you have like 211 00:07:57,960 --> 00:08:00,719 50 hits and in there 212 00:08:00,720 --> 00:08:02,789 there are seven days, well, 213 00:08:02,790 --> 00:08:03,899 then that's fine. 214 00:08:03,900 --> 00:08:05,729 You read a bit of extra code, but who 215 00:08:05,730 --> 00:08:06,930 cares? I mean, yeah. 216 00:08:08,610 --> 00:08:10,979 So that gave me the idea 217 00:08:10,980 --> 00:08:13,079 of, well, maybe we can 218 00:08:13,080 --> 00:08:15,179 build like a search engine for 219 00:08:15,180 --> 00:08:16,889 source code that can be used to find 220 00:08:16,890 --> 00:08:18,979 bugs. And that's what this talk 221 00:08:18,980 --> 00:08:19,969 is about. 222 00:08:19,970 --> 00:08:22,069 So overall, it's supposed to look 223 00:08:22,070 --> 00:08:24,469 like this, you take source code 224 00:08:24,470 --> 00:08:26,929 and you stuff it into some 225 00:08:26,930 --> 00:08:29,089 robust parser for the language 226 00:08:29,090 --> 00:08:31,879 and then you put it into a database 227 00:08:31,880 --> 00:08:34,219 and then the analyst, 228 00:08:34,220 --> 00:08:35,269 the auditor 229 00:08:36,530 --> 00:08:38,719 sits at one side and is 230 00:08:38,720 --> 00:08:41,209 able to query the database, 231 00:08:41,210 --> 00:08:43,849 see what comes back, and then eventually 232 00:08:43,850 --> 00:08:46,099 adapt the query to make 233 00:08:46,100 --> 00:08:47,809 well, to actually find what is 234 00:08:47,810 --> 00:08:50,089 interesting now 235 00:08:50,090 --> 00:08:51,589 in the back. 236 00:08:51,590 --> 00:08:53,509 You can also use this database for 237 00:08:53,510 --> 00:08:55,009 different kind of pattern recognition 238 00:08:55,010 --> 00:08:55,939 tools. 239 00:08:55,940 --> 00:08:57,529 But this is something that we're not 240 00:08:57,530 --> 00:08:58,570 going to be discussing today. 241 00:09:00,790 --> 00:09:02,739 Now, the question is, 242 00:09:03,760 --> 00:09:05,949 if we want to build a good search engine, 243 00:09:05,950 --> 00:09:07,929 what does our language need to be able to 244 00:09:07,930 --> 00:09:08,930 model? 245 00:09:10,790 --> 00:09:12,879 And overall, I 246 00:09:12,880 --> 00:09:14,979 think is the following are 247 00:09:14,980 --> 00:09:17,109 we need to we need to 248 00:09:17,110 --> 00:09:19,929 ask what does the statement look like 249 00:09:19,930 --> 00:09:21,939 and can we get from one statement to 250 00:09:21,940 --> 00:09:23,139 another? 251 00:09:23,140 --> 00:09:25,359 And how do the statements affect affect 252 00:09:25,360 --> 00:09:27,460 each other and. 253 00:09:30,570 --> 00:09:32,309 We can, of course, take a look at how 254 00:09:32,310 --> 00:09:35,070 people who design compilers 255 00:09:36,360 --> 00:09:39,089 have tackled all of these problems since 256 00:09:39,090 --> 00:09:40,559 when you try to write an optimizer or 257 00:09:40,560 --> 00:09:42,479 something like that, you need all this 258 00:09:42,480 --> 00:09:43,480 kind of information. 259 00:09:45,420 --> 00:09:47,609 And in compiler construction, 260 00:09:47,610 --> 00:09:49,169 well, they essentially have different 261 00:09:49,170 --> 00:09:50,789 graphic representations of code 262 00:09:51,870 --> 00:09:54,809 and all each of these 263 00:09:54,810 --> 00:09:56,909 highlights some aspect. 264 00:09:56,910 --> 00:09:59,249 So, for example, there's the abstract 265 00:09:59,250 --> 00:10:01,379 syntax tree that you see here in the 266 00:10:01,380 --> 00:10:02,879 top, the ESTIE. 267 00:10:02,880 --> 00:10:05,069 And this gives you like a 268 00:10:05,070 --> 00:10:07,319 hierarchical decomposition of the code 269 00:10:07,320 --> 00:10:08,940 into all of its language elements. 270 00:10:10,050 --> 00:10:11,969 Then there's the control flow graph. 271 00:10:11,970 --> 00:10:14,249 You probably notice that's essentially 272 00:10:14,250 --> 00:10:17,249 what you see in tools like ADA Pro, which 273 00:10:17,250 --> 00:10:19,499 tells you, well, the 274 00:10:19,500 --> 00:10:20,819 statements have been collapsed. 275 00:10:20,820 --> 00:10:21,989 Collapsed, right. 276 00:10:21,990 --> 00:10:24,089 And you see how you can get from one 277 00:10:24,090 --> 00:10:26,249 statement to the to the other and 278 00:10:26,250 --> 00:10:28,349 on condition nodes, you 279 00:10:28,350 --> 00:10:30,779 see whether the conditions to be true 280 00:10:30,780 --> 00:10:33,059 or false to get to another 281 00:10:33,060 --> 00:10:34,289 statement. 282 00:10:34,290 --> 00:10:36,479 And then this thing here, this is not so 283 00:10:36,480 --> 00:10:38,159 well known. This is called the PDG, the 284 00:10:38,160 --> 00:10:39,789 program dependance graph. 285 00:10:39,790 --> 00:10:42,059 And this has edges to model data. 286 00:10:42,060 --> 00:10:44,399 So this says a variable 287 00:10:44,400 --> 00:10:45,509 that is being produced, that one 288 00:10:45,510 --> 00:10:48,209 statement reaches 289 00:10:48,210 --> 00:10:49,199 another statement. 290 00:10:49,200 --> 00:10:51,299 So it's not changed 291 00:10:51,300 --> 00:10:52,619 on the way. 292 00:10:52,620 --> 00:10:53,999 So there's actually data flow from here 293 00:10:54,000 --> 00:10:55,000 to there. 294 00:10:55,410 --> 00:10:56,399 OK. 295 00:10:56,400 --> 00:10:58,679 Now, as I said, they all highlight some 296 00:10:58,680 --> 00:11:00,419 aspect of the code that we want to be 297 00:11:00,420 --> 00:11:02,189 able to model in the search for it. 298 00:11:02,190 --> 00:11:04,679 But none of them 299 00:11:04,680 --> 00:11:06,749 are good can really 300 00:11:06,750 --> 00:11:08,489 do it. All right. 301 00:11:08,490 --> 00:11:10,559 And another problem is if 302 00:11:10,560 --> 00:11:12,299 if you take a look at a typical query, 303 00:11:12,300 --> 00:11:14,549 then it's going to sound a bit like this. 304 00:11:14,550 --> 00:11:16,859 Find a call TIFU in a statement 305 00:11:16,860 --> 00:11:18,989 where data flow exists to a statement 306 00:11:18,990 --> 00:11:20,429 that contains a call to bar. 307 00:11:21,660 --> 00:11:23,799 That means what you're actually doing 308 00:11:23,800 --> 00:11:25,979 here is you're kind of transitioning 309 00:11:25,980 --> 00:11:28,059 from one representation to the other. 310 00:11:28,060 --> 00:11:29,819 Right. So first, you're in the syntax 311 00:11:29,820 --> 00:11:31,979 world. First you're saying find a call to 312 00:11:31,980 --> 00:11:33,209 Foo, which means. 313 00:11:33,210 --> 00:11:35,399 Well, in in the syntax tree, there 314 00:11:35,400 --> 00:11:37,859 needs to be a node, which is called 315 00:11:37,860 --> 00:11:39,959 foo. But then you want to take a look 316 00:11:39,960 --> 00:11:41,489 at dataflow. Now you want to see can I 317 00:11:41,490 --> 00:11:43,259 get from this statement to to another 318 00:11:43,260 --> 00:11:45,359 statement. And here the the 319 00:11:45,360 --> 00:11:47,279 abstract syntax tree fails completely. 320 00:11:47,280 --> 00:11:48,899 And instead you want to use something 321 00:11:48,900 --> 00:11:51,239 like the PJI, so you want to transition 322 00:11:51,240 --> 00:11:53,339 from the E to the PJI and 323 00:11:53,340 --> 00:11:54,390 then again to the EU. 324 00:11:56,520 --> 00:11:58,199 And what we want to have is a 325 00:11:58,200 --> 00:11:59,819 representation that can do all of these 326 00:11:59,820 --> 00:12:01,109 things. 327 00:12:01,110 --> 00:12:03,389 And the the primary 328 00:12:03,390 --> 00:12:04,679 observation here is that 329 00:12:06,570 --> 00:12:09,089 in all of these representations, the CFG, 330 00:12:09,090 --> 00:12:10,679 the PDG and the ESTIE 331 00:12:12,120 --> 00:12:14,369 for each of the statements, there's 332 00:12:14,370 --> 00:12:16,589 actually one designated node. 333 00:12:16,590 --> 00:12:18,689 Right. So if you look at this indexical 334 00:12:18,690 --> 00:12:20,909 source, um, there's one node 335 00:12:20,910 --> 00:12:23,579 in the CFG, there's another in the PDG 336 00:12:23,580 --> 00:12:25,619 and also in the Høst here. 337 00:12:25,620 --> 00:12:27,599 It's just a declaration node. 338 00:12:27,600 --> 00:12:29,159 Of course there's something else beneath 339 00:12:29,160 --> 00:12:30,929 that, but that doesn't matter. 340 00:12:30,930 --> 00:12:33,029 So why not try to join 341 00:12:33,030 --> 00:12:35,309 these data structures at statement notes? 342 00:12:36,390 --> 00:12:37,679 And that's what we did. 343 00:12:37,680 --> 00:12:39,869 So this is what we call the 344 00:12:39,870 --> 00:12:40,889 property graph. 345 00:12:40,890 --> 00:12:42,840 And what you see here is, 346 00:12:44,160 --> 00:12:46,589 well, you see parts of the høst 347 00:12:46,590 --> 00:12:47,519 which are still there. 348 00:12:47,520 --> 00:12:49,649 Right? So those little trees. 349 00:12:49,650 --> 00:12:51,809 But you can also get from from 350 00:12:51,810 --> 00:12:54,339 one of these statement 351 00:12:54,340 --> 00:12:56,699 subtree of the E to 352 00:12:56,700 --> 00:12:58,769 to another statement via 353 00:12:58,770 --> 00:13:01,079 data flow links or 354 00:13:01,080 --> 00:13:02,489 control flow links. 355 00:13:02,490 --> 00:13:04,979 Right. And now the idea is maybe 356 00:13:04,980 --> 00:13:07,349 maybe we can describe vulnerabilities 357 00:13:07,350 --> 00:13:09,570 as walks in this graph. 358 00:13:13,470 --> 00:13:15,659 Now, once we have that data structure, 359 00:13:15,660 --> 00:13:17,819 the question was, how are we 360 00:13:17,820 --> 00:13:19,349 going to store this? 361 00:13:19,350 --> 00:13:21,479 And it actually took me quite some 362 00:13:21,480 --> 00:13:23,759 time to to realize that, 363 00:13:23,760 --> 00:13:25,859 well, if you have a graph, then, 364 00:13:25,860 --> 00:13:27,629 you know, trying to map this to a 365 00:13:27,630 --> 00:13:29,939 classical relational database 366 00:13:29,940 --> 00:13:32,189 management system is not going to be much 367 00:13:32,190 --> 00:13:34,709 fun because you need to map a graph 368 00:13:34,710 --> 00:13:35,819 to tables. 369 00:13:35,820 --> 00:13:37,469 And it took me even longer because there 370 00:13:37,470 --> 00:13:39,239 are actually a lot of great reverse 371 00:13:39,240 --> 00:13:41,699 engineering tools which do exactly this. 372 00:13:41,700 --> 00:13:43,859 So for me, it seemed like the right way 373 00:13:43,860 --> 00:13:45,779 to do it. 374 00:13:45,780 --> 00:13:47,609 And then I started to look at different 375 00:13:47,610 --> 00:13:49,649 kinds of databases and I looked at 376 00:13:49,650 --> 00:13:52,109 document databases and 377 00:13:52,110 --> 00:13:54,299 mapping graphs through documents 378 00:13:54,300 --> 00:13:55,409 also kind of fails. 379 00:13:55,410 --> 00:13:57,449 And then I realized mapping graphs to 380 00:13:57,450 --> 00:13:59,669 graphs succeeds immediately. 381 00:14:03,720 --> 00:14:06,209 Now, um, why are 382 00:14:06,210 --> 00:14:08,549 relational database, relational 383 00:14:08,550 --> 00:14:10,659 database management systems 384 00:14:10,660 --> 00:14:12,209 not the right choice here? 385 00:14:12,210 --> 00:14:14,849 Well, as I already said, the underlying 386 00:14:14,850 --> 00:14:16,649 data structure that's being used here is 387 00:14:16,650 --> 00:14:17,999 the table. 388 00:14:18,000 --> 00:14:20,279 And the most obvious 389 00:14:20,280 --> 00:14:22,559 problem is that if you want to create 390 00:14:22,560 --> 00:14:25,079 relationships between nodes, 391 00:14:25,080 --> 00:14:27,269 as you have in a graph right just across 392 00:14:27,270 --> 00:14:30,029 from one note to another edges, 393 00:14:30,030 --> 00:14:31,619 then what you typically do. 394 00:14:31,620 --> 00:14:33,689 So pretty much all you can do 395 00:14:33,690 --> 00:14:35,909 is create a so-called joint table. 396 00:14:35,910 --> 00:14:38,549 And in this joint table, you will 397 00:14:38,550 --> 00:14:40,440 associate tables from 398 00:14:41,730 --> 00:14:43,979 items from a table 399 00:14:43,980 --> 00:14:46,379 on the left and a table on the right. 400 00:14:46,380 --> 00:14:48,749 And the lookup time 401 00:14:48,750 --> 00:14:50,819 will scale with the number 402 00:14:50,820 --> 00:14:52,379 of relationships. 403 00:14:52,380 --> 00:14:53,819 But what you really want is you want to 404 00:14:53,820 --> 00:14:55,769 be at one node and you want to have you 405 00:14:55,770 --> 00:14:57,599 want to have immediate access only to the 406 00:14:57,600 --> 00:14:59,729 relationships that are outgoing there. 407 00:14:59,730 --> 00:15:01,469 And that's what a graph database gives 408 00:15:01,470 --> 00:15:02,470 you. 409 00:15:02,760 --> 00:15:04,979 Now, I know 410 00:15:04,980 --> 00:15:07,079 that when you hear big data and cloud 411 00:15:07,080 --> 00:15:09,449 era noise, you're probably thinking, no, 412 00:15:09,450 --> 00:15:12,089 this can't possibly make sense, because 413 00:15:12,090 --> 00:15:13,679 the only people who use these kinds of 414 00:15:13,680 --> 00:15:15,330 words, you know, the guys in the suits. 415 00:15:16,830 --> 00:15:19,169 But, you know, this actually makes 416 00:15:19,170 --> 00:15:21,269 sense from a technical 417 00:15:21,270 --> 00:15:23,009 point of view. And that's what I hope to 418 00:15:23,010 --> 00:15:24,570 show in the next couple of slides. 419 00:15:27,370 --> 00:15:29,679 Yeah, so the native storage 420 00:15:29,680 --> 00:15:31,330 format for Gref Database's. 421 00:15:33,020 --> 00:15:35,239 It's the so-called property graph that's 422 00:15:35,240 --> 00:15:37,489 different from the tables, so 423 00:15:37,490 --> 00:15:39,109 property graph is just a graph, really, 424 00:15:39,110 --> 00:15:41,539 but there are properties 425 00:15:41,540 --> 00:15:44,209 attached to the nodes, which just means, 426 00:15:44,210 --> 00:15:46,369 well, you can think of it as having 427 00:15:46,370 --> 00:15:48,799 Python dictionaries attached to each node 428 00:15:48,800 --> 00:15:51,019 or Perl hashes or. 429 00:15:51,020 --> 00:15:53,419 Yeah, just a key value pairs, really, 430 00:15:53,420 --> 00:15:56,299 and also 431 00:15:56,300 --> 00:15:58,099 the edges. And this is important for us 432 00:15:58,100 --> 00:15:59,779 as well. They're labeled. 433 00:15:59,780 --> 00:16:02,209 So it's not just there's 434 00:16:02,210 --> 00:16:03,859 there's an edge from A to B, but there is 435 00:16:03,860 --> 00:16:06,019 an edge labeled as knows or 436 00:16:06,020 --> 00:16:07,699 created or something like that from A to 437 00:16:07,700 --> 00:16:09,859 B, and that's all 438 00:16:09,860 --> 00:16:11,359 that a property graph really is. 439 00:16:11,360 --> 00:16:12,919 And that's the native storage format of 440 00:16:12,920 --> 00:16:14,659 graph databases. 441 00:16:14,660 --> 00:16:17,059 But now the nice thing here is 442 00:16:17,060 --> 00:16:19,939 once you have your data in this format, 443 00:16:19,940 --> 00:16:22,549 you can make use of different languages 444 00:16:22,550 --> 00:16:24,649 which have been designed specifically 445 00:16:24,650 --> 00:16:27,799 to query these property graphs. 446 00:16:27,800 --> 00:16:29,899 And there are two languages which are 447 00:16:29,900 --> 00:16:31,039 currently very popular. 448 00:16:31,040 --> 00:16:33,349 One is the cipha query language 449 00:16:33,350 --> 00:16:34,609 by the four guys. 450 00:16:35,930 --> 00:16:37,669 This is not so well suited for what we 451 00:16:37,670 --> 00:16:39,859 want to do most 452 00:16:39,860 --> 00:16:42,199 of all, because the cipha query language 453 00:16:42,200 --> 00:16:44,269 doesn't allow you to have 454 00:16:44,270 --> 00:16:46,789 like the equivalent of stored 455 00:16:46,790 --> 00:16:47,719 procedures. 456 00:16:47,720 --> 00:16:49,429 And we really want to have this. 457 00:16:49,430 --> 00:16:50,779 As we'll see. 458 00:16:50,780 --> 00:16:52,849 The other one is Gremlin and Gremlin 459 00:16:52,850 --> 00:16:54,289 is really awesome and I hope to show 460 00:16:54,290 --> 00:16:55,549 this. 461 00:16:55,550 --> 00:16:57,649 So a typical typical Gramling 462 00:16:57,650 --> 00:16:58,969 query looks a bit like this. 463 00:16:58,970 --> 00:17:01,459 You say choose the set of start notes, 464 00:17:01,460 --> 00:17:02,749 then you have some start notes in your 465 00:17:02,750 --> 00:17:05,088 graph and then you describe where 466 00:17:05,089 --> 00:17:07,368 to walk the graph from there. 467 00:17:07,369 --> 00:17:09,679 And if that's possible, then the notes 468 00:17:09,680 --> 00:17:11,449 that are reached in that way will be 469 00:17:11,450 --> 00:17:12,618 returned. 470 00:17:12,619 --> 00:17:14,989 So as an example, return 471 00:17:14,990 --> 00:17:16,969 all objects created by people over the 472 00:17:16,970 --> 00:17:18,529 age of 30. 473 00:17:18,530 --> 00:17:20,059 So it looks like this. 474 00:17:20,060 --> 00:17:21,999 So here's the Gremlin query. 475 00:17:22,000 --> 00:17:24,169 So you begin by saying take all 476 00:17:24,170 --> 00:17:26,509 nodes where H 477 00:17:26,510 --> 00:17:27,589 is bigger than 30. 478 00:17:28,640 --> 00:17:30,469 So here in the example, that's Josh, 479 00:17:30,470 --> 00:17:32,569 who's two, and Peter who's 480 00:17:32,570 --> 00:17:33,769 thirty five. 481 00:17:33,770 --> 00:17:34,770 And now. 482 00:17:36,550 --> 00:17:38,619 Take all outgoing edges 483 00:17:38,620 --> 00:17:40,899 labeled by, create it, and then you reach 484 00:17:40,900 --> 00:17:41,900 this. 485 00:17:42,730 --> 00:17:43,730 And this note 486 00:17:45,520 --> 00:17:47,679 and now the nice thing is once that 487 00:17:47,680 --> 00:17:49,809 you've created a walk like this, you 488 00:17:49,810 --> 00:17:52,179 can give it a name, right? 489 00:17:52,180 --> 00:17:54,459 So instead of always saying filter, 490 00:17:54,460 --> 00:17:56,619 it is bigger, 30 491 00:17:56,620 --> 00:17:58,749 out created, you can now say people 492 00:17:58,750 --> 00:18:01,239 we can fire now and then you can reuse 493 00:18:01,240 --> 00:18:03,849 this again and again. 494 00:18:03,850 --> 00:18:05,919 Yeah. And this is what a definition 495 00:18:05,920 --> 00:18:07,149 like that would look like. 496 00:18:07,150 --> 00:18:09,519 So it's very similar to to like a stored 497 00:18:09,520 --> 00:18:10,989 procedure. 498 00:18:10,990 --> 00:18:12,519 It means you don't have to ever write it 499 00:18:12,520 --> 00:18:13,630 again. And 500 00:18:14,800 --> 00:18:16,539 this is what we're going to make use of 501 00:18:16,540 --> 00:18:18,639 now, of course, for social 502 00:18:18,640 --> 00:18:20,449 networks. This is extremely useful. 503 00:18:20,450 --> 00:18:22,599 And the best example is Facebook Graph 504 00:18:22,600 --> 00:18:24,699 Search. So they actually made 505 00:18:24,700 --> 00:18:27,009 a UI where people can click 506 00:18:27,010 --> 00:18:28,989 together, different kinds of graph 507 00:18:28,990 --> 00:18:30,189 database queries. 508 00:18:30,190 --> 00:18:31,809 So you could look for people who like 509 00:18:31,810 --> 00:18:33,669 English, Defense League and country, for 510 00:18:33,670 --> 00:18:35,859 example, or 511 00:18:35,860 --> 00:18:37,959 you could look for mothers of Catholics 512 00:18:37,960 --> 00:18:40,029 from Italy who like Durex 513 00:18:40,030 --> 00:18:41,030 and. 514 00:18:41,600 --> 00:18:42,630 Stuff like that, right? 515 00:18:44,180 --> 00:18:45,260 But the amazing thing is 516 00:18:46,970 --> 00:18:48,799 you can store other useful things in 517 00:18:48,800 --> 00:18:51,019 graph databases and 518 00:18:51,020 --> 00:18:53,329 the property graph by by definition 519 00:18:53,330 --> 00:18:54,829 really is a property graph. 520 00:18:54,830 --> 00:18:57,259 Right. So you can store it in a in the 521 00:18:57,260 --> 00:18:58,260 graph database 522 00:18:59,460 --> 00:19:01,579 and now we can 523 00:19:01,580 --> 00:19:03,469 use this this trick here, you know, this 524 00:19:03,470 --> 00:19:05,539 this taking walks 525 00:19:05,540 --> 00:19:08,029 in this graph and giving them a name 526 00:19:08,030 --> 00:19:10,759 to define domain specific 527 00:19:10,760 --> 00:19:12,139 query language for vulnerability, 528 00:19:12,140 --> 00:19:13,770 discovery, Steria. 529 00:19:15,170 --> 00:19:17,329 And we did all this and 530 00:19:17,330 --> 00:19:19,339 made it open source so you can play with 531 00:19:19,340 --> 00:19:22,039 it. So this tool is called your 532 00:19:22,040 --> 00:19:24,139 E robust chord analysis platform 533 00:19:24,140 --> 00:19:26,209 for C and C++. 534 00:19:26,210 --> 00:19:27,649 Don't be fooled. 535 00:19:27,650 --> 00:19:29,989 It's mostly C, but if you throw something 536 00:19:29,990 --> 00:19:32,719 like Firefox at it, which is 537 00:19:32,720 --> 00:19:34,819 let's say nice C++, then 538 00:19:34,820 --> 00:19:36,649 it will also work for you. 539 00:19:36,650 --> 00:19:38,179 But you should throw something like the 540 00:19:38,180 --> 00:19:40,399 Stelle into it, then it will 541 00:19:40,400 --> 00:19:42,889 fail miserably anyway. 542 00:19:42,890 --> 00:19:45,019 By definition, you get 543 00:19:45,020 --> 00:19:46,939 an extensible query language because you 544 00:19:46,940 --> 00:19:48,109 have gremlin, right? 545 00:19:48,110 --> 00:19:50,899 And we provided a lot of different, 546 00:19:50,900 --> 00:19:53,389 um, traversal. 547 00:19:53,390 --> 00:19:55,459 They're called like these pipes 548 00:19:55,460 --> 00:19:57,649 that you can immediately use. 549 00:19:57,650 --> 00:19:59,719 It's repeatable via Python 550 00:19:59,720 --> 00:20:02,089 because so many people like Python and 551 00:20:02,090 --> 00:20:04,249 I also wrote a couple of shell utilities 552 00:20:04,250 --> 00:20:05,809 that you can use for like day to day 553 00:20:05,810 --> 00:20:07,249 auditing work. 554 00:20:07,250 --> 00:20:08,839 And it's known to work on other people's 555 00:20:08,840 --> 00:20:10,940 machines, which I'm very, very proud of. 556 00:20:12,040 --> 00:20:13,040 OK, 557 00:20:14,930 --> 00:20:16,319 yeah, you can download here. 558 00:20:16,320 --> 00:20:19,099 OK, so let's take a quick look at 559 00:20:19,100 --> 00:20:20,909 what this tool looks like. 560 00:20:20,910 --> 00:20:22,789 So you have an importer, you start to 561 00:20:22,790 --> 00:20:25,459 import code that's here, OK, 562 00:20:25,460 --> 00:20:27,949 and then it creates a database 563 00:20:27,950 --> 00:20:30,949 called unindexed and 564 00:20:30,950 --> 00:20:33,019 this is then stored in 565 00:20:33,020 --> 00:20:34,189 the graph database. 566 00:20:34,190 --> 00:20:35,959 And then all you need to do is start the 567 00:20:35,960 --> 00:20:38,179 database server and it will open up. 568 00:20:39,260 --> 00:20:41,749 Well, it will make a rest API accessible. 569 00:20:41,750 --> 00:20:43,819 So this shows this 570 00:20:43,820 --> 00:20:45,979 kind of stuff comes from Web guys 571 00:20:45,980 --> 00:20:48,169 right now to make 572 00:20:48,170 --> 00:20:50,239 sure that you don't ever see that 573 00:20:50,240 --> 00:20:52,249 there's Web stuff underneath. 574 00:20:52,250 --> 00:20:54,439 I wrote a library called Python Uran that 575 00:20:54,440 --> 00:20:56,239 you can just include in your scripts and 576 00:20:56,240 --> 00:20:57,649 it will just communicate with the rest 577 00:20:57,650 --> 00:20:59,929 API for you and 578 00:20:59,930 --> 00:21:01,999 then you can run 579 00:21:02,000 --> 00:21:04,039 your scripts happily. 580 00:21:04,040 --> 00:21:05,959 So this is a complete working script 581 00:21:05,960 --> 00:21:07,699 except for the query right here. 582 00:21:07,700 --> 00:21:10,039 You insert your query, so you connect 583 00:21:10,040 --> 00:21:12,199 to the database, you run the query, 584 00:21:12,200 --> 00:21:14,629 you print the results, that's all. 585 00:21:14,630 --> 00:21:16,909 Alternatively, you can use the shell 586 00:21:16,910 --> 00:21:17,910 utilities. 587 00:21:19,010 --> 00:21:21,169 Now let's 588 00:21:21,170 --> 00:21:23,749 test this on some real projects. 589 00:21:23,750 --> 00:21:25,819 So the first project we're looking at is 590 00:21:25,820 --> 00:21:27,530 the Velzy media player. 591 00:21:28,730 --> 00:21:30,829 And here's just a 592 00:21:30,830 --> 00:21:32,209 short disclaimer about this. 593 00:21:32,210 --> 00:21:33,859 If you audit people's code and that 594 00:21:33,860 --> 00:21:35,929 doesn't mean that you hate them and it 595 00:21:35,930 --> 00:21:37,699 also doesn't mean that you disrespect 596 00:21:37,700 --> 00:21:38,989 them. 597 00:21:38,990 --> 00:21:42,049 Mostly, it means that their code base is 598 00:21:42,050 --> 00:21:44,629 interesting in one way or another. 599 00:21:44,630 --> 00:21:46,130 Right. So, um, 600 00:21:47,540 --> 00:21:49,879 the fact that this is popular made 601 00:21:49,880 --> 00:21:51,589 me want to audit it. 602 00:21:51,590 --> 00:21:53,149 I think the Velzy developers are doing a 603 00:21:53,150 --> 00:21:55,309 really good job and all of the bugs 604 00:21:55,310 --> 00:21:57,679 that are going to be disclosed now 605 00:21:57,680 --> 00:21:59,869 have actually been fixed now in 606 00:21:59,870 --> 00:22:02,119 the git, you know, in 607 00:22:02,120 --> 00:22:03,949 the jet version. 608 00:22:03,950 --> 00:22:05,569 And we'll probably then be fixed in the 609 00:22:05,570 --> 00:22:06,570 next versions as well. 610 00:22:07,520 --> 00:22:10,369 OK, so let's make this practical. 611 00:22:10,370 --> 00:22:12,469 You run the importer, you're 612 00:22:12,470 --> 00:22:15,079 on on the velzy code 613 00:22:15,080 --> 00:22:17,629 starts importing the code, you start 614 00:22:17,630 --> 00:22:20,149 the Java server and 615 00:22:20,150 --> 00:22:22,789 you can then point your browser to 616 00:22:22,790 --> 00:22:25,009 Port seven four seven four and 617 00:22:25,010 --> 00:22:27,079 you'll get some basic statistics 618 00:22:27,080 --> 00:22:28,969 of what's inside the database. 619 00:22:28,970 --> 00:22:31,460 So you can now see for Velzy, 620 00:22:32,750 --> 00:22:35,359 it's created about two million nodes, 621 00:22:35,360 --> 00:22:37,789 five million properties, four 622 00:22:37,790 --> 00:22:40,009 million relationships and so on, and 623 00:22:40,010 --> 00:22:42,379 it uses about seven hundred and five 624 00:22:42,380 --> 00:22:43,429 megabytes. 625 00:22:43,430 --> 00:22:45,229 So it's storable. 626 00:22:45,230 --> 00:22:47,719 So all of these experiments 627 00:22:47,720 --> 00:22:49,489 are being done on a laptop. 628 00:22:49,490 --> 00:22:51,649 So you don't need a server farm to to 629 00:22:51,650 --> 00:22:52,809 operate this kind of stuff. 630 00:22:52,810 --> 00:22:55,519 Right now, if you look at the unindexed, 631 00:22:55,520 --> 00:22:57,319 the database that has been created, 632 00:22:57,320 --> 00:22:59,239 you'll see that there's actually an 633 00:22:59,240 --> 00:23:01,579 Apache Lucene index here, right? 634 00:23:01,580 --> 00:23:02,779 Apache, Lucene. 635 00:23:02,780 --> 00:23:04,909 It's just something that you can use to 636 00:23:04,910 --> 00:23:06,469 actually document database. 637 00:23:06,470 --> 00:23:08,599 And this is used as part of the graph 638 00:23:08,600 --> 00:23:11,179 database to give you fast lookups 639 00:23:11,180 --> 00:23:13,249 of nodes by their properties. 640 00:23:13,250 --> 00:23:15,439 So you can say things like give me 641 00:23:15,440 --> 00:23:17,629 all calls to Mellark and you'll 642 00:23:17,630 --> 00:23:18,589 immediately have them. 643 00:23:18,590 --> 00:23:20,869 Right. So this takes up 644 00:23:20,870 --> 00:23:22,879 about as much space as the graph database 645 00:23:22,880 --> 00:23:23,880 you could see. 646 00:23:24,830 --> 00:23:26,989 Now, let's start using this. 647 00:23:26,990 --> 00:23:29,119 So here's a very simple query 648 00:23:29,120 --> 00:23:31,579 with the same query language. 649 00:23:31,580 --> 00:23:33,319 So as I said, there are shell utilities. 650 00:23:33,320 --> 00:23:34,639 You can look up as one of the shell 651 00:23:34,640 --> 00:23:36,499 utilities that you can use to 652 00:23:37,550 --> 00:23:39,649 quickly pipe a query 653 00:23:39,650 --> 00:23:40,989 into into it and. 654 00:23:40,990 --> 00:23:43,689 We'll get the results in here, it says, 655 00:23:43,690 --> 00:23:45,769 Give me all the files with 656 00:23:45,770 --> 00:23:48,009 the file, PAF contains 657 00:23:48,010 --> 00:23:50,109 the word Dymocks, so the 658 00:23:50,110 --> 00:23:52,239 monsters and 659 00:23:52,240 --> 00:23:54,489 of course, you could have also done 660 00:23:54,490 --> 00:23:56,049 this with a find. 661 00:23:56,050 --> 00:23:57,050 Right. 662 00:23:57,860 --> 00:23:59,349 But now it gets more interesting. 663 00:23:59,350 --> 00:24:01,150 You can also insert these 664 00:24:02,230 --> 00:24:03,489 gremlin queries. 665 00:24:03,490 --> 00:24:06,459 And here I've created custom query 666 00:24:06,460 --> 00:24:08,529 known as a custom 667 00:24:08,530 --> 00:24:11,049 step known as query note index. 668 00:24:11,050 --> 00:24:13,119 And you can pass Lucene query in here 669 00:24:13,120 --> 00:24:14,529 to get a start note. 670 00:24:14,530 --> 00:24:16,629 But instead of just returning those 671 00:24:16,630 --> 00:24:19,089 notes, we can now start traversing 672 00:24:19,090 --> 00:24:21,969 the graph starting from those notes 673 00:24:21,970 --> 00:24:24,189 so we can say things like from those 674 00:24:24,190 --> 00:24:26,499 file notes, go out 675 00:24:26,500 --> 00:24:28,329 and filter all the functions. 676 00:24:28,330 --> 00:24:30,069 So this will give you all the functions 677 00:24:30,070 --> 00:24:31,689 and print their name. 678 00:24:31,690 --> 00:24:33,789 So suddenly you're going 679 00:24:33,790 --> 00:24:35,859 from files to functions and 680 00:24:35,860 --> 00:24:38,589 you get all the functions in files 681 00:24:38,590 --> 00:24:39,849 named Dymocks. 682 00:24:41,590 --> 00:24:43,569 Yeah, that's what that looks like. 683 00:24:44,710 --> 00:24:46,779 Now let's see how much time I have 684 00:24:46,780 --> 00:24:47,890 left. Okay. 685 00:24:50,290 --> 00:24:51,579 Yeah. So of course you're going to be 686 00:24:51,580 --> 00:24:53,169 asking. OK, nice. 687 00:24:53,170 --> 00:24:55,059 But how do I know what's actually in the 688 00:24:55,060 --> 00:24:57,009 database. So how do I get from one note 689 00:24:57,010 --> 00:24:59,259 to another. And the nice thing is 690 00:24:59,260 --> 00:25:01,329 if you work at a university, you 691 00:25:01,330 --> 00:25:03,699 eventually have some some master students 692 00:25:03,700 --> 00:25:05,289 and one of them has the students. 693 00:25:06,610 --> 00:25:09,249 They wrote a tool 694 00:25:09,250 --> 00:25:11,649 to visualize the graph database contents. 695 00:25:11,650 --> 00:25:14,229 Right. So you can get it. 696 00:25:14,230 --> 00:25:16,359 This is, of course, printed 697 00:25:16,360 --> 00:25:18,429 in a way that you probably can't see from 698 00:25:18,430 --> 00:25:20,529 the back. The point is, just in those 699 00:25:20,530 --> 00:25:22,899 notes, you see all the properties and 700 00:25:22,900 --> 00:25:24,819 you see the different edges between 701 00:25:24,820 --> 00:25:27,099 there. And you also have the labels and 702 00:25:27,100 --> 00:25:28,720 also attributes on those. 703 00:25:30,320 --> 00:25:31,899 Uh, yeah. 704 00:25:31,900 --> 00:25:32,900 So. 705 00:25:33,730 --> 00:25:35,469 Yeah, you can use these tools, you can 706 00:25:35,470 --> 00:25:37,749 plot graph, 707 00:25:37,750 --> 00:25:40,059 for example, here we say 708 00:25:40,060 --> 00:25:41,469 give us the dataflow edges and the 709 00:25:41,470 --> 00:25:44,229 contraflow edges and then it will print 710 00:25:44,230 --> 00:25:46,299 nice things that you can put onto slides 711 00:25:46,300 --> 00:25:48,339 or study to actually know how you can get 712 00:25:48,340 --> 00:25:49,359 from one place to another. 713 00:25:50,450 --> 00:25:52,879 Now, let's take a look at Bux, finally, 714 00:25:52,880 --> 00:25:54,979 so I first wanted to 715 00:25:54,980 --> 00:25:57,229 look at things which are 716 00:25:57,230 --> 00:25:59,479 in a way similar to 717 00:25:59,480 --> 00:26:01,699 the Lipsius bug that 718 00:26:02,960 --> 00:26:04,129 I disclosed. 719 00:26:04,130 --> 00:26:06,199 So I 720 00:26:06,200 --> 00:26:08,719 usually begin this by formulating 721 00:26:08,720 --> 00:26:10,849 in in words what I'm looking for. 722 00:26:10,850 --> 00:26:13,039 And in this case, this was I get 723 00:26:13,040 --> 00:26:15,109 calls to mallock where the 724 00:26:15,110 --> 00:26:16,999 first argument contains an additive 725 00:26:17,000 --> 00:26:19,279 expression and a call to meme 726 00:26:19,280 --> 00:26:21,409 copy is reached by dataflow, where the 727 00:26:21,410 --> 00:26:23,839 third argument, a seamount to copy, 728 00:26:23,840 --> 00:26:26,329 also contains an additive expression 729 00:26:26,330 --> 00:26:28,339 and the two additive expressions are not 730 00:26:28,340 --> 00:26:29,479 equal. 731 00:26:29,480 --> 00:26:30,769 So these are the kinds of things that you 732 00:26:30,770 --> 00:26:33,379 can easily find with the tools that 733 00:26:33,380 --> 00:26:35,599 we wrote here and now 734 00:26:35,600 --> 00:26:37,069 as a query, it looks like this. 735 00:26:38,310 --> 00:26:39,439 This is a custom step. 736 00:26:39,440 --> 00:26:40,910 Get calls to, um, 737 00:26:42,020 --> 00:26:44,209 and, uh, well, you'll find 738 00:26:44,210 --> 00:26:45,169 a lot of different ones. 739 00:26:45,170 --> 00:26:47,269 There's, for example, get 740 00:26:47,270 --> 00:26:49,489 definitions and get functions by name, 741 00:26:49,490 --> 00:26:51,199 stuff like that. This gives you the start 742 00:26:51,200 --> 00:26:54,049 notes. So get close to Mellark 743 00:26:54,050 --> 00:26:55,249 from there. 744 00:26:55,250 --> 00:26:57,429 Walk to the first argument. 745 00:26:57,430 --> 00:26:59,299 Right. It's zero because we're computer 746 00:26:59,300 --> 00:27:00,950 scientists and we start counting at zero. 747 00:27:02,470 --> 00:27:04,669 And as a side effect, store 748 00:27:04,670 --> 00:27:06,679 the code that you see there. 749 00:27:06,680 --> 00:27:09,289 Right. That's the first document 750 00:27:09,290 --> 00:27:11,290 stored in a variable called Seante. 751 00:27:12,350 --> 00:27:14,479 Now from there, I'll 752 00:27:14,480 --> 00:27:16,579 look for an additive expression. 753 00:27:17,660 --> 00:27:19,639 And as I said, you know, if there's no 754 00:27:19,640 --> 00:27:22,399 additive expression, like 755 00:27:22,400 --> 00:27:24,469 an empty set will be returned here. 756 00:27:24,470 --> 00:27:26,569 So so those will 757 00:27:26,570 --> 00:27:27,979 be filtered. 758 00:27:27,980 --> 00:27:30,139 Yeah. And then from there, from 759 00:27:30,140 --> 00:27:31,879 all those Morlocks, where there's an 760 00:27:31,880 --> 00:27:33,379 additive expression in its first 761 00:27:33,380 --> 00:27:35,449 argument, go up 762 00:27:35,450 --> 00:27:38,509 to the statement that encloses this. 763 00:27:38,510 --> 00:27:40,819 And from there, follow dataflow 764 00:27:40,820 --> 00:27:42,619 links. Right. 765 00:27:42,620 --> 00:27:44,869 To calls to 766 00:27:44,870 --> 00:27:47,029 copy where the third 767 00:27:47,030 --> 00:27:49,099 argument is not equal 768 00:27:49,100 --> 00:27:50,989 to this C.A.T. 769 00:27:50,990 --> 00:27:52,099 that we just stored up here. 770 00:27:52,100 --> 00:27:54,319 Right. And there's an additive 771 00:27:54,320 --> 00:27:56,959 expression inside and that's it. 772 00:27:56,960 --> 00:27:59,089 And I can imagine that if you see 773 00:27:59,090 --> 00:28:00,119 something like this for the first time 774 00:28:00,120 --> 00:28:01,759 and it's really complicated, but you'll 775 00:28:01,760 --> 00:28:03,259 get very fluid at this kind of stuff, 776 00:28:04,280 --> 00:28:05,599 then we just pipe it into your own 777 00:28:05,600 --> 00:28:06,589 lookup. 778 00:28:06,590 --> 00:28:08,539 We sort it unique. 779 00:28:08,540 --> 00:28:10,699 We get the locations. 780 00:28:10,700 --> 00:28:11,779 You're on location, but. 781 00:28:13,380 --> 00:28:15,749 And we have a file containing 782 00:28:15,750 --> 00:28:18,329 locations that match, so getting 783 00:28:18,330 --> 00:28:21,059 that file gives us four hits 784 00:28:21,060 --> 00:28:23,789 and I immediately said, oh wow, OK, 785 00:28:23,790 --> 00:28:26,669 and before looks nice, I have some force. 786 00:28:26,670 --> 00:28:28,919 So let's look at the map 787 00:28:28,920 --> 00:28:30,659 for stuff and then you can pipe it into 788 00:28:30,660 --> 00:28:32,789 your own code, which will give you 789 00:28:32,790 --> 00:28:34,039 the code. 790 00:28:34,040 --> 00:28:35,009 Right. 791 00:28:35,010 --> 00:28:37,799 So this is what you find. 792 00:28:37,800 --> 00:28:40,049 You find a function called MP for Redbox 793 00:28:40,050 --> 00:28:41,640 name and. 794 00:28:43,050 --> 00:28:45,209 Well, yeah, I would have ask 795 00:28:45,210 --> 00:28:47,669 you how to trigger the bug, but 796 00:28:47,670 --> 00:28:50,249 I wrote it in the title of the slide, 797 00:28:50,250 --> 00:28:52,379 so it looks 798 00:28:52,380 --> 00:28:54,509 very similar to the bug that 799 00:28:54,510 --> 00:28:55,679 Stephane found. 800 00:28:55,680 --> 00:28:58,049 So here you have inside the malac 801 00:28:58,050 --> 00:29:00,119 there's an addition. 802 00:29:00,120 --> 00:29:02,459 You add one, you subtract 803 00:29:02,460 --> 00:29:04,080 eight. So you're 804 00:29:05,130 --> 00:29:07,259 subtracting seven, really. 805 00:29:07,260 --> 00:29:08,979 And down here in the U.S. 806 00:29:08,980 --> 00:29:09,980 use obstructing it. 807 00:29:10,980 --> 00:29:13,349 So now if you insert a seven, 808 00:29:14,370 --> 00:29:16,559 then you have seven plus one 809 00:29:16,560 --> 00:29:18,269 minus eight gives you zero. 810 00:29:18,270 --> 00:29:20,279 So you allocate something close to zero. 811 00:29:20,280 --> 00:29:21,239 Right. 812 00:29:21,240 --> 00:29:23,399 And here, if you insert seven, 813 00:29:23,400 --> 00:29:25,619 seven minus eight is minus 814 00:29:25,620 --> 00:29:27,179 one has to unsigned. 815 00:29:27,180 --> 00:29:29,489 Integer is Max. 816 00:29:29,490 --> 00:29:30,779 And there you go. You have your buffer 817 00:29:30,780 --> 00:29:33,119 overflow. So this is in the empty for 818 00:29:33,120 --> 00:29:34,120 the mixer. 819 00:29:44,230 --> 00:29:46,130 Let's see, yeah, OK. 820 00:29:48,060 --> 00:29:50,609 Now, interesting question to ask here is, 821 00:29:50,610 --> 00:29:52,169 well, how do you how do you come up with 822 00:29:52,170 --> 00:29:54,359 ways of so what am I going to search 823 00:29:54,360 --> 00:29:56,849 for in this code to actually find bugs? 824 00:29:56,850 --> 00:29:58,979 And there were some wise words 825 00:29:58,980 --> 00:30:01,019 set in the art of software security 826 00:30:01,020 --> 00:30:02,459 assessment. 827 00:30:02,460 --> 00:30:03,719 And I'd like to quote this. 828 00:30:03,720 --> 00:30:06,209 So they say, in fact, two or four authors 829 00:30:06,210 --> 00:30:08,459 typically start reviewing a new code base 830 00:30:08,460 --> 00:30:10,559 by finding the equivalent of the util 831 00:30:10,560 --> 00:30:12,659 directories and reading the framework and 832 00:30:12,660 --> 00:30:14,879 glue code line by line. 833 00:30:14,880 --> 00:30:16,259 And I found this really helpful. 834 00:30:16,260 --> 00:30:17,609 You know, if you start auditing a new 835 00:30:17,610 --> 00:30:19,709 project, take a look at the language 836 00:30:19,710 --> 00:30:22,019 that's being spoken in this code base, 837 00:30:22,020 --> 00:30:24,149 and those are the utility functions. 838 00:30:24,150 --> 00:30:26,369 So in this case, in the case 839 00:30:26,370 --> 00:30:28,439 of Velzy, I took a look at 840 00:30:28,440 --> 00:30:30,389 Velzy Stream H. 841 00:30:30,390 --> 00:30:32,549 This is the well, this is these 842 00:30:32,550 --> 00:30:35,099 are the this is the API that they use 843 00:30:35,100 --> 00:30:37,679 to analyze data streams 844 00:30:37,680 --> 00:30:39,509 of all kinds. 845 00:30:39,510 --> 00:30:41,699 So if we find fundamental difficulties in 846 00:30:41,700 --> 00:30:43,799 using this API, then we're 847 00:30:43,800 --> 00:30:44,940 going to find bugs. 848 00:30:46,440 --> 00:30:48,689 Now, what I found is there's 849 00:30:48,690 --> 00:30:50,640 a function called stream size, and 850 00:30:51,900 --> 00:30:54,179 as you might have suspected, it returns 851 00:30:54,180 --> 00:30:56,759 a 64 bit integer. 852 00:30:56,760 --> 00:30:58,739 That's because a stream, a data stream. 853 00:30:58,740 --> 00:31:01,179 So this might be just an empty for 854 00:31:01,180 --> 00:31:03,359 of course, you want it to be bigger than 855 00:31:03,360 --> 00:31:04,259 four gigabytes. 856 00:31:04,260 --> 00:31:07,019 Right. So any good 857 00:31:07,020 --> 00:31:09,089 HD movie is going to be bigger than four 858 00:31:09,090 --> 00:31:11,039 gigabytes and like that. 859 00:31:11,040 --> 00:31:13,229 So the problem here is 860 00:31:13,230 --> 00:31:14,849 that on 32 bit platforms, 861 00:31:16,350 --> 00:31:18,329 well, you can't really store 862 00:31:19,440 --> 00:31:23,039 forty two bit in a register. 863 00:31:23,040 --> 00:31:25,169 So the size of the 864 00:31:25,170 --> 00:31:27,389 stream is going to be larger than the 865 00:31:27,390 --> 00:31:29,729 max on the platform. 866 00:31:29,730 --> 00:31:31,919 And that means that all 867 00:31:31,920 --> 00:31:34,109 the locations that depend on stream 868 00:31:34,110 --> 00:31:35,819 size, you're going to be very you going 869 00:31:35,820 --> 00:31:37,289 have to be very careful of those. 870 00:31:38,570 --> 00:31:40,549 And it's realistic for an attack because 871 00:31:40,550 --> 00:31:43,059 know somebody well, if somebody downloads 872 00:31:43,060 --> 00:31:45,819 a movie off file sharing application 873 00:31:45,820 --> 00:31:48,399 and the caption is right, 874 00:31:48,400 --> 00:31:50,529 then people won't be thinking, 875 00:31:50,530 --> 00:31:51,639 oh, this is four gigabytes. 876 00:31:51,640 --> 00:31:52,629 I'm not going to start this. 877 00:31:52,630 --> 00:31:54,069 Right. So it's four gigabytes because 878 00:31:54,070 --> 00:31:55,519 it's a movie. 879 00:31:55,520 --> 00:31:56,950 Go. So, 880 00:31:58,510 --> 00:32:00,579 again, formulating this in your head, 881 00:32:00,580 --> 00:32:02,439 you get give me statements containing 882 00:32:02,440 --> 00:32:04,569 calls to stream size and 883 00:32:04,570 --> 00:32:06,819 the symbol in 60 40 where data 884 00:32:06,820 --> 00:32:08,979 flow exists to statements containing 885 00:32:08,980 --> 00:32:09,999 the symbol Malac. 886 00:32:10,000 --> 00:32:12,519 Right. So you get locations depending 887 00:32:12,520 --> 00:32:14,169 on the stream size. 888 00:32:14,170 --> 00:32:16,899 And this query looks well, 889 00:32:16,900 --> 00:32:19,149 the sentences not so long. 890 00:32:19,150 --> 00:32:20,739 And the query isn't either. 891 00:32:20,740 --> 00:32:22,819 So it's again, get close to what 892 00:32:22,820 --> 00:32:24,999 you just saw, a stream size and then go 893 00:32:25,000 --> 00:32:27,159 up to the statement and 894 00:32:27,160 --> 00:32:29,259 now you're leaving the the AST part 895 00:32:29,260 --> 00:32:31,549 and now 896 00:32:31,550 --> 00:32:33,699 filter all those statements which 897 00:32:33,700 --> 00:32:35,829 contain in 64 898 00:32:35,830 --> 00:32:38,319 and then follow the data flow to 899 00:32:38,320 --> 00:32:40,419 statements that contain the word 900 00:32:40,420 --> 00:32:42,159 malac. That's it. 901 00:32:42,160 --> 00:32:43,389 And you don't get so many hits. 902 00:32:43,390 --> 00:32:44,740 So those are all the hits that you get 903 00:32:45,850 --> 00:32:47,979 and well, they're 904 00:32:47,980 --> 00:32:48,879 all not so interesting. 905 00:32:48,880 --> 00:32:51,039 But this one is nice because this is the 906 00:32:51,040 --> 00:32:52,029 updater. 907 00:32:52,030 --> 00:32:54,639 And so I read plus one depending 908 00:32:54,640 --> 00:32:55,929 on stream says. 909 00:32:55,930 --> 00:32:58,089 So let's take a look at this code. 910 00:32:58,090 --> 00:33:00,039 So this is the Velzy automatic update to 911 00:33:00,040 --> 00:33:01,869 this thing that checks in the background 912 00:33:01,870 --> 00:33:04,779 whether new updates are available. 913 00:33:04,780 --> 00:33:05,780 And you can see 914 00:33:07,540 --> 00:33:09,459 it connects to some new URL. 915 00:33:09,460 --> 00:33:11,509 We'll see which one and then it can 916 00:33:11,510 --> 00:33:13,629 stream size and 917 00:33:13,630 --> 00:33:15,579 stores this. And I read and this is the 918 00:33:15,580 --> 00:33:17,499 64 bit integer. 919 00:33:17,500 --> 00:33:19,719 Then it costs mallock 920 00:33:19,720 --> 00:33:20,769 oneiric plus one, 921 00:33:21,910 --> 00:33:24,699 not one on a 32 bit platform, 922 00:33:24,700 --> 00:33:26,589 I read plus one. 923 00:33:26,590 --> 00:33:28,719 Well, this will first succeed and they 924 00:33:28,720 --> 00:33:31,029 will do a calculation with a 64 925 00:33:31,030 --> 00:33:32,289 bit most probably. 926 00:33:32,290 --> 00:33:34,359 But then no matter what happens, 927 00:33:34,360 --> 00:33:36,579 this is going to be truncated 928 00:33:36,580 --> 00:33:38,619 to six to 32 bit. 929 00:33:38,620 --> 00:33:41,019 Right, because the size T on a 32 930 00:33:41,020 --> 00:33:43,119 bit platform is four 931 00:33:43,120 --> 00:33:44,439 byte as opposed to eight. 932 00:33:45,960 --> 00:33:48,069 Now if I 933 00:33:48,070 --> 00:33:50,139 read is maxing 934 00:33:50,140 --> 00:33:52,309 out, then again, you get a buffer 935 00:33:52,310 --> 00:33:55,029 close close to the size of zero 936 00:33:55,030 --> 00:33:57,099 and now they call stream rate, which 937 00:33:57,100 --> 00:33:59,259 is an API function that is similar 938 00:33:59,260 --> 00:34:01,329 to mem copy in a way, but much better 939 00:34:01,330 --> 00:34:02,769 as you'll see in a moment. 940 00:34:02,770 --> 00:34:05,439 And the read 941 00:34:05,440 --> 00:34:07,479 I read characters into this buffer. 942 00:34:07,480 --> 00:34:08,880 So again, you have an overflow. 943 00:34:10,480 --> 00:34:12,129 Now you're going to be saying, yeah, but 944 00:34:12,130 --> 00:34:13,899 you know, your update is going to be 945 00:34:13,900 --> 00:34:15,399 verified. Right. 946 00:34:15,400 --> 00:34:16,658 Here's the nice thing. 947 00:34:16,659 --> 00:34:19,269 They actually do signature checking 948 00:34:19,270 --> 00:34:22,149 after they've downloaded using HTTP 949 00:34:22,150 --> 00:34:24,309 only. Um, but this 950 00:34:24,310 --> 00:34:26,259 buffer overflow happens before the 951 00:34:26,260 --> 00:34:27,340 signature is being checked. 952 00:34:29,440 --> 00:34:31,629 Right. So executer before verification. 953 00:34:31,630 --> 00:34:32,589 Nice. 954 00:34:32,590 --> 00:34:34,059 This could be something that we could be 955 00:34:34,060 --> 00:34:35,060 interested in. 956 00:34:35,800 --> 00:34:37,569 Now let's look into the binary and we'll 957 00:34:37,570 --> 00:34:39,439 see. OK, this is a static URL. 958 00:34:39,440 --> 00:34:41,589 It should be update vidalin and 959 00:34:41,590 --> 00:34:42,698 so on. 960 00:34:42,699 --> 00:34:45,359 And this year is. 961 00:34:45,360 --> 00:34:47,499 Yeah, if you follow the link you'll see 962 00:34:47,500 --> 00:34:49,149 this is the call to stream size. 963 00:34:49,150 --> 00:34:51,488 It's been in line and here you see 964 00:34:51,489 --> 00:34:53,349 the truncation in. 965 00:34:53,350 --> 00:34:55,269 It's in a very obvious form. 966 00:34:55,270 --> 00:34:57,429 So here the return value is 967 00:34:57,430 --> 00:34:59,829 actually stored in two registers 968 00:34:59,830 --> 00:35:00,789 because you can't really do it 969 00:35:00,790 --> 00:35:03,219 differently on a 32 bit platform. 970 00:35:03,220 --> 00:35:04,540 Right. But then 971 00:35:05,590 --> 00:35:07,779 as they as they add one, 972 00:35:07,780 --> 00:35:09,909 they really just add one to one of the 973 00:35:09,910 --> 00:35:11,349 registers because they're going to be 974 00:35:11,350 --> 00:35:13,389 using a 32 bit value anyway. 975 00:35:13,390 --> 00:35:15,939 So they might as well truncated here. 976 00:35:15,940 --> 00:35:18,159 And then this flows into the Marlock 977 00:35:18,160 --> 00:35:19,989 and there you go. 978 00:35:19,990 --> 00:35:22,029 You have your integer overflow leading to 979 00:35:22,030 --> 00:35:23,030 a heap buffer of. 980 00:35:24,910 --> 00:35:25,910 Now, 981 00:35:27,070 --> 00:35:28,510 I set up my little Web server 982 00:35:29,530 --> 00:35:32,409 and pointed ATC host to update 983 00:35:32,410 --> 00:35:34,479 at points to update Verlin or 984 00:35:34,480 --> 00:35:37,659 to that Web server and created a file 985 00:35:37,660 --> 00:35:40,089 that is four gigabytes long and contained 986 00:35:40,090 --> 00:35:41,199 all A's. 987 00:35:41,200 --> 00:35:43,299 I attached to debugger and I 988 00:35:43,300 --> 00:35:44,919 check for updates. 989 00:35:44,920 --> 00:35:47,019 And here's what happened. 990 00:35:49,540 --> 00:35:51,999 So that's very convenient because 991 00:35:52,000 --> 00:35:54,189 typically from the crash, you need 992 00:35:54,190 --> 00:35:56,359 to do all sorts of work to actually get 993 00:35:56,360 --> 00:35:58,449 VIP equals or 994 00:35:58,450 --> 00:36:00,249 your value, right? So this is forty one 995 00:36:00,250 --> 00:36:02,529 forty one forty one forty one, meaning 996 00:36:02,530 --> 00:36:04,749 it's all A's. So you control where 997 00:36:04,750 --> 00:36:06,909 the program jumps next. 998 00:36:06,910 --> 00:36:08,929 And the nice thing is, if you have this, 999 00:36:08,930 --> 00:36:11,229 then arguing in favor of patching just 1000 00:36:11,230 --> 00:36:12,699 gets very much easier. 1001 00:36:15,430 --> 00:36:18,189 OK, so some notes on exploitation 1002 00:36:18,190 --> 00:36:20,529 stream read is your friend, I just I just 1003 00:36:20,530 --> 00:36:21,549 said this already. 1004 00:36:21,550 --> 00:36:23,769 It's like memoires piece of copy stayed 1005 00:36:23,770 --> 00:36:25,959 in your buffer, but much better. 1006 00:36:25,960 --> 00:36:28,509 So first of all, the size depends 1007 00:36:28,510 --> 00:36:30,669 on the content. Langfield 1008 00:36:30,670 --> 00:36:33,159 So you don't have to actually transfair 1009 00:36:33,160 --> 00:36:35,229 four gigabytes of data or 1010 00:36:35,230 --> 00:36:37,569 you need to say as I am eventually 1011 00:36:37,570 --> 00:36:39,909 going to transmit four gigabytes 1012 00:36:39,910 --> 00:36:42,099 of data and here's some data 1013 00:36:42,100 --> 00:36:44,319 that the attacker fully 1014 00:36:44,320 --> 00:36:46,029 controls the amount of data to actually 1015 00:36:46,030 --> 00:36:48,699 copy. And you also control 1016 00:36:48,700 --> 00:36:50,859 how data is fragmented into 1017 00:36:50,860 --> 00:36:52,479 different blocks. 1018 00:36:52,480 --> 00:36:54,549 And now the nicest thing of all, 1019 00:36:54,550 --> 00:36:57,159 and this is why we control IP on this 1020 00:36:57,160 --> 00:36:59,319 in between those copy operations stream 1021 00:36:59,320 --> 00:37:01,479 read dereference as a function 1022 00:37:01,480 --> 00:37:02,409 pointer. 1023 00:37:02,410 --> 00:37:04,599 OK, so if you overwrite this 1024 00:37:04,600 --> 00:37:07,889 function pointer then you control 1025 00:37:07,890 --> 00:37:08,890 the IP. 1026 00:37:10,050 --> 00:37:13,259 Now, SLR and is enabled, 1027 00:37:13,260 --> 00:37:15,029 but there is a bit of position dependent 1028 00:37:15,030 --> 00:37:16,709 quote that we can use to build a small 1029 00:37:16,710 --> 00:37:17,999 Ruppe chain for the exploit, 1030 00:37:19,050 --> 00:37:21,149 the downside is this is not so 1031 00:37:21,150 --> 00:37:23,519 stable, at least for my poque exploit. 1032 00:37:23,520 --> 00:37:25,139 So I'm going to show you a pocket exploit 1033 00:37:25,140 --> 00:37:27,449 now, a proof of concept, but it's 1034 00:37:27,450 --> 00:37:28,979 well, I only took a couple of days for 1035 00:37:28,980 --> 00:37:31,349 this, so it's not very 1036 00:37:31,350 --> 00:37:33,449 stable. But yeah, 1037 00:37:33,450 --> 00:37:36,149 it will work at some point. 1038 00:37:36,150 --> 00:37:37,679 Let's see. OK, so 1039 00:37:39,030 --> 00:37:40,050 this is the platform. 1040 00:37:41,680 --> 00:37:43,869 Now I start Velzy and then 1041 00:37:43,870 --> 00:37:45,489 I use the automatic updater, 1042 00:37:46,750 --> 00:37:49,359 and it didn't work, and 1043 00:37:49,360 --> 00:37:50,360 I tried again. 1044 00:37:51,550 --> 00:37:53,469 And it didn't work. 1045 00:37:53,470 --> 00:37:55,689 Also, pork exploits have stage fright, 1046 00:37:55,690 --> 00:37:57,969 as you know, and it didn't work. 1047 00:37:59,470 --> 00:38:00,820 Can you give me five tries? 1048 00:38:03,280 --> 00:38:05,289 Almost. That's very close, that's very 1049 00:38:05,290 --> 00:38:06,380 close. Just a second. 1050 00:38:07,450 --> 00:38:08,450 Yes, there we go. 1051 00:38:19,760 --> 00:38:22,039 So the reason this didn't work 1052 00:38:22,040 --> 00:38:24,209 straight away is I did know 1053 00:38:24,210 --> 00:38:26,299 all sorts of work to ensure that 1054 00:38:26,300 --> 00:38:28,459 the heat is in any controllable 1055 00:38:28,460 --> 00:38:31,129 state. This is really just 1056 00:38:31,130 --> 00:38:33,409 these were a couple of luck shots and 1057 00:38:33,410 --> 00:38:36,049 then eventually it hit the right address. 1058 00:38:36,050 --> 00:38:38,239 Yeah, I mean, it it shows that you can 1059 00:38:38,240 --> 00:38:39,649 execute arbitrary code. 1060 00:38:39,650 --> 00:38:41,239 So this has to be enough to patch it. 1061 00:38:41,240 --> 00:38:42,379 Right. 1062 00:38:42,380 --> 00:38:43,380 Good. Next. 1063 00:38:44,500 --> 00:38:46,669 Yeah. So our second test subject was 1064 00:38:46,670 --> 00:38:47,670 the Linux kernel. 1065 00:38:49,160 --> 00:38:50,809 This is also well suited for static 1066 00:38:50,810 --> 00:38:53,299 analysis because there's a lot of drivers 1067 00:38:53,300 --> 00:38:55,279 in there. And to actually force these 1068 00:38:55,280 --> 00:38:57,499 drivers, you would actually need to have 1069 00:38:57,500 --> 00:39:00,109 the hardware to to fuzz this stuff. 1070 00:39:00,110 --> 00:39:01,789 So you find a lot of things statically as 1071 00:39:01,790 --> 00:39:02,790 well. 1072 00:39:03,530 --> 00:39:05,209 And also it's a logical base for large 1073 00:39:05,210 --> 00:39:07,429 users. So it's interesting right 1074 00:39:07,430 --> 00:39:09,079 now. Yeah, I want to show you something 1075 00:39:09,080 --> 00:39:10,549 and I know this is a lot of text on the 1076 00:39:10,550 --> 00:39:12,619 slide here and show you that we 1077 00:39:12,620 --> 00:39:15,199 can take very 1078 00:39:15,200 --> 00:39:17,509 complex, we can define 1079 00:39:17,510 --> 00:39:19,699 very complex traversal and then just 1080 00:39:19,700 --> 00:39:21,529 hide them in a single word, which in this 1081 00:39:21,530 --> 00:39:24,019 case is unsanitized, and then reuse 1082 00:39:24,020 --> 00:39:25,789 them again and again. 1083 00:39:25,790 --> 00:39:27,979 So for unsanitized, what I want to do is 1084 00:39:27,980 --> 00:39:29,929 I want to be able to find cases where I 1085 00:39:29,930 --> 00:39:33,199 have a flow from a source to a sink, 1086 00:39:33,200 --> 00:39:35,509 but certain checks are definitely 1087 00:39:35,510 --> 00:39:37,399 not in place. 1088 00:39:37,400 --> 00:39:39,469 So if you 1089 00:39:39,470 --> 00:39:41,939 formulate this in complete 1090 00:39:41,940 --> 00:39:44,619 form, it's the traversal return 1091 00:39:44,620 --> 00:39:46,909 control sources if and only if 1092 00:39:46,910 --> 00:39:48,079 there's enough missing. 1093 00:39:48,080 --> 00:39:49,039 If they meet the following two 1094 00:39:49,040 --> 00:39:51,199 conditions, there exists a path 1095 00:39:51,200 --> 00:39:52,789 from the source statement to the sync 1096 00:39:52,790 --> 00:39:55,189 statement in the control flow graph 1097 00:39:55,190 --> 00:39:57,769 such that no code on the path matches 1098 00:39:57,770 --> 00:39:59,809 any of the sanitizer descriptions. 1099 00:39:59,810 --> 00:40:01,279 So these descriptions are something that 1100 00:40:01,280 --> 00:40:02,629 you pass in there. 1101 00:40:02,630 --> 00:40:04,729 Now, a second, a variable defined 1102 00:40:04,730 --> 00:40:06,409 by the source and used by the sync 1103 00:40:06,410 --> 00:40:08,789 reaches the sync via the control 1104 00:40:08,790 --> 00:40:10,549 flow path, i.e. 1105 00:40:10,550 --> 00:40:12,739 the variable is not re defined by any 1106 00:40:12,740 --> 00:40:13,969 node on the path. 1107 00:40:13,970 --> 00:40:15,199 So that's pretty complex. 1108 00:40:15,200 --> 00:40:17,569 Right, but you can implement it once 1109 00:40:17,570 --> 00:40:19,309 and then put it into your step called 1110 00:40:19,310 --> 00:40:21,469 unsanitized and reuse it again and 1111 00:40:21,470 --> 00:40:22,639 again. 1112 00:40:22,640 --> 00:40:24,739 And that's what we did here. 1113 00:40:24,740 --> 00:40:26,749 So here you see typical. 1114 00:40:26,750 --> 00:40:29,029 So we say a query 1115 00:40:30,170 --> 00:40:32,539 here we look for buffer overflow 1116 00:40:32,540 --> 00:40:33,799 and write handlers. 1117 00:40:33,800 --> 00:40:35,659 So we start off by characterizing the 1118 00:40:35,660 --> 00:40:37,759 Sync's that we want to reach sensitive 1119 00:40:37,760 --> 00:40:38,689 operations. 1120 00:40:38,690 --> 00:40:40,519 So get functions 1121 00:40:41,600 --> 00:40:42,919 where the name matches. 1122 00:40:42,920 --> 00:40:44,659 Right, OK. 1123 00:40:44,660 --> 00:40:46,909 And from there on, 1124 00:40:46,910 --> 00:40:49,009 go to the arguments or 1125 00:40:49,010 --> 00:40:51,289 arguments it is now and. 1126 00:40:53,880 --> 00:40:55,979 Yeah, no, go, go, go 1127 00:40:55,980 --> 00:40:58,049 to calls to 1128 00:40:58,050 --> 00:41:00,359 copy from user or meme copy and 1129 00:41:00,360 --> 00:41:02,969 to their third arguments. 1130 00:41:02,970 --> 00:41:03,989 Yeah. 1131 00:41:03,990 --> 00:41:06,119 And now filter those for count 1132 00:41:06,120 --> 00:41:08,579 because count is typically something 1133 00:41:08,580 --> 00:41:10,829 that's that's a variable you can control 1134 00:41:10,830 --> 00:41:11,830 in the right hander. 1135 00:41:14,420 --> 00:41:16,219 And then here comes the unsanitized step 1136 00:41:16,220 --> 00:41:17,419 that's the important part, so the 1137 00:41:17,420 --> 00:41:19,609 unsanitized step here, you 1138 00:41:19,610 --> 00:41:21,259 characterize the different kinds of 1139 00:41:21,260 --> 00:41:23,329 checks that you want to 1140 00:41:23,330 --> 00:41:25,519 filter. So if this stuff occurs, 1141 00:41:25,520 --> 00:41:27,559 then you don't care for this sink. 1142 00:41:27,560 --> 00:41:29,779 So one of them is, is there some sort 1143 00:41:29,780 --> 00:41:31,009 of comparison in here? 1144 00:41:31,010 --> 00:41:33,169 We've got this is check or does the 1145 00:41:33,170 --> 00:41:34,909 code contain ELAC? 1146 00:41:34,910 --> 00:41:36,979 That means that the buffer 1147 00:41:36,980 --> 00:41:38,779 is probably allocated to have the correct 1148 00:41:38,780 --> 00:41:40,729 size. Of course, we don't know this, but, 1149 00:41:40,730 --> 00:41:42,079 you know, let's filter all of these. 1150 00:41:42,080 --> 00:41:44,089 Let's look for the cases where it's very 1151 00:41:44,090 --> 00:41:46,429 clear that this is a bug and 1152 00:41:46,430 --> 00:41:48,169 Min is not used. 1153 00:41:48,170 --> 00:41:50,569 So Min is often used instead of 1154 00:41:50,570 --> 00:41:51,800 instead of having a check. 1155 00:41:52,940 --> 00:41:54,919 Yes. A filter those as well. 1156 00:41:54,920 --> 00:41:57,199 And make sure that the sources 1157 00:41:57,200 --> 00:41:59,299 match count and we 1158 00:41:59,300 --> 00:42:00,769 have a similar one. We're not going to go 1159 00:42:00,770 --> 00:42:01,939 for this in detail. 1160 00:42:01,940 --> 00:42:04,039 But again, the unsanitized part is 1161 00:42:04,040 --> 00:42:06,409 exactly the same, right? 1162 00:42:06,410 --> 00:42:08,629 All we do now is we have a 1163 00:42:08,630 --> 00:42:10,759 slightly different characterization 1164 00:42:10,760 --> 00:42:13,159 of the Sync's and the sources, 1165 00:42:13,160 --> 00:42:14,929 and we ran these two queries 1166 00:42:16,010 --> 00:42:18,079 and, um, well, we 1167 00:42:18,080 --> 00:42:20,149 had seven audits and 11 hits 1168 00:42:20,150 --> 00:42:21,139 using this. 1169 00:42:21,140 --> 00:42:22,550 So that was very cool. 1170 00:42:24,320 --> 00:42:26,179 Now, here's one of them just to show you 1171 00:42:26,180 --> 00:42:27,679 what these kinds of bugs look like and 1172 00:42:27,680 --> 00:42:29,749 what what you can use this unsanitized 1173 00:42:29,750 --> 00:42:31,129 step for. 1174 00:42:31,130 --> 00:42:33,170 So this is an actual handler of 1175 00:42:34,280 --> 00:42:35,329 Q F that. 1176 00:42:35,330 --> 00:42:37,579 If that card you see here 1177 00:42:37,580 --> 00:42:39,799 call to copy from user, 1178 00:42:39,800 --> 00:42:41,589 which is being used to initialize the 1179 00:42:41,590 --> 00:42:44,719 variable Rick Lence request. 1180 00:42:44,720 --> 00:42:47,149 And that means that this 1181 00:42:47,150 --> 00:42:49,370 variable here is controlled by userspace. 1182 00:42:50,540 --> 00:42:52,759 Now there's an allocation 1183 00:42:52,760 --> 00:42:54,949 here on the way, but it's absolutely 1184 00:42:54,950 --> 00:42:56,809 not concerned with Rachlin. 1185 00:42:56,810 --> 00:42:59,509 Instead, Rocklyn is next used 1186 00:42:59,510 --> 00:43:01,729 in this copy operation down here as 1187 00:43:01,730 --> 00:43:03,439 a third argumentum copy. 1188 00:43:03,440 --> 00:43:05,629 And this buffer, well, the 1189 00:43:05,630 --> 00:43:07,099 size of this buffer has nothing to do 1190 00:43:07,100 --> 00:43:09,049 with Rocklyn. And it's a it's a static 1191 00:43:09,050 --> 00:43:11,269 buffer. So you have a classical 1192 00:43:11,270 --> 00:43:13,069 stack based buffer overflow. 1193 00:43:13,070 --> 00:43:15,139 And the nice thing this is why there 1194 00:43:15,140 --> 00:43:17,329 were only 11 hits is that we were 1195 00:43:17,330 --> 00:43:19,639 able to actually filter those cases where 1196 00:43:19,640 --> 00:43:21,529 there are different kinds of checks to 1197 00:43:21,530 --> 00:43:22,939 ensure that this kind of stuff doesn't 1198 00:43:22,940 --> 00:43:24,199 happen. Right. 1199 00:43:24,200 --> 00:43:26,209 So stack based buffer flow, we did not 1200 00:43:26,210 --> 00:43:28,279 write and exploit what looks looks 1201 00:43:28,280 --> 00:43:31,459 pretty exploitable now 1202 00:43:31,460 --> 00:43:33,399 for the practical evaluation of all the 1203 00:43:33,400 --> 00:43:34,159 stuff. 1204 00:43:34,160 --> 00:43:35,809 I mean, I was just able to show you a 1205 00:43:35,810 --> 00:43:37,489 couple of samples here. 1206 00:43:37,490 --> 00:43:40,429 We did a larger evaluation and 1207 00:43:40,430 --> 00:43:42,499 Niko worked on this 1208 00:43:43,520 --> 00:43:44,779 for a great deal. 1209 00:43:44,780 --> 00:43:47,329 So he used this 1210 00:43:47,330 --> 00:43:49,519 at well 1211 00:43:49,520 --> 00:43:51,919 in a real company for Qualcomm 1212 00:43:51,920 --> 00:43:54,059 and found about a 1213 00:43:54,060 --> 00:43:56,239 hundred issues in an internal audit. 1214 00:43:56,240 --> 00:43:58,369 Now, I don't know exactly what an issue 1215 00:43:58,370 --> 00:44:00,739 is here, so 1216 00:44:00,740 --> 00:44:03,199 can can comment on the quality 1217 00:44:03,200 --> 00:44:04,849 of those findings. 1218 00:44:04,850 --> 00:44:06,949 But then we also did 1219 00:44:06,950 --> 00:44:08,089 an evaluation. 1220 00:44:08,090 --> 00:44:10,939 So he did most of the work with this 1221 00:44:10,940 --> 00:44:13,279 on the Linux kernel mainline 1222 00:44:13,280 --> 00:44:15,499 and we formulated different 1223 00:44:15,500 --> 00:44:16,309 kinds of queries. 1224 00:44:16,310 --> 00:44:18,349 So two queries, buffer overflow, as you 1225 00:44:18,350 --> 00:44:20,239 just saw them, although they were 1226 00:44:20,240 --> 00:44:22,399 slightly rewritten. But yeah, we just saw 1227 00:44:22,400 --> 00:44:25,339 them then we had zero byte allocation, 1228 00:44:25,340 --> 00:44:27,559 memory mapping box and some memory 1229 00:44:27,560 --> 00:44:28,809 disclosure box. 1230 00:44:28,810 --> 00:44:31,009 This one was rather complicated and 1231 00:44:31,010 --> 00:44:32,389 need to publish it at some point. 1232 00:44:34,130 --> 00:44:36,709 Yeah. And we found a lot of blocks 1233 00:44:36,710 --> 00:44:38,389 kind of proud of that. 1234 00:44:38,390 --> 00:44:40,459 So using the. 1235 00:44:48,290 --> 00:44:50,689 Yeah, and 1236 00:44:50,690 --> 00:44:52,309 they were also acknowledged by the 1237 00:44:52,310 --> 00:44:54,199 developers, so that's always nice to know 1238 00:44:54,200 --> 00:44:56,419 that actually, you know, these are not 1239 00:44:56,420 --> 00:44:58,549 just thugs for security folks, 1240 00:44:58,550 --> 00:45:00,979 but actually they would also say it's OK, 1241 00:45:00,980 --> 00:45:01,669 10 minutes. 1242 00:45:01,670 --> 00:45:02,670 Yeah, that's good. 1243 00:45:03,710 --> 00:45:05,359 Uh, OK. 1244 00:45:05,360 --> 00:45:07,159 So conclusion. 1245 00:45:07,160 --> 00:45:09,229 Yeah, I showed you a system to mine, 1246 00:45:09,230 --> 00:45:10,910 quote, basis for vulnerabilities 1247 00:45:12,740 --> 00:45:13,639 on the long run. 1248 00:45:13,640 --> 00:45:15,709 Um, well, here I've 1249 00:45:15,710 --> 00:45:17,239 already built a bridge between program 1250 00:45:17,240 --> 00:45:19,429 analysis and graph databases on 1251 00:45:19,430 --> 00:45:21,319 the long run. It's you know, the larger 1252 00:45:21,320 --> 00:45:23,419 effort of this is can we 1253 00:45:23,420 --> 00:45:25,489 have typical pattern recognition, 1254 00:45:25,490 --> 00:45:27,769 data mining techniques to discover 1255 00:45:27,770 --> 00:45:28,669 vulnerabilities? 1256 00:45:28,670 --> 00:45:30,439 And this is one part of this. 1257 00:45:30,440 --> 00:45:32,809 And finally, as I just said already, 1258 00:45:32,810 --> 00:45:34,579 we found real exploitable bugs. 1259 00:45:34,580 --> 00:45:36,679 So, yeah, it seems 1260 00:45:36,680 --> 00:45:38,929 seems to work in some 1261 00:45:38,930 --> 00:45:39,930 cases. Thanks. 1262 00:45:50,670 --> 00:45:52,829 Okay, thank you very much. 1263 00:45:52,830 --> 00:45:55,169 As usual, we have some time for Q&A 1264 00:45:55,170 --> 00:45:57,449 left, so we again have four 1265 00:45:57,450 --> 00:45:59,249 microphones in the room, please. 1266 00:45:59,250 --> 00:46:00,149 All the people in the room will have 1267 00:46:00,150 --> 00:46:01,769 questions lined up behind the microphones 1268 00:46:01,770 --> 00:46:03,629 and for the people on the stream. 1269 00:46:03,630 --> 00:46:04,859 We also have a signal. 1270 00:46:04,860 --> 00:46:05,999 Angel in the room will take your 1271 00:46:06,000 --> 00:46:08,099 questions on Twitter Iasi and all 1272 00:46:08,100 --> 00:46:10,889 your online well means. 1273 00:46:10,890 --> 00:46:12,389 And we're going to go with the first 1274 00:46:12,390 --> 00:46:13,889 question from room for microphone number 1275 00:46:13,890 --> 00:46:15,549 two, please. 1276 00:46:15,550 --> 00:46:17,609 Yeah. So do you 1277 00:46:17,610 --> 00:46:19,919 have a database with 1278 00:46:19,920 --> 00:46:21,389 Curis ready? 1279 00:46:21,390 --> 00:46:23,279 So I don't think it would be a good idea 1280 00:46:23,280 --> 00:46:25,229 to integrate this in continuous 1281 00:46:25,230 --> 00:46:26,999 integration system. 1282 00:46:27,000 --> 00:46:29,009 Yes, we have a database already with such 1283 00:46:29,010 --> 00:46:31,079 Curis. And is there any way to 1284 00:46:31,080 --> 00:46:32,039 annotate called. 1285 00:46:32,040 --> 00:46:34,409 So if I go and if 1286 00:46:34,410 --> 00:46:36,419 this report is a problem, can I annotate 1287 00:46:36,420 --> 00:46:38,699 it in some way that we'll skip 1288 00:46:38,700 --> 00:46:41,339 it the next time if it 1289 00:46:41,340 --> 00:46:41,999 goes through? 1290 00:46:42,000 --> 00:46:44,279 Um, we currently 1291 00:46:44,280 --> 00:46:46,649 don't have a database of 1292 00:46:46,650 --> 00:46:48,149 like a lot of queries. 1293 00:46:48,150 --> 00:46:49,649 We have some examples. 1294 00:46:49,650 --> 00:46:51,989 Right, that you can look at, 1295 00:46:51,990 --> 00:46:54,029 but we definitely want to have this. 1296 00:46:54,030 --> 00:46:56,729 This is more of a question of 1297 00:46:56,730 --> 00:46:59,099 time that you're able to invest into 1298 00:46:59,100 --> 00:47:01,349 this. But I also think that the the idea 1299 00:47:01,350 --> 00:47:03,089 that you're I mean, essentially what 1300 00:47:03,090 --> 00:47:05,279 you're saying is, if I understand 1301 00:47:05,280 --> 00:47:07,439 correctly, how can we not build 1302 00:47:07,440 --> 00:47:09,629 a database of things that we know that 1303 00:47:09,630 --> 00:47:10,679 have broken in the past. 1304 00:47:10,680 --> 00:47:12,509 Right. So that we can immediately scan 1305 00:47:12,510 --> 00:47:14,849 the new code for this kind of stuff, 1306 00:47:14,850 --> 00:47:16,779 if I understand correctly. 1307 00:47:16,780 --> 00:47:19,079 Um, yeah, this would definitely be very 1308 00:47:19,080 --> 00:47:21,479 awesome. But needs some some. 1309 00:47:21,480 --> 00:47:22,739 Yeah. 1310 00:47:22,740 --> 00:47:24,629 Work to be done. 1311 00:47:24,630 --> 00:47:25,650 Just writing queries 1312 00:47:26,760 --> 00:47:28,019 in the annotation part. 1313 00:47:28,020 --> 00:47:30,239 I mean, can I mark some code 1314 00:47:30,240 --> 00:47:32,369 that this was checked and it's safe so 1315 00:47:32,370 --> 00:47:33,959 it doesn't go again. 1316 00:47:33,960 --> 00:47:34,960 So 1317 00:47:36,630 --> 00:47:38,129 yeah. If something interesting to say 1318 00:47:38,130 --> 00:47:39,539 about this, but I'd rather take this 1319 00:47:39,540 --> 00:47:42,029 offline because it's concerned with 1320 00:47:42,030 --> 00:47:44,219 the work that we haven't got 1321 00:47:44,220 --> 00:47:46,769 published yet, but 1322 00:47:46,770 --> 00:47:49,409 yeah. About the annotation. 1323 00:47:49,410 --> 00:47:51,089 Yeah. Maybe we can discuss this after the 1324 00:47:51,090 --> 00:47:51,469 talk. 1325 00:47:51,470 --> 00:47:53,579 I thank you from 1326 00:47:53,580 --> 00:47:55,169 your next question from microphone number 1327 00:47:55,170 --> 00:47:56,249 three please. 1328 00:47:56,250 --> 00:47:57,250 Yeah. 1329 00:47:58,130 --> 00:48:00,269 Give them that. You support C and C++. 1330 00:48:00,270 --> 00:48:02,429 I assume you're obtaining the 1331 00:48:02,430 --> 00:48:04,859 tier from CRANN compiler for example 1332 00:48:04,860 --> 00:48:05,860 now. 1333 00:48:07,110 --> 00:48:09,899 Yeah, this is another story, but 1334 00:48:09,900 --> 00:48:11,999 I was asked by several people to skip 1335 00:48:12,000 --> 00:48:13,769 it because it's a bit boring. 1336 00:48:13,770 --> 00:48:16,529 The thing is that if you 1337 00:48:16,530 --> 00:48:18,689 I said something about a robust 1338 00:48:18,690 --> 00:48:20,969 Pausa. Right. So the problem 1339 00:48:20,970 --> 00:48:23,099 really is the job 1340 00:48:23,100 --> 00:48:25,169 of compilers is to also check 1341 00:48:25,170 --> 00:48:27,269 whether the code that you're giving it is 1342 00:48:27,270 --> 00:48:28,979 actually correct. 1343 00:48:28,980 --> 00:48:30,749 That means and C, that you need the 1344 00:48:30,750 --> 00:48:32,969 complete code base and you need a working 1345 00:48:32,970 --> 00:48:34,979 build environment with all the libraries 1346 00:48:34,980 --> 00:48:36,059 and stuff like that. 1347 00:48:36,060 --> 00:48:37,319 I wanted something that's more like a 1348 00:48:37,320 --> 00:48:39,749 search engine. So I actually built 1349 00:48:39,750 --> 00:48:42,179 a parser for C that does robust 1350 00:48:42,180 --> 00:48:44,579 parsing. So it assumes that 1351 00:48:44,580 --> 00:48:46,889 the code that you give it is valid 1352 00:48:46,890 --> 00:48:49,559 C++ in some weird dialect, 1353 00:48:49,560 --> 00:48:50,789 some place on Earth. 1354 00:48:50,790 --> 00:48:53,279 Right. And then it it will just 1355 00:48:53,280 --> 00:48:55,229 pass the stuff that it understands and 1356 00:48:55,230 --> 00:48:56,729 stuff that it doesn't understand throws 1357 00:48:56,730 --> 00:48:57,929 away. 1358 00:48:57,930 --> 00:48:59,909 That's part of the project. 1359 00:48:59,910 --> 00:49:01,139 Um, yeah. 1360 00:49:01,140 --> 00:49:03,479 The problem that you have with the clang, 1361 00:49:03,480 --> 00:49:05,010 I mean you could definitely write 1362 00:49:06,270 --> 00:49:08,339 an interface for this, but the 1363 00:49:08,340 --> 00:49:10,709 problem is typically as soon as it 1364 00:49:10,710 --> 00:49:13,019 sees some code like a definition, 1365 00:49:13,020 --> 00:49:15,389 it doesn't know. So the rest 1366 00:49:15,390 --> 00:49:17,609 of the source file is not 1367 00:49:17,610 --> 00:49:19,289 parsed and we don't want that. 1368 00:49:20,670 --> 00:49:22,349 Okay. And a question from number three, 1369 00:49:22,350 --> 00:49:23,489 please. 1370 00:49:23,490 --> 00:49:24,919 Hello. 1371 00:49:24,920 --> 00:49:27,059 Are there any errors that one 1372 00:49:27,060 --> 00:49:29,159 would intuitively think 1373 00:49:29,160 --> 00:49:31,049 one should be able to find using this 1374 00:49:31,050 --> 00:49:33,479 method, but which are not possible 1375 00:49:33,480 --> 00:49:36,149 or, you know, too hard, too expensive? 1376 00:49:36,150 --> 00:49:37,979 Do you have any interesting examples of 1377 00:49:37,980 --> 00:49:38,980 that? 1378 00:49:39,390 --> 00:49:40,390 Well, in a way, 1379 00:49:41,620 --> 00:49:42,869 you take a look at this. 1380 00:49:44,710 --> 00:49:46,269 This is actually pretty intuitive, and I 1381 00:49:46,270 --> 00:49:47,559 would have thought that it had been found 1382 00:49:47,560 --> 00:49:49,059 a long time ago. 1383 00:49:49,060 --> 00:49:51,129 I think that what's 1384 00:49:51,130 --> 00:49:53,739 problematic here is maybe that 1385 00:49:53,740 --> 00:49:55,269 you can only trigger the bug if you 1386 00:49:55,270 --> 00:49:56,199 insert a seven. 1387 00:49:56,200 --> 00:49:58,359 So if you take a further 1388 00:49:58,360 --> 00:49:59,769 to find these things and I think most 1389 00:49:59,770 --> 00:50:02,289 people face this stuff, then 1390 00:50:02,290 --> 00:50:03,399 you're probably not going to hit the 1391 00:50:03,400 --> 00:50:04,419 seven. Right. 1392 00:50:04,420 --> 00:50:06,639 But in a way, I think this 1393 00:50:06,640 --> 00:50:08,799 this is this looks so intuitive. 1394 00:50:08,800 --> 00:50:11,079 But we still see that using 1395 00:50:11,080 --> 00:50:13,359 these simple methods, we can find bugs. 1396 00:50:13,360 --> 00:50:15,639 So, yeah, I don't know if that answers 1397 00:50:15,640 --> 00:50:16,629 your question. 1398 00:50:16,630 --> 00:50:18,099 Probably not. 1399 00:50:18,100 --> 00:50:20,179 Yeah. I think you 1400 00:50:20,180 --> 00:50:23,139 as you answered the other way around. 1401 00:50:23,140 --> 00:50:24,280 OK, so 1402 00:50:25,380 --> 00:50:26,949 I think that 1403 00:50:28,720 --> 00:50:30,399 the system is good and I was asking for 1404 00:50:30,400 --> 00:50:32,199 things that the system is surprisingly 1405 00:50:32,200 --> 00:50:32,509 bad. 1406 00:50:32,510 --> 00:50:34,809 It are surprisingly 1407 00:50:34,810 --> 00:50:35,810 bad at. 1408 00:50:38,420 --> 00:50:39,420 Um, 1409 00:50:43,640 --> 00:50:45,889 well, I mean, they are very 1410 00:50:45,890 --> 00:50:47,869 obvious limitations, right? 1411 00:50:47,870 --> 00:50:49,939 You know, we're not 1412 00:50:49,940 --> 00:50:52,999 really evaluating the expressions at all. 1413 00:50:53,000 --> 00:50:54,589 As I said in the beginning, it's more 1414 00:50:54,590 --> 00:50:56,839 like you try 1415 00:50:56,840 --> 00:50:58,909 to model what people do, who look at code 1416 00:50:58,910 --> 00:51:01,579 manually, look for additions, 1417 00:51:01,580 --> 00:51:03,889 inside allocations or stuff like that. 1418 00:51:03,890 --> 00:51:05,959 But you can't you don't really evaluate 1419 00:51:05,960 --> 00:51:07,010 the expression to see. 1420 00:51:08,300 --> 00:51:09,889 Yeah, that's still very abstract. 1421 00:51:09,890 --> 00:51:11,899 I can't name anything concrete, but I 1422 00:51:11,900 --> 00:51:13,639 mean, that's that's the biggest problem, 1423 00:51:13,640 --> 00:51:15,829 especially if you compare it to like 1424 00:51:15,830 --> 00:51:18,319 exact methods such as symbolic execution 1425 00:51:18,320 --> 00:51:20,869 or so where you actually calculate 1426 00:51:20,870 --> 00:51:23,089 what the real values could be, which 1427 00:51:23,090 --> 00:51:24,090 we don't do here. 1428 00:51:24,980 --> 00:51:25,980 Thank you. 1429 00:51:26,780 --> 00:51:28,429 Okay. Next question from microphone 1430 00:51:28,430 --> 00:51:29,540 number two, please. 1431 00:51:33,290 --> 00:51:34,879 Oh, that's OK. 1432 00:51:34,880 --> 00:51:37,189 I can just put my head up 1433 00:51:37,190 --> 00:51:40,139 thinking I'm wondering, 1434 00:51:40,140 --> 00:51:42,229 we have now seen that 1435 00:51:42,230 --> 00:51:44,329 you evaluate code being 1436 00:51:44,330 --> 00:51:45,889 C or C++. 1437 00:51:45,890 --> 00:51:48,109 Well, or you said that a C then 1438 00:51:48,110 --> 00:51:49,110 C++. 1439 00:51:50,150 --> 00:51:52,039 Is it that that you kind of use the 1440 00:51:52,040 --> 00:51:54,199 grammar to pause when you said you wrote 1441 00:51:54,200 --> 00:51:56,509 your Pausa? Because I'm mainly 1442 00:51:56,510 --> 00:51:58,819 working with Java and I'm interested 1443 00:51:58,820 --> 00:52:01,009 in searching for 1444 00:52:01,010 --> 00:52:03,499 bad coding and possible 1445 00:52:03,500 --> 00:52:04,500 bugs in Java. 1446 00:52:05,630 --> 00:52:08,449 Yes. So we wrote a grammar 1447 00:52:08,450 --> 00:52:10,549 for the ontology for 1448 00:52:10,550 --> 00:52:12,859 parser generator, which is a 1449 00:52:12,860 --> 00:52:14,089 really awesome tool. 1450 00:52:14,090 --> 00:52:16,159 So it's it's if you compare it to 1451 00:52:16,160 --> 00:52:18,589 stuff like Flexin Python, then, 1452 00:52:18,590 --> 00:52:20,899 you know, it's I mean with 1453 00:52:20,900 --> 00:52:22,729 X and bison, you spend lots and lots of 1454 00:52:22,730 --> 00:52:24,829 time making flexin bison. 1455 00:52:24,830 --> 00:52:27,019 Understand what you want to pass with 1456 00:52:27,020 --> 00:52:29,059 onto Lauritz. It's really simple. 1457 00:52:29,060 --> 00:52:31,699 And yet we've also had somebody 1458 00:52:31,700 --> 00:52:34,219 take a look at adapting this to Java. 1459 00:52:35,510 --> 00:52:37,129 There are people actually using this for 1460 00:52:37,130 --> 00:52:38,299 Java right now. 1461 00:52:38,300 --> 00:52:39,349 But clearly, 1462 00:52:40,640 --> 00:52:42,019 um. 1463 00:52:42,020 --> 00:52:44,119 Well, yeah, you're 1464 00:52:44,120 --> 00:52:46,789 feeding Java to see powers. 1465 00:52:46,790 --> 00:52:49,009 So, yeah, most 1466 00:52:49,010 --> 00:52:50,749 dataflow inside the function, for 1467 00:52:50,750 --> 00:52:52,549 example, that works very well. 1468 00:52:52,550 --> 00:52:54,689 But like, if you want to take a look at 1469 00:52:54,690 --> 00:52:56,719 class Hirak class hierarchies and stuff 1470 00:52:56,720 --> 00:52:59,359 like that, probably be a good idea to 1471 00:52:59,360 --> 00:53:01,189 just adapt the positive to Java. 1472 00:53:01,190 --> 00:53:02,190 Yeah. 1473 00:53:04,100 --> 00:53:05,750 OK, I close my front, I'm afraid. 1474 00:53:06,950 --> 00:53:08,539 Can you say something about the 1475 00:53:08,540 --> 00:53:10,669 performance? Do you need to tweak the 1476 00:53:10,670 --> 00:53:13,069 query to make them faster or is it OK? 1477 00:53:13,070 --> 00:53:15,169 Yeah. So it depends a lot on the 1478 00:53:15,170 --> 00:53:17,299 amount of stock notes that you choose. 1479 00:53:17,300 --> 00:53:18,300 Right. Because you're 1480 00:53:19,400 --> 00:53:21,139 if you have a lot of start notes where 1481 00:53:21,140 --> 00:53:23,419 you need to start to do and 1482 00:53:23,420 --> 00:53:25,489 this is also in the documentation is 1483 00:53:25,490 --> 00:53:28,159 you need to start to to to track this 1484 00:53:28,160 --> 00:53:30,949 because it will it will run all of these 1485 00:53:30,950 --> 00:53:33,139 it won't run these 1486 00:53:33,140 --> 00:53:34,879 queries. Well, it will run them all in 1487 00:53:34,880 --> 00:53:37,009 parallel because at some point it could 1488 00:53:37,010 --> 00:53:38,629 be that you want to merge the results 1489 00:53:38,630 --> 00:53:40,759 again. So you need to split this. 1490 00:53:40,760 --> 00:53:43,069 The queries that we had here, the 1491 00:53:43,070 --> 00:53:46,099 those we presented, the slowest 1492 00:53:46,100 --> 00:53:48,229 were about took about 1493 00:53:48,230 --> 00:53:50,389 five minutes to execute on on 1494 00:53:50,390 --> 00:53:51,390 my laptop. 1495 00:53:52,460 --> 00:53:54,679 So I guess that's acceptable in 1496 00:53:54,680 --> 00:53:56,779 if you do an audit, the 1497 00:53:56,780 --> 00:53:58,879 the part that takes most of 1498 00:53:58,880 --> 00:54:01,219 the time that was this unsanitized 1499 00:54:01,220 --> 00:54:03,169 step where you actually start traversing 1500 00:54:03,170 --> 00:54:04,909 the control flow graph again and again 1501 00:54:04,910 --> 00:54:07,789 and again to see if if there's any path, 1502 00:54:07,790 --> 00:54:09,019 any reasonable path that you're 1503 00:54:09,020 --> 00:54:11,209 interested in for 1504 00:54:11,210 --> 00:54:14,209 simpler stuff like the, uh, 1505 00:54:14,210 --> 00:54:15,300 like the 1506 00:54:17,510 --> 00:54:20,089 like this back here, for example, 1507 00:54:20,090 --> 00:54:21,469 or also the updated bug. 1508 00:54:22,610 --> 00:54:24,049 This is pretty instantaneous. 1509 00:54:24,050 --> 00:54:26,569 So it takes, um, like, well, 1510 00:54:26,570 --> 00:54:27,829 maybe not instantaneous. 1511 00:54:27,830 --> 00:54:30,139 It takes 10 to 15 seconds 1512 00:54:30,140 --> 00:54:31,349 or so. Yeah. 1513 00:54:32,730 --> 00:54:33,730 Oh, 1514 00:54:37,020 --> 00:54:39,329 OK, I see we don't have any more 1515 00:54:39,330 --> 00:54:40,739 room, is there maybe a question from the 1516 00:54:40,740 --> 00:54:41,759 Internet? 1517 00:54:41,760 --> 00:54:43,079 OK, no. 1518 00:54:43,080 --> 00:54:45,269 So please, thank you again for 1519 00:54:45,270 --> 00:54:46,349 your very interesting talk.