1 00:00:00,000 --> 00:00:13,119 *music* 2 00:00:13,119 --> 00:00:17,190 Herald: Good morning and welcome back to stage one. It's kind of going to be the 3 00:00:17,190 --> 00:00:21,490 second talk about physics on this day already and it's about big data and 4 00:00:21,490 --> 00:00:27,150 science and big data became something like Uber in science. It's everywhere every 5 00:00:27,150 --> 00:00:33,370 discipline has it. Axel Naumann's working for CERN, the accelerator in Switzerland 6 00:00:33,370 --> 00:00:39,160 and he talks about how physics and computing bridge in this area and he works 7 00:00:39,160 --> 00:00:43,183 a lot with ROOT, a program that helps transform data into knowledge. A warm 8 00:00:43,183 --> 00:00:44,650 welcome. 9 00:00:44,650 --> 00:00:45,262 Axel Naumann: Thank you. 10 00:00:45,262 --> 00:00:51,260 *applause* 11 00:00:51,260 --> 00:00:57,850 AN: Thanks a lot. So, well you know, when, when I was discussing this abstract with 12 00:00:57,850 --> 00:01:00,950 the science track people they tell me: "Well, you know about three hundred people 13 00:01:00,950 --> 00:01:06,000 might be in the audience." But well, hey, you are huge that's much more than three 14 00:01:06,000 --> 00:01:10,940 hundred people. So thank you so much for inviting me over it's a real honor. And of 15 00:01:10,940 --> 00:01:15,310 course originally when talking to 300 people are all science interested I 16 00:01:15,310 --> 00:01:20,590 thought you know I pick something fairly narrow focuswise but then I learned I'm 17 00:01:20,590 --> 00:01:24,690 going to be in Saal one and that's different, so I decided to make the scope 18 00:01:24,690 --> 00:01:30,670 a little bit wider and that's what I ended up with. I'll talk a little bit about 19 00:01:30,670 --> 00:01:37,540 CERN in society as well if you so choose, you'll see what that means in a minute. So 20 00:01:37,540 --> 00:01:41,680 the things I'll cover here is obviously CERN just a little bit of an introduction 21 00:01:41,680 --> 00:01:46,100 how we do physics, how we do computing, what data means to us and I can tell you 22 00:01:46,100 --> 00:01:51,810 it means everything, you heard about that already, right? How we do data analysis in 23 00:01:51,810 --> 00:01:56,159 high energy physics and just because we've been doing it for a while and 24 00:01:56,159 --> 00:02:00,530 because I've been doing it for more than ten years, I'm one of the guys who's 25 00:02:00,530 --> 00:02:07,250 providing the software to do data analysis in high energy physics, so, you 26 00:02:07,250 --> 00:02:11,360 know, because we know what we are doing and we have some experience, I thought 27 00:02:11,360 --> 00:02:18,110 maybe you might be interested in hearing what my forecast is for data analysis in 28 00:02:18,110 --> 00:02:25,430 general, in the future. So let's start with CERN. And so if you wonder what CERN 29 00:02:25,430 --> 00:02:31,510 is, you've all heard about CERN, about the fantastic funds we love to use, then 30 00:02:31,510 --> 00:02:36,960 you've probably also heard that we are doing science. We were founded right after 31 00:02:36,960 --> 00:02:41,450 the Second World War or soon after the Second World War, basically as a way to 32 00:02:41,450 --> 00:02:47,458 entertain those freaky scientists. You know that was the idea: peace europewide. 33 00:02:47,458 --> 00:02:52,349 And damn, that's working out really well and so well there's not just Europe 34 00:02:52,349 --> 00:02:57,530 anymore these days. We are located near Geneva, we are doing only fundamental 35 00:02:57,530 --> 00:03:02,269 research, so we don't do any weapons, nuclear stuff you 36 00:03:02,269 --> 00:03:10,230 know, these kind of things. The WWW was invented at CERN but that was just a, you 37 00:03:10,230 --> 00:03:14,586 know, side effect happens sometimes, that we invent things. But usually we just do 38 00:03:14,586 --> 00:03:22,500 science. So what we do is, we take money, lots off, and brains who like to discuss 39 00:03:22,500 --> 00:03:27,210 and think and come up with ideas and from that we generate knowledge. It's really 40 00:03:27,210 --> 00:03:33,000 all about curiosity. The things we try to answer is what is mass? Which is funny 41 00:03:33,000 --> 00:03:37,371 question right? Like we all know what mass is but actually we don't. We know what 42 00:03:37,371 --> 00:03:42,360 mass is in the universe. We understand that masses attract one another: gravity. 43 00:03:42,360 --> 00:03:48,730 Which is beautifully correct. And in the small scale, our particles, we know that 44 00:03:48,730 --> 00:03:52,940 mass is energy and we can't convert them. But we don't understand how these two 45 00:03:52,940 --> 00:03:58,319 things go together. Like there is no bridge, they contradict one another. So we 46 00:03:58,319 --> 00:04:04,930 are trying to understand what that bridge might be. Part of that mass thing is of 47 00:04:04,930 --> 00:04:08,650 course also what's out there in the universe? That's a big question. We only 48 00:04:08,650 --> 00:04:14,230 understand a few percent of that. 90 and some percent are completely unknown to 49 00:04:14,230 --> 00:04:20,349 us, and that's scary right? I mean we know gravity really well, we can deal with 50 00:04:20,349 --> 00:04:27,560 freaky things like black holes and yet we don't understand what's out there. Now to 51 00:04:27,560 --> 00:04:31,850 do all these things we are probing nature at the smallest scale as we call it, so 52 00:04:31,850 --> 00:04:36,190 that's particles, we are dealing with things like the Higgs particle and 53 00:04:36,190 --> 00:04:43,900 supersymmetry. Here's a little bit of a fact sheet. We have about 12,000 54 00:04:43,900 --> 00:04:47,500 physicists who are working with CERN. We are basically the workbench that you saw 55 00:04:47,500 --> 00:04:54,661 in Andre's talk before. We are the table that physicists use, okay? And, so they 56 00:04:54,661 --> 00:04:59,050 come to CERN and once a while about 10,000 physicists a year, or they work 57 00:04:59,050 --> 00:05:02,810 remotely most of the time from about 120 nations. So you're seeing it's not 58 00:05:02,810 --> 00:05:10,650 European anymore, this is a global thing. CERN in itself has about 2,500 employees, 59 00:05:10,650 --> 00:05:15,490 you know those scrubbing the table, setting things up and so on. And our 60 00:05:15,490 --> 00:05:21,190 table is right here. In the far end we have the Alps, it's in Switzerland 61 00:05:21,190 --> 00:05:25,990 as I said, so the Alps are always close, with Mont Blanc, we have the 62 00:05:25,990 --> 00:05:31,639 Lake Geneva we have the Jura, the French Mountains on the lower end here, it's just 63 00:05:31,639 --> 00:05:37,410 beautiful. It's really nice, but we needed to stick a 30-kilometer ring in 64 00:05:37,410 --> 00:05:43,861 there somewhere and people would have hated us had we put it like this. But 65 00:05:43,861 --> 00:05:49,671 luckily people were smart back then in the 70s, and built a tunnel much better. So 66 00:05:49,671 --> 00:05:55,229 now we have this huge tunnel, and we send particles through in both directions near 67 00:05:55,229 --> 00:06:00,351 the speed of light and the tunnel is filled with magnets simply because if you 68 00:06:00,351 --> 00:06:08,110 don't use a magnet the particles will fly straight but we need them to turn around. 69 00:06:08,110 --> 00:06:13,560 Here you see what it's looking like, you also see these big halls there that have 70 00:06:13,560 --> 00:06:21,880 access shafts from the top and that's where the experiments are. That's sort of 71 00:06:21,880 --> 00:06:29,210 a sketch of one of the experiments. So the the LHC is one of the, no, is the biggest 72 00:06:29,210 --> 00:06:35,889 particle accelerator at the moment, it's a ring with 27 kilometers circumference, 100 73 00:06:35,889 --> 00:06:40,300 meters below Switzerland and France, it has four big experiments and several 74 00:06:40,300 --> 00:06:45,270 small ones and we are expected to run until 2030. So you see that all of that 75 00:06:45,270 --> 00:06:50,150 is large-scale simply because we're trying to make good use of the money we have. 76 00:06:50,150 --> 00:06:56,020 Here, you see one of these caverns that are used by the experiments while it was 77 00:06:56,020 --> 00:07:01,490 empty. The experiment was then lowered through this hole by the roof, piece by 78 00:07:01,490 --> 00:07:07,190 piece, and these things are humongous. To give you an impression of how big it is, I 79 00:07:07,190 --> 00:07:12,520 put Waldo in there, so your job for the next three slides is to find Waldo. You 80 00:07:12,520 --> 00:07:15,800 know, that gives you the scale. He's friendlily waving at you, so it should be 81 00:07:15,800 --> 00:07:21,990 easy to find him. So then we put a detector in there. Here it's pulled apart 82 00:07:21,990 --> 00:07:26,160 a little bit, so it looks nicer, you can actually see something. You can for 83 00:07:26,160 --> 00:07:31,039 example see the beam pipe, so that's where the particles are flying through, and then 84 00:07:31,039 --> 00:07:34,880 they're coming from both directions and colliding in the center of the detector 85 00:07:34,880 --> 00:07:38,490 and then things happen we try to understand what 86 00:07:38,490 --> 00:07:44,790 is happening. That's yet another view, frontal view on one of the detectors and 87 00:07:44,790 --> 00:07:51,060 now you have to imagine that, you know, you can't just open up Amazon and order an 88 00:07:51,060 --> 00:07:56,210 LHC experiment, right, that's not how it works. We do this stuff ourselves, like 89 00:07:56,210 --> 00:08:02,669 PhD students, postdocs, engineers. You know, that's all done by hand, just like 90 00:08:02,669 --> 00:08:06,940 the microscope you saw before. Of course you order the parts, but you know the 91 00:08:06,940 --> 00:08:11,060 design, the whole conception and actually screwing these things together, making 92 00:08:11,060 --> 00:08:16,970 sure that all fits, is all done by hand. And I find that just beautiful, I mean 93 00:08:16,970 --> 00:08:21,760 that's close to a miracle, right? That nations, like people no matter what 94 00:08:21,760 --> 00:08:26,819 nation, people across the globe work together to build such a huge thing and 95 00:08:26,819 --> 00:08:39,490 then you turn it on and it works. More or less, but you get it to work. That's not 96 00:08:39,490 --> 00:08:44,310 my applause, that's your applause, because you make this possible. Really, but it's, 97 00:08:44,310 --> 00:08:49,690 it's huge this is for me one of the things I love most about CERN: That is this 98 00:08:49,690 --> 00:08:55,279 international thing that just works smoothly. Now the detectors are like a 99 00:08:55,279 --> 00:09:01,310 massive camera. We have lots of pixels and we take many, many pictures a second. We 100 00:09:01,310 --> 00:09:06,680 do this to identify particles and then sort of estimate what has happened during 101 00:09:06,680 --> 00:09:15,470 the collision. Now, life at CERN is of course an important ingredient for 102 00:09:15,470 --> 00:09:19,529 scientists as well, and if you live at CERN then actually it's just work at CERN 103 00:09:19,529 --> 00:09:23,980 and that's what it's about. But it's not that bad, so we hang out together in our 104 00:09:23,980 --> 00:09:30,040 control rooms, make sure that the experiments work correctly. We also, you 105 00:09:30,040 --> 00:09:33,720 know, study the forces. *laughter* 106 00:09:33,720 --> 00:09:38,740 We have scientific discourse, in the sun, view on the Mont Blanc, with a good 107 00:09:38,740 --> 00:09:45,430 coffee. We have lectures and we are lectured and of course, as you, we have 108 00:09:45,430 --> 00:09:54,570 more laptops than people. And, then we do stuff and so this presentation is going to 109 00:09:54,570 --> 00:09:58,580 introduce you to some of the things we are doing, and more on the computing and the 110 00:09:58,580 --> 00:10:04,100 society side as I said. But because I have so much to talk to about I decided that 111 00:10:04,100 --> 00:10:08,810 you just build your own talk, you tell me what you want to hear. So let's do this, 112 00:10:08,810 --> 00:10:14,410 you can choose between A, physics, and B, model simulation and data. You remember 113 00:10:14,410 --> 00:10:18,620 these books like from the old days when we were all young? It's that kind of thing, 114 00:10:18,620 --> 00:10:24,450 ok? You decide/design your own talk here. So, by applause, do you want to hear about 115 00:10:24,450 --> 00:10:27,720 physics? *applause* 116 00:10:27,720 --> 00:10:35,730 Okay. Or the model simulation data part? *louder applause* 117 00:10:35,730 --> 00:10:45,101 Okay, there we go. So, this is what we skip. Model simulation data it is. You're 118 00:10:45,101 --> 00:10:49,700 a strange crowd, first time I meet people who don't want to hear about physics... no 119 00:10:49,700 --> 00:10:51,450 I'm kidding. *laughter* 120 00:10:51,450 --> 00:10:53,800 Audience: *inaudible interjection* *laughter* 121 00:10:53,800 --> 00:11:00,079 So model simulation data it is. So our theory is actually incredibly precise. 122 00:11:00,079 --> 00:11:04,450 It's so precise that our basic job is really really boring, because we already 123 00:11:04,450 --> 00:11:10,514 understand everything. Whenever there is a collision, we know what's going to happen. 124 00:11:10,514 --> 00:11:15,430 Except for these very rare things. So we are trying to find these very rare things 125 00:11:15,430 --> 00:11:19,580 out of this haystack of fairly boring things that we really understand well. And 126 00:11:19,580 --> 00:11:25,589 the weird things are, for example, monopoles, supersymmetry, or black holes. 127 00:11:25,589 --> 00:11:32,060 Now the theorists job is to tell us what we should be seeing in the detector, given 128 00:11:32,060 --> 00:11:42,347 some fancy physics. Then we use simulation to see how our detector would respond to 129 00:11:42,347 --> 00:11:53,476 that. Now, of course the question is: We are just counting, basically, when we do 130 00:11:53,476 --> 00:11:58,102 experiments and the question is: How often do we need to see something to say: "Well, 131 00:11:58,102 --> 00:12:03,310 that's not just the ordinary. That is something new, that's something that could 132 00:12:03,310 --> 00:12:09,870 be explained by a weird theory. We use the detector simulation as I said to basically 133 00:12:09,870 --> 00:12:15,029 predict how much we expect to see things. We use reconstruction software which 134 00:12:15,029 --> 00:12:20,680 tells us what has happened, or might have happened in the detector to count how 135 00:12:20,680 --> 00:12:25,400 often we saw something. And then we use statistics to compare these two and to say 136 00:12:25,400 --> 00:12:31,610 whether something is expected or not. Now, that's fairly abstract but it's fairly 137 00:12:31,610 --> 00:12:36,905 common, a fairly common approach. For example, if you look at climate versus 138 00:12:36,905 --> 00:12:40,331 weather, right, I mean we always have temperature fluctuations because of 139 00:12:40,331 --> 00:12:46,480 weather, and the question is: Is that rise in temperature because of a weather effect 140 00:12:46,480 --> 00:12:50,375 or because of a climate effect? Is that large-scale or just a short-term 141 00:12:50,375 --> 00:12:55,610 fluctuation. So there, we have a very similar problem and here what you do is 142 00:12:55,610 --> 00:13:00,880 you measure temperatures, and you want to detect abnormal variations, and you can 143 00:13:00,880 --> 00:13:06,420 improve that by measuring longer, like, for 300 years instead of 20 years. That 144 00:13:06,420 --> 00:13:11,930 gives you a better prediction what you would expect in the future. Also, larger 145 00:13:11,930 --> 00:13:14,170 deviations help, right?. If you look for something that 146 00:13:14,170 --> 00:13:19,700 is just 0.1 degree, then you might not be able to find it. If there is a deviation 147 00:13:19,700 --> 00:13:25,230 of 5 degrees, you will definitely find it. And for us it's very similar. So here we 148 00:13:25,230 --> 00:13:31,610 have a plot, one of the first Higgs discovery plots, and you can see that we 149 00:13:31,610 --> 00:13:38,800 have many ingredients there. So, the black dots are what we measure and they have 150 00:13:38,800 --> 00:13:43,829 certain uncertainty, because when we measure, we count and we might have, you 151 00:13:43,829 --> 00:13:48,977 know, not seen something, or we might have seen more than we we should have seen, so 152 00:13:48,977 --> 00:13:54,970 there's always an uncertainty. And then we also have theory, which tells us you 153 00:13:54,970 --> 00:14:00,079 should have seen so many and so for the red part that's something that we know 154 00:14:00,079 --> 00:14:04,889 exists, it's nothing spectacular. It's simply what theory is telling us what we 155 00:14:04,889 --> 00:14:10,660 should be seeing. And you can see the data follows the red part fairly well. But then 156 00:14:10,660 --> 00:14:15,980 there is this other bump in our dots on the right-hand side or in the center and 157 00:14:15,980 --> 00:14:21,230 that does not make sense, unless you take the Higgs into account, right, which is 158 00:14:21,230 --> 00:14:26,889 the light blue part and so here you can see how this interplay between different 159 00:14:26,889 --> 00:14:38,280 sources of physics and statistics works for us. Now just as for the climate, more 160 00:14:38,280 --> 00:14:43,690 data helps. And there are two versions of more data more data: Either by having more 161 00:14:43,690 --> 00:14:48,079 collisions, which is why we are running 24/7, or more data by combining different 162 00:14:48,079 --> 00:14:52,060 analyses which is what's happening here. So here you see all these different 163 00:14:52,060 --> 00:14:56,990 analyses. If you combine them, of course you get a much stronger prediction of, in 164 00:14:56,990 --> 00:15:03,300 this case, the Higgs mass, then if you just take any single one of them. You see 165 00:15:03,300 --> 00:15:08,540 how similar what we are doing is to, you know, any of the big data analyses out 166 00:15:08,540 --> 00:15:16,414 there. Okay, so that was that part. Now comes the obligatory part again, 167 00:15:16,414 --> 00:15:22,930 computering. When we were designing the LHC,not me, when people were designing the 168 00:15:22,930 --> 00:15:31,120 LHC, they needed to project computing power from 1990 to 2000 2010 and so on. 169 00:15:31,120 --> 00:15:34,140 And then they said: "Well, we need massive amount of computers" and for you 170 00:15:34,140 --> 00:15:38,420 there's now "Ughhh - everybody has it, we have it as well, we have our racks of 171 00:15:38,420 --> 00:15:44,240 computers". This is something that the big companies usually don't show: You you know 172 00:15:44,240 --> 00:15:48,509 there is actually a ramp where the trucks arrive and they offload the things and 173 00:15:48,509 --> 00:15:53,820 then someone needs to screw them together and then looks shiny. This is how we are 174 00:15:53,820 --> 00:16:00,870 spending our CPU time: We have about 60,000 cores that are spinning all the 175 00:16:00,870 --> 00:16:06,680 time for us, and they are distributed around the world. You can see that CERN, 176 00:16:06,680 --> 00:16:14,529 for example, is the red part there near the bottom. Yeah, so we make good use of 177 00:16:14,529 --> 00:16:20,829 that. We also monitor the efficiency, and because 100 percent efficient is for 178 00:16:20,829 --> 00:16:29,300 beginners we are actually about 700 percent efficient. Don't ask why. They 179 00:16:29,300 --> 00:16:33,920 decided if you are multi-threading, then we, you know, we multiply your efficiency 180 00:16:33,920 --> 00:16:39,950 by the number of threads you have. Makes no sense to me. We also have storage, 181 00:16:39,950 --> 00:16:44,930 currently we use about 0.7 exabytes. We also have available at one point seven 182 00:16:44,930 --> 00:16:49,130 exabytes, so that's good, we make use of the storage we have. Where it's, you know, 183 00:16:49,130 --> 00:16:55,529 tera- peta- exa-, so it's a lot, and here you can see on the right hand side you 184 00:16:55,529 --> 00:16:59,610 see, for example, the tape usage on the bottom and you see this dip that was 185 00:16:59,610 --> 00:17:04,270 before we were starting the accelerator again, we needed to make some space so we 186 00:17:04,270 --> 00:17:09,089 monitor our hard disk usage all the time. Hey, here comes the next decision point: 187 00:17:09,089 --> 00:17:13,630 So, do you want to hear about, 1, distributed computing or 2, measure 188 00:17:13,630 --> 00:17:17,839 effects of bugs. So, 1, distributed computing 189 00:17:17,839 --> 00:17:26,470 *applause* and 2, measure the effects of bugs 190 00:17:26,470 --> 00:17:35,560 *similar amount of applause* Okay, so that's my call, and I would say 191 00:17:35,560 --> 00:17:41,455 we do we do... Measure the effects of bugs, because it's shorter. 192 00:17:41,455 --> 00:17:47,130 *laughter* So this is one of the views you can, you 193 00:17:47,130 --> 00:17:50,740 know, electronic views you can get from a detector and you see how we trace the 194 00:17:50,740 --> 00:17:55,380 particles that fly through the detector. Now, that software right, that's the 195 00:17:55,380 --> 00:17:59,927 result of software, and you might not believe it, if you have bugs in there, in 196 00:17:59,927 --> 00:18:00,808 that software. 197 00:18:02,849 --> 00:18:07,260 And you know, these bugs are sometimes wrong coordinate transformations, so 198 00:18:07,260 --> 00:18:12,590 things don't go this way but that way, it's kind of weird if you look at it, and 199 00:18:12,590 --> 00:18:17,470 the result is that our particles don't go through the path that they should have 200 00:18:17,470 --> 00:18:25,190 been going, but we are attributing them a different path. Now, the the nice thing 201 00:18:25,190 --> 00:18:30,960 is that we are doing this a million times, right? So all of that is smeared. We are 202 00:18:30,960 --> 00:18:35,730 not systematically doing this wrong it's just, we are always doing it a little bit 203 00:18:35,730 --> 00:18:41,669 wrong. And so the net result is that if we measure our particles, we will not measure 204 00:18:41,669 --> 00:18:46,861 the right thing but always a little bit wobbly left wobbly right you know? Things 205 00:18:46,861 --> 00:18:53,809 are not as precise. That's simply an uncertainty. So for us just like counting 206 00:18:53,809 --> 00:18:59,059 has an uncertainty and predictions have an uncertainty, software bugs introduced 207 00:18:59,059 --> 00:19:05,559 another source of uncertainties. And here you can see how we are tracking 208 00:19:05,559 --> 00:19:09,370 uncertainties for for all of our analyses. We are trying to understand the 209 00:19:09,370 --> 00:19:16,220 different forces of uncertainties. And again, bugs are only one of the sources 210 00:19:16,220 --> 00:19:22,880 here, so if we find the bug then we reduce our uncertainty and we can find new 211 00:19:22,880 --> 00:19:27,760 physics earlier, instead of having to wait and collect more data. So for us 212 00:19:27,760 --> 00:19:32,210 finding bugs is really key, we really love finding bugs because it brings 213 00:19:32,210 --> 00:19:36,710 physics closer. I thought that was interesting. It's kind of rare that you're 214 00:19:36,710 --> 00:19:42,140 in environment where you're able to measure the effect of bugs. Okay, so now 215 00:19:42,140 --> 00:19:47,870 we are talking, we'll be talking about data. I talked, told you that we are 216 00:19:47,870 --> 00:19:52,690 trying to find particle traces in our data and the way we do this is by using 217 00:19:52,690 --> 00:19:56,700 reconstruction programs and there are multiple gigabytes of binaries in shared 218 00:19:56,700 --> 00:20:01,799 libraries and stuff. They're huge, they're experiment specific and they are curated 219 00:20:01,799 --> 00:20:06,270 by the experiments, open-source for some of them, and we want them to be correct 220 00:20:06,270 --> 00:20:14,140 and efficient. The data format we use is not comma separated values, it's binary 221 00:20:14,140 --> 00:20:21,080 and for some strange reason it's our own custom binary format. The reason is that 222 00:20:21,080 --> 00:20:26,990 it's really targeted and the kind of data we are having. We have collisions 223 00:20:26,990 --> 00:20:32,230 that are independent, so we only need one in memory at any time and we have nested 224 00:20:32,230 --> 00:20:38,590 collections which makes the regular table layout a non-starter. We actually generate 225 00:20:38,590 --> 00:20:44,430 them from C++ objects so from classes, class definitions, C++ class definitions 226 00:20:44,430 --> 00:20:51,320 and we can read them back into C++ but also into JavaScript or Scala. Database 227 00:20:51,320 --> 00:20:56,840 just didn't do it for us. They have the wrong model of data axis, they don't 228 00:20:56,840 --> 00:21:02,940 scale, it's just not the kind of system that works for us. Also using a file 229 00:21:02,940 --> 00:21:09,390 system as a storage back-end might sound really very traditional and boring but it 230 00:21:09,390 --> 00:21:13,890 works amazingly well and seems to be future proof as well, so that's just the 231 00:21:13,890 --> 00:21:20,360 way to go for us. There are many other structured data formats out there, many of 232 00:21:20,360 --> 00:21:26,000 those did not exist when we started root our own data format. But they also miss 233 00:21:26,000 --> 00:21:30,250 many things. For example, we wanted to make sure that we have schema evolution 234 00:21:30,250 --> 00:21:33,970 support. We can change the class layout and still read back all data. We don't 235 00:21:33,970 --> 00:21:38,750 want to throw away all data just because we're changing the class. Also we do not 236 00:21:38,750 --> 00:21:43,370 trust people. That is a, you know, as a computer scientist or whatever you 237 00:21:43,370 --> 00:21:46,750 probably know what I'm talking about right? If people have to write their own 238 00:21:46,750 --> 00:21:50,630 streaming algorithm, there will be bugs and we will lose data. 239 00:21:50,630 --> 00:21:54,610 We really don't want to do this, so we were trying to automate this, based on the 240 00:21:54,610 --> 00:22:03,070 class definition. So, last decision point for the story. Do you want to hear about 241 00:22:03,070 --> 00:22:10,409 cling, our C++ interpreter or about Open Data and Applied Science? Let's start with 242 00:22:10,409 --> 00:22:14,860 option 1, the C++ interpreter *applause* 243 00:22:14,860 --> 00:22:21,106 Okay and and Open Data and Applied Science? 244 00:22:21,106 --> 00:22:29,679 *more applause than before* Yeah. I'm heading there. You miss a fish. 245 00:22:29,679 --> 00:22:35,299 You can look at the slides later. Okay, so there we go. Really? No. The slide number 246 00:22:35,299 --> 00:22:41,140 is wrong. Oh a bug! So, Open Data and Applied Science. Okay, you really wanted 247 00:22:41,140 --> 00:22:47,700 to know about our budget, I understand that. So we get from you about 1 billion 248 00:22:47,700 --> 00:22:50,719 year and the currency doesn't really matter anymore at this, at this point of 249 00:22:50,719 --> 00:22:54,200 time. *laughter* 250 00:22:54,200 --> 00:23:01,230 And that is a lot of money. And you know? We try to do really wonderful things, I 251 00:23:01,230 --> 00:23:04,943 mean we really enjoy our job, we love it. It's fantastic to work in such an 252 00:23:04,943 --> 00:23:09,248 environment. And thank you very much for making that possible. Really, I mean it. 253 00:23:11,110 --> 00:23:16,691 But it also means that you decided as society to enable something like CERN. 254 00:23:17,473 --> 00:23:22,140 Which I think really deserves my applause and yours probably as well. I think it's a 255 00:23:22,140 --> 00:23:24,425 great decision to do something like this. 256 00:23:24,425 --> 00:23:30,211 *applause* 257 00:23:31,325 --> 00:23:35,690 So we realize this, right? We realized that we are basically, that we can do what 258 00:23:35,690 --> 00:23:40,210 we do because of you, and we are trying to react to that by giving back what we do. 259 00:23:40,210 --> 00:23:47,460 Software, research results, hardware and data. So the way we share research results 260 00:23:47,460 --> 00:23:52,600 is through open access. We have it, finally. It took us a long time to fight 261 00:23:52,600 --> 00:23:57,570 with publishers and, you know, the establishment, but now we have it. We 262 00:23:57,570 --> 00:23:59,220 also, yes thank you. 263 00:23:59,220 --> 00:24:03,395 *applause* 264 00:24:03,395 --> 00:24:07,520 We also put a lot of effort in communicating our results and what we are 265 00:24:07,520 --> 00:24:12,680 doing. And if you're in the region, it's definitely worth a visit. I mean the URL 266 00:24:12,680 --> 00:24:17,590 is really easy to remember, it's visit.cern, and you know, works. And you 267 00:24:17,590 --> 00:24:22,270 should go there by April, actually, if you can because then you can ask people how to 268 00:24:22,270 --> 00:24:27,580 get on the ground, because the accelerator is off at the moment. We also do applied 269 00:24:27,580 --> 00:24:32,320 research, for example we have this super cool experiment where we try to study how 270 00:24:32,320 --> 00:24:39,630 clouds form, based on cosmic rays. So the the influence of cosmic rays and cloud 271 00:24:39,630 --> 00:24:45,770 formation. Which is a key element in the uncertainty of climate models. We are 272 00:24:45,770 --> 00:24:50,440 trying to, to think about, you know, how to make energy from nuclear waste. So 273 00:24:50,440 --> 00:24:54,830 getting rid of nuclear waste while making energy from it. And we are trying to 274 00:24:54,830 --> 00:25:02,070 repurpose detectors that we have and you know develop. We have something called 275 00:25:02,070 --> 00:25:08,330 open hardware, for example White Rabbit: deterministic ethernet, we have Open Data, 276 00:25:08,330 --> 00:25:12,789 and we have the LHC@home and some other programs, where either you can donate 277 00:25:12,789 --> 00:25:21,250 compute power or your brain and help us get better results. We explicitly try to 278 00:25:21,250 --> 00:25:25,747 use open source as much as possible, and also feed back, whenever we see issues. 279 00:25:27,700 --> 00:25:33,620 But we also create open source. For example, we create Geant, which is a 280 00:25:33,620 --> 00:25:37,831 program that allows you to simulate how particles fly through a matter, for 281 00:25:37,831 --> 00:25:44,610 example used by the NASA. We have Indico, which allows us to schedule meetings, 282 00:25:44,610 --> 00:25:48,940 upload slides, you know, these kind of things. Across the globe, lots of people, 283 00:25:48,940 --> 00:25:52,970 with access protection, all these kind of things. And it's open source. We have 284 00:25:52,970 --> 00:25:58,919 DaviX, the dimension we love HTTP. That's the next machine of Tim Berners-Lee. And 285 00:25:58,919 --> 00:26:03,140 that's his futile effort in trying to prevent the cleaning personnel from 286 00:26:03,140 --> 00:26:07,530 switching it off. They don't speak English, they did not back then at least. 287 00:26:09,337 --> 00:26:15,500 So we use we used DaviX to transfer files over HTTP, with a high bandwidth. Or we 288 00:26:15,500 --> 00:26:21,241 have CVM-FS, which allows us to distribute our binaries across the globe, and not 289 00:26:21,241 --> 00:26:26,570 rely on admins downloading stuff and making sure it actually runs, and these 290 00:26:26,570 --> 00:26:31,581 kind of things. That is a lifesaver, it's really fantastic, it's a great tool. But 291 00:26:31,581 --> 00:26:37,730 nobody knows it. And we have ROOT, but that's coming up. So now, the last 292 00:26:37,730 --> 00:26:42,534 official part of this, of this presentation, how do we do data analysis? 293 00:26:42,534 --> 00:26:44,950 Not like that. *laughter* 294 00:26:44,950 --> 00:26:52,210 *applause* We use, we use C++ and actually physicists 295 00:26:52,210 --> 00:26:58,140 need to write their own analysis in C++. We have very few people who have an actual 296 00:26:58,140 --> 00:27:03,876 education in programming. so that's sort of a clash. As I said, we need to keep one 297 00:27:04,607 --> 00:27:08,460 collision in memory. And for what, you know, what matters to us is throughput. We 298 00:27:08,460 --> 00:27:13,340 want to have, we want to analyze as many collisions as possible per second. What we 299 00:27:13,340 --> 00:27:17,390 can do, is specialize our data format to match the analysis, because we don't want 300 00:27:17,390 --> 00:27:23,419 to waste I/O cycles, if we can, you know, if we can make use of the CPU better. ROOT 301 00:27:23,419 --> 00:27:29,110 allows us to do this since twenty years. It's really the workhorse for the analysis 302 00:27:29,110 --> 00:27:35,200 in high energy physics. And it's also an interface to complex software. We have 303 00:27:35,200 --> 00:27:40,950 serialization facilities, we have the statistical tools, that people need, and 304 00:27:40,950 --> 00:27:44,480 we have graphics, because once you have done your analysis you need to communicate 305 00:27:44,480 --> 00:27:48,500 that to your peers and convince people, and publish, and so on, so that's part of 306 00:27:48,500 --> 00:27:54,169 the game. All of that is open source, and, of course, all of that is not just used by 307 00:27:54,169 --> 00:28:03,370 high energy physics. So, to conclude: We are here, because you make it possible. 308 00:28:03,370 --> 00:28:05,223 Thank you very much. It's fantastic to have you. 309 00:28:05,223 --> 00:28:10,860 *applause* We want to share and we have great people 310 00:28:10,860 --> 00:28:17,080 for science outreach, but we have nobody for software outreach, basically. So maybe 311 00:28:17,080 --> 00:28:24,570 it's worth a look to see what what CERN is producing software-wise. Scientific 312 00:28:24,570 --> 00:28:29,940 computing is nothing new, it existed since a long time, but we had to start fairly 313 00:28:29,940 --> 00:28:35,490 early on a large scale. So when we were building it up, we had to take... we were 314 00:28:35,490 --> 00:28:39,960 trying to take pieces that existed and did not found find much. So now we ended up 315 00:28:39,960 --> 00:28:45,179 with C++ data serialization, efficient computing for non computer scientists 316 00:28:45,179 --> 00:28:49,660 even... In the part that I skipped and, you know, one of the alternate tracks, you 317 00:28:49,660 --> 00:28:54,289 would have seen that we have a Python binding as well for the whole software 318 00:28:54,289 --> 00:28:59,970 stack in C++. And for us, what matters most is scale. Now we are seeing that we 319 00:28:59,970 --> 00:29:04,309 are not the only ones. There are many more natural sciences arriving at a similar 320 00:29:04,309 --> 00:29:09,120 challenge of having to analyze large amounts of data. Now I promised to you 321 00:29:09,120 --> 00:29:12,480 that I'll be bold and I'll try to make a few statements of what will happen with 322 00:29:12,480 --> 00:29:16,750 data analysis, not just in science. Because what we see is that we actually 323 00:29:16,750 --> 00:29:22,610 educate the people who will do data analysis, not just in science. What we see 324 00:29:22,610 --> 00:29:30,990 is that in the past, data volume mattered most. So more data meant more power. Now 325 00:29:30,990 --> 00:29:35,929 that's not the complete truth anymore. It's a lot about finding correlations. So 326 00:29:35,929 --> 00:29:40,880 even with the amount of data not growing anymore, because it's already humongous, 327 00:29:40,880 --> 00:29:46,320 we try to squeeze more knowledge out of it. And for that, I/O becomes important 328 00:29:46,320 --> 00:29:53,900 and CPU limitations is the crucial factor. We see that multivariate techniques are 329 00:29:53,900 --> 00:29:59,029 still rising and they will just be part of the toolchain of the statistical tools; 330 00:29:59,852 --> 00:30:06,681 except for generative parts, which, I believe, will change the way we model. 331 00:30:10,232 --> 00:30:16,361 Now, based on what I just described, this is not a big surprise anymore. As we need 332 00:30:16,361 --> 00:30:21,210 throughput, we need to have a language for the core analysis part, that is close to 333 00:30:21,210 --> 00:30:26,970 metal, so something like C++. On the other hand writing analyses is 334 00:30:26,970 --> 00:30:31,791 still complex, so you need a higher-level language and for that people could, for 335 00:30:31,791 --> 00:30:35,929 example, use Python. So, now language binding becomes relevant all of a sudden. 336 00:30:35,929 --> 00:30:42,010 It's much more important in the future. And we need to tailor I/O to the actual 337 00:30:42,010 --> 00:30:48,910 analysis to not waste CPU cycles. So throughput is the king and, in my point of 338 00:30:48,910 --> 00:30:54,331 view, also in the future we will see much more effort in increasing the throughput. 339 00:30:55,600 --> 00:31:03,115 Okay, so that was it. In case you want to discuss anything with me, like "That's 340 00:31:03,115 --> 00:31:07,970 just wrong!", that's fine. I'm probably have several bugs in there. I'm still here 341 00:31:07,970 --> 00:31:12,909 until tomorrow. I don't know where yet, so I'll wander around and you can contact 342 00:31:12,909 --> 00:31:16,818 me by email or Twitter. Thank you very much for your attention. Thank you. 343 00:31:16,818 --> 00:31:20,525 *applause* 344 00:31:20,525 --> 00:31:27,990 *music* 345 00:31:27,990 --> 00:31:45,000 subtitles created by c3subtitles.de in the year 2017. Join, and help us!