1 00:00:00,000 --> 00:00:22,090 *36C3 preroll music* 2 00:00:22,090 --> 00:00:30,160 Okay so now to our speaker, he’s Lucas. He's a SPARQL magician I'm told, so and he 3 00:00:30,160 --> 00:00:35,230 will introduce you to his favorite querying language, SPARQL, and give you a 4 00:00:35,230 --> 00:00:40,020 little introduction and in the second part he will do some live coding which is 5 00:00:40,020 --> 00:00:45,840 always really interesting and funny and you can give him some things that he's 6 00:00:45,840 --> 00:00:50,070 querying for you and I'm sure we'll have lots of fun and interesting learning stuff 7 00:00:50,070 --> 00:00:53,790 here so give a warm round of applause to Lucas. 8 00:00:53,790 --> 00:00:55,730 [Applause] 9 00:01:01,030 --> 00:01:09,040 [inaudible] 10 00:01:09,040 --> 00:01:13,440 Is this better? Aha! It's a bit too loud so I'll just talk a bit until they have 11 00:01:13,440 --> 00:01:18,729 figured it out. Yeah so this is going to be kind of two parts but not really that 12 00:01:18,729 --> 00:01:22,420 separate but in the second part I'm basically going to write the queries that 13 00:01:22,420 --> 00:01:27,010 you suggest so if you – if you see what I'm going to do here and then think oh I 14 00:01:27,010 --> 00:01:31,270 have a great idea for something we could perhaps query then just remember that and 15 00:01:31,270 --> 00:01:34,869 we'll get back to that hopefully because otherwise the second half is going to be 16 00:01:34,869 --> 00:01:40,040 really short if I don't get any ideas from you. But yeah, so this is about querying 17 00:01:40,040 --> 00:01:46,190 linked data which allows you to do all kinds of crazy things and answer all kinds 18 00:01:46,190 --> 00:01:50,770 of crazy questions such as I think I had on the slides something like "what are the 19 00:01:50,770 --> 00:01:54,390 largest cities with a female mayor?" and if you wanted to find that out 20 00:01:54,390 --> 00:01:59,200 traditionally you could like go through Wikipedia and try to find all the largest 21 00:01:59,200 --> 00:02:03,320 cities and see which ones have a female mayor and which ones don't or perhaps 22 00:02:03,320 --> 00:02:06,550 there's a category with all the cities with a female mayor but then you have to 23 00:02:06,550 --> 00:02:12,200 sort them by population and it's a whole mess and with linked data you can find 24 00:02:12,200 --> 00:02:17,489 that out much more easily and also all kinds of other things but let's start with 25 00:02:17,489 --> 00:02:24,580 some simple fantasy linked data so this is a tiny snippet of linked data, some data 26 00:02:24,580 --> 00:02:30,049 graph. It's just composed of a load of nodes which are these ovals and rectangles 27 00:02:30,049 --> 00:02:35,049 here and they're connected with arrows and each of these forms kind of a triple 28 00:02:35,049 --> 00:02:39,820 consisting of the start node and then the arrow and then the end node and that's how 29 00:02:39,820 --> 00:02:45,250 we represent all the information you have in there, in this linked database. So for 30 00:02:45,250 --> 00:02:48,410 example we can read this as this talk right now happens in the Esszimmer or the 31 00:02:48,410 --> 00:02:51,959 dining room which is the name of this stage here and it's going to be followed 32 00:02:51,959 --> 00:02:55,930 by the live querying session which also happens in Esszimmer and the live querying 33 00:02:55,930 --> 00:03:00,900 session in turn follows this talk again and the Esszimmer, the dining room, is 34 00:03:00,900 --> 00:03:06,049 next to the kitchen, the Küche, and the kitchen is next to the dining room again 35 00:03:06,049 --> 00:03:09,890 and both of them are part of the WikipakaWG which is part of 36C3 and the 36 00:03:09,890 --> 00:03:17,340 talk happens right now and at the same time there's also some talk about how 37 00:03:17,340 --> 00:03:22,110 state elections are climate elections or something in the Chaos West stage, starts 38 00:03:22,110 --> 00:03:25,530 at the same time, Chaos West stage is part of the Chaos West Assembly which is part 39 00:03:25,530 --> 00:03:31,670 of 36C3 as well and so this graph has a few important properties, for example 40 00:03:31,670 --> 00:03:36,060 there's some redundant connections here, you could see, you could say, if this talk 41 00:03:36,060 --> 00:03:39,180 is followed by the live querying then you don't really need to know that live 42 00:03:39,180 --> 00:03:43,810 querying follows this talk, it's kind of redundant information. You already know 43 00:03:43,810 --> 00:03:48,900 it, but it doesn't hurt to have it, and it often makes your life easier if you have a 44 00:03:48,900 --> 00:03:53,650 little bit of redundancy in your graph and then if you find that one half of this 45 00:03:53,650 --> 00:03:57,269 connection is missing for example you can still investigate what's going on and also 46 00:03:57,269 --> 00:04:02,480 in here we have kind of bi-directional connection so Esszimmer is next to Küche 47 00:04:02,480 --> 00:04:07,510 which is next to Esszimmer but this is two separate arrows and could also be that 48 00:04:07,510 --> 00:04:11,790 only one of them is there so you don't have arrows which go into-, in both 49 00:04:11,790 --> 00:04:16,010 directions at once in this data model, it has to be, if you want something like this 50 00:04:16,010 --> 00:04:18,680 you have to have two separate arrows because that keeps the data model very 51 00:04:18,680 --> 00:04:25,720 simple. You just have subject predicate object and that's everything you have, and 52 00:04:25,720 --> 00:04:33,210 then to query this graph, you kind of select a tiny part of it and then you 53 00:04:33,210 --> 00:04:39,090 remove some part that you don't know about for example we know that this talk is 54 00:04:39,090 --> 00:04:43,650 followed by live querying and if we remove the live querying part, then we can ask 55 00:04:43,650 --> 00:04:50,650 something like... Okay, I did it the other way around. Never mind, this way. This 56 00:04:50,650 --> 00:04:53,530 talk is followed by which talk? and then you have a question but because you've 57 00:04:53,530 --> 00:05:00,449 left out this part and then if you ask this question to a query service it can, 58 00:05:00,449 --> 00:05:06,820 kind of, you can think of this like a, err, damn, I only know the German word for 59 00:05:06,820 --> 00:05:11,620 this one, a, Schablone, template, so you put this over the graph and this has to 60 00:05:11,620 --> 00:05:15,660 match the existing node this has to match the existing arrow and then you see which 61 00:05:15,660 --> 00:05:20,170 nodes can you put in here and in this case that's only the live querying or the other 62 00:05:20,170 --> 00:05:26,510 way around which talk follows this one so you can have the beginning of the triple 63 00:05:26,510 --> 00:05:31,490 can be a variable like this one or the end of the triple can be a variable like in 64 00:05:31,490 --> 00:05:38,540 this case and you can also have more complicated patterns like, no there's not 65 00:05:38,540 --> 00:05:42,280 a more complicated pattern, this is the same pattern. You have the question which 66 00:05:42,280 --> 00:05:46,139 talk happens in Esszimmer and you have two answers: this talk happens in Esszimmer 67 00:05:46,139 --> 00:05:51,819 and live querying happens in Esszimmer. But you can also combine more graph nodes 68 00:05:51,819 --> 00:05:57,659 like this, for example, which talk happens in some room, which is part of the 69 00:05:57,659 --> 00:06:02,389 Wikipaka-WG. So we have one free part here and one free part here. But we know that 70 00:06:02,389 --> 00:06:06,220 these two have to be connected with, "happens in", and then this has to be 71 00:06:06,220 --> 00:06:10,819 connected with "is part of" to the Wikipaka-WG. And you can kind of 72 00:06:10,819 --> 00:06:16,610 construct– if you can phrase your question as a kind of graph like this, where some 73 00:06:16,610 --> 00:06:19,439 parts are predetermined that you already know about and the other parts that you 74 00:06:19,439 --> 00:06:26,249 want to find. Those are these kind of variables which are here indicated with 75 00:06:26,249 --> 00:06:30,990 just dashed lines. Then you can ask that question to the graph and find the 76 00:06:30,990 --> 00:06:35,930 matching results. In this case, you have these two matches, this talk happens in 77 00:06:35,930 --> 00:06:40,139 Esszimmer as part of Wikipaka-WG and live querying happens in Esszimmer, is part of 78 00:06:40,139 --> 00:06:46,759 Wikidata– Wikipaka-WG. And then, if you– if we had more information in this graph 79 00:06:46,759 --> 00:06:51,770 here, we might also have other rooms. For example, there's this library over there 80 00:06:51,770 --> 00:06:56,080 which also is going to have some talks. If we had the whole schedule in here, we 81 00:06:56,080 --> 00:07:01,069 would find those as well. And we could also adapt the query so that we don't even 82 00:07:01,069 --> 00:07:06,860 make the Wikipaka-WG part fixed. We could ask for anything that happens in 33C3. So 83 00:07:06,860 --> 00:07:11,360 that would be some variable, happens in some room, is part of some assembly, is 84 00:07:11,360 --> 00:07:15,610 part of 36C3. And then we would find this thing as well because it fits the same 85 00:07:15,610 --> 00:07:21,530 kind of pattern: happens in, is part of, is part of 36C3. Does that make sense? 86 00:07:21,530 --> 00:07:31,539 Hopefully. I'm seeing a lot of nodding heads. OK, that's great. So then we can 87 00:07:31,539 --> 00:07:38,060 try to move ahead to actually ask some of these questions to a real query system. 88 00:07:38,060 --> 00:07:43,029 Because in reality, you're not going to actually draw these graphs, but you have 89 00:07:43,029 --> 00:07:47,529 some kind of language where you phrase them instead, which looks a bit like this. 90 00:07:47,529 --> 00:07:52,719 So you have the part: SELECT anything WHERE, that is kind of like SQL, and then 91 00:07:52,719 --> 00:07:57,900 everything else is not like SQL. Forget SQL! I hear this is easier to understand 92 00:07:57,900 --> 00:08:03,499 if you don't know SQL. I didn't know SQL that much when I learned SPARQL, and I 93 00:08:03,499 --> 00:08:08,850 think it helped me, apparently. But what you write down here is these, is this kind 94 00:08:08,850 --> 00:08:14,080 of description of the graph, and these dashed parts, which are the variables 95 00:08:14,080 --> 00:08:17,919 which you don't yet know. Those are marked with a question mark because that's kind 96 00:08:17,919 --> 00:08:21,150 of what you use to ask a question. In this case, I've just called it "?talk", but it 97 00:08:21,150 --> 00:08:27,069 could be any name, basically. And then instead of "happens in" as two words, I've 98 00:08:27,069 --> 00:08:32,510 just written "happensIn" as one and then with the prefix "36C3" and it happens in 99 00:08:32,510 --> 00:08:38,289 the 36C3 Esszimmer because I don't really have a separate dining room at home, but a 100 00:08:38,289 --> 00:08:43,320 lot of people do. So if we just wrote it happens in Esszimmer, that would be pretty 101 00:08:43,320 --> 00:08:48,110 ambiguous and no one would know which which dining room you're talking about. 102 00:08:48,110 --> 00:08:52,780 And by adding this prefix we know we're talking about just the dining room in 103 00:08:52,780 --> 00:08:58,510 this, at thirty– 36C3. I think, I assume there's no other assembly that has 104 00:08:58,510 --> 00:09:01,370 something called the dining room. If it does, then we would have to add something 105 00:09:01,370 --> 00:09:06,199 else here to make it clear. And I've used the same prefix for "happensIn" to make 106 00:09:06,199 --> 00:09:09,970 clear which kind of "happens in" relation we're talking about, that it's one 107 00:09:09,970 --> 00:09:15,650 specific to Congress events. And then you could ask this to a query service which 108 00:09:15,650 --> 00:09:21,750 has this example graph in it, and you might get the response that it's these two 109 00:09:21,750 --> 00:09:27,760 talks. And at the end, you have this period here because if you read the whole 110 00:09:27,760 --> 00:09:33,089 thing, it's kind of like a sentence again. Because the talk happens in Esszimmer. And 111 00:09:33,089 --> 00:09:36,810 if you have two sentences, then you have two periods. So the talk happens in some 112 00:09:36,810 --> 00:09:40,990 room. And this room is part of the Wikipaka-WG. And because we've used the 113 00:09:40,990 --> 00:09:47,510 same variable name here and down here, this has to be the same room. And it 114 00:09:47,510 --> 00:09:50,790 couldn't just be two different things. So if we use two different variable names 115 00:09:50,790 --> 00:09:55,500 here, room and something else, then we would just get all the combinations of 116 00:09:55,500 --> 00:09:59,260 talks happening somewhere and rooms being part of Wikipaka-WG without them being 117 00:09:59,260 --> 00:10:02,970 connected anyway, but because they use the same variable name they have to be 118 00:10:02,970 --> 00:10:08,840 connected like this. And then you would get these results we've seen earlier. What 119 00:10:08,840 --> 00:10:13,830 you can also do is leave out the room. So when I translate this into English, I 120 00:10:13,830 --> 00:10:18,410 could say, the talk happens in the room and the room is part of Wikipaka-WG. But I 121 00:10:18,410 --> 00:10:23,160 could also say the talk happens in *some room, which is* part of the Wikipaka-WG, 122 00:10:23,160 --> 00:10:26,300 as kind of a– I don't know what that's called in English kind of a relative 123 00:10:26,300 --> 00:10:32,509 sentence sub-something-clause where we don't really talk about the room in itself 124 00:10:32,509 --> 00:10:36,600 just as a part of this larger sentence. And you can write that in SPARQL as well. 125 00:10:36,600 --> 00:10:44,220 And then it looks like this. And these square brackets kind of describe what the 126 00:10:44,220 --> 00:10:48,480 room looks like without giving it names. So in this case, you can only select the 127 00:10:48,480 --> 00:10:51,959 talk up here and we don't have a room variable. But if you don't care about what 128 00:10:51,959 --> 00:10:55,740 the room is, then that can be very useful. I've also changed something else here. 129 00:10:55,740 --> 00:11:04,149 I've replaced the 36C3 in "isPartOf" with schema, which is another prefix and schema 130 00:11:04,149 --> 00:11:09,380 is kind of this collection of useful prefixes and other nodes that you can 131 00:11:09,380 --> 00:11:14,189 reuse, for example, if you're describing things you have on your website, you might 132 00:11:14,189 --> 00:11:18,890 say you have an article with a schema:title and a schema:publicationDate. 133 00:11:18,890 --> 00:11:22,870 So this was mainly introduced by Google and some other search engines. But we can 134 00:11:22,870 --> 00:11:27,880 use the same vocabulary to talk about our talks because "isPartOf" is one of these 135 00:11:27,880 --> 00:11:35,829 standard terms we can use for that. And what else do I have. OK, the next thing I 136 00:11:35,829 --> 00:11:41,190 have is actual queries. So I think I'm just going to– I'm almost going to switch 137 00:11:41,190 --> 00:11:45,350 to Wikidata, so I should talk a bit about Wikidata. So all these examples here were 138 00:11:45,350 --> 00:11:52,639 just on some example graph, which I made up here and threw on a slide with a lot of 139 00:11:52,639 --> 00:11:58,180 probably overengineered tikz LaTeX magic, which I shouldn't have wasted that much 140 00:11:58,180 --> 00:12:03,790 time about. But it looks nice. And… but if we want to write real queries, we could 141 00:12:03,790 --> 00:12:07,470 load this thing into a query service, but it wouldn't be that interesting because 142 00:12:07,470 --> 00:12:12,220 it's kind of small. But there are a lot of real data graphs out there that you can 143 00:12:12,220 --> 00:12:17,120 query with this query language, SPARQL. And one of the coolest ones, at least in 144 00:12:17,120 --> 00:12:21,170 my opinion, is called Wikidata or Wikidata. There's some kind of discussion 145 00:12:21,170 --> 00:12:27,980 about how it's pronounced. And it's kind of a free database of anything that's 146 00:12:27,980 --> 00:12:33,910 relevant. And it's part of the same family of projects as Wikipedia and Wikimedia 147 00:12:33,910 --> 00:12:37,769 Commons and other things. And it's also maintained by the same community of 148 00:12:37,769 --> 00:12:42,269 volunteers. And you can find all kinds of really interesting and cool and funny data 149 00:12:42,269 --> 00:12:46,009 there. So all of these example queries, which I have here, we're just going to ask 150 00:12:46,009 --> 00:12:57,380 to Wikidata. But first, I will just give you one or two minutes to try to imagine 151 00:12:57,380 --> 00:13:04,079 what this question would look like, either in the graph format or in the SPARQL 152 00:13:04,079 --> 00:13:09,339 format. Just try to figure out how you would formulate: "which software is 153 00:13:09,339 --> 00:13:15,100 written in bash" as a kind of, this kind of graph query. And then we can see what 154 00:13:15,100 --> 00:13:22,970 we can come up with. So. I didn't think this through. I need some waiting loop 155 00:13:22,970 --> 00:13:36,380 music now. Does anyone have a kind of idea of what the graph looks like, because I'm 156 00:13:36,380 --> 00:13:41,160 going to uncover it now and then you can compare, if it looks the same way. So it 157 00:13:41,160 --> 00:13:45,760 would look like, this at least using the Wikidata terminology. So instead of "is 158 00:13:45,760 --> 00:13:51,790 written in", the property is called probing– programming language. And this 159 00:13:51,790 --> 00:13:56,050 could also, this could be called "bash" or "Bourne Again Shell" or "GNU bash" or 160 00:13:56,050 --> 00:14:02,009 something. Doesn't really matter. And in SPARQL, it looks like this, which is a lot 161 00:14:02,009 --> 00:14:06,630 less readable, unfortunately, because one of the things about Wikidata is that it's 162 00:14:06,630 --> 00:14:14,290 multilingual. So instead of saying "programming language", we say "P277". And 163 00:14:14,290 --> 00:14:17,509 I think that's beautiful, haha. No, but this is a property ID and you can look up 164 00:14:17,509 --> 00:14:22,589 what this property is called in English or in German or in any other language. So if 165 00:14:22,589 --> 00:14:31,420 we look at Wikidata.org and look for – I think I forgot to zoom in. Yeah. There we 166 00:14:31,420 --> 00:14:40,180 go. I hope that's readable. Property P, what was it? 277. That is the property 167 00:14:40,180 --> 00:14:45,019 "programming language", at least in… okay, you can't read that. There you go. At 168 00:14:45,019 --> 00:14:48,220 least in English. In German it's "Programmiersprache", and it has tons of 169 00:14:48,220 --> 00:14:51,530 other languages too. So you can use Wikidata in any language you want, which 170 00:14:51,530 --> 00:14:56,640 is very nice. I could also show this page in a different language and then all of 171 00:14:56,640 --> 00:15:01,330 this would look different. The downside is that the SPARQL query is not quite as 172 00:15:01,330 --> 00:15:06,649 readable because you have to use all these numeric identifiers, but you don't have to 173 00:15:06,649 --> 00:15:14,920 memorize them at least. So let's… oops, try to write this query. SELECT * WHERE 174 00:15:14,920 --> 00:15:25,290 and we have the software, which is… which has the programming language "bash", and 175 00:15:25,290 --> 00:15:30,589 then we have to add these prefixes first, so bash is going to be a Wikidata item. So 176 00:15:30,589 --> 00:15:35,639 we abbreviate that with "wd" and that's a prefix. And then if I press control space, 177 00:15:35,639 --> 00:15:41,660 or I think on Macs command space works as well, then it searches for bash and shows 178 00:15:41,660 --> 00:15:46,850 me these suggestions and then I can just select the right one. In this case, "GNU 179 00:15:46,850 --> 00:15:50,959 bash", and then I have the ID, and if I move the mouse over it again, then I can 180 00:15:50,959 --> 00:15:55,760 see what this ID refers to. So it's not quite as bad as– so on the PDF slides, you 181 00:15:55,760 --> 00:16:00,879 just see the ID. But if you're actually on the query.wikidata.org website… let me 182 00:16:00,879 --> 00:16:05,370 make that a bit larger so you can all see it. And if you want to try that out on 183 00:16:05,370 --> 00:16:09,180 your laptop, I don't know, here it's a bit *audio outage* And for the programming 184 00:16:09,180 --> 00:16:17,290 language, we use a slightly different prefix, which is "wdt", which stands for 185 00:16:17,290 --> 00:16:21,270 "truthy". So we're only interested in "truthy" information and not all the 186 00:16:21,270 --> 00:16:28,529 information. And then we find this property P277. And if we run this query 187 00:16:28,529 --> 00:16:34,620 with control-enter or with this button here, then we get a collection of other 188 00:16:34,620 --> 00:16:40,240 IDs. Yeah. Does anyone want to get software which is written in bash? This 189 00:16:40,240 --> 00:16:51,209 one has a very low ID that is going to be… Loading. There we go. Autopackage. Some 190 00:16:51,209 --> 00:16:55,060 package management system that I haven't even heard of, but it's written in bash. 191 00:16:55,060 --> 00:17:01,130 OK, so… wait. Er, so here you can see all these statements and "programming 192 00:17:01,130 --> 00:17:08,010 language: GNU Bash" is the one we looked for. And unfortunately… so this is not a 193 00:17:08,010 --> 00:17:11,720 very useful list. So one thing we can do in the Wikidata Query Service, which is 194 00:17:11,720 --> 00:17:17,140 pretty specific to Wikidata, is to add the so-called label service, which is 195 00:17:17,140 --> 00:17:21,300 basically magic that you don't need to understand. But you write something like 196 00:17:21,300 --> 00:17:25,650 "serv" or "service" and then with control+space again for autocompletion. 197 00:17:25,650 --> 00:17:30,600 And it suggests you this thing. And you just keep that in your query at all times, 198 00:17:30,600 --> 00:17:34,800 basically. And then you say, I would like to have not just a software, but also the 199 00:17:34,800 --> 00:17:41,200 software label. And then we get down here, the label of the software. And I can also 200 00:17:41,200 --> 00:17:46,270 add the software description. And then we also see what, what is described. At least 201 00:17:46,270 --> 00:17:53,150 if it has a description and then the query results are already a lot more usable. And 202 00:17:53,150 --> 00:17:59,170 I'm just going to rename this to "item" and then we can edit this query however we 203 00:17:59,170 --> 00:18:04,340 want and the variable name will always kind of match. Because the next query 204 00:18:04,340 --> 00:18:07,610 won't be about software anymore. So it'll be confusing if you just still call it 205 00:18:07,610 --> 00:18:13,210 "software". But, yeah, there is some software here like Apache Yetus, Ruby 206 00:18:13,210 --> 00:18:18,780 Version Manager, Wikidata missing pictures, Pi-hole, all written in Bash. 207 00:18:18,780 --> 00:18:27,790 OK, I have several more examples queries here, which are kind of simple, should I 208 00:18:27,790 --> 00:18:34,100 skip ahead or is it good if I do a few more simple examples. Skip ahead? Is that 209 00:18:34,100 --> 00:18:41,180 OK? OK, then let's. So who was born at sea is not all that interesting. Just Place of 210 00:18:41,180 --> 00:18:45,020 birth at sea. We have a special value for that and it's not a very interesting list. 211 00:18:45,020 --> 00:18:48,780 I think a few results, just five or so, because most people are going to have 212 00:18:48,780 --> 00:18:51,890 "place of birth: Atlantic Ocean" or something. Which places are located on the 213 00:18:51,890 --> 00:18:57,180 White Elster, just something for the Leipzig people. And where does the 214 00:18:57,180 --> 00:19:00,750 Neverending Story take place? This actually kind of cute. Let's do that. 215 00:19:00,750 --> 00:19:06,220 Also, this is a bit interesting because in this case, the variable is in the last 216 00:19:06,220 --> 00:19:13,330 place and not the first one. So that… and then we have the Neverending Story in the 217 00:19:13,330 --> 00:19:19,620 beginning and narrative location. And then the item is at the end instead of at the 218 00:19:19,620 --> 00:19:24,660 beginning of a triple. And it works just as well, except that a lot of these don't 219 00:19:24,660 --> 00:19:31,800 have a label in English. So let's add German as a fallback language. And then we 220 00:19:31,800 --> 00:19:37,630 get all of these places which someone added to Wikidata at some point. Let's see 221 00:19:37,630 --> 00:19:42,410 if there's any useful information about them. So they all have IDs in the same 222 00:19:42,410 --> 00:19:47,890 range. So it looks like they were all created at the same time because the are 223 00:19:47,890 --> 00:19:51,880 are just increasing all the time. So the Gelichterland is a place from the 224 00:19:51,880 --> 00:19:55,261 Neverending Story, it's a finctional… fictional country. It has a capital, which 225 00:19:55,261 --> 00:20:00,740 is this fictional place. It's located on the… this terrain feature, it's present in 226 00:20:00,740 --> 00:20:05,600 the Neverending Story. And it depicts horror fiction. I'm not sure about that, 227 00:20:05,600 --> 00:20:12,350 but let's leave it alone for now. OK, yeah. And skip to a slightly more 228 00:20:12,350 --> 00:20:20,120 interesting query, which is this one, which popes had children. So what is the 229 00:20:20,120 --> 00:20:24,580 graph going to look like for this? How many, how many triples are we going to 230 00:20:24,580 --> 00:20:29,090 have? So triple is node, arrow, and another node, how many triples would you 231 00:20:29,090 --> 00:20:36,500 need for "Pope has a child"? Let's do a raising hands. Who thinks you need zero 232 00:20:36,500 --> 00:20:43,380 triples, OK? Who thinks you need one triple? Who thinks you need two triples? 233 00:20:43,380 --> 00:20:48,180 That's more people. Does anyone think you need three triples? No. OK, so mostly two, 234 00:20:48,180 --> 00:20:54,250 but some people think one. So the one… the people who think it might need one triple, 235 00:20:54,250 --> 00:21:02,650 perhaps are thinking of something like the Pope, which is the leader of the worldwide 236 00:21:02,650 --> 00:21:11,200 Catholic Church, has a child, this child or it's called item, but that's not going 237 00:21:11,200 --> 00:21:15,360 to have any results. Or it could be the other way around. And you could say that… 238 00:21:15,360 --> 00:21:26,370 oh let's just comment this out. The item has "father: the pope". And that doesn't 239 00:21:26,370 --> 00:21:30,650 work. Because the items are not… the children are not directly connected to the 240 00:21:30,650 --> 00:21:35,170 item for the office of the pope, instead it's going to be two levels. It's going to 241 00:21:35,170 --> 00:21:40,400 say the child has a father, some person, and then the person has the office pope or 242 00:21:40,400 --> 00:21:44,690 has the position pope or is a pope or something. So you need this level of 243 00:21:44,690 --> 00:21:49,150 indirection. So in the graph that looks either like this or it could be the other 244 00:21:49,150 --> 00:21:54,880 way around. So either the child has a father pope, which has "position held: 245 00:21:54,880 --> 00:22:00,930 pope" or the pope has a child and also a "position held", so that's kind of an 246 00:22:00,930 --> 00:22:04,090 example of the redundancy I mentioned earlier, we have the two directions 247 00:22:04,090 --> 00:22:11,300 "child" and also "father"/"mother", and- so you can ask your query in two ways, and 248 00:22:11,300 --> 00:22:14,170 it doesn't really make that much of a difference, assuming that the data is 249 00:22:14,170 --> 00:22:19,700 complete. And I think someone occasionally runs queries to check if any of these 250 00:22:19,700 --> 00:22:25,460 circles are missing. So let's try one of them, let's just stay with this one, so 251 00:22:25,460 --> 00:22:32,030 the item does not have "pope" as father, it has some pope, and then this pope has 252 00:22:32,030 --> 00:22:42,720 "position held: pope". And then let's add the "pope" label and… yeah, pope label is 253 00:22:42,720 --> 00:22:49,800 enough, and then we get 24 results! So we have a Duke of Parma which, who was the 254 00:22:49,800 --> 00:22:55,150 son of Paul III. Paul III had three children. Let's sort by this. Wow, 255 00:22:55,150 --> 00:23:04,390 Alexander VI was very busy. And some of them just have, oh oh oh, we have 256 00:23:04,390 --> 00:23:08,780 duplicates, Giovanni Borgia and Giovanni Borgia. Should I demonstrate Wikidata 257 00:23:08,780 --> 00:23:13,550 editing now or do we just ignore this? So, yeah, someone imported a lot of 258 00:23:13,550 --> 00:23:19,050 information from this peerage database and apparently we have some duplicate items 259 00:23:19,050 --> 00:23:24,140 here, let's just leave those alone for now. In fact, I think this and this also 260 00:23:24,140 --> 00:23:29,770 looks suspiciously similar. Giovanni Borgia, unless he had two children of that 261 00:23:29,770 --> 00:23:38,190 name. I mean, he could have. So this… we have a date of birth 1470s… 1498. No, that 262 00:23:38,190 --> 00:23:44,970 might actually be different children. OK, not a very creative father in the names. 263 00:23:44,970 --> 00:23:52,980 Yeah. And wait, that's a pope who's a child of another pope. Very interesting! 264 00:23:52,980 --> 00:23:56,460 And another one. And another one. We have three popes who are children of other 265 00:23:56,460 --> 00:24:02,000 popes. Let's search for those! So we would also need for that, that the item has 266 00:24:02,000 --> 00:24:11,380 "position held: Pope", and I could copy paste this, but just do this. So the item 267 00:24:11,380 --> 00:24:14,300 should be… child should have a "father: pope" and the item should have "position 268 00:24:14,300 --> 00:24:18,380 held: Pope", and the pope should also have "position held: pope". And in this case, 269 00:24:18,380 --> 00:24:22,690 it would probably be less confusing to call these "child" and "father", because 270 00:24:22,690 --> 00:24:26,480 this is also a pope now, but… variable names. One of the three hardest problems 271 00:24:26,480 --> 00:24:30,490 in computer science, right? Yeah, we have three children who are… three popes who 272 00:24:30,490 --> 00:24:36,540 are children of other popes. Wow. I'm actually going to save this query, popes 273 00:24:36,540 --> 00:24:42,470 who were children of other popes. But actually, we can future-proof this a 274 00:24:42,470 --> 00:24:47,910 little bit, because right now we've only said that the father should be a pope. But 275 00:24:47,910 --> 00:24:50,730 in case there's ever a female pope, let's just switch this around and say that the 276 00:24:50,730 --> 00:24:58,640 pope should have the child… item and then it's going to work, even if the pope 277 00:24:58,640 --> 00:25:03,140 happens to be female and is a mother instead of a father. There we go, same 278 00:25:03,140 --> 00:25:13,070 three results. OK, and let's keep that, and open a new tab for next queries. Yeah. 279 00:25:13,070 --> 00:25:18,340 Which Microsoft software runs on Linux. OK. That's not that funny. So perhaps we 280 00:25:18,340 --> 00:25:23,010 can just skip it… I don't know. That joke kind of ran out of steam a while ago. 281 00:25:23,010 --> 00:25:26,630 Basically looks like this and it's like Visual Studio Code and three other 282 00:25:26,630 --> 00:25:31,230 programs, meh. What are some compositions for organ and orchestra. This isn't funny 283 00:25:31,230 --> 00:25:35,710 at all, but I just find it very nice because it's just an awesome sound. And so 284 00:25:35,710 --> 00:25:40,860 that would be… the composition has the instrumentation "organ" and also 285 00:25:40,860 --> 00:25:52,670 "orchestra", which we can write as… item, item label… composition… instrumentation, 286 00:25:52,670 --> 00:26:11,650 this one, orchestra. And also, "composition… organ". And then, oops, 287 00:26:11,650 --> 00:26:18,120 yeah, this should be "item"… and also I forgot to add the label service. There we 288 00:26:18,120 --> 00:26:28,300 go. And we have 12 results, which is nice if you want to listen to any of those. We 289 00:26:28,300 --> 00:26:38,570 could also check if any of them have an audio file on Commons. Let's see. One, OK, 290 00:26:38,570 --> 00:26:46,460 and I think we've heard this one already. So, but… one thing that's kind of annoying 291 00:26:46,460 --> 00:26:50,420 here, I should have mentioned this in the last query, I think. So I had to repeat 292 00:26:50,420 --> 00:26:53,490 the item and the property ID, which is a bit annoying and makes the query difficult 293 00:26:53,490 --> 00:26:57,740 to read. And what you can do is leave that out and you can also do this in the 294 00:26:57,740 --> 00:27:04,660 previous case. So let's actually go one slide back. So here I didn't write twice 295 00:27:04,660 --> 00:27:07,350 that it's the software which should have the developer, and also the operating 296 00:27:07,350 --> 00:27:10,860 system. I just wrote the software has "developer: Microsoft" and also with a 297 00:27:10,860 --> 00:27:16,690 semicolon at the end instead of a period, it has "operating system: Linux". So if 298 00:27:16,690 --> 00:27:18,920 you read this as English it's just one sentence where you don't repeat the 299 00:27:18,920 --> 00:27:22,230 subject twice. The software has "developer: Microsoft" and "operating 300 00:27:22,230 --> 00:27:26,220 system: Linux", instead of "software has developer: Microsoft" and "software has 301 00:27:26,220 --> 00:27:31,350 operating system: Linux". And if you… if the property here is also the same thing, 302 00:27:31,350 --> 00:27:36,330 then you can even leave that out and add a comma at the end and just list the two 303 00:27:36,330 --> 00:27:41,170 values and you don't even have to repeat the instrumentation. So let's do that here 304 00:27:41,170 --> 00:27:47,450 and abbreviate this query. And it has the exact same 12 results, just slightly more 305 00:27:47,450 --> 00:27:54,720 convenient to read and… to write at least, hopefully also to read. I don't know. But 306 00:27:54,720 --> 00:27:56,840 you don't use the comma that much. The semicolon is pretty useful, like we could 307 00:27:56,840 --> 00:28:06,600 have written this as, the pope has, er, the child and also position held like 308 00:28:06,600 --> 00:28:10,530 this. It means exactly the same, but you can immediately see that both of these 309 00:28:10,530 --> 00:28:18,400 refer to the pope because there's just a bunch of blank space here. Yeah, so then 310 00:28:18,400 --> 00:28:27,690 we have this one. This isn't funny at all, but there are a lot of people who used to 311 00:28:27,690 --> 00:28:33,000 be in the Nazi Party during World War 2 and then who later just went back into a 312 00:28:33,000 --> 00:28:37,340 civil life and even received the Bundesverdienstkreuz, the order of merit 313 00:28:37,340 --> 00:28:42,320 of the Federal Republic of Germany. And you can find those… in this case I've done 314 00:28:42,320 --> 00:28:46,600 it with three triples, which is, the person was a member of this political 315 00:28:46,600 --> 00:28:51,510 party and received this award. And also I've added that they're "instance of: 316 00:28:51,510 --> 00:28:55,040 human", because we also have a lot of fictional data on Wikidata. You already 317 00:28:55,040 --> 00:28:57,701 saw that with the Neverending Story stuff earlier. So there might also be a 318 00:28:57,701 --> 00:29:02,040 fictional character who was a member of this political party and who received the 319 00:29:02,040 --> 00:29:07,300 award, and we're not really interested in those. So we add "instance of: human", and 320 00:29:07,300 --> 00:29:11,420 then we are certain that we only get real results and not fictional results. And it 321 00:29:11,420 --> 00:29:14,410 doesn't really cost us anything because the Query Service can optimize that pretty 322 00:29:14,410 --> 00:29:22,160 well. So let's write that… actually, let's do that here. So the item should be 323 00:29:22,160 --> 00:29:31,670 "instance of: human", which is Q5, because it's a very common item, and "member of 324 00:29:31,670 --> 00:29:39,920 political party". And you can see I can search by the German abbreviation and find 325 00:29:39,920 --> 00:29:44,250 this, even though it's not a label, because there are search aliases. And also 326 00:29:44,250 --> 00:29:48,690 "award received", the Bundesverdienstkreuz, because I can't be 327 00:29:48,690 --> 00:29:54,300 bothered to type in the whole English name. There we go. And we find, I think… 328 00:29:54,300 --> 00:30:03,720 how many results? Eleven results. Yeah. And this actually isn't quite correct, 329 00:30:03,720 --> 00:30:10,250 because in theory, you don't get this order, this order has like 11 parts or 330 00:30:10,250 --> 00:30:15,310 something. You can get the Grand Cross with Distinction or you can get the Star 331 00:30:15,310 --> 00:30:19,280 or whatever. I think it's listed somewhere here. Yeah, you can get the Grand Cross 332 00:30:19,280 --> 00:30:22,580 Special Class, you can get the Grand Cross Special Issue, you can get the Grand Cross 333 00:30:22,580 --> 00:30:27,020 First Class, blah blah blah. And so, in theory, any of these people should have 334 00:30:27,020 --> 00:30:34,190 one of these awards and not just "order of merit". But I think when I checked, all of 335 00:30:34,190 --> 00:30:42,190 them just had… all the results, just had directly "order of merit". But actually, 336 00:30:42,190 --> 00:30:48,230 no we can try to search for the correct ones instead. So it would not be part of 337 00:30:48,230 --> 00:30:53,650 this directly, it would be… "award received" would be some award, such as 338 00:30:53,650 --> 00:31:03,310 this one, and then this award is part of the order of merit, so "award"… "part of"… 339 00:31:03,310 --> 00:31:14,670 Let's see if that finds any results. Oh. Oh. Oh, dear. Yeah, that, that… that's a 340 00:31:14,670 --> 00:31:21,210 lot of results. "Herbert von Karajan". That's that's depressing. OK, yeah. OK, so 341 00:31:21,210 --> 00:31:24,000 I think I… when I tried this out and didn't find any results, I just did 342 00:31:24,000 --> 00:31:30,430 something wrong because, this way we find a lot more results. And if we… so we don't 343 00:31:30,430 --> 00:31:35,660 actually select the award here, because we don't care what kind of award they got. So 344 00:31:35,660 --> 00:31:41,710 we could also use this abbreviation again, like this. So we just say they got some 345 00:31:41,710 --> 00:31:47,280 award, which is part of the order of merit. And in this case, we could even 346 00:31:47,280 --> 00:31:54,000 abbreviate that further and say, we put a slash here. And then, that kind of 347 00:31:54,000 --> 00:31:58,420 describes a path that you have to take from this item to this item and you have 348 00:31:58,420 --> 00:32:03,900 to first get to some award received. And then that has to be part of something 349 00:32:03,900 --> 00:32:08,020 else. And you can add as many elements here as you want. And then we get the 350 00:32:08,020 --> 00:32:17,540 exact same 802 results… and… lots of well- known names here. And if we want to find 351 00:32:17,540 --> 00:32:21,500 the original 11 ones that directly had the order of merit as the award received, we 352 00:32:21,500 --> 00:32:25,970 can add a question mark here, which is just like in a regular expression, it says 353 00:32:25,970 --> 00:32:32,360 this part is optional. They can have directly received this award or they can 354 00:32:32,360 --> 00:32:36,090 have received some award, which is part of the order of merit. And then we should get 355 00:32:36,090 --> 00:32:47,540 813. Yeah, 813 results, so 802, plus the 11 from earlier. And… I'm starting this 356 00:32:47,540 --> 00:32:53,020 with "instance of: human", which… and the Query Service is going to re-order this 357 00:32:53,020 --> 00:32:57,210 because searching for all the humans and then filtering for the ones who are in 358 00:32:57,210 --> 00:33:01,270 this political party and so on wouldn't be efficient. So I don't have to worry about 359 00:33:01,270 --> 00:33:05,970 that. I could write it in this order, or I could shuffle it around. Doesn't make any 360 00:33:05,970 --> 00:33:10,020 difference. The Query Service already knows in which order to do these things. 361 00:33:10,020 --> 00:33:14,110 So you don't have to worry about that. You can just start with "is a human" and then 362 00:33:14,110 --> 00:33:23,310 add everything else. I think I have one more complicated query here. Yeah, so 363 00:33:23,310 --> 00:33:27,620 that's one of the examples I mentioned earlier, the largest cities by population 364 00:33:27,620 --> 00:33:33,200 with a female mayor. So the graph for that is, I think the largest one I prepared for 365 00:33:33,200 --> 00:33:37,570 the slides, except the one in the beginning. And it looks like this. We 366 00:33:37,570 --> 00:33:41,340 should have a city which is a city, "instance of: city", and it has a certain 367 00:33:41,340 --> 00:33:45,990 population, and it has… so for the mayor, we use the same property as for head of 368 00:33:45,990 --> 00:33:52,270 government. And if you don't know that, you could look at some city like Berlin 369 00:33:52,270 --> 00:33:59,280 and maybe you know what the mayor of Berlin is called… what was it?. Something 370 00:33:59,280 --> 00:34:04,540 "Müller", I think. Yeah. And then you can see, aha, the property for the mayor is 371 00:34:04,540 --> 00:34:13,909 "head of government". Or you could also search for, the city should have a mayor, 372 00:34:13,909 --> 00:34:19,490 and then you'll still find "head of government", the right property. And that 373 00:34:19,490 --> 00:34:24,879 mayor should be a human and she should have the gender "female". Oops. There's a 374 00:34:24,879 --> 00:34:28,369 question mark there for no reason at all. That's not a variable. That should be the 375 00:34:28,369 --> 00:34:36,940 fixed value. Sorry. So let's put that there. We have a city which is "instance 376 00:34:36,940 --> 00:34:49,759 of: city", and it also has a population which we're going to use later and it also 377 00:34:49,759 --> 00:34:55,139 has a head of government. No, that's wrong. Not the "office held by head of 378 00:34:55,139 --> 00:34:59,380 government", the "head of government" itself, which we call the mayor and then 379 00:34:59,380 --> 00:35:17,609 the mayor is "instance of: human" and gender should be female… come on… female. 380 00:35:17,609 --> 00:35:27,649 And let's select the city, cityLabel, mayorLabel and also the population. And 381 00:35:27,649 --> 00:35:31,220 then we find some 83 results. That's not yet the largest cities with a female 382 00:35:31,220 --> 00:35:37,269 mayor. That's just all of them. And in Wikidata we know about 83, apparently. And 383 00:35:37,269 --> 00:35:41,740 if your local hometown has a female mayor, just go ahead and add it to Wikidata and 384 00:35:41,740 --> 00:35:47,009 it's probably relevant. It's not– So the relevance criteria are not as strict as on 385 00:35:47,009 --> 00:35:52,529 Wikipedia fortunately. But if we want just the most populous ones, we can go a bit 386 00:35:52,529 --> 00:35:59,760 back into SQL land and say we want to ORDER BY the population and in SQL you 387 00:35:59,760 --> 00:36:03,420 would write DESC afterwards and in SPARQL it's different. You write 388 00:36:03,420 --> 00:36:09,700 DESC(?population). Erm, I think it's nicer that way. But perhaps it would have been 389 00:36:09,700 --> 00:36:13,740 nicer to just stick with the SQL syntax. I don't know. And we want to limit this to 390 00:36:13,740 --> 00:36:19,160 just the ten most populous cities, for example. And here we go. Tokyo is 391 00:36:19,160 --> 00:36:25,819 currently the biggest one, then Hong Kong, Baghdad, Surabaya, Rome. Yeah. And, oh. 392 00:36:25,819 --> 00:36:37,190 This doesn't make that much sense, Caracas has two mayors. Anyone… yeah, exactly. So 393 00:36:37,190 --> 00:36:43,819 we're only supposed to get the current mayor. Head of government… yeah. Does 394 00:36:43,819 --> 00:36:51,890 anyone know which one is the current one? Or we could just check Wikipedia… Caracas, 395 00:36:51,890 --> 00:36:55,730 which hopefully doesn't get it's information from Wikidata yet. So it's not 396 00:36:55,730 --> 00:37:07,940 circular. And the mayor is… Carolina, Carolina Cestari… Cestari, I don't know. 397 00:37:12,090 --> 00:37:14,660 *laughter* 398 00:37:14,660 --> 00:37:25,420 OK, so let's add a new one. Ah…? Doesn't have an item yet, is that… is that the 399 00:37:25,420 --> 00:37:31,369 mayor, or is chief of government something else? Doesn't occur anywhere else on the 400 00:37:31,369 --> 00:37:45,420 page, of course. Local government… mayor… no. OK, so let's just… I don't know, 401 00:37:45,420 --> 00:37:55,059 doesn't she have a Wikipedia article? No. Just appears in some lists and then she 402 00:37:55,059 --> 00:38:01,210 doesn't have a Wikidata item yet? No. Then… I don't know. We'll do some live 403 00:38:01,210 --> 00:38:04,660 Wikidata editing. It wasn't part of this talk, but let's just do it. Carolina 404 00:38:04,660 --> 00:38:17,270 Cestari… what country is that? Venezuela. Venezuelan politician, and that sounds 405 00:38:17,270 --> 00:38:22,609 like a female name, so I'm just going to guess and check that after the talk. So 406 00:38:22,609 --> 00:38:29,330 she's definitely a human. And gender is female and that is going to be enough for 407 00:38:29,330 --> 00:38:37,930 our query. Do this search again. There we go. And set this to preferred rank. So 408 00:38:37,930 --> 00:38:40,559 that's how the Query Service knows that this is the current value and it should 409 00:38:40,559 --> 00:38:44,500 only return this one. And ideally, one of the head of government values should have 410 00:38:44,500 --> 00:38:50,240 this preferred rank to mark it as the correct current value. And then all the 411 00:38:50,240 --> 00:38:53,640 other ones are additional data that you can use if you want. But it's not the main 412 00:38:53,640 --> 00:39:00,859 value and we are not going to get it in a simple query. And then there's some error 413 00:39:00,859 --> 00:39:06,259 because Caracas isn't some kind of political territorial entity and it should 414 00:39:06,259 --> 00:39:12,579 have a start time. I don't care right now. OK, so we run this query again and 415 00:39:12,579 --> 00:39:21,400 hopefully get just one result for Caracas this time. No. Uhm, we have to wait a bit 416 00:39:21,400 --> 00:39:26,450 until the Query Service is updated. Because it's kind of asynchronous. It just 417 00:39:26,450 --> 00:39:33,639 keeps watching for changes and eventually it will get the new data, but… okay. It 418 00:39:33,639 --> 00:39:42,079 might take a bit longer. Anyways. That's how that query works. Does that make kind 419 00:39:42,079 --> 00:39:51,710 of sense? OK, great. Yeah, I think this is almost exactly what I wrote here. Yeah. 420 00:39:51,710 --> 00:39:56,039 Except with some labels and the label service. Yeah. There is one problem here, 421 00:39:56,039 --> 00:40:02,019 which is, for example, I happen to know that Mexico City is a very large city with 422 00:40:02,019 --> 00:40:11,430 a population of… population: almost 9 million. So it should be right after Tokyo 423 00:40:11,430 --> 00:40:19,259 in front of Hong Kong. And the head of government is a Claudia Sheinbaum or 424 00:40:19,259 --> 00:40:23,980 something, which sounds like a woman. So we should get this result in the query. 425 00:40:23,980 --> 00:40:29,089 The reason we don't is that Mexico City is an instance of "big city" and we have 426 00:40:29,089 --> 00:40:35,470 searched for "instance of: city". And there's some debate about does this class 427 00:40:35,470 --> 00:40:39,860 even make sense at all? I think this is actually the German classification of, a 428 00:40:39,860 --> 00:40:43,859 big city is one with 100 000 Inhabitants, and in other languages or countries, a big 429 00:40:43,859 --> 00:40:49,000 city might be something else, but for now that… the data is what it is. Fortunately, 430 00:40:49,000 --> 00:40:54,049 what we have here is the information, a "big city" is a subclass of a city/town, 431 00:40:54,049 --> 00:41:04,599 which is a subclass of "locality", which is a subclass of. Wait. We should arrive 432 00:41:04,599 --> 00:41:07,789 at city at some point, but I think we've already gone past that. It's also an 433 00:41:07,789 --> 00:41:12,080 instance of capital. Let's go down that instead. A capital is a subclass of city, 434 00:41:12,080 --> 00:41:16,670 there we go. So if we can tell the Query Service to follow these subclass 435 00:41:16,670 --> 00:41:22,609 connections, then we should find these cities. And one way to do that… to make it 436 00:41:22,609 --> 00:41:29,500 work for Mexico City would be to say, it has to be "instance of", some, with the 437 00:41:29,500 --> 00:41:37,160 path again, "subclass of: city" and then we would find Mexico City, but we would 438 00:41:37,160 --> 00:41:42,690 not find all the… oh, we would still find Tokyo because it's still a capital, I 439 00:41:42,690 --> 00:41:47,319 guess. But we've missed a lot of other cities, I think which we used to have… 440 00:41:47,319 --> 00:41:53,609 yeah. Rome, for example, is gone. Because it's… that's just an instance of city 441 00:41:53,609 --> 00:41:57,420 directly. And we've now made the subclass mandatory. What we should do is make it 442 00:41:57,420 --> 00:42:02,490 optional, or even better, we would– we should say there can be any number of this 443 00:42:02,490 --> 00:42:06,960 element. So there… it can be an instance of city or it can be an instance of a 444 00:42:06,960 --> 00:42:10,839 subclass of city, it can be an instance of a subclass of a subclass of city. You can 445 00:42:10,839 --> 00:42:14,359 follow any number of elements, that what this… that's what this star means, just 446 00:42:14,359 --> 00:42:19,390 like in a regular expression. And then we probably have to say we only want the 447 00:42:19,390 --> 00:42:24,359 distinct ones because they are like five different ways to go through the subclass 448 00:42:24,359 --> 00:42:30,050 tree until you've found "city". And we're not interested in the different ways. But 449 00:42:30,050 --> 00:42:35,330 now we should get Tokyo and Mexico City. And Rome is also here and Caracas is 450 00:42:35,330 --> 00:42:39,249 completely gone because we found enough other cities which we were missing 451 00:42:39,249 --> 00:42:45,810 earlier. So you kind of have to watch out and sometimes use elements like this… 452 00:42:45,810 --> 00:42:51,940 "subclass of"-tree is pretty common, or with a, something… order of merit, we had 453 00:42:51,940 --> 00:42:56,839 to use this "part of". You have to watch out if the results are plausible, or 454 00:42:56,839 --> 00:43:00,570 ideally, you know some item that should be in the results, and then you check, is it 455 00:43:00,570 --> 00:43:05,779 there? Why is it not there? And investigate like that. But that's a fixed 456 00:43:05,779 --> 00:43:10,910 version of the query. And… yeah, if we were not interested in the mayor, we could 457 00:43:10,910 --> 00:43:15,019 do the same trick again. But, yeah. It doesn't make that much of a difference. 458 00:43:15,019 --> 00:43:19,119 And I think… yeah, that was almost the only difference. Yeah, except that I 459 00:43:19,119 --> 00:43:22,829 removed the population so we can order by a variable that you don't select in the 460 00:43:22,829 --> 00:43:33,759 end if you want. And I think I am out of slides. So, yeah, if you want to see more 461 00:43:33,759 --> 00:43:38,029 queries, you can look at these Twitter or social media accounts. There's a huge list 462 00:43:38,029 --> 00:43:43,299 of example queries on Wikidata, which is so big that it's getting too big for a 463 00:43:43,299 --> 00:43:46,499 wiki page, and people had to move some queries out there and it's kind of just 464 00:43:46,499 --> 00:43:50,890 grown since 2015 or something. And there's a lot of garbage there, but also a lot of 465 00:43:50,890 --> 00:43:55,970 useful queries if you want to look at that. And I had two more queries in the 466 00:43:55,970 --> 00:44:00,900 talk description which we haven't talked about yet, and I think we have the time. I 467 00:44:00,900 --> 00:44:04,400 can just try to open these. "Which films starred more than one future head of 468 00:44:04,400 --> 00:44:15,210 government?" Does that work? It doesn't. Can I copy the URL here? Yeah, copy link 469 00:44:15,210 --> 00:44:20,700 address. So that's a kind of longer query, which is why it didn't really fit on one 470 00:44:20,700 --> 00:44:26,480 slide. But the important film is you have… er, the important part is you have some 471 00:44:26,480 --> 00:44:32,480 film… instance of, or subclass of film, it has a publication date and a cast member, 472 00:44:32,480 --> 00:44:41,070 which is the head of government. And the head of government held some position, 473 00:44:41,070 --> 00:44:47,009 some head of government, er, some subclass of head of government. And that should be 474 00:44:47,009 --> 00:44:53,330 after the film was published. And then you get a bunch of results. I think this takes 475 00:44:53,330 --> 00:45:00,069 like 11 seconds or something. And you get like films with Schwarzenegger and one 476 00:45:00,069 --> 00:45:05,750 other actor who became US governor. I don't remember the name. And you also get 477 00:45:05,750 --> 00:45:09,890 a lot of… or several films from World War II with future French heads of government, 478 00:45:09,890 --> 00:45:15,910 which is really cool. So, like a film that was shot about the liberation of Paris, 479 00:45:15,910 --> 00:45:20,289 where it's… it's kind of a stretch to call them cast members, but they're definitely 480 00:45:20,289 --> 00:45:26,190 in the film. And if we get the result, then I can tell you what the film is 481 00:45:26,190 --> 00:45:35,381 called. Yeah, it might be busy right now, so you get up to 60 seconds in the Query 482 00:45:35,381 --> 00:45:40,210 Service and then in the end your query is killed if it takes longer than that. So 483 00:45:40,210 --> 00:45:43,039 sometimes it can be a bit of a struggle to make the query work within 60 seconds. 484 00:45:43,039 --> 00:45:48,359 There we go, 50 seconds. That was close. So there's yeah, there's a "La Libération 485 00:45:48,359 --> 00:45:52,450 de Paris" with Charles de Gaulle, who was president of the Council and president of 486 00:45:52,450 --> 00:45:58,240 the provisional government, and also Georges Bidault, I think, who was prime 487 00:45:58,240 --> 00:46:02,700 minister and president of the Council, and other stuff. We have several Indian films 488 00:46:02,700 --> 00:46:09,589 with people who went on to become chief ministers. And then down here there's some 489 00:46:09,589 --> 00:46:14,490 Canadian politicians, apparently. And then here's Arnold Schwarzenegger and Jesse 490 00:46:14,490 --> 00:46:21,450 Ventura, who both became governors and also starred in several films. And the 491 00:46:21,450 --> 00:46:26,320 other thing was, we have a lot of data about the British government because a lot 492 00:46:26,320 --> 00:46:31,670 of volunteers have just been slaving away at that data and adding and adding more 493 00:46:31,670 --> 00:46:38,789 information. I think they've… they have all their parliaments, complete with party 494 00:46:38,789 --> 00:46:42,990 affiliations and everything for at least the last 100 years and some partial data 495 00:46:42,990 --> 00:46:47,020 for a lot more than that, because they have a very long parliamentary history. 496 00:46:47,020 --> 00:46:51,180 And then you can do queries like "how many people named John are there in 497 00:46:51,180 --> 00:46:56,420 parliament", and "how many women with any name". And you can see when the women were 498 00:46:56,420 --> 00:47:01,710 finally more than just the men who are named "John". And it's kind of an amusing 499 00:47:01,710 --> 00:47:08,160 graph. Or not so amusing. Takes a while as well. I hope it doesn't take 50 seconds, 500 00:47:08,160 --> 00:47:13,549 but it looks like the Query Service might be busy at the moment. But I think it was 501 00:47:13,549 --> 00:47:19,910 something like in 1991 or so is the crossover point. Oh yeah. And I should 502 00:47:19,910 --> 00:47:23,840 mention anyway, so everything we saw right now was just a lot of tables. But you can 503 00:47:23,840 --> 00:47:31,170 also show results in different ways, such as a line chart. There we go. So in 1992, 504 00:47:31,170 --> 00:47:35,390 this was the first parliament which had more women than Johns. And then the Johns 505 00:47:35,390 --> 00:47:41,480 have slightly declined and the women have gone up to 220. How many people are in the 506 00:47:41,480 --> 00:47:47,690 House of Commons in total? Does anyone know? No. So I don't know what percentage 507 00:47:47,690 --> 00:47:52,500 this is. Uh, but, this was… yeah, this latest election from 12 December already 508 00:47:52,500 --> 00:48:02,739 in there. Yeah. *indistinguishable*. What? So the query looks like this. So this one 509 00:48:02,739 --> 00:48:06,400 is broken into several parts. We first find all the members of parliament, so 510 00:48:06,400 --> 00:48:10,509 they should be human, again, no fictional people, and then they should have some 511 00:48:10,509 --> 00:48:15,540 "position held", which is a subclass of "member of parliament" in the House of 512 00:48:15,540 --> 00:48:22,440 Commons. And then there should also be, um, a parliamentary term on that, so that 513 00:48:22,440 --> 00:48:27,660 we know which parliament it is and when it starts. And then down here, we import all 514 00:48:27,660 --> 00:48:35,230 those MPs and filter for just the ones with the "given name: John". And then we 515 00:48:35,230 --> 00:48:39,989 filter for just the ones with "gender: female". And there's an optional "subclass 516 00:48:39,989 --> 00:48:44,259 of" in here, because currently the data model is that there is a separate item for 517 00:48:44,259 --> 00:48:49,410 transgender female and someone can have "gender: transfemale– transgender female", 518 00:48:49,410 --> 00:48:52,940 which is a subclass of "female". And there is a discussion right now to get rid of 519 00:48:52,940 --> 00:48:56,519 that and have a separate property for that instead. And then all the trans people 520 00:48:56,519 --> 00:48:59,390 just have "gender:", their right gender, and you don't have to mess with subclass. 521 00:48:59,390 --> 00:49:03,660 But right now we still… well, we need it in theory, I don't think there are any MPs 522 00:49:03,660 --> 00:49:08,540 in practice. But, you know, you know, you can just keep it in there. And then we 523 00:49:08,540 --> 00:49:15,359 import the results and get them here either as a line chart or as a table, if 524 00:49:15,359 --> 00:49:20,769 you want to sort it by the time… yeah, the data starts in 1919, apparently. So we 525 00:49:20,769 --> 00:49:25,450 have exactly a hundred years of history there. We can also show it as a bar chart, 526 00:49:25,450 --> 00:49:30,529 if that makes more sense. No it doesn't. That makes no sense. Line chart is the 527 00:49:30,529 --> 00:49:35,059 right one. Oh, right, but if you show the line chart again, then it breaks for some 528 00:49:35,059 --> 00:49:39,059 reason, there's some bug there. So let's just show it again. There we go. That's 529 00:49:39,059 --> 00:49:47,160 the right… chart. Yeah, and I guess… oh wow, it's already… 50 minutes, so I guess 530 00:49:47,160 --> 00:49:55,359 this is the point where we start moving to the live querying part, and I was told I 531 00:49:55,359 --> 00:49:58,690 should make at least a short break for the stream, so the Angels know where to cut 532 00:49:58,690 --> 00:50:02,770 between. But we could also take a 10 minute's break and then start the next 533 00:50:02,770 --> 00:50:09,170 talk on time. Does that sound OK? Or is 10 minutes too long? Uhm, if you're going to 534 00:50:09,170 --> 00:50:13,670 stay here, which would be very nice, then please think of some example queries that 535 00:50:13,670 --> 00:50:16,820 you think we could write, and then I can try to write them, because otherwise I'm 536 00:50:16,820 --> 00:50:21,569 not going to have much to do. But yeah, let's do a 10 minute break and see you 537 00:50:21,569 --> 00:50:24,569 then. Thank you so far. 538 00:50:24,569 --> 00:50:27,219 *Applause* 539 00:50:27,219 --> 00:50:32,429 *Postroll Music* 540 00:50:32,429 --> 00:50:55,000 Subtitles created by c3subtitles.de in the year 2021. Join, and help us!