1 00:00:00,000 --> 00:00:19,480 *36C3 preroll music* 2 00:00:19,480 --> 00:00:24,140 Herald Angel: We have Tom and Max here. They have a talk here with a very 3 00:00:24,140 --> 00:00:28,140 complicated title that I don't quite understand yet. It's called "Interactively 4 00:00:28,140 --> 00:00:35,810 Discovering Implicational Knowledge in Wikidata. And they told me the point of 5 00:00:35,810 --> 00:00:39,190 the talk is that I would like to understand what it means and I hope I 6 00:00:39,190 --> 00:00:42,190 will. So good luck. Tom: Thank you very much. 7 00:00:42,190 --> 00:00:44,310 Herald: And have some applause, please. 8 00:00:44,310 --> 00:00:47,880 *applause* 9 00:00:47,880 --> 00:00:54,980 T: Thank you very much. Do you hear me? Does it work? Hello? Oh, very good. Thank 10 00:00:54,980 --> 00:00:58,789 you very much and welcome to our talk about interactively discovering 11 00:00:58,789 --> 00:01:05,110 implicational knowledge in Wikidata. It is more or less a fun project we started 12 00:01:05,110 --> 00:01:10,890 for finding rules that are implicit in Wikidata – entailed just by the data it 13 00:01:10,890 --> 00:01:18,850 has, that people inserted into the Wikidata database so far. And we will 14 00:01:18,850 --> 00:01:23,570 start with the explicit knowledge. So the explicit data in Wikidata, with Max. 15 00:01:23,570 --> 00:01:28,340 Max: So. Right. What what is Wikidata? Maybe you have heard about Wikidata, then 16 00:01:28,340 --> 00:01:33,210 that's all fine. Maybe you haven't, then surely you've heard of Wikipedia. And 17 00:01:33,210 --> 00:01:36,790 Wikipedia is run by the Wikimedia Foundation and the Wikimedia Foundation 18 00:01:36,790 --> 00:01:41,330 has several other projects. And one of those is Wikidata. And Wikidata is 19 00:01:41,330 --> 00:01:45,490 basically a large graph that encodes machine readable knowledge in the form of 20 00:01:45,490 --> 00:01:51,730 statements. And a statement basically consists of some entity that is connected 21 00:01:51,730 --> 00:01:58,200 – or some some entities that are connected by some property. And these properties 22 00:01:58,200 --> 00:02:02,909 can then even have annotations on them. So, for example, we have Donna Strickland 23 00:02:02,909 --> 00:02:09,149 here and we encode that she has received a Nobel prize in physics last year by this 24 00:02:09,149 --> 00:02:16,290 property "awarded" and this has then a qualifier "time: 2018" and also "for: 25 00:02:16,290 --> 00:02:23,100 Chirped Pulse Amplification". And all in all, we have some 890 million statements 26 00:02:23,100 --> 00:02:31,960 on Wikidata that connect 71 million items using 7000 properties. But there's also a 27 00:02:31,960 --> 00:02:36,830 bit more. So we also know that Donna Strickland has "field of work: optics" and 28 00:02:36,830 --> 00:02:41,420 also "field of work: lasers" so we can use the same property to connect some entity 29 00:02:41,420 --> 00:02:46,480 with different other entities. And we don't even have to have knowledge that 30 00:02:46,480 --> 00:02:56,530 connects the entities. We can have a date of birth, which is 1959. Nineteen ninety. 31 00:02:56,530 --> 00:03:05,530 No. Nineteen fifty nine. Yes. And this is then just a plain date, not an entity. And 32 00:03:05,530 --> 00:03:11,510 now coming from the explicit knowledge then, well, we have some more we have 33 00:03:11,510 --> 00:03:16,209 Donna Strickland has received a Nobel prize in physics and also Marie Curie has 34 00:03:16,209 --> 00:03:21,170 received the Nobel prize in physics. And we also know that Marie Curie has a Nobel 35 00:03:21,170 --> 00:03:27,780 prize ID that starts with "phys" and then "1903" and some random numbers that 36 00:03:27,780 --> 00:03:32,970 basically are this ID. Then Marie Curie also has received a Nobel prize in 37 00:03:32,970 --> 00:03:38,580 chemistry in 1911. So she has another Nobel ID that starts with "chem" and has 38 00:03:38,580 --> 00:03:43,590 "1911" there. And then there's also Frances Arnold, who received the Nobel 39 00:03:43,590 --> 00:03:48,549 prize in chemistry last year. So she has a Nobel ID that starts with "chem" and has 40 00:03:48,549 --> 00:03:54,740 "2018" there. And now one one could assume that, well, everybody who was awarded the 41 00:03:54,740 --> 00:04:00,156 Nobel prize should also have a Nobel ID. So everybody who was awarded the Nobel 42 00:04:00,156 --> 00:04:05,670 prize should also have a Nobel prize ID, and we could write that as some 43 00:04:05,670 --> 00:04:11,791 implication here. So "awarded(nobelPrize)" implies "nobelID". And well, if you 44 00:04:11,791 --> 00:04:16,349 look sharply at this picture, then there's this arrow here conspicuously missing that 45 00:04:16,349 --> 00:04:22,550 Donald Strickland doesn't have a Nobel prize ID. And indeed, there's 25 people 46 00:04:22,550 --> 00:04:26,669 currently on Wikidata that are missing Nobel prize IDs, and Donna Strickland is 47 00:04:26,669 --> 00:04:34,060 one of them. So we call these people that don't satisfy this implication – we call 48 00:04:34,060 --> 00:04:40,419 those counterexamples and well, if you look at Wikidata on the scale of really 49 00:04:40,419 --> 00:04:45,350 these 890 million statements, then you won't find any counterexamples because 50 00:04:45,350 --> 00:04:52,550 it's just too big. So we need some way to automatically do that. And the idea is 51 00:04:52,550 --> 00:04:58,930 that, well, if we had this knowledge that while some implications are not satisfied, 52 00:04:58,930 --> 00:05:03,840 then this encodes maybe missing information or wrong information, and we 53 00:05:03,840 --> 00:05:10,870 want to represent that in a way that is easy to understand and also succinct. So 54 00:05:10,870 --> 00:05:16,090 it doesn't take long to write it down, it should have a short representation. So 55 00:05:16,090 --> 00:05:23,060 that rules out anything, including complex syntax or logical quantifies. So no SPARQL 56 00:05:23,060 --> 00:05:27,480 queries as a description of that implicit knowledge. No description logics, if 57 00:05:27,480 --> 00:05:33,199 you've heard of that. And we also want something that we can actually compute on 58 00:05:33,199 --> 00:05:41,539 actual hardware in a reasonable timeframe. So our approach is we use Formal Concept 59 00:05:41,539 --> 00:05:46,889 Analysis, which is a technique that has been developed over the past several years 60 00:05:46,889 --> 00:05:52,070 to extract what is called propositional implications. So just logical formulas of 61 00:05:52,070 --> 00:05:56,240 propositional logic that are an implication in the form of this 62 00:05:56,240 --> 00:06:03,020 "awarded(nobelPrize)" implies "nobleID". So what exactly is Formal Concept 63 00:06:03,020 --> 00:06:08,500 Analysis? Off to Tom. T: Thank you. So what is Formal Concept 64 00:06:08,500 --> 00:06:14,420 Analysis? It was developed in 1980s by a guy called Rudolf Wille and Bernard Ganter 65 00:06:14,420 --> 00:06:18,539 and they were restructuring lattice theory. Lattice theory is an ambiguous 66 00:06:18,539 --> 00:06:23,370 name in math, it has two meanings: One meaning is you have a grid and have a 67 00:06:23,370 --> 00:06:29,050 lattice there. The other thing is to speak about orders – order relations. So I like 68 00:06:29,050 --> 00:06:34,150 steaks, I like pudding and I like steaks more than pudding. And I like rice more 69 00:06:34,150 --> 00:06:40,960 than steaks. That's an order, right? And lattices are particular orders which can 70 00:06:40,960 --> 00:06:46,770 be used to represent propositional logic. So easy rules like "when it rains, the 71 00:06:46,770 --> 00:06:52,990 street gets wet", right? So and the data representation those guys used back then, 72 00:06:52,990 --> 00:06:57,080 they called it a formal context, which is basically just a set of objects – they 73 00:06:57,080 --> 00:07:02,000 call them objects, it's just a name –, a set of attributes and some incidence, 74 00:07:02,000 --> 00:07:07,890 which basically means which object does have which attributes. So, for example, my 75 00:07:07,890 --> 00:07:13,150 laptop has the colour black. So this object has some property, right? So that's 76 00:07:13,150 --> 00:07:17,870 a small example on the right for such a formal context. So the objects there are 77 00:07:17,870 --> 00:07:24,379 some animals: a platypus – that's the fun animal from Australia, the mammal which is 78 00:07:24,379 --> 00:07:30,279 also laying eggs and which is also venomous –, a black widow – the spider –, 79 00:07:30,279 --> 00:07:35,449 the duck and the cat. So we see, the platypus has all the properties; it has 80 00:07:35,449 --> 00:07:39,729 being venomous, laying eggs and being a mammal; we have the duck, which is not a 81 00:07:39,729 --> 00:07:44,169 mammal, but it lays eggs, and so on and so on. And it's very easy to grasp some 82 00:07:44,169 --> 00:07:49,430 implicational knowledge here. An easy rule you can find is whenever you endeavour a 83 00:07:49,430 --> 00:07:54,300 mammal that is venomous, it has to lay eggs. So this is a rule that falls out of 84 00:07:54,300 --> 00:07:59,639 this binary data table. Our main problem then or at this point is we do not have 85 00:07:59,639 --> 00:08:03,470 such a data table for Wikidata, right? We have the implicit graph, which is way more 86 00:08:03,470 --> 00:08:09,030 expressive than binary data, and we cannot even store Wikidata as a binary table. 87 00:08:09,030 --> 00:08:13,859 Even if you tried to, we have no chance to compute such rules from that. And for 88 00:08:13,859 --> 00:08:21,460 this, the people from Formal Context Analysis proposed an algorithm to extract 89 00:08:21,460 --> 00:08:27,160 implicit knowledge from an expert. So our expert here could be Wikidata. It's an 90 00:08:27,160 --> 00:08:31,240 expert, you can ask Wikidata questions, right? Using this SPARQL interface, you 91 00:08:31,240 --> 00:08:34,739 can ask. You can ask "Is there an example for that? Is there a counterexample for 92 00:08:34,739 --> 00:08:39,880 something else?" So the algorithm is quite easy. The algorithm is the algorithm and 93 00:08:39,880 --> 00:08:45,380 some expert – in our case, Wikidata –, and the algorithm keeps notes for 94 00:08:45,380 --> 00:08:49,449 counterexamples and keeps notes for valid implications. So in the beginning, we do 95 00:08:49,449 --> 00:08:53,569 not have any valid implications, so this list on the right is empty, and in the 96 00:08:53,569 --> 00:08:56,780 beginning we do not have any counterexamples. So the list on the left, 97 00:08:56,780 --> 00:09:01,900 the formal context to build up is also empty. And all the algorithm does now is, 98 00:09:01,900 --> 00:09:09,170 it asks "is this implication, X follows Y, Y follows X or X implies Y, is it true?" 99 00:09:09,170 --> 00:09:14,000 So "is it true," for example, "that an animal that is a mammal and is venomous 100 00:09:14,000 --> 00:09:18,880 lays eggs?" So now the expert, which in our case is Wikidata, can answer it. We 101 00:09:18,880 --> 00:09:24,860 can query that. We showed in our paper we can query that. So we query it, and if the 102 00:09:24,860 --> 00:09:28,491 Wikidata expert does not find any counterexamples, it will say, ok, that's 103 00:09:28,491 --> 00:09:36,200 maybe a true, true thing; it's yes. Or if it's not a true implication in Wikidata, 104 00:09:36,200 --> 00:09:41,779 it can say, no, no, no, it's not true, and here's a counterexample. So this is 105 00:09:41,779 --> 00:09:48,510 something you contradict by example. You say this rule cannot be true. For example, 106 00:09:48,510 --> 00:09:52,900 when the street is wet, that does not mean it has rained, right? It could be the 107 00:09:52,900 --> 00:10:01,380 cleaning service car or something else. So our idea now was to use Wikidata as an 108 00:10:01,380 --> 00:10:05,819 expert, but also include a human into this loop. So we do not just want to ask 109 00:10:05,819 --> 00:10:11,709 Wikidata, we also want to ask a human expert as well. So we first ask in our 110 00:10:11,709 --> 00:10:18,520 tool the Wikidata expert for some rule. After that, we also inquire the human 111 00:10:18,520 --> 00:10:22,080 expert. And he can also say "yeah, that's true, I know that," or "No, no. Wikidata 112 00:10:22,080 --> 00:10:27,200 is not aware of this counterexample, I know one." Or, in the other case "oh, 113 00:10:27,200 --> 00:10:32,770 Wikidata says this is true. I am aware of a counterexample." Yeah, and so on and so 114 00:10:32,770 --> 00:10:37,600 on. And you can represent this more or less – this is just some mathematical 115 00:10:37,600 --> 00:10:41,689 picture, it's not very important. But you can see on the left there's an exploration 116 00:10:41,689 --> 00:10:46,720 going on, just Wikidata with the algorithm, on the right an exploration, a 117 00:10:46,720 --> 00:10:51,419 human expert versus Wikidata which can answer all the queries. And we combined 118 00:10:51,419 --> 00:10:57,720 those two into one small tool, still under development. So, back to Max. 119 00:10:57,720 --> 00:11:02,980 M: Okay. So far for that to work, we basically need to have a way of viewing 120 00:11:02,980 --> 00:11:08,070 Wikidata, or at least parts of Wikidata, as a formal context. And this formal 121 00:11:08,070 --> 00:11:13,610 context, well, this was a binary table, so what do we do? We just take all the items 122 00:11:13,610 --> 00:11:18,880 in Wikidata as objects and all the properties as attributes of our context 123 00:11:18,880 --> 00:11:24,159 and then have an incidence relation that says "well, this entity has this 124 00:11:24,159 --> 00:11:30,549 property," so it is incident there, and then we end up with a context that has 71 125 00:11:30,549 --> 00:11:36,430 million rows and seven thousand columns. So, well, that might actually be a slight 126 00:11:36,430 --> 00:11:40,180 problem there, because we want to have something that we can run on actual 127 00:11:40,180 --> 00:11:45,811 hardware and not on a supercomputer. So let's maybe not do that and focus on 128 00:11:45,811 --> 00:11:50,900 a smaller set of properties that are actually related to one another through 129 00:11:50,900 --> 00:11:55,689 some kind of common domain, yeah? So it doesn't make any sense to have a property 130 00:11:55,689 --> 00:11:59,640 that relates to spacecraft and then a property that relates to books – that's 131 00:11:59,640 --> 00:12:05,050 probably not a good idea to try to find implicit knowledge between those two. But 132 00:12:05,050 --> 00:12:10,259 two different properties about spacecraft, that sounds good, right? And then the 133 00:12:10,259 --> 00:12:15,000 interesting question is just how do we define the incidence for our set of 134 00:12:15,000 --> 00:12:20,150 properties? And that actually depends very much on which properties we choose, 135 00:12:20,150 --> 00:12:25,550 because it does – for some properties, it makes sense to account for the direction 136 00:12:25,550 --> 00:12:32,679 of the statement: So there is a property called parent? Actually, no, it's child, 137 00:12:32,679 --> 00:12:38,309 and then there's father and mother, and you don't want to turn those around, as do 138 00:12:38,309 --> 00:12:43,760 you want to have "A is a child of B," that should be something different than "B 139 00:12:43,760 --> 00:12:48,930 is a child of A." Then there's the qualifiers that might be important for 140 00:12:48,930 --> 00:12:54,740 some properties. So receiving an award for something might be something different 141 00:12:54,740 --> 00:13:00,740 than receiving an award for something else. But while receiving an award in 2018 142 00:13:00,740 --> 00:13:06,549 and receiving one in 2017, that's probably more or less the same thing, so we don't 143 00:13:06,549 --> 00:13:11,930 necessarily need to differentiate that. And there's also a thing called subclasses 144 00:13:11,930 --> 00:13:15,470 and they form a hierarchy on Wikidata. And you might also want to take that into 145 00:13:15,470 --> 00:13:20,150 account because while winning something that is a Nobel prize, that means also 146 00:13:20,150 --> 00:13:25,190 winning an award itself, and winning the Nobel Peace prize means winning a peace 147 00:13:25,190 --> 00:13:32,586 prize. So there's also implications going on there that you want to respect. So, 148 00:13:32,586 --> 00:13:38,400 to see how we actually do that, let's look at an example. So we have here, well, this 149 00:13:38,400 --> 00:13:47,030 is Donald Strickland. And – I forgot his first name – Ashkin, this is one of the 150 00:13:47,030 --> 00:13:51,720 people that won the Nobel prize in physics with her last year. And also Gérard 151 00:13:51,720 --> 00:13:57,990 Mourou. That is the third one. They all got the Nobel prize in physics last year. 152 00:13:57,990 --> 00:14:04,190 So we have all these statements here, and these two have a qualifier that says 153 00:14:04,190 --> 00:14:10,260 "with: Gérard Mourou" here. And I don't think the qualifier is on this statement 154 00:14:10,260 --> 00:14:15,160 here, actually, but it doesn't actually matter. So what we've done here is, 155 00:14:15,160 --> 00:14:21,190 put all the entities in the small graph as rows in the table. So we have Strickland 156 00:14:21,190 --> 00:14:27,850 and Mourou and Ashkin, and also Arnold and Curie that are not in the picture. But you 157 00:14:27,850 --> 00:14:33,290 can maybe remember that. And then here we have awarded, and we scaled that by the 158 00:14:33,290 --> 00:14:37,250 instance of the different Nobel prizes that people have won. So that's the 159 00:14:37,250 --> 00:14:42,209 physics Nobel in the first column, the chemistry Nobel Prize in the second column 160 00:14:42,209 --> 00:14:48,380 and just general Nobel prizes in the third column. There's awarded and that is scaled 161 00:14:48,380 --> 00:14:55,240 by the "with" qualifier, so awarded with Gérard Mourou. And then there's field of 162 00:14:55,240 --> 00:15:00,450 work, and we have lasers here and radioactivity, so we scale by the actual 163 00:15:00,450 --> 00:15:06,580 field of work that people have. And well then, if we look at what kind of incidence 164 00:15:06,580 --> 00:15:11,370 we get for Donna Strickland, she has a Nobel prize in physics and that is also a 165 00:15:11,370 --> 00:15:17,190 Nobel prize, and she has that together with Mourou. And she has "field of work: 166 00:15:17,190 --> 00:15:23,220 lasers," but not radioactivity. Then, Mourou himself: he has a Nobel prize in 167 00:15:23,220 --> 00:15:29,450 physics, and that is a Nobel prize, but none of the others. Ashkin gets the Nobel 168 00:15:29,450 --> 00:15:33,890 prize in physics, and that is still a Nobel prize, and he gets that with Gérard 169 00:15:33,890 --> 00:15:40,970 Mourou. And also he works on lasers, but not in radioactivity. So Frances Arnold 170 00:15:40,970 --> 00:15:47,230 has a Nobel prize in chemistry, and that is a Nobel prize. And Marie Curie, she has 171 00:15:47,230 --> 00:15:50,510 a Nobel prize in physics and one in chemistry, and they are both a Nobel 172 00:15:50,510 --> 00:15:55,319 prize. And she also works on radioactivity. But lasers didn't exist 173 00:15:55,319 --> 00:16:02,490 back then, so she doesn't get "field of work: lasers." And then basically this 174 00:16:02,490 --> 00:16:10,289 table here is a representation of our formal context. So and then we've actually 175 00:16:10,289 --> 00:16:14,840 gone ahead and started building a tool where you can interactively do all these 176 00:16:14,840 --> 00:16:20,320 things, and it will take care of building the context for you. You just put in the 177 00:16:20,320 --> 00:16:24,540 properties, and Tom will show you how that works. 178 00:16:24,540 --> 00:16:29,030 T: So here you see some first screenshots of this tool. So please do not comment on 179 00:16:29,030 --> 00:16:32,520 the graphic design. We have no idea about that, we have to ask someone about that. 180 00:16:32,520 --> 00:16:36,120 We're just into logics, more or less. On the left, you see the initial state of the 181 00:16:36,120 --> 00:16:41,120 game. On the left you have five boxes: they're called countries and borders, 182 00:16:41,120 --> 00:16:47,370 credit cards, use of energy, memory and computation – I think –, and space 183 00:16:47,370 --> 00:16:53,180 launches, which are just presets we defined. You can explore, for example, in 184 00:16:53,180 --> 00:16:57,050 the case of the credit card, you can explore the properties from Wikidata which 185 00:16:57,050 --> 00:17:02,170 are called "card network," "operator," and "fee," so you can just choose one of them, 186 00:17:02,170 --> 00:17:05,530 or on the right, "custom properties," you can just input the properties you're 187 00:17:05,530 --> 00:17:10,640 interested in Wikidata, whatever one of the seven thousand you like, or some 188 00:17:10,640 --> 00:17:15,140 number of them. On the right, I chose then the credit card thingy and I now want to 189 00:17:15,140 --> 00:17:21,860 show you what happens if you now explore these properties, right? The first step in 190 00:17:21,860 --> 00:17:25,750 the game is that the game will ask – I mean, the game, the exploration process – 191 00:17:25,750 --> 00:17:31,020 will ask, is it true that every entity in Wikidata will have these three properties? 192 00:17:31,020 --> 00:17:36,360 So are they common among all entities in your data, which is most probably not 193 00:17:36,360 --> 00:17:41,540 true, right? I mean, not everything in Wikidata has a fee, at least I hope. So, 194 00:17:41,540 --> 00:17:46,520 what I will do now, I would click the "reject this implication" button, since 195 00:17:46,520 --> 00:17:51,480 the implication "Nothing implies everything" is not true. In the second 196 00:17:51,480 --> 00:17:56,360 step now, the algorithm tries to find the minimal number of questions to obtain the 197 00:17:56,360 --> 00:18:01,820 domain knowledge, so to obtain all valid rules in this domain. So next question is 198 00:18:01,820 --> 00:18:06,120 "is it true that everything in Wikidata that has a 'card network' property also 199 00:18:06,120 --> 00:18:12,560 has a 'fee' and an 'operator' property?" And down here you can see Wikidata says 200 00:18:12,560 --> 00:18:18,110 "ok, there are 26 items which are counterexamples," so there's 26 items in 201 00:18:18,110 --> 00:18:22,670 Wikidata which have the "card network" property but do not have the other two 202 00:18:22,670 --> 00:18:28,200 ones. So, 26 is not a big number, this could mean "ok, that's an error, so 26 203 00:18:28,200 --> 00:18:32,860 statements are missing." Or maybe that that's, really, that's the true case. 204 00:18:32,860 --> 00:18:36,890 That's also ok. But you can now choose what you think is right. You can say, "oh, 205 00:18:36,890 --> 00:18:40,470 I would say it should be true" or you can say "no, I think that's ok, one of these 206 00:18:40,470 --> 00:18:46,380 counterexamples seems valid. Let's reject it." I in this case, rejected it. The next 207 00:18:46,380 --> 00:18:51,020 question it asks: "is it true that everything that has an operator has also a 208 00:18:51,020 --> 00:18:56,290 fee and a card network?" Yeah, this is possibly not true. There's also more than 209 00:18:56,290 --> 00:19:03,110 1000 counterexamples, one being, I think a telecommunication operator in Hungary or 210 00:19:03,110 --> 00:19:10,340 something. And so we can reject this as well. Next question, everything that has 211 00:19:10,340 --> 00:19:15,360 an operator and a card network – so card network means Visa, MasterCard, whatever, 212 00:19:15,360 --> 00:19:21,690 all this stuff – is it true that they have to have a fee?" Wikidata says "no," it has 213 00:19:21,690 --> 00:19:27,570 23 items that contradict it. But one of the items, for example, is the American 214 00:19:27,570 --> 00:19:32,090 Express Gold Card. I suppose the American Express Gold Card has some fee. So this 215 00:19:32,090 --> 00:19:36,140 indicates, "oh, there is some missing data in Wikidata," there is something that 216 00:19:36,140 --> 00:19:40,680 Wikidata does not know but should know to reason correctly in Wikidata with your 217 00:19:40,680 --> 00:19:46,520 SPARQL queries. So we can now say, "yeah, that's, uh, that's not a reject, that's an 218 00:19:46,520 --> 00:19:51,470 accept," because we think it should be true. But Wikidata thinks otherwise. And 219 00:19:51,470 --> 00:19:55,800 you go on, we go on. This is then the last question: "Is it true that everything that 220 00:19:55,800 --> 00:20:00,950 has a fee and a card work should have an operator," and you see, "oh, no counter 221 00:20:00,950 --> 00:20:05,930 examples." This means Wikidata says "this is true," because it says there is no 222 00:20:05,930 --> 00:20:09,580 counterexample. If you're asking Wikidata it says this is a valid implication in the 223 00:20:09,580 --> 00:20:15,400 data set so far, which could also be indicating that something is missing, I'm 224 00:20:15,400 --> 00:20:20,310 not aware if this is possible or not, but ok, for me it sounds reasonable. Everyone 225 00:20:20,310 --> 00:20:23,800 has a fee and a card network should also have an operator, which meens a bank or 226 00:20:23,800 --> 00:20:29,220 something like that. So I accept this implication. And then, yeah, you have won 227 00:20:29,220 --> 00:20:34,410 the exploration game, which essentially means you've won some knowledge. Thank 228 00:20:34,410 --> 00:20:40,300 you. And the knowledge is that you know which implications in Wikidata are true or 229 00:20:40,300 --> 00:20:44,340 should be true from your point of view. And yeah, this is more or less the state 230 00:20:44,340 --> 00:20:50,700 of the game so far as we programmed it in October. And the next state will be to 231 00:20:50,700 --> 00:20:54,970 show you some – "How much does your opinion of the world differ from the 232 00:20:54,970 --> 00:20:59,950 opinion that is now reflected in the data?" So is what you think about the data 233 00:20:59,950 --> 00:21:05,430 true, close to true to what is true in Wikidata. Or maybe Wikidata has wrong 234 00:21:05,430 --> 00:21:10,680 information. You can find it with that. But Max will tell me more about that. 235 00:21:10,680 --> 00:21:18,220 M: Ok. So let me just quickly come back to what we have actually done. So we 236 00:21:18,220 --> 00:21:23,670 offer a procedure that allows you to explore properties in Wikidata and the 237 00:21:23,670 --> 00:21:30,720 implicational knowledge that holds between these properties. And the key idea's here 238 00:21:30,720 --> 00:21:34,661 that when you look at these implications that you get, while there might be some 239 00:21:34,661 --> 00:21:39,280 that you don't actually want because they shouldn't be true, and there might also be 240 00:21:39,280 --> 00:21:46,220 ones that you don't get, but you expect to get because they should hold. And these 241 00:21:46,220 --> 00:21:51,840 unwanted and/or missing implications, they point to missing statements and items in 242 00:21:51,840 --> 00:21:56,130 Wikidata. So they show you where the opportunities to improve the knowledge in 243 00:21:56,130 --> 00:22:00,100 Wikidata are, and, well, sometimes you also get to learn something about the 244 00:22:00,100 --> 00:22:04,080 world, and in most cases, it's that the world is more complicated than you thought 245 00:22:04,080 --> 00:22:10,260 it was – and that's just how life is. But in general, implications can guide you in 246 00:22:10,260 --> 00:22:17,220 your way of improving Wikidata and the state of knowledge therein. So what's 247 00:22:17,220 --> 00:22:22,380 next? Well, so what we currently don't offer in the exploration game and what we 248 00:22:22,380 --> 00:22:27,710 definitely will focus next on is having configurable counterexamples and also 249 00:22:27,710 --> 00:22:32,030 filterable counterexamples – right now you just get a list of a random number of 250 00:22:32,030 --> 00:22:36,880 counterexamples. And you might want to search through this list for something you 251 00:22:36,880 --> 00:22:42,520 recognise and you might also want to explicitly say, well, this one should be a 252 00:22:42,520 --> 00:22:48,600 counterexample, and that's definitely coming next. Then, well, domain specific 253 00:22:48,600 --> 00:22:53,750 scaling of properties, there's still much work to be done. Currently, we only have 254 00:22:53,750 --> 00:23:00,500 some very basic support for that. So you can have properties, but you can't do the 255 00:23:00,500 --> 00:23:03,780 fancy things where you say, "well, everything that is an award should be 256 00:23:03,780 --> 00:23:10,840 considered as one instance of this property." That's also coming and then 257 00:23:10,840 --> 00:23:15,550 what Tom mentioned alread: compare your knowledge that you have explored through 258 00:23:15,550 --> 00:23:21,610 this process against the knowledge that is currently on Wikidata as a form of seeing 259 00:23:21,610 --> 00:23:26,540 "where do you stand? What is missing in Wikidata? How can you improve Wikidata?" 260 00:23:26,540 --> 00:23:32,600 And well, if you have any more suggestions for features, then just tell us. There's a 261 00:23:32,600 --> 00:23:39,530 Github link on the implication game page. And here's the link to the tool again. So, 262 00:23:39,530 --> 00:23:46,140 yeah, just let us know. Open an issue and have fun. And if you have any questions, 263 00:23:46,140 --> 00:23:50,230 then I guess now would be the time to ask. T: Thank you. 264 00:23:50,230 --> 00:23:52,730 Herald: Thank you very much, Tom and Max. 265 00:23:52,730 --> 00:23:55,020 *applause* 266 00:23:55,020 --> 00:24:01,510 Herald: So we will switch microphones now because then I can hand this microphone to 267 00:24:01,510 --> 00:24:07,250 you if any of you have a question for our two speakers. Are there any questions or 268 00:24:07,250 --> 00:24:14,370 suggestions? Yes. Question: Hi. Thanks for the nice talk. I 269 00:24:14,370 --> 00:24:18,720 wanted to ask what's the first question, what's the most interesting implication 270 00:24:18,720 --> 00:24:25,020 that you've found? M: Yeah. That would have made for a 271 00:24:25,020 --> 00:24:31,850 good back up slide. The most interesting implication so far – 272 00:24:31,850 --> 00:24:36,010 T: The most basic thing you would expect everything that is launched in space by 273 00:24:36,010 --> 00:24:41,920 humans – no, everything that landed from space, that has a landing date, also has a 274 00:24:41,920 --> 00:24:46,450 start date. So nothing landed on earth, which was not started here. 275 00:24:46,450 --> 00:24:55,200 M: Yes. Q: Right now, the game only helps you find 276 00:24:55,200 --> 00:25:00,710 out implications. Are you also planning to have that I can also add data like for 277 00:25:00,710 --> 00:25:04,309 example, let's say I have twenty five Nobel laureates who don't have a Nobel 278 00:25:04,309 --> 00:25:08,220 laureate ID. Is there plans where you could give me a simple interface for me to 279 00:25:08,220 --> 00:25:12,760 Google and add that ID because it would make the process of adding new entities to 280 00:25:12,760 --> 00:25:17,400 Wikidata itself more simple. M: Yes. And that's partly hidden 281 00:25:17,400 --> 00:25:23,050 behind this "configurable and filterable counterexamples" thing. We will probably 282 00:25:23,050 --> 00:25:28,380 not have an explicit interface for adding stuff, but most likely interface with some 283 00:25:28,380 --> 00:25:32,270 other tool built around Wikidata, so probably something that will give you 284 00:25:32,270 --> 00:25:37,100 QuickStatements or something like that. But yes, adding data is definitely on the 285 00:25:37,100 --> 00:25:41,710 roadmap. Herald: Any more questions? Yes. 286 00:25:41,710 --> 00:25:48,860 Q: Wouldn't it be nice to do this in other languages, too? 287 00:25:48,860 --> 00:25:52,600 T: Actually it's language independent, so we use Wikidata and then as far as we 288 00:25:52,600 --> 00:25:58,110 know, Wikidata has no language itself. You know, it has just items and properties, so 289 00:25:58,110 --> 00:26:02,640 Qs and Ps, and whatever language you use, it should be translated in the language of 290 00:26:02,640 --> 00:26:06,180 the properties, if there is a label for that property or for that item that you 291 00:26:06,180 --> 00:26:12,420 have. So if Wikidata is aware of your language, we are. 292 00:26:12,420 --> 00:26:15,020 Herald: Oh, yes. More! M: Of course, the tool still needs to be 293 00:26:15,020 --> 00:26:18,360 translated, but – T: The tool itself, it should be. 294 00:26:18,360 --> 00:26:21,850 Q: Hi, thanks for the talk. I have a question. Right now you only can find 295 00:26:21,850 --> 00:26:25,990 missing data with this, right? Or surplus data. Would you think you'd be able to 296 00:26:25,990 --> 00:26:31,560 find wrong information with a similar approach. 297 00:26:31,560 --> 00:26:37,001 T: Actually, we do. I mean, if we Wikidata has a counterexample to something we would 298 00:26:37,001 --> 00:26:42,830 expect to be true, this could point to wrong data, right? If the counterexample 299 00:26:42,830 --> 00:26:47,450 is a wrong counterexample. If there is a missing property or missing property to an 300 00:26:47,450 --> 00:26:58,160 item. Q: Ok, I get to ask a second question. So 301 00:26:58,160 --> 00:27:06,000 the horizontal axis in the incidence matrix. You said it has 7000, it spans 302 00:27:06,000 --> 00:27:10,300 7000 columns, right? M: Yes, because there's 7000 properties in 303 00:27:10,300 --> 00:27:13,850 Wikidata. Q: But it's actually way more columns, 304 00:27:13,850 --> 00:27:17,849 right? Because you multiply the properties times the arguments, right? 305 00:27:17,849 --> 00:27:21,360 M: Yes. So if you do any scaling then of course that might give you multiple 306 00:27:21,360 --> 00:27:23,380 entries. Q: So that's what you mean with scaling, 307 00:27:23,380 --> 00:27:27,770 basically? M: Yes. But already seven thousand is way 308 00:27:27,770 --> 00:27:35,580 too big to actually compute that. Q: How many would it be if you multiply 309 00:27:35,580 --> 00:27:48,060 all the arguments? M: I have no idea, probably a few million. 310 00:27:48,060 --> 00:27:55,309 Q: Have you thought about a recursive method, as counterexamples may be wrong by 311 00:27:55,309 --> 00:28:00,350 other counterexamples, like in an argumentative graph or something like 312 00:28:00,350 --> 00:28:06,708 this? T: Actually, I don't get it. How can a 313 00:28:06,708 --> 00:28:14,040 counterexample be wrong through another counterxample? 314 00:28:14,040 --> 00:28:24,450 Q: Maybe some example says that cats can have golden hair and then another example 315 00:28:24,450 --> 00:28:31,260 might say that this is not a cat. T: Ah, so the property to be a cat or 316 00:28:31,260 --> 00:28:38,000 something cat-ish is missing then. Okay. No, we have not considered so far deeper 317 00:28:38,000 --> 00:28:44,570 reasoning. This horn-propositional logic, you know, it has no contradictions, 318 00:28:44,570 --> 00:28:47,740 because all you can do is you can contradict by counterexamples, but there 319 00:28:47,740 --> 00:28:52,740 can never be a rule that is not true, so far. Just in your or my opinion, maybe, 320 00:28:52,740 --> 00:28:56,370 but not in the logic. So what we have to think about is that we have bigger 321 00:28:56,370 --> 00:29:01,780 reasoning, right? So. Q: Sorry, quick question. Because you're 322 00:29:01,780 --> 00:29:04,929 not considering all the 7000 odd properties for each of the entities, 323 00:29:04,929 --> 00:29:07,570 right? What's your current process of filtering? What are the relevant 324 00:29:07,570 --> 00:29:14,820 properties? I'm sorry, I didn't get that. M: Well, we basically handpick those. So 325 00:29:14,820 --> 00:29:19,940 you have this input field? Yeah, we can go ahead and select our properties. We also 326 00:29:19,940 --> 00:29:26,870 have some predefined sets. Okay. And there's also some classes for groups of 327 00:29:26,870 --> 00:29:30,780 properties that are related that you could use if you want bigger sets, 328 00:29:30,780 --> 00:29:35,960 T: for example, space or family or what was the other? 329 00:29:35,960 --> 00:29:43,410 M: Awards is one. T: It depends on the size of the class. 330 00:29:43,410 --> 00:29:47,390 For example, for space, it's not that much, I think it's 10 or 15 properties. It 331 00:29:47,390 --> 00:29:51,520 will take you some hours, but you can do because they are 15 or something like 332 00:29:51,520 --> 00:29:58,150 that. I think for family, it's way too much, it's like 40 of 50 properties. So a 333 00:29:58,150 --> 00:30:04,540 lot of questions. Herald: I don't see any more hands. Maybe 334 00:30:04,540 --> 00:30:09,760 someone who has not asked the question yet has another one we could take that, 335 00:30:09,760 --> 00:30:14,270 otherwise we would be perfectly on time. And maybe you can tell us where you will 336 00:30:14,270 --> 00:30:18,860 be for deeper discussions where people can find you. 337 00:30:18,860 --> 00:30:22,400 T: Probably at the couches. Herald: The couches, behind our stage. 338 00:30:22,400 --> 00:30:26,720 M: Or just running around somewhere. So there's also our DECT numbers on the 339 00:30:26,720 --> 00:30:35,960 slides; it's 6284 for Tom and 6279 for me. So just call and ask where we're hanging 340 00:30:35,960 --> 00:30:38,470 around. H: Well then, thank you again. Have a 341 00:30:38,470 --> 00:30:40,210 round of applause. *applause* 342 00:30:40,210 --> 00:30:42,650 T: Thank you. M: Well, thanks for having us. 343 00:30:42,650 --> 00:30:45,310 *Applause* 344 00:30:45,310 --> 00:30:49,740 *postroll music* 345 00:30:49,740 --> 00:31:12,000 subtitles created by c3subtitles.de in the year 2020. Join, and help us!