1 00:00:01,180 --> 00:00:02,990 One of the most important steps 2 00:00:02,990 --> 00:00:06,250 in building data intensive apps is to actually model 3 00:00:06,250 --> 00:00:08,700 all this data in MongoDB. 4 00:00:08,700 --> 00:00:12,300 And so that's what we're gonna talk about in this lecture. 5 00:00:12,300 --> 00:00:14,710 So it's really crucial that you follow it 6 00:00:14,710 --> 00:00:19,710 through even at first its a lot to take in. All right. 7 00:00:19,810 --> 00:00:22,013 Anyway, lets now get started. 8 00:00:23,530 --> 00:00:27,530 Now, data modeling is probably a very new concept to you. 9 00:00:27,530 --> 00:00:28,920 So before we start; 10 00:00:28,920 --> 00:00:32,070 lets make clear what we're actually gonna talk about. 11 00:00:32,070 --> 00:00:35,656 So, data modeling is the process of taking unstructured data 12 00:00:35,656 --> 00:00:38,770 generated by a real world scenario 13 00:00:38,770 --> 00:00:42,090 and then structure it into a logical data model 14 00:00:42,090 --> 00:00:43,410 in a database. 15 00:00:43,410 --> 00:00:46,300 And we do that according to a set of criteria 16 00:00:46,300 --> 00:00:49,330 which we're gonna learn about in this video. 17 00:00:49,330 --> 00:00:51,980 For example; lets say that we want to design 18 00:00:51,980 --> 00:00:54,120 an online shop data model. 19 00:00:54,120 --> 00:00:57,040 There will be initially a ton of unstructured data 20 00:00:57,040 --> 00:00:58,130 that we know we need. 21 00:00:58,130 --> 00:00:58,980 Right. 22 00:00:58,980 --> 00:01:00,900 Stuff like products, categories, 23 00:01:00,900 --> 00:01:03,875 customer's orders, shopping carts, suppliers. 24 00:01:03,875 --> 00:01:06,300 And so on and so forth. 25 00:01:06,300 --> 00:01:09,240 Our goal with data modeling is to then structure 26 00:01:09,240 --> 00:01:11,450 this data into a logical way. 27 00:01:11,450 --> 00:01:14,090 Reflecting the real-world relationships 28 00:01:14,090 --> 00:01:16,920 that exists between some of these data sets. 29 00:01:16,920 --> 00:01:19,670 A bit like you can see in this example. 30 00:01:19,670 --> 00:01:23,110 And this is of course just a kind of imaginary situation 31 00:01:23,110 --> 00:01:24,320 but you get the idea. 32 00:01:24,320 --> 00:01:25,600 Right. 33 00:01:25,600 --> 00:01:28,940 Now, many backend developers say that data modeling 34 00:01:28,940 --> 00:01:30,930 is where we have to think the most. 35 00:01:30,930 --> 00:01:33,670 That its the most demanding part of building 36 00:01:33,670 --> 00:01:35,310 an entire application. 37 00:01:35,310 --> 00:01:38,100 Because it really is not always straight-forward. 38 00:01:38,100 --> 00:01:41,070 And sometimes there are simply no right answers. 39 00:01:41,070 --> 00:01:45,500 So not just one unique correct way of structuring the data. 40 00:01:45,500 --> 00:01:48,420 But anyway I will do my best to lay down the process 41 00:01:48,420 --> 00:01:49,510 in this video. 42 00:01:49,510 --> 00:01:52,367 And for that we're gonna go through four steps. 43 00:01:52,367 --> 00:01:56,200 So in the first step; we learned about how to identify 44 00:01:56,200 --> 00:01:59,340 different types of relationships between data. 45 00:01:59,340 --> 00:02:00,360 Then we're gonna understand the difference 46 00:02:00,360 --> 00:02:03,019 between referencing or normalization 47 00:02:03,019 --> 00:02:07,163 and embedding or denormalization. 48 00:02:07,163 --> 00:02:09,030 In the next and most important step; 49 00:02:09,030 --> 00:02:11,660 I will show you my framework for deciding 50 00:02:11,660 --> 00:02:13,560 whether we should embed documents 51 00:02:13,560 --> 00:02:15,750 or reference to other documents 52 00:02:15,750 --> 00:02:18,690 based on a couple of different factors. 53 00:02:18,690 --> 00:02:20,810 Also, we have to quickly talk about 54 00:02:20,810 --> 00:02:22,680 different types of referencing. 55 00:02:22,680 --> 00:02:25,920 Because that's important if that is the type of design 56 00:02:25,920 --> 00:02:28,220 that we choose for our data. 57 00:02:28,220 --> 00:02:32,290 So this is gonna be in fact a quite theoretical lecture. 58 00:02:32,290 --> 00:02:35,940 But also an absolutely essential one for your progress 59 00:02:35,940 --> 00:02:37,660 as a back-end developer. 60 00:02:37,660 --> 00:02:41,553 Because the way we design data so the way we model our data 61 00:02:41,553 --> 00:02:45,180 can make or break our entire application. 62 00:02:45,180 --> 00:02:47,950 And there will be a lot of examples along the way 63 00:02:47,950 --> 00:02:49,510 to make this process easier. 64 00:02:49,510 --> 00:02:50,343 All right. 65 00:02:51,320 --> 00:02:53,440 And the first thing that we are gonna talk about 66 00:02:53,440 --> 00:02:55,780 is the different types of relationships 67 00:02:55,780 --> 00:02:58,210 that can exist between data. 68 00:02:58,210 --> 00:03:00,780 So there are three big types of relationships. 69 00:03:00,780 --> 00:03:05,150 One to one, one to many, and many to many. 70 00:03:05,150 --> 00:03:06,990 And I'm gonna use a movie application 71 00:03:06,990 --> 00:03:08,890 as an example in this slide. 72 00:03:08,890 --> 00:03:10,000 Okay? 73 00:03:10,000 --> 00:03:12,440 So first a one to one relationship 74 00:03:12,440 --> 00:03:14,140 between data is basically 75 00:03:14,140 --> 00:03:17,370 when one field can only have one value. 76 00:03:17,370 --> 00:03:21,550 So in our movie application example; one movie only ever 77 00:03:21,550 --> 00:03:22,990 have one name. 78 00:03:22,990 --> 00:03:24,910 And so this is a simple example 79 00:03:24,910 --> 00:03:27,160 of a one to one relationship. 80 00:03:27,160 --> 00:03:29,690 But these relationships are not really that important 81 00:03:29,690 --> 00:03:31,363 in terms of data modeling. 82 00:03:32,330 --> 00:03:34,430 Now the most important relationships 83 00:03:34,430 --> 00:03:37,210 are the one to many relationships. 84 00:03:37,210 --> 00:03:39,770 And they are so important that in MongoDB 85 00:03:39,770 --> 00:03:42,510 we actually distinguish between three types 86 00:03:42,510 --> 00:03:44,540 of one to many relationships. 87 00:03:44,540 --> 00:03:49,540 One to a few, one to many, and one to a ton or to a million 88 00:03:49,910 --> 00:03:53,230 or something like that. So the difference here is based 89 00:03:53,230 --> 00:03:56,893 on the relative amount of the many. All right. 90 00:03:57,840 --> 00:04:00,969 So an example to a one to a few relationship is that 91 00:04:00,969 --> 00:04:05,967 one movie can win many awards but actually just a few. 92 00:04:05,967 --> 00:04:09,630 So movie is not gonna win a thousand awards 93 00:04:09,630 --> 00:04:11,220 but it can win some. 94 00:04:11,220 --> 00:04:14,930 And so this is a typical one to few relationship. 95 00:04:14,930 --> 00:04:18,710 So you see that in general a one to many relationship 96 00:04:18,710 --> 00:04:23,210 means that one document can relate to many other documents. 97 00:04:23,210 --> 00:04:26,680 Now this might look a bit abstract without the JSON data 98 00:04:26,680 --> 00:04:28,480 but that's actually the purpose here. 99 00:04:28,480 --> 00:04:31,040 I just wanna show you a conceptual overview 100 00:04:31,040 --> 00:04:33,759 of these different types of relationships. 101 00:04:33,759 --> 00:04:36,872 Anyway, any one to many relationship 102 00:04:36,872 --> 00:04:40,600 one document can relate to hundreds or thousands 103 00:04:40,600 --> 00:04:42,070 of other documents. 104 00:04:42,070 --> 00:04:44,788 For example; one movie can have thousands of reviews 105 00:04:44,788 --> 00:04:46,710 in our application. 106 00:04:46,710 --> 00:04:49,380 And so this not really a one to few 107 00:04:49,380 --> 00:04:51,524 but one to many relationship. Okay? 108 00:04:51,524 --> 00:04:55,616 And finally we have the one to ton relationship. 109 00:04:55,616 --> 00:04:59,720 Imagine we wanted to implement some logging functionality 110 00:04:59,720 --> 00:05:03,110 in our app. So basically to know exactly what's going on 111 00:05:03,110 --> 00:05:04,870 on our server. 112 00:05:04,870 --> 00:05:08,770 This logs can then easily grow to millions of documents. 113 00:05:08,770 --> 00:05:11,270 And so this is a very typical example 114 00:05:11,270 --> 00:05:14,200 of a one to tons a relationship. 115 00:05:14,200 --> 00:05:17,100 And the difference between many and a ton is of course 116 00:05:17,100 --> 00:05:20,730 a bit fuzzy but just think that if something can grow 117 00:05:20,730 --> 00:05:23,360 almost to infinity then its definitely 118 00:05:23,360 --> 00:05:25,532 a one to a ton relationship. 119 00:05:25,532 --> 00:05:28,763 So again the one to many relationships 120 00:05:28,763 --> 00:05:31,650 are the most important ones to know. 121 00:05:31,650 --> 00:05:34,050 By the way; in relational databases 122 00:05:34,050 --> 00:05:37,061 there is just one to many without quantifying 123 00:05:37,061 --> 00:05:39,800 how much that many actually is. 124 00:05:39,800 --> 00:05:41,800 In MongoDB databases though 125 00:05:41,800 --> 00:05:44,010 it is an extremely important difference. 126 00:05:44,010 --> 00:05:47,150 Because its one of the factors that we're gonna use 127 00:05:47,150 --> 00:05:49,891 to decide if we should denormalize or normalize data 128 00:05:49,891 --> 00:05:53,340 as you will learn a bit later. 129 00:05:53,340 --> 00:05:57,181 Anyway, the less type of relationship is the many to many 130 00:05:57,181 --> 00:06:00,149 where one movie can have many actors. 131 00:06:00,149 --> 00:06:04,876 But at the same time one actor can play in many movies. 132 00:06:04,876 --> 00:06:07,910 And so here the relationship basically 133 00:06:07,910 --> 00:06:09,630 goes in both directions. 134 00:06:09,630 --> 00:06:11,800 Where before in the other types 135 00:06:11,800 --> 00:06:13,939 it was only in one direction. 136 00:06:13,939 --> 00:06:17,470 For example one movie can have many reviews 137 00:06:17,470 --> 00:06:22,450 but one specific is only for that one movie. Right. 138 00:06:22,450 --> 00:06:24,560 And the same goes for the awards. 139 00:06:24,560 --> 00:06:27,506 So one specific award like for the best actor 140 00:06:27,506 --> 00:06:30,914 goes to only one movie not multiple ones. 141 00:06:30,914 --> 00:06:35,580 But with movies and actors it is indeed different. 142 00:06:35,580 --> 00:06:39,250 So again one movie stars many actors 143 00:06:39,250 --> 00:06:41,920 but one actor plays many movies 144 00:06:41,920 --> 00:06:45,020 and so its a many to many relationship. 145 00:06:45,020 --> 00:06:46,170 Okay. 146 00:06:46,170 --> 00:06:49,060 So keep all this in mind as we now move forward 147 00:06:49,060 --> 00:06:50,063 in this lecture. 148 00:06:51,760 --> 00:06:54,870 And probably the most important aspect that we need to learn 149 00:06:54,870 --> 00:06:57,900 about MongoDB databases is referencing 150 00:06:57,900 --> 00:07:00,340 and embedding two datasets. 151 00:07:00,340 --> 00:07:02,350 And we actually already talked a little bit 152 00:07:02,350 --> 00:07:05,050 about this before but lets review it here 153 00:07:05,050 --> 00:07:07,311 and go a bit deeper also. 154 00:07:07,311 --> 00:07:09,962 So each time we have two related datasets 155 00:07:09,962 --> 00:07:13,829 we can either represent that related data in a reference 156 00:07:13,829 --> 00:07:18,829 or normalized form or in an embedded or denormalized form. 157 00:07:18,842 --> 00:07:22,190 And I keep using the two related terms together 158 00:07:22,190 --> 00:07:24,340 like referencing and normalizing 159 00:07:24,340 --> 00:07:26,460 because you will see them both being used 160 00:07:26,460 --> 00:07:29,510 and so its important that you know all of them. 161 00:07:29,510 --> 00:07:33,070 Anyway, in the referenced form we keep two related 162 00:07:33,070 --> 00:07:35,826 datasets and all the documents separated. 163 00:07:35,826 --> 00:07:39,589 So again all the data is nicely separated 164 00:07:39,589 --> 00:07:43,275 which is exactly what normalized means. 165 00:07:43,275 --> 00:07:47,110 So continuing, the movie database example from before 166 00:07:47,110 --> 00:07:50,750 we would have one movie document and one actor document 167 00:07:50,750 --> 00:07:54,870 for each actor. Now how would we then make the connection 168 00:07:54,870 --> 00:07:58,710 between movie and the actors so that later in our app 169 00:07:58,710 --> 00:08:02,150 we can show which actors played in a particular movie. 170 00:08:02,150 --> 00:08:05,210 Because if they are all completely different document 171 00:08:05,210 --> 00:08:09,438 the movie has no way of knowing about the actors. Right. 172 00:08:09,438 --> 00:08:12,253 Well that's where the IDs come in. 173 00:08:12,253 --> 00:08:16,460 So we use the actor IDs in order to create references 174 00:08:16,460 --> 00:08:18,020 on the movie document. 175 00:08:18,020 --> 00:08:20,981 Effectively connecting movies with actors. 176 00:08:20,981 --> 00:08:24,760 So you see that in a movie document we have an array 177 00:08:24,760 --> 00:08:27,198 where we stored the IDs of all the actors 178 00:08:27,198 --> 00:08:30,760 so that when we request data about a certain a movie 179 00:08:30,760 --> 00:08:34,553 we can easily identify its actors. Does that make sense? 180 00:08:34,553 --> 00:08:38,830 Now this type of referencing is called child referencing 181 00:08:38,830 --> 00:08:41,480 because its the parent in this case the movie 182 00:08:41,480 --> 00:08:45,104 who references its children. In this case the actors. 183 00:08:45,104 --> 00:08:48,841 So we're really creating some sort of hierarchy here. Right. 184 00:08:48,841 --> 00:08:51,870 Now there is also parent referencing 185 00:08:51,870 --> 00:08:54,390 and we are gonna talk about that a bit later. 186 00:08:54,390 --> 00:08:58,710 And by the way in relational databases; all data is always 187 00:08:58,710 --> 00:09:01,958 represented in normalized form like this. 188 00:09:01,958 --> 00:09:05,490 But in a no sequel database like MongoDB 189 00:09:05,490 --> 00:09:09,700 we can denormalize data into a denormalized form 190 00:09:09,700 --> 00:09:12,450 simply by embedding the related document 191 00:09:12,450 --> 00:09:15,330 right into the main document. 192 00:09:15,330 --> 00:09:18,330 So now we have all the relevant data about actors 193 00:09:18,330 --> 00:09:22,060 right inside in one main movie document without the need 194 00:09:22,060 --> 00:09:25,700 for separate documents, collections, and IDs. 195 00:09:25,700 --> 00:09:30,088 So again, if we choose to denormalize or to embed our data 196 00:09:30,088 --> 00:09:34,280 we will have one main document containing all the main data 197 00:09:34,280 --> 00:09:37,197 as well as the related data. All right. 198 00:09:37,197 --> 00:09:40,340 And the result of this is that our application 199 00:09:40,340 --> 00:09:43,330 will need to fewer queries to the database. 200 00:09:43,330 --> 00:09:45,000 Because we can get all the data 201 00:09:45,000 --> 00:09:48,074 about movies and actors all at the same time 202 00:09:48,074 --> 00:09:51,650 which will of course increase our performance. 203 00:09:51,650 --> 00:09:53,840 Now the downside here is of course 204 00:09:53,840 --> 00:09:57,530 that we can't really query the embedded data on its own. 205 00:09:57,530 --> 00:10:00,810 And so if that's a requirement for the application 206 00:10:00,810 --> 00:10:03,790 you would have to choose a normalized design 207 00:10:03,790 --> 00:10:06,280 and since we're talking about pros and cons 208 00:10:06,280 --> 00:10:09,030 of the denormalized form; lets do the same 209 00:10:09,030 --> 00:10:11,490 about the normalized design. 210 00:10:11,490 --> 00:10:13,920 And basically its kind of the opposite 211 00:10:13,920 --> 00:10:15,770 of what we just talked about. 212 00:10:15,770 --> 00:10:18,319 So there is an improvement in performance 213 00:10:18,319 --> 00:10:22,390 when we often need to query the related data on it's own 214 00:10:22,390 --> 00:10:25,740 because we then can just query the data that we need 215 00:10:25,740 --> 00:10:28,490 and not always movies and actors together. 216 00:10:28,490 --> 00:10:31,640 But on the other hand; when we need to actually query 217 00:10:31,640 --> 00:10:33,906 movies and actors together we then are gonna need 218 00:10:33,906 --> 00:10:36,396 many queries to the database. 219 00:10:36,396 --> 00:10:40,010 So first the query for the movie and then from there 220 00:10:40,010 --> 00:10:42,610 we will also need a query for the actor 221 00:10:42,610 --> 00:10:44,989 and that is of course works for performance. 222 00:10:44,989 --> 00:10:48,328 So when designing your database; this is the kind of stuff 223 00:10:48,328 --> 00:10:50,569 that you need to keep in mind. All right. 224 00:10:50,569 --> 00:10:54,900 And now just as a side note; we could of course begin 225 00:10:54,900 --> 00:10:56,994 our thought process with denormlized data 226 00:10:56,994 --> 00:10:59,670 and then come to the conclusion 227 00:10:59,670 --> 00:11:01,692 that its best to actually normalize the data. 228 00:11:01,692 --> 00:11:05,043 So when thinking about our data model 229 00:11:05,043 --> 00:11:08,378 this way of organizing data works of course in both ways. 230 00:11:08,378 --> 00:11:12,570 Now, how do we actually decide if we should 231 00:11:12,570 --> 00:11:15,330 normalize or denormalize the data? 232 00:11:15,330 --> 00:11:18,033 Well that's exactly what we're gonna learn next. 233 00:11:19,690 --> 00:11:22,974 So when we have two related datasets; we have to decide 234 00:11:22,974 --> 00:11:26,180 if we're gonna embed the datasets or if we're gonna 235 00:11:26,180 --> 00:11:27,693 keep them separated and reference them 236 00:11:27,693 --> 00:11:30,400 from one dataset to the other. 237 00:11:30,400 --> 00:11:32,730 And I kind of developed this decision framework 238 00:11:32,730 --> 00:11:36,070 which I'm gonna show you where we use three criteria 239 00:11:36,070 --> 00:11:37,770 to take that decision. 240 00:11:37,770 --> 00:11:40,450 First we look at the type of relationships 241 00:11:40,450 --> 00:11:42,800 that exists between datasets. 242 00:11:42,800 --> 00:11:45,856 Second we try to determine the data access pattern 243 00:11:45,856 --> 00:11:50,150 of the dataset that we want to either embed or reference. 244 00:11:50,150 --> 00:11:53,320 And this just means to analyze how often data is read 245 00:11:53,320 --> 00:11:55,282 and written in that dataset. 246 00:11:55,282 --> 00:11:59,025 Then we also look at something that I call data closeness. 247 00:11:59,025 --> 00:12:02,940 And data closeness is term that I actually just made up 248 00:12:02,940 --> 00:12:06,870 but what it means is how much the data is really related 249 00:12:06,870 --> 00:12:10,109 and how we want to query the data from the database. 250 00:12:10,109 --> 00:12:11,850 And this will make more sense 251 00:12:11,850 --> 00:12:14,180 when we talk about it in a moment. 252 00:12:14,180 --> 00:12:17,330 Now to actually take the decision; we need to combine 253 00:12:17,330 --> 00:12:19,350 all of these three criteria 254 00:12:19,350 --> 00:12:21,792 and not just use one of them in isolation. 255 00:12:21,792 --> 00:12:25,230 So for example; just because criteria number one 256 00:12:25,230 --> 00:12:28,380 says to embed it doesn't mean that we don't need to look 257 00:12:28,380 --> 00:12:30,425 at the other two criteria. 258 00:12:30,425 --> 00:12:34,124 All right and lets start with the relationship type. 259 00:12:34,124 --> 00:12:37,968 So usually when we have one to few relationship 260 00:12:37,968 --> 00:12:40,700 we will always embed the related dataset 261 00:12:40,700 --> 00:12:43,430 into the main dataset just like we learned 262 00:12:43,430 --> 00:12:45,860 in the last slide. Right. 263 00:12:45,860 --> 00:12:49,110 Now in a one to many relationship; things are a bit 264 00:12:49,110 --> 00:12:52,880 more fuzzy so its okay to either embed or reference. 265 00:12:52,880 --> 00:12:55,140 In that case we will have to decide 266 00:12:55,140 --> 00:12:57,304 according to the other two criteria. 267 00:12:57,304 --> 00:12:59,825 Now on the other hand, on a one to a ton 268 00:12:59,825 --> 00:13:03,894 or a many to many relationship we usually always reference 269 00:13:03,894 --> 00:13:06,811 the data. That's because if we actually did embed 270 00:13:06,811 --> 00:13:10,004 in this case we could quickly create way too large document. 271 00:13:10,004 --> 00:13:14,902 Even potentially surpassing the maximum of 16 megabytes. 272 00:13:14,902 --> 00:13:18,214 And so the solution for that is of course referencing 273 00:13:18,214 --> 00:13:22,090 or normalizing the data. And as a quick example; 274 00:13:22,090 --> 00:13:24,142 lets say that in our movie database example 275 00:13:24,142 --> 00:13:27,830 we have around 100 images associated to each movie. 276 00:13:27,830 --> 00:13:30,874 So we could say its a one to many relationship 277 00:13:30,874 --> 00:13:34,230 but are we gonna embed the dataset or should we rather 278 00:13:34,230 --> 00:13:37,523 reference them here. Well we don't really know. 279 00:13:37,523 --> 00:13:40,571 So lets take a look at the other two criteria. 280 00:13:40,571 --> 00:13:44,420 So the second one is about data access patterns 281 00:13:44,420 --> 00:13:46,290 where its just a fancy description 282 00:13:46,290 --> 00:13:48,242 for evaluating whether a certain dataset 283 00:13:48,242 --> 00:13:51,559 is mostly written to or mostly read from. 284 00:13:51,559 --> 00:13:55,760 So if the dataset that we're deciding about is mostly read 285 00:13:55,760 --> 00:13:58,179 and the data is not updated a lot 286 00:13:58,179 --> 00:14:01,620 then we should probably embed that dataset. 287 00:14:01,620 --> 00:14:04,690 So a high read/write ratio just means 288 00:14:04,690 --> 00:14:07,100 that there is a lot more reading than writing. 289 00:14:07,100 --> 00:14:11,100 And a again, a dataset like that is a good candidate 290 00:14:11,100 --> 00:14:11,983 for embedding. 291 00:14:12,830 --> 00:14:15,980 The reason for this is that by embedding we only need 292 00:14:15,980 --> 00:14:18,379 one trip to the database per query. 293 00:14:18,379 --> 00:14:22,197 While for referencing we need two trips. Right. 294 00:14:22,197 --> 00:14:25,660 So if we embed data that is read a lot; 295 00:14:25,660 --> 00:14:28,383 in each query we save one trip to the database 296 00:14:28,383 --> 00:14:32,147 making the entire process way more performant. 297 00:14:32,147 --> 00:14:35,260 So I think that our movie image example 298 00:14:35,260 --> 00:14:38,320 would actually be a good candidate for embedding. 299 00:14:38,320 --> 00:14:41,543 Because once the 100 image are saved to the database 300 00:14:41,543 --> 00:14:43,920 they are not really updated anymore 301 00:14:43,920 --> 00:14:46,930 because there is not really anything to update 302 00:14:46,930 --> 00:14:50,057 about an image. Right, so its all about reading 303 00:14:50,057 --> 00:14:52,563 and therefore based on this criteria 304 00:14:52,563 --> 00:14:55,501 we would embed the imaged documents. 305 00:14:55,501 --> 00:14:59,092 Now on the other hand, if our data is updated a lot 306 00:14:59,092 --> 00:15:03,118 then we should consider referencing or normalizing the data. 307 00:15:03,118 --> 00:15:06,700 That's because its more work for the database engine 308 00:15:06,700 --> 00:15:08,870 to update and embed a document 309 00:15:08,870 --> 00:15:11,600 than a more simple standalone document. 310 00:15:11,600 --> 00:15:13,980 And since our main goal is performance; 311 00:15:13,980 --> 00:15:15,917 we just normalize the dataset. 312 00:15:15,917 --> 00:15:19,653 In our example lets say each movie has many reviews 313 00:15:19,653 --> 00:15:23,284 and each review can be marked as helpful by the user. 314 00:15:23,284 --> 00:15:27,560 So each time someone clicks on this review was helpful 315 00:15:27,560 --> 00:15:29,780 in our application. We need to update 316 00:15:29,780 --> 00:15:31,740 the corresponding document. 317 00:15:31,740 --> 00:15:35,030 And this means that the data can change all the time 318 00:15:35,030 --> 00:15:38,520 and so this is a great candidate for normalizing. 319 00:15:38,520 --> 00:15:41,420 Again because we don't want to be querying the movies 320 00:15:41,420 --> 00:15:45,190 all the time if all we really wanna update is the reviews 321 00:15:45,190 --> 00:15:47,230 by marking them as helpful. 322 00:15:47,230 --> 00:15:49,464 Okay, does that make sense? 323 00:15:49,464 --> 00:15:53,500 And finally the last criteria I call data closeness; 324 00:15:53,500 --> 00:15:56,320 which is just like a measure for how much the data 325 00:15:56,320 --> 00:15:59,469 is related. So if the two datasets really 326 00:15:59,469 --> 00:16:02,890 intrinsically belong together then they should 327 00:16:02,890 --> 00:16:05,880 probably be embedded into one another. 328 00:16:05,880 --> 00:16:10,440 In our example; all users can have many email addresses 329 00:16:10,440 --> 00:16:13,780 on their account and since they are so intrinsically 330 00:16:13,780 --> 00:16:17,190 connected to the user, there is no doubt emails 331 00:16:17,190 --> 00:16:19,920 should be embedded into the document. 332 00:16:19,920 --> 00:16:23,830 Now if we frequently need to query both of datasets 333 00:16:23,830 --> 00:16:26,388 on their own then that's a very good reason 334 00:16:26,388 --> 00:16:29,696 to normalize the data into two separate datasets. 335 00:16:29,696 --> 00:16:32,790 Even if they are closely related. 336 00:16:32,790 --> 00:16:35,227 So imagine that in our app we have a quiz 337 00:16:35,227 --> 00:16:40,227 where users have to identify a movie based on images. 338 00:16:40,440 --> 00:16:43,080 This means that we're gonna query a lot of images 339 00:16:43,080 --> 00:16:44,180 on their own. 340 00:16:44,180 --> 00:16:47,756 So without necessarily querying for the movies themselves. 341 00:16:47,756 --> 00:16:50,640 And so if we apply this third criteria; 342 00:16:50,640 --> 00:16:54,137 we come to the conclusion that we should actually normalize 343 00:16:54,137 --> 00:16:56,759 the image dataset. All right. 344 00:16:56,759 --> 00:17:00,770 Because again if we implement this quiz functionality; 345 00:17:00,770 --> 00:17:04,057 images are gonna be queried on their own all the time. 346 00:17:04,057 --> 00:17:07,422 So, all of this shows that we should really look 347 00:17:07,422 --> 00:17:09,850 all the three criteria together 348 00:17:09,850 --> 00:17:12,700 rather than just one of them in isolation. 349 00:17:12,700 --> 00:17:15,841 Because that might lead to less optimal decisions. 350 00:17:15,841 --> 00:17:18,908 And I say less optimal instead of wrong 351 00:17:18,908 --> 00:17:21,766 because they are not really completely right 352 00:17:21,766 --> 00:17:25,262 or completely wrong ways of modeling our data. 353 00:17:25,262 --> 00:17:28,970 There are no hard rules; these are just like guidelines 354 00:17:28,970 --> 00:17:31,380 that you can follow to find the probably 355 00:17:31,380 --> 00:17:33,860 most correct way of structuring your data. 356 00:17:33,860 --> 00:17:37,077 But again, it's hard to be really really wrong. 357 00:17:37,077 --> 00:17:38,253 Okay? 358 00:17:39,740 --> 00:17:43,110 Now, lets say that we have chosen to normalize 359 00:17:43,110 --> 00:17:44,270 our datasets. 360 00:17:44,270 --> 00:17:46,653 So in other words to reference data. 361 00:17:46,653 --> 00:17:49,380 Then after that we still have to choose 362 00:17:49,380 --> 00:17:52,840 between three different types of referencing. 363 00:17:52,840 --> 00:17:55,460 Child referencing, parent referencing 364 00:17:55,460 --> 00:17:57,540 and two-way referencing. 365 00:17:57,540 --> 00:18:00,767 So the first type is child referencing. 366 00:18:00,767 --> 00:18:04,440 Which is the referencing type I actually showed you before. 367 00:18:04,440 --> 00:18:05,470 Okay? 368 00:18:05,470 --> 00:18:07,850 And lets not take the error logging example 369 00:18:07,850 --> 00:18:10,128 that I mentioned earlier. Where we could potentially 370 00:18:10,128 --> 00:18:13,021 have millions of locked documents. 371 00:18:13,021 --> 00:18:17,300 So in child referencing; we basically keep references 372 00:18:17,300 --> 00:18:20,460 to the related child documents in a parent document. 373 00:18:20,460 --> 00:18:22,941 And they are usually stored in an array. 374 00:18:22,941 --> 00:18:25,735 So you see that each log has an ID 375 00:18:25,735 --> 00:18:29,040 and then in the app document there is that array 376 00:18:29,040 --> 00:18:31,358 with all of these IDs. Right? 377 00:18:31,358 --> 00:18:34,400 However, the problem here is that this array 378 00:18:34,400 --> 00:18:39,320 of IDs can become very large if there are lots of children. 379 00:18:39,320 --> 00:18:42,230 And this is an anti-pattern in MongoDB. 380 00:18:42,230 --> 00:18:45,156 So something that we should avoid at all costs. 381 00:18:45,156 --> 00:18:47,660 Also, child referencing makes it 382 00:18:47,660 --> 00:18:51,410 so that parents and children are very tightly coupled. 383 00:18:51,410 --> 00:18:54,840 Which is not always ideal. But that's exactly 384 00:18:54,840 --> 00:18:57,020 why we have parent referencing. 385 00:18:57,020 --> 00:19:00,300 So in parent referencing; it actually works 386 00:19:00,300 --> 00:19:01,870 the other way around. 387 00:19:01,870 --> 00:19:05,570 So here in each child document we keep a reference 388 00:19:05,570 --> 00:19:07,430 to the parent element. 389 00:19:07,430 --> 00:19:10,267 Therefore the name parent referencing. 390 00:19:10,267 --> 00:19:13,890 In this example the app ID is 23 391 00:19:13,890 --> 00:19:16,640 and so in each log there is the app field 392 00:19:16,640 --> 00:19:18,990 with the 23 ID in it. 393 00:19:18,990 --> 00:19:21,660 So that the child always knows its parent. 394 00:19:21,660 --> 00:19:24,920 And so in this case the parent actually knows nothing 395 00:19:24,920 --> 00:19:26,080 about the children. 396 00:19:26,080 --> 00:19:28,768 Not who they are and not how many they are. 397 00:19:28,768 --> 00:19:32,890 So, they are way more isolated and more standalone. 398 00:19:32,890 --> 00:19:35,326 In that, it can sometimes be beneficial. 399 00:19:35,326 --> 00:19:38,880 So which of these two types is actually better 400 00:19:38,880 --> 00:19:40,527 for this data relationship. 401 00:19:40,527 --> 00:19:42,820 And remember how I said that there 402 00:19:42,820 --> 00:19:45,860 could be millions of logs and so lets suppose 403 00:19:45,860 --> 00:19:47,652 there is two million logged documents. 404 00:19:47,652 --> 00:19:51,340 In a case of child referencing, that would mean 405 00:19:51,340 --> 00:19:53,209 that there are two million ID references 406 00:19:53,209 --> 00:19:55,091 in the app document. 407 00:19:55,091 --> 00:19:58,300 Right? Now also remember how I said that 408 00:19:58,300 --> 00:20:00,545 there is 16 megabyte limit on documents. 409 00:20:00,545 --> 00:20:04,302 So if we kept adding and adding these child IDs 410 00:20:04,302 --> 00:20:06,716 into the array on the parent; then we would 411 00:20:06,716 --> 00:20:09,575 pretty quickly hit that 16 megabytes limit 412 00:20:09,575 --> 00:20:11,772 that each Bson document can hold. 413 00:20:11,772 --> 00:20:14,702 Simply because that array will grow so much. 414 00:20:14,702 --> 00:20:17,210 So that's not really gonna work. 415 00:20:17,210 --> 00:20:18,510 Is it? 416 00:20:18,510 --> 00:20:20,590 On the other hand with parent referencing 417 00:20:20,590 --> 00:20:22,990 that problem is not gonna happen. 418 00:20:22,990 --> 00:20:25,570 We will simply have two million locked documents 419 00:20:25,570 --> 00:20:30,540 just like before but each of them holds ID of its parent. 420 00:20:30,540 --> 00:20:33,098 But there is no array that will grow indefinitely 421 00:20:33,098 --> 00:20:35,740 and therefore parent referencing 422 00:20:35,740 --> 00:20:38,443 would be best solution here. 423 00:20:39,380 --> 00:20:41,901 So the conclusion of all this is that in general 424 00:20:41,901 --> 00:20:44,385 child referencing is best used 425 00:20:44,385 --> 00:20:48,008 for one to a few relationships. Where we know before hand 426 00:20:48,008 --> 00:20:51,118 that the array of child documents won't grow that much. 427 00:20:51,118 --> 00:20:54,573 On the other hand, parent referencing is best used 428 00:20:54,573 --> 00:20:58,690 for one to many and one to a ton relationships 429 00:20:58,690 --> 00:21:00,927 like this one. Okay? 430 00:21:00,927 --> 00:21:04,610 So again always keep in mind that one of the most 431 00:21:04,610 --> 00:21:07,920 important principals of MongoDB data modeling 432 00:21:07,920 --> 00:21:11,900 is that array should never be allowed to grow indefinitely. 433 00:21:11,900 --> 00:21:15,420 In order to never break that 16 megabyte limit. 434 00:21:15,420 --> 00:21:18,170 We also don't want to send our users an array 435 00:21:18,170 --> 00:21:20,730 with thousands of IDs each time 436 00:21:20,730 --> 00:21:24,340 they request a parent dataset. Okay? 437 00:21:24,340 --> 00:21:26,900 So did this logic make sense to you? 438 00:21:26,900 --> 00:21:29,660 Then lets move on to third type of referencing 439 00:21:29,660 --> 00:21:31,870 which is two-way referencing. 440 00:21:31,870 --> 00:21:34,395 And this time with the movie and actor example 441 00:21:34,395 --> 00:21:36,380 I showed you when we talked about 442 00:21:36,380 --> 00:21:39,364 many to many relationships. Remember that? 443 00:21:39,364 --> 00:21:42,229 So again, each movie has many actors 444 00:21:42,229 --> 00:21:44,880 and each actor plays in many movies. 445 00:21:44,880 --> 00:21:48,464 And so that's a typical many to many relationship. 446 00:21:48,464 --> 00:21:52,100 And we usually use this two-way referencing to design 447 00:21:52,100 --> 00:21:55,346 many to many relationships. And it works like this; 448 00:21:55,346 --> 00:21:59,370 in each movie we will keep references to all the actors 449 00:21:59,370 --> 00:22:03,980 that star in that movie. So a bit like in child referencing. 450 00:22:03,980 --> 00:22:07,000 However and at the same time in each actor 451 00:22:07,000 --> 00:22:09,570 we also keep references to all the movies 452 00:22:09,570 --> 00:22:11,660 that the actor played in. 453 00:22:11,660 --> 00:22:15,120 So movies and actors are connected in both directions. 454 00:22:15,120 --> 00:22:17,900 In therefore the name two-way referencing. 455 00:22:17,900 --> 00:22:19,950 And this makes it really easy to search 456 00:22:19,950 --> 00:22:23,290 for both movies and actors completely independently. 457 00:22:23,290 --> 00:22:25,910 While also making it easy to find the actors 458 00:22:25,910 --> 00:22:29,029 associated to each movie and the movies associated 459 00:22:29,029 --> 00:22:30,383 to each actor. 460 00:22:31,623 --> 00:22:32,560 (deep breath) 461 00:22:32,560 --> 00:22:34,747 This was quite a long lecture indeed. 462 00:22:34,747 --> 00:22:38,030 With a lot of new concepts and principals 463 00:22:38,030 --> 00:22:40,220 and guidelines to remember. 464 00:22:40,220 --> 00:22:43,460 So in order to help you with that; here goes a quick 465 00:22:43,460 --> 00:22:46,650 summary and some more general guidelines that you can 466 00:22:46,650 --> 00:22:48,423 take a look at when you need it. 467 00:22:49,260 --> 00:22:52,753 So the most important principal is: structure your data 468 00:22:52,753 --> 00:22:56,120 to match the ways that your application queries 469 00:22:56,120 --> 00:22:57,436 and updates data. 470 00:22:57,436 --> 00:23:01,400 Or in other words: identify the questions that arise 471 00:23:01,400 --> 00:23:03,784 from your application's use cases first, and then model 472 00:23:03,784 --> 00:23:06,634 your data so that the questions can get answered 473 00:23:06,634 --> 00:23:08,995 in the most efficient way. 474 00:23:08,995 --> 00:23:12,610 For example; when I need to query movies and actors 475 00:23:12,610 --> 00:23:16,130 always together or are there scenarios where I only 476 00:23:16,130 --> 00:23:18,041 query movies or only actors. 477 00:23:18,041 --> 00:23:20,528 That kind of questions is what your data model 478 00:23:20,528 --> 00:23:22,930 will be based on. 479 00:23:22,930 --> 00:23:26,730 In general, always favor embedding unless there is a good 480 00:23:26,730 --> 00:23:28,440 reason not to embed. 481 00:23:28,440 --> 00:23:32,513 Especially on one to a few and one to many relationships. 482 00:23:33,370 --> 00:23:37,713 Next up, a one to a ton or a many to many relationship 483 00:23:37,713 --> 00:23:41,543 is usually a good reason to reference instead of embedding. 484 00:23:41,543 --> 00:23:45,734 Also, favor referencing when data is updated a lot 485 00:23:45,734 --> 00:23:50,717 and if you need to frequently access a dataset on its own. 486 00:23:50,717 --> 00:23:55,340 Use embedding when data is mostly read but rarely updated 487 00:23:55,340 --> 00:23:58,469 and when two dataset belong intrinsically together. 488 00:23:58,469 --> 00:24:02,840 Don't allow arrays to grow indefinitely. 489 00:24:02,840 --> 00:24:05,982 Therefore, if you want to normalize; use child referencing 490 00:24:05,982 --> 00:24:09,680 for one to many relationships and parent referencing 491 00:24:09,680 --> 00:24:11,856 for one to a ton relationships. 492 00:24:11,856 --> 00:24:15,160 And finally use two-way referencing 493 00:24:15,160 --> 00:24:17,520 for many to many relationships. 494 00:24:17,520 --> 00:24:18,720 All right? 495 00:24:18,720 --> 00:24:21,202 And that pretty much sums it up. 496 00:24:21,202 --> 00:24:23,970 I would actually recommend you watching this video 497 00:24:23,970 --> 00:24:27,144 twice if you can, just because of how important 498 00:24:27,144 --> 00:24:30,091 this material really is. All right? 499 00:24:30,091 --> 00:24:33,363 Anyway, see you in the next video.