1
00:00:01,180 --> 00:00:02,990
<v Narrator>One of the most important steps</v>

2
00:00:02,990 --> 00:00:06,250
in building data intensive apps is to actually model

3
00:00:06,250 --> 00:00:08,700
all this data in MongoDB.

4
00:00:08,700 --> 00:00:12,300
And so that's what we're gonna talk about in this lecture.

5
00:00:12,300 --> 00:00:14,710
So it's really crucial that you follow it

6
00:00:14,710 --> 00:00:19,710
through even at first its a lot to take in. All right.

7
00:00:19,810 --> 00:00:22,013
Anyway, lets now get started.

8
00:00:23,530 --> 00:00:27,530
Now, data modeling is probably a very new concept to you.

9
00:00:27,530 --> 00:00:28,920
So before we start;

10
00:00:28,920 --> 00:00:32,070
lets make clear what we're actually gonna talk about.

11
00:00:32,070 --> 00:00:35,656
So, data modeling is the process of taking unstructured data

12
00:00:35,656 --> 00:00:38,770
generated by a real world scenario

13
00:00:38,770 --> 00:00:42,090
and then structure it into a logical data model

14
00:00:42,090 --> 00:00:43,410
in a database.

15
00:00:43,410 --> 00:00:46,300
And we do that according to a set of criteria

16
00:00:46,300 --> 00:00:49,330
which we're gonna learn about in this video.

17
00:00:49,330 --> 00:00:51,980
For example; lets say that we want to design

18
00:00:51,980 --> 00:00:54,120
an online shop data model.

19
00:00:54,120 --> 00:00:57,040
There will be initially a ton of unstructured data

20
00:00:57,040 --> 00:00:58,130
that we know we need.

21
00:00:58,130 --> 00:00:58,980
Right.

22
00:00:58,980 --> 00:01:00,900
Stuff like products, categories,

23
00:01:00,900 --> 00:01:03,875
customer's orders, shopping carts, suppliers.

24
00:01:03,875 --> 00:01:06,300
And so on and so forth.

25
00:01:06,300 --> 00:01:09,240
Our goal with data modeling is to then structure

26
00:01:09,240 --> 00:01:11,450
this data into a logical way.

27
00:01:11,450 --> 00:01:14,090
Reflecting the real-world relationships

28
00:01:14,090 --> 00:01:16,920
that exists between some of these data sets.

29
00:01:16,920 --> 00:01:19,670
A bit like you can see in this example.

30
00:01:19,670 --> 00:01:23,110
And this is of course just a kind of imaginary situation

31
00:01:23,110 --> 00:01:24,320
but you get the idea.

32
00:01:24,320 --> 00:01:25,600
Right.

33
00:01:25,600 --> 00:01:28,940
Now, many backend developers say that data modeling

34
00:01:28,940 --> 00:01:30,930
is where we have to think the most.

35
00:01:30,930 --> 00:01:33,670
That its the most demanding part of building

36
00:01:33,670 --> 00:01:35,310
an entire application.

37
00:01:35,310 --> 00:01:38,100
Because it really is not always straight-forward.

38
00:01:38,100 --> 00:01:41,070
And sometimes there are simply no right answers.

39
00:01:41,070 --> 00:01:45,500
So not just one unique correct way of structuring the data.

40
00:01:45,500 --> 00:01:48,420
But anyway I will do my best to lay down the process

41
00:01:48,420 --> 00:01:49,510
in this video.

42
00:01:49,510 --> 00:01:52,367
And for that we're gonna go through four steps.

43
00:01:52,367 --> 00:01:56,200
So in the first step; we learned about how to identify

44
00:01:56,200 --> 00:01:59,340
different types of relationships between data.

45
00:01:59,340 --> 00:02:00,360
Then we're gonna understand the difference

46
00:02:00,360 --> 00:02:03,019
between referencing or normalization

47
00:02:03,019 --> 00:02:07,163
and embedding or denormalization.

48
00:02:07,163 --> 00:02:09,030
In the next and most important step;

49
00:02:09,030 --> 00:02:11,660
I will show you my framework for deciding

50
00:02:11,660 --> 00:02:13,560
whether we should embed documents

51
00:02:13,560 --> 00:02:15,750
or reference to other documents

52
00:02:15,750 --> 00:02:18,690
based on a couple of different factors.

53
00:02:18,690 --> 00:02:20,810
Also, we have to quickly talk about

54
00:02:20,810 --> 00:02:22,680
different types of referencing.

55
00:02:22,680 --> 00:02:25,920
Because that's important if that is the type of design

56
00:02:25,920 --> 00:02:28,220
that we choose for our data.

57
00:02:28,220 --> 00:02:32,290
So this is gonna be in fact a quite theoretical lecture.

58
00:02:32,290 --> 00:02:35,940
But also an absolutely essential one for your progress

59
00:02:35,940 --> 00:02:37,660
as a back-end developer.

60
00:02:37,660 --> 00:02:41,553
Because the way we design data so the way we model our data

61
00:02:41,553 --> 00:02:45,180
can make or break our entire application.

62
00:02:45,180 --> 00:02:47,950
And there will be a lot of examples along the way

63
00:02:47,950 --> 00:02:49,510
to make this process easier.

64
00:02:49,510 --> 00:02:50,343
All right.

65
00:02:51,320 --> 00:02:53,440
And the first thing that we are gonna talk about

66
00:02:53,440 --> 00:02:55,780
is the different types of relationships

67
00:02:55,780 --> 00:02:58,210
that can exist between data.

68
00:02:58,210 --> 00:03:00,780
So there are three big types of relationships.

69
00:03:00,780 --> 00:03:05,150
One to one, one to many, and many to many.

70
00:03:05,150 --> 00:03:06,990
And I'm gonna use a movie application

71
00:03:06,990 --> 00:03:08,890
as an example in this slide.

72
00:03:08,890 --> 00:03:10,000
Okay?

73
00:03:10,000 --> 00:03:12,440
So first a one to one relationship

74
00:03:12,440 --> 00:03:14,140
between data is basically

75
00:03:14,140 --> 00:03:17,370
when one field can only have one value.

76
00:03:17,370 --> 00:03:21,550
So in our movie application example; one movie only ever

77
00:03:21,550 --> 00:03:22,990
have one name.

78
00:03:22,990 --> 00:03:24,910
And so this is a simple example

79
00:03:24,910 --> 00:03:27,160
of a one to one relationship.

80
00:03:27,160 --> 00:03:29,690
But these relationships are not really that important

81
00:03:29,690 --> 00:03:31,363
in terms of data modeling.

82
00:03:32,330 --> 00:03:34,430
Now the most important relationships

83
00:03:34,430 --> 00:03:37,210
are the one to many relationships.

84
00:03:37,210 --> 00:03:39,770
And they are so important that in MongoDB

85
00:03:39,770 --> 00:03:42,510
we actually distinguish between three types

86
00:03:42,510 --> 00:03:44,540
of one to many relationships.

87
00:03:44,540 --> 00:03:49,540
One to a few, one to many, and one to a ton or to a million

88
00:03:49,910 --> 00:03:53,230
or something like that. So the difference here is based

89
00:03:53,230 --> 00:03:56,893
on the relative amount of the many. All right.

90
00:03:57,840 --> 00:04:00,969
So an example to a one to a few relationship is that

91
00:04:00,969 --> 00:04:05,967
one movie can win many awards but actually just a few.

92
00:04:05,967 --> 00:04:09,630
So movie is not gonna win a thousand awards

93
00:04:09,630 --> 00:04:11,220
but it can win some.

94
00:04:11,220 --> 00:04:14,930
And so this is a typical one to few relationship.

95
00:04:14,930 --> 00:04:18,710
So you see that in general a one to many relationship

96
00:04:18,710 --> 00:04:23,210
means that one document can relate to many other documents.

97
00:04:23,210 --> 00:04:26,680
Now this might look a bit abstract without the JSON data

98
00:04:26,680 --> 00:04:28,480
but that's actually the purpose here.

99
00:04:28,480 --> 00:04:31,040
I just wanna show you a conceptual overview

100
00:04:31,040 --> 00:04:33,759
of these different types of relationships.

101
00:04:33,759 --> 00:04:36,872
Anyway, any one to many relationship

102
00:04:36,872 --> 00:04:40,600
one document can relate to hundreds or thousands

103
00:04:40,600 --> 00:04:42,070
of other documents.

104
00:04:42,070 --> 00:04:44,788
For example; one movie can have thousands of reviews

105
00:04:44,788 --> 00:04:46,710
in our application.

106
00:04:46,710 --> 00:04:49,380
And so this not really a one to few

107
00:04:49,380 --> 00:04:51,524
but one to many relationship. Okay?

108
00:04:51,524 --> 00:04:55,616
And finally we have the one to ton relationship.

109
00:04:55,616 --> 00:04:59,720
Imagine we wanted to implement some logging functionality

110
00:04:59,720 --> 00:05:03,110
in our app. So basically to know exactly what's going on

111
00:05:03,110 --> 00:05:04,870
on our server.

112
00:05:04,870 --> 00:05:08,770
This logs can then easily grow to millions of documents.

113
00:05:08,770 --> 00:05:11,270
And so this is a very typical example

114
00:05:11,270 --> 00:05:14,200
of a one to tons a relationship.

115
00:05:14,200 --> 00:05:17,100
And the difference between many and a ton is of course

116
00:05:17,100 --> 00:05:20,730
a bit fuzzy but just think that if something can grow

117
00:05:20,730 --> 00:05:23,360
almost to infinity then its definitely

118
00:05:23,360 --> 00:05:25,532
a one to a ton relationship.

119
00:05:25,532 --> 00:05:28,763
So again the one to many relationships

120
00:05:28,763 --> 00:05:31,650
are the most important ones to know.

121
00:05:31,650 --> 00:05:34,050
By the way; in relational databases

122
00:05:34,050 --> 00:05:37,061
there is just one to many without quantifying

123
00:05:37,061 --> 00:05:39,800
how much that many actually is.

124
00:05:39,800 --> 00:05:41,800
In MongoDB databases though

125
00:05:41,800 --> 00:05:44,010
it is an extremely important difference.

126
00:05:44,010 --> 00:05:47,150
Because its one of the factors that we're gonna use

127
00:05:47,150 --> 00:05:49,891
to decide if we should denormalize or normalize data

128
00:05:49,891 --> 00:05:53,340
as you will learn a bit later.

129
00:05:53,340 --> 00:05:57,181
Anyway, the less type of relationship is the many to many

130
00:05:57,181 --> 00:06:00,149
where one movie can have many actors.

131
00:06:00,149 --> 00:06:04,876
But at the same time one actor can play in many movies.

132
00:06:04,876 --> 00:06:07,910
And so here the relationship basically

133
00:06:07,910 --> 00:06:09,630
goes in both directions.

134
00:06:09,630 --> 00:06:11,800
Where before in the other types

135
00:06:11,800 --> 00:06:13,939
it was only in one direction.

136
00:06:13,939 --> 00:06:17,470
For example one movie can have many reviews

137
00:06:17,470 --> 00:06:22,450
but one specific is only for that one movie. Right.

138
00:06:22,450 --> 00:06:24,560
And the same goes for the awards.

139
00:06:24,560 --> 00:06:27,506
So one specific award like for the best actor

140
00:06:27,506 --> 00:06:30,914
goes to only one movie not multiple ones.

141
00:06:30,914 --> 00:06:35,580
But with movies and actors it is indeed different.

142
00:06:35,580 --> 00:06:39,250
So again one movie stars many actors

143
00:06:39,250 --> 00:06:41,920
but one actor plays many movies

144
00:06:41,920 --> 00:06:45,020
and so its a many to many relationship.

145
00:06:45,020 --> 00:06:46,170
Okay.

146
00:06:46,170 --> 00:06:49,060
So keep all this in mind as we now move forward

147
00:06:49,060 --> 00:06:50,063
in this lecture.

148
00:06:51,760 --> 00:06:54,870
And probably the most important aspect that we need to learn

149
00:06:54,870 --> 00:06:57,900
about MongoDB databases is referencing

150
00:06:57,900 --> 00:07:00,340
and embedding two datasets.

151
00:07:00,340 --> 00:07:02,350
And we actually already talked a little bit

152
00:07:02,350 --> 00:07:05,050
about this before but lets review it here

153
00:07:05,050 --> 00:07:07,311
and go a bit deeper also.

154
00:07:07,311 --> 00:07:09,962
So each time we have two related datasets

155
00:07:09,962 --> 00:07:13,829
we can either represent that related data in a reference

156
00:07:13,829 --> 00:07:18,829
or normalized form or in an embedded or denormalized form.

157
00:07:18,842 --> 00:07:22,190
And I keep using the two related terms together

158
00:07:22,190 --> 00:07:24,340
like referencing and normalizing

159
00:07:24,340 --> 00:07:26,460
because you will see them both being used

160
00:07:26,460 --> 00:07:29,510
and so its important that you know all of them.

161
00:07:29,510 --> 00:07:33,070
Anyway, in the referenced form we keep two related

162
00:07:33,070 --> 00:07:35,826
datasets and all the documents separated.

163
00:07:35,826 --> 00:07:39,589
So again all the data is nicely separated

164
00:07:39,589 --> 00:07:43,275
which is exactly what normalized means.

165
00:07:43,275 --> 00:07:47,110
So continuing, the movie database example from before

166
00:07:47,110 --> 00:07:50,750
we would have one movie document and one actor document

167
00:07:50,750 --> 00:07:54,870
for each actor. Now how would we then make the connection

168
00:07:54,870 --> 00:07:58,710
between movie and the actors so that later in our app

169
00:07:58,710 --> 00:08:02,150
we can show which actors played in a particular movie.

170
00:08:02,150 --> 00:08:05,210
Because if they are all completely different document

171
00:08:05,210 --> 00:08:09,438
the movie has no way of knowing about the actors. Right.

172
00:08:09,438 --> 00:08:12,253
Well that's where the IDs come in.

173
00:08:12,253 --> 00:08:16,460
So we use the actor IDs in order to create references

174
00:08:16,460 --> 00:08:18,020
on the movie document.

175
00:08:18,020 --> 00:08:20,981
Effectively connecting movies with actors.

176
00:08:20,981 --> 00:08:24,760
So you see that in a movie document we have an array

177
00:08:24,760 --> 00:08:27,198
where we stored the IDs of all the actors

178
00:08:27,198 --> 00:08:30,760
so that when we request data about a certain a movie

179
00:08:30,760 --> 00:08:34,553
we can easily identify its actors. Does that make sense?

180
00:08:34,553 --> 00:08:38,830
Now this type of referencing is called child referencing

181
00:08:38,830 --> 00:08:41,480
because its the parent in this case the movie

182
00:08:41,480 --> 00:08:45,104
who references its children. In this case the actors.

183
00:08:45,104 --> 00:08:48,841
So we're really creating some sort of hierarchy here. Right.

184
00:08:48,841 --> 00:08:51,870
Now there is also parent referencing

185
00:08:51,870 --> 00:08:54,390
and we are gonna talk about that a bit later.

186
00:08:54,390 --> 00:08:58,710
And by the way in relational databases; all data is always

187
00:08:58,710 --> 00:09:01,958
represented in normalized form like this.

188
00:09:01,958 --> 00:09:05,490
But in a no sequel database like MongoDB

189
00:09:05,490 --> 00:09:09,700
we can denormalize data into a denormalized form

190
00:09:09,700 --> 00:09:12,450
simply by embedding the related document

191
00:09:12,450 --> 00:09:15,330
right into the main document.

192
00:09:15,330 --> 00:09:18,330
So now we have all the relevant data about actors

193
00:09:18,330 --> 00:09:22,060
right inside in one main movie document without the need

194
00:09:22,060 --> 00:09:25,700
for separate documents, collections, and IDs.

195
00:09:25,700 --> 00:09:30,088
So again, if we choose to denormalize or to embed our data

196
00:09:30,088 --> 00:09:34,280
we will have one main document containing all the main data

197
00:09:34,280 --> 00:09:37,197
as well as the related data. All right.

198
00:09:37,197 --> 00:09:40,340
And the result of this is that our application

199
00:09:40,340 --> 00:09:43,330
will need to fewer queries to the database.

200
00:09:43,330 --> 00:09:45,000
Because we can get all the data

201
00:09:45,000 --> 00:09:48,074
about movies and actors all at the same time

202
00:09:48,074 --> 00:09:51,650
which will of course increase our performance.

203
00:09:51,650 --> 00:09:53,840
Now the downside here is of course

204
00:09:53,840 --> 00:09:57,530
that we can't really query the embedded data on its own.

205
00:09:57,530 --> 00:10:00,810
And so if that's a requirement for the application

206
00:10:00,810 --> 00:10:03,790
you would have to choose a normalized design

207
00:10:03,790 --> 00:10:06,280
and since we're talking about pros and cons

208
00:10:06,280 --> 00:10:09,030
of the denormalized form; lets do the same

209
00:10:09,030 --> 00:10:11,490
about the normalized design.

210
00:10:11,490 --> 00:10:13,920
And basically its kind of the opposite

211
00:10:13,920 --> 00:10:15,770
of what we just talked about.

212
00:10:15,770 --> 00:10:18,319
So there is an improvement in performance

213
00:10:18,319 --> 00:10:22,390
when we often need to query the related data on it's own

214
00:10:22,390 --> 00:10:25,740
because we then can just query the data that we need

215
00:10:25,740 --> 00:10:28,490
and not always movies and actors together.

216
00:10:28,490 --> 00:10:31,640
But on the other hand; when we need to actually query

217
00:10:31,640 --> 00:10:33,906
movies and actors together we then are gonna need

218
00:10:33,906 --> 00:10:36,396
many queries to the database.

219
00:10:36,396 --> 00:10:40,010
So first the query for the movie and then from there

220
00:10:40,010 --> 00:10:42,610
we will also need a query for the actor

221
00:10:42,610 --> 00:10:44,989
and that is of course works for performance.

222
00:10:44,989 --> 00:10:48,328
So when designing your database; this is the kind of stuff

223
00:10:48,328 --> 00:10:50,569
that you need to keep in mind. All right.

224
00:10:50,569 --> 00:10:54,900
And now just as a side note; we could of course begin

225
00:10:54,900 --> 00:10:56,994
our thought process with denormlized data

226
00:10:56,994 --> 00:10:59,670
and then come to the conclusion

227
00:10:59,670 --> 00:11:01,692
that its best to actually normalize the data.

228
00:11:01,692 --> 00:11:05,043
So when thinking about our data model

229
00:11:05,043 --> 00:11:08,378
this way of organizing data works of course in both ways.

230
00:11:08,378 --> 00:11:12,570
Now, how do we actually decide if we should

231
00:11:12,570 --> 00:11:15,330
normalize or denormalize the data?

232
00:11:15,330 --> 00:11:18,033
Well that's exactly what we're gonna learn next.

233
00:11:19,690 --> 00:11:22,974
So when we have two related datasets; we have to decide

234
00:11:22,974 --> 00:11:26,180
if we're gonna embed the datasets or if we're gonna

235
00:11:26,180 --> 00:11:27,693
keep them separated and reference them

236
00:11:27,693 --> 00:11:30,400
from one dataset to the other.

237
00:11:30,400 --> 00:11:32,730
And I kind of developed this decision framework

238
00:11:32,730 --> 00:11:36,070
which I'm gonna show you where we use three criteria

239
00:11:36,070 --> 00:11:37,770
to take that decision.

240
00:11:37,770 --> 00:11:40,450
First we look at the type of relationships

241
00:11:40,450 --> 00:11:42,800
that exists between datasets.

242
00:11:42,800 --> 00:11:45,856
Second we try to determine the data access pattern

243
00:11:45,856 --> 00:11:50,150
of the dataset that we want to either embed or reference.

244
00:11:50,150 --> 00:11:53,320
And this just means to analyze how often data is read

245
00:11:53,320 --> 00:11:55,282
and written in that dataset.

246
00:11:55,282 --> 00:11:59,025
Then we also look at something that I call data closeness.

247
00:11:59,025 --> 00:12:02,940
And data closeness is term that I actually just made up

248
00:12:02,940 --> 00:12:06,870
but what it means is how much the data is really related

249
00:12:06,870 --> 00:12:10,109
and how we want to query the data from the database.

250
00:12:10,109 --> 00:12:11,850
And this will make more sense

251
00:12:11,850 --> 00:12:14,180
when we talk about it in a moment.

252
00:12:14,180 --> 00:12:17,330
Now to actually take the decision; we need to combine

253
00:12:17,330 --> 00:12:19,350
all of these three criteria

254
00:12:19,350 --> 00:12:21,792
and not just use one of them in isolation.

255
00:12:21,792 --> 00:12:25,230
So for example; just because criteria number one

256
00:12:25,230 --> 00:12:28,380
says to embed it doesn't mean that we don't need to look

257
00:12:28,380 --> 00:12:30,425
at the other two criteria.

258
00:12:30,425 --> 00:12:34,124
All right and lets start with the relationship type.

259
00:12:34,124 --> 00:12:37,968
So usually when we have one to few relationship

260
00:12:37,968 --> 00:12:40,700
we will always embed the related dataset

261
00:12:40,700 --> 00:12:43,430
into the main dataset just like we learned

262
00:12:43,430 --> 00:12:45,860
in the last slide. Right.

263
00:12:45,860 --> 00:12:49,110
Now in a one to many relationship; things are a bit

264
00:12:49,110 --> 00:12:52,880
more fuzzy so its okay to either embed or reference.

265
00:12:52,880 --> 00:12:55,140
In that case we will have to decide

266
00:12:55,140 --> 00:12:57,304
according to the other two criteria.

267
00:12:57,304 --> 00:12:59,825
Now on the other hand, on a one to a ton

268
00:12:59,825 --> 00:13:03,894
or a many to many relationship we usually always reference

269
00:13:03,894 --> 00:13:06,811
the data. That's because if we actually did embed

270
00:13:06,811 --> 00:13:10,004
in this case we could quickly create way too large document.

271
00:13:10,004 --> 00:13:14,902
Even potentially surpassing the maximum of 16 megabytes.

272
00:13:14,902 --> 00:13:18,214
And so the solution for that is of course referencing

273
00:13:18,214 --> 00:13:22,090
or normalizing the data. And as a quick example;

274
00:13:22,090 --> 00:13:24,142
lets say that in our movie database example

275
00:13:24,142 --> 00:13:27,830
we have around 100 images associated to each movie.

276
00:13:27,830 --> 00:13:30,874
So we could say its a one to many relationship

277
00:13:30,874 --> 00:13:34,230
but are we gonna embed the dataset or should we rather

278
00:13:34,230 --> 00:13:37,523
reference them here. Well we don't really know.

279
00:13:37,523 --> 00:13:40,571
So lets take a look at the other two criteria.

280
00:13:40,571 --> 00:13:44,420
So the second one is about data access patterns

281
00:13:44,420 --> 00:13:46,290
where its just a fancy description

282
00:13:46,290 --> 00:13:48,242
for evaluating whether a certain dataset

283
00:13:48,242 --> 00:13:51,559
is mostly written to or mostly read from.

284
00:13:51,559 --> 00:13:55,760
So if the dataset that we're deciding about is mostly read

285
00:13:55,760 --> 00:13:58,179
and the data is not updated a lot

286
00:13:58,179 --> 00:14:01,620
then we should probably embed that dataset.

287
00:14:01,620 --> 00:14:04,690
So a high read/write ratio just means

288
00:14:04,690 --> 00:14:07,100
that there is a lot more reading than writing.

289
00:14:07,100 --> 00:14:11,100
And a again, a dataset like that is a good candidate

290
00:14:11,100 --> 00:14:11,983
for embedding.

291
00:14:12,830 --> 00:14:15,980
The reason for this is that by embedding we only need

292
00:14:15,980 --> 00:14:18,379
one trip to the database per query.

293
00:14:18,379 --> 00:14:22,197
While for referencing we need two trips. Right.

294
00:14:22,197 --> 00:14:25,660
So if we embed data that is read a lot;

295
00:14:25,660 --> 00:14:28,383
in each query we save one trip to the database

296
00:14:28,383 --> 00:14:32,147
making the entire process way more performant.

297
00:14:32,147 --> 00:14:35,260
So I think that our movie image example

298
00:14:35,260 --> 00:14:38,320
would actually be a good candidate for embedding.

299
00:14:38,320 --> 00:14:41,543
Because once the 100 image are saved to the database

300
00:14:41,543 --> 00:14:43,920
they are not really updated anymore

301
00:14:43,920 --> 00:14:46,930
because there is not really anything to update

302
00:14:46,930 --> 00:14:50,057
about an image. Right, so its all about reading

303
00:14:50,057 --> 00:14:52,563
and therefore based on this criteria

304
00:14:52,563 --> 00:14:55,501
we would embed the imaged documents.

305
00:14:55,501 --> 00:14:59,092
Now on the other hand, if our data is updated a lot

306
00:14:59,092 --> 00:15:03,118
then we should consider referencing or normalizing the data.

307
00:15:03,118 --> 00:15:06,700
That's because its more work for the database engine

308
00:15:06,700 --> 00:15:08,870
to update and embed a document

309
00:15:08,870 --> 00:15:11,600
than a more simple standalone document.

310
00:15:11,600 --> 00:15:13,980
And since our main goal is performance;

311
00:15:13,980 --> 00:15:15,917
we just normalize the dataset.

312
00:15:15,917 --> 00:15:19,653
In our example lets say each movie has many reviews

313
00:15:19,653 --> 00:15:23,284
and each review can be marked as helpful by the user.

314
00:15:23,284 --> 00:15:27,560
So each time someone clicks on this review was helpful

315
00:15:27,560 --> 00:15:29,780
in our application. We need to update

316
00:15:29,780 --> 00:15:31,740
the corresponding document.

317
00:15:31,740 --> 00:15:35,030
And this means that the data can change all the time

318
00:15:35,030 --> 00:15:38,520
and so this is a great candidate for normalizing.

319
00:15:38,520 --> 00:15:41,420
Again because we don't want to be querying the movies

320
00:15:41,420 --> 00:15:45,190
all the time if all we really wanna update is the reviews

321
00:15:45,190 --> 00:15:47,230
by marking them as helpful.

322
00:15:47,230 --> 00:15:49,464
Okay, does that make sense?

323
00:15:49,464 --> 00:15:53,500
And finally the last criteria I call data closeness;

324
00:15:53,500 --> 00:15:56,320
which is just like a measure for how much the data

325
00:15:56,320 --> 00:15:59,469
is related. So if the two datasets really

326
00:15:59,469 --> 00:16:02,890
intrinsically belong together then they should

327
00:16:02,890 --> 00:16:05,880
probably be embedded into one another.

328
00:16:05,880 --> 00:16:10,440
In our example; all users can have many email addresses

329
00:16:10,440 --> 00:16:13,780
on their account and since they are so intrinsically

330
00:16:13,780 --> 00:16:17,190
connected to the user, there is no doubt emails

331
00:16:17,190 --> 00:16:19,920
should be embedded into the document.

332
00:16:19,920 --> 00:16:23,830
Now if we frequently need to query both of datasets

333
00:16:23,830 --> 00:16:26,388
on their own then that's a very good reason

334
00:16:26,388 --> 00:16:29,696
to normalize the data into two separate datasets.

335
00:16:29,696 --> 00:16:32,790
Even if they are closely related.

336
00:16:32,790 --> 00:16:35,227
So imagine that in our app we have a quiz

337
00:16:35,227 --> 00:16:40,227
where users have to identify a movie based on images.

338
00:16:40,440 --> 00:16:43,080
This means that we're gonna query a lot of images

339
00:16:43,080 --> 00:16:44,180
on their own.

340
00:16:44,180 --> 00:16:47,756
So without necessarily querying for the movies themselves.

341
00:16:47,756 --> 00:16:50,640
And so if we apply this third criteria;

342
00:16:50,640 --> 00:16:54,137
we come to the conclusion that we should actually normalize

343
00:16:54,137 --> 00:16:56,759
the image dataset. All right.

344
00:16:56,759 --> 00:17:00,770
Because again if we implement this quiz functionality;

345
00:17:00,770 --> 00:17:04,057
images are gonna be queried on their own all the time.

346
00:17:04,057 --> 00:17:07,422
So, all of this shows that we should really look

347
00:17:07,422 --> 00:17:09,850
all the three criteria together

348
00:17:09,850 --> 00:17:12,700
rather than just one of them in isolation.

349
00:17:12,700 --> 00:17:15,841
Because that might lead to less optimal decisions.

350
00:17:15,841 --> 00:17:18,908
And I say less optimal instead of wrong

351
00:17:18,908 --> 00:17:21,766
because they are not really completely right

352
00:17:21,766 --> 00:17:25,262
or completely wrong ways of modeling our data.

353
00:17:25,262 --> 00:17:28,970
There are no hard rules; these are just like guidelines

354
00:17:28,970 --> 00:17:31,380
that you can follow to find the probably

355
00:17:31,380 --> 00:17:33,860
most correct way of structuring your data.

356
00:17:33,860 --> 00:17:37,077
But again, it's hard to be really really wrong.

357
00:17:37,077 --> 00:17:38,253
Okay?

358
00:17:39,740 --> 00:17:43,110
Now, lets say that we have chosen to normalize

359
00:17:43,110 --> 00:17:44,270
our datasets.

360
00:17:44,270 --> 00:17:46,653
So in other words to reference data.

361
00:17:46,653 --> 00:17:49,380
Then after that we still have to choose

362
00:17:49,380 --> 00:17:52,840
between three different types of referencing.

363
00:17:52,840 --> 00:17:55,460
Child referencing, parent referencing

364
00:17:55,460 --> 00:17:57,540
and two-way referencing.

365
00:17:57,540 --> 00:18:00,767
So the first type is child referencing.

366
00:18:00,767 --> 00:18:04,440
Which is the referencing type I actually showed you before.

367
00:18:04,440 --> 00:18:05,470
Okay?

368
00:18:05,470 --> 00:18:07,850
And lets not take the error logging example

369
00:18:07,850 --> 00:18:10,128
that I mentioned earlier. Where we could potentially

370
00:18:10,128 --> 00:18:13,021
have millions of locked documents.

371
00:18:13,021 --> 00:18:17,300
So in child referencing; we basically keep references

372
00:18:17,300 --> 00:18:20,460
to the related child documents in a parent document.

373
00:18:20,460 --> 00:18:22,941
And they are usually stored in an array.

374
00:18:22,941 --> 00:18:25,735
So you see that each log has an ID

375
00:18:25,735 --> 00:18:29,040
and then in the app document there is that array

376
00:18:29,040 --> 00:18:31,358
with all of these IDs. Right?

377
00:18:31,358 --> 00:18:34,400
However, the problem here is that this array

378
00:18:34,400 --> 00:18:39,320
of IDs can become very large if there are lots of children.

379
00:18:39,320 --> 00:18:42,230
And this is an anti-pattern in MongoDB.

380
00:18:42,230 --> 00:18:45,156
So something that we should avoid at all costs.

381
00:18:45,156 --> 00:18:47,660
Also, child referencing makes it

382
00:18:47,660 --> 00:18:51,410
so that parents and children are very tightly coupled.

383
00:18:51,410 --> 00:18:54,840
Which is not always ideal. But that's exactly

384
00:18:54,840 --> 00:18:57,020
why we have parent referencing.

385
00:18:57,020 --> 00:19:00,300
So in parent referencing; it actually works

386
00:19:00,300 --> 00:19:01,870
the other way around.

387
00:19:01,870 --> 00:19:05,570
So here in each child document we keep a reference

388
00:19:05,570 --> 00:19:07,430
to the parent element.

389
00:19:07,430 --> 00:19:10,267
Therefore the name parent referencing.

390
00:19:10,267 --> 00:19:13,890
In this example the app ID is 23

391
00:19:13,890 --> 00:19:16,640
and so in each log there is the app field

392
00:19:16,640 --> 00:19:18,990
with the 23 ID in it.

393
00:19:18,990 --> 00:19:21,660
So that the child always knows its parent.

394
00:19:21,660 --> 00:19:24,920
And so in this case the parent actually knows nothing

395
00:19:24,920 --> 00:19:26,080
about the children.

396
00:19:26,080 --> 00:19:28,768
Not who they are and not how many they are.

397
00:19:28,768 --> 00:19:32,890
So, they are way more isolated and more standalone.

398
00:19:32,890 --> 00:19:35,326
In that, it can sometimes be beneficial.

399
00:19:35,326 --> 00:19:38,880
So which of these two types is actually better

400
00:19:38,880 --> 00:19:40,527
for this data relationship.

401
00:19:40,527 --> 00:19:42,820
And remember how I said that there

402
00:19:42,820 --> 00:19:45,860
could be millions of logs and so lets suppose

403
00:19:45,860 --> 00:19:47,652
there is two million logged documents.

404
00:19:47,652 --> 00:19:51,340
In a case of child referencing, that would mean

405
00:19:51,340 --> 00:19:53,209
that there are two million ID references

406
00:19:53,209 --> 00:19:55,091
in the app document.

407
00:19:55,091 --> 00:19:58,300
Right? Now also remember how I said that

408
00:19:58,300 --> 00:20:00,545
there is 16 megabyte limit on documents.

409
00:20:00,545 --> 00:20:04,302
So if we kept adding and adding these child IDs

410
00:20:04,302 --> 00:20:06,716
into the array on the parent; then we would

411
00:20:06,716 --> 00:20:09,575
pretty quickly hit that 16 megabytes limit

412
00:20:09,575 --> 00:20:11,772
that each Bson document can hold.

413
00:20:11,772 --> 00:20:14,702
Simply because that array will grow so much.

414
00:20:14,702 --> 00:20:17,210
So that's not really gonna work.

415
00:20:17,210 --> 00:20:18,510
Is it?

416
00:20:18,510 --> 00:20:20,590
On the other hand with parent referencing

417
00:20:20,590 --> 00:20:22,990
that problem is not gonna happen.

418
00:20:22,990 --> 00:20:25,570
We will simply have two million locked documents

419
00:20:25,570 --> 00:20:30,540
just like before but each of them holds ID of its parent.

420
00:20:30,540 --> 00:20:33,098
But there is no array that will grow indefinitely

421
00:20:33,098 --> 00:20:35,740
and therefore parent referencing

422
00:20:35,740 --> 00:20:38,443
would be best solution here.

423
00:20:39,380 --> 00:20:41,901
So the conclusion of all this is that in general

424
00:20:41,901 --> 00:20:44,385
child referencing is best used

425
00:20:44,385 --> 00:20:48,008
for one to a few relationships. Where we know before hand

426
00:20:48,008 --> 00:20:51,118
that the array of child documents won't grow that much.

427
00:20:51,118 --> 00:20:54,573
On the other hand, parent referencing is best used

428
00:20:54,573 --> 00:20:58,690
for one to many and one to a ton relationships

429
00:20:58,690 --> 00:21:00,927
like this one. Okay?

430
00:21:00,927 --> 00:21:04,610
So again always keep in mind that one of the most

431
00:21:04,610 --> 00:21:07,920
important principals of MongoDB data modeling

432
00:21:07,920 --> 00:21:11,900
is that array should never be allowed to grow indefinitely.

433
00:21:11,900 --> 00:21:15,420
In order to never break that 16 megabyte limit.

434
00:21:15,420 --> 00:21:18,170
We also don't want to send our users an array

435
00:21:18,170 --> 00:21:20,730
with thousands of IDs each time

436
00:21:20,730 --> 00:21:24,340
they request a parent dataset. Okay?

437
00:21:24,340 --> 00:21:26,900
So did this logic make sense to you?

438
00:21:26,900 --> 00:21:29,660
Then lets move on to third type of referencing

439
00:21:29,660 --> 00:21:31,870
which is two-way referencing.

440
00:21:31,870 --> 00:21:34,395
And this time with the movie and actor example

441
00:21:34,395 --> 00:21:36,380
I showed you when we talked about

442
00:21:36,380 --> 00:21:39,364
many to many relationships. Remember that?

443
00:21:39,364 --> 00:21:42,229
So again, each movie has many actors

444
00:21:42,229 --> 00:21:44,880
and each actor plays in many movies.

445
00:21:44,880 --> 00:21:48,464
And so that's a typical many to many relationship.

446
00:21:48,464 --> 00:21:52,100
And we usually use this two-way referencing to design

447
00:21:52,100 --> 00:21:55,346
many to many relationships. And it works like this;

448
00:21:55,346 --> 00:21:59,370
in each movie we will keep references to all the actors

449
00:21:59,370 --> 00:22:03,980
that star in that movie. So a bit like in child referencing.

450
00:22:03,980 --> 00:22:07,000
However and at the same time in each actor

451
00:22:07,000 --> 00:22:09,570
we also keep references to all the movies

452
00:22:09,570 --> 00:22:11,660
that the actor played in.

453
00:22:11,660 --> 00:22:15,120
So movies and actors are connected in both directions.

454
00:22:15,120 --> 00:22:17,900
In therefore the name two-way referencing.

455
00:22:17,900 --> 00:22:19,950
And this makes it really easy to search

456
00:22:19,950 --> 00:22:23,290
for both movies and actors completely independently.

457
00:22:23,290 --> 00:22:25,910
While also making it easy to find the actors

458
00:22:25,910 --> 00:22:29,029
associated to each movie and the movies associated

459
00:22:29,029 --> 00:22:30,383
to each actor.

460
00:22:31,623 --> 00:22:32,560
(deep breath)

461
00:22:32,560 --> 00:22:34,747
This was quite a long lecture indeed.

462
00:22:34,747 --> 00:22:38,030
With a lot of new concepts and principals

463
00:22:38,030 --> 00:22:40,220
and guidelines to remember.

464
00:22:40,220 --> 00:22:43,460
So in order to help you with that; here goes a quick

465
00:22:43,460 --> 00:22:46,650
summary and some more general guidelines that you can

466
00:22:46,650 --> 00:22:48,423
take a look at when you need it.

467
00:22:49,260 --> 00:22:52,753
So the most important principal is: structure your data

468
00:22:52,753 --> 00:22:56,120
to match the ways that your application queries

469
00:22:56,120 --> 00:22:57,436
and updates data.

470
00:22:57,436 --> 00:23:01,400
Or in other words: identify the questions that arise

471
00:23:01,400 --> 00:23:03,784
from your application's use cases first, and then model

472
00:23:03,784 --> 00:23:06,634
your data so that the questions can get answered

473
00:23:06,634 --> 00:23:08,995
in the most efficient way.

474
00:23:08,995 --> 00:23:12,610
For example; when I need to query movies and actors

475
00:23:12,610 --> 00:23:16,130
always together or are there scenarios where I only

476
00:23:16,130 --> 00:23:18,041
query movies or only actors.

477
00:23:18,041 --> 00:23:20,528
That kind of questions is what your data model

478
00:23:20,528 --> 00:23:22,930
will be based on.

479
00:23:22,930 --> 00:23:26,730
In general, always favor embedding unless there is a good

480
00:23:26,730 --> 00:23:28,440
reason not to embed.

481
00:23:28,440 --> 00:23:32,513
Especially on one to a few and one to many relationships.

482
00:23:33,370 --> 00:23:37,713
Next up, a one to a ton or a many to many relationship

483
00:23:37,713 --> 00:23:41,543
is usually a good reason to reference instead of embedding.

484
00:23:41,543 --> 00:23:45,734
Also, favor referencing when data is updated a lot

485
00:23:45,734 --> 00:23:50,717
and if you need to frequently access a dataset on its own.

486
00:23:50,717 --> 00:23:55,340
Use embedding when data is mostly read but rarely updated

487
00:23:55,340 --> 00:23:58,469
and when two dataset belong intrinsically together.

488
00:23:58,469 --> 00:24:02,840
Don't allow arrays to grow indefinitely.

489
00:24:02,840 --> 00:24:05,982
Therefore, if you want to normalize; use child referencing

490
00:24:05,982 --> 00:24:09,680
for one to many relationships and parent referencing

491
00:24:09,680 --> 00:24:11,856
for one to a ton relationships.

492
00:24:11,856 --> 00:24:15,160
And finally use two-way referencing

493
00:24:15,160 --> 00:24:17,520
for many to many relationships.

494
00:24:17,520 --> 00:24:18,720
All right?

495
00:24:18,720 --> 00:24:21,202
And that pretty much sums it up.

496
00:24:21,202 --> 00:24:23,970
I would actually recommend you watching this video

497
00:24:23,970 --> 00:24:27,144
twice if you can, just because of how important

498
00:24:27,144 --> 00:24:30,091
this material really is. All right?

499
00:24:30,091 --> 00:24:33,363
Anyway, see you in the next video.