1
00:00:00,780 --> 00:00:05,080
In the last couple of videos we've discussed the AK mode we've discussed.

2
00:00:05,100 --> 00:00:09,870
Q groups and this kind of graceful shut down anytime a client is about to close down.

3
00:00:10,140 --> 00:00:13,250
And throughout these videos he might've been saying Stephen this is way too in.

4
00:00:13,260 --> 00:00:18,000
There's no way I need to know how to correctly close down a connection and stuff like that.

5
00:00:18,000 --> 00:00:23,430
Well it turns out all this stuff is extra ordinarily important for some reasons we're going to lay out

6
00:00:23,430 --> 00:00:24,420
inside this video.

7
00:00:24,630 --> 00:00:29,390
To be honest with you this video is probably the most important one inside the entire course that's

8
00:00:29,420 --> 00:00:33,960
going to help you understand how this entire asynchronous style of communication between micro services

9
00:00:34,230 --> 00:00:39,420
and even micro services in general is really really hard to manage on the data side.

10
00:00:39,450 --> 00:00:44,990
So we're going take a look at a couple of diagrams and it's going to be kind of crazy but whatever let's

11
00:00:44,990 --> 00:00:45,920
just get through it.

12
00:00:45,980 --> 00:00:50,730
OK so going to imagine that we're working on a totally different application for just a moment.

13
00:00:50,780 --> 00:00:56,270
The example I'm going to give you is kind of a classic example of how concurrency or handling events

14
00:00:56,270 --> 00:01:00,140
and whatnot can kind of go really wrong really quickly.

15
00:01:00,290 --> 00:01:06,020
So we're going to imagine that we are handling some kind of a banking application in our banking application.

16
00:01:06,020 --> 00:01:12,530
We're going to have a publisher that can image events of account deposits and account withdraw.

17
00:01:12,540 --> 00:01:17,420
And as you can imagine well we're going to keep track of how much money a particular user has inside

18
00:01:17,420 --> 00:01:18,730
their account.

19
00:01:18,890 --> 00:01:21,140
So we're going to emit these two kinds of events.

20
00:01:21,290 --> 00:01:24,950
They're going to go over to these two different channels that have been created inside of our nets straining

21
00:01:24,950 --> 00:01:31,660
server we're going to have two services two copies of the same service called account service.

22
00:01:31,880 --> 00:01:37,070
These two services are going to be members of the same queue group inside of each of these channels.

23
00:01:37,070 --> 00:01:41,850
So whenever an event flows into that streaming service it's the event is only going to go to exactly

24
00:01:41,850 --> 00:01:47,330
at one of these two instances these two instances of the Account Service are going to watch for these

25
00:01:47,330 --> 00:01:52,610
incoming events and then depending upon whether it is a deposit or a withdrawal it is going to open

26
00:01:52,610 --> 00:01:57,890
up a file so plain file on our harddrive and update the amount of money that a user has.

27
00:01:57,890 --> 00:02:02,990
So by default this user will have zero dollars and we're going to increment or decrement that amount

28
00:02:03,020 --> 00:02:04,180
over time.

29
00:02:04,370 --> 00:02:07,990
Now for a real bank a real bank usually doesn't care that much.

30
00:02:08,030 --> 00:02:14,210
If you go under zero dollars on your account balance they're just gonna give you a overdraft fee and

31
00:02:14,210 --> 00:02:18,620
charge you some money for essentially borrowing money for some period of time where we're going to imagine

32
00:02:18,620 --> 00:02:20,060
that our bank is a little bit different.

33
00:02:20,060 --> 00:02:25,150
We're going to say that a user can never ever have less than zero dollars if they go below zero.

34
00:02:25,160 --> 00:02:30,130
That is a critical error and it represents something going extremely wrong inside our application.

35
00:02:30,170 --> 00:02:32,330
So that is a hard requirement.

36
00:02:32,330 --> 00:02:37,160
Let's imagine how our app would work ideally in an ideal situation.

37
00:02:37,160 --> 00:02:42,760
So maybe our publisher comes online and publishes an event of account deposit seventy dollars so that

38
00:02:42,770 --> 00:02:47,580
would go over to this channel that streaming server would take a look at the members of the SKU group

39
00:02:47,700 --> 00:02:50,640
and then send this event off to just one of those members.

40
00:02:50,640 --> 00:02:55,050
So this case maybe a sense it off to this account service right here account service with an open up

41
00:02:55,050 --> 00:02:58,620
that file increment to seventy dollars and that's it.

42
00:02:58,620 --> 00:02:59,810
We're good to go.

43
00:02:59,820 --> 00:03:02,530
Next up is 40 maybe that kids handled by this one.

44
00:03:02,730 --> 00:03:05,720
We go to 110 and that's it all done.

45
00:03:06,230 --> 00:03:11,210
And then finally maybe the user tries to withdraw some money so that will come down to this Q Group

46
00:03:11,270 --> 00:03:16,220
that's channel down here maybe go off to this account service we withdraw one hundred dollars we still

47
00:03:16,220 --> 00:03:23,100
have ten which means still good to go so that is the ideal situation but it turns out that there is

48
00:03:23,100 --> 00:03:29,850
an almost infinite number of ways that this process can fail extremely easily.

49
00:03:29,850 --> 00:03:30,630
Incredibly easy.

50
00:03:30,660 --> 00:03:32,560
Just unbelievably easy.

51
00:03:32,610 --> 00:03:39,160
So let's walk through a couple of different ways that this entire system can fail catastrophically so

52
00:03:39,160 --> 00:03:45,130
the first issue we're going to consider is if a listener fails to process the incoming events so we're

53
00:03:45,130 --> 00:03:50,620
going to imagine once again maybe this account deposit goes out gets assigned to this account service

54
00:03:50,620 --> 00:03:54,850
right here and then this account service tries to process this incoming event.

55
00:03:54,850 --> 00:04:00,430
So ideally this thing would open up some file on the harddrive and add in 70 and then save the file.

56
00:04:00,430 --> 00:04:02,610
But what can go wrong with that process.

57
00:04:02,620 --> 00:04:06,700
Well there's really an unbelievable number of things that can go wrong.

58
00:04:06,700 --> 00:04:11,110
This file could be already locked in other words some other program can already have this file open

59
00:04:11,350 --> 00:04:16,590
on the harddrive which would prevent us from opening it and making changes to it we could also have

60
00:04:16,590 --> 00:04:21,480
some faulty logic inside of here maybe before depositing some money maybe we check to make sure that

61
00:04:21,480 --> 00:04:26,640
the user has the ability to deposit some additional money maybe there's for example a weekly deposit

62
00:04:26,640 --> 00:04:30,270
limit where we don't want any user deposits too much money.

63
00:04:30,270 --> 00:04:34,920
So in that scenario well we might reject that event if the vial is locked like I just mentioned a moment

64
00:04:34,920 --> 00:04:35,330
ago.

65
00:04:35,460 --> 00:04:37,190
That would be rejected.

66
00:04:37,410 --> 00:04:40,580
Maybe we've got some typo inside the file or something like that.

67
00:04:40,590 --> 00:04:45,330
Maybe there's some totally unpredictable issue where this event just fails to be processed.

68
00:04:45,840 --> 00:04:51,960
So whatever the issue is with our current setup remember if anything goes wrong inside of our listener

69
00:04:53,160 --> 00:04:59,250
ideally we would not acknowledge the event and so eventually this event will be re processed but it

70
00:04:59,250 --> 00:05:04,960
takes 30 seconds before and that streaming server decides to actually re process this event and send

71
00:05:04,960 --> 00:05:08,310
it off to some other service like maybe this one over here.

72
00:05:08,310 --> 00:05:14,040
So while we are waiting those 30 seconds for this thing to be processed again the publisher might go

73
00:05:14,040 --> 00:05:20,460
ahead and publish the remaining two events so it might say OK let's do a deposit of 40 maybe that gets

74
00:05:20,460 --> 00:05:23,060
handled down here and maybe it gets handled successfully.

75
00:05:23,980 --> 00:05:29,590
And then after that couple of seconds later we try to do withdrawal gets handled down here and oh if

76
00:05:29,590 --> 00:05:34,000
we try to withdraw one hundred dollars off 40 we're not going to go into the negatives and we have a

77
00:05:34,000 --> 00:05:42,960
critical business air so if for whatever reason any event fails to be processed it can cause a catastrophic

78
00:05:43,200 --> 00:05:45,540
error in our business logic of our program.

79
00:05:45,540 --> 00:05:49,410
And as you saw in the last couple of videos is super easy for that to happen.

80
00:05:50,540 --> 00:05:54,170
So what's the next case in which something can fail catastrophically.

81
00:05:54,170 --> 00:06:00,650
Well if one listener runs more quickly than another let's imagine once again we send off 70 it gets

82
00:06:00,650 --> 00:06:05,420
handled by this service and maybe this service for some reason has a backlog of events.

83
00:06:05,480 --> 00:06:11,060
Maybe there's like 100 events that it's waiting to process because this virtual machine that that service

84
00:06:11,060 --> 00:06:17,640
is running on is right now overloaded or who knows what so maybe this event gets sent over and we're

85
00:06:17,640 --> 00:06:22,170
waiting for this thing to be acknowledged and in the meantime we send over another event to the same

86
00:06:22,170 --> 00:06:27,540
service and we're now waiting for both these things to be processed and acknowledge now these things

87
00:06:27,540 --> 00:06:32,310
have 30 seconds to be processed and it is entirely reasonable that the account service might process

88
00:06:32,310 --> 00:06:34,260
them within that 30 second window.

89
00:06:34,470 --> 00:06:40,130
But in the meantime as we are waiting for them to be processed we might also dispatch a withdrawal and

90
00:06:40,140 --> 00:06:44,580
then maybe that gets sent to this other account service down here that is really really fast.

91
00:06:44,640 --> 00:06:50,870
Maybe we just launched the thing and it has an open or no events to be processed in its backlog so in

92
00:06:50,870 --> 00:06:55,880
that case this instance the account service is going to immediately take a look at that incoming event

93
00:06:56,150 --> 00:07:01,100
tried to withdraw one hundred dollars and once again whoops we're in the negatives critical business

94
00:07:01,160 --> 00:07:08,480
error so this is an entirely possible and likely situation we might successfully eventually process

95
00:07:08,480 --> 00:07:13,880
these events but just because one event went to this service and the others went to this service well

96
00:07:13,880 --> 00:07:15,250
we're totally out of luck.

97
00:07:15,500 --> 00:07:21,680
So here's yet another scenario as we just saw in the last couple of videos Nats might have a client

98
00:07:21,690 --> 00:07:26,990
shutdown but it won't actually consider that client to be dead for 10 20 seconds or so depending upon

99
00:07:27,020 --> 00:07:28,740
those heartbeat settings.

100
00:07:28,790 --> 00:07:34,900
So let's imagine that this service right here gets shut down without it being a graceful shutdown maybe

101
00:07:34,910 --> 00:07:40,190
for whatever reason it just suddenly dies a hundred percent but for some window of time 10 20 seconds

102
00:07:40,190 --> 00:07:43,760
or so Natsumi server is gonna think that thing is still alive.

103
00:07:43,760 --> 00:07:49,130
So in that scenario once again maybe we take the 70 maybe Nats tries to allocate it to this dead service

104
00:07:49,160 --> 00:07:53,240
because it thinks it's still running maybe the same with this event right here.

105
00:07:53,540 --> 00:07:56,840
And then the hundred dollars get sent over to this service right here.

106
00:07:56,870 --> 00:07:59,290
So once again these things are not going to be processed.

107
00:07:59,360 --> 00:08:04,430
They will eventually after 30 seconds and that's doesn't get that acknowledgment and decides to reallocate

108
00:08:04,430 --> 00:08:06,630
them or assign them to some other service.

109
00:08:06,770 --> 00:08:10,920
But in that 30 second window well we're still going to be waiting.

110
00:08:10,970 --> 00:08:14,170
We're still going to go ahead and process with this hundred dollar withdrawal.

111
00:08:14,170 --> 00:08:17,660
And so once again we're going to try to withdraw a hundred dollars off a zero.

112
00:08:17,660 --> 00:08:22,860
Boom everything fails yet again all right just one more little example here.

113
00:08:22,870 --> 00:08:26,680
So in all the slides I've shown you so far we really made the assumption that we were going to do the

114
00:08:26,740 --> 00:08:30,250
deposits and the withdrawal within absolute seconds of each other.

115
00:08:30,280 --> 00:08:34,510
Some of the words these events were all going to be sent into streaming server at pretty much the same

116
00:08:34,510 --> 00:08:35,350
time.

117
00:08:35,390 --> 00:08:40,390
But let's now imagine for a second that well maybe a user is kind of following what a user actually

118
00:08:40,390 --> 00:08:40,660
does.

119
00:08:40,660 --> 00:08:45,730
They don't make two deposits in a row and then withdraw within seconds maybe in this scenario we say

120
00:08:45,790 --> 00:08:48,040
that the first deposit happens on a Tuesday.

121
00:08:48,220 --> 00:08:51,740
Then a another deposit on a Wednesday and then withdrawal on Thursday.

122
00:08:51,910 --> 00:08:56,650
So maybe in this scenario there is a ton of time between each of these events being processed.

123
00:08:56,680 --> 00:09:02,050
So let's now imagine that a user does the deposit goes over to the service maybe the deposit initially

124
00:09:02,050 --> 00:09:06,590
fails maybe that stream server sends it or resigns it somewhere else.

125
00:09:06,640 --> 00:09:07,960
Totally fine if that happens.

126
00:09:07,960 --> 00:09:10,740
We've got a ton of time to actually process this event.

127
00:09:11,010 --> 00:09:17,200
And so eventually even if this thing initially fails eventually we deposit seventy dollars we're good.

128
00:09:17,200 --> 00:09:19,740
So then on Wednesday maybe we do the same thing.

129
00:09:19,840 --> 00:09:24,070
It might get juggled a couple of times back and forth because it's failing to be processed but eventually

130
00:09:24,640 --> 00:09:30,060
we get our money in there and now here comes Thursday and let's do the withdrawal.

131
00:09:30,080 --> 00:09:35,870
So now with this withdrawal let's imagine for a second that the hard drive that we're storing this file

132
00:09:35,870 --> 00:09:43,550
on is really laggy for some reason maybe it takes twenty nine point nine nine seconds to open this file

133
00:09:44,500 --> 00:09:51,060
read the value out and then like another second to actually write the value end or update that value.

134
00:09:51,290 --> 00:09:56,450
Let's imagine what would happen if it took us twenty nine point nine nine nine seconds to open up that

135
00:09:56,450 --> 00:10:02,070
file off the harddrive so at twenty nine point nine nine we open up this value and we get the 110 inside

136
00:10:02,150 --> 00:10:08,660
of application and then a millisecond later like a fraction of a second boom we just hit 30 seconds

137
00:10:09,080 --> 00:10:14,540
and at 30 seconds now assumes that this service failed to process that event.

138
00:10:15,100 --> 00:10:16,100
And so nasty side.

139
00:10:16,130 --> 00:10:18,770
Okay well I better go ahead and try to reprocess this thing.

140
00:10:18,830 --> 00:10:20,630
I'll send it to the other service.

141
00:10:20,630 --> 00:10:25,130
But keep in mind this thing is still successfully processing the events and there's no actual time out

142
00:10:25,160 --> 00:10:28,130
on the service to say stop processing after 30 seconds.

143
00:10:28,130 --> 00:10:32,690
Our assumption is at 30 seconds this thing has totally failed and we don't really need to do any cleanup

144
00:10:32,720 --> 00:10:35,100
that's what we are kind of assuming right now.

145
00:10:35,120 --> 00:10:41,300
So then like two milliseconds later maybe at that point this service goes ahead and finally is able

146
00:10:41,300 --> 00:10:42,280
to update that value.

147
00:10:42,290 --> 00:10:44,510
They say OK we're gonna withdraw 100 dollars.

148
00:10:44,540 --> 00:10:50,910
We're down to just 10 and we're gonna save that back into that file but at that point NATS is already

149
00:10:50,970 --> 00:10:53,880
redistribute that event over to this other service.

150
00:10:53,880 --> 00:10:57,340
And so this service is going to see the incoming event and say oh withdraw a hundred dollars.

151
00:10:57,340 --> 00:10:58,210
Yeah no problem.

152
00:10:58,230 --> 00:10:58,490
Okay.

153
00:10:58,500 --> 00:11:04,510
I can open the file maybe now the harddrive is no longer laggy so it occurs instantaneously and we're

154
00:11:04,510 --> 00:11:10,420
going to try to subtract on our dollars from it and oh once again it would take us below zero critical

155
00:11:10,480 --> 00:11:11,670
error.

156
00:11:11,710 --> 00:11:20,770
So at this point we've now gone through several scenarios that absolutely positively no two ways about

157
00:11:20,770 --> 00:11:25,420
it can and probably will happen inside of application.

158
00:11:25,420 --> 00:11:32,740
So we can have some issue with processing these things and tried to go out of order we might fail because

159
00:11:32,740 --> 00:11:38,350
well there's going out of order because one is one instance of our services running slowly or quickly

160
00:11:39,350 --> 00:11:44,840
we can have the very core issue of one these services crashing Nath tries to throw the event to some

161
00:11:44,840 --> 00:11:47,580
service that isn't actually running then as we just saw.

162
00:11:47,600 --> 00:11:51,770
Well even if we don't run into these kind of issues where everything is occurring at the same time we

163
00:11:51,770 --> 00:11:56,600
might run into these kind of strange corner cases where we try to process these same event twice in

164
00:11:56,600 --> 00:11:56,930
a row.

165
00:11:58,520 --> 00:12:02,240
So at this point hopefully you understand the gravity of these problems.

166
00:12:02,360 --> 00:12:07,610
These are core issues that we kind of have a really tough time to address.

167
00:12:07,760 --> 00:12:13,140
And what's more they are almost guaranteed to happen at some point in time almost guaranteed even in

168
00:12:13,140 --> 00:12:17,360
this scenario where we imagined that these events were occurring within days of each other as opposed

169
00:12:17,360 --> 00:12:18,360
to milliseconds.

170
00:12:18,410 --> 00:12:23,750
We still might run into issues if we tried to recede and process the same event twice or even three

171
00:12:23,750 --> 00:12:24,530
times in a row.

172
00:12:25,560 --> 00:12:30,260
So again these are unbelievable issues that are really challenging to solve.

173
00:12:30,310 --> 00:12:35,670
And what's more these are issues that we can't just somehow solve by using a different event bus everything

174
00:12:35,670 --> 00:12:40,920
we just saw is kind of typical of all event bus implementations that there's not really anything particular

175
00:12:40,920 --> 00:12:45,720
about Napster string server that makes it harder or more challenging to deal with these problems.

176
00:12:45,810 --> 00:12:47,710
Nonetheless we have to deal with them somehow.

177
00:12:48,090 --> 00:12:49,170
So how are we gonna do that.

178
00:12:49,170 --> 00:12:50,250
Well let's take a pause right here.

179
00:12:50,250 --> 00:12:54,060
We'll come back the next video and take a look at some different ways that we're gonna solve all the

180
00:12:54,060 --> 00:12:55,260
issues we just discussed.