1 00:00:00,780 --> 00:00:05,080 In the last couple of videos we've discussed the AK mode we've discussed. 2 00:00:05,100 --> 00:00:09,870 Q groups and this kind of graceful shut down anytime a client is about to close down. 3 00:00:10,140 --> 00:00:13,250 And throughout these videos he might've been saying Stephen this is way too in. 4 00:00:13,260 --> 00:00:18,000 There's no way I need to know how to correctly close down a connection and stuff like that. 5 00:00:18,000 --> 00:00:23,430 Well it turns out all this stuff is extra ordinarily important for some reasons we're going to lay out 6 00:00:23,430 --> 00:00:24,420 inside this video. 7 00:00:24,630 --> 00:00:29,390 To be honest with you this video is probably the most important one inside the entire course that's 8 00:00:29,420 --> 00:00:33,960 going to help you understand how this entire asynchronous style of communication between micro services 9 00:00:34,230 --> 00:00:39,420 and even micro services in general is really really hard to manage on the data side. 10 00:00:39,450 --> 00:00:44,990 So we're going take a look at a couple of diagrams and it's going to be kind of crazy but whatever let's 11 00:00:44,990 --> 00:00:45,920 just get through it. 12 00:00:45,980 --> 00:00:50,730 OK so going to imagine that we're working on a totally different application for just a moment. 13 00:00:50,780 --> 00:00:56,270 The example I'm going to give you is kind of a classic example of how concurrency or handling events 14 00:00:56,270 --> 00:01:00,140 and whatnot can kind of go really wrong really quickly. 15 00:01:00,290 --> 00:01:06,020 So we're going to imagine that we are handling some kind of a banking application in our banking application. 16 00:01:06,020 --> 00:01:12,530 We're going to have a publisher that can image events of account deposits and account withdraw. 17 00:01:12,540 --> 00:01:17,420 And as you can imagine well we're going to keep track of how much money a particular user has inside 18 00:01:17,420 --> 00:01:18,730 their account. 19 00:01:18,890 --> 00:01:21,140 So we're going to emit these two kinds of events. 20 00:01:21,290 --> 00:01:24,950 They're going to go over to these two different channels that have been created inside of our nets straining 21 00:01:24,950 --> 00:01:31,660 server we're going to have two services two copies of the same service called account service. 22 00:01:31,880 --> 00:01:37,070 These two services are going to be members of the same queue group inside of each of these channels. 23 00:01:37,070 --> 00:01:41,850 So whenever an event flows into that streaming service it's the event is only going to go to exactly 24 00:01:41,850 --> 00:01:47,330 at one of these two instances these two instances of the Account Service are going to watch for these 25 00:01:47,330 --> 00:01:52,610 incoming events and then depending upon whether it is a deposit or a withdrawal it is going to open 26 00:01:52,610 --> 00:01:57,890 up a file so plain file on our harddrive and update the amount of money that a user has. 27 00:01:57,890 --> 00:02:02,990 So by default this user will have zero dollars and we're going to increment or decrement that amount 28 00:02:03,020 --> 00:02:04,180 over time. 29 00:02:04,370 --> 00:02:07,990 Now for a real bank a real bank usually doesn't care that much. 30 00:02:08,030 --> 00:02:14,210 If you go under zero dollars on your account balance they're just gonna give you a overdraft fee and 31 00:02:14,210 --> 00:02:18,620 charge you some money for essentially borrowing money for some period of time where we're going to imagine 32 00:02:18,620 --> 00:02:20,060 that our bank is a little bit different. 33 00:02:20,060 --> 00:02:25,150 We're going to say that a user can never ever have less than zero dollars if they go below zero. 34 00:02:25,160 --> 00:02:30,130 That is a critical error and it represents something going extremely wrong inside our application. 35 00:02:30,170 --> 00:02:32,330 So that is a hard requirement. 36 00:02:32,330 --> 00:02:37,160 Let's imagine how our app would work ideally in an ideal situation. 37 00:02:37,160 --> 00:02:42,760 So maybe our publisher comes online and publishes an event of account deposit seventy dollars so that 38 00:02:42,770 --> 00:02:47,580 would go over to this channel that streaming server would take a look at the members of the SKU group 39 00:02:47,700 --> 00:02:50,640 and then send this event off to just one of those members. 40 00:02:50,640 --> 00:02:55,050 So this case maybe a sense it off to this account service right here account service with an open up 41 00:02:55,050 --> 00:02:58,620 that file increment to seventy dollars and that's it. 42 00:02:58,620 --> 00:02:59,810 We're good to go. 43 00:02:59,820 --> 00:03:02,530 Next up is 40 maybe that kids handled by this one. 44 00:03:02,730 --> 00:03:05,720 We go to 110 and that's it all done. 45 00:03:06,230 --> 00:03:11,210 And then finally maybe the user tries to withdraw some money so that will come down to this Q Group 46 00:03:11,270 --> 00:03:16,220 that's channel down here maybe go off to this account service we withdraw one hundred dollars we still 47 00:03:16,220 --> 00:03:23,100 have ten which means still good to go so that is the ideal situation but it turns out that there is 48 00:03:23,100 --> 00:03:29,850 an almost infinite number of ways that this process can fail extremely easily. 49 00:03:29,850 --> 00:03:30,630 Incredibly easy. 50 00:03:30,660 --> 00:03:32,560 Just unbelievably easy. 51 00:03:32,610 --> 00:03:39,160 So let's walk through a couple of different ways that this entire system can fail catastrophically so 52 00:03:39,160 --> 00:03:45,130 the first issue we're going to consider is if a listener fails to process the incoming events so we're 53 00:03:45,130 --> 00:03:50,620 going to imagine once again maybe this account deposit goes out gets assigned to this account service 54 00:03:50,620 --> 00:03:54,850 right here and then this account service tries to process this incoming event. 55 00:03:54,850 --> 00:04:00,430 So ideally this thing would open up some file on the harddrive and add in 70 and then save the file. 56 00:04:00,430 --> 00:04:02,610 But what can go wrong with that process. 57 00:04:02,620 --> 00:04:06,700 Well there's really an unbelievable number of things that can go wrong. 58 00:04:06,700 --> 00:04:11,110 This file could be already locked in other words some other program can already have this file open 59 00:04:11,350 --> 00:04:16,590 on the harddrive which would prevent us from opening it and making changes to it we could also have 60 00:04:16,590 --> 00:04:21,480 some faulty logic inside of here maybe before depositing some money maybe we check to make sure that 61 00:04:21,480 --> 00:04:26,640 the user has the ability to deposit some additional money maybe there's for example a weekly deposit 62 00:04:26,640 --> 00:04:30,270 limit where we don't want any user deposits too much money. 63 00:04:30,270 --> 00:04:34,920 So in that scenario well we might reject that event if the vial is locked like I just mentioned a moment 64 00:04:34,920 --> 00:04:35,330 ago. 65 00:04:35,460 --> 00:04:37,190 That would be rejected. 66 00:04:37,410 --> 00:04:40,580 Maybe we've got some typo inside the file or something like that. 67 00:04:40,590 --> 00:04:45,330 Maybe there's some totally unpredictable issue where this event just fails to be processed. 68 00:04:45,840 --> 00:04:51,960 So whatever the issue is with our current setup remember if anything goes wrong inside of our listener 69 00:04:53,160 --> 00:04:59,250 ideally we would not acknowledge the event and so eventually this event will be re processed but it 70 00:04:59,250 --> 00:05:04,960 takes 30 seconds before and that streaming server decides to actually re process this event and send 71 00:05:04,960 --> 00:05:08,310 it off to some other service like maybe this one over here. 72 00:05:08,310 --> 00:05:14,040 So while we are waiting those 30 seconds for this thing to be processed again the publisher might go 73 00:05:14,040 --> 00:05:20,460 ahead and publish the remaining two events so it might say OK let's do a deposit of 40 maybe that gets 74 00:05:20,460 --> 00:05:23,060 handled down here and maybe it gets handled successfully. 75 00:05:23,980 --> 00:05:29,590 And then after that couple of seconds later we try to do withdrawal gets handled down here and oh if 76 00:05:29,590 --> 00:05:34,000 we try to withdraw one hundred dollars off 40 we're not going to go into the negatives and we have a 77 00:05:34,000 --> 00:05:42,960 critical business air so if for whatever reason any event fails to be processed it can cause a catastrophic 78 00:05:43,200 --> 00:05:45,540 error in our business logic of our program. 79 00:05:45,540 --> 00:05:49,410 And as you saw in the last couple of videos is super easy for that to happen. 80 00:05:50,540 --> 00:05:54,170 So what's the next case in which something can fail catastrophically. 81 00:05:54,170 --> 00:06:00,650 Well if one listener runs more quickly than another let's imagine once again we send off 70 it gets 82 00:06:00,650 --> 00:06:05,420 handled by this service and maybe this service for some reason has a backlog of events. 83 00:06:05,480 --> 00:06:11,060 Maybe there's like 100 events that it's waiting to process because this virtual machine that that service 84 00:06:11,060 --> 00:06:17,640 is running on is right now overloaded or who knows what so maybe this event gets sent over and we're 85 00:06:17,640 --> 00:06:22,170 waiting for this thing to be acknowledged and in the meantime we send over another event to the same 86 00:06:22,170 --> 00:06:27,540 service and we're now waiting for both these things to be processed and acknowledge now these things 87 00:06:27,540 --> 00:06:32,310 have 30 seconds to be processed and it is entirely reasonable that the account service might process 88 00:06:32,310 --> 00:06:34,260 them within that 30 second window. 89 00:06:34,470 --> 00:06:40,130 But in the meantime as we are waiting for them to be processed we might also dispatch a withdrawal and 90 00:06:40,140 --> 00:06:44,580 then maybe that gets sent to this other account service down here that is really really fast. 91 00:06:44,640 --> 00:06:50,870 Maybe we just launched the thing and it has an open or no events to be processed in its backlog so in 92 00:06:50,870 --> 00:06:55,880 that case this instance the account service is going to immediately take a look at that incoming event 93 00:06:56,150 --> 00:07:01,100 tried to withdraw one hundred dollars and once again whoops we're in the negatives critical business 94 00:07:01,160 --> 00:07:08,480 error so this is an entirely possible and likely situation we might successfully eventually process 95 00:07:08,480 --> 00:07:13,880 these events but just because one event went to this service and the others went to this service well 96 00:07:13,880 --> 00:07:15,250 we're totally out of luck. 97 00:07:15,500 --> 00:07:21,680 So here's yet another scenario as we just saw in the last couple of videos Nats might have a client 98 00:07:21,690 --> 00:07:26,990 shutdown but it won't actually consider that client to be dead for 10 20 seconds or so depending upon 99 00:07:27,020 --> 00:07:28,740 those heartbeat settings. 100 00:07:28,790 --> 00:07:34,900 So let's imagine that this service right here gets shut down without it being a graceful shutdown maybe 101 00:07:34,910 --> 00:07:40,190 for whatever reason it just suddenly dies a hundred percent but for some window of time 10 20 seconds 102 00:07:40,190 --> 00:07:43,760 or so Natsumi server is gonna think that thing is still alive. 103 00:07:43,760 --> 00:07:49,130 So in that scenario once again maybe we take the 70 maybe Nats tries to allocate it to this dead service 104 00:07:49,160 --> 00:07:53,240 because it thinks it's still running maybe the same with this event right here. 105 00:07:53,540 --> 00:07:56,840 And then the hundred dollars get sent over to this service right here. 106 00:07:56,870 --> 00:07:59,290 So once again these things are not going to be processed. 107 00:07:59,360 --> 00:08:04,430 They will eventually after 30 seconds and that's doesn't get that acknowledgment and decides to reallocate 108 00:08:04,430 --> 00:08:06,630 them or assign them to some other service. 109 00:08:06,770 --> 00:08:10,920 But in that 30 second window well we're still going to be waiting. 110 00:08:10,970 --> 00:08:14,170 We're still going to go ahead and process with this hundred dollar withdrawal. 111 00:08:14,170 --> 00:08:17,660 And so once again we're going to try to withdraw a hundred dollars off a zero. 112 00:08:17,660 --> 00:08:22,860 Boom everything fails yet again all right just one more little example here. 113 00:08:22,870 --> 00:08:26,680 So in all the slides I've shown you so far we really made the assumption that we were going to do the 114 00:08:26,740 --> 00:08:30,250 deposits and the withdrawal within absolute seconds of each other. 115 00:08:30,280 --> 00:08:34,510 Some of the words these events were all going to be sent into streaming server at pretty much the same 116 00:08:34,510 --> 00:08:35,350 time. 117 00:08:35,390 --> 00:08:40,390 But let's now imagine for a second that well maybe a user is kind of following what a user actually 118 00:08:40,390 --> 00:08:40,660 does. 119 00:08:40,660 --> 00:08:45,730 They don't make two deposits in a row and then withdraw within seconds maybe in this scenario we say 120 00:08:45,790 --> 00:08:48,040 that the first deposit happens on a Tuesday. 121 00:08:48,220 --> 00:08:51,740 Then a another deposit on a Wednesday and then withdrawal on Thursday. 122 00:08:51,910 --> 00:08:56,650 So maybe in this scenario there is a ton of time between each of these events being processed. 123 00:08:56,680 --> 00:09:02,050 So let's now imagine that a user does the deposit goes over to the service maybe the deposit initially 124 00:09:02,050 --> 00:09:06,590 fails maybe that stream server sends it or resigns it somewhere else. 125 00:09:06,640 --> 00:09:07,960 Totally fine if that happens. 126 00:09:07,960 --> 00:09:10,740 We've got a ton of time to actually process this event. 127 00:09:11,010 --> 00:09:17,200 And so eventually even if this thing initially fails eventually we deposit seventy dollars we're good. 128 00:09:17,200 --> 00:09:19,740 So then on Wednesday maybe we do the same thing. 129 00:09:19,840 --> 00:09:24,070 It might get juggled a couple of times back and forth because it's failing to be processed but eventually 130 00:09:24,640 --> 00:09:30,060 we get our money in there and now here comes Thursday and let's do the withdrawal. 131 00:09:30,080 --> 00:09:35,870 So now with this withdrawal let's imagine for a second that the hard drive that we're storing this file 132 00:09:35,870 --> 00:09:43,550 on is really laggy for some reason maybe it takes twenty nine point nine nine seconds to open this file 133 00:09:44,500 --> 00:09:51,060 read the value out and then like another second to actually write the value end or update that value. 134 00:09:51,290 --> 00:09:56,450 Let's imagine what would happen if it took us twenty nine point nine nine nine seconds to open up that 135 00:09:56,450 --> 00:10:02,070 file off the harddrive so at twenty nine point nine nine we open up this value and we get the 110 inside 136 00:10:02,150 --> 00:10:08,660 of application and then a millisecond later like a fraction of a second boom we just hit 30 seconds 137 00:10:09,080 --> 00:10:14,540 and at 30 seconds now assumes that this service failed to process that event. 138 00:10:15,100 --> 00:10:16,100 And so nasty side. 139 00:10:16,130 --> 00:10:18,770 Okay well I better go ahead and try to reprocess this thing. 140 00:10:18,830 --> 00:10:20,630 I'll send it to the other service. 141 00:10:20,630 --> 00:10:25,130 But keep in mind this thing is still successfully processing the events and there's no actual time out 142 00:10:25,160 --> 00:10:28,130 on the service to say stop processing after 30 seconds. 143 00:10:28,130 --> 00:10:32,690 Our assumption is at 30 seconds this thing has totally failed and we don't really need to do any cleanup 144 00:10:32,720 --> 00:10:35,100 that's what we are kind of assuming right now. 145 00:10:35,120 --> 00:10:41,300 So then like two milliseconds later maybe at that point this service goes ahead and finally is able 146 00:10:41,300 --> 00:10:42,280 to update that value. 147 00:10:42,290 --> 00:10:44,510 They say OK we're gonna withdraw 100 dollars. 148 00:10:44,540 --> 00:10:50,910 We're down to just 10 and we're gonna save that back into that file but at that point NATS is already 149 00:10:50,970 --> 00:10:53,880 redistribute that event over to this other service. 150 00:10:53,880 --> 00:10:57,340 And so this service is going to see the incoming event and say oh withdraw a hundred dollars. 151 00:10:57,340 --> 00:10:58,210 Yeah no problem. 152 00:10:58,230 --> 00:10:58,490 Okay. 153 00:10:58,500 --> 00:11:04,510 I can open the file maybe now the harddrive is no longer laggy so it occurs instantaneously and we're 154 00:11:04,510 --> 00:11:10,420 going to try to subtract on our dollars from it and oh once again it would take us below zero critical 155 00:11:10,480 --> 00:11:11,670 error. 156 00:11:11,710 --> 00:11:20,770 So at this point we've now gone through several scenarios that absolutely positively no two ways about 157 00:11:20,770 --> 00:11:25,420 it can and probably will happen inside of application. 158 00:11:25,420 --> 00:11:32,740 So we can have some issue with processing these things and tried to go out of order we might fail because 159 00:11:32,740 --> 00:11:38,350 well there's going out of order because one is one instance of our services running slowly or quickly 160 00:11:39,350 --> 00:11:44,840 we can have the very core issue of one these services crashing Nath tries to throw the event to some 161 00:11:44,840 --> 00:11:47,580 service that isn't actually running then as we just saw. 162 00:11:47,600 --> 00:11:51,770 Well even if we don't run into these kind of issues where everything is occurring at the same time we 163 00:11:51,770 --> 00:11:56,600 might run into these kind of strange corner cases where we try to process these same event twice in 164 00:11:56,600 --> 00:11:56,930 a row. 165 00:11:58,520 --> 00:12:02,240 So at this point hopefully you understand the gravity of these problems. 166 00:12:02,360 --> 00:12:07,610 These are core issues that we kind of have a really tough time to address. 167 00:12:07,760 --> 00:12:13,140 And what's more they are almost guaranteed to happen at some point in time almost guaranteed even in 168 00:12:13,140 --> 00:12:17,360 this scenario where we imagined that these events were occurring within days of each other as opposed 169 00:12:17,360 --> 00:12:18,360 to milliseconds. 170 00:12:18,410 --> 00:12:23,750 We still might run into issues if we tried to recede and process the same event twice or even three 171 00:12:23,750 --> 00:12:24,530 times in a row. 172 00:12:25,560 --> 00:12:30,260 So again these are unbelievable issues that are really challenging to solve. 173 00:12:30,310 --> 00:12:35,670 And what's more these are issues that we can't just somehow solve by using a different event bus everything 174 00:12:35,670 --> 00:12:40,920 we just saw is kind of typical of all event bus implementations that there's not really anything particular 175 00:12:40,920 --> 00:12:45,720 about Napster string server that makes it harder or more challenging to deal with these problems. 176 00:12:45,810 --> 00:12:47,710 Nonetheless we have to deal with them somehow. 177 00:12:48,090 --> 00:12:49,170 So how are we gonna do that. 178 00:12:49,170 --> 00:12:50,250 Well let's take a pause right here. 179 00:12:50,250 --> 00:12:54,060 We'll come back the next video and take a look at some different ways that we're gonna solve all the 180 00:12:54,060 --> 00:12:55,260 issues we just discussed.