1 00:00:03,910 --> 00:00:10,300 The Docker Health Check command. It was a new feature added in 1.12, which came out mid 2016, the 2 00:00:10,300 --> 00:00:16,480 same time that Swarm Kit and Swarm Mode were available in Docker. It was added really as a part of 3 00:00:16,480 --> 00:00:22,870 that toolkit, but it still works in all the different files like the Dockerfile, the Compose file, 4 00:00:22,870 --> 00:00:27,790 the docker run command uses it, the Stack files support it, the service update and service create command 5 00:00:27,790 --> 00:00:28,250 support it. 6 00:00:28,260 --> 00:00:29,630 It's everywhere. 7 00:00:29,950 --> 00:00:36,220 I highly recommend that when you're going production, you do engage in testing options for this health 8 00:00:36,220 --> 00:00:37,090 check command. 9 00:00:37,150 --> 00:00:40,330 It's going to work right out of the box with an exec. 10 00:00:40,390 --> 00:00:45,040 It's going to execute that command inside the container just like if you were running your own exec 11 00:00:45,040 --> 00:00:45,700 command. 12 00:00:45,730 --> 00:00:50,020 So, it's not running it from outside the container; it's just running it inside which means that even 13 00:00:50,020 --> 00:00:55,360 simple workers that don't have exposed ports, you can run a simple command in them to validate whether 14 00:00:55,360 --> 00:00:57,780 they're returning good data or whatever. 15 00:00:58,090 --> 00:01:04,300 It's a simple execution of a command, which means it gets a simple return. It expects a 0 or a 1. 16 00:01:04,300 --> 00:01:09,850 In Linux and Windows, you have exit codes from commands and a 0 is a good thing. 17 00:01:09,850 --> 00:01:11,560 It means everything was fine. 18 00:01:11,590 --> 00:01:15,570 Anything other than a 0 is going to be an error in most applications. 19 00:01:15,580 --> 00:01:21,010 But in Docker, we need that application to exit a 1 specifically. We'll show in a minute how you 20 00:01:21,010 --> 00:01:22,160 do that. 21 00:01:22,330 --> 00:01:28,420 There's only three states to a healthcheck in Docker. It starts out with starting. Starting is the 22 00:01:28,420 --> 00:01:32,480 first 30 seconds, by default, where it hasn't run a healthcheck command yet. 23 00:01:32,710 --> 00:01:34,130 Then it's going to run one. 24 00:01:34,180 --> 00:01:39,160 If that returns a 0, it'll start with the healthy. It'll change to the healthy option. 25 00:01:39,280 --> 00:01:45,640 It'll take that command and it'll run it every 30 seconds by default again. If it ever receives an unhealthy 26 00:01:45,640 --> 00:01:49,690 return, like an exit 1, then it marks it as an unhealthy container. 27 00:01:49,690 --> 00:01:52,790 We have options for controlling all of this including retries. 28 00:01:52,810 --> 00:01:53,940 We'll see that in a minute. 29 00:01:54,760 --> 00:02:00,010 This is a much better option than we've had in the past because Docker, until now, was just making 30 00:02:00,010 --> 00:02:01,870 sure the application was still running. 31 00:02:01,930 --> 00:02:05,740 It didn't have any insight into whether that application was doing what it was supposed to. 32 00:02:05,920 --> 00:02:10,120 Now we can do that inside the Docker container itself. 33 00:02:10,330 --> 00:02:13,980 But this isn't a replacement for your third party monitoring solution. 34 00:02:13,990 --> 00:02:20,440 This isn't going to give you graphs, or status over time, or any sort of third party tooling that you 35 00:02:20,440 --> 00:02:22,150 would expect out of a monitoring solution. 36 00:02:22,150 --> 00:02:27,780 This is about Docker understanding if the container itself has a basic level of healthy. 37 00:02:27,880 --> 00:02:36,490 So, in a Nginx, it might return a localhost of the root index file. A return of 200 or 300 is fine 38 00:02:36,520 --> 00:02:39,720 and gives it an exit code of 0, and it considers it healthy. 39 00:02:39,910 --> 00:02:43,190 That's not a super advanced uh, you know, monitoring tool. 40 00:02:43,360 --> 00:02:49,630 But if it did return a 404 or 500 error, it would then consider it unhealthy and we can do something about 41 00:02:49,630 --> 00:02:50,620 that. 42 00:02:50,620 --> 00:02:54,270 Where are we going to see this Docker healthcheck in the GUI? 43 00:02:54,520 --> 00:02:57,070 The first place is in container ls. 44 00:02:57,250 --> 00:02:59,430 It'll just see it as this new option. 45 00:02:59,430 --> 00:03:04,030 It's in the middle. We'll see in a second where it'll show us one of the three states if the health check 46 00:03:04,030 --> 00:03:06,530 is running, and that's how we actually know that there's a healthcheck. 47 00:03:06,580 --> 00:03:12,370 That's the easiest way, at least, to know. We'll see the history, the last five of that healthcheck, 48 00:03:12,370 --> 00:03:19,040 show up in the inspect for that container. And we can see some basic trend over time there. 49 00:03:19,150 --> 00:03:23,460 But the docker run command does not take action on an unhealthy container. 50 00:03:23,620 --> 00:03:29,620 Once the healthcheck considers a container unhealthy, docker run is just going to indicate that in the ls 51 00:03:29,620 --> 00:03:32,690 command, and in the inspect, but it's not going to take action. 52 00:03:32,710 --> 00:03:36,050 That's where we expect the Swarm Services to take action. 53 00:03:36,070 --> 00:03:42,670 So the stacks and services will actually replace that container with a new task, on a new host possibly, 54 00:03:42,670 --> 00:03:44,100 depending on the scheduler. 55 00:03:44,410 --> 00:03:50,200 Even in the update command, we see a little extra bonus by using the healthchecks because the updates 56 00:03:50,550 --> 00:03:56,440 will consider the healthcheck as a part of the readiness for that container before it goes and changes 57 00:03:56,440 --> 00:03:57,370 the next one. 58 00:03:57,370 --> 00:04:02,710 If a container comes up, but it doesn't pass its health check, then the service update won't go to 59 00:04:02,710 --> 00:04:06,750 the next one. Or it'll take action based on the changes you give it. 60 00:04:07,470 --> 00:04:10,400 Let's look at a few examples before we go to the command line. 61 00:04:10,410 --> 00:04:12,680 This is one that we're using on docker run. 62 00:04:12,690 --> 00:04:17,670 This allows us to use an existing image that doesn't have a health check in it, and we're adding 63 00:04:17,670 --> 00:04:19,680 the health check in at runtime. 64 00:04:19,710 --> 00:04:26,220 In this case, we're using the Elasticsearch image. You can see the command is a cURL localhost 65 00:04:26,270 --> 00:04:32,430 9200, which is the port that the Elasticsearch is running on inside the container, not the published port, 66 00:04:32,670 --> 00:04:35,380 but inside the container. For Elasticsearch, 67 00:04:35,390 --> 00:04:38,040 there is an actual health URL. 68 00:04:38,070 --> 00:04:39,560 So, we can use that here. 69 00:04:39,570 --> 00:04:43,750 You'll notice the two pipes with the false at the end of that command. 70 00:04:43,890 --> 00:04:45,300 And that's going to be pretty common 71 00:04:45,300 --> 00:04:50,760 if using something like cURL or another tool that will send out an error code that's other than 1. 72 00:04:50,810 --> 00:04:56,010 Remember when I mentioned that while ago? We need it to exit with 1 if there's a problem. Because that's 73 00:04:56,010 --> 00:04:59,240 the one error code that Docker is going to do something about. 74 00:04:59,310 --> 00:05:06,630 We need to make sure that in this case, a shell will always return the false 1 exit code 75 00:05:06,640 --> 00:05:10,430 if there's anything coming out of that command other than 0. 76 00:05:10,530 --> 00:05:13,250 It's a nice way to get around that problem. 77 00:05:13,310 --> 00:05:18,180 It just so happens with cURL, cURL will give other potential error codes and we don't want it to 78 00:05:18,180 --> 00:05:18,880 do that. 79 00:05:19,380 --> 00:05:25,290 In the actual Docker files, we can add the same command. The format's a little bit different. But you see 80 00:05:25,290 --> 00:05:31,850 that we have these options here. We have the interval, the timeout, the start period (which is new), and retries. 81 00:05:31,950 --> 00:05:35,670 The interval is what you would think it is. It's, by default, every 30 seconds. 82 00:05:35,730 --> 00:05:41,520 How often it's going to run this health check. The time out is how long it's going to wait before it errors 83 00:05:41,520 --> 00:05:48,030 out and returns a bad code, if maybe the app is slow. The start period is a new feature that allows us now 84 00:05:48,060 --> 00:05:56,280 in 17.09 and newer, to give a longer wait period than the first 30 seconds of the duration. Before, it 85 00:05:56,280 --> 00:05:59,810 would always just wait the long...the interval time before it started the healthcheck. 86 00:05:59,820 --> 00:06:04,710 But maybe you have a Java app, or database, or something that takes a lot longer to start. 87 00:06:04,710 --> 00:06:06,400 Maybe it takes five minutes. 88 00:06:06,540 --> 00:06:11,880 You could add that start period in there. It'll still do healthchecks. But what it will do is it won't 89 00:06:11,880 --> 00:06:17,010 alarm on an unhealthy check until that time has elapsed. 90 00:06:17,010 --> 00:06:22,750 So if you set two minutes in there, even though it's health checking every 30 seconds, it's going to only 91 00:06:22,750 --> 00:06:28,320 consider it unhealthy once it's past that two minute mark. The last one there, retries, means that 92 00:06:28,320 --> 00:06:33,650 we will try this health check x number of times before we consider it unhealthy. 93 00:06:33,720 --> 00:06:38,940 That gives maybe a potentially unstable app a chance to come back with a healthy and recover on 94 00:06:38,940 --> 00:06:42,420 its own before we consider this a truly unhealthy container. 95 00:06:42,510 --> 00:06:45,680 The basic healthcheck command you would use in a Dockerfile is called HEALTHCHECK, 96 00:06:45,720 --> 00:06:51,510 all capital letters there. The same format exists where if we're just doing a simple cURL of the localhost 97 00:06:51,630 --> 00:06:55,010 because maybe it's PHP app or something. We can do that. 98 00:06:55,200 --> 00:06:59,760 This is how you would add all those options in to a Dockerfile so you would see how I add the 99 00:06:59,770 --> 00:07:04,130 timeout interval and the retries before the command itself. 100 00:07:04,290 --> 00:07:09,450 The first one there for the basic command, notice I don't have to put in a CMD if I'm just giving it the 101 00:07:09,450 --> 00:07:14,230 command to run. But if I want to show options, if I want to give it custom options out of the box with 102 00:07:14,230 --> 00:07:19,190 the timeout and so on, then I have to specify which one is the command. 103 00:07:19,200 --> 00:07:20,550 Now these aren't two different lines. 104 00:07:20,550 --> 00:07:23,820 Notice the back slash on the end of the first line there. 105 00:07:23,940 --> 00:07:25,530 So don't get that confused. 106 00:07:26,290 --> 00:07:31,270 Here we have a simple example of what it might be like if you had a static application running inside 107 00:07:31,270 --> 00:07:32,450 an Nginx server. 108 00:07:32,500 --> 00:07:37,360 You could set the interval and the time out from your Dockerfile, and you would just have it simply 109 00:07:37,360 --> 00:07:39,830 do a cURL command on the localhost. 110 00:07:39,850 --> 00:07:45,910 If it returns a 200 or 300, it considers that fine. If it returns a 4, or 5, or something else, it considers 111 00:07:45,910 --> 00:07:46,660 that an error. 112 00:07:46,660 --> 00:07:51,050 You notice here that I have an exit 1, which is the same thing as a false. 113 00:07:51,100 --> 00:07:55,540 I did that just to show you that certain examples on the Internet will have a false. Certain examples 114 00:07:55,540 --> 00:07:56,680 will have an exit 1. 115 00:07:56,680 --> 00:07:58,090 They both do the same thing. 116 00:07:58,390 --> 00:08:00,500 Here's a little bit more advanced example. 117 00:08:00,580 --> 00:08:04,240 In this case, we're using a PHP app that's combined with Nginx. 118 00:08:04,270 --> 00:08:12,070 What I've done is, in the resources, you'll find a link to this PHP example. I've added in a custom 119 00:08:12,160 --> 00:08:18,360 Nginx config file that uses Nginx and PHP-FPM status URLs. 120 00:08:18,400 --> 00:08:24,640 Both of those applications have their own status page and sort of a healthcheck ping URL. 121 00:08:24,910 --> 00:08:29,950 You can use those in your apps if you're using PHP or Nginx. There are two different URLs, 122 00:08:30,100 --> 00:08:32,910 but you can use both of them inside the same healthcheck. 123 00:08:32,950 --> 00:08:37,900 In this case, we're using just one of them, and we're throwing in the localhost/ping, which is 124 00:08:37,930 --> 00:08:39,200 actually a PHP-FPM 125 00:08:39,200 --> 00:08:48,050 status command, but you have to enable that inside your PHP-FPM. Again, in the resources of this lecture, 126 00:08:48,070 --> 00:08:51,990 there's a link to a PHP Docker Good Defaults. 127 00:08:52,090 --> 00:08:56,450 You can go check that out on a GitHub where I've shown in this example in a little bit more detail. 128 00:08:56,470 --> 00:09:01,740 Next we have a Postgres example so in the Dockerfile I can use a different URL. 129 00:09:01,750 --> 00:09:08,140 Here we have a Postgres application where in the healthcheck command, I'm using a command of pg isready. 130 00:09:08,140 --> 00:09:13,810 Now, with different apps, there's different tools. With Postgres, it comes with a built-in tool, 131 00:09:13,810 --> 00:09:17,650 that's a very simple testing of a connection to a Postgres server. 132 00:09:17,650 --> 00:09:22,330 It doesn't validate that you have good data, or that your database is mounted properly. It's simply 133 00:09:22,330 --> 00:09:26,430 going to say, 'Does this database server allow connections? Yes or no?' 134 00:09:26,440 --> 00:09:28,970 That's a neat one that you can do out of the box. 135 00:09:29,640 --> 00:09:32,710 Here's what it would look like in a composer/stack file. 136 00:09:32,790 --> 00:09:33,990 Very similar. 137 00:09:33,990 --> 00:09:37,830 You'll notice that the start period down there requires a different version. 138 00:09:37,830 --> 00:09:44,370 Since the healthcheck command came out in 1.12, it was actually supported in 2.1 of this Compose 139 00:09:44,370 --> 00:09:45,090 file. 140 00:09:45,150 --> 00:09:50,790 But, if you're going to use the start period, that means you have to update your Compose file to version 141 00:09:50,870 --> 00:09:56,070 3.4 in order to support that. Because the start period came out over a year later after the healthcheck 142 00:09:56,070 --> 00:09:58,690 command did. 143 00:09:58,700 --> 00:10:01,130 Let's start out with some simple run commands. 144 00:10:01,250 --> 00:10:06,050 What we're gonna do here is we're going to start a Postgres database server without the healthcheck 145 00:10:06,080 --> 00:10:08,040 because by default, it doesn't come with one. 146 00:10:08,210 --> 00:10:13,130 Then we're going to run it again with a manual healthcheck command that will add at the command 147 00:10:13,130 --> 00:10:18,730 line, and we'll see the difference. 148 00:10:18,730 --> 00:10:24,500 Here, we're just going to call the first one p1. We'll run it detached from the official Postgres 149 00:10:24,520 --> 00:10:25,280 image. 150 00:10:25,420 --> 00:10:31,220 If I do a docker container ls, you'll see that there's nothing indicating a healthcheck here. 151 00:10:31,420 --> 00:10:40,490 If we do that same command again, and call it p2 this time, we're going to add a health command. 152 00:10:44,140 --> 00:10:48,280 This time, we're going to use the pg isready, which we talked about earlier, 153 00:10:49,850 --> 00:10:54,710 to test that the connections are available on this Postgres server. We're going to tell it that the 154 00:10:54,710 --> 00:10:59,600 user we need is the postgres user. We don't actually need to give it a password. It's not going to try 155 00:10:59,600 --> 00:11:04,680 to log in. It's just going to try to validate. We'll use the Postgres image. 156 00:11:06,230 --> 00:11:14,620 Now if we do a docker container ls, and I zoom out a little bit, you'll see that it says, 'Up 4 seconds 157 00:11:14,620 --> 00:11:16,050 health is starting.' 158 00:11:16,150 --> 00:11:22,510 Now we get this additional feature in our status of our ls command. It will stay in the starting 159 00:11:22,510 --> 00:11:23,940 state for the default 160 00:11:23,940 --> 00:11:27,450 30 seconds until it runs the healthcheck command for the first time. 161 00:11:27,760 --> 00:11:34,760 Now that we've waited over 30 seconds, you'll see that it's changed to status of healthy. 162 00:11:34,810 --> 00:11:44,240 If we do a docker container inspect on that p2, we'll see at the very top of that that we had 163 00:11:44,240 --> 00:11:48,780 this new health status option. In this case, I've only been able to run it twice. 164 00:11:49,070 --> 00:11:55,120 You can see the output there, that it's showing it's accepting connections. 165 00:11:55,140 --> 00:12:00,750 All right. Let's do some service create commands to that same database, in that same test healthcheck. 166 00:12:00,760 --> 00:12:05,680 What we'll see here when we do this is that there are three different states that a service goes through 167 00:12:05,680 --> 00:12:06,530 on starting up. 168 00:12:06,540 --> 00:12:11,800 It's preparing, which usually means it's downloading the image. It's starting, which means it's executing 169 00:12:11,890 --> 00:12:14,850 the container and bringing it up. Then it's running. 170 00:12:14,980 --> 00:12:19,600 Without the healthcheck command, the starting and running are very quick. They're almost instantaneous. 171 00:12:19,720 --> 00:12:25,170 We'll see that here with a docker service create name p1 postgres. 172 00:12:25,480 --> 00:12:30,430 Once it's done preparing by downloading the image, you'll see that it goes immediately from starting 173 00:12:30,430 --> 00:12:35,590 to running, because there is no healthcheck. It doesn't have anything else to do other than start the 174 00:12:35,590 --> 00:12:37,610 container and say, 'Yep. The binary is running.' 175 00:12:37,750 --> 00:12:43,600 But if we do that same command, docker service create, and call it p2 like before, and give it that same 176 00:12:43,600 --> 00:12:44,240 health command. 177 00:12:49,110 --> 00:12:52,470 We start this service with the healthcheck command built in. 178 00:12:52,590 --> 00:12:58,050 What we'll see is that it'll go from preparing to starting, and it will sit at the starting state for 179 00:12:58,050 --> 00:13:01,780 the default 30 seconds until the first healthcheck runs. 180 00:13:01,890 --> 00:13:08,190 This is now the Docker service expecting a healthy state before it considers this service fully 181 00:13:08,190 --> 00:13:14,280 running. After the 30 seconds is over, it'll shift to the running state. Then we get the last little 182 00:13:14,280 --> 00:13:17,670 verify there, just to make sure that it's considered stable, and then we're done. 183 00:13:17,670 --> 00:13:21,560 You can already see, out of the box, that with services, as well as service updates, 184 00:13:21,570 --> 00:13:26,950 we're going to get this extra bonus of health concept if we use these commands whenever we can.