1
00:00:03,910 --> 00:00:10,300
The Docker Health Check command. It was a new feature added in 1.12, which came out mid 2016, the

2
00:00:10,300 --> 00:00:16,480
same time that Swarm Kit and Swarm Mode were available in Docker. It was added really as a part of

3
00:00:16,480 --> 00:00:22,870
that toolkit, but it still works in all the different files like the Dockerfile, the Compose file,

4
00:00:22,870 --> 00:00:27,790
the docker run command uses it, the Stack files support it, the service update and service create command

5
00:00:27,790 --> 00:00:28,250
support it.

6
00:00:28,260 --> 00:00:29,630
It's everywhere.

7
00:00:29,950 --> 00:00:36,220
I highly recommend that when you're going production, you do engage in testing options for this health

8
00:00:36,220 --> 00:00:37,090
check command.

9
00:00:37,150 --> 00:00:40,330
It's going to work right out of the box with an exec.

10
00:00:40,390 --> 00:00:45,040
It's going to execute that command inside the container just like if you were running your own exec

11
00:00:45,040 --> 00:00:45,700
command.

12
00:00:45,730 --> 00:00:50,020
So, it's not running it from outside the container; it's just running it inside which means that even

13
00:00:50,020 --> 00:00:55,360
simple workers that don't have exposed ports, you can run a simple command in them to validate whether

14
00:00:55,360 --> 00:00:57,780
they're returning good data or whatever.

15
00:00:58,090 --> 00:01:04,300
It's a simple execution of a command, which means it gets a simple return. It expects a 0 or a 1.

16
00:01:04,300 --> 00:01:09,850
In Linux and Windows, you have exit codes from commands and a 0 is a good thing.

17
00:01:09,850 --> 00:01:11,560
It means everything was fine.

18
00:01:11,590 --> 00:01:15,570
Anything other than a 0 is going to be an error in most applications.

19
00:01:15,580 --> 00:01:21,010
But in Docker, we need that application to exit a 1 specifically. We'll show in a minute how you

20
00:01:21,010 --> 00:01:22,160
do that.

21
00:01:22,330 --> 00:01:28,420
There's only three states to a healthcheck in Docker. It starts out with starting. Starting is the

22
00:01:28,420 --> 00:01:32,480
first 30 seconds, by default, where it hasn't run a healthcheck command yet.

23
00:01:32,710 --> 00:01:34,130
Then it's going to run one.

24
00:01:34,180 --> 00:01:39,160
If that returns a 0, it'll start with the healthy. It'll change to the healthy option.

25
00:01:39,280 --> 00:01:45,640
It'll take that command and it'll run it every 30 seconds by default again. If it ever receives an unhealthy

26
00:01:45,640 --> 00:01:49,690
return, like an exit 1, then it marks it as an unhealthy container.

27
00:01:49,690 --> 00:01:52,790
We have options for controlling all of this including retries.

28
00:01:52,810 --> 00:01:53,940
We'll see that in a minute.

29
00:01:54,760 --> 00:02:00,010
This is a much better option than we've had in the past because Docker, until now, was just making

30
00:02:00,010 --> 00:02:01,870
sure the application was still running.

31
00:02:01,930 --> 00:02:05,740
It didn't have any insight into whether that application was doing what it was supposed to.

32
00:02:05,920 --> 00:02:10,120
Now we can do that inside the Docker container itself.

33
00:02:10,330 --> 00:02:13,980
But this isn't a replacement for your third party monitoring solution.

34
00:02:13,990 --> 00:02:20,440
This isn't going to give you graphs, or status over time, or any sort of third party tooling that you

35
00:02:20,440 --> 00:02:22,150
would expect out of a monitoring solution.

36
00:02:22,150 --> 00:02:27,780
This is about Docker understanding if the container itself has a basic level of healthy.

37
00:02:27,880 --> 00:02:36,490
So, in a Nginx, it might return a localhost of the root index file. A return of 200 or 300 is fine

38
00:02:36,520 --> 00:02:39,720
and gives it an exit code of 0, and it considers it healthy.

39
00:02:39,910 --> 00:02:43,190
That's not a super advanced uh, you know, monitoring tool.

40
00:02:43,360 --> 00:02:49,630
But if it did return a 404 or 500 error, it would then consider it unhealthy and we can do something about

41
00:02:49,630 --> 00:02:50,620
that.

42
00:02:50,620 --> 00:02:54,270
Where are we going to see this Docker healthcheck in the GUI?

43
00:02:54,520 --> 00:02:57,070
The first place is in container ls.

44
00:02:57,250 --> 00:02:59,430
It'll just see it as this new option.

45
00:02:59,430 --> 00:03:04,030
It's in the middle. We'll see in a second where it'll show us one of the three states if the health check

46
00:03:04,030 --> 00:03:06,530
is running, and that's how we actually know that there's a healthcheck.

47
00:03:06,580 --> 00:03:12,370
That's the easiest way, at least, to know. We'll see the history, the last five of that healthcheck,

48
00:03:12,370 --> 00:03:19,040
show up in the inspect for that container. And we can see some basic trend over time there.

49
00:03:19,150 --> 00:03:23,460
But the docker run command does not take action on an unhealthy container.

50
00:03:23,620 --> 00:03:29,620
Once the healthcheck considers a container unhealthy, docker run is just going to indicate that in the ls

51
00:03:29,620 --> 00:03:32,690
command, and in the inspect, but it's not going to take action.

52
00:03:32,710 --> 00:03:36,050
That's where we expect the Swarm Services to take action.

53
00:03:36,070 --> 00:03:42,670
So the stacks and services will actually replace that container with a new task, on a new host possibly,

54
00:03:42,670 --> 00:03:44,100
depending on the scheduler.

55
00:03:44,410 --> 00:03:50,200
Even in the update command, we see a little extra bonus by using the healthchecks because the updates

56
00:03:50,550 --> 00:03:56,440
will consider the healthcheck as a part of the readiness for that container before it goes and changes

57
00:03:56,440 --> 00:03:57,370
the next one.

58
00:03:57,370 --> 00:04:02,710
If a container comes up, but it doesn't pass its health check, then the service update won't go to

59
00:04:02,710 --> 00:04:06,750
the next one. Or it'll take action based on the changes you give it.

60
00:04:07,470 --> 00:04:10,400
Let's look at a few examples before we go to the command line.

61
00:04:10,410 --> 00:04:12,680
This is one that we're using on docker run.

62
00:04:12,690 --> 00:04:17,670
This allows us to use an existing image that doesn't have a health check in it, and we're adding

63
00:04:17,670 --> 00:04:19,680
the health check in at runtime.

64
00:04:19,710 --> 00:04:26,220
In this case, we're using the Elasticsearch image. You can see the command is a cURL localhost

65
00:04:26,270 --> 00:04:32,430
9200, which is the port that the Elasticsearch is running on inside the container, not the published port,

66
00:04:32,670 --> 00:04:35,380
but inside the container. For Elasticsearch,

67
00:04:35,390 --> 00:04:38,040
there is an actual health URL.

68
00:04:38,070 --> 00:04:39,560
So, we can use that here.

69
00:04:39,570 --> 00:04:43,750
You'll notice the two pipes with the false at the end of that command.

70
00:04:43,890 --> 00:04:45,300
And that's going to be pretty common

71
00:04:45,300 --> 00:04:50,760
if using something like cURL or another tool that will send out an error code that's other than 1.

72
00:04:50,810 --> 00:04:56,010
Remember when I mentioned that while ago? We need it to exit with 1 if there's a problem. Because that's

73
00:04:56,010 --> 00:04:59,240
the one error code that Docker is going to do something about.

74
00:04:59,310 --> 00:05:06,630
We need to make sure that in this case, a shell will always return the false 1 exit code

75
00:05:06,640 --> 00:05:10,430
if there's anything coming out of that command other than 0.

76
00:05:10,530 --> 00:05:13,250
It's a nice way to get around that problem.

77
00:05:13,310 --> 00:05:18,180
It just so happens with cURL, cURL will give other potential error codes and we don't want it to

78
00:05:18,180 --> 00:05:18,880
do that.

79
00:05:19,380 --> 00:05:25,290
In the actual Docker files, we can add the same command. The format's a little bit different. But you see

80
00:05:25,290 --> 00:05:31,850
that we have these options here. We have the interval, the timeout, the start period (which is new), and retries.

81
00:05:31,950 --> 00:05:35,670
The interval is what you would think it is. It's, by default, every 30 seconds.

82
00:05:35,730 --> 00:05:41,520
How often it's going to run this health check. The time out is how long it's going to wait before it errors

83
00:05:41,520 --> 00:05:48,030
out and returns a bad code, if maybe the app is slow. The start period is a new feature that allows us now

84
00:05:48,060 --> 00:05:56,280
in 17.09 and newer, to give a longer wait period than the first 30 seconds of the duration. Before, it

85
00:05:56,280 --> 00:05:59,810
would always just wait the long...the interval time before it started the healthcheck.

86
00:05:59,820 --> 00:06:04,710
But maybe you have a Java app, or database, or something that takes a lot longer to start.

87
00:06:04,710 --> 00:06:06,400
Maybe it takes five minutes.

88
00:06:06,540 --> 00:06:11,880
You could add that start period in there. It'll still do healthchecks. But what it will do is it won't

89
00:06:11,880 --> 00:06:17,010
alarm on an unhealthy check until that time has elapsed.

90
00:06:17,010 --> 00:06:22,750
So if you set two minutes in there, even though it's health checking every 30 seconds, it's going to only

91
00:06:22,750 --> 00:06:28,320
consider it unhealthy once it's past that two minute mark. The last one there, retries, means that

92
00:06:28,320 --> 00:06:33,650
we will try this health check x number of times before we consider it unhealthy.

93
00:06:33,720 --> 00:06:38,940
That gives maybe a potentially unstable app a chance to come back with a healthy and recover on

94
00:06:38,940 --> 00:06:42,420
its own before we consider this a truly unhealthy container.

95
00:06:42,510 --> 00:06:45,680
The basic healthcheck command you would use in a Dockerfile is called HEALTHCHECK,

96
00:06:45,720 --> 00:06:51,510
all capital letters there. The same format exists where if we're just doing a simple cURL of the localhost

97
00:06:51,630 --> 00:06:55,010
because maybe it's PHP app or something. We can do that.

98
00:06:55,200 --> 00:06:59,760
This is how you would add all those options in to a Dockerfile so you would see how I add the

99
00:06:59,770 --> 00:07:04,130
timeout interval and the retries before the command itself.

100
00:07:04,290 --> 00:07:09,450
The first one there for the basic command, notice I don't have to put in a CMD if I'm just giving it the

101
00:07:09,450 --> 00:07:14,230
command to run. But if I want to show options, if I want to give it custom options out of the box with

102
00:07:14,230 --> 00:07:19,190
the timeout and so on, then I have to specify which one is the command.

103
00:07:19,200 --> 00:07:20,550
Now these aren't two different lines.

104
00:07:20,550 --> 00:07:23,820
Notice the back slash on the end of the first line there.

105
00:07:23,940 --> 00:07:25,530
So don't get that confused.

106
00:07:26,290 --> 00:07:31,270
Here we have a simple example of what it might be like if you had a static application running inside

107
00:07:31,270 --> 00:07:32,450
an Nginx server.

108
00:07:32,500 --> 00:07:37,360
You could set the interval and the time out from your Dockerfile, and you would just have it simply

109
00:07:37,360 --> 00:07:39,830
do a cURL command on the localhost.

110
00:07:39,850 --> 00:07:45,910
If it returns a 200 or 300, it considers that fine. If it returns a 4, or 5, or something else, it considers

111
00:07:45,910 --> 00:07:46,660
that an error.

112
00:07:46,660 --> 00:07:51,050
You notice here that I have an exit 1, which is the same thing as a false.

113
00:07:51,100 --> 00:07:55,540
I did that just to show you that certain examples on the Internet will have a false. Certain examples

114
00:07:55,540 --> 00:07:56,680
will have an exit 1.

115
00:07:56,680 --> 00:07:58,090
They both do the same thing.

116
00:07:58,390 --> 00:08:00,500
Here's a little bit more advanced example.

117
00:08:00,580 --> 00:08:04,240
In this case, we're using a PHP app that's combined with Nginx.

118
00:08:04,270 --> 00:08:12,070
What I've done is, in the resources, you'll find a link to this PHP example. I've added in a custom

119
00:08:12,160 --> 00:08:18,360
Nginx config file that uses Nginx and PHP-FPM status URLs.

120
00:08:18,400 --> 00:08:24,640
Both of those applications have their own status page and sort of a healthcheck ping URL.

121
00:08:24,910 --> 00:08:29,950
You can use those in your apps if you're using PHP or Nginx. There are two different URLs,

122
00:08:30,100 --> 00:08:32,910
but you can use both of them inside the same healthcheck.

123
00:08:32,950 --> 00:08:37,900
In this case, we're using just one of them, and we're throwing in the localhost/ping, which is

124
00:08:37,930 --> 00:08:39,200
actually a PHP-FPM

125
00:08:39,200 --> 00:08:48,050
status command, but you have to enable that inside your PHP-FPM. Again, in the resources of this lecture,

126
00:08:48,070 --> 00:08:51,990
there's a link to a PHP Docker Good Defaults.

127
00:08:52,090 --> 00:08:56,450
You can go check that out on a GitHub where I've shown in this example in a little bit more detail.

128
00:08:56,470 --> 00:09:01,740
Next we have a Postgres example so in the Dockerfile I can use a different URL.

129
00:09:01,750 --> 00:09:08,140
Here we have a Postgres application where in the healthcheck command, I'm using a command of pg isready.

130
00:09:08,140 --> 00:09:13,810
Now, with different apps, there's different tools. With Postgres, it comes with a built-in tool,

131
00:09:13,810 --> 00:09:17,650
that's a very simple testing of a connection to a Postgres server.

132
00:09:17,650 --> 00:09:22,330
It doesn't validate that you have good data, or that your database is mounted properly. It's simply

133
00:09:22,330 --> 00:09:26,430
going to say, 'Does this database server allow connections? Yes or no?'

134
00:09:26,440 --> 00:09:28,970
That's a neat one that you can do out of the box.

135
00:09:29,640 --> 00:09:32,710
Here's what it would look like in a composer/stack file.

136
00:09:32,790 --> 00:09:33,990
Very similar.

137
00:09:33,990 --> 00:09:37,830
You'll notice that the start period down there requires a different version.

138
00:09:37,830 --> 00:09:44,370
Since the healthcheck command came out in 1.12, it was actually supported in 2.1 of this Compose

139
00:09:44,370 --> 00:09:45,090
file.

140
00:09:45,150 --> 00:09:50,790
But, if you're going to use the start period, that means you have to update your Compose file to version

141
00:09:50,870 --> 00:09:56,070
3.4 in order to support that. Because the start period came out over a year later after the healthcheck

142
00:09:56,070 --> 00:09:58,690
command did.

143
00:09:58,700 --> 00:10:01,130
Let's start out with some simple run commands.

144
00:10:01,250 --> 00:10:06,050
What we're gonna do here is we're going to start a Postgres database server without the healthcheck

145
00:10:06,080 --> 00:10:08,040
because by default, it doesn't come with one.

146
00:10:08,210 --> 00:10:13,130
Then we're going to run it again with a manual healthcheck command that will add at the command

147
00:10:13,130 --> 00:10:18,730
line, and we'll see the difference.

148
00:10:18,730 --> 00:10:24,500
Here, we're just going to call the first one p1. We'll run it detached from the official Postgres

149
00:10:24,520 --> 00:10:25,280
image.

150
00:10:25,420 --> 00:10:31,220
If I do a docker container ls, you'll see that there's nothing indicating a healthcheck here.

151
00:10:31,420 --> 00:10:40,490
If we do that same command again, and call it p2 this time, we're going to add a health command.

152
00:10:44,140 --> 00:10:48,280
This time, we're going to use the pg isready, which we talked about earlier,

153
00:10:49,850 --> 00:10:54,710
to test that the connections are available on this Postgres server. We're going to tell it that the

154
00:10:54,710 --> 00:10:59,600
user we need is the postgres user. We don't actually need to give it a password. It's not going to try

155
00:10:59,600 --> 00:11:04,680
to log in. It's just going to try to validate. We'll use the Postgres image.

156
00:11:06,230 --> 00:11:14,620
Now if we do a docker container ls, and I zoom out a little bit, you'll see that it says, 'Up 4 seconds

157
00:11:14,620 --> 00:11:16,050
health is starting.'

158
00:11:16,150 --> 00:11:22,510
Now we get this additional feature in our status of our ls command. It will stay in the starting

159
00:11:22,510 --> 00:11:23,940
state for the default

160
00:11:23,940 --> 00:11:27,450
30 seconds until it runs the healthcheck command for the first time.

161
00:11:27,760 --> 00:11:34,760
Now that we've waited over 30 seconds, you'll see that it's changed to status of healthy.

162
00:11:34,810 --> 00:11:44,240
If we do a docker container inspect on that p2, we'll see at the very top of that that we had

163
00:11:44,240 --> 00:11:48,780
this new health status option. In this case, I've only been able to run it twice.

164
00:11:49,070 --> 00:11:55,120
You can see the output there, that it's showing it's accepting connections.

165
00:11:55,140 --> 00:12:00,750
All right. Let's do some service create commands to that same database, in that same test healthcheck.

166
00:12:00,760 --> 00:12:05,680
What we'll see here when we do this is that there are three different states that a service goes through

167
00:12:05,680 --> 00:12:06,530
on starting up.

168
00:12:06,540 --> 00:12:11,800
It's preparing, which usually means it's downloading the image. It's starting, which means it's executing

169
00:12:11,890 --> 00:12:14,850
the container and bringing it up. Then it's running.

170
00:12:14,980 --> 00:12:19,600
Without the healthcheck command, the starting and running are very quick. They're almost instantaneous.

171
00:12:19,720 --> 00:12:25,170
We'll see that here with a docker service create name p1 postgres.

172
00:12:25,480 --> 00:12:30,430
Once it's done preparing by downloading the image, you'll see that it goes immediately from starting

173
00:12:30,430 --> 00:12:35,590
to running, because there is no healthcheck. It doesn't have anything else to do other than start the

174
00:12:35,590 --> 00:12:37,610
container and say, 'Yep. The binary is running.'

175
00:12:37,750 --> 00:12:43,600
But if we do that same command, docker service create, and call it p2 like before, and give it that same

176
00:12:43,600 --> 00:12:44,240
health command.

177
00:12:49,110 --> 00:12:52,470
We start this service with the healthcheck command built in.

178
00:12:52,590 --> 00:12:58,050
What we'll see is that it'll go from preparing to starting, and it will sit at the starting state for

179
00:12:58,050 --> 00:13:01,780
the default 30 seconds until the first healthcheck runs.

180
00:13:01,890 --> 00:13:08,190
This is now the Docker service expecting a healthy state before it considers this service fully

181
00:13:08,190 --> 00:13:14,280
running. After the 30 seconds is over, it'll shift to the running state. Then we get the last little

182
00:13:14,280 --> 00:13:17,670
verify there, just to make sure that it's considered stable, and then we're done.

183
00:13:17,670 --> 00:13:21,560
You can already see, out of the box, that with services, as well as service updates,

184
00:13:21,570 --> 00:13:26,950
we're going to get this extra bonus of health concept if we use these commands whenever we can.