1
00:00:02,230 --> 00:00:08,220
I'm back to the persons dataset we used earlier in the persons collection.

2
00:00:08,220 --> 00:00:13,050
Now sometimes you want to get a feeling for the distribution of the data you have

3
00:00:13,530 --> 00:00:16,980
and there is a useful command that can help you with that,

4
00:00:17,130 --> 00:00:22,160
the bucket stage. So let's prepare our pipeline for this

5
00:00:22,210 --> 00:00:24,100
and let's add the 

6
00:00:24,100 --> 00:00:25,360
$bucket stage,

7
00:00:25,390 --> 00:00:32,980
now what does bucket do? The bucket stage allows you to output your data in, well in buckets for which

8
00:00:32,980 --> 00:00:41,200
you can calculate certain summary statistics. Buckets takes a group by parameter here, a group by field

9
00:00:41,380 --> 00:00:47,120
where you define by which field do you want to put your data into buckets

10
00:00:47,190 --> 00:00:51,860
and here, I will go for the age,

11
00:00:51,850 --> 00:00:53,920
so dob.age,

12
00:00:54,400 --> 00:00:59,290
so here I'll refer to $dob.age. This tells bucket

13
00:00:59,390 --> 00:00:59,800
ok

14
00:00:59,830 --> 00:01:04,630
where is my input data essentially which I want to put into buckets.

15
00:01:04,630 --> 00:01:08,330
Then you define some boundaries and these are essentially your categories,

16
00:01:08,440 --> 00:01:18,370
so you could say I'm interested in my ages 0 to 18 to 30 to 50 to 80 to 120, something like this,

17
00:01:18,420 --> 00:01:21,960
this would now create your bucket, so the different categories you have,

18
00:01:21,970 --> 00:01:25,360
you want to categorize your data into.

19
00:01:25,540 --> 00:01:30,450
Now the question is what do you want to output for these buckets

20
00:01:30,550 --> 00:01:33,790
and here you define the structure of what you get back.

21
00:01:33,790 --> 00:01:40,970
So I could say that in each bucket, I want to have an array of the names,

22
00:01:41,200 --> 00:01:50,520
this can be done like in the group phase with the push operator and I simply push name in there.

23
00:01:50,830 --> 00:01:55,600
Now name is an object here, it's this object

24
00:01:55,690 --> 00:01:58,930
so maybe we just take the first names to keep it a bit shorter.

25
00:01:58,930 --> 00:02:01,220
You could push objects too,

26
00:02:01,240 --> 00:02:03,180
I just want to keep it shorter.

27
00:02:03,220 --> 00:02:08,610
So now each bucket will have a document where I see the names of the people who are in the bucket,

28
00:02:09,070 --> 00:02:17,360
I also want to see the average age, let's say. We can do that with the average operator you saw before

29
00:02:17,560 --> 00:02:22,730
and here, I simply point at dob.age

30
00:02:22,990 --> 00:02:27,130
and I also want to find out how many persons are in the bucket.

31
00:02:27,130 --> 00:02:32,940
So for this we can use some one, just in the group stage, one will be added for every element in the bucket.

32
00:02:34,620 --> 00:02:37,890
Now let's give this a try, for that

33
00:02:37,920 --> 00:02:41,190
I'll move pretty and aggregate into their positions,

34
00:02:41,370 --> 00:02:42,590
copy that

35
00:02:45,200 --> 00:02:51,320
and execute it and now well we have a lot of names in there,

36
00:02:51,330 --> 00:02:56,180
it probably was not my best idea to put all the names into the buckets.

37
00:02:56,190 --> 00:03:01,530
Here we have a bucket border, here we see basically for which category this bucket is, we see there are

38
00:03:01,540 --> 00:03:04,140
2300 people in there,

39
00:03:04,140 --> 00:03:05,640
we see the average age,

40
00:03:05,650 --> 00:03:10,890
now let me get rid of the names here because it's really hard to read that otherwise so I'll just use

41
00:03:10,890 --> 00:03:12,400
the summary statistics now

42
00:03:13,600 --> 00:03:15,160
and this is now easier to read.

43
00:03:15,160 --> 00:03:21,060
So now we see we got three buckets, three categories essentially

44
00:03:21,130 --> 00:03:28,150
and the reason for us having only three buckets is probably that we seem to have no persons younger

45
00:03:28,150 --> 00:03:30,760
than 18 or older than 80,

46
00:03:30,880 --> 00:03:36,400
so these buckets and this starting point seems to be redundant.

47
00:03:36,670 --> 00:03:43,610
Let's quickly check this by running a query on persons and finding all persons who are younger than

48
00:03:43,610 --> 00:03:51,980
18, so we can of course do that by pointing at dob.age and then using the lower than operator to see

49
00:03:52,220 --> 00:03:54,990
who's lower than 18 and we got none

50
00:03:55,130 --> 00:04:00,740
and now let's tweak this for greater than 80 and we got none there too.

51
00:04:00,740 --> 00:04:01,680
So this is correct

52
00:04:01,730 --> 00:04:05,490
and if we sum this up, we would also get our 5000 records by the way

53
00:04:05,750 --> 00:04:11,530
and this is the bucket command and how we can use it to get an idea of the distribution.

54
00:04:11,540 --> 00:04:16,520
Now we can of course fine tune this, since we know that we got no one who's younger than 18, we can get

55
00:04:16,520 --> 00:04:17,580
rid of that

56
00:04:17,900 --> 00:04:24,600
and we want to keep our end bound so that everyone fits in there but we could add more levels in between

57
00:04:24,980 --> 00:04:27,010
for 40 and 60.

58
00:04:27,160 --> 00:04:35,060
If I now run this, you see now we got more buckets for more granularity, this is 62 open and basically

59
00:04:35,600 --> 00:04:41,900
and this gives us an idea for our distribution and the average age in every bucket and so on.

60
00:04:42,290 --> 00:04:45,860
Now there also is an alternative to this,

61
00:04:45,920 --> 00:04:47,930
you can also run,

62
00:04:48,260 --> 00:04:54,160
let me quickly create an aggregate pipeline,

63
00:04:54,200 --> 00:04:59,430
you can also run another stage which is called bucketAuto.

64
00:04:59,450 --> 00:05:08,850
Now as the name suggests, bucketAuto does the bucketing algorithm for you. What you do here is you simply

65
00:05:08,850 --> 00:05:15,730
define the group by key because of course you need to tell by which field to bucket,

66
00:05:15,840 --> 00:05:18,130
so dob.age,

67
00:05:18,450 --> 00:05:21,820
you then also define the number of buckets you want to have

68
00:05:21,840 --> 00:05:23,880
and they will then be derived automatically,

69
00:05:23,910 --> 00:05:27,850
so mongodb will look at your data and see where it should draw the boundaries,

70
00:05:28,110 --> 00:05:33,330
so we could say we want five buckets. And then you also define the output of course because you still

71
00:05:33,340 --> 00:05:34,700
define what you want to see

72
00:05:34,710 --> 00:05:37,070
and I'll just copy the output from above.

73
00:05:37,560 --> 00:05:49,180
So now if I turn this into a usable format, I can copy that and if I execute this now, you see now I get

74
00:05:49,420 --> 00:05:53,720
this output, mongodb tells me which boundaries it created for me

75
00:05:53,770 --> 00:05:57,670
so I see that the youngest person seems to be 21 years old 

76
00:05:58,030 --> 00:06:01,780
and then I get these buckets with my summary statistics.

77
00:06:01,810 --> 00:06:08,950
Now each bucket holds almost the same amount of values because mongodb tried to derive an equal

78
00:06:08,950 --> 00:06:14,750
distribution and bucketAuto can be an even quicker way for getting a feeling for your data.