1 00:00:02,230 --> 00:00:08,220 I'm back to the persons dataset we used earlier in the persons collection. 2 00:00:08,220 --> 00:00:13,050 Now sometimes you want to get a feeling for the distribution of the data you have 3 00:00:13,530 --> 00:00:16,980 and there is a useful command that can help you with that, 4 00:00:17,130 --> 00:00:22,160 the bucket stage. So let's prepare our pipeline for this 5 00:00:22,210 --> 00:00:24,100 and let's add the 6 00:00:24,100 --> 00:00:25,360 $bucket stage, 7 00:00:25,390 --> 00:00:32,980 now what does bucket do? The bucket stage allows you to output your data in, well in buckets for which 8 00:00:32,980 --> 00:00:41,200 you can calculate certain summary statistics. Buckets takes a group by parameter here, a group by field 9 00:00:41,380 --> 00:00:47,120 where you define by which field do you want to put your data into buckets 10 00:00:47,190 --> 00:00:51,860 and here, I will go for the age, 11 00:00:51,850 --> 00:00:53,920 so dob.age, 12 00:00:54,400 --> 00:00:59,290 so here I'll refer to $dob.age. This tells bucket 13 00:00:59,390 --> 00:00:59,800 ok 14 00:00:59,830 --> 00:01:04,630 where is my input data essentially which I want to put into buckets. 15 00:01:04,630 --> 00:01:08,330 Then you define some boundaries and these are essentially your categories, 16 00:01:08,440 --> 00:01:18,370 so you could say I'm interested in my ages 0 to 18 to 30 to 50 to 80 to 120, something like this, 17 00:01:18,420 --> 00:01:21,960 this would now create your bucket, so the different categories you have, 18 00:01:21,970 --> 00:01:25,360 you want to categorize your data into. 19 00:01:25,540 --> 00:01:30,450 Now the question is what do you want to output for these buckets 20 00:01:30,550 --> 00:01:33,790 and here you define the structure of what you get back. 21 00:01:33,790 --> 00:01:40,970 So I could say that in each bucket, I want to have an array of the names, 22 00:01:41,200 --> 00:01:50,520 this can be done like in the group phase with the push operator and I simply push name in there. 23 00:01:50,830 --> 00:01:55,600 Now name is an object here, it's this object 24 00:01:55,690 --> 00:01:58,930 so maybe we just take the first names to keep it a bit shorter. 25 00:01:58,930 --> 00:02:01,220 You could push objects too, 26 00:02:01,240 --> 00:02:03,180 I just want to keep it shorter. 27 00:02:03,220 --> 00:02:08,610 So now each bucket will have a document where I see the names of the people who are in the bucket, 28 00:02:09,070 --> 00:02:17,360 I also want to see the average age, let's say. We can do that with the average operator you saw before 29 00:02:17,560 --> 00:02:22,730 and here, I simply point at dob.age 30 00:02:22,990 --> 00:02:27,130 and I also want to find out how many persons are in the bucket. 31 00:02:27,130 --> 00:02:32,940 So for this we can use some one, just in the group stage, one will be added for every element in the bucket. 32 00:02:34,620 --> 00:02:37,890 Now let's give this a try, for that 33 00:02:37,920 --> 00:02:41,190 I'll move pretty and aggregate into their positions, 34 00:02:41,370 --> 00:02:42,590 copy that 35 00:02:45,200 --> 00:02:51,320 and execute it and now well we have a lot of names in there, 36 00:02:51,330 --> 00:02:56,180 it probably was not my best idea to put all the names into the buckets. 37 00:02:56,190 --> 00:03:01,530 Here we have a bucket border, here we see basically for which category this bucket is, we see there are 38 00:03:01,540 --> 00:03:04,140 2300 people in there, 39 00:03:04,140 --> 00:03:05,640 we see the average age, 40 00:03:05,650 --> 00:03:10,890 now let me get rid of the names here because it's really hard to read that otherwise so I'll just use 41 00:03:10,890 --> 00:03:12,400 the summary statistics now 42 00:03:13,600 --> 00:03:15,160 and this is now easier to read. 43 00:03:15,160 --> 00:03:21,060 So now we see we got three buckets, three categories essentially 44 00:03:21,130 --> 00:03:28,150 and the reason for us having only three buckets is probably that we seem to have no persons younger 45 00:03:28,150 --> 00:03:30,760 than 18 or older than 80, 46 00:03:30,880 --> 00:03:36,400 so these buckets and this starting point seems to be redundant. 47 00:03:36,670 --> 00:03:43,610 Let's quickly check this by running a query on persons and finding all persons who are younger than 48 00:03:43,610 --> 00:03:51,980 18, so we can of course do that by pointing at dob.age and then using the lower than operator to see 49 00:03:52,220 --> 00:03:54,990 who's lower than 18 and we got none 50 00:03:55,130 --> 00:04:00,740 and now let's tweak this for greater than 80 and we got none there too. 51 00:04:00,740 --> 00:04:01,680 So this is correct 52 00:04:01,730 --> 00:04:05,490 and if we sum this up, we would also get our 5000 records by the way 53 00:04:05,750 --> 00:04:11,530 and this is the bucket command and how we can use it to get an idea of the distribution. 54 00:04:11,540 --> 00:04:16,520 Now we can of course fine tune this, since we know that we got no one who's younger than 18, we can get 55 00:04:16,520 --> 00:04:17,580 rid of that 56 00:04:17,900 --> 00:04:24,600 and we want to keep our end bound so that everyone fits in there but we could add more levels in between 57 00:04:24,980 --> 00:04:27,010 for 40 and 60. 58 00:04:27,160 --> 00:04:35,060 If I now run this, you see now we got more buckets for more granularity, this is 62 open and basically 59 00:04:35,600 --> 00:04:41,900 and this gives us an idea for our distribution and the average age in every bucket and so on. 60 00:04:42,290 --> 00:04:45,860 Now there also is an alternative to this, 61 00:04:45,920 --> 00:04:47,930 you can also run, 62 00:04:48,260 --> 00:04:54,160 let me quickly create an aggregate pipeline, 63 00:04:54,200 --> 00:04:59,430 you can also run another stage which is called bucketAuto. 64 00:04:59,450 --> 00:05:08,850 Now as the name suggests, bucketAuto does the bucketing algorithm for you. What you do here is you simply 65 00:05:08,850 --> 00:05:15,730 define the group by key because of course you need to tell by which field to bucket, 66 00:05:15,840 --> 00:05:18,130 so dob.age, 67 00:05:18,450 --> 00:05:21,820 you then also define the number of buckets you want to have 68 00:05:21,840 --> 00:05:23,880 and they will then be derived automatically, 69 00:05:23,910 --> 00:05:27,850 so mongodb will look at your data and see where it should draw the boundaries, 70 00:05:28,110 --> 00:05:33,330 so we could say we want five buckets. And then you also define the output of course because you still 71 00:05:33,340 --> 00:05:34,700 define what you want to see 72 00:05:34,710 --> 00:05:37,070 and I'll just copy the output from above. 73 00:05:37,560 --> 00:05:49,180 So now if I turn this into a usable format, I can copy that and if I execute this now, you see now I get 74 00:05:49,420 --> 00:05:53,720 this output, mongodb tells me which boundaries it created for me 75 00:05:53,770 --> 00:05:57,670 so I see that the youngest person seems to be 21 years old 76 00:05:58,030 --> 00:06:01,780 and then I get these buckets with my summary statistics. 77 00:06:01,810 --> 00:06:08,950 Now each bucket holds almost the same amount of values because mongodb tried to derive an equal 78 00:06:08,950 --> 00:06:14,750 distribution and bucketAuto can be an even quicker way for getting a feeling for your data.