您的位置:首页 > 其它

常见排序算法导读(11)[桶排序]

2017-04-24 20:30 274 查看

上一节讲了基数排序(Radix Sort),这一节介绍桶排序(Bucket Sort or Bin Sort)。和基数排序一样,桶排序也是一种分布式排序。

桶排序(Bucket Sort)的基本思想

  1. 将待排对象序列按照一定hash算法分发到N个桶中
  2. 对每一个桶的待排对象进行排序
  3. 遍历N个桶,收集所有非空桶里的有序对象(子序列)组成一个统一的有序对象序列

在每一个桶中,如果采用链式存储的话,1.和2.可以合并在一起操作,也就是在分发的过程中保证每一个桶里的对象是桶内有序的。

例如: 设有5个桶, 待排对象序列为 {29, 25, 3, 49, 9, 37, 21, 43}

1. 分发(scatter) (注:图片来源戳这里)

2. 桶内排序(sort)

3. 收集(gather)

从上面的3张图中,我们可以很直观地了解桶排序的过程。在观看了动画Bucket Sort后,我决定采用动画中给出的hash算法和对每一个桶采用单链表存储结构给出C代码实现。动画中给出的hash算法如下:

Linked List Array index = Value * NUMBER_OF_ELEMENTS/(MAXINUM_ARRAY_VALUE + 1)
e.g. (348 * 30)/1000 = 10
(15  * 30)/1000 = 0
Note that NUMBER_OF_ELEMENTS is the number of buckets, which is 30.

桶排序的C代码实现

1. 基本排序原理介绍

/*
* Bucket Sort
*
*      Bucket sort(or bin sort), is a sorting algorithm that works by
*      distributing the elements of an array into a number of buckets.
*      Each bucket is then sorted individually, either using a different
*      sorting algorithm, or by recursively applying the bucket sorting
*      algorithm.
*
*      Typically, bucket sort works as follows:
*      1. Set up an array of initially empty "buckets"
*      2. Scatter: go over the original array, putting each object in
*                  its bucket
*      3. Sort each non-empty bucket
*      4. Gather : visit the buckets in order and put all elements back
*                  into the original array
*
*      Note that step#2 and step#3 are merged into one step since we use
*      single linked list for per bucket for better performance. Right
*      here we just use insertion sorting algorithm to initiliaze a single
*      linked list.
*
*      In addition, we define N(=10) buckets, and use such hash algorithm in
*      the following,
*              a) get max number of a[] as MAX
*              b) get width of the max number (i.e. MAX) as WIDTH
*                 e.g. MAX = 9,   WIDTH = 1;
*                      MAX = 99,  WIDTH = 2;
*                      MAX = 999, WIDTH = 3;
*              c) index = a[i] * N / (10 ** WIDTH)
*      then we can dispatch a[i] to bucket[index]
*/

2. 单链表定义及基本操作

1 typedef struct list_s {
2         int data;
3         struct list_s *next;
4 } list_t;
5
6 static void
7 list_init(list_t **head, list_t *node)
8 {
9         if (*head == NULL) {
10                 *head = node;
11                 return;
12         }
13
14         /* get both prev and next of the node to insert */
15         list_t *node_prev = *head;
16         list_t *node_next = NULL;
17         for (list_t *p = *head; p != NULL; p = p->next) {
18                 if (p->data < node->data) {
19                         node_prev = p;
20                         continue;
21                 }
22
23                 node_next = p;
24                 break;
25         }
26
27         if (node_next == NULL) { /* append node to the tail */
28                 node_prev->next = node;
29         } else {
30                 if (node_next == node_prev) { /* == *head */
31                         node->next = *head;
32                         *head = node;
33                         return;
34                 }
35
36                 /* node_prev -> node -> node_next */
37                 node_prev->next = node;
38                 node->next = node_next;
39         }
40 }
41
42 static void
43 list_show(list_t *head)
44 {
45         if (head == NULL)
46                 return;
47
48         for (list_t *p = head; p != NULL; p = p->next)
49                 printf("%d ", p->data);
50         printf("\n");
51 }
52
53 static void
54 list_fini(list_t *head)
55 {
56         list_t *p = head;
57         while (p != NULL) {
58                 list_t *q = p;
59                 p = p->next;
60                 free(q);
61         }
62 }

3. 核心步骤之一:分发scatter()

1 /*
2  * Get width of a number
3  * e.g.
4  *   for i in [  0 .. 9  ] // width = 1
5  *   for i in [ 10 .. 99 ] // width = 2
6  *   for i in [100 .. 999] // width = 3
7  *   ...
8  */
9 static int
10 get_width_of_num(int num)
11 {
12         int w = 1;
13         for (int q = num / 10; q != 0; q /= 10)
14                 w++;
15         return w;
16 }
17
18 static int
19 get_hash_base(int a[], size_t n)
20 {
21         /* get max one of a[] */
22         int max = a[0];
23         for (int i = 0; i < n; i++) {
24                 if (max < a[i])
25                        max = a[i];
26         }
27
28         /* get hash base which is 10**N, N=1, 2, ... */
29         int base = 1;
30         for (int i = 0; i < get_width_of_num(max); i++)
31                 base *= 10;
32
33         return base;
34 }
35
36 static void
37 scatter(list_t **bucket, size_t m, int a[], size_t n)
38 {
39         int base = get_hash_base(a, n);
40
41         for (int i = 0; i < n; i++) {
42                 /* 1. new a node for a[i] */
43                 list_t *nodep = NULL;
44                 nodep = (list_t *)malloc(sizeof (list_t));
45                 if (nodep == NULL) /* error: failed to malloc */
46                         return;
47
48                 nodep->data = a[i];
49                 nodep->next = NULL;
50
51                 /* 2. dispatch the new node to bucket[j] */
52                 int j = a[i] * m / base;
53                 list_init(&(bucket[j]), nodep);
54         }
55 }

4. 核心步骤之二:收集gather()

1 static void
2 gather(list_t **bucket, size_t m, int a[], size_t n)
3 {
4         int k = 0;
5         for (int i = 0; i < m; i++) {
6                 if (bucket[i] == NULL)
7                         continue;
8
9                 for (list_t *p = bucket[i]; p != NULL; p = p->next) {
10                         a[k++] = p->data;
11
12                         if (k >= n) /* overflow */
13                                 break;
14                 }
15
16                 list_fini(bucket[i]);
17         }
18 }

5. 桶排序bucketsort()

1 void
2 bucketsort(int a[], size_t n)
3 {
4         /* alloc bucket[] */
5 #define BUCKET_NUM 10
6         list_t **bucket = (list_t **)malloc(sizeof (list_t *) * BUCKET_NUM);
7         if (bucket == NULL) /* error: failed to malloc */
8                 return;
9         for (int i = 0; i < BUCKET_NUM; i++)
10                 bucket[i] = NULL;
11
12         /* scatter elements in a[] to bucket[] */
13         scatter(bucket, BUCKET_NUM, a, n);
14
15         /* gather a[] by walking bucket[] */
16         gather(bucket, BUCKET_NUM, a, n);
17
18         free(bucket);
19 }

6. 完整的C代码

o bucketsort.c (或访问这里)

1 /*
2  * Bucket Sort
3  *
4  *      Bucket sort(or bin sort), is a sorting algorithm that works by
5  *      distributing the elements of an array into a number of buckets.
6  *      Each bucket is then sorted individually, either using a different
7  *      sorting algorithm, or by recursively applying the bucket sorting
8  *      algorithm.
9  *
10  *      Typically, bucket sort works as follows:
11  *      1. Set up an array of initially empty "buckets"
12  *      2. Scatter: go over the original array, putting each object in
13  *                  its bucket
14  *      3. Sort each non-empty bucket
15  *      4. Gather : visit the buckets in order and put all elements back
16  *                  into the original array
17  *
18  *      Note that step#2 and step#3 are merged into one step since we use
19  *      single linked list for per bucket for better performance. Right
20  *      here we just use insertion sorting algorithm to initiliaze a single
21  *      linked list.
22  *
23  *      In addition, we define N(=10) buckets, and use such hash algorithm in
24  *      the following,
25  *              a) get max number of a[] as MAX
26  *              b) get width of the max number (i.e. MAX) as WIDTH
27  *                 e.g. MAX = 9,   WIDTH = 1;
28  *                      MAX = 99,  WIDTH = 2;
29  *                      MAX = 999, WIDTH = 3;
30  *              c) index = a[i] * N / (10 ** WIDTH)
31  *      then we can dispatch a[i] to bucket[index]
32  */
33
34 #include <stdio.h>
35 #include <stdlib.h>
36 #include <string.h>
37
38 typedef enum bool_s {false, true} bool_t;
39
40 bool_t g_isint = true;
41
42 typedef struct list_s {
43         int data;
44         struct list_s *next;
45 } list_t;
46
47 static void
48 list_init(list_t **head, list_t *node)
49 {
50         if (*head == NULL) {
51                 *head = node;
52                 return;
53         }
54
55         /* get both prev and next of the node to insert */
56         list_t *node_prev = *head;
57         list_t *node_next = NULL;
58         for (list_t *p = *head; p != NULL; p = p->next) {
59                 if (p->data < node->data) {
60                         node_prev = p;
61                         continue;
62                 }
63
64                 node_next = p;
65                 break;
66         }
67
68         if (node_next == NULL) { /* append node to the tail */
69                 node_prev->next = node;
70         } else {
71                 if (node_next == node_prev) { /* == *head */
72                         node->next = *head;
73                         *head = node;
74                         return;
75                 }
76
77                 /* node_prev -> node -> node_next */
78                 node_prev->next = node;
79                 node->next = node_next;
80         }
81 }
82
83 static void
84 list_show(list_t *head)
85 {
86         if (head == NULL)
87                 return;
88
89         for (list_t *p = head; p != NULL; p = p->next)
90                 printf("%d ", p->data);
91         printf("\n");
92 }
93
94 static void
95 list_fini(list_t *head)
96 {
97         list_t *p = head;
98         while (p != NULL) {
99                 list_t *q = p;
100                 p = p->next;
101                 free(q);
102         }
103 }
104
105 static void
106 show(int a[], size_t n)
107 {
108         if (g_isint) {
109                 for (int i = 0; i < n; i++)
110                         printf("%-2d ", a[i]);
111         } else {
112                 for (int i = 0; i < n; i++)
113                         printf("%-2c ", a[i]);
114         }
115         printf("\n");
116 }
117
118 /*
119  * Get width of a number
120  * e.g.
121  *   for i in [  0 .. 9  ] // width = 1
122  *   for i in [ 10 .. 99 ] // width = 2
123  *   for i in [100 .. 999] // width = 3
124  *   ...
125  */
126 static int
127 get_width_of_num(int num)
128 {
129         int w = 1;
130         for (int q = num / 10; q != 0; q /= 10)
131                 w++;
132         return w;
133 }
134
135 static int
136 get_hash_base(int a[], size_t n)
137 {
138         /* get max one of a[] */
139         int max = a[0];
140         for (int i = 0; i < n; i++) {
141                 if (max < a[i])
142                        max = a[i];
143         }
144
145         /* get hash base which is 10**N, N=1, 2, ... */
146         int base = 1;
147         for (int i = 0; i < get_width_of_num(max); i++)
148                 base *= 10;
149
150         return base;
151 }
152
153 static void
154 scatter(list_t **bucket, size_t m, int a[], size_t n)
155 {
156         int base = get_hash_base(a, n);
157
158         for (int i = 0; i < n; i++) {
159                 /* 1. new a node for a[i] */
160                 list_t *nodep = NULL;
161                 nodep = (list_t *)malloc(sizeof (list_t));
162                 if (nodep == NULL) /* error: failed to malloc */
163                         return;
164
165                 nodep->data = a[i];
166                 nodep->next = NULL;
167
168                 /* 2. dispatch the new node to bucket[j] */
169                 int j = a[i] * m / base;
170                 list_init(&(bucket[j]), nodep);
171
172                 /* NOTE: dump bucket[j] just for visual observation */
173                 printf("%d:%d\t\t%d\tbucket[%d] : ", i, j, a[i], j);
174                 list_show(bucket[j]);
175         }
176 }
177
178 static void
179 gather(list_t **bucket, size_t m, int a[], size_t n)
180 {
181         int k = 0;
182         for (int i = 0; i < m; i++) {
183                 if (bucket[i] == NULL)
184                         continue;
185
186                 for (list_t *p = bucket[i]; p != NULL; p = p->next) {
187                         a[k++] = p->data;
188
189                         if (k >= n) /* overflow */
190                                 break;
191                 }
192
193                 list_fini(bucket[i]);
194         }
195 }
196
197 void
198 bucketsort(int a[], size_t n)
199 {
200         /* alloc bucket[] */
201 #define BUCKET_NUM 10
202         list_t **bucket = (list_t **)malloc(sizeof (list_t *) * BUCKET_NUM);
203         if (bucket == NULL) /* error: failed to malloc */
204                 return;
205         for (int i = 0; i < BUCKET_NUM; i++)
206                 bucket[i] = NULL;
207
208         /* scatter elements in a[] to bucket[] */
209         scatter(bucket, BUCKET_NUM, a, n);
210
211         /* gather a[] by walking bucket[] */
212         gather(bucket, BUCKET_NUM, a, n);
213
214         free(bucket);
215 }
216
217 int
218 main(int argc, char *argv[])
219 {
220         if (argc < 2) {
221                 fprintf(stderr, "Usage: %s <C1> [C2] ...\n", argv[0]);
222                 return -1;
223         }
224
225         argc--;
226         argv++;
227
228         int n = argc;
229         int *a = (int *)malloc(sizeof(int) * n);
230 #define VALIDATE(p) do { if (p == NULL) return -1; } while (0)
231         VALIDATE(a);
232
233         char *s = getenv("ISINT");
234         if (s != NULL && strncmp(s, "true", 4) == 0)
235                 g_isint = true;
236         else if (s != NULL && strncmp(s, "false", 4) == 0)
237                 g_isint = false;
238
239         if (g_isint) {
240                 for (int i = 0; i < n; i++)
241                         *(a+i) = atoi(argv[i]);
242         } else {
243                 for (int i = 0; i < n; i++)
244                         *(a+i) = argv[i][0];
245         }
246
247         printf("                ");
248         for (int i = 0; i < n; i++)
249                 printf("%-2x ", i);
250         printf("\n");
251
252         printf("Before sorting: "); show(a, n);
253         bucketsort(a, n);
254         printf("After  sorting: "); show(a, n);
255
256 #define FREE(p) do { free(p); p = NULL; } while (0)
257         FREE(a);
258         return 0;
259 }

o 编译并测试

$ gcc -g -Wall -std=gnu99 -m32 -o bucketsort  bucketsort.c

$ ./bucketsort 29 25 3 49 9 37 21 43
0  1  2  3  4  5  6  7
Before sorting: 29 25 3  49 9  37 21 43
0:2             29      bucket[2] : 29
1:2             25      bucket[2] : 25 29
2:0             3       bucket[0] : 3
3:4             49      bucket[4] : 49
4:0             9       bucket[0] : 3 9
5:3             37      bucket[3] : 37
6:2             21      bucket[2] : 21 25 29
7:4             43      bucket[4] : 43 49
After  sorting: 3  9  21 25 29 37 43 49

桶排序(Bucket Sort)的排序稳定性取决于每一个桶内排序的稳定性。 如果每一个桶的排序方法是稳定的,则桶排序就是一种稳定的排序算法。特别需要注意的是,桶排序非常耗费存储空间。 就上面的实现而言,我们消耗了n个链表结点和10个桶,也就是说,其空间复杂度为O(n+k) (其中,k为桶的个数)。 从时间复杂度的角度看,我们实现的桶排序算法,最好的时间复杂度是O(n+k), 也就是n个元素在分发阶段均匀地分散在k(=10)个桶中,并且每个桶在分发的时候不需要进行链式插入排序就保持有序;那么在收集阶段,每个桶都有元素被遍历到。 最坏的时间复杂度是O(n**2), 也就是说n个元素在分发阶段被装入了一个桶X中,而且在对桶X进行链式插入排序时的时间复杂度为O(n**2)。 维基百科对桶排序的时间复杂度总结为:

Worst-case performance  O(n ** 2)
Best-case performance   Ω(n + k)
Average performance     Θ(n + k)

虽然桶排序很耗费存储空间,但它并非一无是处。在需要使用并行处理以提高排序速度的时候,桶排序可以很好地予以支持。例如,假设有1000个桶,有100万条数据需要排序。那么,我们完全可以启动1000个线程对这100万条数据进行并行分发并做链式插入排序,然后等1000个线程都结束后,将1000个桶里的数据收集回来就OK了。当然,用32位的程序处理100万条数据(如data域为int),大约需要占用4*2*1000000 + 4*1000 = 8M的额外存储空间。(在32位的程序中,int和pointer都是占4个字节)

参考资料

  1. Algorithm Visualizations
  2. Bucket Sort from wikipedia

总结

到此为止,常见的9种排序算法都已经介绍完毕,前后历时一个半月,充满了艰辛,也充满了快乐,尤其是对一个热爱编码的程序员来说,乐趣显然是大大滴,从此再也不怕Intel或者AMD的面试官问我神马是堆排序啦:-)。 所有C代码实现都已经保存在我的GitHub里,如感兴趣,请浏览vCodeHub/xdsa/sorting。 最后,引用一句古诗表达一下我此刻的真切感受,"纸上得来终觉浅,绝知此事要躬行"。学习算法,首先要看大师写的书,国产数据结构的书最好不要看,晦涩难懂不说,而且还很可能被误导。但是,仅仅看书是不够的,只有动手去编码实际体会一下,才能知其然也知其所以然,从而印象深刻,进而融会贯通。

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: