您的位置：首页 > 运维架构 > Linux

Linux 内核中RAID5源码详解之stripe_head的管理

2015-07-24 14:01 746 查看

Linux 内核中RAID5源码详解之stripe_head的管理

前面已经介绍了整个系统的全局架构和内核中RAID5的基本处理单元stripe_head结构，基本上已经从整体上对Linux内核中的RAID5模块有了一定的认识，今天我们就来说说RAID5是怎么来管理stripe_head(下面有时也会说到”条带“，其实指的就是stripe_head结构)的，闲话不多说，go~

stripe_head的管理

我们已经知道内核中的RAID5对请求的处理，是以stripe_head结构为基本单元进行的，stripe_head结构在内存中的表示可以描述成下图：

这是一个代表3+1模式的RAID5结构，其中我们可以看出除了一些元数据，剩下的就是dev设备的缓冲区了，相关的结构定义可在raid5.h中查阅。值得注意的是元数据中有一个域：
sector
，这个值代表了这条stripe_head在RAID5中的偏移量，并且这个偏移量也是后面4个设备缓冲区对其的标志，及后面4个设备的缓冲区在当前设备上的偏移量也是sector，是确定下来的。

前面已经介绍了stripe_head的state状态，每一个状态代表什么意思，以及相应的RAID5的全局配置信息r5conf，这些都可参照我的博文Linux内核中RAID5的基本架构与数据结构解析，那么接下来我们就来谈谈内核中是怎么实现对stripe_head的管理的。

其实用一句话简要概括就是根据stripe_head的状态将其放入不同的list中，而list正是r5conf中的handle_list、hold_list、delayed_list，其中还有inactive_list和temp_inactive_list。 每个list所代表的含义如下：

handle_list : 需要处理的stripe_head集合

hold_list : 预读状态就绪的stripe_head集合

delayed_list：需要延迟处理的stripe_head集合，因为缺少处理的条件

inactive_list：不活跃的stripe_head集合

temp_inactive_list：inactive_list的缓存

接下来我们从一个stripe_head的诞生到消亡来一步步观察内核是怎么对其进行管理的。

首先当一个请求到来时，RAID5中的入口函数为

make_request()

，这里我们主要讲stripe_head的管理，所以其他的地方不会讲的很细，基本上就讲到个什么意思，以后会再详细讲下其中的处理细节。在

make_request()

中有这样一段语句：

/*make_request()*/
for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
int previous;
int seq;

do_prepare = false;
retry:
seq = read_seqcount_begin(&conf->gen_lock);
previous = 0;
if (do_prepare)
prepare_to_wait(&conf->wait_for_overlap, &w,
TASK_UNINTERRUPTIBLE);
if (unlikely(conf->reshape_progress != MaxSector)) {
/* spinlock is needed as reshape_progress may be
* 64bit on a 32bit platform, and so it might be
* possible to see a half-updated value
* Of course reshape_progress could change after
* the lock is dropped, so once we get a reference
* to the stripe that we think it is, we will have
* to check again.
*/
spin_lock_irq(&conf->device_lock);
if (mddev->reshape_backwards
? logical_sector < conf->reshape_progress
: logical_sector >= conf->reshape_progress) {
previous = 1;
} else {
if (mddev->reshape_backwards
? logical_sector < conf->reshape_safe
: logical_sector >= conf->reshape_safe) {
spin_unlock_irq(&conf->device_lock);
schedule();
do_prepare = true;
goto retry;
}
}
spin_unlock_irq(&conf->device_lock);
}

new_sector = raid5_compute_sector(conf, logical_sector,
previous,
&dd_idx, NULL);//计算偏移量
pr_debug("raid456: make_request, sector %llu logical %llu\n",
(unsigned long long)new_sector,
(unsigned long long)logical_sector);

sh = get_active_stripe(conf, new_sector, previous,
(bi->bi_rw&RWA_MASK), 0);//取条带

这里我们可以注意到第1行的for循环，这是为了将bio请求切片，切成一个stripe_head的处理单位，即默认的情况下是page大小，因为每个stripe_head在每块盘上的处理单元是一个page的大小，默认为4KB。中间的部分暂时忽略，而

raid5_compute_sector()

则是计算这个切出来的page大小的bio请求在盘上的偏移量，怎么计算的我们暂时不需要知道，后面会专门讲解。通过

raid5_compute_sector()

计算后得到当前盘上的偏移量new_sector，我们前面说过了一个stripe_head所对应的盘上的偏移量都是一样的，所以这个bio的切片会和其他盘上具有new_sector的偏移量的数据构成一个stripe_head的dev的缓冲区域。

获取stripe_head结构

那么接下来就通过new_sector这个偏移量来获得stripe_head，

get_active_stripe()

就是干这个事的，跟进

get_active_stripe()

：

static struct stripe_head *
get_active_stripe(struct r5conf *conf, sector_t sector,
int previous, int noblock, int noquiesce)
{
struct stripe_head *sh;
int hash = stripe_hash_locks_hash(sector);

pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);

spin_lock_irq(conf->hash_locks + hash);

do {
wait_event_lock_irq(conf->wait_for_stripe,
conf->quiesce == 0 || noquiesce,
*(conf->hash_locks + hash));
sh = __find_stripe(conf, sector, conf->generation - previous);
if (!sh) {
if (!conf->inactive_blocked)
sh = get_free_stripe(conf, hash);
if (noblock && sh == NULL)
break;
if (!sh) {
conf->inactive_blocked = 1;
wait_event_lock_irq(
conf->wait_for_stripe,
!list_empty(conf->inactive_list + hash) &&
(atomic_read(&conf->active_stripes)
< (conf->max_nr_stripes * 3 / 4)
|| !conf->inactive_blocked),
*(conf->hash_locks + hash));
conf->inactive_blocked = 0;
} else {
init_stripe(sh, sector, previous);
atomic_inc(&sh->count);
}
} else if (!atomic_inc_not_zero(&sh->count)) {
spin_lock(&conf->device_lock);
if (!atomic_read(&sh->count)) {
if (!test_bit(STRIPE_HANDLE, &sh->state))
atomic_inc(&conf->active_stripes);
BUG_ON(list_empty(&sh->lru) &&
!test_bit(STRIPE_EXPANDING, &sh->state));
list_del_init(&sh->lru);
if (sh->group) {
sh->group->stripes_cnt--;
sh->group = NULL;
}
}
atomic_inc(&sh->count);
spin_unlock(&conf->device_lock);
}
} while (sh == NULL);

spin_unlock_irq(conf->hash_locks + hash);
return sh;
}

前面说过RAID5默认的stripe_head的个数为256，它们之间的区别都是靠sector域来区分的，所以stripe_head是循环使用的，而为了更有效率的管理stripe_head，RAID5引进了HASH链表和LRU链表，通过每一个sector(前面计算得到的new_sector的值)计算对应的hash值，然后再HASH链表或者LRU链表中查找stripe_head，这样更有效率。

观察上述的代码，先计算sector对应的hash值，接下来是一个do-while的大循环，直到

sh != NULL

时循环才结束，意思就是一定要取到个stripe_head，不取到stripe_head就赖着不走了，好好好，咱们不耍流氓，那就来看看怎么取的：

1.

__find_stripe()

先根据hash值在hash链表中查找stripe_head,跟进

__find_stripe()

static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
short generation)
{
struct stripe_head *sh;

pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, stripe_hash(conf, sector), hash)
if (sh->sector == sector && sh->generation == generation)
return sh;
pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
}

这很明显是一次遍历HASH链表，直到遇到sector相同的，则返回，代表了具有sector偏移量的stripe_head已经在使用了，不需要再重新拿一个空的stripe_head来运行，这也是sector的唯一性。

2. 回到

get_active_stripe()

，如果

__find_stripe()

失败，则没用正在使用的偏移量为sector的stripe_head，这时则调用

get_free_stripe()

来获取个空的stripe_head,第一次时由于活跃的stripe_head为空，所以肯定进入
get_free_stripe()
,跟进

get_free_stripe()

：

/* find an idle stripe, make sure it is unhashed, and return it. */
static struct stripe_head *get_free_stripe(struct r5conf *conf, int hash)
{
struct stripe_head *sh = NULL;
struct list_head *first;

if (list_empty(conf->inactive_list + hash))
goto out;
first = (conf->inactive_list + hash)->next;//获得对应的list_head
sh = list_entry(first, struct stripe_head, lru);//在LRU链表中获取相应的stripe_head
list_del_init(first);//将其删除
remove_hash(sh);//将sh从hash链表中删除
atomic_inc(&conf->active_stripes);//将RAID5的活跃stripe_head数加1
BUG_ON(hash != sh->hash_lock_index);
if (list_empty(conf->inactive_list + hash))//删除后对应的位置上没有list，则将empty_inactive_nr加1
atomic_inc(&conf->empty_inactive_list_nr);
out:
return sh;
}

首先进入if判断，根据hash值在RAID5的inactive_list查找相应的位置，由于在进入该函数之前，

__find_stripe()

是返回NULL的，所以相应的stripe_head按道理来说应该是在inactive_list中的，如果if成立，即inactive_list没有相应的stripe_head，那么则跳到out：返回NULL；否则获取相应的stripe_head，并从LRU和HASH链表中删除。

3. 获得了stripe_head结构后，返回到

get_active_stripe()

，但是此时的stripe_head结构是空的，需要初始化，于是调用

init_stripe()

对stripe_head设置相应的元数据，跟进

init_stripe()

static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
{
struct r5conf *conf = sh->raid_conf;
int i, seq;

BUG_ON(atomic_read(&sh->count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
BUG_ON(stripe_operations_active(sh));

pr_debug("init_stripe called, stripe %llu\n",
(unsigned long long)sector);
retry:
seq = read_seqcount_begin(&conf->gen_lock);
sh->generation = conf->generation - previous;
sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;//设置条带中设备的数目，即盘数
sh->sector = sector;//设置条带的偏移量
stripe_set_idx(sector, conf, previous, sh);//设置条带中校验盘的盘号等等
sh->state = 0;//状态值为空

for (i = sh->disks; i--; ) {//对每一个设备缓冲区进行操作
struct r5dev *dev = &sh->dev[i];

if (dev->toread || dev->read || dev->towrite || dev->written ||test_bit(R5_LOCKED, &dev->flags)) {/*由于是空条带，所以该缓冲区中的请求链表一定全为空，而且不能上锁*/
printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
(unsigned long long)sh->sector, i, dev->toread,
dev->read, dev->towrite, dev->written,
test_bit(R5_LOCKED, &dev->flags));
WARN_ON(1);
}
dev->flags = 0;
raid5_build_block(sh, i, previous);/*为每个缓冲区设置相应的偏移量，注意虽然条带中的缓冲区在相应的盘上具有相同的偏移量，但是在整个RAID5的地址空间中，这些缓冲区的偏移量是不一样的*/
}
if (read_seqcount_retry(&conf->gen_lock, seq))
goto retry;
insert_hash(conf, sh);//再将条带插入到HASH链表中
sh->cpu = smp_processor_id();
}

相应的注释已经标明了，这就是设置相应的元数据信息。

4. 回到

get_active_stripe()

中，再设置些全局的元数据就结束了，到这里已经取到了相应的stripe_head结构，并设置了相应的元数据。

stripe_head的状态转移

在获得了stripe_head(设为sh)后，对其的操作主要是通过

set_bit(&sh->state)

和

clear_bit(&sh->state)

对其状态进行设置，

get_active_stripe()

返回后，在

make_request()

中继续执行，进入如下区域：

if (test_bit(STRIPE_EXPANDING, &sh->state) ||
!add_stripe_bio(sh, bi, dd_idx, rw)){/*add_stripe_bio: 将bio请求加入到该stripe_head中*/
/* Stripe is busy expanding or
* add failed due to overlap.  Flush everything
* and wait a while
*/
md_wakeup_thread(mddev->thread);
release_stripe(sh);
schedule();
do_prepare = true;
goto retry;
}
set_bit(STRIPE_HANDLE, &sh->state);//将条带设置为需要处理标志
clear_bit(STRIPE_DELAYED, &sh->state);//清除条带的延迟处理标志
if ((bi->bi_rw & REQ_SYNC) &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe_plug(mddev, sh);//将条带加入到不同的list中进行处理

这里我们主要研究stripe_head的管理，所以对于

add_stripe_bio()

我们只要需要知道这是讲bio加入到条带中的就可以了，而真正处理stripe_head的地方在

release_stripe_plug()

中，至此，

make_request()

中对stripe_head的处理已经结束，下面我们将战场转移到

release_stripe_plug()

中。跟进：

static void release_stripe_plug(struct mddev *mddev,
struct stripe_head *sh)
{
struct blk_plug_cb *blk_cb = blk_check_plugged(
raid5_unplug, mddev,
sizeof(struct raid5_plug_cb));
struct raid5_plug_cb *cb;

if (!blk_cb) {
release_stripe(sh);
return;
}

cb = container_of(blk_cb, struct raid5_plug_cb, cb);

if (cb->list.next == NULL) {
int i;
INIT_LIST_HEAD(&cb->list);
for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
INIT_LIST_HEAD(cb->temp_inactive_list + i);
}

if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
list_add_tail(&sh->lru, &cb->list);//添加到LRU链表中
else
release_stripe(sh);
}

仔细研究代码，发现最后的出口函数都是

release_stripe()

，这也是真正处理stripe_head的接口，跟进

release_stripe()

static void release_stripe(struct stripe_head *sh)
{
struct r5conf *conf = sh->raid_conf;
unsigned long flags;
struct list_head list;
int hash;
bool wakeup;

/* Avoid release_list until the last reference.
*/
if (atomic_add_unless(&sh->count, -1, 1))
return;

if (unlikely(!conf->mddev->thread) ||
test_and_set_bit(STRIPE_ON_RELEASE_LIST, &sh->state))
goto slow_path;
wakeup = llist_add(&sh->release_list, &conf->released_stripes);
if (wakeup)
md_wakeup_thread(conf->mddev->thread);
return;
slow_path:
local_irq_save(flags);
/* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
INIT_LIST_HEAD(&list);
hash = sh->hash_lock_index;
do_release_stripe(conf, sh, &list);//根据不同state，将sh放入不同的链表
spin_unlock(&conf->device_lock);
release_inactive_stripe_list(conf, &list, hash);//整理inactive_stripe_list
}
local_irq_restore(flags);
}

看东西要看重点，这里的重点就是

do_release_stripe()

，注释已经说明了它的作用，跟进去：

static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
struct list_head *temp_inactive_list)
{
BUG_ON(!list_empty(&sh->lru));
BUG_ON(atomic_read(&conf->active_stripes)==0);
if (test_bit(STRIPE_HANDLE, &sh->state)) {//sh需要处理
if (test_bit(STRIPE_DELAYED, &sh->state) &&
!test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
list_add_tail(&sh->lru, &conf->delayed_list);//延迟处理，加入delayed_list
else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
sh->bm_seq - conf->seq_write > 0)
list_add_tail(&sh->lru, &conf->bitmap_list);
else {
clear_bit(STRIPE_DELAYED, &sh->state);//清除延迟处理状态
clear_bit(STRIPE_BIT_DELAY, &sh->state);//清除等待bitmap处理状态
if (conf->worker_cnt_per_group == 0) {
list_add_tail(&sh->lru, &conf->handle_list);//加入handle_list
} else {
raid5_wakeup_stripe_thread(sh);
return;
}
}
md_wakeup_thread(conf->mddev->thread);
} else {//不需要处理，则回收
BUG_ON(stripe_operations_active(sh));
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
if (atomic_dec_return(&conf->preread_active_stripes)
< IO_THRESHOLD)
md_wakeup_thread(conf->mddev->thread);//唤醒守护进程
atomic_dec(&conf->active_stripes);//将活跃条带数减1
if (!test_bit(STRIPE_EXPANDING, &sh->state))
list_add_tail(&sh->lru, temp_inactive_list);//加入非活跃list
}
}

主要注释已经给出，就是根据不同的状态加入到不同的list中，而且

do_release_stripe()

还给出了回收stripe_head的操作，那么回收stripe_head的入口在哪呢？

stripe_head的回收

stripe_head在处理完后，需要进程回收操作，因为整个RAID5中只有256个stripe_head，而回收操作的入口在哪呢？既然

do_release_stripe()

是处理回收的地方，那么我们就全局搜索调用这个函数的地方，发现

__release_stripe()

，没错，这就是回收的入口，有关回收操作的具体步骤，上述注释已经给出，就几行，很简单。值得注意的是回收操作并没有把stripe_head中的dev缓冲区清空，因为只要缓冲区的数据还在，那么下次读请求到来时就不需要再重新从盘上读了，以提高性能。

总结

其实对stripe_head的管理，只需要知道stripe_head中的state状态决定了stripe_head的存放位置以及相应的操作，也许画个图会更好的理解。

细心的会发现一直没有说hold_list，其实那个很简单，就把它看成delayed_list到handle_list的中介吧，即如果在delayed_list上的stripe_head得到相应的条件后会先加入到hold_list中，然后再取条带进行处理时，如果handle_list上的条带数不足，则从hold_list上进行补充。到这里，希望能对你理解内核中RAID5的stripe_head管理有一定的帮助。

下一篇介绍内核中RAID5的守护进程raid5d，相当于整个RAID5体系的作战指挥中心，你懂得~

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航