您的位置：首页 > 运维架构 > Linux

linux内核态与用户态通信方式

2014-01-06 16:50 861 查看

出自：http://blog.chinaunix.net/uid-24613712-id-3969750.html

下面对linux内核态与用户态通信方式中的procfs进行讲解。

/proc主要存放内核的一些控制信息，所以这些信息大部分的逻辑位置位于内核控制的内存，在/proc下使用ls -l你会发现大部分的文件或者文件夹的大小都是0，不过使用cd命令进到文件夹下或者使用cat命令查看某些文件的内容确实能查看到一些信息。这是因为/proc文件系统和其他常规文件系统一下把自己注册到虚拟文件系统层（VFS）,但是直到VFS调用它，请求文件或者目录的i节点时，/proc才根据内核中信息建立相应的目录或者文件（可以理解为这些虚拟文件是动态创建的）。

/proc开始的设计只是为了满足内核向用户态进程报告自己的状态而设计的，随着发展/proc文件系统已经成为“内核态-用户态”交互的一种半双工的方式。用户不但可以从/proc中读取内核的相关信息，而且可以通过改变/proc下相应的文件来改变内核的某些行为状态。

在这里先简单的了解下/proc目录下的一些文件。

--cmdline:系统启动时输入给内核的命令行参数

--cpuinfo:cpu的硬件信息（型号，家族，缓存大小..）

--filesystems:当前内核支持的文件系统，当没有给mount(1)指明文件系统时，mount(1)就根据该文件便利不同的文件系统

--nterrupts:中断的使用和触发次数，据说在调试中断时候很有用

--ioports:当前在用的已注册的IO端口的范围

--kmsg:对应dmesg命令。可以代替系统调用syslog（2）来记录内核日志信息。

--filesystems:内核符号表，该文件保存了内核输出的符号定义。

--loadavg:

cat /proc/loadavg

4.61 4.36 4.15 9/84 5662

每个值的含义为：

参数解释

lavg_1 (4.61) 1-分钟平均负载

lavg_5 (4.36) 5-分钟平均负载

lavg_15(4.15) 15-分钟平均负载

nr_running (9) 在采样时刻，运行队列的任务的数目，与/proc/stat的procs_running表示相同意思

nr_threads (84) 在采样时刻，系统中活跃的任务的个数（不包括运行已经结束的任务）

last_pid(5662) 最大的pid值，包括轻量级进程，即线程。

--locks:内核锁

--modules:已经加载的模块列表，对应lsmod命令。

--mounts:已加载的文件系统的列表，对应mount命令不带参数。

--mtd：每个mtd设备对应的分区名称。

--partitions：系统识别的分区表。

--stat：全面统计状态表，cpu内存的利用率等都是从这里提取出来的。

--version：对应的版本号。

--sys：可以通过修改sys目录中的信息来修改内核的某些控制参数。

好了，切入正题。

1.通过proc文件系统来进行内核态用户态通信主要需要用到的函数：

在此处输入内容

------------创建一个目录：struct proc_dir_entry *proc_mkdir(const char *name , struct proc_dir_entry *parent)

其中name为要创建的目录名，parent为带创建的目录的上一级，如果parent为null则创建到/proc目录之下。

------------创建一个文件：struct proc_dir_entry *create_proc_entry(const char *name,mode_t mode，struct proc_die_entry *parent)

其中name和parent同上，mode为要创建文件的掩码，即文件权限。

------------上述两个函数中牵扯到一个结构体struct proc_dir_entry.此结构体定义在在include/linux/proc_fs.h文件中：

这里只列出我们这次学习中会用到的成员

struct proc_dir_entry{

...

read_proc_t *read_proc;

write_proc_t *write_proc;

...

};

①关于read_proc

typedef int (read_proc_t)(char *page, char **start, off_t off, int count, int *eof, void *data);

page是这些数据写到的位置，位于内核空间，大小为一个页。count定义了可以写入的最大字符数。在返回多页数据时（通常一个页是4k）我们会用到start和off。当所有数据全部写入之后需要设置*eof为1。data参数是一个指向私有参数的指针。

当我们使用cat命令，或者其他操作队proc目录下的文件进行读的操作时，这个函数会将内核中要读取的数据copy到page指向的内核空间，这里的page存在于内核空间，不需要调用copy_to_user,proc的驱动程序会自动将这块内存中的内容复制到用户空间。

对于read_proc，我个人的一点点猜测，因为不知道从哪里深入了解这个函数，所以关于这个函数的具体原理我目前无法考证，但是这里放出个人的一点点猜测，如果有深入了解的同学，烦请告知。我的猜测：①此函数的返回值为每次读取的字节数，每次调用此函数时系统会在off的基础上再加上上次调用此函数的返回值然后赋值给off，所以此函数的返回值必须是本次调用读取的字节数，这样就会保证每次调用时候off为可以写入的内存的地址的首。②如果未置*eof为1，并且read_proc的返回值不为0会一直读取下去。③关于start我的猜测是start指向的是每个页。也就是如果是多页的数据的话，读取时候是从*start页的off处开始读取。如下图：

②关于write_proc

typedef int (write_proc_t)(struct file *file, const char __user *buffer, unsigned long count, void *data);

file这个参数先可以忽略，buffer是传递给你的字符串数据，位于用户空间的一个缓冲区中，我们不能直接读取他，要先使用copy_from_user后使用。count定义了在buffer中有多少字符要被写入。data通read_proc一样是私有数据指针，即read_proc或者write_proc被调用时传给他的一个参数。

好了，一起来写一个damo程序吧。

代码：

#include <linux/module.h>
#include <linux/init.h>
#include <linux/proc_fs.h>
#include <asm/uaccess.h>

#define PROC_NAME "yzy_test"
struct proc_dir_entry *my_proc_fs;
static char kernel_buf[128];

int read_proc(char *page, char **start, off_t off,int count, int *eof, void *data);

int write_proc(struct file *file, const char __user *buffer,unsigned long count, void *data);

static int __init module_test_init(void)
{
memset(kernel_buf,0x00,128);
my_proc_fs=create_proc_entry(PROC_NAME,0666,NULL);
my_proc_fs->read_proc=read_proc;
my_proc_fs->write_proc=write_proc;
return 0;
}

static void __exit module_test_exit(void)
{
remove_proc_entry(PROC_NAME,0);
}

int read_proc(char *page, char **start, off_t off,int count, int *eof, void *data)
{
int len=strlen(kernel_buf);
if(count>len-off)
{
*eof=1;
}
if( count>strlen(kernel_buf)-off )
{
count=strlen(kernel_buf)-off;
}
memcpy(page+off,kernel_buf+off,count);
printk("read:%s\n",page);
return count;
}

int write_proc(struct file *file, const char __user *buffer,unsigned long count, void *data)
{
int len=strlen(kernel_buf);
if(count>len)
{
count=len-1;
}
if(!copy_from_user(kernel_buf,buffer,count))//成功返回0,失败返回拷贝失败的字节数.
{
return -1;
}
kernel_buf[count]=0;
printk("write:%s\n",kernel_buf);
return count;
}

module_init(module_test_init);
module_exit(module_test_exit);

MODULE_AUTHOR("test");
MODULE_DESCRIPTION("procfs test module");
MODULE_LICENSE("GPL");

makefile：
ifneq ($(KERNELRELEASE),)
procfs-objs := test_procfs.o
obj-m := test_procfs.o
else
K_VER ?= $(shell uname -r)
K_DIR := /lib/modules/$(K_VER)/build
M_DIR := $(shell pwd)
all:
make -C $(K_DIR) M=$(M_DIR)
clean:
rm -rf *.o *.ko *.symvers *.order *mod.c
endif

好了，代码写好了，编一把吧。

生成的.ko文件，使用/sbin/insmod 装载，卸载时使用/sbin/rmmod 卸载，使用lsmod可以查看是否装载成功。

出自：http://www.blogbus.com/wanderer-zjhit-logs/151138377.html

1 用户一般的数据操作只能在用户态进行;要操作内核态数据，必须利用标准的接口实现。

有很多方法可以读取或者修改内核态的数据。

1.1 可以利用动态模块，向用户态一样操作核心态数据结构

1.2 可以利用标准的系统调用、添加的系统调用来操作核心态数据

1.3 利用procfs读取、写入核心态数据 [sysctl系统调用]

1.4 利用sysfs读取、写入核心态数据

1.5 利用seq_file接口操作核心态数据 [自己添加内核操作方法]

1.6 利用kprobe、jprobe、jretprobe调试技术操作核心态数据

1.7 利用netlink钩子函数操作核心态数据结构

1.8 利用访问设备文件的方式操作核心态数据结构

.....

2 内核态和用户态区别

内核态的出现源于保护模式，从本质上讲linux系统不相信用户、也不认为用户有能力能很好的利用os所提供

的强大的资源管理能力。为此OS运行在与用户隔离的空间中，运行级别为0，所有进程的内核态数据空间都是一样

的，因为那里跑的是操作系统的代码，执行基本的资源管理任务，线性地址空间位于oxc000 0000以上，

内存映射方式为：实际内存=线性内存-3G；用户态属于每个进程的私有空间[这也是进程间会有差别的原因]，我

们一般打交道使用的空间都是用户空间，用户空间的管理依靠于task_strut的mm内存管理单元，其将内存划分

为若干内存区域(vm_area_struct),然后依靠页表来管理这些上述的内存区域，用户空间是可以被内存交换换进换

出的，而内核空间显然不能被换出...

3 内核态用户态交互的接口

1 中描述了大量的内核态交互方法，这好比linux系统的底层不同文件系统的访问[其在vfs层总是归结于同一系统

调用read和write操作]。内核数据交互也是类似，不过与fs访问正好相反，上面大量的交互方式都归结于下面两个

最基本的函数[copy_from_user和copy_to_user]。所以摁其咽喉，方能掌握本质，分析分析copy_from_user

函数

4 copy_from_user()详解 2.6.35.3内核版本

4-1 arch/x86/include/asm/Uaccess_32.h

static inline unsigned long __must_check copy_from_user(void *to,
const void __user *from,
unsigned long n)
{
int sz = __compiletime_object_size(to);   //宏定义

if (likely(sz == -1 || sz >= n))
n = _copy_from_user(to, from, n);
else
copy_from_user_overflow();

return n;
}

解释：先判断内核空间to的空间是否满足拷贝数据n大小，不满足发出WARN(1, "Buffer overflow detected!\n");

4-2 _copy_from_user(to,from,n)

/**
* copy_from_user: - Copy a block of data from user space.
* @to:   Destination address, in kernel space.
* @from: Source address, in user space.
* @n:    Number of bytes to copy.
*
* Context: User context only.  This function may sleep.
*
* Copy data from user space to kernel space.
*
* Returns number of bytes that could not be copied.
* On success, this will be zero.
*
* If some data could not be copied, this function will pad the copied
* data to the requested size using zero bytes.
*/
unsigned long
_copy_from_user(void *to, const void __user *from, unsigned long n)
{
if (access_ok(VERIFY_READ, from, n))
n = __copy_from_user(to, from, n);
else
memset(to, 0, n);
return n;
}

解释：注释非常清楚，access_ok()检查用户空间合理性[不超过0xc000 0000]，userspace映射问题后面在看

4-3 __copy_from_user(to,from,n) arch/x86/include/asm/Uaccess_32.h

/*
* An alternate version - __copy_from_user_inatomic() - may be called from
* atomic context and will fail rather than sleep.  In this case the
* uncopied bytes will *NOT* be padded with zeros.  See fs/filemap.h
* for explanation of why this is needed.
*/
static __always_inline unsigned long
__copy_from_user(void *to, const void __user *from, unsigned long n)
{
might_fault();
if (__builtin_constant_p(n)) {
unsigned long ret;

switch (n) {
case 1:
__get_user_size(*(u8 *)to, from, 1, ret, 1);
return ret;
case 2:
__get_user_size(*(u16 *)to, from, 2, ret, 2);
return ret;
case 4:
__get_user_size(*(u32 *)to, from, 4, ret, 4);
return ret;
}
}
return __copy_from_user_ll(to, from, n);
}

解释：首先查看n是否为固定值1,2,4字节，如果是固定大小，则操作简单；否则调用通用拷贝函数，适合于大块传输

4-4 __copy_from_user_ll(to, from, n) /arch/x86/lib

unsigned long __copy_from_user_ll(void *to, const void __user *from,
unsigned long n)
{
if (movsl_is_ok(to, from, n))
__copy_user_zeroing(to, from, n);
else
n = __copy_user_zeroing_intel(to, from, n);
return n;
}

解释：首先判断是否需要大规模数据拷贝，一般返回1

注：linux采用AT&T编码方式，左边值为原操作数，右边值为目的操作数，与intel编码方式不同

4-5 __copy_user_zeroing(to, from, size) /arch/x86/lib 进入copy的关键

#define  __copy_user_zeroing(to, from, size)    \
do {         \
int __d0, __d1, __d2;      \
__asm__ __volatile__(      \                    #注：以4字节做为成串传送的基本单位
" cmp  $7,%0\n"     \                            #比较size是否大于7，即判断是否需要成串传送，cmp中右为原操作数
" jbe  1f\n"     \                                    #如果小于7，跳到1处，以单字节作为传送单位
" movl %1,%0\n"     \                 #ecx=to地址，此时ecx大于7字节，需要把余8单字节传，其余串传ecx=n
" negl %0\n"     \                                  #ecx取补码，   正数补码为自身
" andl $7,%0\n"     \                             #ecx=ecx%8，显然操作前ecx=n，且andl中右为原操作数
" subl %0,%3\n"     \                            #寄存器X-=ecx，ecx为模8取余，显然此时寄存器X=n - n%8
"4: rep; movsb\n"     \                          #余8剩余字节按照字节拷贝，显然此时ecx为n模8取余的值
" movl %3,%0\n"     \                           #ecx=该寄存器X值，X值应为n-n%8，movl中左为原操作数
" shrl $2,%0\n"     \                              #然后ecx=ecx/4,此时ecx为movsl的次数，shrl中右为原操作数
" andl $3,%3\n"     \                             #该寄存器X=X%4，andl中右为原操作数

" .align 2,0x90\n"    \                            #
"0: rep; movsl\n"     \                           #按照四字节倍数拷贝，每次拷贝4个字节,   此时ecx=n/4
" movl %3,%0\n"     \                           #ecx值设为n%4,   此时该寄存器X=n%4
"1: rep; movsb\n"     \                          #以单字节拷贝
"2:\n"       \
".section .fixup,\"ax\"\n"    \
"5: addl %3,%0\n"  \   #寄存器X值[n-n%8] + ecx值[n%8-已拷贝字节] -->  ecx，addl，subl中右为原操作数
" jmp 6f\n"     \                                     #跳转到后边标号6处，此时ecx显然为剩余拷贝的字节数
"3: lea 0(%3,%0,4),%0\n"    \                #ecx=ecx*4+n%4，ecx=在出错时剩余拷贝的字节数
"6: pushl %0\n"     \                              #ecx压入栈，待后面出错返回时使用
" pushl %%eax\n"     \                           #eax寄存器压栈，用0填充内核剩余空间时用eax值填
" xorl %%eax,%%eax\n"    \                  #eax清0
" rep; stosb\n"     \                                #此时将内核剩余复制空间清0
" popl %%eax\n"     \                             #恢复eax值
" popl %0\n"     \                                   #ecx=剩余未拷贝的字节数
" jmp 2b\n"     \                                     #跳到前面2标号处，即退出内存拷贝
".previous\n"      \
".section __ex_table,\"a\"\n"    \              #专用的异常地址表，用于拷贝过程中的异常恢复
" .align 4\n"     \                                     #4字节为单位对其
" .long 4b,5b\n"     \                               #标号5 为标号4的异常处理程序地址
" .long 0b,3b\n"     \                               #标号3 为标号0的异常处理程序地址
" .long 1b,6b\n"     \                               #标号6 为标号1的异常处理程序地址
".previous"      \
: "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2) \
: "3"(size), "0"(size), "1"(to), "2"(from)  \
: "memory");      \
} while (0)

解释：

1 变量绑定寄存器初始值输出值

0 ecx size size

1 edi to指针 __do变量

2 esi from __d1变量

3 寄存器 size __d2变量

2 分析

2-1

gnu的gcc和ld支持四个段：text段、data段、fixup段、__ex_table段。

fixup段用于异常发生后的恢复操作，和text段没有太大差别

__ex_tabel段用于异常地址表

2-2

在cpu进行访址的时候，内核空间和用户空间使用的都是线性地址，cpu在访址的过程中会自动完成从线性地址到物理地址的转换[用户态、内核态都得依靠进程页表完成转换]，而合理的线性地址意味着：该线性地址位于该进程task_struct->mm虚存空间的某一段vm_struct_mm中，而且建立线性地址到物理地址的映射，即线性地址对应内容在物理内存中。如果访存失败，有两种可能：该线性地址存在在进程虚存区间中，但是并未建立于物理内存的映射，有可能是交换出去，也有可能是刚申请到线性区间[内核是很会偷懒的]，要依靠缺页异常去建立申请物理空间并建立映射；第2种可能是线性地址空间根本没有在进程虚存区间中，这样就会出现常见的坏指针，就会引发常见的段错误[也有可能由于访问了无权访问的空间造成保护异常]。如果坏指针问题发生在用户态，最严重的就是杀死进程[最常见的就是在打dota时候出现的大红X，然后dota程序结束]，如果发生在内核态,整个系统可能崩溃[xp的蓝屏很可能就是这种原因形成的]。所以linux当然不会任由这种情况的发生，其措施如下：

linux内核对于可能发生问题的指令都会准备"修复地址"，比如前面的fixup部分，而且遵循谁使用这些指令，谁负责修复工作的原则。比如前面的代码中，标号5即为标号4的修复指令，3为0,6为1的修复指令。在编译过程中，编译器会将5,4等的地址对应的存入struct exception_table_entry{unsigned long insn，fixup；}中。insn即可能为4的地址，而fixup可能为5的地址，如果4为坏地址[即该地址并未在虚存区间中]，则在页面异常处理过程中，会转入bad_area处，如果发生在用户态直接杀死进程即可。如果发生在内核态，首先通过search_exception_table查找异常处理表exception_table。即找到某一个exception_table_entry，假设其insn=标号4地址，fixup=标号5地址.内核将发生:

regs->ip=fixup,即通过修改当前的内核地址，从而将内核从死亡的边缘拉回来，通过标号5地址处的修复工作从而全身而退。

出自：http://www.ibm.com/developerworks/cn/linux/l-netlink/

imp1.tar.gz与imp2.tar.gz下载位置

多数的 Linux 内核态程序都需要和用户空间的进程交换数据，但 Linux 内核态无法对传统的 Linux 进程间同步和通信的方法提供足够的支持。本文总结并比较了几种内核态与用户态进程通信的实现方法，并推荐使用 netlink 套接字实现中断环境与用户态进程通信。

1 引言

Linux 是一个源码开放的操作系统，无论是普通用户还是企业用户都可以编写自己的内核代码，再加上对标准内核的裁剪从而制作出适合自己的操作系统。目前有很多中低端用户使用的网络设备的操作系统是从标准 Linux 改进而来的，这也说明了有越来越多的人正在加入到 Linux 内核开发团体中。

一个或多个内核模块的实现并不能满足一般 Linux 系统软件的需要，因为内核的局限性太大，如不能在终端上打印，不能做大延时的处理等等。当我们需要做这些的时候，就需要将在内核态采集到的数据传送到用户态的一个或多个进程中进行处理。这样，内核态与用户空间进程通信的方法就显得尤为重要。在 Linux 的内核发行版本中没有对该类通信方法的详细介绍，也没有其他文章对此进行总结，所以本文将列举几种内核态与用户态进程通信的方法并详细分析它们的实现和适用环境。

2 Linux 内核模块的运行环境与传统进程间通信

在一台运行 Linux 的计算机中，CPU 在任何时候只会有如下四种状态：

【1】在处理一个硬中断。

【2】在处理一个软中断，如 softirq、tasklet 和 bh。

【3】运行于内核态，但有进程上下文，即与一个进程相关。

【4】运行一个用户态进程。

其中，【1】、【2】和【3】是运行于内核空间的，而【4】是在用户空间。其中除了【4】，其他状态只可以被在其之上的状态抢占。比如，软中断只可以被硬中断抢占。

Linux 内核模块是一段可以动态在内核装载和卸载的代码，装载进内核的代码便立即在内核中工作起来。Linux 内核代码的运行环境有三种：用户上下文环境、硬中断环境和软中断环境。但三种环境的局限性分两种，因为软中断环境只是硬中断环境的延续。比较如表【1】。

表【1】

内核态环境	介绍	局限性
用户上下文	内核态代码的运行与一用户空间进程相关，如系统调用中代码的运行环境。	不可直接将本地变量传递给用户态的内存区，因为内核态和用户态的内存映射机制不同。
硬中断和软中断环境	硬中断或软中断过程中代码的运行环境，如 IP 数据报的接收代码的运行环境，网络设备的驱动程序等。	不可直接向用户态内存区传递数据；代码在运行过程中不可阻塞。

Linux 传统的进程间通信有很多，如各类管道、消息队列、内存共享、信号量等等。但它们都无法介于内核态与用户态使用，原因如表【2】。

表【2】

通信方法	无法介于内核态与用户态的原因
管道（不包括命名管道）	局限于父子进程间的通信
消息队列	在硬、软中断中无法无阻塞地接收数据。
信号量	无法介于内核态和用户态使用。
内存共享	需要信号量辅助，而信号量又无法使用。
套接字	在硬、软中断中无法无阻塞地接收数据。

3 Linux内核态与用户态进程通信方法的提出与实现

3．1 用户上下文环境

运行在用户上下文环境中的代码是可以阻塞的，这样，便可以使用消息队列和 UNIX 域套接字来实现内核态与用户态的通信。但这些方法的数据传输效率较低，Linux 内核提供 copy_from_user()/copy_to_user() 函数来实现内核态与用户态数据的拷贝，但这两个函数会引发阻塞，所以不能用在硬、软中断中。一般将这两个特殊拷贝函数用在类似于系统调用一类的函数中，此类函数在使用中往往"穿梭"于内核态与用户态。此类方法的工作原理路如图【1】。

图【1】

其中相关的系统调用是需要用户自行编写并载入内核。 imp1.tar.gz是一个示例，内核模块注册了一组设置套接字选项的函数使得用户空间进程可以调用此组函数对内核态数据进行读写。源码包含三个文件，imp1.h 是通用头文件，定义了用户态和内核态都要用到的宏。imp1_k.c 是内核模块的源代码。imp1_u.c 是用户态进程的源代码。整个示例演示了由一个用户态进程向用户上下文环境发送一个字符串，内容为"a message from userspace\n"。然后再由用户上下文环境向用户态进程发送一个字符串，内容为"a
message from kernel\n"。Linux内核态与用户态进程通信方法的提出与实现

3．2 硬、软中断环境

比起用户上下文环境，硬中断和软中断环境与用户态进程无丝毫关系，而且运行过程不能阻塞。

3．2．1 使用一般进程间通信的方法

我们无法直接使用传统的进程间通信的方法实现。但硬、软中断中也有一套同步机制--自旋锁（spinlock），可以通过自旋锁来实现中断环境与中断环境，中断环境与内核线程的同步，而内核线程是运行在有进程上下文环境中的，这样便可以在内核线程中使用套接字或消息队列来取得用户空间的数据，然后再将数据通过临界区传递给中断过程。基本思路如图【2】。

图【2】

因为中断过程不可能无休止地等待用户态进程发送数据，所以要通过一个内核线程来接收用户空间的数据，再通过临界区传给中断过程。中断过程向用户空间的数据发送必须是无阻塞的。这样的通信模型并不令人满意，因为内核线程是和其他用户态进程竞争CPU接收数据的，效率很低，这样中断过程便不能实时地接收来自用户空间的数据。

3．2．2 netlink 套接字

在 Linux 2.4 版以后版本的内核中，几乎全部的中断过程与用户态进程的通信都是使用 netlink 套接字实现的，同时还使用 netlink 实现了 ip queue 工具，但 ip queue 的使用有其局限性，不能自由地用于各种中断过程。内核的帮助文档和其他一些 Linux 相关文章都没有对 netlink 套接字在中断过程和用户空间通信的应用上作详细的说明，使得很多用户对此只有一个模糊的概念。

netlink 套接字的通信依据是一个对应于进程的标识，一般定为该进程的 ID。当通信的一端处于中断过程时，该标识为 0。当使用 netlink 套接字进行通信，通信的双方都是用户态进程，则使用方法类似于消息队列。但通信双方有一端是中断过程，使用方法则不同。netlink 套接字的最大特点是对中断过程的支持，它在内核空间接收用户空间数据时不再需要用户自行启动一个内核线程，而是通过另一个软中断调用用户事先指定的接收函数。工作原理如图【3】。

图【3】

很明显，这里使用了软中断而不是内核线程来接收数据，这样就可以保证数据接收的实时性。

当 netlink 套接字用于内核空间与用户空间的通信时，在用户空间的创建方法和一般套接字使用类似，但内核空间的创建方法则不同。图【4】是 netlink 套接字实现此类通信时创建的过程。

图【4】

以下举一个 netlink 套接字的应用示例。示例实现了从 netfilter 的 NF_IP_PRE_ROUTING 点截获的 ICMP 数据报，在将数据报的相关信息传递到一个用户态进程，由用户态进程将信息打印在终端上。源码在文件 imp2.tar.gz中。内核模块代码（分段详解）：

（一）模块初始化与卸载

static struct sock *nlfd;
struct
{
__u32 pid;
rwlock_t lock;
}user_proc;
/*挂接在 netfilter 框架的 NF_IP_PRE_ROUTING 点上的函数为 get_icmp()*/
static struct nf_hook_ops imp2_ops =
{
.hook = get_icmp,		/*netfilter 钩子函数*/
.pf = PF_INET,
.hooknum = NF_IP_PRE_ROUTING,
.priority = NF_IP_PRI_FILTER -1,
};
static int __init init(void)
{
rwlock_init(&user_proc.lock);
/*在内核创建一个 netlink socket，并注明由 kernel_recieve() 函数接收数据
这里协议 NL_IMP2 是自定的*/
nlfd = netlink_kernel_create(NL_IMP2, kernel_receive);
if(!nlfd)
{
printk("can not create a netlink socket\n");
return -1;
}
/*向 netfilter 的 NF_IP_PRE_ROUTING 点挂接函数*/
return nf_register_hook(&imp2_ops);
}
static void __exit fini(void)
{
if(nlfd)
{
sock_release(nlfd->socket);
}
nf_unregister_hook(&imp2_ops);
}
module_init(init);
module_exit(fini);

其实片断（一）的工作很简单，模块加载阶段先在内核空间创建一个 netlink 套接字，再将一个函数挂接在 netfilter 框架的 NF_IP_PRE_ROUTING 钩子点上。卸载时释放套接字所占的资源并注销之前在 netfilter 上挂接的函数。

（二）接收用户空间的数据

DECLARE_MUTEX(receive_sem);
static void kernel_receive(struct sock *sk, int len)
{
do
{
struct sk_buff *skb;
if(down_trylock(&receive_sem))
return;

while((skb = skb_dequeue(&sk-<receive_queue)) != NULL)
{
{
struct nlmsghdr *nlh = NULL;
if(skb-<len <= sizeof(struct nlmsghdr))
{
nlh = (struct nlmsghdr *)skb-<data;
if((nlh-<nlmsg_len <= sizeof(struct nlmsghdr))
&& (skb-<len <= nlh-<nlmsg_len))
{
if(nlh-<nlmsg_type == IMP2_U_PID)
{
write_lock_bh(&user_proc.pid);
user_proc.pid = nlh-<nlmsg_pid;
write_unlock_bh(&user_proc.pid);
}
else if(nlh-<nlmsg_type == IMP2_CLOSE)
{
write_lock_bh(&user_proc.pid);
if(nlh-<nlmsg_pid == user_proc.pid) user_proc.pid = 0;
write_unlock_bh(&user_proc.pid);
}
}
}
}
kfree_skb(skb);
}
up(&receive_sem);
}while(nlfd && nlfd-<receive_queue.qlen);
}

如果读者看过 ip_queue.c 或 rtnetlink.c中的源码会发现片断（二）中的 03～18 和 31～38 是 netlink socket 在内核空间接收数据的框架。在框架中主要是从套接字缓存中取出全部的数据，然后分析是不是合法的数据报，合法的 netlink 数据报必须有nlmsghdr 结构的报头。在这里笔者使用了自己定义的消息类型：IMP2_U_PID（消息为用户空间进程的ID），IMP2_CLOSE（用户空间进程关闭）。因为考虑到 SMP，所以在这里使用了读写锁来避免不同 CPU 访问临界区的问题。kernel_receive()
函数的运行在软中断环境。

（三）截获 IP 数据报

static unsigned int get_icmp(unsigned int hook,
struct sk_buff **pskb,
const struct net_device *in,
const struct net_device *out,
int (*okfn)(struct sk_buff *))
{
struct iphdr *iph = (*pskb)->nh.iph;
struct packet_info info;
if(iph->protocol == IPPROTO_ICMP)	/*若传输层协议为 ICMP*/
{
read_lock_bh(&user_proc.lock);
if(user_proc.pid != 0)
{
read_unlock_bh(&user_proc.lock);
info.src = iph->saddr;	/*记录源地址*/
info.dest = iph->daddr;	/*记录目的地址*/
send_to_user(&info);		/*发送数据*/
}
else
read_unlock_bh(&user_proc.lock);
}
return NF_ACCEPT;
}

（四）发送数据

static int send_to_user(struct packet_info *info)
{
int ret;
int size;
unsigned char *old_tail;
struct sk_buff *skb;
struct nlmsghdr *nlh;
struct packet_info *packet;
size = NLMSG_SPACE(sizeof(*info));
/*开辟一个新的套接字缓存*/
skb = alloc_skb(size, GFP_ATOMIC);
old_tail = skb->tail;
/*填写数据报相关信息*/
nlh = NLMSG_PUT(skb, 0, 0, IMP2_K_MSG, size-sizeof(*nlh));
packet = NLMSG_DATA(nlh);
memset(packet, 0, sizeof(struct packet_info));
/*传输到用户空间的数据*/
packet->src = info->src;
packet->dest = info->dest;
/*计算经过字节对其后的数据实际长度*/
nlh->nlmsg_len = skb->tail - old_tail;
NETLINK_CB(skb).dst_groups = 0;
read_lock_bh(&user_proc.lock);
ret = netlink_unicast(nlfd, skb, user_proc.pid, MSG_DONTWAIT); /*发送数据*/
read_unlock_bh(&user_proc.lock);
return ret;
nlmsg_failure: /*若发送失败，则撤销套接字缓存*/
if(skb)
kfree_skb(skb);
return -1;
}

片断（四）中所使用的宏参考如下：

/*字节对齐*/
#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
/*计算包含报头的数据报长度*/
#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(sizeof(struct nlmsghdr)))
/*字节对齐后的数据报长度*/
#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
/*填写相关报头信息，这里使用了nlmsg_failure标签，所以在程序中要定义*/
#define NLMSG_PUT(skb, pid, seq, type, len) \
({ if (skb_tailroom(skb) < (int)NLMSG_SPACE(len)) goto nlmsg_failure; \
__nlmsg_put(skb, pid, seq, type, len); })
static __inline__ struct nlmsghdr *
__nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len)
{
struct nlmsghdr *nlh;
int size = NLMSG_LENGTH(len);
nlh = (struct nlmsghdr*)skb_put(skb, NLMSG_ALIGN(size));
nlh->nlmsg_type = type;
nlh->nlmsg_len = size;
nlh->nlmsg_flags = 0;
nlh->nlmsg_pid = pid;
nlh->nlmsg_seq = seq;
return nlh;
}
/*跳过报头取实际数据*/
#define NLMSG_DATA(nlh)  ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))
/*取 netlink 控制字段*/
#define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))

运行示例时，先编译 imp2_k.c 模块，然后使用 insmod 将模块加载入内核。再运行编译好的 imp2_u 命令，此时就会显示出本机当前接收的 ICMP 数据报的源地址和目的地址。用户可以使用 Ctrl+C 来终止用户空间的进程，再次启动也不会带来问题。

4 总结

本文从内核态代码的不同运行环境来实现不同方法的内核空间与用户空间的通信，并分析了它们的实际效果。最后推荐使用 netlink 套接字实现中断环境与用户态进程通信，因为 netlink 套接字是专为此类通信定制的。

参考资料

Linux 2.4 及后续版本内核源代码；

www.netfilter.org；

RFC 3549；

出自：http://shentar.me/%E5%9F%BA%E4%BA%8Enetlink%E7%9A%84%E5%86%85%E6%A0%B8%E6%80%81%E4%B8%8E%E7%94%A8%E6%88%B7%E6%80%81%E5%BC%82%E6%AD%A5%E5%B9%B6%E5%8F%91%E6%95%B0%E6%8D%AE%E4%BC%A0%E8%BE%93%E6%A8%A1%E5%9E%8B/

用户态采用select模型，初始化时建立多个netlinksocket，绑定完成之后，向内核发送握手消息，这样内核可以将已经建立的连接记住，以便后续选择可用的连接发送数据。初始化和握手完成之后，由内核主动向用户态发送数据，用户态主线程在各个socket句柄上面等待读事件的到来，当检测到读事件时，向线程池提交数据读取和处理任务。这样模拟一个连接池和事件分发模型，保证内核数据及时被读取到用户态程序并处理，能做到并发。

而内核态的netlink在接收数据时本身就是以系统调用的方式提供给业务层的发送接口，因此本身就是异步的，性能不是问题。内核态收到数据时，只需要提交给一个内核线程去处理即可。

原型代码如下：

共用头文件

#define NETLINK_TEST 21
#define MAX_DATA_LEN (768)

#define  MAX_PROCESS_COUNT 100

#define MAX_PAYLOAD 1024
#define MAX_PID_COUNT MAX_PROCESS_COUNT
#define MAX_REC_DATA_LEN 1024

#define STATIC_PEROID 1024*500

#define MSG_COUNT 10000

#define TRUE 1

用户态

#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <string.h>
#include <asm/types.h>
#include <linux/netlink.h>
#include <linux/socket.h>
#include <sys/select.h>
#include <pthread.h>
#include "conf.h"

struct endpoint
{
unsigned int pid;
struct sockaddr_nl src_addr;
struct sockaddr_nl dest_addr;
int sock_fd;
struct msghdr msg;
} *endpoits = NULL;

int listen_user(void);
int handwithknl(void);
int close_user(void);
int doselect(struct timeval wait);
void* threadProc(void* ed);
void* sendThreadProc(void* arg);
void flushcount(int len);
void staticsout(void);
void startSendThreads(void);
static int maxfd = 0;
static fd_set rset;
static struct timeval tmout;

unsigned char* readbuffer;

static struct timeval tmnow;
static struct timeval oldtime;
static long reccount = 0;

pthread_mutex_t mutex;
static long readbytes = 0;
static long readnum = 0;
static long sendbytes = 0;

int main(int argc, char* argv[])
{
endpoits = (struct endpoint*)malloc(sizeof(struct endpoint) * MAX_PID_COUNT);

readbuffer = (unsigned char*)malloc(MAX_REC_DATA_LEN);

// create an netlink socket and bind.
listen_user();

if (pthread_mutex_init(&mutex, NULL) != 0 )
{
printf("Init metux error.\n");
return -1;
}

// send a handshake msg to the knl. let the knl to see this client.
handwithknl();

// sleep(6);

// startSendThreads();

gettimeofday(&oldtime, NULL);
tmout.tv_sec = 1;
tmout.tv_usec = 0;
// wait for event from the knl to dispach.
while (1)
{
doselect(tmout);

if (readbytes > STATIC_PEROID)
{
staticsout();
}

// sleep(2);
}

// close the socket.
close_user();
return 0;
}

int listen_user(void)
{
int pidcount;
struct endpoint* ed = NULL;
struct nlmsghdr* nlh = NULL;
unsigned int pid = getpid();

for (pidcount = 0; pidcount != MAX_PID_COUNT; pidcount++)
{
ed = endpoits + pidcount;
memset((void*)ed, 0, sizeof(struct endpoint));

ed->sock_fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_TEST);
ed->pid = pid + pidcount;
if (0 == ed->pid)
{
pid = ++ed->pid;
}

ed->src_addr.nl_family = AF_NETLINK;
ed->src_addr.nl_groups = 0;
ed->src_addr.nl_pid = ed->pid;

//TODO may be the src_addr.nl_pid is already in used, the bind will return a nonezero value.
bind(ed->sock_fd, (struct sockaddr *) &ed->src_addr, sizeof(ed->src_addr));

ed->dest_addr.nl_family = AF_NETLINK;
ed->dest_addr.nl_pid = 0;
ed->dest_addr.nl_groups = 0;

/* Fill in the netlink message payload */
ed->msg.msg_name = (void *) &ed->dest_addr;
ed->msg.msg_namelen = sizeof(ed->dest_addr);

printf("init the socket %d of pid %d successful.\n", ed->sock_fd, ed->pid);
usleep(10000);
}

return 0;

}

int handwithknl(void)
{
int pidcount;
struct endpoint* ed = NULL;
struct nlmsghdr* nlh = NULL;

for (pidcount = 0; pidcount != MAX_PID_COUNT; pidcount++)
{
ed = endpoits + pidcount;
ed->msg.msg_iovlen = 1;
ed->msg.msg_iov = malloc(sizeof(struct iovec));
ed->msg.msg_iov->iov_base = malloc(NLMSG_SPACE(MAX_PAYLOAD));
nlh = (struct nlmsghdr*)ed->msg.msg_iov->iov_base;
nlh->nlmsg_len = NLMSG_SPACE(MAX_PAYLOAD);
nlh->nlmsg_pid = ed->pid;
nlh->nlmsg_flags = 0;
ed->msg.msg_iov->iov_len = nlh->nlmsg_len;
snprintf((char*)NLMSG_DATA(nlh), MAX_PAYLOAD - 1, "Hello knl! This is %d!", ed->pid);

// printf(" Sending message from . ...\n", ed->pid);
sendmsg(ed->sock_fd, &ed->msg, 0);

if (ed->sock_fd > maxfd)
{
maxfd = ed->sock_fd;
}

FD_SET(ed->sock_fd, &rset);

//         pthread_t tid;
//         if (0 == pthread_create(&tid, NULL, &threadProc, (void*)ed))
//         {
//             printf("create a thread %u successful for the pid: %d. \n", tid, ed->pid);
//         }
}

return 0;
}

int close_user(void)
{
int pidcount;

for (pidcount = 0; pidcount != MAX_PID_COUNT; pidcount++)
{
close(endpoits[pidcount].sock_fd);
}

return 0;
}

int doselect(struct timeval wait)
{
int pidcount;
int selcount = 0;
struct endpoint* ed = NULL;
struct nlmsghdr* nlh = NULL;

selcount = select(maxfd + 1, &rset, NULL, NULL, &wait);

if (selcount == 0)
{
return 0;
}
else if (selcount < 0)
{
printf("selected error!\n");
return -1;
}
else
{
for (pidcount = 0; pidcount != MAX_PID_COUNT; pidcount++)
{
ed = endpoits + pidcount;
if (FD_ISSET(ed->sock_fd, &rset))
{
int count = 0;
int msglen = -1;
int readstatus = 0;
memset(readbuffer, 0, MAX_REC_DATA_LEN);
ed->msg.msg_iov->iov_base = (void*)readbuffer;
nlh = (struct nlmsghdr*)ed->msg.msg_iov->iov_base;

while (TRUE)
{
memset(readbuffer, 0, MAX_REC_DATA_LEN);
readstatus = recvmsg(ed->sock_fd, &ed->msg, 0);

if (readstatus == -1)
{
printf("recieved error! %d \n", ed->sock_fd);
break;
}
else if (readstatus == 0)
{
printf("recieved error peer orderly shutdown! %d \n", ed->sock_fd);
break;
}
else
{
count += readstatus;
// printf("count %d\n", count);
}

if (msglen == -1 && count >= 16)
{
msglen = nlh->nlmsg_len;
// printf("msg len is: %u\n", msglen);
}

if (msglen != -1 && count == msglen)
{
// printf("success read a msg: %u\n", count);

// pData = (int*)NLMSG_DATA(nlh);
// for (i = 0; i != MAX_DATA_LEN/4; i++)
// {
//     printf("%d,", *(pData + i));
// }
// printf("\n");

readnum++;
// printf("readnum is %d\n", readnum);
break;
}
}

// printf("received %u bytes.\n", count);
// printf("received %d bytes, the peer pid is %d, the local pid is %d.\n", nlh->nlmsg_len, nlh->nlmsg_pid, ed->pid);
flushcount(count);
}
FD_CLR(ed->sock_fd, &rset);
FD_SET(ed->sock_fd, &rset);
}
}

// printf("select one time.\n");
return 0;
}

void* sendThreadProc(void* arg)
{
int readcount = 0;
struct endpoint* ed;
struct nlmsghdr* nlh;
fd_set rset_c;
struct timeval wait_time;
int val;
int i;

ed = (struct endpoint*)arg;
ed->msg.msg_iov->iov_base = malloc(MAX_DATA_LEN);

wait_time.tv_sec = 2;
wait_time.tv_usec = 0;

printf("send data thread [%d] start!\n", ed->pid);

for (i = 0; i != MSG_COUNT; i++)
{
nlh = (struct nlmsghdr*)ed->msg.msg_iov->iov_base;
nlh->nlmsg_len = NLMSG_SPACE(MAX_DATA_LEN);
nlh->nlmsg_pid = ed->pid;
nlh->nlmsg_flags = 0;
ed->msg.msg_iov->iov_len = nlh->nlmsg_len;
readcount = sendmsg(ed->sock_fd, &ed->msg, MSG_DONTWAIT);
flushcount(readcount);
}
printf("thread %d end!\n", ed->pid);
}

void* threadProc(void* arg)
{
int readcount = 0;
struct endpoint* ed;
struct nlmsghdr* nlh;
fd_set rset_c;
struct timeval wait_time;
int val;

ed = (struct endpoint*)arg;
ed->msg.msg_iov->iov_base = malloc(MAX_DATA_LEN);

wait_time.tv_sec = 2;
wait_time.tv_usec = 0;

printf("thread %d start!\n", ed->pid);

for (;;)
{
FD_ZERO(&rset_c);
FD_SET(ed->sock_fd, &rset_c);
select(ed->sock_fd + 1, &rset_c, NULL, NULL, &wait_time);
if (FD_ISSET(ed->sock_fd, &rset_c))
{
recvmsg(ed->sock_fd, &ed->msg, 0);
nlh = (struct nlmsghdr*)ed->msg.msg_iov->iov_base;
flushcount(nlh->nlmsg_len);
}
}
printf("thread %d end!\n", ed->pid);
}

void flushcount(int len)
{
int val;
val = pthread_mutex_lock(&mutex);
if(val != 0)
{
printf("lock error. \n");
pthread_mutex_unlock(&mutex);
return;
}

if (len > 0)
{
readbytes += len;
readnum++;
}

pthread_mutex_unlock(&mutex);
}

void staticsout()
{
int millsec;
int val;
val = pthread_mutex_lock(&mutex);
if(val != 0)
{
printf("lock error. \n");
pthread_mutex_unlock(&mutex);
return;
}

if (readbytes > STATIC_PEROID)
{
gettimeofday(&tmnow, NULL);
millsec = (tmnow.tv_sec - oldtime.tv_sec) * 1000 + (tmnow.tv_usec - oldtime.tv_usec) / 1000;
printf("received %d Kbytes, consumed time is: %dms, speed is: %5.3fK/s, %5.2f/s.\n",
readbytes / 1024,
millsec,
(float)(readbytes / 1024) * 1000 / millsec,
(float)(readnum * 1000 / millsec));

gettimeofday(&oldtime, NULL);
readbytes = 0;
readnum = 0;
}

pthread_mutex_unlock(&mutex);
}

void startSendThreads(void)
{
int pidcount;
struct endpoint* ed = NULL;
struct nlmsghdr* nlh = NULL;

for (pidcount = 0; pidcount != MAX_PID_COUNT; pidcount++)
{
pthread_t tid;
ed = endpoits + pidcount;
if (0 == pthread_create(&tid, NULL, &sendThreadProc, (void*)ed))
{
printf("create a sendthread %u successful for the pid: %d. \n", tid, ed->pid);
}
}
}

内核态

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/sched.h>
#include <net/sock.h>
#include <net/netlink.h>
#include <linux/kthread.h>
#include "conf.h"

#ifndef SLEEP_MILLI_SEC
#define SLEEP_MILLI_SEC(nMilliSec) \
do { \
long timeout = (nMilliSec) * HZ / 1000; \
while(timeout > 0) \
{ \
timeout = schedule_timeout(timeout); \
} \
}while(0);
#endif

struct sock* nl_sk = NULL;
EXPORT_SYMBOL_GPL(nl_sk);

static struct task_struct* task_test[MAX_PROCESS_COUNT];

static DECLARE_WAIT_QUEUE_HEAD(myevent_waitqueue);

static u32 pids[MAX_PID_COUNT] = {0};
static int pidindex = 0;

static int readBytes = 0;

static int childDataThread(void* index)
{
int threadindex = *((int*)index);
struct sk_buff* skb;
struct nlmsghdr* nlh;
int rc;
int len = NLMSG_SPACE(MAX_DATA_LEN);
int a = MSG_COUNT;
unsigned char randomindex;
wait_queue_head_t timeout_wq;
int devi = MAX_PID_COUNT / MAX_PROCESS_COUNT;

init_waitqueue_head(&timeout_wq);

printk("start thread [%d].\n", threadindex);

allow_signal(SIGKILL);

while (a-- && !kthread_should_stop())
{
int* pData;
int i = 0;
skb = alloc_skb(len, GFP_ATOMIC);
if (!skb)
{
printk(KERN_ERR "net_link: allocate failed.\n");
return -1;
}

nlh = nlmsg_put(skb, 0, 0, 0, MAX_DATA_LEN, 0);
NETLINK_CB(skb).pid = 0;
//         pData = (int*)NLMSG_DATA(nlh);
//         for (i = 0; i != MAX_DATA_LEN/4; i++)
//         {
//             *(pData + i) = i;
//         }

get_random_bytes(&randomindex, 1);
randomindex = randomindex % devi + threadindex * devi;
// printk("radmonindex is: %d, threadindex is %d\n", randomindex, threadindex);
if (pids[randomindex] != 0)
{
// printk("net_link: going to send, peer pid is: %d, a is: %d.\n", pids[randomindex], a);
rc = netlink_unicast(nl_sk, skb, pids[randomindex], MSG_DONTWAIT);
if (rc < 0)
{
printk(KERN_ERR "net_link: can not unicast skb: %d, a is: %d, peerpid is: %u\n", rc, a, pids[randomindex]);
interruptible_sleep_on_timeout(&timeout_wq, (long)(0.1 * HZ));
}
else
{
// printk(KERN_ERR "net_link:  unicast skb: %d, a is: %d, peerpid is: %u\n", rc, a, pids[randomindex]);
}
}

interruptible_sleep_on_timeout(&timeout_wq, (long)(0.05 * HZ));

if(signal_pending(current))
{
break;
}
}

printk("thread %d exit!\n", threadindex);
return 0;
}

void nl_data_ready(struct sk_buff* __skb)
{
struct sk_buff* skb;
struct nlmsghdr* nlh;

skb = skb_get(__skb);

if (skb->len >= NLMSG_SPACE(0))
{
nlh = nlmsg_hdr(skb);
// printk("net_link: recv %s.\n", (char *) NLMSG_DATA(nlh));

if (pidindex < MAX_PID_COUNT)
{
pids[pidindex] = nlh->nlmsg_pid;
if (pidindex == MAX_PID_COUNT - 1)
{
int i;
for (i = 0; i != MAX_PROCESS_COUNT; i++)
{
wake_up_process(task_test[i]);
printk("wake up the thread [%d].\n", i);
}
}
pidindex++;
}

readBytes += nlh->nlmsg_len;

if (readBytes > STATIC_PEROID)
{
printk("received  %d bytes.\n", readBytes);
readBytes = 0;
}
kfree_skb(skb);
}

return;
}

static int init_netlink(void)
{
nl_sk = netlink_kernel_create(&init_net, NETLINK_TEST, 0, nl_data_ready,
NULL, THIS_MODULE);
if (!nl_sk)
{
printk(KERN_ERR "net_link: Cannot create netlink socket.\n");
return -EIO;
}

printk("net_link: create socket ok.\n");
return 0;
}

int init_thread(void)
{
int i = 0;
char processName[64] = {0};
for (i = 0; i != MAX_PROCESS_COUNT; i++)
{
void* data = kmalloc(sizeof(int), GFP_ATOMIC);
*(int*)data = i;
snprintf(processName, 63, "childDataThread-%d", i);

task_test[i] = kthread_create(childDataThread, data, processName);
if (IS_ERR(task_test[i]))
{
return PTR_ERR(task_test[i]);
}
printk("init thread (%d) ok!\n", i);
}

return 0;
}

int knl_init(void)
{
init_netlink();
init_thread();
return 0;
}

void stop_kthreads(void)
{
int i;
for (i = 0; i != MAX_PROCESS_COUNT; i++)
{
kthread_stop(task_test[i]);
}
}

void knl_exit(void)
{
stop_kthreads();
if (nl_sk != NULL)
{
sock_release(nl_sk->sk_socket);
}

printk("net_link: remove ok.\n");
}

module_exit(knl_exit);
module_init(knl_init);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("r");

Makefile

MODULE_NAME := knl
obj-m += $(MODULE_NAME).o
KERNELDIR ?= /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
all:
$(MAKE) -C $(KERNELDIR) M=$(PWD)
gcc -g -o usr -lpthread usr.c
clean:
rm -f *.ko *.o *.cmd usr $(MODULE_NAME).mod.c Module.symvers

in:clean rm all
insmod knl.ko
rm:
rmmod knl.ko
sp:
cat /proc/net/knl
ru:
./usr
sm:
dmesg -c

初步测试的性能结果为：

init the socket 188 of pid 24056 successful.

init the socket 189 of pid 24057 successful.

init the socket 190 of pid 24058 successful.

init the socket 191 of pid 24059 successful.

init the socket 192 of pid 24060 successful.

init the socket 193 of pid 24061 successful.

init the socket 194 of pid 24062 successful.

init the socket 195 of pid 24063 successful.

init the socket 196 of pid 24064 successful.

init the socket 197 of pid 24065 successful.

init the socket 198 of pid 24066 successful.

init the socket 199 of pid 24067 successful.

init the socket 200 of pid 24068 successful.

init the socket 201 of pid 24069 successful.

init the socket 202 of pid 24070 successful.

received 30 Mbytes, consumed time is: 10227ms, speed is: 2.933M/s, 4012.00/s.

received 30 Mbytes, consumed time is: 10062ms, speed is: 2.982M/s, 4013.00/s.

received 30 Mbytes, consumed time is: 10052ms, speed is: 2.984M/s, 4012.00/s.

received 30 Mbytes, consumed time is: 10069ms, speed is: 2.979M/s, 4013.00/s.

received 30 Mbytes, consumed time is: 10113ms, speed is: 2.966M/s, 4012.00/s.

received 30 Mbytes, consumed time is: 10071ms, speed is: 2.979M/s, 4012.00/s.

received 30 Mbytes, consumed time is: 10289ms, speed is: 2.916M/s, 4014.00/s.

received 30 Mbytes, consumed time is: 10247ms, speed is: 2.928M/s, 4013.00/s.

received 30 Mbytes, consumed time is: 10347ms, speed is: 2.899M/s, 4013.00/s.

received 30 Mbytes, consumed time is: 10340ms, speed is: 2.901M/s, 4013.00/s.

received 30 Mbytes, consumed time is: 10107ms, speed is: 2.968M/s, 4012.00/s.

received 30 Mbytes, consumed time is: 10267ms, speed is: 2.922M/s, 4013.00/s.

内核态的运行结果

有较多发送失败的打印：

583 start thread [193].

584 start thread [194].

585 start thread [93].

586 start thread [195].

587 start thread [196].

588 start thread [197].

589 start thread [198].

590 start thread [199].

591 start thread [92].

592 start thread [91].

593 start thread [90].

594 start thread [89].

595 start thread [41].

596 start thread [40].

597 start thread [39].

598 start thread [17].

599 start thread [16].

600 start thread [1].

601 start thread [0].

消息传输的性能能达到4000条每秒，但是数据量却很低，看来netlink只能先用作消息传递，数据需要走共享内存的通道。

既然用做消息通道，那么不需要太多的连接，尝试10个连接的，内核10个进程并发的情况：

#define NETLINK_TEST 21
#define MAX_DATA_LEN (768)

#define  MAX_PROCESS_COUNT 10

#define MAX_PAYLOAD 1024
#define MAX_PID_COUNT MAX_PROCESS_COUNT
#define MAX_REC_DATA_LEN 1024

#define STATIC_PEROID 1024*500

#define MSG_COUNT 100000000

#define TRUE 1

received 501 Kbytes, consumed time is: 3303ms, speed is: 151.680K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3277ms, speed is: 152.579K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3285ms, speed is: 152.207K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3260ms, speed is: 153.374K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3281ms, speed is: 152.393K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3279ms, speed is: 152.486K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3282ms, speed is: 152.346K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3271ms, speed is: 152.858K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3262ms, speed is: 153.280K/s, 218.00/s.

received 500 Kbytes, consumed time is: 3295ms, speed is: 151.745K/s, 218.00/s.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航