您的位置:首页 > 运维架构 > Linux

Linux下性能分析工具和内存泄露检测工具的简介(Valgrind和gprof)

2016-02-29 09:33 901 查看
valgrind通常用来成分析程序性能及程序中的内存泄露错误

Valgrind工具集简绍

Valgrind包含下列工具:

memcheck:检查程序中的内存问题,如泄漏、越界、非法指针等。

callgrind:检测程序代码的运行时间和调用过程,以及分析程序性能。

cachegrind:分析CPU的cache命中率、丢失率,用于进行代码优化。

helgrind:用于检查多线程程序的竞态条件。

massif:堆栈分析器,指示程序中使用了多少堆内存等信息。

这几个工具的使用是通过命令:valgrand –tool=name 程序名来分别调用的,当不指定tool参数时默认是 –tool=memcheck

Valgrind工具详解

Memcheck

最常用的工具,用来检测程序中出现的内存问题,所有对内存的读写都会被检测到,一切对malloc、free、new、delete的调用都会被捕获。所以,它能检测以下问题:


对未初始化内存的使用;

读/写释放后的内存块;

读/写超出malloc分配的内存块;

读/写不适当的栈中内存块;

内存泄漏,指向一块内存的指针永远丢失;

不正确的malloc/free或new/delete匹配;

memcpy()相关函数中的dst和src指针重叠。

这些问题往往是C/C++程序员最头疼的问题,Memcheck能在这里帮上大忙。

例如:

#include <stdlib.h>
#include <malloc.h>
#include <string.h>

void test()
{
int *ptr = malloc(sizeof(int)*10);

ptr[10] = 7; // 内存越界

memcpy(ptr +1, ptr, 5); // 踩内存

free(ptr);
free(ptr);// 重复释放

int *p1;
*p1 = 1; // 非法指针
}

int main(void)
{
test();
return 0;
}


将程序编译生成可执行文件后执行:valgrind –leak-check=full ./程序名

输出结果如下:

==4832== Memcheck, a memory error detector
==4832== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==4832== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==4832== Command: ./tmp
==4832==
==4832== Invalid write of size 4      // 内存越界
==4832==    at 0x804843F: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==  Address 0x41a6050 is 0 bytes after a block of size 40 alloc'd
==4832==    at 0x4026864: malloc (vg_replace_malloc.c:236)
==4832==    by 0x8048435: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==
==4832== Source and destination overlap in memcpy(0x41a602c, 0x41a6028, 5) // 踩内存
==4832==    at 0x4027BD6: memcpy (mc_replace_strmem.c:635)
==4832==    by 0x8048461: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==
==4832== Invalid free() / delete / delete[] // 重复释放
==4832==    at 0x4025BF0: free (vg_replace_malloc.c:366)
==4832==    by 0x8048477: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==  Address 0x41a6028 is 0 bytes inside a block of size 40 free'd
==4832==    at 0x4025BF0: free (vg_replace_malloc.c:366)
==4832==    by 0x804846C: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==
==4832== Use of uninitialised value of size 4 // 非法指针
==4832==    at 0x804847B: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==
==4832==
==4832== Process terminating with default action of signal 11 (SIGSEGV) //由于非法指针赋值导致的程序崩溃
==4832==  Bad permissions for mapped region at address 0x419FFF4
==4832==    at 0x804847B: test (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==    by 0x804848D: main (in /home/yanghao/Desktop/testC/testmem/tmp)
==4832==
==4832== HEAP SUMMARY:
==4832==     in use at exit: 0 bytes in 0 blocks
==4832==   total heap usage: 1 allocs, 2 frees, 40 bytes allocated
==4832==
==4832== All heap blocks were freed -- no leaks are possible
==4832==
==4832== For counts of detected and suppressed errors, rerun with: -v
==4832== Use --track-origins=yes to see where uninitialised values come from
==4832== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 11 from 6)
Segmentation fault


从valgrind的检测输出结果看,这几个错误都找了出来。

Callgrind

和gprof类似的分析工具,但它对程序的运行观察更是入微,能给我们提供更多的信息。和gprof不同,它不需要在编译源代码时附加特殊选项,但加上调试选项是推荐的。Callgrind收集程序运行时的一些数据,建立函数调用关系图,还可以有选择地进行cache模拟。在运行结束时,它会把分析数据写入一个文件。callgrind_annotate可以把这个文件的内容转化成可读的形式。

生成可视化的图形需要下载gprof2dot

这是个python脚本,把它下载之后修改其权限chmod +7 gprof2dot.py ,并把这个脚本添加到$PATH路径中的任一文件夹下,我是将它放到了/usr/bin目录下,这样就可以直接在终端下执行gprof2dot.py了。

Callgrind可以生成程序性能分析的图形,首先来说说程序性能分析的工具吧,通常可以使用gnu自带的gprof,它的使用方法是:在编译程序时添加-pg参数,例如:

#include <stdio.h>
#include <malloc.h>
void test()
{
sleep(1);
}
void f()
{
int i;
for( i = 0; i < 5; i ++)
test();
}
int main()
{
f();
printf("process is over!\n");
return 0;
}


首先执行
gcc -pg -o tmp tmp.c
,然后运行该程序
./tmp
,程序运行完成后会在当前目录下生成gmon.out文件(这个文件gprof在分析程序时需要),

再执行
gprof ./tmp | gprof2dot.py |dot -Tpng -o report.png
,打开report.png结果:



显示test被调用了5次,程序中耗时所占百分比最多的是test函数。

再来看 Callgrind的生成调用图过程吧,执行:
valgrind --tool=callgrind ./tmp
,执行完成后在目录下生成”callgrind.out.XXX”的文件这是分析文件,可以直接利用:
callgrind_annotate callgrind.out.XXX
打印结果,也可以使用:
gprof2dot.py -f callgrind callgrind.out.XXX |dot -Tpng -o report.png
来生成图形化结果:



它生成的结果非常详细,甚至连函数入口,及库函数调用都标识出来了。

Cachegrind

Cache分析器,它模拟CPU中的一级缓存I1,Dl和二级缓存,能够精确地指出程序中cache的丢失和命中。如果需要,它还能够为我们提供cache丢失次数,内存引用次数,以及每行代码,每个函数,每个模块,整个程序产生的指令数。这对优化程序有很大的帮助。

作一下广告:

valgrind自身利用该工具在过去几个月内使性能提高了25%-30%。据早先报道,kde的开发team也对valgrind在提高kde性能方面的帮助表示感谢。

它的使用方法也是:
valgrind --tool=cachegrind 程序名


Helgrind

它主要用来检查多线程程序中出现的竞争问题。Helgrind寻找内存中被多个线程访问,而又没有一贯加锁的区域,这些区域往往是线程之间失去同步的地方,而且会导致难以发掘的错误。Helgrind实现了名为“Eraser”的竞争检测算法,并做了进一步改进,减少了报告错误的次数。不过,Helgrind仍然处于实验阶段。

首先举一个竞态的例子吧:

#include <stdio.h>
#include <pthread.h>
#define NLOOP 50
int counter = 0; /* incremented by threads */
void *threadfn(void *);

int main(int argc, char **argv)
{
pthread_t tid1, tid2,tid3;

pthread_create(&tid1, NULL, &threadfn, NULL);
pthread_create(&tid2, NULL, &threadfn, NULL);
pthread_create(&tid3, NULL, &threadfn, NULL);

/* wait for both threads to terminate */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
pthread_join(tid3, NULL);

return 0;
}

void *threadfn(void *vptr)
{
int i, val;
for (i = 0; i < NLOOP; i++) {
val = counter;
printf("%x: %d \n", (unsigned int)pthread_self(),  val+1);
counter = val+1;
}
return NULL;
}


这段程序的竞态在30~32行,我们想要的效果是3个线程分别对全局变量累加50次,最后全局变量的值为150,由于这里没有加锁,很明显竞态使得程序不能达到我们的目标。我们来看Helgrind是如何帮我们检测到竞态的。先编译程序:
gcc -o test thread.c -lpthread
,然后执行:
valgrind --tool=helgrind ./test
输出结果如下:

49c0b70: 1
49c0b70: 2
==4666== Thread #3 was created
==4666==    at 0x412E9D8: clone (clone.S:111)
==4666==    by 0x40494B5: pthread_create@@GLIBC_2.1 (createthread.c:256)
==4666==    by 0x4026E2D: pthread_create_WRK (hg_intercepts.c:257)
==4666==    by 0x4026F8B: pthread_create@* (hg_intercepts.c:288)
==4666==    by 0x8048524: main (in /home/yanghao/Desktop/testC/testmem/a.out)
==4666==
==4666== Thread #2 was created
==4666==    at 0x412E9D8: clone (clone.S:111)
==4666==    by 0x40494B5: pthread_create@@GLIBC_2.1 (createthread.c:256)
==4666==    by 0x4026E2D: pthread_create_WRK (hg_intercepts.c:257)
==4666==    by 0x4026F8B: pthread_create@* (hg_intercepts.c:288)
==4666==    by 0x8048500: main (in /home/yanghao/Desktop/testC/testmem/a.out)
==4666==
==4666== Possible data race during read of size 4 at 0x804a028 by thread #3
==4666==    at 0x804859C: threadfn (in /home/yanghao/Desktop/testC/testmem/a.out)
==4666==    by 0x4026F60: mythread_wrapper (hg_intercepts.c:221)
==4666==    by 0x4048E98: start_thread (pthread_create.c:304)
==4666==    by 0x412E9ED: clone (clone.S:130)
==4666==  This conflicts with a previous write of size 4 by thread #2
==4666==    at 0x80485CA: threadfn (in /home/yanghao/Desktop/testC/testmem/a.out)
==4666==    by 0x4026F60: mythread_wrapper (hg_intercepts.c:221)
==4666==    by 0x4048E98: start_thread (pthread_create.c:304)
==4666==    by 0x412E9ED: clone (clone.S:130)
==4666==
==4666== Possible data race during write of size 4 at 0x804a028 by thread #2
==4666==    at 0x80485CA: threadfn (in /home/yanghao/Desktop/testC/testmem/a.out)
==4666==    by 0x4026F60: mythread_wrapper (hg_intercepts.c:221)
==4666==    by 0x4048E98: start_thread (pthread_create.c:304)
==4666==    by 0x412E9ED: clone (clone.S:130)
==4666==  This conflicts with a previous read of size 4 by thread #3
==4666==    at 0x804859C: threadfn (in /home/yanghao/Desktop/testC/testmem/a.out)
==4666==    by 0x4026F60: mythread_wrapper (hg_intercepts.c:221)
==4666==    by 0x4048E98: start_thread (pthread_create.c:304)
==4666==    by 0x412E9ED: clone (clone.S:130)
==4666==
49c0b70: 3
......
55c1b70: 51
==4666==
==4666== For counts of detected and suppressed errors, rerun with: -v
==4666== Use --history-level=approx or =none to gain increased speed, at
==4666== the cost of reduced accuracy of conflicting-access information
==4666== ERROR SUMMARY: 8 errors from 2 contexts (suppressed: 99 from 31)


helgrind成功的找到了竞态的所在位置,标红所示。

Massif

堆栈分析器,它能测量程序在堆栈中使用了多少内存,告诉我们堆块,堆管理块和栈的大小。Massif能帮助我们减少内存的使用,在带有虚拟内存的现代系统中,它还能够加速我们程序的运行,减少程序停留在交换区中的几率。

Massif对内存的分配和释放做profile。程序开发者通过它可以深入了解程序的内存使用行为,从而对内存使用进行优化。这个功能对C++尤其有用,因为C++有很多隐藏的内存分配和释放。

此外,lackey和nulgrind也会提供。Lackey是小型工具,很少用到;Nulgrind只是为开发者展示如何创建一个工具。我们就不做介绍了。

使用Valgrind

Valgrind使用起来非常简单,你甚至不需要重新编译你的程序就可以用它。当然如果要达到最好的效果,获得最准确的信息,还是需要按要求重新编译一下的。比如在使用memcheck的时候,最好关闭优化选项。

valgrind命令的格式如下:

valgrind [valgrind-options] your-prog [your-prog options]
valgrind --tool=massif --stacks=yes ./test


(这个工具有个bug, 只有程序中出现new或者malloc之类的堆操作,才会统计栈的使用,否则只统计堆的使用)

一些常用的选项如下:

选项作用
-h –help显示帮助信息。
–version显示valgrind内核的版本,每个工具都有各自的版本。
q –quiet安静地运行,只打印错误信息。
-v –verbose打印更详细的信息。
–tool= [default: memcheck]最常用的选项。运行valgrind中名为toolname的工具。如果省略工具名,默认运行memcheck。
–db-attach= [default: no]绑定到调试器上,便于调试错误。

gprof

简介

GNU profiler(gprof)是GNU profiler工具。它可以为Linux平台上的程序精确分析性能瓶颈,它能够记录每个函数的调用次数,每个函数消耗的处理器时间,还能够显示“调用图”,包括函数的调用关系。能够为我们改进应用程序的性能提供很多有利的帮助。

原理

通过在编译和链接程序的时候使用-pg选项(编译和链接过程都需要),当我们使用”-pg”选项编译程序后,gcc会做三个工作:

程序的入口处(main函数之前)插入monstartup函数的调用代码,完成profile的初始化工作,包括分配保存信息的内存以及设置一个clock信号处理函数

在每个函数的入口处插入_mcount函数的调用代码,用于统计函数的调用信息:包括调用时间、调用次数以及调用栈信息

在程序退出处(注册 atexit()函数),插入_mcleanup()函数的调用代码,负责将profile信息输出到gmon.out中。

使用流程

在编译和链接时,加上-pg选项。

执行编译的二进制程序

程序正常退出后,在运行目录下 生成gmon.out文件。如果原来有gmon.out 文件,将会被覆盖。

用gprof工具分析gmon.out文件。

gprof输出分析

在gmon.out文件产生之后,可以通过GNU binutils中提供的工具gprof来分析数据,转换成容易阅读、理解的格式。

一般用法:

gprof Binary-file gmon.out >report.txt


其中,Binary-file指的是所运行的程序(也可以是程序调用到的库文件),gmon.out就是前面所输出的那个文件,report.txt就是生成的分析报告了。Gprof提供了丰富的参数选项,以控制报告输出的内容。

简单列表

用文本编辑器打开报告文件:



报告的第一部分是一个简单列表,列出了各个函数的调用情况,如上图所示。列表首先按时间降序排列,如果时间相同,再按调用次数降序排列。各个字段的含义如下:

%time该函数消耗时间占程序所有时间的百分比

Cumulative seconds累积执行时间。执行这个函数所消耗的时间,加上其上列函数消耗的时间总和

Self seconds函数自身消耗的时间(所有调用时间总和),列表首先以这个值的大小排序

Calls 函数被调用的次数,如果某个函数从未被调用,那么这个字段为空

Self Ts/call函数自身的平均执行时间

Total Ts/call函数及其衍生函数调用的平均时间

Name 函数名

其实,在列表的下方,给出了这些字段的详细说明:

%        the percentage of the total running time of the
time       program used by this function.

cumulative a running sum of the number of seconds accounted
seconds   for by this function and those listed above it.

self      the number of seconds accounted for by this
seconds    function alone.  This is the major sort for this
listing.

calls      the number of times this function was invoked, if
this function is profiled, else blank.

self      the average number of milliseconds spent in this
ms/call    function per call, if this function is profiled,
else blank.

total     the average number of milliseconds spent in this
ms/call    function and its descendents per call, if this
function is profiled, else blank.

name       the name of the function.  This is the minor sort
for this listing. The index shows the location of
the function in the gprof listing. If the index is
in parenthesis it shows where it would appear in
the gprof listing if it were to be printed.


调用关系图



报告中的第二部分是个调用图,它给出了函数及其后代的时间消耗情况。列表按时间消耗降序排列,并且索引化组织,根据索引,很容易找出调用的整体关系。调用关系图之后,给出了图中各元素的说明,看起来很方便:

This table describes the call tree of the program, and was sorted by
the total amount of time spent in each function and its children.

Each entry in this table consists of several lines. The line with the
index number at the left hand margin lists the current function.
The lines above it list the functions that called this function,
and the lines below it list the functions this one called.
This line lists:
index  A unique number given to each element of the table.
Index numbers are sorted numerically.
The index number is printed next to every function name so
it is easier to look up where the function in the table.

% time This is the percentage of the `total' time that was spent
in this function and its children.  Note that due to
different viewpoints, functions excluded by options, etc,
these numbers will NOT add up to 100%.

self   This is the total amount of time spent in this function.

children   This is the total amount of time propagated into this
function by its children.

called This is the number of times the function was called.
If the function called itself recursively, the number
only includes non-recursive calls, and is followed by
a `+' and the number of recursive calls.

name   The name of the current function. The index number is
printed after it.  If the function is a member of a
cycle, the cycle number is printed between the
function's name and the index number.

For the function's parents, the fields have the following meanings:

self   This is the amount of time that was propagated directly
from the function into this parent.

children   This is the amount of time that was propagated from
the function's children into this parent.

called This is the number of times this parent called the
function `/' the total number of times the function
was called.  Recursive calls to the function are not
included in the number after the `/'.

name   This is the name of the parent. The parent's index
number is printed after it.  If the parent is a
member of a cycle, the cycle number is printed between
the name and the index number.

If the parents of the function cannot be determined, the word
`<spontaneous>' is printed in the `name' field, and all the other
fields are blank.

For the function's children, the fields have the following meanings:

self   This is the amount of time that was propagated directly
from the child into the function.

children   This is the amount of time that was propagated from the
child's children to the function.

called This is the number of times the function called
this child `/' the total number of times the child
was called.  Recursive calls by the child are not
listed in the number after the `/'.

name   This is the name of the child. The child's index
number is printed after it.  If the child is a
member of a cycle, the cycle number is printed
between the name and the index number.

If there are any cycles (circles) in the call graph, there is an
entry for the cycle-as-a-whole.  This entry shows who called the
cycle (as parents) and the members of the cycle (as children.)
The `+' recursive calls entry shows the number of function calls that
were internal to the cycle, and the calls entry for each member shows,
for that member, how many times it was called from other members of
the cycle.


利用DOT图形化

TXT格式的报告,对于小规模的程序已经足够了,但是对于大规模的程序来说,就显得还是太繁杂了,特别是我们把注意力放在调用关系上时,文本的跳跃总是让人不舒服。

把TXT报告转换成图片,需要python和dot,还要下载gprof2dot.py的脚本。

Dot是graphviz提供的一个工具,在CentOS下,可以执行下面命令安装:

yum install graphviz


安装之后,执行:

python gprof2dot.py report.txt | dot -Tpng -o ast.png


其中report.txt就是前面gprof输出的文本报告,这时,当前目录下就生成一个名为ast.png的文件了,打开看看。

问题

共享库支持

对于代码剖析的支持是由编译器增加的,因此如果希望从共享库中获得剖析信息,就需要使用-pg来编译这些库。

如果需要分析系统函数(如libc库),需要用–lc_p替换-lc。这样程序会链接libc_p.so或libc_p.a。只有这样才能监控到底层的C库函数的执行时间。

用户时间与内核时间

它只能分析应用程序在运行过程中所消耗掉的用户时间,无法得到程序内核空间的运行时间。对内核态的调用分析无能为力。如果程序系统调用比率比较大,就不适合。

此外,时间是通过采样分析得到的,结果精度不高,如果执行时间很少,那么可能采不到样,输出时,结果就忽略了,这也是很多地方看到的时间都是0.00的原因。

多线程

Gprof对多线程支持不好,因为gprof用ITIMER_PROF信号,而只有主线程才能处理这个信号。http://sam.zoy.org/writings/programming/gprof.html给了一个解决方法,就是嵌入个钩子,但我用它测试asterisk的时候,效果并不好,子线程的分析结果总是不对。

其它

只有进程退出才能生成gmon.out文件,用起来还是有些不方便。

参考文章列表:

gprof

Valgrind

linux下利用valgrind工具进行内存泄露检测和性能分析
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: