【定屏】vfp 原生bug导致开机概率定屏问题分析
2018-03-04 23:37
357 查看
1、问题描述
android5.1+kernel3.10环境下,手机下载软件,首次开机(1/500)低概率定屏在android开机动画最后一帧,旋转屏幕会出现开机向导部分界面。此问题,adb可以连接,按键有反应。
2、问题分析
拿到问题后,与客户沟通,确认adb可以连接,按键有反应,屏幕可以正常亮灭,system_sever正常、kernel工作正常,动画卡在最后一张图片播放未完成,开始认为是bootanimation模块异常,逻辑有问题联系模块处理。模块分析后认为bootanimation流程无异常,从log上看也无明显异常点。模块添加log请客户复现问题。确认是bootanimation流程没有走完,进入sleep,导致最后一张图片没有播放完,无法退出动画进程,经过log分析是frameDuration变量异常导致动画无法退出,在bool BootAnimation::movie()函数中模块添加log如下:
log输出如下:
frameDuration变量值这里莫名的改变了,从代码和log分析这里没有涉及到多线程操作,在开机过程中局部变量值突然改变。至此问题从模块交由我们来分析。
从流程看frameDuration是一个long long类型局部变量,初始化后就没有被赋值改变,通过查看log和代码,初步怀疑是栈数据异常导致变量值发生变化。此时需要抓取bootanimation进程的coredump进行内存分析。修改代码,在frameDuration发生异常时主动abort,修改如下:
由于复现概率太低,此修改未能复现,之后客户复现问题后,请客户之间给bootanimation进程发生signal 11,触发coredump进行分析。
发生问题时从汇编上下文看这里没有人操作过d8浮点寄存器,且没有将d8寄存器保存到栈上的操作,所以这里排除了用户空间栈数据破坏的可能。之后抓取了多份样本coredump进行分析,d8寄存器数据变化没有规律,每次都不相同,且整个浮点寄存器数据都不规律。几份样本数据如下:
结合上面数据,同时分析kernel log,发现每次发生frameDuration改变时,kernel log都有输出cpu拔core的log
这里猜测和cpu hotplug机制有关,于是抱着试试的心态,修改代码在开机时不做拔core操作,修改如下:
经过一周测试问题不复现,可以确定是和cpu hotplug机制有关。那么为什么cpu 做拔核操作会影响到上层进程浮点数寄存器值会发生变化呢?
查看vfp代码发现,vfp代码中有注册vfp_hotplug相关函数,当cpu进行拔核操作时,会通过内核通知链通知vfp模块调用vfp_hotplug函数,此函数会检查cpu状态,根据cpu状态来reload thread_info结构中保存的数据,而thread_info中就保存了进程的上下文数据,这里就包含有dx浮点数寄存器。当cpu拔核时,cpu_down会对cpu进行状态切换,进行下电操作,这就导致调用了vfp_hotplug函数。
vfp_hotplug函数会检查到cpu拔出通知,action==CPU_DYING,会调用 vfp_force_reload函数,这个函数会load thread_info中浮点寄存器,会对上下文从新赋值,由于cpu掉电,就导致浮点寄存器的值是一个无规律的无效值。再次调度到问题进程时,问题进程恢复上下文就会获取到一个错误的值。这里由于篇幅不对vfp_force_reload函数流程进行展开分析,这里涉及进程上下文切换和cpu hotplug机制会在后文中做详细分析,请看http://mp.blog.csdn.net/postedit/79445581文章。至此问题算是有个结论,真对此问题kernel中已经有高手合入了修改,只是我们的代码没有及时添加这个patch,导致了这样的问题。个人才疏学浅,分析有纰漏的地方还请各路大神指点。
修改如下:
Author: Yuanyuan Zhong <zyy@motorola.com>
Date: Wed Oct 30 17:31:49 2013 +0100
ARM: 7873/1: vfp: clear vfp_current_hw_state for dying cpu
The CPU_DYING notifier is called by cpu stopper task which
does not own the context held in the VFP hardware. Calling
vfp_force_reload() has no effect.
Replace it with clearing vfp_current_hw_state.
Signed-off-by: Yuanyuan Zhong <zyy@motorola.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 52b8f40..2f37e1d 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -642,9 +642,9 @@ int vfp_restore_user_hwstate(struct user_vfp __user *ufp,
static int vfp_hotplug(struct notifier_block *b, unsigned long action,
void *hcpu)
{
- if (action == CPU_DYING || action == CPU_DYING_FROZEN) {
- vfp_force_reload((long)hcpu, current_thread_info());
- } else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
+ if (action == CPU_DYING || action == CPU_DYING_FROZEN)
+ vfp_current_hw_state[(long)hcpu] = NULL;
+ else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
vfp_enable(NULL);
return NOTIFY_OK;
}
android5.1+kernel3.10环境下,手机下载软件,首次开机(1/500)低概率定屏在android开机动画最后一帧,旋转屏幕会出现开机向导部分界面。此问题,adb可以连接,按键有反应。
2、问题分析
拿到问题后,与客户沟通,确认adb可以连接,按键有反应,屏幕可以正常亮灭,system_sever正常、kernel工作正常,动画卡在最后一张图片播放未完成,开始认为是bootanimation模块异常,逻辑有问题联系模块处理。模块分析后认为bootanimation流程无异常,从log上看也无明显异常点。模块添加log请客户复现问题。确认是bootanimation流程没有走完,进入sleep,导致最后一张图片没有播放完,无法退出动画进程,经过log分析是frameDuration变量异常导致动画无法退出,在bool BootAnimation::movie()函数中模块添加log如下:
nsecs_t frameDuration = s2ns(1) /animation.fps; __android_log_print(ANDROID_LOG_ERROR, LOG_TAG,"msn---bootani---start--frameDuration=%lld===animation.fps==%lld",frameDuration/1000000,animation.fps); for (size_t i=0 ;i<pcount ; i++) { for (int r=0 ; !part.count ||r<part.count ; r++) { for (size_t j=0 ; j<fcount&& (!exitPending() || part.playUntilComplete) ; j++) { …… nsecs_t delay = frameDuration - (now - lastFrame); +__android_log_print(ANDROID_LOG_ERROR,LOG_TAG,"msn---bootani---f23--frameDurationlld==%lld",frameDuration); +__android_log_print(ANDROID_LOG_ERROR,LOG_TAG, "msn---bootani---f23--delaylld== %lld",delay); +__android_log_print(ANDROID_LOG_ERROR,LOG_TAG,"msn---bootani---f23--lastFramelld==%lld",lastFrame); lastFrame = now; +__android_log_print(ANDROID_LOG_ERROR,LOG_TAG, "msn---bootani---f24-- %lld", frameDuration); |
下面是正常log输出是frameDuration变量的值 01-02 08:46:53.931 258 329 E BootAnimation: msn---bootani---f23--frameDuration==0.000000 01-02 08:46:53.931 258 329 E BootAnimation: msn---bootani---f23--frameDurationlld==125000000 01-02 08:46:53.931 258 329 E BootAnimation: msn---bootani---f23--delay== 0.000000 01-02 08:46:53.931 258 329 E BootAnimation: msn---bootani---f23--delaylld== 85754394 01-02 08:46:53.931 258 329 E BootAnimation: msn---bootani---f23--lastFrame== 0.000000 01-02 08:46:53.931 258 329 E BootAnimation: msn---bootani---f23--lastFramelld== 196218922118 异常log输出如下: 01-02 08:46:54.282 258 329 E BootAnimation: msn---bootani---f23--frameDuration== -144.312836 01-02 08:46:54.282 258 329 E BootAnimation: msn---bootani---f23--frameDurationlld==36616418357296292 01-02 08:46:54.282 258 329 E BootAnimation: msn---bootani---f23--delay== 0.000000 01-02 08:46:54.282 258 329 E BootAnimation: msn---bootani---f23--delaylld== 36616418355129544 01-02 08:46:54.282 258 329 E BootAnimation: msn---bootani---f23--lastFrame== 0.000000 01-02 08:46:54.282 258 329 E BootAnimation: msn---bootani---f23--lastFramelld== 196605366210 |
从流程看frameDuration是一个long long类型局部变量,初始化后就没有被赋值改变,通过查看log和代码,初步怀疑是栈数据异常导致变量值发生变化。此时需要抓取bootanimation进程的coredump进行内存分析。修改代码,在frameDuration发生异常时主动abort,修改如下:
BootAnimation.h++ #include<stdlib.h> BootAnimation.cppboolBootAnimation::playAnimation(const Animation& animation){ ......-- nsecs_t frameDuration = s2ns(1) /animation.fps;++ const nsecs_t frameDuration = s2ns(1) /animation.fps;++ nsecs_t tempFD = frameDuration; ...... eglSwapBuffers(mDisplay, mSurface); nsecs_t now = systemTime();++ if(tempFD != frameDuration){ ++ abort();++ } nsecs_t delay = frameDuration - (now - lastFrame); |
下面是frameDuration初始化相关代码对应的汇编 nsecs_t lastFrame =systemTime(); nsecs_t frameDuration = s2ns(1) / animation.fps; 下面是对应的汇编代码 0xb6f673a8 <+1232>: add r1,pc, #656 ; (adr r1, 0xb6f6763c<android::BootAnimation::movie()+1892>) 0xb6f673aa <+1234>: ldrd r0, r1, [r1] 0xb6f673ae <+1238>: mov r2,r7 0xb6f673b0 <+1240>: asrs r3,r7, #31 0xb6f673b2 <+1242>: blx 0xb6f65014 0xb6f673b6 <+1246>: ldr r2,[pc, #640] ; (0xb6f67638<android::BootAnimation::movie()+1888>) 0xb6f673b8 <+1248>: movs r3,#0 0xb6f673ba <+1250>: vmov d8, r0, r1 //这里给d8赋值,也就是frameDuration 0xb6f673be <+1254>: blx 0xb6f65014<_ZN7android7RefBase13onLastWeakRefEPKv+420> 0xb6f673c2 <+1258>: strd r0,r1, [sp] 0xb6f673c6 <+1262>: vmov d10,r0, r1 通过gdb可以打印出d8寄存器的值如下 frameDuration = 36616418357296292--->0x82167041223CA4 delay = 36616418355129544 --->0x82167041012CC8 d8寄存器数据 d8 { u8= {0xa4, 0x3c, 0x22, 0x41, 0x70, 0x16, 0x82, 0x0}, u16= {0x3ca4, 0x4122, 0x1670, 0x82}, u32= {0x41223ca4, 0x821670}, u64 = 0x82167041223ca4, --->36616418357296292 f32= {0xa, 0x0}, f64= 0x0 } |
64#coredump ######################### d0 0x6e61746f6f622d2d, d1 0x2d3632662d2d2d69, d2 0xc07404fa77ed456e, d3 0x3fd34413509f792d, d4 0xc090a40000000000, d5 0xc074000000000000, d6 0x4c0becf1515fe, d7 0x4c0be, d8 0x82167041223ca4, d9 0x82169e07ba0c66, d10 0xc0620a02c0427212, d11 0x1a8c408420a83044, d12 0xa0a42cc1c008f282, d13 0xc80107081c243445, d14 0x2cb7c54cf018f182, d15 0x400040a4540012ec, d16 0x28000000, d17 0x3fd287a7636f4361, d18 0x3ff8000000000000, d19 0x0, d20 0x8000000000000000, d21 0x40620a02c0427212, d22 0x8000000000000000, d23 0x3f000000, d24 0x8000000000000000, d25 0x40620a02c0427212, d26 0x3ff7170dc2e4690d, d27 0x0, d28 0x8000000043700000, d29 0x0, d30 0x8000000000000000, d31 0x4000000000000000, | 301#coredump ####################### d0 0x6e61746f6f622d2d, d1 0x2d3632662d2d2d69, d2 0xc07408001779bc6e, d3 0x3fd34413509f792d, d4 0xc090a40000000000, d5 0xc074000000000000, d6 0x18d26549fa186, d7 0x18d26, d8 0x3d1204114a498488, d9 0x3d12042b3bd823f2, d10 0x308861552855500a, d11 0xb8b80a00da4290ac, d12 0x2802c06040a740e0, d13 0x95b8008912588410, d14 0x1051f370b3429280, d15 0x4024006401979122, d16 0x28000000, d17 0x3fd287a7636f4361, d18 0x3ff8000000000000, d19 0x0, d20 0x8000000000000000, d21 0x0, d22 0x8000000000000000, d23 0x3f000000, d24 0x8000000000000000, d25 0x0, d26 0x8000000000000000, d27 0x0, d28 0x8000000043700000, d29 0x0, d30 0x8000000000000000, d31 0x0 | 53#coredump####################### d0 081102052821a548 d1 0000000000000000 d2 0c04011013890485 d3 684305e003284648 d4 38698021003d0080 d5 e411412649661183 d6 10404a462100040d d7 0c100e80b007a008 d8 0c04011013890485 d9 0c040162386b5c0b d10 0c081851008054a4 d11 c30006804320081a d12 ac02030000338762 d13 2832c50055332002 d14 ca5130402f858329 d15 2b8184c7046a0898 d16 0000005224e25786 d17 cd9006480382b402 d18 82cd0841010d1601 d19 015c90804823c122 d20 9b128400c20b9184 d21 1119a90980c14082 d22 4050240413007294 d23 7039042b00e08000 d24 b8428608458c2895 d25081186251a008074 d26 011c8120048a130c d27 b862e4030215a080 d28 26202380c410c0d0 d29 1d80f04240000003 d30 5080e00257088016 d31 d34c005b4e44a8a2 |
在frameDuration变量发生突变时底层kernel有做cpu_down操作 01-03 06:50:51.810 271 320 E BootAnimation: msn--frameDuration= 125000000 01-03 06:50:52.060 271 320 E BootAnimation: msn--frameDuration= 2064657439864660112 01-03 06:50:52.310 271 320 E BootAnimation: msn--frameDuration= 2064657439864660112 对应时间点kernel log 01-03 06:50:51.768 <5>[ 51.071899] c0 CPU1: shutdown 01-03 06:50:51.768 <4>[ 51.074798] c0 !! 0 platform_cpu_kill 1 !! 01-03 06:50:51.777 <4>[ 51.079589] c0 powerdown_cpus i=6 !! 01-03 06:50:51.777 <4>[ 51.083251] c0 platform_cpu_kill finished i=0!! 01-03 06:50:51.865 <6>[ 51.165985] c0 cpufreq_sprdemand: !! we gonna unplug cpu2 !! 01-03 06:50:51.865 <5>[ 51.173217] c0 CPU2: shutdown 01-03 06:50:51.878 <4>[ 51.176452] c0 !! 0 platform_cpu_kill 2 !! 01-03 06:50:51.878 <4>[ 51.180999] c0 powerdown_cpus i=5 !! 01-03 06:50:51.878 <4>[ 51.184631] c0 platform_cpu_kill finished i=0!! 01-03 06:50:51.969 <6>[ 51.265991] c0 cpufreq_scx35: --xing-- set1200000 khz for cpu0 01-03 06:50:51.970 <6>[ 51.280517] c0 cpufreq_scx35: 768000 -->1200000, real=1200000, index=1 01-03 06:50:52.027 <6>[ 51.326049] c0 cpufreq_sprdemand: !! we gonna plugin cpu1 !! 01-03 06:50:52.027 <4>[ 51.333160] c1 CPU1: Booted secondaryprocessor 01-03 06:50:52.067 <6>[ 51.366027] c0 cpufreq_sprdemand: !! we gonna plugin cpu2 !! 01-03 06:50:52.067 <4>[ 51.373138] c2 CPU2: Booted secondaryprocessor |
修改/system/core/rootdir/init.rc 添加下面代码,验证是否是hotplug导致问题 on boot # basic network init .... ++ write/sys/devices/system/cpu/cpufreq/xxxx/cpu_hotplug_disable 1 |
查看vfp代码发现,vfp代码中有注册vfp_hotplug相关函数,当cpu进行拔核操作时,会通过内核通知链通知vfp模块调用vfp_hotplug函数,此函数会检查cpu状态,根据cpu状态来reload thread_info结构中保存的数据,而thread_info中就保存了进程的上下文数据,这里就包含有dx浮点数寄存器。当cpu拔核时,cpu_down会对cpu进行状态切换,进行下电操作,这就导致调用了vfp_hotplug函数。
/* *VFP hardware can lose all context when a CPU goes offline. * Aswe will be running in SMP mode with CPU hotplug, we will save the *hardware state at every thread switch. We clear our held state when * aCPU has been killed, indicating that the VFP hardware doesn't contain * athreads VFP state. When a CPU starts up,we re-enable access to the *VFP hardware. * *Both CPU_DYING and CPU_STARTING are called on the CPU which * isbeing offlined/onlined. */ static int vfp_hotplug(structnotifier_block *b, unsigned long action, void*hcpu) { if(action == CPU_DYING || action == CPU_DYING_FROZEN) { vfp_force_reload((long)hcpu, current_thread_info()); }else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN) vfp_enable(NULL); returnNOTIFY_OK; } |
修改如下:
Author: Yuanyuan Zhong <zyy@motorola.com>
Date: Wed Oct 30 17:31:49 2013 +0100
ARM: 7873/1: vfp: clear vfp_current_hw_state for dying cpu
The CPU_DYING notifier is called by cpu stopper task which
does not own the context held in the VFP hardware. Calling
vfp_force_reload() has no effect.
Replace it with clearing vfp_current_hw_state.
Signed-off-by: Yuanyuan Zhong <zyy@motorola.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 52b8f40..2f37e1d 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -642,9 +642,9 @@ int vfp_restore_user_hwstate(struct user_vfp __user *ufp,
static int vfp_hotplug(struct notifier_block *b, unsigned long action,
void *hcpu)
{
- if (action == CPU_DYING || action == CPU_DYING_FROZEN) {
- vfp_force_reload((long)hcpu, current_thread_info());
- } else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
+ if (action == CPU_DYING || action == CPU_DYING_FROZEN)
+ vfp_current_hw_state[(long)hcpu] = NULL;
+ else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
vfp_enable(NULL);
return NOTIFY_OK;
}
相关文章推荐
- 多线程问题导致的JDBMonitor的bug分析
- jdk1.6.0_29的bug导致c3p0获取sqlserver连接时被hang住甚至deadlock问题的分析
- 高通 MSM8K GPT异常导致无法开机问题分析
- 多线程问题导致的JDBMonitor的bug分析
- 有关ansj的IndexAnalysis的分词对elasticsearch的fast vector highlight高亮会产生BUG的问题分析
- MySQL Sending data导致查询很慢的问题详细分析
- 一段RUBY的脚本,分析姓名的分数,本来没甚么难的,就是ruby1.91的编码问题,导致一堆问题。
- JavaScript 中 Date 细节问题导致的程序 Bug
- 安装最新版本的ReSharper导致原生全局搜索工具的消失问题
- DllMain中不当操作导致死锁问题的分析--DisableThreadLibraryCalls对DllMain中死锁的影响
- 文《左右c++与java中国的垃圾问题的分析与解决》一bug分析
- MySQL Sending data导致查询很慢的问题详细分析
- 电脑电源问题,导致攒机电脑无法开机
- 关于烧写开机logo之后导致无法启动系统的问题
- ICMP Redirect 报文导致TCP连接建立不起来的问题分析...
- sudo bug导致的zabbix断图问题 推荐
- 某手机厂商是否应该尝试这样去定位手机开机卡死的低概率问题?
- Android - 通过真实案例学习解内存泄漏问题,最终发现Android原生Bug
- linux下错误使用pthread_mutex_lock导致程序奔溃问题分析
- rt-thread通过TCP连接(网络+shell)方式调用list_if()导致网络断开的问题分析