您的位置:首页 > 其它

【定屏】vfp 原生bug导致开机概率定屏问题分析

2018-03-04 23:37 357 查看
1、问题描述
android5.1+kernel3.10环境下,手机下载软件,首次开机(1/500)低概率定屏在android开机动画最后一帧,旋转屏幕会出现开机向导部分界面。此问题,adb可以连接,按键有反应。
2、问题分析
拿到问题后,与客户沟通,确认adb可以连接,按键有反应,屏幕可以正常亮灭,system_sever正常、kernel工作正常,动画卡在最后一张图片播放未完成,开始认为是bootanimation模块异常,逻辑有问题联系模块处理。模块分析后认为bootanimation流程无异常,从log上看也无明显异常点。模块添加log请客户复现问题。确认是bootanimation流程没有走完,进入sleep,导致最后一张图片没有播放完,无法退出动画进程,经过log分析是frameDuration变量异常导致动画无法退出,在bool BootAnimation::movie()函数中模块添加log如下:
nsecs_t frameDuration = s2ns(1) /animation.fps;
   __android_log_print(ANDROID_LOG_ERROR, LOG_TAG,"msn---bootani---start--frameDuration=%lld===animation.fps==%lld",frameDuration/1000000,animation.fps);
for (size_t i=0 ;i<pcount ; i++) {
for (int r=0 ; !part.count ||r<part.count ; r++) {
for (size_t j=0 ; j<fcount&& (!exitPending() || part.playUntilComplete) ; j++) {
……            
nsecs_t delay = frameDuration - (now - lastFrame);
+__android_log_print(ANDROID_LOG_ERROR,LOG_TAG,"msn---bootani---f23--frameDurationlld==%lld",frameDuration);
+__android_log_print(ANDROID_LOG_ERROR,LOG_TAG, "msn---bootani---f23--delaylld== %lld",delay);
+__android_log_print(ANDROID_LOG_ERROR,LOG_TAG,"msn---bootani---f23--lastFramelld==%lld",lastFrame);
 lastFrame = now;
+__android_log_print(ANDROID_LOG_ERROR,LOG_TAG, "msn---bootani---f24-- %lld", frameDuration);
log输出如下:
下面是正常log输出是frameDuration变量的值          
01-02 08:46:53.931   258  329 E BootAnimation: msn---bootani---f23--frameDuration==0.000000
01-02 08:46:53.931   258  329 E BootAnimation: msn---bootani---f23--frameDurationlld==125000000
01-02 08:46:53.931   258  329 E BootAnimation: msn---bootani---f23--delay== 0.000000
01-02 08:46:53.931   258  329 E BootAnimation: msn---bootani---f23--delaylld== 85754394
01-02 08:46:53.931   258  329 E BootAnimation: msn---bootani---f23--lastFrame== 0.000000
01-02 08:46:53.931   258  329 E BootAnimation: msn---bootani---f23--lastFramelld== 196218922118   
异常log输出如下:
01-02 08:46:54.282   258   329 E BootAnimation: msn---bootani---f23--frameDuration== -144.312836
01-02 08:46:54.282   258  329 E BootAnimation: msn---bootani---f23--frameDurationlld==36616418357296292
01-02 08:46:54.282   258  329 E BootAnimation: msn---bootani---f23--delay== 0.000000
01-02 08:46:54.282   258  329 E BootAnimation: msn---bootani---f23--delaylld== 36616418355129544
01-02 08:46:54.282   258  329 E BootAnimation: msn---bootani---f23--lastFrame== 0.000000
01-02 08:46:54.282   258  329 E BootAnimation: msn---bootani---f23--lastFramelld== 196605366210
frameDuration变量值这里莫名的改变了,从代码和log分析这里没有涉及到多线程操作,在开机过程中局部变量值突然改变。至此问题从模块交由我们来分析。
从流程看frameDuration是一个long long类型局部变量,初始化后就没有被赋值改变,通过查看log和代码,初步怀疑是栈数据异常导致变量值发生变化。此时需要抓取bootanimation进程的coredump进行内存分析。修改代码,在frameDuration发生异常时主动abort,修改如下:
BootAnimation.h++ #include<stdlib.h> BootAnimation.cppboolBootAnimation::playAnimation(const Animation& animation){    ......--  nsecs_t frameDuration = s2ns(1) /animation.fps;++  const nsecs_t frameDuration = s2ns(1) /animation.fps;++  nsecs_t tempFD = frameDuration;     ......    eglSwapBuffers(mDisplay, mSurface);        nsecs_t now = systemTime();++    if(tempFD != frameDuration){ ++          abort();++    }       nsecs_t delay = frameDuration - (now - lastFrame);
由于复现概率太低,此修改未能复现,之后客户复现问题后,请客户之间给bootanimation进程发生signal 11,触发coredump进行分析。
下面是frameDuration初始化相关代码对应的汇编
    nsecs_t lastFrame =systemTime();
    nsecs_t frameDuration = s2ns(1) / animation.fps;
下面是对应的汇编代码
   0xb6f673a8 <+1232>:     add r1,pc, #656   ; (adr r1, 0xb6f6763c<android::BootAnimation::movie()+1892>)
   0xb6f673aa <+1234>:     ldrd r0, r1, [r1]    
   0xb6f673ae <+1238>:     mov       r2,r7
   0xb6f673b0 <+1240>:     asrs r3,r7, #31
   0xb6f673b2 <+1242>:     blx  0xb6f65014
   0xb6f673b6 <+1246>:     ldr   r2,[pc, #640]       ; (0xb6f67638<android::BootAnimation::movie()+1888>)
   0xb6f673b8 <+1248>:     movs      r3,#0
   0xb6f673ba <+1250>:     vmov      d8, r0, r1       //这里给d8赋值,也就是frameDuration
   0xb6f673be <+1254>:     blx  0xb6f65014<_ZN7android7RefBase13onLastWeakRefEPKv+420>
   0xb6f673c2 <+1258>:     strd r0,r1, [sp]
   0xb6f673c6 <+1262>:     vmov      d10,r0, r1
通过gdb可以打印出d8寄存器的值如下
frameDuration = 36616418357296292--->0x82167041223CA4
delay = 36616418355129544               --->0x82167041012CC8
d8寄存器数据
d8 {
  u8= {0xa4, 0x3c, 0x22, 0x41, 0x70, 0x16, 0x82, 0x0},
  u16= {0x3ca4, 0x4122, 0x1670, 0x82},
  u32= {0x41223ca4, 0x821670},
  u64 = 0x82167041223ca4, --->36616418357296292
  f32= {0xa, 0x0},
  f64= 0x0
}  
发生问题时从汇编上下文看这里没有人操作过d8浮点寄存器,且没有将d8寄存器保存到栈上的操作,所以这里排除了用户空间栈数据破坏的可能。之后抓取了多份样本coredump进行分析,d8寄存器数据变化没有规律,每次都不相同,且整个浮点寄存器数据都不规律。几份样本数据如下:
64#coredump
#########################
d0 0x6e61746f6f622d2d,
d1 0x2d3632662d2d2d69,
d2 0xc07404fa77ed456e,
d3 0x3fd34413509f792d,
d4 0xc090a40000000000,
d5 0xc074000000000000,
d6 0x4c0becf1515fe,
d7 0x4c0be,
d8 0x82167041223ca4,
d9 0x82169e07ba0c66,
d10 0xc0620a02c0427212,
d11 0x1a8c408420a83044,
d12 0xa0a42cc1c008f282,
d13 0xc80107081c243445,
d14 0x2cb7c54cf018f182,
d15 0x400040a4540012ec,
d16 0x28000000,
d17 0x3fd287a7636f4361,
d18 0x3ff8000000000000,
d19 0x0,
d20 0x8000000000000000,
d21 0x40620a02c0427212,
d22 0x8000000000000000,
d23 0x3f000000,
d24 0x8000000000000000,
d25 0x40620a02c0427212,
d26 0x3ff7170dc2e4690d,
d27 0x0,
d28 0x8000000043700000,
d29 0x0,
d30 0x8000000000000000,
d31 0x4000000000000000,
 301#coredump
#######################
d0 0x6e61746f6f622d2d,
d1 0x2d3632662d2d2d69,
d2 0xc07408001779bc6e,
d3 0x3fd34413509f792d,
d4 0xc090a40000000000,
d5 0xc074000000000000,
d6 0x18d26549fa186,
d7 0x18d26,
d8 0x3d1204114a498488,
d9 0x3d12042b3bd823f2,
d10 0x308861552855500a,
d11 0xb8b80a00da4290ac,
d12 0x2802c06040a740e0,
d13 0x95b8008912588410,
d14 0x1051f370b3429280, 
d15 0x4024006401979122, 
d16 0x28000000,
d17 0x3fd287a7636f4361,
d18 0x3ff8000000000000,
d19 0x0,
d20 0x8000000000000000, 
d21 0x0,
d22 0x8000000000000000,
d23 0x3f000000,
d24 0x8000000000000000,
d25 0x0,
d26 0x8000000000000000,
d27 0x0,
d28 0x8000000043700000,
d29 0x0,
d30 0x8000000000000000,
d31 0x0
53#coredump#######################
d0 081102052821a548  

d1  0000000000000000
 d2  0c04011013890485  
 d3 684305e003284648
 d4  38698021003d0080 
 d5 e411412649661183
 d6  10404a462100040d  
 d7 0c100e80b007a008
 d8  0c04011013890485  
 d9 0c040162386b5c0b
 d10 0c081851008054a4  
 d11 c30006804320081a
   d12 ac02030000338762  
d13 2832c50055332002
   d14 ca5130402f858329  
d15 2b8184c7046a0898
   d16 0000005224e25786 
 d17 cd9006480382b402
   d18 82cd0841010d1601  
d19 015c90804823c122
   d20 9b128400c20b9184 
 d21 1119a90980c14082
   d22 4050240413007294
  d23 7039042b00e08000
   d24 b8428608458c2895 
 d25081186251a008074
   d26 011c8120048a130c  
d27 b862e4030215a080
   d28 26202380c410c0d0  
d29 1d80f04240000003
   d30 5080e00257088016  
d31 d34c005b4e44a8a2
结合上面数据,同时分析kernel log,发现每次发生frameDuration改变时,kernel log都有输出cpu拔core的log
在frameDuration变量发生突变时底层kernel有做cpu_down操作
01-03 06:50:51.810   271  320 E BootAnimation: msn--frameDuration= 125000000
01-03 06:50:52.060   271  320 E BootAnimation: msn--frameDuration= 2064657439864660112
01-03 06:50:52.310   271  320 E BootAnimation: msn--frameDuration= 2064657439864660112
对应时间点kernel log
01-03 06:50:51.768 <5>[   51.071899] c0 CPU1: shutdown
01-03 06:50:51.768 <4>[   51.074798] c0 !! 0  platform_cpu_kill 1 !!
01-03 06:50:51.777 <4>[   51.079589] c0 powerdown_cpus i=6 !!
01-03 06:50:51.777 <4>[   51.083251] c0 platform_cpu_kill finished i=0!!
01-03 06:50:51.865 <6>[   51.165985] c0 cpufreq_sprdemand: !!  we gonna unplug cpu2  !!
01-03 06:50:51.865 <5>[   51.173217] c0 CPU2: shutdown
01-03 06:50:51.878 <4>[   51.176452] c0 !! 0  platform_cpu_kill 2 !!
01-03 06:50:51.878 <4>[   51.180999] c0 powerdown_cpus i=5 !!
01-03 06:50:51.878 <4>[   51.184631] c0 platform_cpu_kill finished i=0!!
01-03 06:50:51.969 <6>[   51.265991] c0 cpufreq_scx35: --xing-- set1200000 khz for cpu0
01-03 06:50:51.970 <6>[   51.280517] c0 cpufreq_scx35: 768000 -->1200000, real=1200000, index=1
01-03 06:50:52.027 <6>[   51.326049] c0 cpufreq_sprdemand: !!  we gonna plugin cpu1  !!
01-03 06:50:52.027 <4>[   51.333160] c1 CPU1: Booted secondaryprocessor
01-03 06:50:52.067 <6>[   51.366027] c0 cpufreq_sprdemand: !!  we gonna plugin cpu2  !!
01-03 06:50:52.067 <4>[   51.373138] c2 CPU2: Booted secondaryprocessor
这里猜测和cpu hotplug机制有关,于是抱着试试的心态,修改代码在开机时不做拔core操作,修改如下:
修改/system/core/rootdir/init.rc
添加下面代码,验证是否是hotplug导致问题
on boot
   # basic network init
....
++     write/sys/devices/system/cpu/cpufreq/xxxx/cpu_hotplug_disable 1
经过一周测试问题不复现,可以确定是和cpu hotplug机制有关。那么为什么cpu 做拔核操作会影响到上层进程浮点数寄存器值会发生变化呢?
查看vfp代码发现,vfp代码中有注册vfp_hotplug相关函数,当cpu进行拔核操作时,会通过内核通知链通知vfp模块调用vfp_hotplug函数,此函数会检查cpu状态,根据cpu状态来reload thread_info结构中保存的数据,而thread_info中就保存了进程的上下文数据,这里就包含有dx浮点数寄存器。当cpu拔核时,cpu_down会对cpu进行状态切换,进行下电操作,这就导致调用了vfp_hotplug函数。
/*
 *VFP hardware can lose all context when a CPU goes offline.
 * Aswe will be running in SMP mode with CPU hotplug, we will save the
 *hardware state at every thread switch. We clear our held state when
 * aCPU has been killed, indicating that the VFP hardware doesn't contain
 * athreads VFP state.  When a CPU starts up,we re-enable access to the
 *VFP hardware.
 *
 *Both CPU_DYING and CPU_STARTING are called on the CPU which
 * isbeing offlined/onlined.
 */
static int vfp_hotplug(structnotifier_block *b, unsigned long action,
       void*hcpu)
{
       if(action == CPU_DYING || action == CPU_DYING_FROZEN) {
              vfp_force_reload((long)hcpu, current_thread_info());
       }else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
              vfp_enable(NULL);
       returnNOTIFY_OK;
}
vfp_hotplug函数会检查到cpu拔出通知,action==CPU_DYING,会调用 vfp_force_reload函数,这个函数会load thread_info中浮点寄存器,会对上下文从新赋值,由于cpu掉电,就导致浮点寄存器的值是一个无规律的无效值。再次调度到问题进程时,问题进程恢复上下文就会获取到一个错误的值。这里由于篇幅不对vfp_force_reload函数流程进行展开分析,这里涉及进程上下文切换和cpu hotplug机制会在后文中做详细分析,请看http://mp.blog.csdn.net/postedit/79445581文章。至此问题算是有个结论,真对此问题kernel中已经有高手合入了修改,只是我们的代码没有及时添加这个patch,导致了这样的问题。个人才疏学浅,分析有纰漏的地方还请各路大神指点。
修改如下:
Author: Yuanyuan Zhong <zyy@motorola.com>
Date:   Wed Oct 30 17:31:49 2013 +0100
    ARM: 7873/1: vfp: clear vfp_current_hw_state for dying cpu
    
    The CPU_DYING notifier is called by cpu stopper task which
    does not own the context held in the VFP hardware. Calling
    vfp_force_reload() has no effect.
    Replace it with clearing vfp_current_hw_state.
    
    Signed-off-by: Yuanyuan Zhong <zyy@motorola.com>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 52b8f40..2f37e1d 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -642,9 +642,9 @@ int vfp_restore_user_hwstate(struct user_vfp __user *ufp,
 static int vfp_hotplug(struct notifier_block *b, unsigned long action,
        void *hcpu)
 {
-       if (action == CPU_DYING || action == CPU_DYING_FROZEN) {
-               vfp_force_reload((long)hcpu, current_thread_info());
-       } else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
+       if (action == CPU_DYING || action == CPU_DYING_FROZEN)
+               vfp_current_hw_state[(long)hcpu] = NULL;
+       else if (action == CPU_STARTING || action == CPU_STARTING_FROZEN)
                vfp_enable(NULL);
        return NOTIFY_OK;
 }
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: