您的位置:首页 > 移动开发 > Android开发

Android Stability [转]

2016-03-23 20:47 429 查看
1. APSS stability

2. MPSS stability

3. TZ stability

4. RPM stability

5. ADSP stability

6. Unknown reset

1.APSS Stability

Type

Type in detail

Log indication

How to debug?

Possible reasons

Kernel Panic

Data abort

Unable to handle kernel paging request at virtual address
00001098

gpd = ffffffc00007d000



Internal error: Oops: 96000045 [#1] PREEMPT SMP



1. load to T32

2. v.f, check call stack

3. d.l PC, check current PC, and why it loads to wrong address

4. Go through assembly code to see why the address value is wrong

5. Read code to check critical variables to find the final reason

1. Check it’s a SW issue or HW issue?

If random crash occours due to various reasons, generally it’s a HW issue:cache error, register corruption, DDR memory corruption/bitflip etc.

Actions:

1) If see DDR memory corrupted, Do DDR stress test by QMesa;

2) Check HW PDN;

3) Improve core voltage, disable CPR, disable one CPU to see whether issue disappeared;

4) Single device issue, do RMA;

2. If it can reproduce regularly with the same symptom, it must be a SW BUG. Need to follow left steps and read code to fix the bug.

3. If it’s a once issue, also need to follow the steps to check why panic happened. If we identified memory/cache/register corruption happened. Then we have to monitor the device.

Prefetch abort

Bad mode in Synchronous Abort handler detected, code 0x86000004



PC is at 0x30303063706920

LR is at 0x3030303063706920



1. Load to T32

2. D.l PC, check current PC,

3. D.dump r(sp) or D.v %y.ll r(sp), check stack

4. Read code to check critical variables to find the final reason, it is often caused by stack corruption

Stack_protector

Kernel panic – not syncing: stack-protector: Kernel stack is corrupted in: ffffffc000263f20

arch/Kconfig:

CONFIG_HAVE_CC_STACKPROTECTOR=y

CONFIG_CC_STACKPROTECTOR=y

1. Load to T32

2. v.f check call stack

3. d.dump r(sp)

4. d.l current function

Undefined instruction

Internal error: Oops – undefined instruction: 0 [#1] PREEMPT SMP ARM

Modules linked in : …

1. load to T32

2. v.f

3. d.l PC

4. mmu.pt.list r(pc)

5. check why the instruction is undefined

Cache Error

SBE:

[ 26.079779] EDAC DEVICE0: CE: cache instance: cpu1 block: L1 'A53 L1 Correctable Error'

[ 26.087819] EDAC arm64: ARM64 CPU ERP: Single-bit error interrupt received on CPU 1!

[ 26.095487] EDAC arm64: Single-bit error information from CPU 1, MIDR=0x410fd031:

[ 26.102775] EDAC arm64: Cortex A53 CPU1 L1 Single-bit Error detected

[ 26.109107] EDAC arm64: CPUMERRSR value = 0x9281040183

[ 26.114056] EDAC arm64: L1 Instruction data RAM bank is 1

[ 26.119438] EDAC arm64: Repeated error count: 146

[ 26.124124] EDAC arm64: Other error count: 10

CCI Error:

[45.812121] EDAC arm64: CCI error interrupt received!

[ 45.817228] EDAC arm64: CCI imprecise error register: 00000000.

[ 45.823044] EDAC DEVICE0: UE: cache instance: cpu0 block: L3 'CCI Error'

[ 45.829707] Kernel panic -not syncing: EDAC cache: UE instance: cpu0 block L3 'CCI Error‘

1. Check back trace

2. Check the access prior to these messages in dmesg and RTB logs, although they will not always be an exact indicator of what is causing a particular error.

Defconfig

CONFIG_EDAC_CORTEX_ARM64_PANIC_ON_CE

ü Configure to enable panic on correctable single-bit error

ü If config is disabled the command line argument has no effect

CONFIG_EDAC_CORTEX_ARM64_PANIC_ON_UE

ü Configure to enable panic on uncorrectable double-bit error

CORTEX_ARM64_EDAC.PANIC_ON_CE=0

ü Add this on the command line to disable panic of correctable single-bit error

Non secure WD

APSS Non Secure WD bark/bite

In call stack



Wdog_bark_handler()

Handle_irq_eventpercpu()



Gic_handle_irq()

El1_irq()

àexception

Some functions()



Kthread()

Check points:

1. Timer list

2. Work list

3. TRECE events log

4. Register information

Steps:

1. v.f

2. v.v wdog_data

Check wdog_data->alive_mask->bits.

3. Restore the kernel stack manually

4. Check trace event log

Other tests:

1. Disable CPU3, whether reproduced;

2. Bump APC0 volage 25mV, whether reproduced;

3. HW issues: RMA.

1. Excessive serial logging.

a. Serial logging is costly

b. Spends too much time with IRQ disabled

c. Pr_err() in kmalloc()

2. RT thread

a. RT thread runs too long not yielding a processor

b. Fastmixer

3. Low memory

a. No idle worker

b. Kthreadd can’t create worker_thread

4. DDR corruption

a. Spinlock corruption

5. Online CPU unable to send ACK back to IPI.

Secure WD

Secure WD

In RST_STAT.BIN, it stores GCC_RESET_STATUS.

< Reset registers(Saved by SDI) >:

GCC_RESET_STATUS : 0x03 or 0x23

PMIC Registers (SDI) :

PON_REASON1 : 0x00

PON_WARM_RESET_REASON1 : 0x21

PON_WARM_RESET_REASON2 : 0x00

POFF_REASON1 : 0x02

POFF_REASON2 : 0x00

PON_SOFT_RESET_REASON1 : 0x00

PON_SOFT_RESET_REASON2 : 0x00

< TZ log >:

CPU |Reset Reason |Reset Count

0 |0x00000000 (TZBSP_ERR_FATAL_NONE )|0x00000000

1 |0x00000000 (TZBSP_ERR_FATAL_NONE )|0x00000000

How to debug:

1. SDI will save cache and CPU context when secure wd bite happened.

2. If all CPUs are offline, check from RPM side.

3. If at least one CPU is online, check from AP side, could check RTB logs, TZ diag logs etc.

4. In RTB log, check whether there is some action to disable clock. Then it is highly suspected.

5. Normally it involves effort from multiple technical teams. A series of tests is required.

Crash reason:

TZ failed to respond to FIQ

Possible reasons:

1. DDR lock up

2. APPS fails to wake up

3. Un-clocked register access

Bus hang

AHB timeout

In TZ log or F3 messages:

Fatal Error: AHB_TIMEOUT

Or

Fatal Error: NOC_ERROR

How to debug:

1. Check TZ diag log

2. Check parse NOC error message

3. Check what master was doing

4. Check if required clock for slave was set

5. Check RTB log.

TZ Diag log:

ERRLOG0: cause of Noc error type

ERRLOG1: InitFlow, TargFlow,

TargSubRange, Source(PID, BID, MID), SeqID parameters.

ERRLOG2: RouterID for the error when router ID is greater than 32bit

ERRLOG3: complete address/offset of the address for which the error was detected.

Reason:

AHB timeout salve, which monitors the bus for hang. When a hang detected, it generates an interrupt to any processor.

NOC error handler software is integrated into the AHB timeout driver to provide complete coverage for bus hang detection.

Possible reasons:

Un-clocked register access

2.MPSS Stability

Type

Type in detail

Log indication

How to debug?

Possible reasons

Q6/QuRT exception

Processor exception

ExIPC: Exception recieved tid=161 inst=c0075d2c

1. Recover call stack

2. Check SSR to get detail exception reason; The last two bytes indicates exception rason:

0x70, TLB miss-RW

0x71,TLB miss-Write

0x15, invalid packet

0x3A, BADTRAP



3. d.l @ELR

4. d.dump SP

5. Read code to find why exception happened, why register value is not correct.

SW reasons:

1. NULL pointer;

2. Address is not correct;

3. Stack corrupted by other tasks;

HW reasons:

1. Bit flip happened.

2. Register corruption.

QuRT error

In kernel log:

SFR Init: wdog or kernel error suspected..

In modem log:

MODEM - ERR_PRECISE, a precise exception occurred at instruction 0x86842558 with BADVA 0x8B04BBC0

Modem - ERR_NMI, Error cause: ERR_NMI, an NMI occurred.

MODEM - ERR_TLBMISS, TLBMISS RW occurred at instruction 0xC003AE24 with BADVA 0x00000008

MODEM - ERR_ASSERT, Kernel Assert: at instruction 0xC13BF268 with BADVA 0xD1396C48

1. v.v QURT_error_info, check cause and cause2(If cause = 5).

2. V.f

3. D.l PC, check assembly.

4. According to the QuRT error reasons (PRECISE, NMI, TLBMISS, ASSERT etc), then decide to check different points.

5. Chec MB.

6. Find the final cause by efforts.

SW reasons:

1. NULL pointer;

2. Address is not correct;

3. Stack corrupted by other tasks;

HW reasons:

1. Bit flip happened.

2. Register corruption.

Timer error

MODEM - timer_slaves.c:1011 No items on time slave task free cmd q on 0,Q_ele=24,Free_Q=0

1. v.v timer_slaves_cmd_q to see which timer is full/empty;

2. Check call stack of 3 timer slave tasks. Should be one task pending somewhere, can’t handle other command;

3. Check timers_expired_slave1/2/3;

SW reasons:

1. Some timer CB functions didn’t release in time.

HW reasons:

1. Bit flip happened.

2. Register corruption.

Heap error

MODEM - memheap.c:2127 In task 0xfa, Assertion !

INTEGRITY_CHECK_ON_USED_HEADER

(heap_ptr->magic_num_use

1. Running Heapwalker

2. Check modem_mem_heap -> incomingBlock;

3. D.dump SP

4. Check why the heap value is not correct according to stack and global variables;

SW reasons:

1. The client who wrongly use heap, like free twice,

2. The input pointer is wrong, not a correct heap.

HW reasons:

1. Bit flip happened.

2. Register corruption.

Stack error

stack_protect.c:65 Stack Check Failed

It's a stack overflow issue. Usually local array overflow will cause this crash. If you have added
-fstack-protector during compilation, the stack checker will report this crash which shows that some of the functions may have stack overflow or stack corruption.

When a stack is corrupted/overwritten by other tasks, it can also cause other exceptions if -fstack-protector is not added.

1. V.f, check the call stack

2. D.dump SP, check the stack

3. Check assembly in detail to see which value is wrong, and whether the content on stack is corrupted.

4. Check whether 128 padding 0xF8 is corrupted in stack.

5. Check in functions, whether there is huge stack allocation

SW reasons:

1. huge stack allocation, e.g. Big local variables, too many function calls.

Method to increase stack size: in

modem_proc\build\bsp\modem_proc_img\build\

modem_task_stksz.csv.

increase the size of the task.

MPSS Watch Dog

Software watchdog timeout

dog.c:1498 Watchdog detects stalled initialization

1. v.v dog_state_table to see which task didn’t pet DOG in time.

2. Check the call stack of this task. Whether there is possible functions in infinite loop;

3. If the task is waiting for some futex, then check whether this futex is holding by other tasks, whether a dead lock happened.

4. Check MB, whether some task lasts too long, but didn’t pet DOG.

1. Infinite loop in some task;

2. Dead lock;

3. Some task lasts too long;

Hardware watchdog bark

Seldom happen. Normally a SW dog or a HW WD bite happened.

Hardware watchdog bite

In Dmesg:

Watchdog bite received from modem software!

It is always difficult to debug because of insufficient logs.

Cache NOT flushed.

Normally need enable ETB log, and reproduce. Need involve Qualcomm to solve.

SW reasons:

1. Some bugs in QuRT error handler;

2. Un-clocked register access

HW reasons:

1. RMA devices;

2. PDN issue;

Error fatal

Error fatal in core modules

Some error fatals are handled by BSP team, E.g.:

Diag:

diagcomm_sio.c:1269 Assertion 0 failed

MProc:

glink_channel_migration.c:426?? Assertion status == GLINK_STATUS_SUCCESS failed

EFS:

fs_rmts_pm.c:962 2,0,0,Partition data not meant for this partition

Read code to find the error fatal reason, check call stack and input variables to see why error fatal happened.

Normally a SW bug.

Error fatal in protocol stack modules

The error fatal in protocol are handled by modem team. E.g:

LTE crash:

lte_ml1_mgr_modules.c:621:Assert stm_error_flag == STM_SUCCESS failed: LTE_ML1

GSM crash:

gl1_hw_sleep_ctl.c:5028?? SLEEP:Error recovery attempt FAILED. g_slept 595990 duration 440613 missed_frames 4

RF crash:

rf_dispatch_snum.c:497:rf_dispatch_snum_pop_item: SNUM Node Item Exhausted (Ma

Handled by protocol team.

Handled by protocol team.

3.TZ Stability

Type

Type in detail

Log indication

How to debug?

Possible reasons

Non secure WDOG

In call stack



Wdog_bark_handler()

Handle_irq_eventpercpu()

Scm_call()



Gic_handle_irq()

El1_irq()

àexception

Some functions()



Kthread()

1. Non secure WD is checked from APPS perspective first.

2. TZ is one of the root cause for NS WDOG.

3. Scm call in call stack of HLOS, and scm_lock() didn’t return.

4. V.v scm_lock -> comm

5. Task.dtask, check the task state.

SW reasons

HW reasons:

XPU error

MPU error

xpu: ISR begin
XPU ERROR: Non Sec!!
xpu:>>> [1] XPU error dump,
XPU id 3 (BIMC_MPU0)<<<

xpu: uErrorFlags: 00000016

xpu: HAL_XPU2_ERROR_F_CLIENT_PORT

xpu: HAL_XPU2_ERROR_F_MULTIPLE

uBusFlags: 00000521
xpu: HAL_XPU2_BUS_F_ERROR_AC

xpu: HAL_XPU2_BUS_F_APROTNS

xpu: HAL_XPU2_BUS_F_AOOO
xpu: HAL_XPU2_BUS_F_ABURST

xpu: uPhysicalAddress: 80c000f4

xpu: uMasterId: 00000000, uAVMID : 00000003

xpu: uATID : 00000000, uABID : 00000002
xpu: uAPID : 00000000, uALen : 00000000

xpu: uASize : 00000002, uAPReqPriority : 00000000

xpu: uAMemType: 00000000

1. Check B/P/M to find which client is;

a) XPU ID – [3] is (BIMC_MPU0)

b) Virtual Master ID – uAVMID [3] is (TZBSP_VMID_AP)

c) Bus ID – uABID [2]

d) Port ID – uAPID[0] is (Kryo)

e) Master ID – uMasterId[0] is CL1 CPU 0

2. uPhysicalAddress: 80c000f4,

is the address that the client is trying to access;

3. Check why this client access this address.

Read

80-NV396-70_XPU ERROR ANALYSIS FOR MSM8996 for details.

Possible reasons:

1. RMA devices

2. Un-clocked register access

3. APPS/MPSS wants to access some peripherals(e.g. SPI), but RPM didn’t give it access rights.

APU and RPU error

xpu: ISR begin
XPU ERROR: Non Sec!!
xpu:>>> [5] XPU error dump,
XPU id 45 (BAM_BLSP1_DMA)<<<
xpu: uErrorFlags: 00000002
xpu: HAL_XPU2_ERROR_F_CLIENT_PORT
uBusFlags: 00080021
xpu: HAL_XPU2_BUS_F_ERROR_AC
xpu: HAL_XPU2_BUS_F_APROTNS
xpu: HAL_XPU2_BUS_F_NONSECURE_RG_MATCH
xpu: uPhysicalAddress: 00019000
xpu: uMasterId: 00000000, uAVMID : 00000003
xpu: uATID : 00000000, uABID : 00000002
xpu: uAPID : 00000000, uALen : 00000000
xpu: uASize : 00000000, uAPReqPriority : 00000000
xpu: uAMemType: 00000000
Fatal Error: XPU_VIOLATION
1. Check B/P/M to find which client is;

a) XPU ID – [3] is (BAM_BLSP1_DMA)

b) Virtual Master ID – uAVMID [3] is (TZBSP_VMID_AP)

c) Bus ID – uABID [2]

d) Port ID – uAPID[0] is (Kryo 0)

e) Master ID – uMasterId[0] is CL1 CPU 0

2. Because it is APU, this address maps to the offset from [BAM_BLSP1_DMA address base MSB] + 00019000 and the result is as follows:

[0x07544000+ 00019000] = 0x755D000

To find the details of the register at 0x755D000 is as follows:

BLSP1_BLSP_BAM_P_CTRL_20

3. Check why this client access this address.

Read

80-NV396-70_XPU ERROR ANALYSIS FOR MSM8996 for details.

Possible reasons:

1. RMA devices

2. Un-clocked register access

3. APPS/MPSS wants to access some peripherals(e.g. SPI), but RPM didn’t give it access rights.

NOC error

n SNOC Error – ERRLOG0 = 0x80030000

n SNOC Error – ERRLOG1 = 0x6a52810e

n SNOC Error – ERRLOG2 = 0x00000000

n SNOC Error – ERRLOG3 = 0x014c5000

n SNOC Error – ERRLOG4 = 0x00000000

n Fatal Error – NOC_ERROR

n PCNOC Error – ERRLOG0 = 0x80030000

n PCNOC Error – ERRLOG1 = 0x14225024

n PCNOC Error – ERRLOG3 = 0x00005000

n PCNOC Error – ERRLOG4 = 0x00000000

n Fatal Error – NOC_ERROR

General flow:

1. Decode the route ID per the decomposition table of a specific NoC to get values for the InitFlow, TargFlow, TargSubRange, SrcId.PID, SrcId.BID, SrcId.MID, and SeqId parameters.

2. Look up master information (route ID composition of a specific NoC) from the InitFlow value.

3. Look up slave information (route ID composition of a specific NoC) from the TargFlow value.

4. Look up the source bus, port, and master by using SrcId.BID, SrcId.PID, and SrcId.MID values, respectively.

Example:

CNOC Error: 0x1690D079 Source: 0x05 Destination: 0x29 MID: 0x0F BID: 0x02 PID: 0x03

InitFlow qxm_snoc/l/0

TargetFlow qhs7/T/mss_cfg

Master ID A53 cluster

MID: 0x0F

BID: 0x02 //BIMC

PID: 0x03 //A53 cluster

For the mss_cfg, check from CNOC HDD, its base address is 0xfc800000

qhm0_rpm_M2 [0xfd000000:0xfc800000] qhs7_mss_cfg WR 1 bytes Req 81.3 ns

And the offset is 0x80040 from ERRLOG3.

CNOC ERROR: ERRLOG3 = 0x00080040

So the final address is 0xfc800000 + 0x80040 = 0xFC880040

Check from ipcat, this address is MSS_QDSP6SS_NMI.

MSS_QDSP6SS_NMI

0xFC880040

Write

No

Need check why APPS access MSS_QDSP6SS_NMI and cause the NOC error.

Read

80-NV396-71_NoC Error Debug for MSM8996 User Guide for details.

Possible reasons:

1. RMA devices

2. Un-clocked register access

4.RPM Stability

Type

Type in detail

Log indication

How to debug?

Possible reasons

Bus Fault exception

< RPM log >

145.418005: Clock: gcc_dehr_clk Requested State = Enable. Reference Count = 1

191.852022: rpm_err_fatal (lr: 0x0010c149) (ipsr: 0x00000005)

1. Load RPM dump to T32

2. Run rpm_restore_from_core.cmm

3. Run rpm_m3_unstack.cmm

4. Run rpm_parse_faults.cmm

5. v.f, check call stack

6. d.l PC

7. Check why it tries to access this AHB bus

Bus error occurs when AHB interface receives an error response from a bus slave – RPM does not have access permission.

APPS Non secure WD

< RPM log >

9.602248: rpm_halt_exit

9.602253:rpm_abort_interrupt_received (APPS_NON_SECURE_WD_BITE) … aborting

9.602256: rpm_error_fatal (lr: 0x0010c7b) (ipsr: 0x00000049) – “unknown interrupt 73”

RPM receives the notification interrupt from HW when a NON secure WD bite occurs within APPS to preserve RPM states. This is not a RPM error. Need to check the non-secure WD from AP.

Not an error in RPM.

LDO setting timeout

<RPM log >

0x000000002085DA3D: rpm_apply_request (resource type: ldoa) (resource id: 16)

0x000000002085DB3D: rpm_apply_request (resource type: ldoa) (resource id: 16)

0x000000002085DC3D: rpm_apply_request (resource type: ldoa) (resource id: 16)

0x000000002085DD3D: rpm_apply_request (resource type: ldoa) (resource id: 16)

0x000000002085DE3D: rpm_apply_request (resource type: ldoa) (resource id: 16)

0x0000000020867021: START Apply() VS0B

0x000000002086702D: START Post-Dep() VS1039B

0x000000002086703E: rpm_err_fatal (lr: 0x00013007) (ipsr: 0x00000000)

1. Check RPM log, to see who requires to set LDO, and which LDO is to set.

2. V.f in RPM T32. Check whether there is error happened in RPM call stack.

3. Check with HW for such issue.

-000|abort()

-001|pm_rpm_check_vreg_settle_status(

| ?,

| estimated_settling_time_us = 200 = 0xC8,

| pwr_res = 0x00099FFC = DALPROP_StructPtrs_8996_xml[79],

| comm_ptr = 0x00099E64 = ,

| settling_err_en = 1 = 0x1)

| vreg_status = 0 = 0x0

| current_time = 545679961 = 0x20866A59

| settle_end_time = 8718783610880 = 0x000007EE00000000

| return of pm_pwr_is_vreg_ready_alg = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1

| return of pm_rpm_check_battery_status = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1

| return of pm_pwr_status_reg_dump_alg = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1

| return of pm_pwr_is_vreg_ready_alg = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1

---|end of frame

Mostly it’s a HW issue, like

1. Pin shortage

2. Insufficient headroom, etc.

RPM WD bark

< RPM log >

43862.139223: rpm_process_request (master: “APPS”) (resource type: clk2) (id: 0) (full name: bimc)



43862.139406: Clock: gcc_ddr_dim_cfg_clk Requested State = Enable. Reference Count = 1

43862.170521: Rpm_err_fatal (lr: 0xfffffff9) (ipsr: 0x00000041)

1. Load RPM dump to T32

2. Run rpm_restore_from_core.cmm

3. v.f, check call stack

4. Normally RPM WD bark happened while waiting for some bit to be cleared by HW, but it didn’t.

E.g:

Waiting for GCC_BIMC_DDR_CPLL_CMD_RCGR bit0 to clear when APPS requested new BIMC freq from 547.2MHz to 777.6MHz.

Mostly it’s a HW issue. Check with HW together.

VDD MIN

System can’t enter VDD_MIN

1. Need to triage what is preventing VDD_MIN

Conditions that must be met to enter VDD_MIN

CXO = off

VDD_DIG = Retention level

VDD_MEM = Retention level

2. Check npa-dump.txt log

Nap_client (name: APSS) (handle: 0x198728) (resource: 0x198548) (type: NPA_CLIENT_REQUIRED) (request: 1)

3. Check railway.txt

Follow Railway.rail_state
.voter_list_head to walk through the voters.

Mostly it’s SW reasons make RPM can’t enter vdd_min.

5.ADSP Stability

Handled by Audio team, submit cases toQualcomm Audio ADSP team.

6.Abnormal Reset

Type

Type in detail

Log indication

How to debug?

Possible reasons

Unknown reset

In QCAP log, there is no error in APPS/Modem/TZ/RPM.

Read GCC_RESET_STATUS register from RST_STAT.BIN(0x8600760):

Bits Filed name

5 MSM_TSENSE_RESET_STATUS

4 PROC_HALT_CTI_STATUS

3 SRST_RESET_STATUS

2 MSM_TSENSE0_RESET_STATUS

1 PMIC_RESIN_RESET_STATUS

0 SECURE_WDOG_EXPIRE_RESET_STATUS

1. Read GCC_RESET_STATUS, if it’s 0. Then

2. Read PMIC warm reset 1 and 2 -> If not 0, then there was a PMIC warm reset.

3. Read PMIC warm reset 1 and 2 -> If it’s 0, then there was no PMIC warm reset, check PON_REASON and POFF_REASON1 / POFF_REASON2 registers.

It happened mostly in following tests:

Throwing, drop, rolling

ESD

Possible reasons:

HW PDN

SW bug

Bad sample

Thermal

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: