Resolving/Debugging user space crashes and segmentation faults
2013-05-19 16:40
323 查看
By Hai Shalom
原文地址: [b]http://www.rt-embedded.com/blog/archives/resolving-crashes-and-segmentation-faults/[/b]
==============================================================================================================================
Segmentation fault is the most common error condition where your program tries to access either an invalid memory location, or a memory location it is not allowed to access.
A few examples for this could be:
Dereferencing a NULL pointer.
Dereferencing an uninitialized pointer.
Accessing memory with a wrong alignment.
Writing to a read only area.
Writing or reading beyond program allocated resources (buffer overflow).
Memory corruption/overrun.
For our examples, let’s use the following function that simply writes the character ‘R’ to a given location provided by ptr:
All the following code examples will cause the program to terminate and the message “Segmentation Fault” to appear on the console.
Example 1: Dereference a NULL pointer:
Example 2: Write to a read-only location:
Example 3: Dereference an uninitialized pointer:
Another common error is alignment trap (also referred as bus error). A bus error occurs when the CPU tries to access a 32-bit or 16-bit variable in an unaligned memory address.
See the article about alignment traps for more details.
Another type of crash is caused be an illegal instruction. Normally, this cannot happen because the compiler generates only legal instructions. However, in case your program uses callbacks, this scenario could happen in case the program is trying to jump to
an uninitialized callback address.
The last type of error which is covered in this article is the Floating point exception. If your program performed an illegal arithmetic operation (such as division by zero), the program will terminate and the message “Floating point exception” will
appear on the console. Although the words “Floating point” are used, this type of error also refers to errors cause by arithmetic operations with integers.
When one of the above happens, the kernel sends a fault signal (or an exception) to the program. A fault signal is a special signal that tells the program that something bad has happened, and it needs to be terminated. There are four fault signals
in the system:
SIGSEGV: In case of Segmentation fault
SIGBUS: In case of Alignment trap
SIGILL: In case of Illegal instruction
SIGFPE: In case of Illegal arithmetic operation.
Each signal needs to be handled and acknowledged. There are also the signals SIGQUIT and SIGINT which are not caused by an error, but also must be handled because the program needs to be terminated. In case the program did not install such handlers, there are
default handlers for the fault signals that just write a short error message; “Segmentation fault“, and then terminate the program. Unfortunately, this is insufficient information in order to find and fix the bug, and the system may become
unstable or unusable once the program was terminated. Furthermore, this short error message might get lost within the many other messages which appear on the console while the system is running. Note that this behavior is usually unaccepted in a project which
is meant for mass production (the user must manually power cycle the unit in this case).
During debug, it is possible to enable core dump, which can be parsed off line on a host machine using gdb. The core dump contains useful information about the last known whereabouts of the program. Further information is available in the Enabling
core dumps post.
In each system, it is crucial to install exception handlers for these fault signals, because you can never know when a program might crash (trust me; usually it happens in the customer’s premises or during some qualification tests). The PCD provides
an easy and convenient way for registering its exception handlers, which provide a lot of useful debug information (See the next paragraph for more details). In case you want to write your own exception handler, you must consider the following issues:
The exception can occur anytime, therefore, the handler must be written carefully.
Many ANSI-C functions are not signal-safe (including printf( )…), therefore, your signal handler must use only signal safe functions. A list of signal safe functions can be found here.
Don’t try to call your main( ) function from your signal handler. It may appear that you’ve revived your program, but it is unclear what will be the consequences, because your program’s stack or heap could have been corrupted.
Your exception handler must call exit( ) once it has completed its work.
The next step is to write the exception handler. There are two types of exception handlers you can use; a standard exception handler which receives only the signal number, or an enhanced exception handler, which can also receive some more information, which
may defer between architectures: a pointer to a siginfo_t and a pointer to a ucontext_t (casted to void *). Let’s take a look at thesigaction structure, which is defined in signal.h:
In case we want a simple handler, we can specify our handler in the sa_handler field, and in case we want the enhanced handler, we define SA_SIGINFO flag in sa_flags and specify our handler in sa_sigaction. The following macros can
be used for registering exception handlers:
The first macro is for the simple handler and the second macro is for the enhanced handler. The macros use the sigaction( )function.
Once your handler has been activated, it means that an illegal operation has occurred. If you enabled the enhanced handler, it is possible to read the siginfo_t structure which contains information about the crash. This structure is large and contains
a number of unions. I would like to mention only the most relevant members:
This structure provides information about the signal number and its code (See the siginfo.h header code for textual information about each signal code), the last known errno and the address which caused the error condition in the si_addr pointer.
This address in this case is the address that the CPU was trying to access and not an address of the instruction (which is held by the Program Counter). The ucontext_t structure contains a list of the core register sets and their last known values
(such as the Program Counter). It varies between architectures.
We can now understand the nature and root cause of the error. We can also understand what the fault address that caused the error is. From the ucontext_t structure, we can extract the Program Counter (for the last known execution address) and the Link
register (for the return address in ARM architecture). Theoretically speaking, we could print this information to the console usingprintf( ) function. However, if you remember, this function is not signal-safe, and therefore, it cannot be used, although
I have seen some signal handlers that do use printf to print it. So how can we do it right? The solution is to have a crash daemon which listens on a socket. We can use our signal handler to send this information to the socket, and the daemon will
print it for us. The socket API is signal safe and could be used without a problem. There is and easier way to do it, and I’ll present it next. Let’s say for now that we extracted the value of the Program Counter from the ucontext_t structure,
and that the CPU was executing the instruction in address 0×8548. Assuming we compiled our program with debug symbols, we can use the objdump utility and ask it to show the mixed assembly and C code (using the –S option) around this address. Here’s
a snippet from the output:
The line in red caused the Segmentation Fault crash. Object file with symbols will be presented in mixed-mode of C and assembly. In case the object was compiled without symbols, the only information we could extract near the address is the function name.
It is also possible to extract the file name and line number from the address number using the addr2line utility. Here’s an example:
There are cases where the fault address is not inside the program’s code, but inside a shared library’s code. A shared library code address will be mapped much higher than the area of the program code. We could use the maps file in the proc filesystem
to determine where the last command came from. However, in order to print it, we’ll have to reboot the system because once the program has crashed, its proc entry is already gone. After we have rebooted and figured out the new PID, we need to print its map
file by the command “cat /proc/<PID>/maps”. Here’s an example output:
Look for the “x” symbol in the permission column for code sections. In this example, possible Program counter locations could theoretically reside in the ranges of 0×8000-0×9000, 0×4000000-0×4005000, 0x400e000-0×4023000, 0x402c000-0×4067000,
and 0×4075000-0x407f000. Matching the Program Counter and Link register to one of these ranges will result with the faulty code section. We’ll see an example later.
In this example, we can see what the last executed command was, but not cause of the problem. The signal info structure also contains the signal code and the registers which could help us figure out what is the root cause. How can we extract this information
easily? Well, just continue to read.
When a program crashes due to an exception, it will be terminated once the error message is displayed. This crash will not trigger any recovery action, and your system will probably because unstable or unusable. The PCD can
help here in two fields:
Enhanced debugging capabilities and system recovery. Once registering to the PCD exception handlers, they will provide more information about this crash, including the Program counter, Link register (return address), all the other registers, last value of errnoand
the maps file of the process, right before it was terminated, without the need to reboot the system. The latter will help you analyze the location of the error just be looking at the PCD’s error report. It will also trigger a recovery action once the crash
was detected, and return the system to functional mode. The crash information is also saved on a non-volatile storage for later/offline analysis. Let’s take the piece of code from example 2, and instrument it with the PCD exception handlers (See how easily
it is done):
Let’s configure a simple PCD rule to start an monitor a program:
Here is the output on the console once PCD has started this rule. Note that the selected recovery action here was “Reboot”, that’s why the system is rebooting right after the crash. Pay attention to the bolded red line:
As we can see, the details of this crash provided by the PCD can help you find and resolve this issue easily and quickly. Let’s extract the file name and line number from the address number using the mentioned addr2line utility:
Now let’s get back to the objdump’s output for a more detailed output:
We can see that the red line is the last instruction that the CPU performed before the crash, and the return address (by the Link Register) is marked by the orange line. In many cases, a function is called from various locations; the Link Register value tells
us what the specific location is, and therefore, is also important. From the code, we can see that r4 is loaded with address 0x857c and that the value in this address is sent to our rte_test_ptr function. We can use the objdump to lookup
the variable’s name by grepping on the value of the word:
Now we also know the variable name, although it was possible because this variable was global and not on the stack.
Suppose the faulty function is inside a shared library and not in our program. How can we debug this?
Let’s repeat the example, and place the rte_test_ptr inside a shared library. After we compile and link, we run the program again. We’ll reexamine the crash log:
Now we can see that unlike the previous example, the Program Counter is in a high address (0x0402c4a4 according to the log). We understand that it resides outside of the program’s code and it is somewhere in one of the linked shared libraries. As we already
saw, we can determine which library executed the bad instruction by matching the PC value to the executable address ranges in the maps file. In this example, this address is in the execution range of /lib/libsegv.so, which is the library made for
the purpose of this example. In order to understand where we can find the problem inside the library, we need to calculate the offset by reducing the library’s base address from the PC value. In this example, we do: 0x0402c4a4 – 0x402c000 = 0x4A4, and this
is the address of the problematic code inside the library. Let’s use objdump utility again, but now we’ll specify the library name, and not the program’s name. We can truncate the output by using the grep utility, to match the address we
calculated and a few lines before and after the match (use –A and –B options):
In red we see the instruction that caused the crash, as expected, inside the rte_test_ptr function we moved to a shared library.
For conclusion, now we know how to:
Extract and understand the crash information provided in to a fault signal handler.
Find a bad instruction inside our program and inside a shared library.
Now we have all the required information to fix this crash. It can be done manually, and it can be done using the PCD.
Crashes and segmentation faults may be also a result of memory corruption. There could be a case where some code unintentionally changes a memory portion which it does not own thus causing a mess to the rightful owner. Read here how
to debug such errors.
http://linux.die.net/man/2/signal
http://linux.die.net/man/2/sigaction
http://sourceforge.net/projects/pcd/
http://www.rt-embedded.com/blog/archives/enabling-core-dumps-in-embedded-systems/
March 29th, 2010 | Tags: crash, debug, embedded, fault, find, fix, Linux, pcd, real, resolve, rt, segmentation, time |
Category: Debugging, Linux,Programming
原文地址: [b]http://www.rt-embedded.com/blog/archives/resolving-crashes-and-segmentation-faults/[/b]
==============================================================================================================================
What is a segmentation fault?
Segmentation fault is the most common error condition where your program tries to access either an invalid memory location, or a memory location it is not allowed to access.A few examples for this could be:
Dereferencing a NULL pointer.
Dereferencing an uninitialized pointer.
Accessing memory with a wrong alignment.
Writing to a read only area.
Writing or reading beyond program allocated resources (buffer overflow).
Memory corruption/overrun.
For our examples, let’s use the following function that simply writes the character ‘R’ to a given location provided by ptr:
void rte_test_ptr( char *ptr ) { *ptr = 'R'; } |
Example 1: Dereference a NULL pointer:
int main( int argc , char *argv[] ) { rte_test_ptr(NULL); return 0; } |
char *ro_ptr = "RT-Embedded"; int main( int argc , char *argv[] ) { rte_test_ptr(ro_ptr); return 0; } |
int main( int argc , char *argv[] ) { char *uninit_ptr; rte_test_ptr(uninit_ptr); return 0; } |
What other crash types we usually see?
Another common error is alignment trap (also referred as bus error). A bus error occurs when the CPU tries to access a 32-bit or 16-bit variable in an unaligned memory address.See the article about alignment traps for more details.
Another type of crash is caused be an illegal instruction. Normally, this cannot happen because the compiler generates only legal instructions. However, in case your program uses callbacks, this scenario could happen in case the program is trying to jump to
an uninitialized callback address.
The last type of error which is covered in this article is the Floating point exception. If your program performed an illegal arithmetic operation (such as division by zero), the program will terminate and the message “Floating point exception” will
appear on the console. Although the words “Floating point” are used, this type of error also refers to errors cause by arithmetic operations with integers.
What happens when the program performs an illegal operation?
When one of the above happens, the kernel sends a fault signal (or an exception) to the program. A fault signal is a special signal that tells the program that something bad has happened, and it needs to be terminated. There are four fault signalsin the system:
SIGSEGV: In case of Segmentation fault
SIGBUS: In case of Alignment trap
SIGILL: In case of Illegal instruction
SIGFPE: In case of Illegal arithmetic operation.
Each signal needs to be handled and acknowledged. There are also the signals SIGQUIT and SIGINT which are not caused by an error, but also must be handled because the program needs to be terminated. In case the program did not install such handlers, there are
default handlers for the fault signals that just write a short error message; “Segmentation fault“, and then terminate the program. Unfortunately, this is insufficient information in order to find and fix the bug, and the system may become
unstable or unusable once the program was terminated. Furthermore, this short error message might get lost within the many other messages which appear on the console while the system is running. Note that this behavior is usually unaccepted in a project which
is meant for mass production (the user must manually power cycle the unit in this case).
Enabling core dump
During debug, it is possible to enable core dump, which can be parsed off line on a host machine using gdb. The core dump contains useful information about the last known whereabouts of the program. Further information is available in the Enablingcore dumps post.
Handling fault signals, and extracting information from them
In each system, it is crucial to install exception handlers for these fault signals, because you can never know when a program might crash (trust me; usually it happens in the customer’s premises or during some qualification tests). The PCD providesan easy and convenient way for registering its exception handlers, which provide a lot of useful debug information (See the next paragraph for more details). In case you want to write your own exception handler, you must consider the following issues:
The exception can occur anytime, therefore, the handler must be written carefully.
Many ANSI-C functions are not signal-safe (including printf( )…), therefore, your signal handler must use only signal safe functions. A list of signal safe functions can be found here.
Don’t try to call your main( ) function from your signal handler. It may appear that you’ve revived your program, but it is unclear what will be the consequences, because your program’s stack or heap could have been corrupted.
Your exception handler must call exit( ) once it has completed its work.
The next step is to write the exception handler. There are two types of exception handlers you can use; a standard exception handler which receives only the signal number, or an enhanced exception handler, which can also receive some more information, which
may defer between architectures: a pointer to a siginfo_t and a pointer to a ucontext_t (casted to void *). Let’s take a look at thesigaction structure, which is defined in signal.h:
/* Structure describing the action to be taken when a signal arrives. */ struct sigaction { /* Signal handler. */ union { /* Used if SA_SIGINFO is not set. */ /* Type of a signal handler. */ typedef void (*sa_handler) (int); /* Used if SA_SIGINFO is set. */ void (*sa_sigaction) (int, siginfo_t *, void *); } __sigaction_handler; /* Additional set of signals to be blocked. */ __sigset_t sa_mask; /* Special flags. */ int sa_flags; /* Restore handler. */ void (*sa_restorer) (void); }; |
be used for registering exception handlers:
#define SETSIG(sa, sig, func) \ { memset( &sa, 0, sizeof( struct sigaction ) ); \ sa.sa_handler = func; \ sa.sa_flags = SA_RESTART; \ sigaction(sig, &sa, 0L); \ } #define SETSIGINFO(sa, sig, func) \ { memset( &sa, 0, sizeof( struct sigaction ) ); \ sa.sa_sigaction = func; \ sa.sa_flags = SA_RESTART | SA_SIGINFO; \ sigaction(sig, &sa, 0L); \ } |
Once your handler has been activated, it means that an illegal operation has occurred. If you enabled the enhanced handler, it is possible to read the siginfo_t structure which contains information about the crash. This structure is large and contains
a number of unions. I would like to mention only the most relevant members:
typedef struct siginfo { int si_signo; /* Signal number. */ int si_errno; /* If non-zero, an errno value associated with this signal, as defined in <errno.h>. */ int si_code; /* Signal code. */ /* SIGILL, SIGFPE, SIGSEGV, SIGBUS. */ struct { void *si_addr; /* Faulting insn/memory ref. */ } _sigfault; } siginfo_t; |
This address in this case is the address that the CPU was trying to access and not an address of the instruction (which is held by the Program Counter). The ucontext_t structure contains a list of the core register sets and their last known values
(such as the Program Counter). It varies between architectures.
What do we do with this information?
We can now understand the nature and root cause of the error. We can also understand what the fault address that caused the error is. From the ucontext_t structure, we can extract the Program Counter (for the last known execution address) and the Linkregister (for the return address in ARM architecture). Theoretically speaking, we could print this information to the console usingprintf( ) function. However, if you remember, this function is not signal-safe, and therefore, it cannot be used, although
I have seen some signal handlers that do use printf to print it. So how can we do it right? The solution is to have a crash daemon which listens on a socket. We can use our signal handler to send this information to the socket, and the daemon will
print it for us. The socket API is signal safe and could be used without a problem. There is and easier way to do it, and I’ll present it next. Let’s say for now that we extracted the value of the Program Counter from the ucontext_t structure,
and that the CPU was executing the instruction in address 0×8548. Assuming we compiled our program with debug symbols, we can use the objdump utility and ask it to show the mixed assembly and C code (using the –S option) around this address. Here’s
a snippet from the output:
# armeb-linux-uclibceabi-objdump -S segv ... 00008544 <rte_test_ptr>: #include <stdio.h> #include <pcdapi.h> void rte_test_ptr( char *ptr ) { *ptr = 'R'; 8544: e3a03052 mov r3, #82 ; 0x52 8548: e5c03000 strb r3, [r0] } 854c: e12fff1e bx lr ... |
It is also possible to extract the file name and line number from the address number using the addr2line utility. Here’s an example:
# armeb-linux-uclibceabi-addr2line -e segv -f 8548 rte_test_ptr /home/hai/rte/segv.c:7 |
to determine where the last command came from. However, in order to print it, we’ll have to reboot the system because once the program has crashed, its proc entry is already gone. After we have rebooted and figured out the new PID, we need to print its map
file by the command “cat /proc/<PID>/maps”. Here’s an example output:
# cat /proc/204/maps 00008000-00009000 r-xp 00000000 1f:07 59 /usr/sbin/segv 00010000-00011000 rw-p 00000000 1f:07 59 /usr/sbin/segv 04000000-04005000 r-xp 00000000 1f:06 231 /lib/ld-uClibc-0.9.29.so 04005000-04007000 rw-p 04005000 00:00 0 0400c000-0400d000 r--p 00004000 1f:06 231 /lib/ld-uClibc-0.9.29.so 0400d000-0400e000 rw-p 00005000 1f:06 231 /lib/ld-uClibc-0.9.29.so 0400e000-04023000 r-xp 00000000 1f:06 175 /lib/libticc.so 04023000-0402a000 ---p 04023000 00:00 0 0402a000-0402c000 rw-p 00014000 1f:06 175 /lib/libticc.so 0402c000-04067000 r-xp 00000000 1f:06 200 /lib/libuClibc-0.9.29.so 04067000-0406e000 ---p 04067000 00:00 0 0406e000-0406f000 r--p 0003a000 1f:06 200 /lib/libuClibc-0.9.29.so 0406f000-04070000 rw-p 0003b000 1f:06 200 /lib/libuClibc-0.9.29.so 04070000-04075000 rw-p 04070000 00:00 0 04075000-0407f000 r-xp 00000000 1f:06 137 /lib/libgcc_s.so.1 0407f000-04086000 ---p 0407f000 00:00 0 04086000-04087000 rw-p 00009000 1f:06 137 /lib/libgcc_s.so.1 0ece0000-0ecf5000 rwxp 0ece0000 00:00 0 [stack] |
and 0×4075000-0x407f000. Matching the Program Counter and Link register to one of these ranges will result with the faulty code section. We’ll see an example later.
In this example, we can see what the last executed command was, but not cause of the problem. The signal info structure also contains the signal code and the registers which could help us figure out what is the root cause. How can we extract this information
easily? Well, just continue to read.
How can PCD help debugging, resolving and preventing crashes?
When a program crashes due to an exception, it will be terminated once the error message is displayed. This crash will not trigger any recovery action, and your system will probably because unstable or unusable. The PCD canhelp here in two fields:
Enhanced debugging capabilities and system recovery. Once registering to the PCD exception handlers, they will provide more information about this crash, including the Program counter, Link register (return address), all the other registers, last value of errnoand
the maps file of the process, right before it was terminated, without the need to reboot the system. The latter will help you analyze the location of the error just be looking at the PCD’s error report. It will also trigger a recovery action once the crash
was detected, and return the system to functional mode. The crash information is also saved on a non-volatile storage for later/offline analysis. Let’s take the piece of code from example 2, and instrument it with the PCD exception handlers (See how easily
it is done):
#include <stdio.h> #include <pcdapi.h> void rte_test_ptr( char *ptr ) { *ptr = 'R'; } char *ro_ptr = "RT-Embedded"; int main( int argc , char *argv[] ) { /* Register to PCD's exception handlers */ PCD_API_REGISTER_EXCEPTION_HANDLERS(); /* Crash test */ rte_test_ptr(ro_ptr); printf(ro_ptr); return 0; } |
RULE = TEST_SIGSEGV START_COND = NONE COMMAND = /usr/sbin/segv SCHED = NICE,0 DAEMON = YES END_COND = NONE END_COND_TIMEOUT = -1 FAILURE_ACTION = REBOOT ACTIVE = YES |
pcd: Starting process /usr/sbin/segv (Rule TEST_SIGSEGV). pcd: Rule TEST_SIGSEGV: Success (Process /usr/sbin/segv (204)). ************************************************************************** **************************** Exception Caught **************************** ************************************************************************** Signal information: Time: Thu Jan 1 00:00:12 1970 Process name: /usr/sbin/segv PID: 204 Fault Address: 0x00008590 Signal: Segmentation fault Signal Code: Invalid permissions for mapped object Last error: Success (0) Last error (by signal): 0 ARM registers: trap_no=0x0000000e error_code=0x0000081f oldmask=0x00000000 r0=0x00008590 r1=0x0ecf4ba4 r2=0x00000000 r3=0x00000052 r4=0x00010690 r5=0x00000000 r6=0x0000846c r7=0x00008418 r8=0x00000000 r9=0x00000000 r10=0x00000000 fp=0x00000000 ip=0x00000000 sp=0x0ecf4cf0 lr=0x0000856c pc=0x00008548 cpsr=0x40000010 fault_address=0x00008590 Maps file: 00008000-00009000 r-xp 00000000 1f:07 59 /usr/sbin/segv 00010000-00011000 rw-p 00000000 1f:07 59 /usr/sbin/segv 04000000-04005000 r-xp 00000000 1f:06 231 /lib/ld-uClibc-0.9.29.so 04005000-04007000 rw-p 04005000 00:00 0 0400c000-0400d000 r--p 00004000 1f:06 231 /lib/ld-uClibc-0.9.29.so 0400d000-0400e000 rw-p 00005000 1f:06 231 /lib/ld-uClibc-0.9.29.so 0400e000-04023000 r-xp 00000000 1f:06 175 /lib/libticc.so 04023000-0402a000 ---p 04023000 00:00 0 0402a000-0402c000 rw-p 00014000 1f:06 175 /lib/libticc.so 0402c000-04067000 r-xp 00000000 1f:06 200 /lib/libuClibc-0.9.29.so 04067000-0406e000 ---p 04067000 00:00 0 0406e000-0406f000 r--p 0003a000 1f:06 200 /lib/libuClibc-0.9.29.so 0406f000-04070000 rw-p 0003b000 1f:06 200 /lib/libuClibc-0.9.29.so 04070000-04075000 rw-p 04070000 00:00 0 04075000-0407f000 r-xp 00000000 1f:06 137 /lib/libgcc_s.so.1 0407f000-04086000 ---p 0407f000 00:00 0 04086000-04087000 rw-p 00009000 1f:06 137 /lib/libgcc_s.so.1 0ece0000-0ecf5000 rwxp 0ece0000 00:00 0 [stack] ************************************************************************** pcd: Error: Process /usr/sbin/segv (204) exited unexpectedly (Rule TEST_SIGSEGV). pcd: Terminating PCD, rebooting system... starting pid 205, tty '': '/bin/umount /var /sys' The system is going down NOW! Sent SIGTERM to all processes Sent SIGKILL to all processes Restarting system. |
# armeb-linux-uclibceabi-addr2line -e segv -f 8548 rte_test_ptr /home/hai/rte/segv.c:7 |
00008544 <rte_test_ptr>: void rte_test_ptr( char *ptr ) { *ptr = 'R'; 8544: e3a03052 mov r3, #82 ; 0x52 8548: e5c03000 strb r3, [r0] } 854c: e12fff1e bx lr 00008550 <main>: char *ro_ptr = "RT-Embedded"; int main( int argc , char *argv[] ) { 8550: e92d4010 push {r4, lr} /* Register to PCD's exception handlers */ PCD_API_REGISTER_EXCEPTION_HANDLERS(); /* Crash test */ rte_test_ptr(ro_ptr); 8554: e59f4020 ldr r4, [pc, #32] ; 857c <main+0x2c> char *ro_ptr = "RT-Embedded"; int main( int argc , char *argv[] ) { /* Register to PCD's exception handlers */ PCD_API_REGISTER_EXCEPTION_HANDLERS(); 8558: e5910000 ldr r0, [r1] 855c: e3a01000 mov r1, #0 ; 0x0 8560: ebffffb8 bl 8448 <_init+0x30> /* Crash test */ rte_test_ptr(ro_ptr); 8564: e5940000 ldr r0, [r4] 8568: ebfffff5 bl 8544 <rte_test_ptr> printf(ro_ptr); 856c: e5940000 ldr r0, [r4] 8570: ebffffb1 bl 843c <_init+0x24> return 0; } 8574: e3a00000 mov r0, #0 ; 0x0 8578: e8bd8010 pop {r4, pc} 857c: 00010690 .word 0x00010690 |
us what the specific location is, and therefore, is also important. From the code, we can see that r4 is loaded with address 0x857c and that the value in this address is sent to our rte_test_ptr function. We can use the objdump to lookup
the variable’s name by grepping on the value of the word:
# armeb-linux-uclibceabi-objdump -x segv | grep 10690 00010690 g O .data 00000004 ro_ptr |
Suppose the faulty function is inside a shared library and not in our program. How can we debug this?
Let’s repeat the example, and place the rte_test_ptr inside a shared library. After we compile and link, we run the program again. We’ll reexamine the crash log:
************************************************************************** **************************** Exception Caught **************************** ************************************************************************** Signal information: Time: Thu Jan 1 00:01:11 1970 Process name: /usr/sbin/segv PID: 206 Fault Address: 0x000085f4 Signal: Segmentation fault Signal Code: Invalid permissions for mapped object Last error: Success (0) Last error (by signal): 0 ARM registers: trap_no=0x0000000e error_code=0x0000081f oldmask=0x00000000 r0=0x000085f4 r1=0x0eb72b74 r2=0x00000000 r3=0x00000052 r4=0x04078000 r5=0x00000000 r6=0x000084ac r7=0x0000844c r8=0x00000000 r9=0x00000000 r10=0x00000000 fp=0x0eb72cd4 ip=0x0402c4a0 sp=0x0eb72cc0 lr=0x000085c0 pc=0x0402c4a4 cpsr=0x00000010 fault_address=0x000085f4 Maps file: 00008000-00009000 r-xp 00000000 1f:06 315 /usr/sbin/segv 00010000-00011000 rw-p 00000000 1f:06 315 /usr/sbin/segv 04000000-04005000 r-xp 00000000 1f:06 232 /lib/ld-uClibc-0.9.29.so 04005000-04007000 rw-p 04005000 00:00 0 0400c000-0400d000 r--p 00004000 1f:06 232 /lib/ld-uClibc-0.9.29.so 0400d000-0400e000 rw-p 00005000 1f:06 232 /lib/ld-uClibc-0.9.29.so 0400e000-04023000 r-xp 00000000 1f:06 175 /lib/libticc.so 04023000-0402a000 ---p 04023000 00:00 0 0402a000-0402c000 rw-p 00014000 1f:06 175 /lib/libticc.so 0402c000-0402d000 r-xp 00000000 1f:06 212 /lib/libsegv.so 0402d000-04034000 ---p 0402d000 00:00 0 04034000-04035000 rw-p 00000000 1f:06 212 /lib/libsegv.so 04035000-04070000 r-xp 00000000 1f:06 200 /lib/libuClibc-0.9.29.so 04070000-04077000 ---p 04070000 00:00 0 04077000-04078000 r--p 0003a000 1f:06 200 /lib/libuClibc-0.9.29.so 04078000-04079000 rw-p 0003b000 1f:06 200 /lib/libuClibc-0.9.29.so 04079000-0407e000 rw-p 04079000 00:00 0 0407e000-04088000 r-xp 00000000 1f:06 137 /lib/libgcc_s.so.1 04088000-0408f000 ---p 04088000 00:00 0 0408f000-04090000 rw-p 00009000 1f:06 137 /lib/libgcc_s.so.1 0eb5e000-0eb73000 rwxp 0eb5e000 00:00 0 [stack] ************************************************************************** pcd: Error: Process /usr/sbin/segv (206) exited unexpectedly (Rule TEST_SIGSEGV). pcd: Terminating PCD, rebooting system... starting pid 207, tty '': '/bin/umount -l /nvram /var /sys' The system is going down NOW! Sent SIGTERM to all processes Requesting system reboot Restarting system. |
saw, we can determine which library executed the bad instruction by matching the PC value to the executable address ranges in the maps file. In this example, this address is in the execution range of /lib/libsegv.so, which is the library made for
the purpose of this example. In order to understand where we can find the problem inside the library, we need to calculate the offset by reducing the library’s base address from the PC value. In this example, we do: 0x0402c4a4 – 0x402c000 = 0x4A4, and this
is the address of the problematic code inside the library. Let’s use objdump utility again, but now we’ll specify the library name, and not the program’s name. We can truncate the output by using the grep utility, to match the address we
calculated and a few lines before and after the match (use –A and –B options):
# armeb-linux-uclibceabi-objdump -S libsegv.so | grep 4a4 –B 4 –A 4 000004a0 <rte_test_ptr>: void rte_test_ptr( char *ptr ) { *ptr = 'R'; 4a0: e3a03052 mov r3, #82 ; 0x52 4a4: e5c03000 strb r3, [r0] } 4a8: e12fff1e bx lr |
For conclusion, now we know how to:
Extract and understand the crash information provided in to a fault signal handler.
Find a bad instruction inside our program and inside a shared library.
Now we have all the required information to fix this crash. It can be done manually, and it can be done using the PCD.
Memory corruption
Crashes and segmentation faults may be also a result of memory corruption. There could be a case where some code unintentionally changes a memory portion which it does not own thus causing a mess to the rightful owner. Read here howto debug such errors.
Resources:
http://linux.die.net/man/2/signalhttp://linux.die.net/man/2/sigaction
http://sourceforge.net/projects/pcd/
http://www.rt-embedded.com/blog/archives/enabling-core-dumps-in-embedded-systems/
Check out the ads, there could be something that may interest you there. The ads revenue helps me to pay for the domain and storage. |
Category: Debugging, Linux,Programming
相关文章推荐
- Debugging Segmentation Faults and Pointer Problems
- Debugging Segmentation Faults and Pointer Problems(转载)
- oracle create tablespace、user and grant
- Avoid memory copying between user space and kernel space
- PCIe userspace tools: lspci, setpci and sysfs
- Avoid memory copying between user space and kernel space
- Modules, User Space and Kernel Space
- kernel space and user space
- Kernel space DMA and User space DMA
- Ehcache 1.5.0 User Guide - Remote Network debugging and monitoring for Distributed Caches(远程调式和跟踪分布式缓存)(11)
- oracle create user and create tablespace
- linux i2c access in kernel and user space
- Kernel space DMA and User space DMA
- 《DirectX9 User Interfaces Design and Implementation》第七章的译文
- A Hybrid User and Item-Based Collaborative Filtering with Smoothing on Sparse Data
- Clock time, User CPU time and System CPU time in UNIX?
- Doing It in User Space
- Friendship and Mobility :User Movement In Location-Based Social Networks(2013.10.10)
- USER AND SCHEMA