5.5.24,一个怪异的死锁问题导致DB无法连接
2016-11-10 12:26
429 查看
故障现象:
mysql 5.5.24,突然所有新连接卡住,db无法连接,无任何报错故障原因分析:
找到对应的mysqld进程号,pstack $pid > /tmp/jj.sqlThread 28 (Thread 0x7fc08bb32700 (LWP 3402)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 27 (Thread 0x7fc048e6b700 (LWP 3403)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 26 (Thread 0x7fc04846a700 (LWP 3404)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 25 (Thread 0x7fc047a69700 (LWP 3405)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 24 (Thread 0x7fc047068700 (LWP 3406)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 23 (Thread 0x7fc046667700 (LWP 3407)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x7fc045c66700 (LWP 3408)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x7fc045265700 (LWP 3409)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x7fc044864700 (LWP 3410)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x7fc043e63700 (LWP 3411)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7fc043462700 (LWP 3412)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7fc042a61700 (LWP 3413)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7fc042060700 (LWP 3414)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7fc04165f700 (LWP 3415)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7fc040c5e700 (LWP 3416)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7fc04025d700 (LWP 3417)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7fc03f85c700 (LWP 3418)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7fc03ee5b700 (LWP 3419)):
#0 0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1 0x0000000000870ad7 in os_aio_linux_handle ()
#2 0x0000000000835086 in fil_aio_wait ()
#3 0x00000000007d11a8 in io_handler_thread ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7fc03d450700 (LWP 3421)):
#0 0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000873de3 in os_event_wait_time_low ()
#2 0x00000000007cd900 in srv_lock_timeout_thread ()
#3 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7fc03ca4f700 (LWP 3422)):
#0 0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000873de3 in os_event_wait_time_low ()
#2 0x00000000007cd51b in srv_error_monitor_thread ()
#3 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7fc03c04e700 (LWP 3423)):
#0 0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000873de3 in os_event_wait_time_low ()
#2 0x00000000007d053b in srv_monitor_thread ()
#3 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fc03e45a700 (LWP 3424)):
#0 0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000873f26 in os_event_wait_low ()
#2 0x00000000007ce43e in srv_master_thread ()
#3 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fc088630700 (LWP 3444)):
#0 0x00007fc08c9b34b5 in sigwait () from /lib64/libpthread.so.0
#1 0x00000000005009eb in signal_hand ()
#2 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#3 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fc0885ef700 (LWP 3445)):
#0 0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000737573 in Event_queue::cond_wait(THD*, timespec*, char const*, char const*, unsigned int) ()
#2 0x00000000007375fb in Event_queue::get_top_for_execution_if_time(THD*, Event_queue_element_for_exec**) ()
#3 0x00000000007366d7 in Event_scheduler::run(THD*) ()
#4 0x000000000073681b in event_scheduler_thread ()
#5 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#6 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fc0885ae700 (LWP 3467)):
#0 0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fc08c9ad53e in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x00007fc08c9ad4a6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000057bb6a in log_in_use(char const*) ()
#4 0x00000000006ce273 in MYSQL_BIN_LOG::purge_logs_before_date(long) ()
#5 0x00000000006d26f8 in MYSQL_BIN_LOG::rotate_and_purge(bool) ()
#6 0x000000000061243e in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()
#7 0x0000000000567359 in mysql_execute_command(THD*) ()
#8 0x000000000056b209 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#9 0x000000000056cb36 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#10 0x00000000005f6386 in do_handle_one_connection(THD*) ()
#11 0x00000000005f63ea in handle_one_connection ()
#12 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fc08856d700 (LWP 5461)):
#0 0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fc08c9ad508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00007fc08c9ad3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000006d2eef in MYSQL_BIN_LOG::reset_logs(THD*) ()
#4 0x000000000057a0ac in reset_master(THD*) ()
#5 0x00000000006127c8 in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()
#6 0x0000000000567359 in mysql_execute_command(THD*) ()
#7 0x000000000056b209 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#8 0x000000000056cb36 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#9 0x00000000005f6386 in do_handle_one_connection(THD*) ()
#10 0x00000000005f63ea in handle_one_connection ()
#11 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fc08852c700 (LWP 5528)):
#0 0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005016cf in one_thread_per_connection_end(THD*, bool) ()
#2 0x00000000005f62dd in do_handle_one_connection(THD*) ()
#3 0x00000000005f63ea in handle_one_connection ()
#4 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fc08cdce720 (LWP 3400)):
#0 0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fc08c9ad53e in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x00007fc08c9ad4a6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000547ae2 in THD::THD() ()
#4 0x0000000000501985 in handle_connections_sockets() ()
#5 0x000000000050612f in mysqld_main(int, char**) ()
#6 0x00007fc08bb51d5d in __libc_start_main () from /lib64/libc.so.6
#7 0x00000000004fd9dd in _start ()
从堆栈信息看出,1号主线程处于锁等待,导致新连接无法连上,#0 0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
。3号线程执行了reset master的操作,4号线程看起来像是一个flush logs的动作,通过观察binlog看到故障时间点正好是一个binlog达到上限后切换的时间点,初步推测是flush logs和reset master操作之间存在死锁问题,通过查看源代码来验证这个观点。
信息从pstack看Thread 4先调用MYSQL_BIN_LOG::purge_logs_before_date,加锁语句为mysql_mutex_lock(&LOCK_index),完成后并无释放;后执行log_in_use,加锁语句为mysql_mutex_lock(&LOCK_thread_count);
也就是说Thread 4的加锁顺序是先LOCK_index,后LOCK_thread_count
以下是相关的函数代码
int MYSQL_BIN_LOG::purge_logs_before_date(time_t purge_time)
{
int error;
char to_log[FN_REFLEN];
LOG_INFO log_info;
MY_STAT stat_area;
THD *thd= current_thd;
DBUG_ENTER("purge_logs_before_date");
mysql_mutex_lock(&LOCK_index);
to_log[0]= 0;
if ((error=find_log_pos(&log_info, NullS, 0 /*no mutex*/)))
goto err;
while (strcmp(log_file_name, log_info.log_file_name) &&
!is_active(log_info.log_file_name) &&
!log_in_use(log_info.log_file_name))
{
if (!mysql_file_stat(m_key_file_log,
log_info.log_file_name, &stat_area, MYF(0)))
{
if (my_errno == ENOENT)
{
/*
It's not fatal if we can't stat a log file that does not exist.
*/
my_errno= 0;
}
else
{
/*
Other than ENOENT are fatal
*/
if (thd)
{
push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_WARN,
ER_BINLOG_PURGE_FATAL_ERR,
"a problem with getting info on being purged %s; "
"consider examining correspondence "
"of your binlog index file "
"to the actual binlog files",
log_info.log_file_name);
}
else
{
sql_print_information("Failed to delete log file '%s'",
log_info.log_file_name);
}
error= LOG_INFO_FATAL;
goto err;
}
}
else
{
if (stat_area.st_mtime < purge_time)
strmake(to_log,
log_info.log_file_name,
sizeof(log_info.log_file_name) - 1);
else
break;
}
if (find_next_log(&log_info, 0))
break;
}
error= (to_log[0] ? purge_logs(to_log, 1, 0, 1, (ulonglong *) 0) : 0);
err:
mysql_mutex_unlock(&LOCK_index);
DBUG_RETURN(error);
}
#endif /* HAVE_REPLICATION */
bool log_in_use(const char* log_name)
{
size_t log_name_len = strlen(log_name) + 1;
THD *tmp;
bool result = 0;
mysql_mutex_lock(&LOCK_thread_count);
I_List_iterator<THD> it(threads);
while ((tmp=it++))
{
LOG_INFO* linfo;
if ((linfo = tmp->current_linfo))
{
mysql_mutex_lock(&linfo->lock);
result = !memcmp(log_name, linfo->log_file_name, log_name_len);
mysql_mutex_unlock(&linfo->lock);
if (result)
break;
}
}
mysql_mutex_unlock(&LOCK_thread_count);
return result;
}
信息从pstack看Thread 3调用了MYSQL_BIN_LOG::reset_logs,加锁语句为mysql_mutex_lock(&LOCK_thread_count);
mysql_mutex_lock(&LOCK_log);
mysql_mutex_lock(&LOCK_index);
也就是说先给LOCK_thread_count加锁,再给LOCK_index加锁,Thread 4和Thread 3的加锁顺序正好相反,出现了死锁现象,这个死锁mysql无法检测出来。
以下是相关函数的代码
bool MYSQL_BIN_LOG::reset_logs(THD* thd)
{
LOG_INFO linfo;
bool error=0;
int err;
const char* save_name;
DBUG_ENTER("reset_logs");
ha_reset_logs(thd);
/*
The following mutex is needed to ensure that no threads call
'delete thd' as we would then risk missing a 'rollback' from this
thread. If the transaction involved MyISAM tables, it should go
into binlog even on rollback.
*/
mysql_mutex_lock(&LOCK_thread_count);
/*
We need to get both locks to be sure that no one is trying to
write to the index log file.
*/
mysql_mutex_lock(&LOCK_log);
mysql_mutex_lock(&LOCK_index);
/* Save variables so that we can reopen the log */
save_name=name;
name=0; // Protect against free
close(LOG_CLOSE_TO_BE_OPENED);
而1号主线程显示也是等待编号为_L_lock_995的锁,从代码推测也就是等待LOCK_thread_count的锁释放,但因为3和4线程陷入死锁,永远也无法释放,从而一直卡着。
复现方法
后来了解到用户为了节约磁盘空间,定期会执行reset master清理所有binlog。查看了下mysql5.6没法复现,已经修复了这个bug。5.5.24的复现方法如下:
1 gdb -p $pidofmysqld
2 在log_in_use函数处设置断点,c继续运行
3 打开1个连接,执行flush logs
4 再打开一个连接,执行reset master
5 gdb执行到断点处,按c继续
6 这时再打开新连接,发现无法连接
7 pstack查看信息跟之前的pstack信息吻合,故障复现
相关文章推荐
- 远程桌面无法连接的一个奇怪问题
- 怪异的问题:Wince6下ActiveSync无法自动连接
- 64位win7下PL/SQL Developer 报“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 万恶的ie,一个空格导致的ie js无法执行的问题
- 64位win7下PL/SQL Developer 报“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 使用PLSQL Developer时,“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 解决PCoIP连接View 5.0虚拟桌面超时断开导致无法再次登录的问题
- 64位win7下PL/SQL Developer 报“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 一个jdbc无法连接informix的问题
- 局域网内VSS无法连接的一个“恶心他妈给恶心开门”的问题
- 使用PLSQL Developer时,“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 一个 vs2008 无法连接到Asp.net Development Server的问题
- 64位win7下PL/SQL Developer 报“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 解决PCoIP连接View 5.0虚拟桌面超时断开导致无法再次登录的问题
- 使用PLSQL Developer时,“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法
- 怪异的问题:Wince6下Remote Tools无法连接成功!
- 使用PLSQL Developer时,“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法[转]
- 64位win7下PL/SQL Developer 报“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法[转帖]
- 使用PLSQL Developer时,“ORA-12154: TNS:无法解析指定的连接标识符”问题的一个解决办法 .
- Mac OS ssh-copy-id Command not found 导致无法对目标服务器上传一个公钥的问题解决方案