您的位置:首页 > 其它

5.5.24,一个怪异的死锁问题导致DB无法连接

2016-11-10 12:26 429 查看

故障现象:

mysql 5.5.24,突然所有新连接卡住,db无法连接,无任何报错

故障原因分析:

找到对应的mysqld进程号,pstack $pid > /tmp/jj.sql

Thread 28 (Thread 0x7fc08bb32700 (LWP 3402)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 27 (Thread 0x7fc048e6b700 (LWP 3403)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7fc04846a700 (LWP 3404)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7fc047a69700 (LWP 3405)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7fc047068700 (LWP 3406)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7fc046667700 (LWP 3407)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7fc045c66700 (LWP 3408)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7fc045265700 (LWP 3409)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7fc044864700 (LWP 3410)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7fc043e63700 (LWP 3411)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fc043462700 (LWP 3412)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fc042a61700 (LWP 3413)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fc042060700 (LWP 3414)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fc04165f700 (LWP 3415)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fc040c5e700 (LWP 3416)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fc04025d700 (LWP 3417)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fc03f85c700 (LWP 3418)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fc03ee5b700 (LWP 3419)):

#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1

#1  0x0000000000870ad7 in os_aio_linux_handle ()

#2  0x0000000000835086 in fil_aio_wait ()

#3  0x00000000007d11a8 in io_handler_thread ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fc03d450700 (LWP 3421)):

#0  0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000873de3 in os_event_wait_time_low ()

#2  0x00000000007cd900 in srv_lock_timeout_thread ()

#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fc03ca4f700 (LWP 3422)):

#0  0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000873de3 in os_event_wait_time_low ()

#2  0x00000000007cd51b in srv_error_monitor_thread ()

#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc03c04e700 (LWP 3423)):

#0  0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000873de3 in os_event_wait_time_low ()

#2  0x00000000007d053b in srv_monitor_thread ()

#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc03e45a700 (LWP 3424)):

#0  0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000873f26 in os_event_wait_low ()

#2  0x00000000007ce43e in srv_master_thread ()

#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc088630700 (LWP 3444)):

#0  0x00007fc08c9b34b5 in sigwait () from /lib64/libpthread.so.0

#1  0x00000000005009eb in signal_hand ()

#2  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc0885ef700 (LWP 3445)):

#0  0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000737573 in Event_queue::cond_wait(THD*, timespec*, char const*, char const*, unsigned int) ()

#2  0x00000000007375fb in Event_queue::get_top_for_execution_if_time(THD*, Event_queue_element_for_exec**) ()

#3  0x00000000007366d7 in Event_scheduler::run(THD*) ()

#4  0x000000000073681b in event_scheduler_thread ()

#5  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#6  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fc0885ae700 (LWP 3467)):

#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007fc08c9ad53e in _L_lock_995 () from /lib64/libpthread.so.0

#2  0x00007fc08c9ad4a6 in pthread_mutex_lock () from /lib64/libpthread.so.0

#3  0x000000000057bb6a in log_in_use(char const*) ()

#4  0x00000000006ce273 in MYSQL_BIN_LOG::purge_logs_before_date(long) ()

#5  0x00000000006d26f8 in MYSQL_BIN_LOG::rotate_and_purge(bool) ()

#6  0x000000000061243e in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()

#7  0x0000000000567359 in mysql_execute_command(THD*) ()

#8  0x000000000056b209 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()

#9  0x000000000056cb36 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()

#10 0x00000000005f6386 in do_handle_one_connection(THD*) ()

#11 0x00000000005f63ea in handle_one_connection ()

#12 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#13 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fc08856d700 (LWP 5461)):

#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007fc08c9ad508 in _L_lock_854 () from /lib64/libpthread.so.0

#2  0x00007fc08c9ad3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0

#3  0x00000000006d2eef in MYSQL_BIN_LOG::reset_logs(THD*) ()

#4  0x000000000057a0ac in reset_master(THD*) ()

#5  0x00000000006127c8 in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()

#6  0x0000000000567359 in mysql_execute_command(THD*) ()

#7  0x000000000056b209 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()

#8  0x000000000056cb36 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()

#9  0x00000000005f6386 in do_handle_one_connection(THD*) ()

#10 0x00000000005f63ea in handle_one_connection ()

#11 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#12 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc08852c700 (LWP 5528)):

#0  0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00000000005016cf in one_thread_per_connection_end(THD*, bool) ()

#2  0x00000000005f62dd in do_handle_one_connection(THD*) ()

#3  0x00000000005f63ea in handle_one_connection ()

#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc08cdce720 (LWP 3400)):

#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007fc08c9ad53e in _L_lock_995 () from /lib64/libpthread.so.0

#2  0x00007fc08c9ad4a6 in pthread_mutex_lock () from /lib64/libpthread.so.0

#3  0x0000000000547ae2 in THD::THD() ()

#4  0x0000000000501985 in handle_connections_sockets() ()

#5  0x000000000050612f in mysqld_main(int, char**) ()

#6  0x00007fc08bb51d5d in __libc_start_main () from /lib64/libc.so.6

#7  0x00000000004fd9dd in _start ()

从堆栈信息看出,1号主线程处于锁等待,导致新连接无法连上,#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0

。3号线程执行了reset  master的操作,4号线程看起来像是一个flush logs的动作,通过观察binlog看到故障时间点正好是一个binlog达到上限后切换的时间点,初步推测是flush logs和reset master操作之间存在死锁问题,通过查看源代码来验证这个观点。

信息从pstack看Thread 4先调用MYSQL_BIN_LOG::purge_logs_before_date,加锁语句为mysql_mutex_lock(&LOCK_index),完成后并无释放;后执行log_in_use,加锁语句为mysql_mutex_lock(&LOCK_thread_count);

也就是说Thread 4的加锁顺序是先LOCK_index,后LOCK_thread_count

以下是相关的函数代码

int MYSQL_BIN_LOG::purge_logs_before_date(time_t purge_time)

{

  int error;

  char to_log[FN_REFLEN];

  LOG_INFO log_info;

  MY_STAT stat_area;

  THD *thd= current_thd;

  DBUG_ENTER("purge_logs_before_date");

  mysql_mutex_lock(&LOCK_index);

  to_log[0]= 0;

  if ((error=find_log_pos(&log_info, NullS, 0 /*no mutex*/)))

    goto err;

  while (strcmp(log_file_name, log_info.log_file_name) &&

         !is_active(log_info.log_file_name) &&

         !log_in_use(log_info.log_file_name))

  {

    if (!mysql_file_stat(m_key_file_log,

                         log_info.log_file_name, &stat_area, MYF(0)))

    {

      if (my_errno == ENOENT)

      {

        /*

          It's not fatal if we can't stat a log file that does not exist.

        */

        my_errno= 0;

      }

      else

      {

        /*

          Other than ENOENT are fatal

        */

        if (thd)

        {

          push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_WARN,

                              ER_BINLOG_PURGE_FATAL_ERR,

                              "a problem with getting info on being purged %s; "

                              "consider examining correspondence "

                              "of your binlog index file "

                              "to the actual binlog files",

                              log_info.log_file_name);

        }

        else

        {

          sql_print_information("Failed to delete log file '%s'",

                                log_info.log_file_name);

        }

        error= LOG_INFO_FATAL;

        goto err;

      }

    }

    else

    {

      if (stat_area.st_mtime < purge_time)

        strmake(to_log,

                log_info.log_file_name,

                sizeof(log_info.log_file_name) - 1);

      else

        break;

    }

    if (find_next_log(&log_info, 0))

      break;

  }

  error= (to_log[0] ? purge_logs(to_log, 1, 0, 1, (ulonglong *) 0) : 0);

err:

  mysql_mutex_unlock(&LOCK_index);

  DBUG_RETURN(error);

}

#endif /* HAVE_REPLICATION */

bool log_in_use(const char* log_name)

{

  size_t log_name_len = strlen(log_name) + 1;

  THD *tmp;

  bool result = 0;

  mysql_mutex_lock(&LOCK_thread_count);

  I_List_iterator<THD> it(threads);

  while ((tmp=it++))

  {

    LOG_INFO* linfo;

    if ((linfo = tmp->current_linfo))

    {

      mysql_mutex_lock(&linfo->lock);

      result = !memcmp(log_name, linfo->log_file_name, log_name_len);

      mysql_mutex_unlock(&linfo->lock);

      if (result)

        break;

    }

  }

  mysql_mutex_unlock(&LOCK_thread_count);

  return result;

}

信息从pstack看Thread 3调用了MYSQL_BIN_LOG::reset_logs,加锁语句为mysql_mutex_lock(&LOCK_thread_count);

 mysql_mutex_lock(&LOCK_log);

mysql_mutex_lock(&LOCK_index);

也就是说先给LOCK_thread_count加锁,再给LOCK_index加锁,Thread 4和Thread 3的加锁顺序正好相反,出现了死锁现象,这个死锁mysql无法检测出来。

以下是相关函数的代码

bool MYSQL_BIN_LOG::reset_logs(THD* thd)

{

  LOG_INFO linfo;

  bool error=0;

  int err;

  const char* save_name;

  DBUG_ENTER("reset_logs");

  ha_reset_logs(thd);

  /*

    The following mutex is needed to ensure that no threads call

    'delete thd' as we would then risk missing a 'rollback' from this

    thread. If the transaction involved MyISAM tables, it should go

    into binlog even on rollback.

  */

  mysql_mutex_lock(&LOCK_thread_count);

  /*

    We need to get both locks to be sure that no one is trying to

    write to the index log file.

  */

  mysql_mutex_lock(&LOCK_log);

  mysql_mutex_lock(&LOCK_index);

  /* Save variables so that we can reopen the log */

  save_name=name;

  name=0;                                       // Protect against free

  close(LOG_CLOSE_TO_BE_OPENED);

而1号主线程显示也是等待编号为_L_lock_995的锁,从代码推测也就是等待LOCK_thread_count的锁释放,但因为3和4线程陷入死锁,永远也无法释放,从而一直卡着。



复现方法

后来了解到用户为了节约磁盘空间,定期会执行reset master清理所有binlog。

查看了下mysql5.6没法复现,已经修复了这个bug。5.5.24的复现方法如下:

1 gdb -p $pidofmysqld

2 在log_in_use函数处设置断点,c继续运行

3 打开1个连接,执行flush logs

4 再打开一个连接,执行reset  master

5 gdb执行到断点处,按c继续

6 这时再打开新连接,发现无法连接

7 pstack查看信息跟之前的pstack信息吻合,故障复现
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐