您的位置:首页 > 运维架构 > 网站架构

MySQL的MMM高可用测试(转自老金)

2011-06-13 10:02 274 查看

一、双主零从

==============================================================
主机:

功能IP主机名
monitor10.1.30.216mysql-3
DBdb6Server-id=610.1.30.226net-1
db7Server-id=710.1.30.230net-2
虚拟ip:

IP010.1.30.219(Write)
IP110.1.30.221(Read)
IP210.1.30.243(Read)




——————–[1] mmm_common.conf ——————–

active_master_role writer

<host default>

cluster_interface eth0 #网卡,如果配置错误mmm会自动修正

pid_path /var/run/mmm_agentd.pid

bin_path /usr/lib/mysql-mmm/

replication_user repl

replication_password repl

agent_user mmm_agent

agent_password mmm_agent

</host>

<host db6>

ip 10.1.30.226

mysql_port 3306

mode master

peer db7

</host>

<host db7>

ip 10.1.30.230

mysql_port 3306

mode master

peer db6

</host>

<role writer>

hosts db6, db7 #能够作为writer的host

ips 10.1.30.219 #writer的虚拟IP,应用的写请求直接访问这里的ip即可

mode exclusive #只有两种模式:exclusive是排他,在这种模式下任何时候只能一个host拥有该角色,balanced模式下可以多个host同时拥有此角色。一般writer是ex,reader是ba

</role>

<role reader>

hosts db6, db7 #能够作为reader的host

ips 10.1.30.221, 10.1.30.243 #reader的虚拟IP,应用的读请求直接访问这里的ip即可

mode balanced

</role>

——————–[2] mmm_mon.conf ———————–

include mmm_common.conf

<monitor>

ip 10.1.30.216

pid_path /var/run/mmm_mond.pid

bin_path /usr/lib/mysql-mmm/

status_path /var/lib/misc/mmm_mond.status

ping_ips 10.1.30.214, 10.1.30.215, 10.1.30.226, 10.1.30.230

</monitor>

<host default>

monitor_user mmm_mon

monitor_password mmm_mon

</host>

debug 0

——————–[4] mmm_agent.conf ——————–

include mmm_common.conf

this db6

查看状态:

[root@mysql-3 ~]# mmm_control show

db6(10.1.30.226) master/ONLINE. Roles: reader(10.1.30.221), writer(10.1.30.219)

db7(10.1.30.230) master/ONLINE. Roles: reader(10.1.30.243)

[root@mysql-3 ~]# mmm_control checks

db7 ping [last change: 2010/09/19 17:19:09] OK

db7 mysql [last change: 2010/09/19 17:19:09] OK

db7 rep_threads [last change: 2010/09/19 17:19:09] OK

db7 rep_backlog [last change: 2010/09/19 17:19:09] OK: Backlog is null

db6 ping [last change: 2010/09/19 17:19:09] OK

db6 mysql [last change: 2010/09/19 17:19:09] OK

db6 rep_threads [last change: 2010/09/19 17:24:57] OK

db6 rep_backlog [last change: 2010/09/19 17:19:09] OK: Backlog is null

[root@mysql-3 ~]# mmm_control mode

ACTIVE

模拟宕机切换:

随便找一个客户端,执行写操作:

@client

[root@mysql-1 ~]# more /root/script/inserting-into-db.sh

#!/bin/bash

while true;

do

mysql -udba -pdba -h10.1.30.219 -e “insert into jcr.t1 values(null);”

sleep 1 ;

done;

[root@mysql-1 ~]# ./inserting-into-db.sh &

在两个数据库主机,通过脚本监测产生的binlog日志:

脚本内容

[root@net-1 binlog]# more mysqlbinlog_tail.sh

binlogidx=$1

while true;

binlog=`tail -1 $binlogidx`

do

mysqlbinlog $binlog|tail -20

sleep 2;

done;

监测效果:

[root@net-1 binlog]# ./mysqlbinlog_tail.sh mysql-bin.index

# at 478124

#100920 10:57:13 server id 6 end_log_pos 478152 Intvar

SET INSERT_ID=5166/*!*/;

# at 478152

#100920 10:57:13 server id 6 end_log_pos 478242 Query thread_id=43343 exec_time=1 error_code=0

SET TIMESTAMP=1284951433/*!*/;

insert into jcr.t1 values(null)

/*!*/;

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

……

可以看到两个db中的binlog显示的server id都是6,也就是说当前情况下db6是作为写库。

停止db6

[root@net-1 ~]# /etc/init.d/mysql stop

Shutting down MySQL. [ OK ]

依次检查mmm各日志:

/var/log/mysql-mmm/mmm_mond.log

2010/09/20 09:49:38 WARN Check ‘rep_threads’ on ‘db6′ is in unknown state! Message: UNKNOWN:Connect error (host = 10.1.30.226:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:49:38 WARN Check ‘rep_backlog’ on ‘db6′ is in unknown state! Message: UNKNOWN:Connect error (host = 10.1.30.226:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:49:48 ERROR Check ‘rep_threads’ on ‘db7′ has failed for 10 seconds! Message: ERROR: Replication is broken

2010/09/20 09:49:50 FATAL State of host ‘db7′ changed from ONLINE to REPLICATION_FAIL

2010/09/20 09:49:50 INFO Removing all roles from host ‘db7′:

2010/09/20 09:49:50 INFO Removed role ‘reader(10.1.30.243)’ from host ‘db7′

2010/09/20 09:49:50 INFO Orphaned role ‘reader(10.1.30.243)’ has been assigned to ‘db6′

2010/09/20 09:49:52 ERROR Check ‘mysql’ on ‘db6′ has failed for 10 seconds! Message: ERROR: Connect error (host = 10.1.30.226:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:49:53 FATAL State of host ‘db6′ changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)

2010/09/20 09:49:53 INFO Removing all roles from host ‘db6′:

2010/09/20 09:49:53 INFO Removed role ‘reader(10.1.30.243)’ from host ‘db6′

2010/09/20 09:49:53 INFO Removed role ‘reader(10.1.30.221)’ from host ‘db6′

2010/09/20 09:49:53 INFO Removed role ‘writer(10.1.30.219)’ from host ‘db6′

2010/09/20 09:49:56 FATAL State of host ‘db7′ changed from REPLICATION_FAIL to ONLINE

2010/09/20 09:49:56 INFO Orphaned role ‘writer(10.1.30.219)’ has been assigned to ‘db7′

2010/09/20 09:49:56 INFO Orphaned role ‘reader(10.1.30.243)’ has been assigned to ‘db7′

2010/09/20 09:49:56 INFO Orphaned role ‘reader(10.1.30.221)’ has been assigned to ‘db7′

先熟悉一下db6和db7二者目前的角色:

db6(10.1.30.226)master/ ONLINE.Roles:reader(10.1.30.221), writer(10.1.30.219)
db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.243)

从日志可以看出,db6停止之后,mmm提示connect error,然后db7的io复制线程失败,检查项rep_threads在10秒后出现报警replication is broken,由于当前的写库是db6,于是mmm认为db7上的数据已经不能和db6保持一致了,故把db7的读角色(reader)迁移到db6上。变成了:

db6(10.1.30.226) master/ONLINE. Roles: reader(10.1.30.221), writer(10.1.30.219), reader(10.1.30.243)

db7(10.1.30.230) master/ REPLICATION_FAIL.

db6(10.1.30.226)master/ ONLINE.Roles:

reader(10.1.30.221),

writer(10.1.30.219),

reader(10.1.30.243)

db7(10.1.30.230)master/ REPLICATION_FAIL.
但是,这时mmm的”mysql”检查项在10秒后出现报警,认为db6已经彻底失败,因此又把db6设置状态为hard_offline,把db7从replication_fail状态切换到online状态(因为db7的mysql至少还活着)同时把上面的所有角色切换到db7上。状态最终变为:

db6(10.1.30.226)master/ HARD_OFFLINE.
db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.221),

writer(10.1.30.219),

reader(10.1.30.243)

疑问:mmm为什么会多此一举,先把因db6宕机引起复制失败的db7的reader角色交给写库db6,然后再把db6的所有角色交给正常的db7?mmm在处理复制失败的时候能不能直接判断写库的mysql状态,如果失败,那就直接由db7接管db6,而不需要再来回切换角色了。

db6: /var/log/mysql-mmm/mmm_agentd.log

2010/09/20 09:48:28 FATAL Couldn’t allow writes: ERROR: Can’t connect to MySQL (host = 10.1.30.226:3306, user = mmm_agent)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:48:30 FATAL Couldn’t allow writes: ERROR: Can’t connect to MySQL (host = 10.1.30.226:3306, user = mmm_agent)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:48:33 FATAL Couldn’t allow writes: ERROR: Can’t connect to MySQL (host = 10.1.30.226:3306, user = mmm_agent)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:48:36 FATAL Couldn’t allow writes: ERROR: Can’t connect to MySQL (host = 10.1.30.226:3306, user = mmm_agent)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:48:39 INFO We have some new roles added or old rules deleted!

2010/09/20 09:48:39 INFO Added: reader(10.1.30.243)

2010/09/20 09:48:40 FATAL Couldn’t allow writes: ERROR: Can’t connect to MySQL (host = 10.1.30.226:3306, user = mmm_agent)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 09:48:42 INFO We have some new roles added or old rules deleted!

2010/09/20 09:48:42 INFO Deleted: reader(10.1.30.221), reader(10.1.30.243), writer(10.1.30.219)

2010/09/20 09:48:43 FATAL Couldn’t deny writes: ERROR: Can’t connect to MySQL (host = 10.1.30.226:3306, user = mmm_agent)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

从db6的agent日志也验证了刚才的分析:先增加reader(243),然后移除所有角色。

db7: /var/log/mysql-mmm/mmm_agentd.log

2010/09/20 09:48:39 INFO We have some new roles added or old rules deleted!

2010/09/20 09:48:39 INFO Deleted: reader(10.1.30.243)

2010/09/20 09:48:45 INFO We have some new roles added or old rules deleted!

2010/09/20 09:48:45 INFO Added: reader(10.1.30.221), reader(10.1.30.243), writer(10.1.30.219)

db7的日志可以看出,db7先删除reader(243),然后mmm把所有角色都添加到了db7下面。

db6恢复正常

[root@net-1 ~]# /etc/init.d/mysql start

Starting MySQL. [ OK ]

/var/log/mysql-mmm/mmm_mond.log

2010/09/20 09:50:43 INFO Check ‘rep_threads’ on ‘db6′ is ok!

2010/09/20 09:50:43 INFO Check ‘rep_backlog’ on ‘db6′ is ok!

2010/09/20 09:50:47 INFO Check ‘mysql’ on ‘db6′ is ok!

2010/09/20 09:50:48 FATAL State of host ‘db6′ changed from HARD_OFFLINE to AWAITING_RECOVERY

2010/09/20 09:50:50 FATAL State of host ‘db6′ changed from AWAITING_RECOVERY to ONLINE because it was down for only 58 seconds

2010/09/20 09:50:50 INFO Moving role ‘reader(10.1.30.243)’ from host ‘db7′ to host ‘db6′

2010/09/20 09:51:38 INFO Check ‘rep_threads’ on ‘db7′ is ok!

db6: /var/log/mysql-mmm/mmm_agentd.log

2010/09/20 09:49:39 INFO We have some new roles added or old rules deleted!

2010/09/20 09:49:39 INFO Added: reader(10.1.30.243)

db7: /var/log/mysql-mmm/mmm_agentd.log

2010/09/20 09:49:39 INFO We have some new roles added or old rules deleted!

2010/09/20 09:49:39 INFO Deleted: reader(10.1.30.243)

Mmm的处理步骤大致是:

db6的“mysql”check恢复正常,然后把db6切换到awaiting_recovery状态。然后mmm判断db6的宕机时间在正常范围内,不属于flapping,因此自动切换为online状态。

把db7中的一个reader角色迁移到db6上。

目前写库是db7。(从binlog中就可以看出server id=7)

注:可以在exclusive 的<role writer>中设置prefer=db6,这样在db6恢复正常之后,就可以再次被切换为写库了。

从客户端的监测发现,其中写操作大概停了17秒。

其他架构:

二、双主多从





其中的纯数字是ip的D段值(10.1.30.*)

配置:

/etc/mysql-mmm/mmm_common.conf

……<host default>略……

<host db1>

ip 10.1.30.214

mysql_port 3306

mode slave

</host>

<host db6>

ip 10.1.30.226

mysql_port 3306

mode master

peer db7

</host>

<host db7>

ip 10.1.30.230

mysql_port 3306

mode master

peer db6

</host>

<role writer>

hosts db6, db7

ips 10.1.30.219

mode exclusive

prefer db6

</role>

<role reader>

hosts db1, db7

ips 10.1.30.221, 10.1.30.243

mode balanced

</role>

目前各自分配的角色:

db2(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.243)

db6(10.1.30.226)master/ONLINE.Roles:

writer(10.1.30.219)

db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.221)

# Role writer is assigned to it’s preferred host db6.

测试1:db7挂

mmm把reader(10.1.30.221) 切换到db2

2010/09/20 17:38:27 WARN Check ‘rep_backlog’ on ‘db7′ is in unknown state! Message: UNKNOWN: Connect error (host = 10.1.30.230:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 17:38:27 WARN Check ‘rep_threads’ on ‘db7′ is in unknown state! Message: UNKNOWN: Connect error (host = 10.1.30.230:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 17:38:37 ERROR Check ‘mysql’ on ‘db7′ has failed for 10 seconds! Message: ERROR: Connect error (host = 10.1.30.230:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 17:38:37 ERROR Check ‘rep_threads’ on ‘db6′ has failed for 10 seconds! Message: ERROR: Replication is broken

2010/09/20 17:38:38 FATAL State of host ‘db7′ changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)

2010/09/20 17:38:38 INFO Removing all roles from host ‘db7′:

2010/09/20 17:38:38 INFO Removed role ‘reader(10.1.30.221)’ from host ‘db7′

2010/09/20 17:38:38 INFO Orphaned role ‘reader(10.1.30.221)’ has been assigned to ‘db1′

db1(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.221),

reader(10.1.30.243)

db6(10.1.30.226)master/ONLINE.Roles:

writer(10.1.30.219)

db7(10.1.30.230)master/HARD_OFFLINE.Roles:
恢复正常之后:

2010/09/20 17:42:17 INFO Check ‘mysql’ on ‘db7′ is ok!

2010/09/20 17:42:17 INFO Check ‘rep_backlog’ on ‘db7′ is ok!

2010/09/20 17:42:17 INFO Check ‘rep_threads’ on ‘db7′ is ok!

2010/09/20 17:42:19 FATAL State of host ‘db7′ changed from HARD_OFFLINE to AWAITING_RECOVERY

2010/09/20 17:42:28 INFO Check ‘rep_threads’ on ‘db6′ is ok!

db1(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.221),

reader(10.1.30.243)

db6(10.1.30.226)master/ONLINE.Roles:

writer(10.1.30.219)

db7(10.1.30.230)master/AWAITING_RECOVERY.Roles:
mmm_control set_online db7

db1(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.221)

db6(10.1.30.226)master/ONLINE.Roles:

writer(10.1.30.219)

db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.243)

测试2:db6挂

mmm把db6拿掉,把写操作转给db7,同时把db1状态改为replication_fail,把db1的读操作切到db7。

这时,mmm检查发现db1的复制又恢复了正常(master已经成功切换到db7),于是又恢复了读访问。

2010/09/20 17:52:44 WARN Check ‘rep_backlog’ on ‘db6′ is in unknown state! Message: UNKNOWN: Connect error (host = 10.1.30.226:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 17:52:44 WARN Check ‘rep_threads’ on ‘db6′ is in unknown state! Message: UNKNOWN: Connect error (host = 10.1.30.226:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 17:52:54 ERROR Check ‘mysql’ on ‘db6′ has failed for 10 seconds! Message: ERROR: Connect error (host = 10.1.30.226:3306, user = mmm_mon)! Lost connection to MySQL server at ‘reading initial communication packet’, system error: 111

2010/09/20 17:52:54 ERROR Check ‘rep_threads’ on ‘db7′ has failed for 10 seconds! Message: ERROR: Replication is broken

2010/09/20 17:52:54 ERROR Check ‘rep_threads’ on ‘db1′ has failed for 10 seconds! Message: ERROR: Replication is broken

2010/09/20 17:52:55 FATAL State of host ‘db6′ changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)

2010/09/20 17:52:55 INFO Removing all roles from host ‘db6′:

2010/09/20 17:52:55 INFO Removed role ‘writer(10.1.30.219)’ from host ‘db6′

2010/09/20 17:52:55 INFO Orphaned role ‘writer(10.1.30.219)’ has been assigned to ‘db7′

2010/09/20 17:52:58 FATAL State of host ‘db1′ changed from ONLINE to REPLICATION_FAIL

2010/09/20 17:52:58 INFO Removing all roles from host ‘db1′:

2010/09/20 17:52:58 INFO Removed role ‘reader(10.1.30.221)’ from host ‘db1′

2010/09/20 17:52:58 INFO Orphaned role ‘reader(10.1.30.221)’ has been assigned to ‘db7′

2010/09/20 17:52:59 INFO Check ‘rep_threads’ on ‘db1′ is ok!

2010/09/20 17:53:01 FATAL State of host ‘db1′ changed from REPLICATION_FAIL to ONLINE

2010/09/20 17:53:01 INFO Moving role ‘reader(10.1.30.243)’ from host ‘db7′ to host ‘db1′

db1(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.243)

db6(10.1.30.226)master/ HARD_OFFLINE.Roles:
db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.221),

writer(10.1.30.219)

恢复正常

2010/09/20 18:02:10 INFO Check ‘mysql’ on ‘db6′ is ok!

2010/09/20 18:02:10 INFO Check ‘rep_backlog’ on ‘db6′ is ok!

2010/09/20 18:02:10 INFO Check ‘rep_threads’ on ‘db6′ is ok!

2010/09/20 18:02:13 FATAL State of host ‘db6′ changed from HARD_OFFLINE to AWAITING_RECOVERY

2010/09/20 18:02:45 INFO Check ‘rep_threads’ on ‘db7′ is ok!

db1(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.243)

db6(10.1.30.226)master/AWAITING_RECOVERY.Roles:
db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.221),

writer(10.1.30.219)

mmm_control set_online db6

2010/09/20 18:05:28 FATAL Admin changed state of ‘db6′ from AWAITING_RECOVERY to ONLINE

2010/09/20 18:05:28 INFO Moving role ‘writer(10.1.30.219)’ from host ‘db7′ to preferred host ‘db6′

db1(10.1.30.214)slave/ONLINE.Roles:

reader(10.1.30.243)

db6(10.1.30.226)master/ONLINE.Roles:

writer(10.1.30.219)

db7(10.1.30.230)master/ONLINE.Roles:

reader(10.1.30.221)

在整个过程中,db1对应的master从db6切换到db7,然后再由db7切换到db6,而db6失败然后重新打开,在这期间每一秒都有数据写入,最终检查数据一致性状况良好。

[root@mysql-3 ~]# ./check_repl.sh

+———-+

| count(*) |

+———-+

| 30598 |

+———-+

+———-+

| count(*) |

+———-+

| 30598 |

+———-+

+———-+

| count(*) |

+———-+

| 30598 |

+———-+

总结

MMM对MySQL复制提供了一种高可用的解决方案。

它包含主机监控、mysql服务监控、复制线程监控等多方位监控手段,在MySQL多主环境中,使用它可以在任何一个节点(master或slave)出现故障时能够及时作出判断和调整。它使用虚拟ip技术使数据库架构内的角色切换对上层的应用程序透明。

原文链接:http://www.mysqlops.com/2011/06/11/mysql-mmm-architecture.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: