案例场景
今天在线上发现一个问题,由于监控没有覆盖到,某台机器的磁盘被写满了,导致线上MySQL主从复制出现问题。问题如下:
localhost.(none)>showslavestatus\G ***************************1.row*************************** Slave_IO_State: Master_Host:10.xx.xx.xx Master_User:replica Master_Port:5511 Connect_Retry:60 Master_Log_File: Read_Master_Log_Pos:4 Relay_Log_File:relay-bin.001605 Relay_Log_Pos:9489761 Relay_Master_Log_File: Slave_IO_Running:No Slave_SQL_Running:No Last_Errno:13121 Last_Error:Relaylogreadfailure:Couldnotparserelaylogevententry. Thepossiblereasonsare:themaster'sbinarylogiscorrupted(youcancheckthisbyrunning 'mysqlbinlog'onthebinarylog),theslave'srelaylogiscorrupted(youcancheckthisby running'mysqlbinlog'ontherelaylog),anetworkproblem,theserverwasunabletofetcha keyringkeyrequiredtoopenanencryptedrelaylogfile,orabuginthemaster'sor slave'sMySQLcode.Ifyouwanttocheckthemaster'sbinarylogorslave'srelaylog, youwillbeabletoknowtheirnamesbyissuing'SHOWSLAVESTATUS'onthisslave.
于是查看error log,发现error log中的内容如下:
2021-03-31T11:34:39.367173+08:0011[Warning][MY-010897][Repl]StoringMySQLusernameor passwordinformationinthemasterinforepositoryisnotsecureandisthereforenot recommended.PleaseconsiderusingtheUSERandPASSWORDconnectionoptionsforSTARTSLAVE; seethe'STARTSLAVESyntax'intheMySQLManualformoreinformation. 2021-03-31T11:34:39.368161+08:0012[ERROR][MY-010596][Repl]Errorreadingrelaylog eventforchannel'':binlogtruncatedinthemiddleofevent;consideroutofdiskspace 2021-03-31T11:34:39.368191+08:0012[ERROR][MY-013121][Repl]SlaveSQLforchannel'':Relay logreadfailure:Couldnotparserelaylogevententry.Thepossiblereasonsare:themaster's binarylogiscorrupted(youcancheckthisbyrunning'mysqlbinlog'onthebinarylog),the slave'srelaylogiscorrupted(youcancheckthisbyrunning'mysqlbinlog'ontherelaylog), anetworkproblem,theserverwasunabletofetchakeyringkeyrequiredtoopenanencrypted relaylogfile,orabuginthemaster'sorslave'sMySQLcode.Ifyouwanttocheckthe master'sbinarylogorslave'srelaylog,youwillbeabletoknowtheirnamesbyissuing'SHOW SLAVESTATUS'onthisslave.Error_code:MY-013121 2021-03-31T11:34:39.368205+08:0012[ERROR][MY-010586][Repl]Errorrunningquery,slaveSQL threadaborted.Fixtheproblem,andrestarttheslaveSQLthreadwith"SLAVESTART".We stoppedatlog'mysql-bin.000446'position9489626
从描述中可以看到,error log是比较智能的,发现了磁盘问题,并提示我们需要"consideroutofdiskspace"
解决问题
登录服务器,很快就发现是MySQL所在的服务器磁盘使用率达到100%了,问题原因跟error log中的内容一致。
现在就解决这个问题。基本的思路就是清理磁盘文件,然后重新搭建复制关系,这个过程似乎比较简单,但是实际操作中,在搭建复制关系的时候出现了下面的报错:
###基于gtid的复制,想重新搭建复制关系 localhost.(none)>resetslave; ERROR1371(HY000):Failedpurgingoldrelaylogs:Failedduringlogreset localhost.(none)>resetslaveall; ERROR1371(HY000):Failedpurgingoldrelaylogs:Failedduringlogreset
第一步:因为复制是基于gtid进行的,所以直接记录show slave status的状态后,就可以重新reset slave,并利用change master语句来重建复制关系了。
但是却出现上面的报错,从报错信息看是mysql无法完成purge relay log的操作,这看起来不科学。好吧,既然你自己不能完成purge relay logs的操作,那就让我来帮你吧。
第二步:手工rm -f 删除所有的relay log,发现报错变成了:
localhost.(none)>resetslaveall; ERROR1374(HY000):I/Oerrorreadinglogindexfile
嗯,好吧,问题没有得到解决。
然后思考了下,既然不能通过手工reset slave 来清理relay log,直接stop
slave 然后change master行不行呢?
第三步:直接stop slave,然后change master,不执行reset slave all的语句,结果如下:
localhost.(none)>changemastertomaster_host='10.13.224.31', ->master_user='replica', ->master_password='eHnNCaQE3ND', ->master_port=5510, ->master_auto_position=1; ERROR1371(HY000):Failedpurgingoldrelaylogs:Failedduringlogreset
得,问题依旧。
第四步:反正复制已经报错断开了,执行个start slave看看,结果戏剧性的一幕出现了:
localhost.(none)>startslave; ERROR2006(HY000):MySQLserverhasgoneaway Noconnection.Tryingtoreconnect... Connectionid:262 Currentdatabase:***NONE*** QueryOK,0rowsaffected(0.01sec) localhost.(none)> [root@~]#
执行start slave之后,实例直接挂了。
到这里,复制彻底断开了,从库实例已经挂了。
第五步:看看实例还能不能重启,尝试重启实例,发现实例还能起来。实例重新起来后,查看复制关系,结果如下:
localhost.(none)>showslavestatus\G ***************************1.row*************************** Slave_IO_State:Queueingmastereventtotherelaylog Master_Host:10.xx.xx.xx Master_User:replica Master_Port:5511 Connect_Retry:60 Master_Log_File: Read_Master_Log_Pos:4 Relay_Log_File:relay-bin.001605 Relay_Log_Pos:9489761 Relay_Master_Log_File: Slave_IO_Running:Yes Slave_SQL_Running:No Last_Errno:13121 Last_Error:Relaylogreadfailure:Couldnotparserelaylogevententry. Thepossiblereasonsare:themaster'sbinarylogiscorrupted(youcancheckthisbyrunning 'mysqlbinlog'onthebinarylog),theslave'srelaylogiscorrupted(youcancheckthisby running'mysqlbinlog'ontherelaylog),anetworkproblem,theserverwasunabletofetcha keyringkeyrequiredtoopenanencryptedrelaylogfile,orabuginthemaster'sorslave's MySQLcode.Ifyouwanttocheckthemaster'sbinarylogorslave'srelaylog,youwillbeable toknowtheirnamesbyissuing'SHOWSLAVESTATUS'onthisslave. Skip_Counter:0
复制关系依旧报错。
第六步:重新reset slave all看看,结果成功了。
localhost.(none)>stopslave; QueryOK,0rowsaffected(0.00sec) localhost.(none)>resetslaveall; QueryOK,0rowsaffected(0.03sec)
第七步:重新搭建复制关系并启动复制
localhost.(none)>changemastertomaster_host='10.xx.xx.xx', ->master_user='replica', ->master_password='xxxxx', ->master_port=5511, ->master_auto_position=1; QueryOK,0rowsaffected,2warnings(0.01sec) localhost.(none)>startslave; QueryOK,0rowsaffected(0.00sec) localhost.(none)>showslavestatus\G ***************************1.row*************************** Slave_IO_State:Waitingformastertosendevent Master_Host:10.xx.xx.xx Master_User:replica Master_Port:5511 Connect_Retry:60 ... Slave_IO_Running:Yes Slave_SQL_Running:Yes
发现实例的复制关系可以建立起来了。
一点总结
当磁盘写满的情况发生之后,mysql服务无法向元信息表中写数据,relay log也可能已经不完整了,如果直接清理了服务器上的磁盘数据,再去重新change master修改主从复制关系,可能会出现报错,不能直接修复,因为这不是一个正常的主从复制关系断裂场景。
所以,正确的做法应该是:
1、清理服务器的磁盘
2、重启复制关系断开的那个从库
3、重新reset slave all、change master来搭建主从复制关系即可
如果有更好的方法,还请不吝赐教。
以上就是磁盘写满导致MySQL复制失败的解决方案的详细内容,更多关于MySQL复制失败的解决方案的资料请关注其它相关文章!