MHA 进阶

2022-01-25 19:18:39

本篇文章使用到的所有软件包都可通过点击此百度云链接下载,提取码:iiqg

环境准备

准备如下主机:

主机名 IP 角色
db01 10.0.1.200
db02 10.0.1.201
db03 10.0.1.202
manager 10.0.1.204 MHA 管理机

1、参考「二进制包形式安装MySQL5.7」在如上 db01、db02、db03 中安装 MySQL。

2、在各节点中添加如下 hosts 映射方便后续操作:

$ cat << EOF >> /etc/hosts
10.0.1.204 manager
10.0.1.200 db01
10.0.1.201 db02
10.0.1.202 db03
EOF

3、在 db01、db02、db03 节点中创建 bin log 文件存放目录:

$ mkdir -p /data/binlog && chown -R mysql.mysql /data/

主从配置

1、修改 db01 主机的 MySQL 配置文件:

$ cat > /etc/my.cnf <<EOF
[mysqld]
basedir=/data/app/mysql
datadir=/data/3306/data
socket=/tmp/mysql.sock
server_id=1
port=3306
secure-file-priv=/tmp
autocommit=0
log_bin=/data/binlog/mysql-bin
binlog_format=row
gtid-mode=on
enforce-gtid-consistency=true
log-slave-updates=1
[mysql]
prompt='db01 [\\d]> '
EOF

2、修改 db02 主机的 MySQL 配置文件:

$ cat > /etc/my.cnf <<EOF
[mysqld]
basedir=/data/app/mysql
datadir=/data/3306/data
socket=/tmp/mysql.sock
server_id=2
port=3306
secure-file-priv=/tmp
autocommit=0
log_bin=/data/binlog/mysql-bin
binlog_format=row
gtid-mode=on
enforce-gtid-consistency=true
log-slave-updates=1
[mysql]
prompt='db02 [\\d]> '
EOF

3、修改 db03 主机的 MySQL 配置文件:

$ cat > /etc/my.cnf <<EOF
[mysqld]
basedir=/data/app/mysql
datadir=/data/3306/data
socket=/tmp/mysql.sock
server_id=3
port=3306
secure-file-priv=/tmp
autocommit=0
log_bin=/data/binlog/mysql-bin
binlog_format=row
gtid-mode=on
enforce-gtid-consistency=true
log-slave-updates=1
[mysql]
prompt='db03 [\\d]> '
EOF

4、重启 db01、db02、db03 的 MySQL 服务:

$ systemctl restart mysqld

5、在 db01 创建 MySQL 复制账号并授权:

mysql> create user repluser@'10.0.1.%' identified by '123';
mysql> grant replication slave,replication client on *.* to repluser@'10.0.1.%';
mysql> flush privileges;

6、在 db02、db03 启用复制功能:

mysql> change master to 
master_host='10.0.1.200',
master_user='repluser',
master_password='123' ,
MASTER_AUTO_POSITION=1;

mysql> start slave;

MHA 基础环境搭建

SSH 免密配置

1、在 manager 主机生成秘钥对并自授权:

$ ssh-keygen 
$ mv /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
$ chmod 600 /root/.ssh/authorized_keys

2、将 .ssh 目录拷贝到 db01、db02、db03 主机:

$ scp -rp /root/.ssh db01:/root/
$ scp -rp /root/.ssh db02:/root/
$ scp -rp /root/.ssh db03:/root/

MHA 软件包安装

1、在 manager 主机安装如下 manager 和 node 程序包:

$ yum localinstall mha4mysql-manager-0.58-0.el7.centos.noarch.rpm mha4mysql-node-0.58-0.el7.centos.noarch.rpm -y

2、在 db01、db02、db03 主机安装 node 程序包并创建软链接:

$ yum localinstall mha4mysql-node-0.58-0.el7.centos.noarch.rpm -y

3、在各主机创建 mysqlmysqlbinlog 的软链接到 /usr/bin/

$ ln -s /data/app/mysql/bin/mysql /usr/local/bin/
$ ln -s /data/app/mysql/bin/mysqlbinlog /usr/local/bin/

4、在 db01 创建用于 MHA 管理的 MySQL 用户:

$ grant all privileges on *.* to mhauser@'10.0.1.%' identified by 'mha';
$ flush privileges;

配置文件准备

1、在 manager 主机创建配置文件目录和日志目录:

$ mkdir -p /etc/mha && mkdir -p /var/log/mha/app1

2、在 manager 主机编辑 MHA 配置文件:

$ cat > /etc/mha/app1.cnf <<EOF
[server default]
manager_log=/var/log/mha/app1/manager         # MHA的工作日志设置
manager_workdir=/var/log/mha/app1             # MHA工作目录        
master_binlog_dir=/data/binlog                # 主库的binlog目录
user=mhauser                                  # MHA 监控用户                      
password=mha                                  # 监控密码
ping_interval=2                               # 心跳检测的间隔时间
repl_user=repluser                            # 复制用户
repl_password=123                             # 复制密码
ssh_user=root                                 # ssh互信的用户
[server1]                                     # 节点信息....
hostname=10.0.1.200
port=3306  
                                
[server2]            
hostname=10.0.1.201
port=3306
candidate_master=1

[server3]
hostname=10.0.1.202
port=3306
EOF

检查及启动

下述操作都是在 manager 主机中进行。

1、检查 SSH 通信是否正常:

$ masterha_check_ssh   --conf=/etc/mha/app1.cnf
Mon May 18 19:50:22 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon May 18 19:50:22 2020 - [info] Reading application default configuration from /etc/mha/app1.cnf..
Mon May 18 19:50:22 2020 - [info] Reading server configuration from /etc/mha/app1.cnf..
Mon May 18 19:50:22 2020 - [info] Starting SSH connection tests..
Mon May 18 19:50:23 2020 - [debug] 
Mon May 18 19:50:22 2020 - [debug]  Connecting via SSH from root@10.0.1.200(10.0.1.200:22) to root@10.0.1.201(10.0.1.201:22)..
Mon May 18 19:50:22 2020 - [debug]   ok.
Mon May 18 19:50:22 2020 - [debug]  Connecting via SSH from root@10.0.1.200(10.0.1.200:22) to root@10.0.1.202(10.0.1.202:22)..
Mon May 18 19:50:22 2020 - [debug]   ok.
Mon May 18 19:50:23 2020 - [debug] 
Mon May 18 19:50:22 2020 - [debug]  Connecting via SSH from root@10.0.1.201(10.0.1.201:22) to root@10.0.1.200(10.0.1.200:22)..
Mon May 18 19:50:22 2020 - [debug]   ok.
Mon May 18 19:50:22 2020 - [debug]  Connecting via SSH from root@10.0.1.201(10.0.1.201:22) to root@10.0.1.202(10.0.1.202:22)..
Mon May 18 19:50:23 2020 - [debug]   ok.
Mon May 18 19:50:24 2020 - [debug] 
Mon May 18 19:50:23 2020 - [debug]  Connecting via SSH from root@10.0.1.202(10.0.1.202:22) to root@10.0.1.200(10.0.1.200:22)..
Mon May 18 19:50:23 2020 - [debug]   ok.
Mon May 18 19:50:23 2020 - [debug]  Connecting via SSH from root@10.0.1.202(10.0.1.202:22) to root@10.0.1.201(10.0.1.201:22)..
Mon May 18 19:50:23 2020 - [debug]   ok.
Mon May 18 19:50:24 2020 - [info] All SSH connection tests passed successfully.

2、检查主从环境是否正常:

$ masterha_check_repl  --conf=/etc/mha/app1.cnf 
Mon May 18 19:51:06 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon May 18 19:51:06 2020 - [info] Reading application default configuration from /etc/mha/app1.cnf..
Mon May 18 19:51:06 2020 - [info] Reading server configuration from /etc/mha/app1.cnf..
Mon May 18 19:51:06 2020 - [info] MHA::MasterMonitor version 0.58.
Mon May 18 19:51:07 2020 - [info] GTID failover mode = 1
Mon May 18 19:51:07 2020 - [info] Dead Servers:
Mon May 18 19:51:07 2020 - [info] Alive Servers:
Mon May 18 19:51:07 2020 - [info]   10.0.1.200(10.0.1.200:3306)
Mon May 18 19:51:07 2020 - [info]   10.0.1.201(10.0.1.201:3306)
Mon May 18 19:51:07 2020 - [info]   10.0.1.202(10.0.1.202:3306)
Mon May 18 19:51:07 2020 - [info] Alive Slaves:
Mon May 18 19:51:07 2020 - [info]   10.0.1.201(10.0.1.201:3306)  Version=5.7.28-log (oldest major version between slaves) log-bin:enabled
Mon May 18 19:51:07 2020 - [info]     GTID ON
Mon May 18 19:51:07 2020 - [info]     Replicating from 10.0.1.200(10.0.1.200:3306)
Mon May 18 19:51:07 2020 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon May 18 19:51:07 2020 - [info]   10.0.1.202(10.0.1.202:3306)  Version=5.7.28-log (oldest major version between slaves) log-bin:enabled
Mon May 18 19:51:07 2020 - [info]     GTID ON
Mon May 18 19:51:07 2020 - [info]     Replicating from 10.0.1.200(10.0.1.200:3306)
Mon May 18 19:51:07 2020 - [info] Current Alive Master: 10.0.1.200(10.0.1.200:3306)
Mon May 18 19:51:07 2020 - [info] Checking slave configurations..
Mon May 18 19:51:07 2020 - [info]  read_only=1 is not set on slave 10.0.1.201(10.0.1.201:3306).
Mon May 18 19:51:07 2020 - [info]  read_only=1 is not set on slave 10.0.1.202(10.0.1.202:3306).
Mon May 18 19:51:07 2020 - [info] Checking replication filtering settings..
Mon May 18 19:51:07 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Mon May 18 19:51:07 2020 - [info]  Replication filtering check ok.
Mon May 18 19:51:07 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Mon May 18 19:51:07 2020 - [info] Checking SSH publickey authentication settings on the current master..
Mon May 18 19:51:07 2020 - [info] HealthCheck: SSH to 10.0.1.200 is reachable.
Mon May 18 19:51:07 2020 - [info] 
10.0.1.200(10.0.1.200:3306) (current master)
 +--10.0.1.201(10.0.1.201:3306)
 +--10.0.1.202(10.0.1.202:3306)

Mon May 18 19:51:07 2020 - [info] Checking replication health on 10.0.1.201..
Mon May 18 19:51:07 2020 - [info]  ok.
Mon May 18 19:51:07 2020 - [info] Checking replication health on 10.0.1.202..
Mon May 18 19:51:07 2020 - [info]  ok.
Mon May 18 19:51:07 2020 - [warning] master_ip_failover_script is not defined.
Mon May 18 19:51:07 2020 - [warning] shutdown_script is not defined.
Mon May 18 19:51:07 2020 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

3、启动:

$ nohup masterha_manager --conf=/etc/mha/app1.cnf --remove_dead_master_conf --ignore_last_failover  < /dev/null> /var/log/mha/app1/manager.log 2>&1 &

4、查看当前 MHA 状态:

$ masterha_check_status --conf=/etc/mha/app1.cnf
app1 (pid:1952) is running(0:PING_OK), master:10.0.1.200

故障转移与提醒

准备脚本

在 manager 主机准备下面脚本,赋予它们执行权限,并复制到 /usr/local/bin/ 中:

$ ls | xargs -n1
master_ip_failover      # 故障转移脚本
master_ip_online_change # 在线切换脚本 
mha_check.sh            # MHA 管理
send_report             # 邮件通知脚本
$ chmod +x *
$ cp -a * /usr/local/bin/

这几个脚本内容如下:

master_ip_failover

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);
my $prefix = '24';
my $vip = '10.0.1.205';
my $full_vip = "$vip/$prefix";
my $key = '1';
my $iface_name = 'eth0';
my $ssh_start_vip = "/sbin/ifconfig $iface_name:$key $full_vip";
my $ssh_stop_vip = "/sbin/ifconfig $iface_name:$key down";
my $ssh_Bcast_arp= "/sbin/arping -I $iface_name -c 3 -A $vip";
GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $full_vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
     return 0  unless  ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

master_ip_online_change

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;
my $_tstart;
my $_running_interval = 0.1;
my (
  $command,              $orig_master_is_new_slave, $orig_master_host,
  $orig_master_ip,       $orig_master_port,         $orig_master_user,
  $orig_master_password, $orig_master_ssh_user,     $new_master_host,
  $new_master_ip,        $new_master_port,          $new_master_user,
  $new_master_password,  $new_master_ssh_user,
);
 
###########################################################################
my $vip = "10.0.1.205";
my $key = "1";
my $iface_name = 'eth0';
my $ssh_start_vip = "/sbin/ifconfig $iface_name:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig $iface_name:$key $vip down";
my $ssh_Bcast_arp= "/sbin/arping -I $iface_name -c 3 -A $vip";
###########################################################################
 
GetOptions(
  'command=s'                => \$command,
  'orig_master_is_new_slave' => \$orig_master_is_new_slave,
  'orig_master_host=s'       => \$orig_master_host,
  'orig_master_ip=s'         => \$orig_master_ip,
  'orig_master_port=i'       => \$orig_master_port,
  'orig_master_user=s'       => \$orig_master_user,
  'orig_master_password=s'   => \$orig_master_password,
  'orig_master_ssh_user=s'   => \$orig_master_ssh_user,
  'new_master_host=s'        => \$new_master_host,
  'new_master_ip=s'          => \$new_master_ip,
  'new_master_port=i'        => \$new_master_port,
  'new_master_user=s'        => \$new_master_user,
  'new_master_password=s'    => \$new_master_password,
  'new_master_ssh_user=s'    => \$new_master_ssh_user,
);
exit &main();
sub current_time_us {
  my ( $sec, $microsec ) = gettimeofday();
  my $curdate = localtime($sec);
  return $curdate . " " . sprintf( "%06d", $microsec );
}
sub sleep_until {
  my $elapsed = tv_interval($_tstart);
  if ( $_running_interval > $elapsed ) {
    sleep( $_running_interval - $elapsed );
  }
}
sub get_threads_util {
  my $dbh                    = shift;
  my $my_connection_id       = shift;
  my $running_time_threshold = shift;
  my $type                   = shift;
  $running_time_threshold = 0 unless ($running_time_threshold);
  $type                   = 0 unless ($type);
  my @threads;
  my $sth = $dbh->prepare("SHOW PROCESSLIST");
  $sth->execute();
  while ( my $ref = $sth->fetchrow_hashref() ) {
    my $id         = $ref->{Id};
    my $user       = $ref->{User};
    my $host       = $ref->{Host};
    my $command    = $ref->{Command};
    my $state      = $ref->{State};
    my $query_time = $ref->{Time};
    my $info       = $ref->{Info};
    $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
    next if ( $my_connection_id == $id );
    next if ( defined($query_time) && $query_time < $running_time_threshold );
    next if ( defined($command)    && $command eq "Binlog Dump" );
    next if ( defined($user)       && $user eq "system user" );
    next
      if ( defined($command)
      && $command eq "Sleep"
      && defined($query_time)
      && $query_time >= 1 );
    if ( $type >= 1 ) {
      next if ( defined($command) && $command eq "Sleep" );
      next if ( defined($command) && $command eq "Connect" );
    }
    if ( $type >= 2 ) {
      next if ( defined($info) && $info =~ m/^select/i );
      next if ( defined($info) && $info =~ m/^show/i );
    }
    push @threads, $ref;
  }
  return @threads;
}
sub main {
  if ( $command eq "stop" ) {
    ## Gracefully killing connections on the current master
    # 1. Set read_only= 1 on the new master
    # 2. DROP USER so that no app user can establish new connections
    # 3. Set read_only= 1 on the current master
    # 4. Kill current queries
    # * Any database access failure will result in script die.
    my $exit_code = 1;
    eval {
      ## Setting read_only=1 on the new master (to avoid accident)
      my $new_master_handler = new MHA::DBHelper();
      # args: hostname, port, user, password, raise_error(die_on_error)_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      print current_time_us() . " Set read_only on the new master.. ";
      $new_master_handler->enable_read_only();
      if ( $new_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      $new_master_handler->disconnect();
      # Connecting to the orig master, die if any database error happens
      my $orig_master_handler = new MHA::DBHelper();
      $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
        $orig_master_user, $orig_master_password, 1 );
      ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
      $orig_master_handler->disable_log_bin_local();
      print current_time_us() . " Drpping app user on the orig master..\n";
###########################################################################
      #FIXME_xxx_drop_app_user($orig_master_handler);
###########################################################################
      ## Waiting for N * 100 milliseconds so that current connections can exit
      my $time_until_read_only = 15;
      $_tstart = [gettimeofday];
      my @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_read_only > 0 && $#threads >= 0 ) {
        if ( $time_until_read_only % 5 == 0 ) {
          printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_read_only * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_read_only--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }
      ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
      print current_time_us() . " Set read_only=1 on the orig master.. ";
      $orig_master_handler->enable_read_only();
      if ( $orig_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      ## Waiting for M * 100 milliseconds so that current update queries can complete
      my $time_until_kill_threads = 5;
      @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
        if ( $time_until_kill_threads % 5 == 0 ) {
          printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_kill_threads--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }
###########################################################################
      print "disable the VIP on old master: $orig_master_host \n";
      &stop_vip();
###########################################################################
      ## Terminating all threads
      print current_time_us() . " Killing all application threads..\n";
      $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
      print current_time_us() . " done.\n";
      $orig_master_handler->enable_log_bin_local();
      $orig_master_handler->disconnect();
      ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {
    ## Activating master ip on the new master
    # 1. Create app user with write privileges
    # 2. Moving backup script if needed
    # 3. Register new master's ip to the catalog database
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();
      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      ## Set read_only=0 on the new master
      $new_master_handler->disable_log_bin_local();
      print current_time_us() . " Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();
      ## Creating an app user on the new master
      print current_time_us() . " Creating app user on the new master..\n";
###########################################################################
      #FIXME_xxx_create_app_user($new_master_handler);
###########################################################################
      $new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();
      ## Update master ip on the catalog database, etc
###############################################################################
      print "enable the VIP: $vip on the new master: $new_master_host \n ";
      &start_vip();
###############################################################################
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {
    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}
###########################################################################
sub start_vip() {
        `ssh $new_master_ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
        `ssh $orig_master_ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}
###########################################################################
sub usage {
  print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  die;
}

mha_check.sh

#!/bin/bash

choice=$1
stop="masterha_stop  --conf=/etc/mha/app1.cnf"
status="masterha_check_status --conf=/etc/mha/app1.cnf"
repl="masterha_check_repl --conf=/etc/mha/app1.cnf"
conf_file="/etc/mha/app1.cnf"
log_file="/var/log/mha/app1/manager.log"


case "$choice" in 
  stop)
    $stop
    ;;
  start)
    nohup masterha_manager --conf=$conf_file --remove_dead_master_conf --ignore_last_failover  < /dev/null> $log_file 2>&1 &
    echo "it is start !"
    ;;
  restart)
    $stop 
    sleep 1
    nohup masterha_manager --conf=$conf_file --remove_dead_master_conf --ignore_last_failover  < /dev/null> $log_file 2>&1 &
    echo "it is start !"
    ;;
  status)
    $status
    ;;
  repl)
    $repl
    ;;
    *)
      echo "Usages: $0  {start|stop|restart|status|repl}"
      exit 1
esac

send_report

#!/usr/bin/perl
#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
 
## Note: This is a sample script and is not complete. Modify the script based on your environment.
 
use strict;
use warnings FATAL => 'all';
use Mail::Sender;
use Getopt::Long;
 
#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp='smtp.qq.com';
my $mail_from='158641875@qq.com';
my $mail_user='158641875';
my $mail_pass='svzgmphqiaoxbhci';
#my $mail_to=['to1@qq.com','to2@qq.com'];
my $mail_to='632404164@qq.com';
 
GetOptions(
  'orig_master_host=s' => \$dead_master_host,
  'new_master_host=s'  => \$new_master_host,
  'new_slave_hosts=s'  => \$new_slave_hosts,
  'subject=s'          => \$subject,
  'body=s'             => \$body,
);
 
# Do whatever you want here
mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);
 
sub mailToContacts {
        my ($smtp, $mail_from, $mail_user, $mail_pass, $mail_to, $subject, $msg ) = @_;
        open my $DEBUG, ">/tmp/mail.log"
                or die "Can't open the debug    file:$!\n";
        my $sender = new Mail::Sender {
                ctype           => 'text/plain;charset=utf-8',
                encoding        => 'utf-8',
                smtp            => $smtp,
                from            => $mail_from,
                auth            => 'LOGIN',
                TLS_allowed     => '0',
                authid          => $mail_user,
                authpwd         => $mail_pass,
                to              => $mail_to,
                subject         => $subject,
                debug           => $DEBUG
        };
        $sender->MailMsg(
                {
                        msg => $msg,
                        debug => $DEBUG
                }
        ) or print $Mail::Sender::Error;
        return 1;
}
 
exit 0;

配置生效

1、在 MHA 配置文件中添加如下项:

$ vim /etc/mha/app1.cnf 
[server default]
master_ip_failover_script=/usr/local/bin/master_ip_failover
report_script=/usr/local/bin/send_report

2、手动在当前主库所在主机即 db01 主机配置 VIP:

$ ifconfig eth0:1 10.0.1.205/24

3、重启 MHA:

$ mha_check.sh restart

日志冗余

1、在 manager 主机创建存放冗余 bin log 的目录:

$ mkdir -p /data/binlog_server/ && chown -R mysql.mysql /data/*

2、在 db01 中查看当前主库使用的 bin log 文件:

$ mysql -e 'show master status\G' | grep 'File'
             File: mysql-bin.000001

3、随便挑一个 MySQL 主机将 mysqlbinlog 二进制文件拷贝到 manager 主机的 /usr/bin/ 中,执行 mysqlbinlog 命令持续拉取 bin log 文件到冗余目录:

$ cd  /data/binlog_server/
$ nohup mysqlbinlog  -R --host=10.0.1.200 --user=mhauser --password=mha --raw  --stop-never mysql-bin.000001 &

4、MHA 配置文件中添加如下节:

$ vim /etc/mha/app1.cnf 
[binlog1]
no_master=1
hostname=10.0.1.204
master_binlog_dir=/data/binlog_server/

5、重启 MHA:

$ mha_check.sh restart
Stopped app1 successfully.
it is start !

故障模拟及恢复

故障模拟

1、在 manager 主机控制台监控 MHA 日志:

$ tailf /var/log/mha/app1/manager

2、在 db01 中停止 MySQL 服务:

$ systemctl stop mysqld

3、MHA 日志信息如下,可以看到 master 已经自动切换到了 10.0.1.201 的主机即 db02:

$ tailf /var/log/mha/app1/manager
...
----- Failover Report -----

app1: MySQL Master failover 10.0.1.200(10.0.1.200:3306) to 10.0.1.201(10.0.1.201:3306) succeeded

Master 10.0.1.200(10.0.1.200:3306) is down!

Check MHA Manager logs at centos7-204:/var/log/mha/app1/manager for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.0.1.200(10.0.1.200:3306)
Selected 10.0.1.201(10.0.1.201:3306) as a new master.
10.0.1.201(10.0.1.201:3306): OK: Applying all logs succeeded.
10.0.1.201(10.0.1.201:3306): OK: Activated master IP address.
10.0.1.202(10.0.1.202:3306): OK: Slave started, replicating from 10.0.1.201(10.0.1.201:3306)
10.0.1.201(10.0.1.201:3306): Resetting slave info succeeded.
Master failover to 10.0.1.201(10.0.1.201:3306) completed successfully.
Mon May 18 21:03:30 2020 - [info] Sending mail..

4、检查邮箱已经收到了邮件:

image-20200518210826996

5、检查 VIP 是否已经漂移到 db02:

$ ifconfig | grep '10.0.1.205'
        inet 10.0.1.205  netmask 255.255.255.0  broadcast 10.0.1.255

故障恢复

1、检查几个 MySQL 节点是否正常启动。

2、检查当前的 master,由「故障模拟」最终的日志输出可以看出新的 master 是 db02,查看它当前使用的 bin log 文件:

$ mysql -e 'show master status\G' | grep 'File'
             File: mysql-bin.000001

3、启动 db01 中的 MySQL 服务:

$ systemctl start mysqld

4、将 db01 设为从库,主库指向新主库 db02:

mysql> change master to 
master_host='10.0.1.201',
master_user='repluser',
master_password='123' ,
MASTER_AUTO_POSITION=1;

mysql> start slave;

5、在 manager 节点重新将 db01 配置到 MHA 中:

$ masterha_conf_host --command=add --conf=/etc/mha/app1.cnf --hostname=10.0.1.200 --block=server1 --params="port=3306"

6、在 manager 节点修复 binlog_server:

$ ps aux | grep mysqlbinlog
root      19798  0.0  0.2 112712   968 pts/1    R+   21:51   0:00 grep --color=auto mysqlbinlog
$ rm -rf /data/binlog_server/*
$ cd /data/binlog_server/
$ nohup mysqlbinlog  -R --host=10.0.1.201 --user=mhauser --password=mha --raw  --stop-never mysql-bin.000001 &

7、检查 SSH 和主从环境是否正常:

$ masterha_check_ssh  --conf=/etc/mha/app1.cnf
$ masterha_check_repl  --conf=/etc/mha/app1.cnf 

8、启动 MHA:

$ mha_check.sh start
it is start !

9、查看 MHA 状态:

$ mha_check.sh status
app1 (pid:19908) is running(0:PING_OK), master:10.0.1.201

在线切换

命令切换

此种方法切换有几个注意点:

  1. 此种方法切换,要注意将原主库,FTWRL(Flush table with read lock),否则会造成主从不一致;
  2. 手工切换 VIP;
  3. 重新拉去新主库的 binlog;

1、以切换主库为 db01 为例,在 manager 执行下面命令进行切换:

$ masterha_master_switch  --conf=/etc/mha/app1.cnf --master_state=alive --new_master_host=10.0.1.200 --orig_master_is_new_slave --running_updates_limit=10000
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.0.1.201(10.0.1.201:3306)? (YES/no): yes

Starting master switch from 10.0.1.201(10.0.1.201:3306) to 10.0.1.200(10.0.1.200:3306)? (yes/NO): yes

master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes

2、手动切换 VIP,在 manager 主机执行下面命令:

# 停止 db02 的 VIP
$ ssh db02 'ifconfig eth0:1 down';
# 在 db01 上配置 VIP
$ ssh db01 'ifconfig eth0:1 10.0.1.205/24';

3、在 manager 主机重新拉取 db01 的 bin log:

$ pkill mysqlbinlog
$ cd /data/binlog_server
$ rm -rf *
# 查看 db01 当前使用的 bin log 文件
$ ssh db01 'mysql -e "show master  status;"';
File    Position        Binlog_Do_DB    Binlog_Ignore_DB        Executed_Gtid_Set
mysql-bin.000002        194                     cc1a793f-941e-11ea-991d-000c29ec139d:1-4
# 持续拉取该 bin log 文件
$ nohup mysqlbinlog -R --host=10.0.1.200 --user=mhauser --password=mha --raw --stop-never mysql-bin.000002 &

4、重启 MHA 并检查状态:

# 重启
$ mha_check.sh restart
# 检查状态
$ mha_check.sh status
app1 (pid:21891) is running(0:PING_OK), master:10.0.1.200

脚本切换

1、在 manager 主机修改 MHA 配置文件添加如下项:

$ vim /etc/mha/app1.cnf
[server default]
master_ip_online_change_script=/usr/local/bin/master_ip_online_change

2、在 manager 主机停止 MHA 服务:

$ mha_check.sh stop
Stopped app1 successfully.

3、以切换新主库为 db03 为例,在 manager 主机执行下面命令:

$ masterha_master_switch  --conf=/etc/mha/app1.cnf --master_state=alive --new_master_host=10.0.1.202 --orig_master_is_new_slave --running_updates_limit=10000
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.0.1.200(10.0.1.200:3306)? (YES/no): yes

Starting master switch from 10.0.1.200(10.0.1.200:3306) to 10.0.1.202(10.0.1.202:3306)? (yes/NO): yes

由于添加了在线切换脚本的原因,此时 VIP 已经自动帮我们切换到了 db03 主机。

4、在 manager 主机重新拉取 db01 的 bin log:

$ pkill mysqlbinlog
$ cd /data/binlog_server
$ rm -rf *
# 查看 db01 当前使用的 bin log 文件
$ ssh db03 'mysql -e "show master  status;"';
File    Position        Binlog_Do_DB    Binlog_Ignore_DB        Executed_Gtid_Set
mysql-bin.000001        1079                    cc1a793f-941e-11ea-991d-000c29ec139d:1-4
# 持续拉取该 bin log 文件
$ nohup mysqlbinlog -R --host=10.0.1.202 --user=mhauser --password=mha --raw --stop-never mysql-bin.000001 &

5、重启 MHA 并检查状态:

# 重启
$ mha_check.sh restart
# 检查状态
$ mha_check.sh status
app1 (pid:22119) is running(0:PING_OK), master:10.0.1.202