How to handle Galera crash?
First of all, make sure you are running the latest Galera stable release so you do not run into older bugs that have already been fixed. Start with inspecting the MySQL error log on the Galera nodes as Galera will be logging to this file. Try to shed some light to any relevant line which indicates error or failing. If the Galera nodes happen to be responsive, you may also try to collect following output:
mysql> SHOW FULL PROCESSLIST; mysql> SHOW PROCESSLIST; mysql> SHOW ENGINE INNODB STATUS; mysql> SHOW STATUS LIKE 'wsrep%';
Next, inspect the system resources by checking network, firewall, disk usage and memory utilization as well as inspecting the general system activity log (syslog, message, dmesg). If still no indication of the problem found, you may hit into a bug which you can report it directly at Galera bugs on Launchpad page or request for technical support assistance directly from the vendor (Codership, Percona or MariaDB). You may also join the Galera Google Group mailing list to seek for open assistance.
- If you are using rsync for state transfer, and a node crashes before the state transfer is over, rsync process might hang forever, occupying the port and not allowing to restart the node. The problem will show up as ‘port in use’ in the server error log. Find the orphan rsync process and kill it manually.
- Before re-initializing the cluster, you can determine which DB node is having the most updated data by comparing the wsrep_last_commited value among nodes. The one which holding the highest number is recommended to be the reference node when bootstrapping the cluster.