, , , , , ,

We wanted to optimize and test backups for one of our new large scale setup before we could finalize on the backup plan. Our challenge was the data volume was almost 30x more than our normal volumes. Since this involved large volume of data we thought this might be a good candidate to test incremental backups. We wrote a wrapper script to save the states between full backups and incremental backups and did some tests with smaller data sets. It worked perfectly fine. The version of xtrabackup we were testing was 1.6.2.

The incremental backups were completing well within 30minutes. We set out to test what would happen if the amount of incremental diff was large. While doing this test, we started getting backup failures with the following error

20110726_225312.log-110727 00:56:10 innobackupex: Starting to backup .frm, .MRG, .MYD, .MYI,
20110726_225312.log-innobackupex: .TRG, .TRN, .ARM, .ARZ, .CSM, .CSV and .opt files in
20110726_225312.log-innobackupex: subdirectories of ‘/var/lib/mysql’
20110726_225312.log:innobackupex: Error: Broken pipe at /var/backups/xtrabackup/bin/innobackupex line 336.
20110726_225312.log-Backup failed.
20110726_225312.log-Deleting bad data directory…
20110726_225312.log-Done. Removed /ddmysql/backups/20110726_225312

Initially we thought this was a one of case and made quite a few reruns. It was failing consistently. We wanted to get to the bottom of this. This was important for us to finalize the backup plan. We ran the innobackupex script through a debugger to find out where it was failing. The broken pipe error was getting triggered when the script was trying to send a ping to the mysql server so that the connection doesn’t timeout after the number of seconds mentioned in “mysql_keep_alive_timeout”  variable. This was set to around 1800. Our conclusion was when the incremental was of smaller size, the script never got to send the timeout ping. We reran the script with a much larger timeout value for a larger incremental data set and the backup was completed successfully.

The script seems to lose the connection handle during  course of execution and when it tries to ping on the handle it encounters a broken pipe. Wondering if there are other tried and tested ways to overcome this issue. Bug has been logged with the xtrabackup team to see if there are other options.