Monday, October 18, 2010

Mysql - LOAD DATA INFILE could lose data

Last week, when we encountered a sudden data loss in mysql, the first thing to do was to pour through the bin logs looking for "DELETE" statements. It took quite a while to figure out that the data "loss" was more of a change in Ids in one table, and that had to do with a LOAD DATA INFILE statement.

To illustrate assume we have these two tables:

vegetables (
  id bigint(10) NOT NULL AUTO_INCREMENT,
  name varchar(255) NOT NULL,
  UNIQUE KEY uk_name (name)

sales (
  id bigint(10) NOT NULL AUTO_INCREMENT,
  saledate datetime,
  vegetableId  bigint(10) NOT NULL

The sales table has information about vegetable sales, each sales record has a connection to a vegetable record.

If we load data into the vegetables table using a LOAD csv of this sort, it is possible to break the existing foreign key relationships from the sales records to the vegetables records:

load data infile '/path/to/file.csv' replace into table vegetables;

The problem is the keyword "REPLACE" used here. It tells mysql what to do if it finds that any record in the csv file violates a key constraint. In this example, if the csv file had a record with the vegetable name "brocolli", and brocolli was already in the vegetables table, then the keyword "REPLACE" tells mysql to replace the row on the table with the new data from the csv file.

This still seems to be not a problem as replacing "brocolli" with "brocolli" is certainly not earth shattering. But what about the id? This is an auto_increment field which means that mysql will over-write the old id with a new auto incremented one. And that means that any sales record that referenced brocolli is now left with a pointer to nowhere. In other words, as far as practical matters go, all sales of brocolli have now disappeared.

This problem would not have happened if the sales table explicitly specified the FOREIGN KEY constraint on its vegetableId field.

In any case, is this data loss so terrible? Once we see the error we made, could we not quickly correct it, after all it is just a change in the ids, and the old sales records are still there in the table. Well think about it, even though those records are there, what if we had replacements for more than one vegetable? Let's assume both "brocolli" and "arugala" were replaced. How can we determine which vegetableIds were for brocolli and which were for arugala? The ids in the vegetable table have been replaced so it is impossible to make the connection to here from the sales table. No, this is a bad one.

In fact, an incident similar to this happened where I work. We had to restore the db to get this data back, it was not a pleasant experience.

No comments: