Friday, July 15, 2011

Mac / Microsoft Excel / newlines (\r \n)

It is a frequently the case that the business department hands over Excel files to the engineering department for some type of data processing. The first step here is to convert this to a proper comma separated text file (csv).

If you are doing this conversion using Microsoft Excel on a Mac, you'll note that the resulting file does not have Unix-style newlines. A Unix new line is the 0x0a character, also written as \n. What Excel produces is the 0x0d character, also written as \r.

Most Linux commands do not recognize \r as a line ending. There are several ways to convert the \r characters to proper Linux style line endings. Using the vi editor is a common method. However, there is also the issue that sometimes if the Excel spreadsheet has blank columns, Excel insists on writing a possibly large number of \r characters at the end of the  csv file. The vi method would write a newline per each of these \r characters and that is not ideal.

Instead, you could use this perl one-liner to accomplish both : turn all \r into \n except for the trailing \r characters :

perl -ne 's/([^\r])\r/$1\n/g; s/\r//g; print;'  imported.csv

The regular expression replaces any non \r character followed by \r with the non \r character followed by a \n. Since the trailing \r characters do not match this pattern, they are thus ignored. The second regexp removes these \r characters.

No comments: