Description
What could go wrong? I was trying to export an Excel spreadsheet to a CSV for an anonymous Fortune 500 quasi-government/quasi-private corporation.
| Asset ID | Software Asset Name | Other Column 1 | Other Column 2 |
| SAN12344 | Some App | | |
| SAN8272 | Some Other App | | |
Code
Can you spot the bug? 😜 No exceptions are thrown, but the resulting `inventory_list` is an empty array. Unit tests with some manually generated CSVs pass.
csv = CSV.open(path, headers: true, skip_blanks: true)
@inventory_list = csv
.to_a
.map(&:to_h)
.select { |row| row.key?('Asset ID') }
.map { |row| row.transform_keys { |k| k && k.strip } }
.map { |row| row.transform_values { |v| v&.strip } }
.map { |row| row.delete_if { |k, v| k.nil? || k.empty? || v.nil? || v.empty? } }
Fix
csv = CSV.open(path, 'r:bom|utf-8', headers: true, skip_blanks: true)
Catch the change?
Root Cause
$ xxd software_inventory_list_bom_spec.csv | head -1
00000000: efbb bf41 7373 6574 2049 442c 2053 6f66 ...Asset ID, Sof
Turns out the file has a UTF-8 BOM! I've only heard of these, and what little I knew was that it's supposed to be handled transparently by whatever language's file parsing library, if it supports UTF-8.
And, turns out, MS Excel (2016, to be more specific), inserts a BOM if you select the "CSV UTF8" option in "Save As"!
And even vim and Notepad++ auto-detected the BOM, so I didn't even think that my Ruby code was somehow the culprit.
Well, after too many hours debugging to admit (the lag against ssh through Chrome, through Remote Desktop, is pretty high), it turns out that by default (at least on Amazon Linux 2, Ruby 2.8), the BOM is not auto-detected and just included as part of the "Asset ID" column.
The bytes 0xEF,0xBB,0xBF bytes were included in the string's! And worse, irb didn't even render as a characters! It didn't render them at all, so you can't tell by printing out the strings to the console that it's different. The only way to tell that the string "Asset ID" had those three extra bytes in the beginning was to do a byte count. 🤦
It turns out, if you ask Excel to export a CSV by using the "CSV UTF8" option, it'll insert a BOM, BUT it won't insert a BOM if you use the "CSV" option, the one that doesn't mention UTF8. What confusing UI...
If you want to insert a BOM in a file for testing (and, say, make a unit test out of it like I did 😼 ), there's a vim command for it:
vim -e -s +"set bomb|set encoding=utf-8|wq" filename
vim -e -s +"set bomb|set encoding=utf-8|wq" filename
Take Aways
- A CSV seems simple on the outside, like a pumpkin, nice and smooth on the outside, but inside can get pretty gnarly.
- When in MS Excel, make sure you know if you're exporting a CSV with or without a BOM
- the file util is helpful
- Ruby doesn't automatically handle a BOM. You have to ask it to check to see if it seems a BOM, which seems like a strange default.
Extra Reading
- https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
- https://estl.tech/of-ruby-and-hidden-csv-characters-ef482c679b35
- https://stackoverflow.com/q/543225/423943
- https://theonemanitdepartment.wordpress.com/2014/12/15/the-absolute-minimum-everyone-working-with-data-absolutely-positively-must-know-about-file-types-encoding-delimiters-and-data-types-no-excuses/
No comments:
Post a Comment