I’ve been running into some issues parsing CSVs generated by excel.. it looks like excel (at least on osx) adds a zero width character at the beginning as a byte order mark and these end up being returned as part of the string in the first cell of the first row by org.clojure/data.csv. Thoughts on whether that should be handled in org.clojure/data.csv or something that should be a consumer’s responsibility? My understanding is that it’s part of the metadata of the file and not intended to be part of the content. If consensus is data.csv should handle I can create a ticket and patch.
Do you know if it is a standard Unicode byte order mark byte sequence, or something different than those? I have a half memory that some Java Reader implementations might skip over those for you, but do not know whether data.csv uses those.
I think it’s standard.. I guess might be a consequence of excel using UTF-16 with BOM and reading as UTF-8 where the BOM is interpreted as part of the content instead of indicating the byte order?
If data.csv lets you pass in a Java Reader that you create yourself, would it be easy to try an experiment with creating a UTF-16 encoding Reader?
yeah I can give that a shot. thanks.
It looks like there is a section of data.csv's README that mentions byte order marks, with a couple of suggested ways of handling them.
🤦
thanks! fwiw reading with a utf-16 encoding reader stripped that char too
FWIW, I run into this problem trying to use CLI/`deps.edn` on Windows sometimes. If I create a deps.edn
file via echo
or something similar on Powershell, it ends up with a UTF-16 BOM and tools.deps
reads it in (and barfs on the content) rather than skipping it.
yeah. looks like the safe option is to use a BOMInputStream and auto-detect / skip
There is a huge amount of backstory on bom and java readers going back years