clojure-dev

Issues: https://clojure.atlassian.net/browse/CLJ | Guide: https://insideclojure.org/2015/05/01/contributing-clojure/
rutledgepaulv 2019-08-28T00:42:42.013600Z

I’ve been running into some issues parsing CSVs generated by excel.. it looks like excel (at least on osx) adds a zero width character at the beginning as a byte order mark and these end up being returned as part of the string in the first cell of the first row by org.clojure/data.csv. Thoughts on whether that should be handled in org.clojure/data.csv or something that should be a consumer’s responsibility? My understanding is that it’s part of the metadata of the file and not intended to be part of the content. If consensus is data.csv should handle I can create a ticket and patch.

2019-08-28T00:48:02.015100Z

Do you know if it is a standard Unicode byte order mark byte sequence, or something different than those? I have a half memory that some Java Reader implementations might skip over those for you, but do not know whether data.csv uses those.

rutledgepaulv 2019-08-28T00:54:45.016200Z

I think it’s standard.. I guess might be a consequence of excel using UTF-16 with BOM and reading as UTF-8 where the BOM is interpreted as part of the content instead of indicating the byte order?

2019-08-28T00:55:58.017Z

If data.csv lets you pass in a Java Reader that you create yourself, would it be easy to try an experiment with creating a UTF-16 encoding Reader?

rutledgepaulv 2019-08-28T00:56:39.017500Z

yeah I can give that a shot. thanks.

2019-08-28T00:59:16.018Z

It looks like there is a section of data.csv's README that mentions byte order marks, with a couple of suggested ways of handling them.

rutledgepaulv 2019-08-28T01:02:03.018300Z

🤦

rutledgepaulv 2019-08-28T01:02:28.018900Z

thanks! fwiw reading with a utf-16 encoding reader stripped that char too

seancorfield 2019-08-28T01:11:36.020500Z

FWIW, I run into this problem trying to use CLI/`deps.edn` on Windows sometimes. If I create a deps.edn file via echo or something similar on Powershell, it ends up with a UTF-16 BOM and tools.deps reads it in (and barfs on the content) rather than skipping it.

rutledgepaulv 2019-08-28T01:13:12.021200Z

yeah. looks like the safe option is to use a BOMInputStream and auto-detect / skip

alexmiller 2019-08-28T01:38:18.022Z

There is a huge amount of backstory on bom and java readers going back years