I have been using
fwrite() from the
data.table package for years. However, I recently noticed a change in 1.11.0 broke my code. This is not a recent change, I must haven’t updated my R packages for a while. Or, I haven’t used the related code for a while. Or, I just haven’t noticed the change. The change is (quote):
Numeric data that has been quoted is now detected and read as numeric.
Quoted numbers used to be read as characters because they are quoted. Now, for whatever reason,
data.table has decided to read quoted numbers as numbers, even when they are quoted.
The old code still runs, I just get the data not as expected. I have data using “0001, 0002, 0003, …” in the id column. Now, the id column is read as “1, 2, 3, …” This change does not generate error message immediately, which will happen ten steps later down the line, where I need to do character operations on the id column.
First, I was angry for
data.table making this change. After taking a deep breath, I agree the root of the problem is not
data.table. Rather, it is lack of meta information in the CSV format. To fundamentally solve this problem, meaning to let R programs to be able to always read CSV files as expected, the solution is embedding column definitions inside CSV files. This solution, which I didn’t know before, actually exists.
CSVY format adds YAML frontmatter to CSV files. Besides other descriptive information, the YAML frontmatter includes column definitions like this:
- name: x
- name: y
- name: z
With this information saved inside the CSV file, it ensures the CSV file can be read as expected.
Using the CSVY format with R
data.table is straightforward. Both
fwrite() has the boolean parameter
yaml since 1.12.4, so we can just use
fwrite(..., yaml = TRUE) to save the CSVY file, and then use
fread(..., yaml = TRUE) to load the CSVY file. This feature provides a long-term solution for giving column definitions to CSV files.