Back to Homepage


Use CSVY Format for Data Storage

Feng Jiang
Last Updated: 2020-11-18

I have been using fread() and fwrite() from the data.table package for years. However, I recently noticed a change in 1.11.0 broke my code. This is not a recent change, I must haven't updated my R packages for a while. Or, I haven't used the related code for a while. Or, I just haven't noticed the change. The change is (quote):

Numeric data that has been quoted is now detected and read as numeric.

Quoted numbers used to be read as characters because they are quoted. Now, for whatever reason, data.table has decided to read quoted numbers as numbers, even when they are quoted.

The old code still runs. I just don't get the data as expected. I have data using "0001, 0002, 0003, ..." in the id column. Now, the id column is read as "1, 2, 3, ..." This change does not generate error message immediately, which will happen ten steps later down the line, where I need to do character operations on the id column.

First, I was angry for data.table making this change. After taking a deep breath, I agree the root of the problem is not data.table. Rather, it is lack of meta information in the CSV format. To fundamentally solve this problem, meaning to let R programs to be able to always read CSV files as expected, the solution is embedding column definitions inside CSV files. This solution, which I didn't know before, actually exists.

CSVY format adds YAML frontmatter to CSV files. Besides other descriptive information, the YAML frontmatter includes column definitions like this:

schema:
  fields:
  - name: x
    type: numeric
  - name: y
    type: character
  - name: z
    type: POSIXct

With this information saved inside the CSV file, it ensures the CSV file can be read as expected.

Using the CSVY format with R data.table is straightforward. Both fread() and fwrite() has the boolean parameter yaml since 1.12.4, so we can just use fwrite(..., yaml = TRUE) to save the CSVY file, and then use fread(..., yaml = TRUE) to load the CSVY file. This feature provides a long-term solution for giving column definitions to CSV files.


© 2020 fengjiang.me