Originally published on RPubs, June 2016
After testing the new Feather package and noticing how fast it is, I wanted to learn how Feather compares with other libraries and if its comparative performance degrades on larger data sets.
Here are the highlights of the results:
- On average, Feather read the files 218 times faster than ‘read.csv’, and 33 times faster then ‘fread’
- ‘read.csv’ reads 1 gigabyte of CSV data in 80 seconds, ‘fread’ in 12 seconds and Feather in less than half a second.
- ‘write.csv’ took 90 seconds to write 1 gigabyte of CSV data, the same data frame took 0.67 seconds using Feather
- Feather kept its performance even on the largest files tested
- Feather’s files size were consistently half of the size of the csv files
- When loaded in memory, ‘read.csv’ and Feather where the same size. ‘fead’ was consistently larger
I tested 32 csv files. Each file contain 80 variables. The smallest file is 50 megabytes and the largest is a bit under 2 gigabytes. Each file is 50 megabytes larger than the previous one.
The measurements taken are:
- Time it takes to read the file into memory
- Time it takes to write the data into a file
- Size of the file
- Memory usage when file is loaded
The R Markup with the test details is found in my GitHub account
1 – Time it takes to read the file into memory
The following plot traces the time it takes ‘read.csv’ and ‘fread’ to read CSV files, and how long it takes to load the ‘read_feather’ to load the Feather file that has the same data in the original CSV files.
To calculate the Performance Increase, I divided the time it took ‘read.csv’ and ‘fread’ to read the CSV file, by the time it took ‘read_feather’ to load the Feather file that has the same data in the original CSV files.
2 – Time it takes to write the data into a file
Here is a comparison of the time it takes ‘read.csv’ and ‘write_feather’ to create the files based on the same data frame.
3 – Size of the file
A comparison of the file size that ‘read.csv’ and ‘write_feather’ of the files created based on the same data frame. The ‘Function’ says ‘read.csv’ and ‘read_feather’ because the measurement was taken at the time of running those commands.
4 – Memory usage when file is loaded
Here is a comparison of the size of the data loaded via each of the commands.