As our industry has embraced the new strategies for handling the production workloads which include containers (Read as K8s) or Serverless (Read as Functions As A Service), the developers don’t have the luxury of unlimited computing resources on the production environments. Those days are gone where it was easy to acquire a large virtual machine with many cores and high memory for application deployment needs. As a .net developer though you are working with managed code and you rely on GC (Garbage collector) to do the job, the onus is now on you to write highly performant code which can run anywhere right from docker containers to IoT devices. With advent of C#8 and .NET core , Microsoft .NET team has been very cognizant of memory allocations. Every new version comes with modern APIs that can increase the performance of the application to X times more compared with older version of traditional APIs. In this blog post I will be showcasing the file I/O operations with 3 different techniques and will benchmark each technique. From the benchmarks results it will be pretty evident that System.IO.Pipelines wins by a considerable margin both in the time of execution and memory allocation.
We will be experimenting with reading a large csv file (100,000 records with 5 fields) of employee data. I’m sure you must encounter this challenge many times in your career where you have to parse a large csv file. This challenge creates a enough pressure on GC. The considerations around garbage collection are particularly important when thinking about performance. This is because garbage collection takes up CPU time, reducing the time spent on the actual data processing. Not only this, but each time a garbage collection is triggered, much of the work is suspended so that the remaining references can be evaluated. This can drastically effect the amount of time taken to process the data. I’ve chosen three techniques for this challenge.
- Using CSVHelper : This is a popular library for parsing csv files in .NET ecosystem.
- Using IAsyncEnumerable : This API was introduced from C#8 where a data stream (chunks of data) can be processed instead of whole file.
- Using System.IO.Pipelines : This API was shipped with .NET core 2.1 and is internally used by Kestrel , web server for AspNET core for high performance to process many requests per second received from the socket. It is available as Nuget package download.David Fowler who has architected these APIs as an excellent post of its introduction.
The source code of this post is available on my Github repo. The repo also consists of tests which has benchmarks results.
So let’s dive straight into the code
The entry point public method is
ProcessFileAsync , which creates the instance of PipeReader class , it reads that data and converts into buffer, which is of data type of
ReadOnlySequence<byte> . This buffer data is then passed to
ParseLines method as a ref along with the position of the PipeReader , which has 0 value as it is in the beginning position.
ParseLines method tries to navigate the new line using
NewLine as a delimiter. This process continue till the end position of PipeReader is reached. After parsing is finish the PipeRead position is moved till the end of the buffer and it is marked as processed (line number 35 and 43).
The actual data processing takes place in the static class
LineParser in the
ParseLine method, after omitting the header row of csv file it tries to capture the each field data value by finding the
"," position , and then try to extract string value with UTF8 Encoding using index offsets and ranges pattern. Each field is processed one by one and it is tracked by
Let’s see how we can consume this
Creating an ArrayPool of
Employee type .
ArrayPool<T> is a high performance pool of managed arrays. It is a thread safe pool with custom max length.
Next thing is to call the
Rent method which requires you to specify minimum length of the buffer. Keep in mind, that what
Rent returns might be bigger than what you have asked for.
Once you are done using it, you just
Return it to the SAME pool.
Return method has an overload, which allows you to cleanup the buffer so subsequent consumer via
Rent will not see the previous consumer's content. By default the contents are left unchanged.
I’ve used BenchmarkDotNet library to measure the performance.
As from the result above PipeLines method is clear winner which just took 143.1 milliseconds to process the data and with just 44 MB of memory allocation