Reading Large Binary Files: Child Play


Yes, that was funny title, but after your experience it you will agree to me. So, here is the story. I have been working on a software that read some recording from hardware device to database, we have 45 records per second for 30 days, so it is about 30x24x60x60 record entry with 45 columns in it. We have to make a desktop application so we choose .NET for it. The first version of software was released by my company 3 yrs ago, and the reading well, we were inexperience at that time to manage that data, and what we get is about 2 hrs to read all that data to database. Oh, I forgot to tell that 30 days entry was from one hardware device and we have 3-4 device :), so we took 2 hr to read all say 4 devices. Now that is not acceptable thing. So, we decide to rewrite the complete software to make use of some parallelism, as that is only way my team thing it is going to work.

I start the rewrite, and with only hope to reduce to 2 hr work to 30-45 minute I start writing code, but this time we make a exception from last time, instead of using TEXT ascii file or SQLite database, we opt to use SQL Server to store our data. Reason, well first pre-release version of software use Text file, we never get that part working for more than 15 days, and it always get out of memory for one or other reason. Then we start using Sqlite which is 5 times lighter on hardware and speed the reading and information access, however, there is part which still use text file. So, in order to avoid two source we opt for database only, and Since client already have Sql Server on seperate machine, we thought it is good to have seperate machine storing database, for long term and obviously LAN benefits. Since client already have SQL Server and we are using .NET I decide to go with Sql Server, only.

So, we start reading binary file, and start putting insert query for each record [just for testing the lowest speed], and it goes through in 12-13 minutes. wow, we already reduce 30-40 minute job to 12 minutes, just by using full time database. Now, the next challenge is to speed it up with known bulk import methods. couple of them that I tried are

1. Using Dataaset and Update feature of Dataset,

2. Using Long Insert query to send 100 or 1000 records in one “Execute”

3. Using SqlBulkCopy feature in .NET. This is obvious choice on speed, but in few cases it fails for me, so I have to look for first two options as well.

So, at end we get SqlBulkCopy as our tool to go with, now it doesn’t simply import the data in Database. We have to prepare background for it, SqlBulkCopy is used to send data from CSV files, so we create CSV file from binary read, than import this file to Staging Sql Table, and then transfer data from Staging Table to main table, all this is done in 2-3 minute flat. Yes, a 40 minute work is done in 3 minutes. Period.

The trick is, we reduce the number of operation needed to write to disk, we still create CSV, but perviously we are creating XML File, secondly, we have multiple file write procedure going, we remove all this. Infact to achieve that speed we even stop writing LOG file to trace error, we use SQL server to record those error for us. Infact removing Log Text file with SQL error log, itself speed things from 5 minutes to 3 minutes. though logs ar enot that long, but disk writing is very slow as compare to database insert.

All said, I will hardly use plain text file in future for long data read, it will be some sort of database for sure.

, , , ,