## Sorting a 2GB file with one string per line

If you have a 2GB file with one string per line, which sorting algorithm would you use to sort the file and why?

My initial thoughts:
Because 2GB size of strings are way too huge to be put into main memory, I came up with two ways:

1. K-way merge sort. Divide the file into K pieces, transfer them into main memory and sort them.
2. Bucket sort. Sort each character in order.

Solution:
When an interviewer gives a size limit of 2GB, it should tell you something – in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.

1. Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any $O(nlogn)$ algorithm. Save the lines back to the file.
2. Now bring the next chunk into memory and sort.
3. Once we’re done, merge them one by one.

The above algorithm is also known as external sort. Step 3 is known as N-way merge.
The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.

Advertisements

### 1 Comment (+add yours?)

1. thanks, that really could be a tricky question, if you don’t know the answer. 🙂