Generate an integer which is not in a file with 4 billion integers, limited memory

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory.
FOLLOW UP
What if you have only 10 MB of memory?

My initial thoughts:
In Java, an integer is 32-bit. 4 billion integers would quickly eat up 15 GB of memory. Therefore we cannot put them all in the main memory. We can instead use a boolean array. Each cell in position i of the array represent whether number i is present in the file or not. Then when generating new numbers, we just need to do a search through the array to find a spot whose value is false, meaning that number is absent from the file. However, in Java, the maximum length of an array is around 2^32. So we can make a 2D array with indexing to do this.

My initial codes:

	public static void generateNumber(String fileName) {
		final int SIZE = Integer.MAX_VALUE / 2;
		boolean[][] array = new boolean[3][SIZE];
		try {
			DataInputStream in = new DataInputStream(
					new FileInputStream(fileName);
			BufferedReader reader = new BufferedReader(
					new InputStreamReader(in));
			String line = new String();

			while ((line = reader.readLine()) != null) {
				long number = Long.parseLong(line);
				int which = (int) (number / SIZE);
				int pos = (int) (number % SIZE + which * SIZE);
				array[which][pos] = true;
			}
		} catch (IOException e) {
			e.printStackTrace();
		}

		for (int i = 0; i < array.length; ++i) {
			for (int j = 0; j < array[i].length; ++j)
				if (!array[i][j]) {
					System.out.println(i * SIZE + j);
					break;
				}
		}
	}

Solution:

There are a total of 2^32, or 4 billion, distinct integers possible. We have 1 GB of memory, or 8 billion bits.
Thus, with 8 billion bits, we can map all possible integers to a distinct bit with the available memory. The logic is as follows:

  1. Create a bit vector (BV) of size 4 billion.
  2. Initialize BV with all 0’s.
  3. Scan all numbers (num) from the file and write BV[num] = 1;
  4. Now scan again BV from 0th index
  5. Return the first index which has 0 value.
	public static void findOpenNumber2(String fileName)
			throws FileNotFoundException {
		byte[] bitfield = new byte[0xFFFFFFF / 8];
		Scanner in = new Scanner(new FileReader(fileName));
		while (in.hasNextInt()) {
			int n = in.nextInt();

			// Finds the corresponding number in the bitfield by using the
			// OR operator to set the nth bit of a byte (e.g.. 10 would
			// correspond to the 2nd bit of index 2 in the byte array).

			bitfield[n / 8] |= 1 << (n % 8);
		}

		for (int i = 0; i < bitfield.length; i++) {
			for (int j = 0; j < 8; j++) {

				// Retrieves the individual bits of each byte. When 0 bit
				// is found, finds the corresponding value.

				if ((bitfield[i] & (1 << j)) == 0) {
					System.out.println(i * 8 + j);
					return;
				}
			}
		}
	}

Follow Up: What if we have only 10 MB memory?

It’s possible to find a missing integer with just two passes of the data set. We can divide up the integers into blocks of some size (we’ll discuss how to decide on a size later). Let’s just assume that we divide up the integers into blocks of 1000. So, block 0 represents the numbers 0 through 999, block 1 represents blocks 1000 – 1999, etc. Since the range of ints is finite, we know that the number of blocks needed is finite.
In the first pass, we count how many ints are in each block. That is, if we see 552, we know that that is in block 0, we increment counter[0]. If we see 1425, we know that that is in block 1, so we increment counter[1].
At the end of the first pass, we’ll be able to quickly spot a block that is missing a number. If our block size is 1000, then any block which has fewer than 1000 numbers must be missing a number. Pick any one of those blocks.
In the second pass, we’ll actually look for which number is missing. We can do this by creating a simple bit vector of size 1000. We iterate through the file, and for each number that should be in our block, we set the appropriate bit in the bit vector. By the end, we’ll know
which number (or numbers) is missing.
Now we just have to decide what the block size is.
A quick answer is 2^20 values per block. We will need an array with 2^12 block counters and a bit vector in 2^17 bytes. Both of these can comfortably fit in 10*2^20 bytes.
What’s the smallest footprint? When the array of block counters occupies the same memory as the bit vector. Let N = 2^32.

counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4

It’s possible to find a missing integer with just under 65KB (or, more exactly, sqrt(2)*2^15 bytes).

	int bitsize = 1048576; // 2^20 bits (2^17 bytes)
	int blockNum = 4096; // 2^12
	byte[] bitfield = new byte[bitsize / 8];
	int[] blocks = new int[blockNum];

	void findOpenNumber(String fileName) throws FileNotFoundException {
		int starting = -1;
		Scanner in = new Scanner(new FileReader(fileName));
		while (in.hasNextInt()) {
			int n = in.nextInt();
			blocks[n / (bitfield.length * 8)]++;
		}

		for (int i = 0; i < blocks.length; i++) {
			if (blocks[i] < bitfield.length * 8) {
				// if value < 2^20, then at least 1 number is missing in
				// that section.
				starting = i * bitfield.length * 8;
				break;
			}
		}

		in = new Scanner(new FileReader(fileName));
		while (in.hasNextInt()) {
			int n = in.nextInt();
			// If the number is inside the block that’s missing numbers,
			// we record it
			if (n >= starting && n < starting + bitfield.length * 8) {
				bitfield[(n - starting) / 8] |= 1 << ((n - starting) % 8);
			}
		}

		for (int i = 0; i < bitfield.length; i++) {
			for (int j = 0; j < 8; j++) {
				// Retrieves the individual bits of each byte. When 0 bit
				// is found, finds the corresponding value.
				if ((bitfield[i] & (1 << j)) == 0) {
					System.out.println(i * 8 + j + starting);
					return;
				}
			}
		}
	}
Advertisements

7 Comments (+add yours?)

  1. jim
    Aug 31, 2012 @ 10:46:54

    hi, I was wondering that for the followup solution, is it possible that block counters all reach up to 1000? Great blog and solution btw

    Reply

  2. john
    Oct 05, 2012 @ 21:03:12

    Hi,about follow up solution, you assume the number in the file is distinct?

    Reply

  3. David
    Jun 05, 2013 @ 17:35:07

    Hi, why will the array of block counters occupy the same memory as the bit vector?

    And why the size of block multiplied by 4 in “counters (bytes): blocks * 4”?

    Reply

  4. David
    Jun 05, 2013 @ 17:40:08

    One more question. What is 65KB and how to get it? It seems not be obtained from sqrt(2)*2^15 bytes.

    Reply

  5. Rami Jiossy
    Feb 25, 2014 @ 03:38:12

    I got a question about u’r initial code; one without mem limitaitons.
    what pos updates does :
    int pos = (int) (number % SIZE + which * SIZE);
    pos can be at most SIZE; so the way u update it looks illigal.
    can u please explain it?

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: