How to detect duplicate documents in billions of urls

You have a billion urls, where each is a huge page. How do you detect the duplicate documents?

My initial thoughts:
We can only use some heuristics to do the detection:

  1. If the two documents have exactly the same links inside the page
  2. They have the same title
  3. Their creation time are the same

Solution:
Observations:

  1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this.
  2. Billions of urls exist so we don’t want to compare every page with every other page (that would be O(n^2)).

Based on the above two observations we can derive an algorithm which is as follows:

  1. Iterate through the pages and compute the hash table of each one.
  2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?

  • How much space does each page take up in the hash table?
    • Each page hashes to a four byte value.
    • Each url is an average of 30 characters, so that’s another 30 bytes at least.
    • Each url takes up roughly 34 bytes.
  • 34 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in memory!

What do we do?

  • We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
  • We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
  • Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.
    • First, we hash the document to get a hash value v
    • v\%n tells us which machine this document’s hash table can be found on.
    • v / n is the value in the hash table that is located on its machine.

Find duplicate elements in the array with 4KB memory only

You have an array with all the numbers from 1 to N, where N is at most 32,000. The array may have duplicate entries and you do not know what N is. With only 4KB of memory available, how would you print all duplicate elements in the array?

My initial thoughts:
4KB = 32768 bits. We can use a byte array to represent if we have seen number i. If so, we make it 1. If we encounter a duplicate, we would have marked the particular position with 1, so we would know we have a duplicate. Then we print it out.

My initial codes:

	public static void findDuplicates(int[] array) {
		byte[] bitVector = new byte[32000 / 8];
		for (int number : array) {
			int posInArray = number >> 3;
			int posInBit = number % 8;
			if ((bitVector[posInArray] & (1 << posInBit)) > 0)
				// found duplicates
				System.out.println(number);
			else
				bitVector[posInArray] |= (1 << posInBit);
		}
	}

Solution:
We have 4KB of memory which means we can address up to 8 * 4 * (2^10) bits. Note that 32*(2^10) bits is greater than 32000. We can create a bit vector with 32000 bits, where each bit represents one integer.
NOTE: While this isn’t an especially difficult problem, it’s important to implement this cleanly. We will define our own bit vector class to hold a large bit vector.

	public static void checkDuplicates(int[] array) {
		BitSet bs = new BitSet(32000);
		for (int i = 0; i < array.length; i++) {
			int num = array[i];
			int num0 = num - 1; // bitset starts at 0, numbers start at 1
			if (bs.get(num0)) {
				System.out.println(num);
			} else {
				bs.set(num0);
			}
		}
	}

	class BitSet {
		int[] bitset;
		
		public BitSet(int size) {
			bitset = new int[size >> 5]; // divide by 32
		}

		boolean get(int pos) {
			int wordNumber = (pos >> 5); // divide by 32
			int bitNumber = (pos & 0x1F); // mod 32
			return (bitset[wordNumber] & (1 << bitNumber)) != 0;
		}

		void set(int pos) {
			int wordNumber = (pos >> 5); // divide by 32
			int bitNumber = (pos & 0x1F); // mod 32
			bitset[wordNumber] |= 1 << bitNumber;
		}
	}

Comments:
bitset starts at 0 while numbers start at 1. I didn’t pay attention to that.

Generate an integer which is not in a file with 4 billion integers, limited memory

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory.
FOLLOW UP
What if you have only 10 MB of memory?

My initial thoughts:
In Java, an integer is 32-bit. 4 billion integers would quickly eat up 15 GB of memory. Therefore we cannot put them all in the main memory. We can instead use a boolean array. Each cell in position i of the array represent whether number i is present in the file or not. Then when generating new numbers, we just need to do a search through the array to find a spot whose value is false, meaning that number is absent from the file. However, in Java, the maximum length of an array is around 2^32. So we can make a 2D array with indexing to do this.

My initial codes:

	public static void generateNumber(String fileName) {
		final int SIZE = Integer.MAX_VALUE / 2;
		boolean[][] array = new boolean[3][SIZE];
		try {
			DataInputStream in = new DataInputStream(
					new FileInputStream(fileName);
			BufferedReader reader = new BufferedReader(
					new InputStreamReader(in));
			String line = new String();

			while ((line = reader.readLine()) != null) {
				long number = Long.parseLong(line);
				int which = (int) (number / SIZE);
				int pos = (int) (number % SIZE + which * SIZE);
				array[which][pos] = true;
			}
		} catch (IOException e) {
			e.printStackTrace();
		}

		for (int i = 0; i < array.length; ++i) {
			for (int j = 0; j < array[i].length; ++j)
				if (!array[i][j]) {
					System.out.println(i * SIZE + j);
					break;
				}
		}
	}

Solution:

There are a total of 2^32, or 4 billion, distinct integers possible. We have 1 GB of memory, or 8 billion bits.
Thus, with 8 billion bits, we can map all possible integers to a distinct bit with the available memory. The logic is as follows:

  1. Create a bit vector (BV) of size 4 billion.
  2. Initialize BV with all 0’s.
  3. Scan all numbers (num) from the file and write BV[num] = 1;
  4. Now scan again BV from 0th index
  5. Return the first index which has 0 value.
	public static void findOpenNumber2(String fileName)
			throws FileNotFoundException {
		byte[] bitfield = new byte[0xFFFFFFF / 8];
		Scanner in = new Scanner(new FileReader(fileName));
		while (in.hasNextInt()) {
			int n = in.nextInt();

			// Finds the corresponding number in the bitfield by using the
			// OR operator to set the nth bit of a byte (e.g.. 10 would
			// correspond to the 2nd bit of index 2 in the byte array).

			bitfield[n / 8] |= 1 << (n % 8);
		}

		for (int i = 0; i < bitfield.length; i++) {
			for (int j = 0; j < 8; j++) {

				// Retrieves the individual bits of each byte. When 0 bit
				// is found, finds the corresponding value.

				if ((bitfield[i] & (1 << j)) == 0) {
					System.out.println(i * 8 + j);
					return;
				}
			}
		}
	}

Follow Up: What if we have only 10 MB memory?

It’s possible to find a missing integer with just two passes of the data set. We can divide up the integers into blocks of some size (we’ll discuss how to decide on a size later). Let’s just assume that we divide up the integers into blocks of 1000. So, block 0 represents the numbers 0 through 999, block 1 represents blocks 1000 – 1999, etc. Since the range of ints is finite, we know that the number of blocks needed is finite.
In the first pass, we count how many ints are in each block. That is, if we see 552, we know that that is in block 0, we increment counter[0]. If we see 1425, we know that that is in block 1, so we increment counter[1].
At the end of the first pass, we’ll be able to quickly spot a block that is missing a number. If our block size is 1000, then any block which has fewer than 1000 numbers must be missing a number. Pick any one of those blocks.
In the second pass, we’ll actually look for which number is missing. We can do this by creating a simple bit vector of size 1000. We iterate through the file, and for each number that should be in our block, we set the appropriate bit in the bit vector. By the end, we’ll know
which number (or numbers) is missing.
Now we just have to decide what the block size is.
A quick answer is 2^20 values per block. We will need an array with 2^12 block counters and a bit vector in 2^17 bytes. Both of these can comfortably fit in 10*2^20 bytes.
What’s the smallest footprint? When the array of block counters occupies the same memory as the bit vector. Let N = 2^32.

counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4

It’s possible to find a missing integer with just under 65KB (or, more exactly, sqrt(2)*2^15 bytes).

	int bitsize = 1048576; // 2^20 bits (2^17 bytes)
	int blockNum = 4096; // 2^12
	byte[] bitfield = new byte[bitsize / 8];
	int[] blocks = new int[blockNum];

	void findOpenNumber(String fileName) throws FileNotFoundException {
		int starting = -1;
		Scanner in = new Scanner(new FileReader(fileName));
		while (in.hasNextInt()) {
			int n = in.nextInt();
			blocks[n / (bitfield.length * 8)]++;
		}

		for (int i = 0; i < blocks.length; i++) {
			if (blocks[i] < bitfield.length * 8) {
				// if value < 2^20, then at least 1 number is missing in
				// that section.
				starting = i * bitfield.length * 8;
				break;
			}
		}

		in = new Scanner(new FileReader(fileName));
		while (in.hasNextInt()) {
			int n = in.nextInt();
			// If the number is inside the block that’s missing numbers,
			// we record it
			if (n >= starting && n < starting + bitfield.length * 8) {
				bitfield[(n - starting) / 8] |= 1 << ((n - starting) % 8);
			}
		}

		for (int i = 0; i < bitfield.length; i++) {
			for (int j = 0; j < 8; j++) {
				// Retrieves the individual bits of each byte. When 0 bit
				// is found, finds the corresponding value.
				if ((bitfield[i] & (1 << j)) == 0) {
					System.out.println(i * 8 + j + starting);
					return;
				}
			}
		}
	}