How to detect duplicate documents in billions of urls

You have a billion urls, where each is a huge page. How do you detect the duplicate documents?

My initial thoughts:
We can only use some heuristics to do the detection:

  1. If the two documents have exactly the same links inside the page
  2. They have the same title
  3. Their creation time are the same

Solution:
Observations:

  1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this.
  2. Billions of urls exist so we don’t want to compare every page with every other page (that would be O(n^2)).

Based on the above two observations we can derive an algorithm which is as follows:

  1. Iterate through the pages and compute the hash table of each one.
  2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?

  • How much space does each page take up in the hash table?
    • Each page hashes to a four byte value.
    • Each url is an average of 30 characters, so that’s another 30 bytes at least.
    • Each url takes up roughly 34 bytes.
  • 34 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in memory!

What do we do?

  • We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
  • We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
  • Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.
    • First, we hash the document to get a hash value v
    • v\%n tells us which machine this document’s hash table can be found on.
    • v / n is the value in the hash table that is located on its machine.
    Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: