How to avoid getting into infinite loops when designing a web crawler

If you were designing a web crawler, how would you avoid getting into infinite loops?

My initial thoughts:

  1. Keep a list of visited webpage. If the webpage is already visited, do not follow that link.
  2. Make a webpage with links less than a threshold as the base-page. Stop searching at base pages.

Solution:
First, how does the crawler get into a loop? The answer is very simple: when we re-parse an already parsed page. This would mean that we revisit all the links found in that page, and this would continue in a circular fashion.
Be careful about what the interviewer considers the “same” page. Is it URL or content? One could easily get redirected to a previously crawled page.
So how do we stop visiting an already visited page? The web is a graph-based structure, and we commonly use DFS (depth first search) and BFS (breadth first search) for traversing graphs. We can mark already visited pages the same way that we would in a BFS/DFS.
We can easily prove that this algorithm will terminate in any case. We know that each step of the algorithm will parse only new pages, not already visited pages. So, if we assume that we have N number of unvisited pages, then at every step we are reducing N (N-1) by 1. That proves that our algorithm will continue until they are only N steps.

Advertisements

1 Comment (+add yours?)

  1. Julian
    Jul 28, 2014 @ 20:45:04

    Excellent beat ! I would like to apprentice at the same time as you amend your site,
    how can i subscribe for a weblog site? The account helped me a applicable deal.

    I had been tiny bit acquainted of this your broadcast provided brilliant transparent idea

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: