A web crawler is a program that automatically visits web pages, reads their content, and follows links to discover more pages. Search engines like Google use web crawlers to scan the web, collect page data, and build an index so users can search for information later.
Product Design Requirements
Functional Requirements
- The system should be able to crawl web pages starting from a set of seed URLs.
- The system should be able to fetch HTML content from each URL, parse the content and extract new URLs.
Non-Functional Requirements
- The system should be highly available, especially for crawl scheduling and data retrieval.
- The system should be fault tolerant; failures in workers or network calls should not stop the crawl.
- The system should be polite to external websites by enforcing rate limits per domain.
- The system should be highly scalable and support crawling billions of pages.
Design Setup
Because a web crawler is not primarily a user-facing system, we do not need to start with traditional client APIs. Instead, we can define the system boundary: what it takes in, what it produces, and how data moves through the system. This gives us a clear foundation before moving into the high-level architecture.
System Interface
The crawler has a simple external interface:
- Input: Seed URLs to begin crawling from
- Output: Extracted text or metadata from crawled web pages
The seed URLs act as the starting point. From there, the crawler discovers additional URLs by parsing links from fetched pages.
Data Flow
At a high level, the crawler repeatedly performs the following steps:
- Take a URL from the crawl frontier
- Resolve the domain using DNS
- Fetch the HTML from the target web server
- Parse the HTML and extract useful text
- Store the extracted text in a database
- Extract linked URLs from the page
- Add newly discovered URLs back into the crawl frontier
- Repeat until the crawl is complete
This flow helps separate the system into clear components: URL frontier, DNS resolution, page fetcher, HTML parser, content storage, URL extraction, and deduplication. Once this pipeline is clear, we can design each component in more detail.