Given a URL, it should output a simple textual sitemap, showing the links between pages. The crawler should be limited to one subdomain - so when you start with https://example.com/, it would crawl all pages within example.com, but not follow external links, for example to facebook.com or blog.example.com.
The crawler client is a ReactJS SPA. The interface contains a text input for the URL and a numeric input to limit the maximum depth to crawl.
The sitemap is rendered using an unordered list (ul
) of items (li
) using indentation to show the hierarchy/relationship between them.
The crawler service (Service.API
) is a .NET Core web API written in C#. The API contains a single route:
- POST ~/crawl?url=[url]&max-depth=[max-depth]
Breath first search approach, going one level of the tree at the time. At each step we spawn crawl tasks in parallel which will generate nodes for the next step.
Depth first search approach, going from the root down in the sitemap tree recursively.
A curated approach to DFSCrawler
where we account for errors and retries. A queue processor implementation maintains a list of processing tasks limited by a maximum concurrenly level parameter. When a node fails to crawl, it is re-queued up to a maximum number of times.
A test project (Service.Tests
) contains:
UtilsTests
- Unit tests for theUtils
class, which contains helper methods on URL validation, parsing and transformation.BFSCrawlerTests
,DFSCrawlerTests
andFaultTolerantCrawlerTests
- Using a mock implementation of theIContentProvider
interface, we test basic functionality for the different crawler strategies implementations.
The crawler client will start on port 3000 by default (react-scripts
) using:
cd crawler-client && yarn install && yarn start
The crawler service will start on port 3001 (crawler-service/Service.API/Properties/launchSettings.json
) using:
cd crawler-service && dotnet run --project Service.API
In order to deliver this implementation on a reasonable time, some trade-offs were made. To make it production-ready, some other details should be addressed:
-
Crawler strategy - state management and recovery
Independently of the strategies implemented to make the crawling process more resource efficient, scenarios like node crash, network outages, should be addressed by providing state management and recovery.
-
Logs
Errors should be all tracked in a logging system with relevant context information like time type/description, number of retries, etc.