Monitoring Sitemap of a Website Using a Crawler

Currently in alpha • Available in Professional plan and above

What is sitemap monitoring?

A sitemap helps search engines understand a website’s structure. While XML sitemaps are common, they’re often outdated or incomplete. Distill’s sitemap monitor uses a crawler to discover all pages on a website—even hidden or dynamically generated ones—and alerts you when URLs are added or removed.

How sitemap monitoring works

Distill’s crawler creates a comprehensive list of URLs by:

  1. Starting at your specified URL
  2. Finding all links on that page
  3. Following each link to discover more pages
  4. Repeating until all reachable pages are found

Crawling rules

The crawler will not follow:

  • Links outside the website’s domain
  • Links not in the same subdomain/subpath
  • Links matching your exclusion filter

Example: Starting at https://distill.io will crawl https://distill.io/blog and https://distill.io/help, but not https://forums.distill.io (different subdomain).

Two-step workflow

  1. Crawling — Discovers all links on a website and creates a sitemap. Configure crawl frequency and exclusions at this stage
  2. MonitoringImport discovered URLs from the sitemap into your watchlist to monitor their content

Create a sitemap monitor

  1. Open the Watchlist at https://monitor.distill.io

  2. Click Add MonitorSitemap

    button to add a sitemap monitor

  3. Enter the Start URL

    The crawler will follow links within the same subdomain and subpath only.

    Example: https://distill.io will crawl:

    • https://distill.io/blog
    • https://distill.io/help
    • https://forums.distill.io (different subdomain)
  4. (Optional) Add a regular expression filter to exclude specific links

  5. Click Done to open the Options page. Configure actions and conditions, then save

You’ll receive alerts whenever URLs are added or removed. Click the monitor preview to view change history.

Change history of a sitemap monitor

Use regular expressions to exclude specific links from crawling. Image links are excluded by default.

Add the filter when creating the monitor or edit it later from the crawler detail page.

Regex filter for exclusion

Crawl frequency

The default crawl frequency is once per day. To change it:

  1. Click the hamburger menu → Crawlers
view crawler's detail page
  1. Select your crawler and click Edit Crawler
  2. Modify the Schedule settings
edit crawler's config

Page macros

Page macros run validation steps on each page before crawling. Use them to prevent errors or skip irrelevant pages.

Add page macros

  1. Click the hamburger menu → Crawlers
  2. Select your crawler → Edit Crawler
  3. Next to Page Macros, click Edit Steps
Adding steps to be executed before crawling a page
  1. Add the steps to execute before crawling each URL

Use case: Stop crawling when site is down

When a site returns 503 errors (temporarily unavailable), pages won’t contain the original links. Crawling them creates an incomplete list and triggers false alerts. Use a page macro to stop the crawl:

  1. Add an assert step and expand the options
  2. Check if element_has_text contains keywords like “Maintenance”
  3. Click Set optional Arguments and add an error message: “Crawler stopped: Site under maintenance”
Stop crawl when under maintenance

Use case: Skip URLs based on content

Skip irrelevant URLs to avoid unnecessary notifications. For example, skip out-of-stock products:

  1. Add an if...else block
  2. Check if element_has_text contains “in stock” and negate this step using the overflow button (adds NOT)
  3. In the condition body, use skipURL

This skips all URLs that don’t contain “in stock”.

Stop crawling irrelevant pages

URL rewrite

URL rewriting normalizes URLs before crawling to prevent duplicates. Use it to:

  • Consolidate URLs with different parameter orders
  • Remove unnecessary query parameters
  • Manage redirects

Rewrite presets

Example URL: https://example.com/products?size=10&color=blue&category=shoes

  1. Sort query parameters — Alphabetically sorts parameters

    Result: https://example.com/products?category=shoes&color=blue&size=10

    Must use with “Return constructed URL” preset

  2. Remove all query parameters — Strips all parameters

    Result: https://example.com/products

  3. Return constructed URL — Returns the reconstructed URL

Configure URL rewrite

  1. Select your crawler and click Edit Crawler
  2. Next to URL Rewrite, click Edit Steps
Rewrite URL
  1. Select your desired presets

Rewrite URL
Rewrite URL

Use case: Eliminate duplicate URLs

URLs with different parameter orders can appear as different pages, causing false duplicate alerts.

Example: These URLs lead to the same page:

  • https://bookstore.com/search?sort=price_asc&category=fiction
  • https://bookstore.com/search?category=fiction&sort=price_asc

Normalize them by sorting parameters:

Stop crawl when under maintenance
  1. Click Edit Steps in URL Rewrite
  2. Select Sort Query Params
  3. Select Return Constructed URL

The crawler now treats both URLs as the same entry.

Manage crawlers

View all crawlers

  1. Click the hamburger menu → Crawlers
view and manage crawler
  1. View crawler details: Name, Start URL, Creation date, Last Run Summary, and State
view and manage crawler

View crawler jobs

Click any crawler to see its job history. Click Edit Crawler to modify schedule, exclusions, page macros, or URL rewrite settings.

viewing crawler jobs and status

Job statistics:

Status Description
Total Total URLs found in the sitemap
Queued URLs waiting to be crawled
Crawled Successfully crawled URLs
Errored URLs that encountered errors

View crawl details

Click the Started on field for any job to see the list of discovered URLs and their crawl status.

viewing crawler jobs

Click the caret > icon next to a URL to view its crawl path.

viewing crawler jobs
viewing crawl path

Import crawled URLs

After crawling completes, import discovered URLs into your watchlist to monitor their content.

  1. Open the sitemap monitor’s Change History
  2. Click Import Monitors
Import Monitors button
  1. View the list of crawled URLs with their details
  2. Filter by Show All URLs to see only unmonitored URLs
  3. Select the URLs you want to monitor
  4. Click Import Selected Monitors and configure the options
import monitors button

Export crawled URLs

Export the list of crawled URLs to CSV:

  1. From your watchlist, click the sitemap monitor’s preview
  2. Click Download
export URLs to CSV

CSV fields:

  • URL — The discovered URL
  • Content Type — Resource type
  • Status Code — HTTP response code
  • Diff Type — Change status:
    • Addition — Newly found URL
    • Unchanged — URL present in previous crawl
    • Deleted — URL removed since last crawl
export URLs to CSV

Data retention

Crawler jobs generate significant data. To manage storage:

  • Only the latest 10 jobs are retained
  • Jobs that detected changes are preserved until change history is cleared
  • Older jobs without changes are automatically removed

This ensures access to important historical data while managing storage efficiently.

Was this article helpful? Leave a feedback here.