AEObot
Back to Blog
·guide

Sitemap URL Extractor: How to Extract All URLs From a Sitemap (2026)

A sitemap URL extractor pulls every page URL out of a sitemap.xml file in seconds. Learn what these tools do, why you'd extract sitemap URLs for audits and migrations, and how to do it with online tools, your browser, or the command line.

sitemap url extractorextract urls from sitemapsitemap.xmltechnical seosite migration

A sitemap URL extractor is a tool that reads a site's sitemap.xml file and pulls out a clean, plain-text list of every page URL it contains. Instead of squinting at raw XML — or manually copying links one by one — you point the extractor at a sitemap and get back a tidy list you can paste into a spreadsheet, feed to another tool, or audit page by page. It is one of the small, unglamorous tasks that makes almost every bigger SEO job faster, from a content audit to a full site migration.

This guide explains exactly what a sitemap URL extractor does, the situations where you'll reach for one, and several ways to extract URLs from a sitemap — online tools, your browser, and the command line, including a short copy-paste example. We'll finish with what to do once you have the list, and why your sitemap quietly matters for AI crawlers in 2026.

What Does a Sitemap URL Extractor Do?

To understand the tool, start with the file it reads. A sitemap is an XML document that lists the URLs a site wants search engines to know about. Each entry sits inside a <url> block with a <loc> tag holding the actual address, often alongside optional metadata like <lastmod> (last modified date), <changefreq>, and <priority>. A trimmed example looks like this:

<url>
  <loc>https://example.com/blog/post-one</loc>
  <lastmod>2026-05-30</lastmod>
</url>

A sitemap extractor does one job well: it parses that XML and returns just the values inside the <loc> tags — the URLs — stripped of all the surrounding markup. The good ones also handle the wrinkles you hit on real sites:

  • Sitemap index files. Large sites don't use one giant sitemap. The sitemaps protocol caps a single file at 50,000 URLs and 50MB uncompressed, so big sites split their pages across multiple sitemaps and tie them together with a sitemap index — a sitemap that lists other sitemaps. A capable extractor follows the index, opens each child sitemap, and merges everything into one master list.
  • Gzipped sitemaps. Sitemaps are often served compressed as .xml.gz to save bandwidth. A good extractor decompresses them automatically.
  • Deduplication and counts. It removes duplicate URLs and tells you the total — a useful first sanity check against how many pages you think you have.

The end result is the same regardless of tool: a flat list of every URL the site has declared, ready to use. When people say they want to get URLs from sitemap.xml, this clean export is what they're after.

Why Would You Extract Sitemap URLs?

Pulling a list of URLs sounds trivial until you realize how many workflows start with exactly that list. Here are the most common reasons to extract URLs from a sitemap:

  • Content and SEO audits. A full URL inventory is the foundation of any audit. With every page in a spreadsheet, you can crawl each one to check titles, meta descriptions, status codes, and word counts — and spot thin, duplicate, or orphaned content. A sitemap is the fastest way to enumerate a site you own.
  • Site migrations and redesigns. Migrations are where missing URLs become expensive. Before you move a site or change its URL structure, export the old sitemap to build a complete map of existing pages. That list becomes your redirect plan — every old URL needs a destination — and your post-launch QA checklist. Skip it and pages quietly fall off the map, taking their rankings and backlinks with them.
  • Indexing and coverage checks. Comparing your sitemap URLs against the pages actually indexed (via Google Search Console) reveals gaps both ways: pages you submitted that Google ignored, and indexed pages missing from your sitemap. The URL list is what makes that comparison possible.
  • Feeding other tools. Many tools — crawlers, broken-link checkers, screenshot utilities, AI-visibility scanners — accept a list of URLs as input. Extracting from the sitemap is the quickest way to hand them the full set of pages instead of relying on a slower discovery crawl.
  • Research and benchmarking. A competitor's sitemap is a public table of contents — extract it for a quick, ethical snapshot of how they structure content, how much they've published, and which sections they prioritize.

In nearly every case, the URL list isn't the destination — it's the starting point that makes the next, more valuable task possible.

How to Extract URLs From a Sitemap: 3 Methods

There are three practical ways to extract URLs from a sitemap, scaling from zero-setup to fully automated. Pick the one that matches your comfort level and how often you need to do it.

1. Online sitemap URL extractor tools

The fastest path for a one-off job. A web-based sitemap URL extractor asks for nothing but a URL: paste in https://example.com/sitemap.xml, click extract, and copy the list. Popular options include tools from Sitebulb, XML-Sitemaps, and various free single-purpose extractors. Most follow sitemap index files automatically and export to CSV or plain text.

Best for: non-technical users and quick checks of a public sitemap. Trade-offs: very large sitemaps may be truncated on free tiers, and you're pasting a URL into a third-party service — fine for public sitemaps, worth noting for sensitive ones.

2. Your browser

Every browser can already open a sitemap — just navigate to the sitemap.xml URL directly. Modern browsers render the XML in a readable tree, and you can use the browser's find function or "view source" to scan entries. For a sitemap with a handful of URLs, you can select all, copy, and clean up the result by hand.

This is clumsy for anything large, but genuinely useful for a quick look: confirming a sitemap exists, checking it isn't erroring, or eyeballing whether a specific page is included. No tools, no setup.

3. The command line

For repeatable work, large sites, or anything you want to script, the command line is the most powerful option — and it's simpler than it sounds. On macOS or Linux you can fetch a sitemap and pull out the URLs with a single piped command using curl and grep:

# Fetch a sitemap and extract every URL from its <loc> tags
curl -s https://example.com/sitemap.xml \
  | grep -oP '(?<=<loc>)[^<]+' \
  > sitemap-urls.txt

Here curl -s downloads the sitemap quietly, grep -oP '(?<=<loc>)[^<]+' uses a lookbehind to print only the text inside each <loc> tag, and > sitemap-urls.txt saves the clean list to a file. For a gzipped sitemap, add a decompression step (curl -s … | gunzip | grep …). If you're on a system without GNU grep's -P flag, a sed one-liner like sed -n 's:.*<loc>\(.*\)</loc>.*:\1:p' does the same job.

For a sitemap index, run the same command against the index first to get the list of child sitemaps, then loop over those to extract their URLs. Once it's a script, you can re-run it any time the site changes.

Best for: developers, recurring audits, automation, and very large sitemaps where online tools choke.

What to Do With Your Sitemap URL List

A clean sitemap URL list is the easy part. The value comes from what you do next. Once you've extracted the URLs:

  1. Crawl them. Run the list through a crawler (Screaming Frog, Sitebulb, or similar) to capture status codes, titles, meta descriptions, canonical tags, and word counts for every page in one pass.
  2. Check status codes. Flag anything returning a 404 or 500 — those are dead pages still being advertised in your sitemap, which wastes crawl budget and confuses search engines. Your sitemap should list only live, canonical, indexable URLs.
  3. Compare against what's indexed. Cross-reference the list with Google Search Console's coverage data to find pages submitted but not indexed, and indexed pages missing from the sitemap.
  4. Build your redirect map. For a migration, pair each old URL with its new destination so nothing gets orphaned at launch.
  5. Audit content quality. With every URL in a spreadsheet, prioritize thin pages to improve, duplicates to consolidate, and high-value pages to refresh.

Each step turns a plain list into a concrete action plan — the whole point of extracting it.

Sitemaps and AI Crawler Discovery

There's a newer reason to care about your sitemap: AI answer engines use it too. Crawlers like OpenAI's GPTBot, Anthropic's ClaudeBot, and PerplexityBot power the answers people get from ChatGPT, Claude, Perplexity, and Google's AI Overviews — and like Googlebot, they lean on sitemaps to discover and prioritize which pages to fetch.

That makes a complete, accurate sitemap part of your AI-visibility foundation. A few principles:

  • List every page you want cited. If a page isn't in your sitemap and isn't well linked internally, AI crawlers may never find it — and a page that's never crawled can never be quoted in an answer. This is the same discoverability problem at the heart of answer engine optimization.
  • Keep it clean. A sitemap full of dead URLs, redirects, or noindexed pages sends mixed signals. Extracting your URLs and checking their status codes (the workflow above) is the easiest way to keep it tidy.
  • Reference it in robots.txt. Adding a Sitemap: line to your robots.txt is the fastest way for any crawler — Google's or an AI engine's — to find your full URL set on its first visit. (If you're on WordPress, our WordPress robots.txt guide shows where to add it.)

Being discoverable is necessary but not sufficient. Once crawlers can find your pages, the next question is whether AI engines actually understand and cite them — which is what AI search visibility measures.

Curious whether AI engines can find, read, and cite your content today? Run a free scan at aeobot.io/scan to see exactly where your pages stand — and what to fix first.

Frequently Asked Questions

What is the easiest way to extract all URLs from a sitemap?

For a quick, one-off job, an online sitemap URL extractor is easiest: paste your sitemap.xml address into a web-based tool and copy the list it returns. Most handle sitemap index files automatically and export to CSV or plain text, with no installation or technical skill required. For recurring or large-scale extraction, a curl plus grep command line is faster and scriptable.

How do I extract URLs from a sitemap index file?

A sitemap index is a sitemap that lists other sitemaps rather than pages. First extract the URLs from the index to get the list of child sitemaps, then extract the page URLs from each of those. Most good online extractors and crawlers do this automatically — they detect the index, follow each child sitemap, and merge everything into one combined list. On the command line, run your extraction command against the index, then loop over the resulting sitemap URLs.

Where do I find a website's sitemap.xml?

The most common location is https://example.com/sitemap.xml. If that returns nothing, check the site's robots.txt file (at https://example.com/robots.txt) — sites typically declare their sitemap there with a Sitemap: line. Large sites often use a sitemap index at /sitemap_index.xml, and CMS platforms sometimes generate sitemaps at paths like /sitemap-1.xml. Google Search Console also lists the sitemaps submitted for any site you own.

Can I extract URLs from any website's sitemap?

You can extract URLs from any publicly accessible sitemap, since sitemaps are designed to be read by crawlers and contain only the URLs a site has chosen to publish — which is what makes them useful for competitor research. Just be reasonable with the data: extracting a public list of URLs is fine; aggressively scraping the pages behind them may run into rate limits or terms-of-service rules.

Why are some pages missing from a sitemap?

A page can be absent because it's intentionally excluded (noindexed, set to low priority, or filtered out by the CMS), because the sitemap is outdated and was never regenerated, or because the page is orphaned and the sitemap generator never discovered it. This is exactly why comparing your extracted sitemap URLs against a full crawl and your indexed pages is valuable — the gaps tell you what your sitemap is silently leaving out.