There are several explanations you may have to have to find each of the URLs on a web site, but your actual objective will determine Anything you’re hunting for. By way of example, you may want to:
Determine every indexed URL to investigate troubles like cannibalization or index bloat
Acquire current and historic URLs Google has found, especially for web-site migrations
Find all 404 URLs to Recuperate from publish-migration faults
In Every situation, just one Instrument won’t Present you with everything you need. Sad to say, Google Lookup Console isn’t exhaustive, and a “internet site:instance.com” look for is proscribed and hard to extract facts from.
Within this submit, I’ll stroll you through some tools to construct your URL list and just before deduplicating the data employing a spreadsheet or Jupyter Notebook, depending on your website’s sizing.
Previous sitemaps and crawl exports
Should you’re searching for URLs that disappeared from your Are living internet site not too long ago, there’s a chance another person on the team might have saved a sitemap file or maybe a crawl export ahead of the variations were being built. In the event you haven’t currently, look for these files; they might typically provide what you require. But, in the event you’re looking at this, you almost certainly didn't get so lucky.
Archive.org
Archive.org
Archive.org is an invaluable Instrument for Search engine optimization tasks, funded by donations. In the event you hunt for a domain and choose the “URLs” choice, you are able to obtain approximately 10,000 mentioned URLs.
On the other hand, There are several constraints:
URL limit: You may only retrieve nearly web designer kuala lumpur ten,000 URLs, which is inadequate for larger sized sites.
High quality: Many URLs may be malformed or reference source files (e.g., visuals or scripts).
No export solution: There isn’t a constructed-in technique to export the record.
To bypass the lack of the export button, use a browser scraping plugin like Dataminer.io. Even so, these constraints signify Archive.org may well not offer a complete Remedy for more substantial web-sites. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—but when Archive.org discovered it, there’s a very good possibility Google did, too.
Moz Pro
While you may perhaps usually utilize a link index to discover external sites linking to you personally, these applications also find URLs on your website in the process.
Ways to use it:
Export your inbound links in Moz Pro to secure a rapid and easy list of concentrate on URLs from your internet site. If you’re managing an enormous Internet site, consider using the Moz API to export info over and above what’s manageable in Excel or Google Sheets.
It’s crucial that you Notice that Moz Professional doesn’t affirm if URLs are indexed or uncovered by Google. Nonetheless, because most web pages use a similar robots.txt principles to Moz’s bots because they do to Google’s, this technique usually operates very well for a proxy for Googlebot’s discoverability.
Google Research Console
Google Search Console offers several important sources for setting up your list of URLs.
Links studies:
Similar to Moz Pro, the Back links portion offers exportable lists of goal URLs. Regretably, these exports are capped at 1,000 URLs Each individual. It is possible to implement filters for unique internet pages, but because filters don’t apply to the export, you might need to rely upon browser scraping instruments—limited to 500 filtered URLs at any given time. Not perfect.
Performance → Search Results:
This export offers you a list of web pages obtaining research impressions. While the export is limited, You need to use Google Lookup Console API for larger sized datasets. You can also find free of charge Google Sheets plugins that simplify pulling additional extensive details.
Indexing → Webpages report:
This segment provides exports filtered by difficulty variety, although they are also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a wonderful resource for accumulating URLs, that has a generous limit of a hundred,000 URLs.
Better yet, you could implement filters to produce distinctive URL lists, properly surpassing the 100k Restrict. One example is, if you'd like to export only blog URLs, comply with these steps:
Step one: Add a phase to the report
Stage two: Click on “Produce a new phase.”
Phase 3: Determine the section having a narrower URL sample, for instance URLs that contains /website/
Notice: URLs found in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply worthwhile insights.
Server log data files
Server or CDN log information are Potentially the ultimate Resource at your disposal. These logs seize an exhaustive record of each URL route queried by users, Googlebot, or other bots during the recorded period.
Concerns:
Knowledge dimensions: Log documents is usually massive, so many websites only keep the final two weeks of information.
Complexity: Examining log files is often demanding, but numerous equipment are offered to simplify the process.
Merge, and superior luck
After you’ve gathered URLs from every one of these sources, it’s time to mix them. If your web site is sufficiently small, use Excel or, for larger datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are continuously formatted, then deduplicate the checklist.
And voilà—you now have a comprehensive listing of current, outdated, and archived URLs. Good luck!