Back to Blog
Engineering2026-06-027 min

Pruning 13,000 pages for SEO without losing content

J

John C. Thomas

Founder, BlueWave Projects

One of our products is a webcam directory with roughly 13,000 pages. Google was crawling it, but a large slice of those pages were doing nothing for us — thin tag pages with one camera on them, auto-generated code pages, near-empty aggregations. They were not ranking, and worse, they were spending crawl budget that should have gone to the pages that could rank. Here is how we pulled about 900 pages out of the index without deleting a single piece of real content.

Thin pages are a tax, not just dead weight

Search engines allocate a finite crawl budget per site. Every low-value URL a crawler spends time on is a high-value URL it visits less often. A directory that auto-generates a page for every tag, every category, every cross-section quickly accumulates hundreds of pages that each have almost nothing on them — a single listing, a heading, a footer. Individually harmless. In aggregate, they dilute the site's perceived quality and waste the crawler's time.

In our case the offenders were specific and identifiable:

  • Tag pages with exactly one item (a tag that only ever applied to one thing)
  • Auto-generated pages keyed off junk codes that no human would ever search
  • Category aggregations that duplicated content already on better pages
  • About 474 of them. None worth indexing. All worth keeping accessible — some users do land on them — just not worth a slot in the index.

    noindex, not delete

    The instinct is to delete thin pages. That is usually wrong. Deleting breaks any inbound links, loses pages that occasionally serve a user, and throws away content you might consolidate later. The right tool is almost always noindex.

    A noindex directive tells search engines "you may crawl this, but do not put it in the index." The page still works for the human who lands on it from a direct link; it just stops competing for and diluting your search presence. We added noindex to the thin pages programmatically — the same template that generated them learned to mark the low-value ones.

    The rule we encoded: a tag page with fewer than a threshold of real items, or matching a junk-code pattern, gets noindex. Everything above the threshold stays indexable. The logic lives in one place so it stays consistent as the directory grows.

    Drop the noindexed pages from the sitemap too

    A sitemap is a list of pages you are explicitly asking the search engine to index. Listing a noindexed page in your sitemap sends two contradictory signals: please index this, and do not index this. Pick one. We pruned the sitemap to match the noindex logic — if a page is noindexed, it is not in the sitemap. The sitemap went from 13,974 URLs to 13,108, and every remaining URL is one we actually want ranked.

    A subtle implementation note: your sitemap generator and your noindex logic have to agree, or you reintroduce the contradiction. We made the sitemap compute the same "is this indexable" check the pages use, so the two can never drift apart.

    What we did NOT touch

    The content pages themselves — the actual thing people search for — were healthy and stayed fully indexable. This was not a content cull. It was removing the auto-generated chaff around good content so the good content gets the crawler's attention. The distinction matters: prune the aggregations and junk, never the pages that answer a real query.

    Reading the results in Search Console

    After a prune like this, expect the noindex bucket in Search Console's coverage report to RISE — that is the intended outcome, not a regression. The number you want to watch fall over the following weeks is "Crawled — currently not indexed," which is the engine's way of saying "I spent budget here and decided it was not worth indexing." Move those pages to an explicit noindex and out of the sitemap, and the crawler stops wasting visits on them and reallocates to the pages that earn their place.

    What I would tell another team

  • Thin auto-generated pages are a crawl-budget tax. Audit how many your templates silently produce.
  • noindex, do not delete. Keep the page working for direct visitors; just pull it from the index race.
  • Make your sitemap and your noindex logic compute the same indexability check, so they can never contradict each other.
  • Expect the noindex count to rise and "crawled not indexed" to fall. That is the prune working.
  • The whole change was a few lines of template logic plus a sitemap that respects them. No content lost, about 900 low-value pages out of the index, and the crawler pointed at what matters.

    If your site auto-generates more pages than you can name, [we can help you find the chaff](https://bluewaveprojects.com/booking).

    More from BlueWave