Skip to content

Refactor Sitemap and URL Classification Architecture

Status: Planning

Problem Statement

Current implementation has split responsibilities and duplicated logic:

  1. Sitemap matchers were extended to classify individual URLs (duplicates pattern logic)
  2. URL classification patterns exist in multiple places:
  3. SitemapProcessor.looksLikeProductUrl() - negative patterns
  4. ProductUrlPatternBuilder - positive patterns from database
  5. AbstractSitemapPatternMatcher.NON_PRODUCT_URL_PATTERNS - duplicate negative patterns
  6. Misleading confidence scores: Pattern-based classification returns 0/40/60/80/null but only 0 and null are meaningful (everything else gets overwritten by AI)
  7. Squarespace needs flat sitemap support, which is a generic sitemap structure pattern, not platform-specific

Agreed Architecture

Principle: Separation of Concerns

  1. Sitemap Matchers → Classify sitemap structure ONLY
  2. URL Classification → Single responsibility service for all URL confidence scoring
  3. Two-stage confidence system remains:
  4. Stage 1: Pattern-based (fast, free) - binary filter (0 or null)
  5. Stage 2: AI-based (slow, expensive) - refined scoring (0-100)

Refactoring Plan

Phase 1: Revert URL Classification from Matchers

Goal: Remove URL-level classification from sitemap matchers (return to sitemap-only responsibility)

1.1 Remove URL Classification Interface Methods

  • File: src/Service/Crawler/Processing/SitemapPattern/SitemapPatternMatcherInterface.php
  • Action: Remove:
  • canClassifyUrls(): bool
  • classifyUrl(string $url): UrlClassification

1.2 Remove URL Pattern Constants from Abstract Matcher

  • File: src/Service/Crawler/Processing/SitemapPattern/AbstractSitemapPatternMatcher.php
  • Action: Remove:
  • PRODUCT_URL_PATTERNS constant
  • NON_PRODUCT_URL_PATTERNS constant
  • canClassifyUrls() method
  • classifyUrl() method

1.3 Simplify Squarespace Matcher

  • File: src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcher.php
  • Action: Remove all URL pattern constants, keep only sitemap-level constants

1.4 Remove DTOs/Enums for URL Classification

  • Files to delete:
  • src/Service/Crawler/Processing/SitemapPattern/Enum/UrlTypeEnum.php
  • src/DTO/UrlClassification.php

1.5 Update SitemapClassification

  • File: src/DTO/SitemapClassification.php
  • Action: Remove matcher property (no longer needed for URL classification)

1.6 Update SitemapPatternRegistry

  • File: src/Service/Crawler/Processing/SitemapPattern/SitemapPatternRegistry.php
  • Action: Don't pass matcher to SitemapClassification constructor

1.7 Revert SitemapProcessor Changes

  • File: src/Service/Crawler/Processing/SitemapProcessor.php
  • Action: Remove platform-specific URL classification logic, use only looksLikeProductUrl()

1.8 Delete Tests for URL Classification

  • File to delete: tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcherTest.php (URL classification tests)

Phase 2: Create Generic Flat Sitemap Matcher

Goal: Support platforms like Squarespace that use single sitemap.xml without sub-sitemaps

2.1 Create FlatSitemapMatcher

  • File: src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcher.php
  • Purpose: Generic matcher for platforms with flat sitemap structure
  • Activation: Config-based (sitemap_platform_name = 'flat')
  • Characteristics:
  • No IDENTIFIER_PATTERNS (requires config)
  • PRODUCT_PATTERNS = [] (no separate product sitemaps)
  • NON_PRODUCT_PATTERNS = [] (no separate non-product sitemaps)
  • INDEX_PATTERNS = ['/sitemap.xml$/'] (single sitemap file)
  • Returns SitemapTypeEnum::INDEX for sitemap.xml
  • Returns SitemapTypeEnum::UNKNOWN for anything else
<?php

namespace App\Service\Crawler\Processing\SitemapPattern\Matchers\Cms;

use App\Entity\RoasterCrawlConfig;
use App\Service\Crawler\Processing\SitemapPattern\AbstractSitemapPatternMatcher;
use App\Service\Crawler\Processing\SitemapPattern\Enum\SitemapTypeEnum;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;

/**
 * Generic matcher for platforms with flat sitemap structure.
 * Single sitemap.xml file containing all URLs without sub-sitemaps.
 * Used by: Squarespace, and other platforms with similar structure.
 */
#[AutoconfigureTag('app.sitemap_matcher')]
final class FlatSitemapMatcher extends AbstractSitemapPatternMatcher
{
    protected const array IDENTIFIER_PATTERNS = [];
    protected const array PRODUCT_PATTERNS = [];
    protected const array NON_PRODUCT_PATTERNS = [];
    protected const array INDEX_PATTERNS = ['/sitemap\.xml$/'];
    protected const string PLATFORM_NAME = 'flat';

    public function canHandle(string $sitemapUrl, ?RoasterCrawlConfig $config = null): bool
    {
        return $config?->getSitemapPlatformName() === self::PLATFORM_NAME;
    }
}

2.2 Create Tests for FlatSitemapMatcher

  • File: tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcherTest.php
  • Test cases:
  • Config-based activation
  • Sitemap type detection (INDEX for sitemap.xml)
  • Platform name returns 'flat'

2.3 Update Squarespace Roasters Configuration

  • Set sitemap_platform_name = 'flat' for Squarespace roasters in database
  • Document that Squarespace uses flat sitemap structure

Phase 3: Extract URL Classification to Service

Goal: Centralize URL classification logic in dedicated service

3.1 Create CrawlUrlFactory Service

  • File: src/Service/Crawler/Processing/CrawlUrlFactory.php
  • Purpose: Create CrawlUrl DTOs with initial confidence scoring
  • Responsibilities:
  • Encapsulate pattern-based URL classification
  • Create CrawlUrl objects with appropriate confidence
  • Single source of truth for "what makes a URL worth analyzing?"
<?php

namespace App\Service\Crawler\Processing;

use App\DTO\CrawlUrl;
use Psr\Log\LoggerInterface;

final readonly class CrawlUrlFactory
{
    public function __construct(
        private ProductUrlPatternBuilder $patternBuilder,
        private LoggerInterface $logger
    ) {}

    /**
     * Create CrawlUrl DTOs from raw URL strings.
     * Applies pattern-based filtering to determine initial confidence.
     *
     * @param string[] $urls Raw URL strings
     * @return CrawlUrl[] Filtered URLs with initial confidence
     */
    public function createFromUrls(array $urls): array
    {
        $crawlUrls = [];

        foreach ($urls as $url) {
            $crawlUrl = $this->createFromUrl($url);

            if ($crawlUrl !== null) {
                $crawlUrls[] = $crawlUrl;
            }
        }

        return $crawlUrls;
    }

    /**
     * Create a single CrawlUrl from a URL string.
     * Returns null if URL should be filtered out (obvious non-product).
     */
    public function createFromUrl(string $url): ?CrawlUrl
    {
        $confidence = $this->looksLikeProductUrl($url);

        // Filter out obvious non-products
        if ($confidence === 0) {
            return null;
        }

        return new CrawlUrl($url, $confidence);
    }

    /**
     * Determine if a URL looks like a product page using pattern matching.
     *
     * Returns:
     * - 0 = Definitely NOT a product (will be filtered out)
     * - null = Unknown, needs AI analysis
     *
     * Note: This is Stage 1 (pattern-based) classification.
     * Stage 2 (AI-based) will refine null values to 0-100 scores.
     */
    private function looksLikeProductUrl(string $url): ?int
    {
        $url = strtolower($url);

        // Negative patterns - obvious non-products
        $negativePatterns = [
            '%/(cart|checkout|account|login|register|wishlist|search)/%',
            '%/(blog|news|article|post)/%',
            '%/(about|contact|faq|help|support|careers)/%',
            '%/(category|collection|tag|pages?)/%',
            '%/(policy|policies|terms|privacy|shipping|returns)/%',
        ];

        foreach ($negativePatterns as $pattern) {
            if (preg_match($pattern, $url)) {
                $this->logger->debug('URL rejected by negative pattern', [
                    'url'     => $url,
                    'pattern' => $pattern,
                ]);

                return 0;
            }
        }

        // If it doesn't match negative patterns, it's unknown
        // AI will analyze it in Stage 2
        return null;
    }
}

3.2 Update SitemapProcessor to Use CrawlUrlFactory

  • File: src/Service/Crawler/Processing/SitemapProcessor.php
  • Changes:
  • Inject CrawlUrlFactory
  • Replace URL processing loop with factory call
  • Remove looksLikeProductUrl() method (moved to factory)
// Before (in processSitemap):
$discoveredUrls = [];
foreach ($urls as $urlElement) {
    $url = (string) $urlElement;
    $confidence = $this->looksLikeProductUrl($url);

    if ($confidence === 0) {
        $stats['urls_skipped']++;
        continue;
    }

    $discoveredUrls[] = new CrawlUrl($url, $confidence);
}

// After:
$urlStrings = array_map(fn($el) => (string) $el, $urls);
$discoveredUrls = $this->crawlUrlFactory->createFromUrls($urlStrings);
$stats['urls_skipped'] = count($urls) - count($discoveredUrls);

3.3 Update Tests

  • Update SitemapProcessorIntegrationTest to verify filtering still works
  • Create unit tests for CrawlUrlFactory

Phase 4: Consider ProductUrlPatternBuilder Enhancement (Optional)

Decision Point: Do we need to make ProductUrlPatternBuilder aware of positive indicators?

Current State: - ProductUrlPatternBuilder generates patterns from DB (origins, varieties, processing) - Has static positive indicators: /coffee/, /beans/, /shop/, /products/, /farm/ - SitemapProcessor.looksLikeProductUrl() used these for scoring (40/60/80)

With Option 2 (0 or null only): - Positive patterns are NOT used for filtering - They're only used by AI in Stage 2 - ProductUrlPatternBuilder stays as-is (database-driven patterns) - No platform-awareness needed

Recommendation: Leave ProductUrlPatternBuilder unchanged. It serves AI classification in Stage 2.

Migration Steps

For Existing Squarespace Roasters:

-- Update roaster configs using Squarespace
UPDATE roaster_crawl_config
SET sitemap_platform_name = 'flat'
WHERE base_url IN (
    'https://www.alohastarcoffee.com',
    'https://www.19coffee.com',
    'https://www.bigshoulderscoffee.com'
    -- Add other Squarespace roasters
);

For Future Platforms:

  • Wix: Evaluate if flat or needs custom matcher
  • Custom platforms: Use 'flat' if single sitemap.xml, otherwise create specific matcher

Testing Strategy

Unit Tests

  • ✅ FlatSitemapMatcher (config activation, sitemap type detection)
  • ✅ CrawlUrlFactory (URL filtering, confidence scoring)

Integration Tests

  • ✅ SitemapProcessor with FlatSitemapMatcher
  • ✅ Full flow: Sitemap → URL extraction → Filtering → AI classification

Regression Tests

  • ✅ Existing matcher tests (WordPress, Shopify, Wix) still pass
  • ✅ URL filtering still prevents non-products from being stored
  • ✅ AI classification still receives correct URLs

Benefits

  1. Single Responsibility: Each component has one clear job
  2. Sitemap matchers → classify sitemaps
  3. CrawlUrlFactory → filter and create URL DTOs
  4. ProductUrlPatternBuilder → provide domain patterns for AI
  5. UrlPatternClassificationService → AI-based refinement

  6. DRY Principle: URL filtering logic in one place (CrawlUrlFactory)

  7. Honest Confidence Scores:

  8. Pattern-based: 0 (reject) or null (unknown)
  9. AI-based: 0-100 (refined score)
  10. No misleading intermediate scores

  11. Reusable Components:

  12. FlatSitemapMatcher works for any platform with single sitemap
  13. CrawlUrlFactory can be used outside sitemap context

  14. Clear Architecture:

  15. Stage 1: Fast, cheap filtering (pattern-based)
  16. Stage 2: Slow, expensive refinement (AI-based)

Open Questions

  1. Should we add positive pattern matching in CrawlUrlFactory to provide better initial confidence?
  2. Pro: Could help prioritize URLs for AI analysis
  3. Con: Adds complexity, AI overwrites anyway
  4. Decision: No, keep it simple (0 or null only)

  5. Should CrawlUrlFactory have a "priority" score separate from confidence?

  6. Pro: Could process URLs with /coffee/ before generic /shop/
  7. Con: Additional complexity, unclear benefit
  8. Decision: Defer until we have performance issues

  9. Should we batch AI classification based on initial confidence?

  10. Pro: Process higher-confidence URLs first
  11. Con: Current system already batches by discovery time
  12. Decision: Current batching is fine

Success Criteria

  • ✅ All existing tests pass
  • ✅ Sitemap matchers only handle sitemap classification
  • ✅ URL filtering centralized in CrawlUrlFactory
  • ✅ Confidence scores are honest (0 or null for patterns, 0-100 for AI)
  • ✅ Squarespace roasters work with FlatSitemapMatcher
  • ✅ No code duplication between components
  • ✅ Architecture follows SOLID principles

Timeline

  • Phase 1: Revert URL classification (1-2 hours)
  • Phase 2: Create FlatSitemapMatcher (30 min)
  • Phase 3: Extract CrawlUrlFactory (1 hour)
  • Phase 4: Testing and validation (1 hour)

Total Estimated Time: 3-4 hours

To Modify:

  • src/Service/Crawler/Processing/SitemapPattern/SitemapPatternMatcherInterface.php
  • src/Service/Crawler/Processing/SitemapPattern/AbstractSitemapPatternMatcher.php
  • src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcher.php
  • src/DTO/SitemapClassification.php
  • src/Service/Crawler/Processing/SitemapPattern/SitemapPatternRegistry.php
  • src/Service/Crawler/Processing/SitemapProcessor.php

To Create:

  • src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcher.php
  • src/Service/Crawler/Processing/CrawlUrlFactory.php
  • tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcherTest.php
  • tests/Service/Crawler/Processing/CrawlUrlFactoryTest.php

To Delete:

  • src/Service/Crawler/Processing/SitemapPattern/Enum/UrlTypeEnum.php
  • src/DTO/UrlClassification.php
  • tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcherTest.php (URL tests)

Unchanged:

  • src/Service/Crawler/Processing/ProductUrlPatternBuilder.php
  • src/Service/Crawler/ContentDetection/UrlPatternClassificationService.php
  • src/EventListener/CrawlUrlsDiscoveredListener.php