Refactor Sitemap and URL Classification Architecture¶

Status: Planning¶

Problem Statement¶

Current implementation has split responsibilities and duplicated logic:

Sitemap matchers were extended to classify individual URLs (duplicates pattern logic)
URL classification patterns exist in multiple places:
SitemapProcessor.looksLikeProductUrl() - negative patterns
ProductUrlPatternBuilder - positive patterns from database
AbstractSitemapPatternMatcher.NON_PRODUCT_URL_PATTERNS - duplicate negative patterns
Misleading confidence scores: Pattern-based classification returns 0/40/60/80/null but only 0 and null are meaningful (everything else gets overwritten by AI)
Squarespace needs flat sitemap support, which is a generic sitemap structure pattern, not platform-specific

Agreed Architecture¶

Principle: Separation of Concerns¶

Sitemap Matchers → Classify sitemap structure ONLY
URL Classification → Single responsibility service for all URL confidence scoring
Two-stage confidence system remains:
Stage 1: Pattern-based (fast, free) - binary filter (0 or null)
Stage 2: AI-based (slow, expensive) - refined scoring (0-100)

Refactoring Plan¶

Phase 1: Revert URL Classification from Matchers¶

Goal: Remove URL-level classification from sitemap matchers (return to sitemap-only responsibility)

1.1 Remove URL Classification Interface Methods¶

File: src/Service/Crawler/Processing/SitemapPattern/SitemapPatternMatcherInterface.php
Action: Remove:
canClassifyUrls(): bool
classifyUrl(string $url): UrlClassification

1.2 Remove URL Pattern Constants from Abstract Matcher¶

File: src/Service/Crawler/Processing/SitemapPattern/AbstractSitemapPatternMatcher.php
Action: Remove:
PRODUCT_URL_PATTERNS constant
NON_PRODUCT_URL_PATTERNS constant
canClassifyUrls() method
classifyUrl() method

1.3 Simplify Squarespace Matcher¶

File: src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcher.php
Action: Remove all URL pattern constants, keep only sitemap-level constants

1.4 Remove DTOs/Enums for URL Classification¶

Files to delete:
src/Service/Crawler/Processing/SitemapPattern/Enum/UrlTypeEnum.php
src/DTO/UrlClassification.php

1.5 Update SitemapClassification¶

File: src/DTO/SitemapClassification.php
Action: Remove matcher property (no longer needed for URL classification)

1.6 Update SitemapPatternRegistry¶

File: src/Service/Crawler/Processing/SitemapPattern/SitemapPatternRegistry.php
Action: Don't pass matcher to SitemapClassification constructor

1.7 Revert SitemapProcessor Changes¶

File: src/Service/Crawler/Processing/SitemapProcessor.php
Action: Remove platform-specific URL classification logic, use only looksLikeProductUrl()

1.8 Delete Tests for URL Classification¶

File to delete: tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcherTest.php (URL classification tests)

Phase 2: Create Generic Flat Sitemap Matcher¶

Goal: Support platforms like Squarespace that use single sitemap.xml without sub-sitemaps

2.1 Create FlatSitemapMatcher¶

File: src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcher.php
Purpose: Generic matcher for platforms with flat sitemap structure
Activation: Config-based (sitemap_platform_name = 'flat')
Characteristics:
No IDENTIFIER_PATTERNS (requires config)
PRODUCT_PATTERNS = [] (no separate product sitemaps)
NON_PRODUCT_PATTERNS = [] (no separate non-product sitemaps)
INDEX_PATTERNS = ['/sitemap.xml$/'] (single sitemap file)
Returns SitemapTypeEnum::INDEX for sitemap.xml
Returns SitemapTypeEnum::UNKNOWN for anything else

<?php

namespace App\Service\Crawler\Processing\SitemapPattern\Matchers\Cms;

use App\Entity\RoasterCrawlConfig;
use App\Service\Crawler\Processing\SitemapPattern\AbstractSitemapPatternMatcher;
use App\Service\Crawler\Processing\SitemapPattern\Enum\SitemapTypeEnum;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;

/**
 * Generic matcher for platforms with flat sitemap structure.
 * Single sitemap.xml file containing all URLs without sub-sitemaps.
 * Used by: Squarespace, and other platforms with similar structure.
 */
#[AutoconfigureTag('app.sitemap_matcher')]
final class FlatSitemapMatcher extends AbstractSitemapPatternMatcher
{
    protected const array IDENTIFIER_PATTERNS = [];
    protected const array PRODUCT_PATTERNS = [];
    protected const array NON_PRODUCT_PATTERNS = [];
    protected const array INDEX_PATTERNS = ['/sitemap\.xml$/'];
    protected const string PLATFORM_NAME = 'flat';

    public function canHandle(string $sitemapUrl, ?RoasterCrawlConfig $config = null): bool
    {
        return $config?->getSitemapPlatformName() === self::PLATFORM_NAME;
    }
}

2.2 Create Tests for FlatSitemapMatcher¶

File: tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcherTest.php
Test cases:
Config-based activation
Sitemap type detection (INDEX for sitemap.xml)
Platform name returns 'flat'

2.3 Update Squarespace Roasters Configuration¶

Set sitemap_platform_name = 'flat' for Squarespace roasters in database
Document that Squarespace uses flat sitemap structure

Phase 3: Extract URL Classification to Service¶

Goal: Centralize URL classification logic in dedicated service

3.1 Create CrawlUrlFactory Service¶

File: src/Service/Crawler/Processing/CrawlUrlFactory.php
Purpose: Create CrawlUrl DTOs with initial confidence scoring
Responsibilities:
Encapsulate pattern-based URL classification
Create CrawlUrl objects with appropriate confidence
Single source of truth for "what makes a URL worth analyzing?"

<?php

namespace App\Service\Crawler\Processing;

use App\DTO\CrawlUrl;
use Psr\Log\LoggerInterface;

final readonly class CrawlUrlFactory
{
    public function __construct(
        private ProductUrlPatternBuilder $patternBuilder,
        private LoggerInterface $logger
    ) {}

    /**
     * Create CrawlUrl DTOs from raw URL strings.
     * Applies pattern-based filtering to determine initial confidence.
     *
     * @param string[] $urls Raw URL strings
     * @return CrawlUrl[] Filtered URLs with initial confidence
     */
    public function createFromUrls(array $urls): array
    {
        $crawlUrls = [];

        foreach ($urls as $url) {
            $crawlUrl = $this->createFromUrl($url);

            if ($crawlUrl !== null) {
                $crawlUrls[] = $crawlUrl;
            }
        }

        return $crawlUrls;
    }

    /**
     * Create a single CrawlUrl from a URL string.
     * Returns null if URL should be filtered out (obvious non-product).
     */
    public function createFromUrl(string $url): ?CrawlUrl
    {
        $confidence = $this->looksLikeProductUrl($url);

        // Filter out obvious non-products
        if ($confidence === 0) {
            return null;
        }

        return new CrawlUrl($url, $confidence);
    }

    /**
     * Determine if a URL looks like a product page using pattern matching.
     *
     * Returns:
     * - 0 = Definitely NOT a product (will be filtered out)
     * - null = Unknown, needs AI analysis
     *
     * Note: This is Stage 1 (pattern-based) classification.
     * Stage 2 (AI-based) will refine null values to 0-100 scores.
     */
    private function looksLikeProductUrl(string $url): ?int
    {
        $url = strtolower($url);

        // Negative patterns - obvious non-products
        $negativePatterns = [
            '%/(cart|checkout|account|login|register|wishlist|search)/%',
            '%/(blog|news|article|post)/%',
            '%/(about|contact|faq|help|support|careers)/%',
            '%/(category|collection|tag|pages?)/%',
            '%/(policy|policies|terms|privacy|shipping|returns)/%',
        ];

        foreach ($negativePatterns as $pattern) {
            if (preg_match($pattern, $url)) {
                $this->logger->debug('URL rejected by negative pattern', [
                    'url'     => $url,
                    'pattern' => $pattern,
                ]);

                return 0;
            }
        }

        // If it doesn't match negative patterns, it's unknown
        // AI will analyze it in Stage 2
        return null;
    }
}

3.2 Update SitemapProcessor to Use CrawlUrlFactory¶

File: src/Service/Crawler/Processing/SitemapProcessor.php
Changes:
Inject CrawlUrlFactory
Replace URL processing loop with factory call
Remove looksLikeProductUrl() method (moved to factory)

// Before (in processSitemap):
$discoveredUrls = [];
foreach ($urls as $urlElement) {
    $url = (string) $urlElement;
    $confidence = $this->looksLikeProductUrl($url);

    if ($confidence === 0) {
        $stats['urls_skipped']++;
        continue;
    }

    $discoveredUrls[] = new CrawlUrl($url, $confidence);
}

// After:
$urlStrings = array_map(fn($el) => (string) $el, $urls);
$discoveredUrls = $this->crawlUrlFactory->createFromUrls($urlStrings);
$stats['urls_skipped'] = count($urls) - count($discoveredUrls);

3.3 Update Tests¶

Update SitemapProcessorIntegrationTest to verify filtering still works
Create unit tests for CrawlUrlFactory

Phase 4: Consider ProductUrlPatternBuilder Enhancement (Optional)¶

Decision Point: Do we need to make ProductUrlPatternBuilder aware of positive indicators?

Current State: - ProductUrlPatternBuilder generates patterns from DB (origins, varieties, processing) - Has static positive indicators: /coffee/, /beans/, /shop/, /products/, /farm/ - SitemapProcessor.looksLikeProductUrl() used these for scoring (40/60/80)

With Option 2 (0 or null only): - Positive patterns are NOT used for filtering - They're only used by AI in Stage 2 - ProductUrlPatternBuilder stays as-is (database-driven patterns) - No platform-awareness needed

Recommendation: Leave ProductUrlPatternBuilder unchanged. It serves AI classification in Stage 2.

Migration Steps¶

For Existing Squarespace Roasters:¶

-- Update roaster configs using Squarespace
UPDATE roaster_crawl_config
SET sitemap_platform_name = 'flat'
WHERE base_url IN (
    'https://www.alohastarcoffee.com',
    'https://www.19coffee.com',
    'https://www.bigshoulderscoffee.com'
    -- Add other Squarespace roasters
);

For Future Platforms:¶

Wix: Evaluate if flat or needs custom matcher
Custom platforms: Use 'flat' if single sitemap.xml, otherwise create specific matcher

Testing Strategy¶

Unit Tests¶

✅ FlatSitemapMatcher (config activation, sitemap type detection)
✅ CrawlUrlFactory (URL filtering, confidence scoring)

Integration Tests¶

✅ SitemapProcessor with FlatSitemapMatcher
✅ Full flow: Sitemap → URL extraction → Filtering → AI classification

Regression Tests¶

✅ Existing matcher tests (WordPress, Shopify, Wix) still pass
✅ URL filtering still prevents non-products from being stored
✅ AI classification still receives correct URLs

Benefits¶

Single Responsibility: Each component has one clear job
Sitemap matchers → classify sitemaps
CrawlUrlFactory → filter and create URL DTOs
ProductUrlPatternBuilder → provide domain patterns for AI
UrlPatternClassificationService → AI-based refinement
DRY Principle: URL filtering logic in one place (CrawlUrlFactory)
Honest Confidence Scores:
Pattern-based: 0 (reject) or null (unknown)
AI-based: 0-100 (refined score)
No misleading intermediate scores
Reusable Components:
FlatSitemapMatcher works for any platform with single sitemap
CrawlUrlFactory can be used outside sitemap context
Clear Architecture:
Stage 1: Fast, cheap filtering (pattern-based)
Stage 2: Slow, expensive refinement (AI-based)

Open Questions¶

Should we add positive pattern matching in CrawlUrlFactory to provide better initial confidence?
Pro: Could help prioritize URLs for AI analysis
Con: Adds complexity, AI overwrites anyway
Decision: No, keep it simple (0 or null only)
Should CrawlUrlFactory have a "priority" score separate from confidence?
Pro: Could process URLs with /coffee/ before generic /shop/
Con: Additional complexity, unclear benefit
Decision: Defer until we have performance issues
Should we batch AI classification based on initial confidence?
Pro: Process higher-confidence URLs first
Con: Current system already batches by discovery time
Decision: Current batching is fine

Success Criteria¶

✅ All existing tests pass
✅ Sitemap matchers only handle sitemap classification
✅ URL filtering centralized in CrawlUrlFactory
✅ Confidence scores are honest (0 or null for patterns, 0-100 for AI)
✅ Squarespace roasters work with FlatSitemapMatcher
✅ No code duplication between components
✅ Architecture follows SOLID principles

Timeline¶

Phase 1: Revert URL classification (1-2 hours)
Phase 2: Create FlatSitemapMatcher (30 min)
Phase 3: Extract CrawlUrlFactory (1 hour)
Phase 4: Testing and validation (1 hour)

Total Estimated Time: 3-4 hours

To Modify:¶

src/Service/Crawler/Processing/SitemapPattern/SitemapPatternMatcherInterface.php
src/Service/Crawler/Processing/SitemapPattern/AbstractSitemapPatternMatcher.php
src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcher.php
src/DTO/SitemapClassification.php
src/Service/Crawler/Processing/SitemapPattern/SitemapPatternRegistry.php
src/Service/Crawler/Processing/SitemapProcessor.php

To Create:¶

src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcher.php
src/Service/Crawler/Processing/CrawlUrlFactory.php
tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcherTest.php
tests/Service/Crawler/Processing/CrawlUrlFactoryTest.php

To Delete:¶

src/Service/Crawler/Processing/SitemapPattern/Enum/UrlTypeEnum.php
src/DTO/UrlClassification.php
tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcherTest.php (URL tests)

Unchanged:¶

src/Service/Crawler/Processing/ProductUrlPatternBuilder.php
src/Service/Crawler/ContentDetection/UrlPatternClassificationService.php
src/EventListener/CrawlUrlsDiscoveredListener.php