Refactor Sitemap and URL Classification Architecture¶
Status: Planning¶
Problem Statement¶
Current implementation has split responsibilities and duplicated logic:
- Sitemap matchers were extended to classify individual URLs (duplicates pattern logic)
- URL classification patterns exist in multiple places:
SitemapProcessor.looksLikeProductUrl()- negative patternsProductUrlPatternBuilder- positive patterns from databaseAbstractSitemapPatternMatcher.NON_PRODUCT_URL_PATTERNS- duplicate negative patterns- Misleading confidence scores: Pattern-based classification returns
0/40/60/80/nullbut only0andnullare meaningful (everything else gets overwritten by AI) - Squarespace needs flat sitemap support, which is a generic sitemap structure pattern, not platform-specific
Agreed Architecture¶
Principle: Separation of Concerns¶
- Sitemap Matchers → Classify sitemap structure ONLY
- URL Classification → Single responsibility service for all URL confidence scoring
- Two-stage confidence system remains:
- Stage 1: Pattern-based (fast, free) - binary filter (0 or null)
- Stage 2: AI-based (slow, expensive) - refined scoring (0-100)
Refactoring Plan¶
Phase 1: Revert URL Classification from Matchers¶
Goal: Remove URL-level classification from sitemap matchers (return to sitemap-only responsibility)
1.1 Remove URL Classification Interface Methods¶
- File:
src/Service/Crawler/Processing/SitemapPattern/SitemapPatternMatcherInterface.php - Action: Remove:
canClassifyUrls(): boolclassifyUrl(string $url): UrlClassification
1.2 Remove URL Pattern Constants from Abstract Matcher¶
- File:
src/Service/Crawler/Processing/SitemapPattern/AbstractSitemapPatternMatcher.php - Action: Remove:
PRODUCT_URL_PATTERNSconstantNON_PRODUCT_URL_PATTERNSconstantcanClassifyUrls()methodclassifyUrl()method
1.3 Simplify Squarespace Matcher¶
- File:
src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcher.php - Action: Remove all URL pattern constants, keep only sitemap-level constants
1.4 Remove DTOs/Enums for URL Classification¶
- Files to delete:
src/Service/Crawler/Processing/SitemapPattern/Enum/UrlTypeEnum.phpsrc/DTO/UrlClassification.php
1.5 Update SitemapClassification¶
- File:
src/DTO/SitemapClassification.php - Action: Remove
matcherproperty (no longer needed for URL classification)
1.6 Update SitemapPatternRegistry¶
- File:
src/Service/Crawler/Processing/SitemapPattern/SitemapPatternRegistry.php - Action: Don't pass matcher to
SitemapClassificationconstructor
1.7 Revert SitemapProcessor Changes¶
- File:
src/Service/Crawler/Processing/SitemapProcessor.php - Action: Remove platform-specific URL classification logic, use only
looksLikeProductUrl()
1.8 Delete Tests for URL Classification¶
- File to delete:
tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcherTest.php(URL classification tests)
Phase 2: Create Generic Flat Sitemap Matcher¶
Goal: Support platforms like Squarespace that use single sitemap.xml without sub-sitemaps
2.1 Create FlatSitemapMatcher¶
- File:
src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcher.php - Purpose: Generic matcher for platforms with flat sitemap structure
- Activation: Config-based (
sitemap_platform_name = 'flat') - Characteristics:
- No IDENTIFIER_PATTERNS (requires config)
- PRODUCT_PATTERNS = [] (no separate product sitemaps)
- NON_PRODUCT_PATTERNS = [] (no separate non-product sitemaps)
- INDEX_PATTERNS = ['/sitemap.xml$/'] (single sitemap file)
- Returns SitemapTypeEnum::INDEX for sitemap.xml
- Returns SitemapTypeEnum::UNKNOWN for anything else
<?php
namespace App\Service\Crawler\Processing\SitemapPattern\Matchers\Cms;
use App\Entity\RoasterCrawlConfig;
use App\Service\Crawler\Processing\SitemapPattern\AbstractSitemapPatternMatcher;
use App\Service\Crawler\Processing\SitemapPattern\Enum\SitemapTypeEnum;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;
/**
* Generic matcher for platforms with flat sitemap structure.
* Single sitemap.xml file containing all URLs without sub-sitemaps.
* Used by: Squarespace, and other platforms with similar structure.
*/
#[AutoconfigureTag('app.sitemap_matcher')]
final class FlatSitemapMatcher extends AbstractSitemapPatternMatcher
{
protected const array IDENTIFIER_PATTERNS = [];
protected const array PRODUCT_PATTERNS = [];
protected const array NON_PRODUCT_PATTERNS = [];
protected const array INDEX_PATTERNS = ['/sitemap\.xml$/'];
protected const string PLATFORM_NAME = 'flat';
public function canHandle(string $sitemapUrl, ?RoasterCrawlConfig $config = null): bool
{
return $config?->getSitemapPlatformName() === self::PLATFORM_NAME;
}
}
2.2 Create Tests for FlatSitemapMatcher¶
- File:
tests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcherTest.php - Test cases:
- Config-based activation
- Sitemap type detection (INDEX for sitemap.xml)
- Platform name returns 'flat'
2.3 Update Squarespace Roasters Configuration¶
- Set
sitemap_platform_name = 'flat'for Squarespace roasters in database - Document that Squarespace uses flat sitemap structure
Phase 3: Extract URL Classification to Service¶
Goal: Centralize URL classification logic in dedicated service
3.1 Create CrawlUrlFactory Service¶
- File:
src/Service/Crawler/Processing/CrawlUrlFactory.php - Purpose: Create CrawlUrl DTOs with initial confidence scoring
- Responsibilities:
- Encapsulate pattern-based URL classification
- Create CrawlUrl objects with appropriate confidence
- Single source of truth for "what makes a URL worth analyzing?"
<?php
namespace App\Service\Crawler\Processing;
use App\DTO\CrawlUrl;
use Psr\Log\LoggerInterface;
final readonly class CrawlUrlFactory
{
public function __construct(
private ProductUrlPatternBuilder $patternBuilder,
private LoggerInterface $logger
) {}
/**
* Create CrawlUrl DTOs from raw URL strings.
* Applies pattern-based filtering to determine initial confidence.
*
* @param string[] $urls Raw URL strings
* @return CrawlUrl[] Filtered URLs with initial confidence
*/
public function createFromUrls(array $urls): array
{
$crawlUrls = [];
foreach ($urls as $url) {
$crawlUrl = $this->createFromUrl($url);
if ($crawlUrl !== null) {
$crawlUrls[] = $crawlUrl;
}
}
return $crawlUrls;
}
/**
* Create a single CrawlUrl from a URL string.
* Returns null if URL should be filtered out (obvious non-product).
*/
public function createFromUrl(string $url): ?CrawlUrl
{
$confidence = $this->looksLikeProductUrl($url);
// Filter out obvious non-products
if ($confidence === 0) {
return null;
}
return new CrawlUrl($url, $confidence);
}
/**
* Determine if a URL looks like a product page using pattern matching.
*
* Returns:
* - 0 = Definitely NOT a product (will be filtered out)
* - null = Unknown, needs AI analysis
*
* Note: This is Stage 1 (pattern-based) classification.
* Stage 2 (AI-based) will refine null values to 0-100 scores.
*/
private function looksLikeProductUrl(string $url): ?int
{
$url = strtolower($url);
// Negative patterns - obvious non-products
$negativePatterns = [
'%/(cart|checkout|account|login|register|wishlist|search)/%',
'%/(blog|news|article|post)/%',
'%/(about|contact|faq|help|support|careers)/%',
'%/(category|collection|tag|pages?)/%',
'%/(policy|policies|terms|privacy|shipping|returns)/%',
];
foreach ($negativePatterns as $pattern) {
if (preg_match($pattern, $url)) {
$this->logger->debug('URL rejected by negative pattern', [
'url' => $url,
'pattern' => $pattern,
]);
return 0;
}
}
// If it doesn't match negative patterns, it's unknown
// AI will analyze it in Stage 2
return null;
}
}
3.2 Update SitemapProcessor to Use CrawlUrlFactory¶
- File:
src/Service/Crawler/Processing/SitemapProcessor.php - Changes:
- Inject
CrawlUrlFactory - Replace URL processing loop with factory call
- Remove
looksLikeProductUrl()method (moved to factory)
// Before (in processSitemap):
$discoveredUrls = [];
foreach ($urls as $urlElement) {
$url = (string) $urlElement;
$confidence = $this->looksLikeProductUrl($url);
if ($confidence === 0) {
$stats['urls_skipped']++;
continue;
}
$discoveredUrls[] = new CrawlUrl($url, $confidence);
}
// After:
$urlStrings = array_map(fn($el) => (string) $el, $urls);
$discoveredUrls = $this->crawlUrlFactory->createFromUrls($urlStrings);
$stats['urls_skipped'] = count($urls) - count($discoveredUrls);
3.3 Update Tests¶
- Update
SitemapProcessorIntegrationTestto verify filtering still works - Create unit tests for
CrawlUrlFactory
Phase 4: Consider ProductUrlPatternBuilder Enhancement (Optional)¶
Decision Point: Do we need to make ProductUrlPatternBuilder aware of positive indicators?
Current State:
- ProductUrlPatternBuilder generates patterns from DB (origins, varieties, processing)
- Has static positive indicators: /coffee/, /beans/, /shop/, /products/, /farm/
- SitemapProcessor.looksLikeProductUrl() used these for scoring (40/60/80)
With Option 2 (0 or null only): - Positive patterns are NOT used for filtering - They're only used by AI in Stage 2 - ProductUrlPatternBuilder stays as-is (database-driven patterns) - No platform-awareness needed
Recommendation: Leave ProductUrlPatternBuilder unchanged. It serves AI classification in Stage 2.
Migration Steps¶
For Existing Squarespace Roasters:¶
-- Update roaster configs using Squarespace
UPDATE roaster_crawl_config
SET sitemap_platform_name = 'flat'
WHERE base_url IN (
'https://www.alohastarcoffee.com',
'https://www.19coffee.com',
'https://www.bigshoulderscoffee.com'
-- Add other Squarespace roasters
);
For Future Platforms:¶
- Wix: Evaluate if flat or needs custom matcher
- Custom platforms: Use 'flat' if single sitemap.xml, otherwise create specific matcher
Testing Strategy¶
Unit Tests¶
- ✅ FlatSitemapMatcher (config activation, sitemap type detection)
- ✅ CrawlUrlFactory (URL filtering, confidence scoring)
Integration Tests¶
- ✅ SitemapProcessor with FlatSitemapMatcher
- ✅ Full flow: Sitemap → URL extraction → Filtering → AI classification
Regression Tests¶
- ✅ Existing matcher tests (WordPress, Shopify, Wix) still pass
- ✅ URL filtering still prevents non-products from being stored
- ✅ AI classification still receives correct URLs
Benefits¶
- Single Responsibility: Each component has one clear job
- Sitemap matchers → classify sitemaps
- CrawlUrlFactory → filter and create URL DTOs
- ProductUrlPatternBuilder → provide domain patterns for AI
-
UrlPatternClassificationService → AI-based refinement
-
DRY Principle: URL filtering logic in one place (CrawlUrlFactory)
-
Honest Confidence Scores:
- Pattern-based: 0 (reject) or null (unknown)
- AI-based: 0-100 (refined score)
-
No misleading intermediate scores
-
Reusable Components:
- FlatSitemapMatcher works for any platform with single sitemap
-
CrawlUrlFactory can be used outside sitemap context
-
Clear Architecture:
- Stage 1: Fast, cheap filtering (pattern-based)
- Stage 2: Slow, expensive refinement (AI-based)
Open Questions¶
- Should we add positive pattern matching in CrawlUrlFactory to provide better initial confidence?
- Pro: Could help prioritize URLs for AI analysis
- Con: Adds complexity, AI overwrites anyway
-
Decision: No, keep it simple (0 or null only)
-
Should CrawlUrlFactory have a "priority" score separate from confidence?
- Pro: Could process URLs with
/coffee/before generic/shop/ - Con: Additional complexity, unclear benefit
-
Decision: Defer until we have performance issues
-
Should we batch AI classification based on initial confidence?
- Pro: Process higher-confidence URLs first
- Con: Current system already batches by discovery time
- Decision: Current batching is fine
Success Criteria¶
- ✅ All existing tests pass
- ✅ Sitemap matchers only handle sitemap classification
- ✅ URL filtering centralized in CrawlUrlFactory
- ✅ Confidence scores are honest (0 or null for patterns, 0-100 for AI)
- ✅ Squarespace roasters work with FlatSitemapMatcher
- ✅ No code duplication between components
- ✅ Architecture follows SOLID principles
Timeline¶
- Phase 1: Revert URL classification (1-2 hours)
- Phase 2: Create FlatSitemapMatcher (30 min)
- Phase 3: Extract CrawlUrlFactory (1 hour)
- Phase 4: Testing and validation (1 hour)
Total Estimated Time: 3-4 hours
Related Files¶
To Modify:¶
src/Service/Crawler/Processing/SitemapPattern/SitemapPatternMatcherInterface.phpsrc/Service/Crawler/Processing/SitemapPattern/AbstractSitemapPatternMatcher.phpsrc/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcher.phpsrc/DTO/SitemapClassification.phpsrc/Service/Crawler/Processing/SitemapPattern/SitemapPatternRegistry.phpsrc/Service/Crawler/Processing/SitemapProcessor.php
To Create:¶
src/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcher.phpsrc/Service/Crawler/Processing/CrawlUrlFactory.phptests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/FlatSitemapMatcherTest.phptests/Service/Crawler/Processing/CrawlUrlFactoryTest.php
To Delete:¶
src/Service/Crawler/Processing/SitemapPattern/Enum/UrlTypeEnum.phpsrc/DTO/UrlClassification.phptests/Service/Crawler/Processing/SitemapPattern/Matchers/Cms/SquarespaceSitemapMatcherTest.php(URL tests)
Unchanged:¶
src/Service/Crawler/Processing/ProductUrlPatternBuilder.phpsrc/Service/Crawler/ContentDetection/UrlPatternClassificationService.phpsrc/EventListener/CrawlUrlsDiscoveredListener.php