StructuredDataExtractor Implementation Plan¶
Problem Statement¶
Currently, the content processing pipeline strips all <script> tags before converting HTML to markdown. This removes valuable structured data that could improve LLM extraction accuracy:
- JSON-LD product schemas (
<script type="application/ld+json">) - Platform-specific product data (Shopify's
_RSConfig.product, etc.) - Meta tags are preserved but not explicitly extracted
Proposed Solution¶
Create a StructuredDataExtractor service that extracts structured data from HTML before it's cleaned, and includes it as YAML frontmatter in the markdown output.
Architecture¶
src/Service/Crawler/Extraction/StructuredData/
├── StructuredDataExtractor.php # Main orchestrator
├── ExtractedStructuredData.php # Value object for results
└── Extractors/
├── StructuredDataExtractorInterface.php
├── JsonLdExtractor.php # JSON-LD schemas (priority 100)
├── MetaTagExtractor.php # OG/meta tags (priority 50)
└── PlatformSpecific/
└── ShopifyDataExtractor.php # Shopify _RSConfig (priority 75)
Tagged Services¶
Extractors use Symfony's tagged services pattern with priority ordering:
services:
App\Service\Crawler\Extraction\StructuredData\StructuredDataExtractor:
arguments:
$extractors: !tagged_iterator { tag: 'app.structured_data_extractor', default_priority_method: 'getPriority' }
Integration Point¶
File: src/Service/Crawler/Step/Processors/ContentProcessingStepProcessor.php
Location: In convertHtmlToMarkdown(), call extractor BEFORE $this->htmlCleaner->cleanHtml($html) (which strips script tags).
private function convertHtmlToMarkdown(string $html, string $url): string
{
// Extract structured data BEFORE cleaning
$structuredData = $this->structuredDataExtractor->extract($html);
// Clean HTML (removes script tags)
$html = $this->htmlCleaner->cleanHtml($html);
// Convert to markdown
$markdown = $this->markdownConverter->convert($html);
// Build frontmatter with structured data
$metadata = [
'url' => $url,
'processed_at' => (new DateTimeImmutable())->format('Y-m-d H:i:s'),
'content_length' => strlen($html),
];
// Merge extracted data
if (!$structuredData->isEmpty()) {
$metadata = array_merge($metadata, $structuredData->toArray());
}
$frontmatter = Yaml::dump($metadata, 4, 2);
return "---\n{$frontmatter}---\n\n{$markdown}";
}
Expected Output Format¶
---
url: https://example-roaster.com/products/ethiopia
processed_at: "2025-11-19 10:30:00"
content_length: 45230
meta:
title: "Ethiopia Yirgacheffe - Single Origin Coffee"
description: "Bright and fruity Ethiopian coffee"
image: "https://example-roaster.com/cdn/ethiopia.jpg"
canonical: "https://example-roaster.com/products/ethiopia"
structured_data:
product_name: "Ethiopia Yirgacheffe"
product_description: "Bright and fruity Ethiopian coffee..."
brand: "Example Roaster"
category: "Coffee > Single Origin"
price: "18.50"
currency: "EUR"
availability: "InStock"
platform_data:
shopify:
product_tags: ["Single Origin", "Light Roast", "Africa"]
product_type: "Coffee"
vendor: "Example Roaster"
---
[markdown content]
Data to Extract¶
Meta Tags (MetaTagExtractor)¶
| Field | Sources (priority order) |
|---|---|
| title | og:title, <title> |
| description | og:description, meta[name="description"] |
| image | og:image |
| canonical | link[rel="canonical"] |
| product_price | product:price:amount |
| product_currency | product:price:currency |
| product_availability | product:availability |
JSON-LD (JsonLdExtractor)¶
Extract from <script type="application/ld+json">:
- Product schema: name, description, image, brand, category
- Offer schema: price, priceCurrency, availability
- Handle
@graphformat (multiple schemas in one block)
Shopify (ShopifyDataExtractor)¶
Extract from inline scripts:
_RSConfig.product: tags, type, vendor, variantsShopifyAnalytics.meta.product: similar data- Inline
var product = {...}patterns
Implementation Phases¶
Phase 1: Core Infrastructure¶
ExtractedStructuredDatavalue objectStructuredDataExtractorInterfaceStructuredDataExtractororchestrator- Service configuration
Phase 2: Basic Extractors¶
MetaTagExtractor(OG tags, meta description)JsonLdExtractor(Product/Offer schemas)
Phase 3: Integration¶
- Update
ContentProcessingStepProcessor - Unit tests for extractors
- Integration tests
Phase 4: Platform Extractors¶
ShopifyDataExtractor- Additional platforms as needed (Shopware, WooCommerce)
Error Handling¶
- Each extractor wraps JSON parsing in try-catch
- Invalid data logged as warning, not error
- Processing continues with remaining extractors
- Empty results gracefully omitted from frontmatter
Validation Required¶
Before implementing, investigate real crawled URLs to confirm:
- JSON-LD presence: How many roaster sites include JSON-LD Product schemas?
- Data quality: Is the structured data accurate and useful?
- Platform coverage: What platforms are most common? (Shopify, Shopware, etc.)
- Value add: What specific fields would improve extraction that aren't in the description?
Open Questions¶
- Should we extract variant information (sizes, weights, prices)?
- How to handle conflicting data between sources (e.g., different prices)?
- Should platform-specific data be normalized to a common format?
- Maximum frontmatter size limits?
Status¶
Current: Planning / Validation Next: Investigate sample crawled URLs to validate assumptions