Skip to content

Phase 2: Broken Image Detection Job

Priority: Medium-term Complexity: Medium Dependencies: None (but enhances Phase 3 if implemented first)


Goal

Periodically verify image URLs return HTTP 200 and contain valid image content.


Design Decisions

Decision Choice Rationale
Storage Separate ImageCheck table Audit trail, check history, cleaner entity
Validation GET + magic bytes Most thorough - verifies actual image data

Entity: ImageCheck

#[ORM\Entity(repositoryClass: ImageCheckRepository::class)]
class ImageCheck
{
    #[ORM\Id]
    #[ORM\Column(type: 'uuid')]
    private Uuid $id;

    #[ORM\ManyToOne(targetEntity: CoffeeBean::class)]
    #[ORM\JoinColumn(nullable: false, onDelete: 'CASCADE')]
    private CoffeeBean $coffeeBean;

    #[ORM\Column(length: 255)]
    private string $imageUrl;  // Snapshot of URL at check time

    #[ORM\Column(type: 'datetime_immutable')]
    private DateTimeImmutable $checkedAt;

    #[ORM\Column(enumType: ImageCheckStatus::class)]
    private ImageCheckStatus $status;  // VALID, BROKEN, TIMEOUT, ERROR

    #[ORM\Column(nullable: true)]
    private ?int $httpStatusCode = null;

    #[ORM\Column(length: 100, nullable: true)]
    private ?string $contentType = null;

    #[ORM\Column(length: 20, nullable: true)]
    private ?string $detectedFormat = null;  // jpeg, png, webp, gif

    #[ORM\Column(nullable: true)]
    private ?int $contentLength = null;

    #[ORM\Column(type: 'text', nullable: true)]
    private ?string $errorMessage = null;
}

Validation: GET + Magic Bytes

Image magic byte signatures: - JPEG: FF D8 FF - PNG: 89 50 4E 47 0D 0A 1A 0A - GIF: 47 49 46 38 (GIF8) - WebP: 52 49 46 46 ... 57 45 42 50 (RIFF...WEBP)

Validation Process

  1. Send GET request with Range: bytes=0-15 header (fetch first 16 bytes only)
  2. Check HTTP status code (200 or 206)
  3. Verify Content-Type header starts with image/
  4. Match magic bytes against known signatures

Implementation

Service: ImageValidationService

class ImageValidationService
{
    public function validate(string $url): ImageCheckResult;
}
  • Uses HttpClient with Range header
  • Magic byte detection logic
  • Returns structured result with status, format, errors

Scheduler: ImageValidationSchedulerService

  • Cron: 0 3 */3 * * (3 AM every 3 days)
  • Query: CoffeeBeans with imageUrl where no ImageCheck in last 3 days
  • Dispatch ImageValidationMessage per bean

Message & Handler

  • ImageValidationMessage: Contains CoffeeBean UUID
  • ImageValidationHandler: Validates and persists ImageCheck record

Command: app:validate-images

Manual trigger with options: - --dry-run: Preview without persisting - --limit=N: Limit number of beans to check - --force: Ignore last check date


Files to Create

File Description
src/Entity/ImageCheck.php Entity
src/Repository/ImageCheckRepository.php Repository
src/Enum/ImageCheckStatus.php Status enum (VALID, BROKEN, TIMEOUT, ERROR)
src/Service/Image/ImageValidationService.php Validation logic
src/Scheduler/ImageValidationSchedulerService.php Scheduler provider
src/Message/ImageValidationMessage.php Message class
src/MessageHandler/ImageValidationHandler.php Handler
src/Command/ValidateImagesCommand.php CLI command
Migration image_check table

Reference Files

  • src/Scheduler/AvailabilityCrawlSchedulerService.php - Scheduler pattern
  • src/MessageHandler/CrawlStepHandler.php - Message handler pattern
  • src/Service/Crawler/Implementations/Abstract/Http/AbstractHttpClient.php - HTTP client pattern

Integration with Phase 3

If Phase 3 (Image Caching) is implemented: - Broken image detection can trigger cache invalidation - ImageCheck BROKEN status → CachedImage STALE status - Re-cache job picks up STALE images for re-fetch