Phase 2: Broken Image Detection Job¶
Priority: Medium-term Complexity: Medium Dependencies: None (but enhances Phase 3 if implemented first)
Goal¶
Periodically verify image URLs return HTTP 200 and contain valid image content.
Design Decisions¶
| Decision | Choice | Rationale |
|---|---|---|
| Storage | Separate ImageCheck table |
Audit trail, check history, cleaner entity |
| Validation | GET + magic bytes | Most thorough - verifies actual image data |
Entity: ImageCheck¶
#[ORM\Entity(repositoryClass: ImageCheckRepository::class)]
class ImageCheck
{
#[ORM\Id]
#[ORM\Column(type: 'uuid')]
private Uuid $id;
#[ORM\ManyToOne(targetEntity: CoffeeBean::class)]
#[ORM\JoinColumn(nullable: false, onDelete: 'CASCADE')]
private CoffeeBean $coffeeBean;
#[ORM\Column(length: 255)]
private string $imageUrl; // Snapshot of URL at check time
#[ORM\Column(type: 'datetime_immutable')]
private DateTimeImmutable $checkedAt;
#[ORM\Column(enumType: ImageCheckStatus::class)]
private ImageCheckStatus $status; // VALID, BROKEN, TIMEOUT, ERROR
#[ORM\Column(nullable: true)]
private ?int $httpStatusCode = null;
#[ORM\Column(length: 100, nullable: true)]
private ?string $contentType = null;
#[ORM\Column(length: 20, nullable: true)]
private ?string $detectedFormat = null; // jpeg, png, webp, gif
#[ORM\Column(nullable: true)]
private ?int $contentLength = null;
#[ORM\Column(type: 'text', nullable: true)]
private ?string $errorMessage = null;
}
Validation: GET + Magic Bytes¶
Image magic byte signatures:
- JPEG: FF D8 FF
- PNG: 89 50 4E 47 0D 0A 1A 0A
- GIF: 47 49 46 38 (GIF8)
- WebP: 52 49 46 46 ... 57 45 42 50 (RIFF...WEBP)
Validation Process¶
- Send GET request with
Range: bytes=0-15header (fetch first 16 bytes only) - Check HTTP status code (200 or 206)
- Verify Content-Type header starts with
image/ - Match magic bytes against known signatures
Implementation¶
Service: ImageValidationService¶
- Uses HttpClient with Range header
- Magic byte detection logic
- Returns structured result with status, format, errors
Scheduler: ImageValidationSchedulerService¶
- Cron:
0 3 */3 * *(3 AM every 3 days) - Query: CoffeeBeans with imageUrl where no ImageCheck in last 3 days
- Dispatch
ImageValidationMessageper bean
Message & Handler¶
ImageValidationMessage: Contains CoffeeBean UUIDImageValidationHandler: Validates and persists ImageCheck record
Command: app:validate-images¶
Manual trigger with options:
- --dry-run: Preview without persisting
- --limit=N: Limit number of beans to check
- --force: Ignore last check date
Files to Create¶
| File | Description |
|---|---|
src/Entity/ImageCheck.php |
Entity |
src/Repository/ImageCheckRepository.php |
Repository |
src/Enum/ImageCheckStatus.php |
Status enum (VALID, BROKEN, TIMEOUT, ERROR) |
src/Service/Image/ImageValidationService.php |
Validation logic |
src/Scheduler/ImageValidationSchedulerService.php |
Scheduler provider |
src/Message/ImageValidationMessage.php |
Message class |
src/MessageHandler/ImageValidationHandler.php |
Handler |
src/Command/ValidateImagesCommand.php |
CLI command |
| Migration | image_check table |
Reference Files¶
src/Scheduler/AvailabilityCrawlSchedulerService.php- Scheduler patternsrc/MessageHandler/CrawlStepHandler.php- Message handler patternsrc/Service/Crawler/Implementations/Abstract/Http/AbstractHttpClient.php- HTTP client pattern
Integration with Phase 3¶
If Phase 3 (Image Caching) is implemented: - Broken image detection can trigger cache invalidation - ImageCheck BROKEN status → CachedImage STALE status - Re-cache job picks up STALE images for re-fetch