Skip to content

Feature Implementation Plan: fix-duplicate-bean-creation

📋 Todo Checklist

  • [ ] Update CoffeeBean and CrawlUrl entity definitions.
  • [ ] Create a database migration to alter the schema and merge existing duplicates.
  • [ ] Update CoffeeBeanPersister to handle the new relationship and deduplicate intelligently.
  • [ ] Refactor codebase to use the new entity methods (getPrimaryCrawlUrl, getCrawlUrls).
  • [ ] Write tests to validate the fix.
  • [ ] Final Review and Testing.

🔍 Analysis & Investigation

Codebase Structure

  • Entities: src/Entity/CoffeeBean.php, src/Entity/CrawlUrl.php.
  • Persistence Logic: src/Service/Crawler/Persistance/CoffeeBeanPersister.php.
  • Consumers of getCrawlUrl(): src/Service/Crawler/AvailabilityCrawlSchedulerService.php, src/Controller/Admin/CoffeeBeanCrudController.php (and potentially others).

Current Architecture

The core issue stems from an incorrect database relationship definition. A CoffeeBean is incorrectly mapped as having only one CrawlUrl (OneToOne), enforced by a unique constraint on crawl_url.coffee_bean_id. However, a single product can have multiple URLs (e.g., for different languages like /en/product and /de/product). This architectural flaw forces the system to create a new, duplicate CoffeeBean for each URL variant instead of associating all relevant URLs with a single bean.

Claude's analysis is correct. The solution is to change the relationship to OneToMany (CoffeeBean can have many CrawlUrls) and ManyToOne (CrawlUrl belongs to one CoffeeBean). This requires schema changes, a complex data migration to merge existing duplicates, and updates to the application logic that handles bean creation and retrieval.

Dependencies & Integration Points

  • Doctrine ORM: The entity annotations and migration process are managed by Doctrine.
  • Symfony Migration Bundle: Used to generate and execute the schema and data migration.
  • Symfony Console: A command might be needed to assist with the data migration if it's too complex for a single SQL query.

Considerations & Challenges

  • Data Migration Complexity: The most critical part of this task is the data migration. It must correctly identify all duplicate CoffeeBean entities based on a normalized URL or product name, select one "master" bean, re-associate all related CrawlUrls and other relations to it, and then safely delete the duplicate beans. This process must be atomic and reversible. A database backup is mandatory before running it.
  • Identifying the Primary URL: The concept of an is_primary flag on CrawlUrl is essential for consumers of the data (like the availability checker) that need a single, canonical URL for a bean. The logic to determine which URL is primary must be clearly defined.
  • Code Refactoring: All existing calls to $coffeeBean->getCrawlUrl() will break and must be located and updated, which carries a risk of introducing regressions if any are missed.

📝 Implementation Plan

Prerequisites

  • MANDATORY: Create a full backup of the production database before attempting to run the migration.
  • The migration should be tested thoroughly in a staging environment that mirrors production data as closely as possible.

Step-by-Step Implementation

  1. Step 1: Update Entity Definitions

    • File to modify: src/Entity/CrawlUrl.php
      • Add a new isPrimary property:
        #[ORM\Column(type: 'boolean', options: ['default' => false])]
        private bool $isPrimary = false;
        
      • Add the corresponding getter and setter (isPrimary, setIsPrimary).
      • Change the relationship from OneToOne to ManyToOne:
        // Change this
        #[ORM\OneToOne(targetEntity: CoffeeBean::class, inversedBy: 'crawlUrl')]
        // To this
        #[ORM\ManyToOne(targetEntity: CoffeeBean::class, inversedBy: 'crawlUrls')]
        #[ORM\JoinColumn(nullable: true, onDelete: 'SET NULL')]
        private ?CoffeeBean $coffeeBean = null;
        
    • File to modify: src/Entity/CoffeeBean.php
      • Change the $crawlUrl property to $crawlUrls and initialize it as a collection in the constructor.
        // In __construct()
        $this->crawlUrls = new ArrayCollection();
        
        // Property definition
        #[ORM\OneToMany(targetEntity: CrawlUrl::class, mappedBy: 'coffeeBean')]
        private Collection $crawlUrls;
        
      • Remove getCrawlUrl() and setCrawlUrl().
      • Add new methods to manage the collection:
        public function getCrawlUrls(): Collection
        {
            return $this->crawlUrls;
        }
        
        public function addCrawlUrl(CrawlUrl $crawlUrl): static
        {
            if (!$this->crawlUrls->contains($crawlUrl)) {
                $this->crawlUrls->add($crawlUrl);
                $crawlUrl->setCoffeeBean($this);
            }
            return $this;
        }
        
        public function removeCrawlUrl(CrawlUrl $crawlUrl): static
        {
            if ($this->crawlUrls->removeElement($crawlUrl)) {
                if ($crawlUrl->getCoffeeBean() === $this) {
                    $crawlUrl->setCoffeeBean(null);
                }
            }
            return $this;
        }
        
      • Add a helper method to get the primary URL:
        public function getPrimaryCrawlUrl(): ?CrawlUrl
        {
            foreach ($this->crawlUrls as $crawlUrl) {
                if ($crawlUrl->isPrimary()) {
                    return $crawlUrl;
                }
            }
            return $this->crawlUrls->first() ?: null;
        }
        
  2. Step 2: Create Database Migration

    • Generate a new migration file: php bin/console make:migration.
    • In the up() method:
      • Schema Changes: The generated migration should automatically contain the SQL to drop the unique index on crawl_url.coffee_bean_id and add the is_primary column. Verify this is correct.
      • Data Migration (PHP code within the migration):
        1. Fetch all CoffeeBean entities.
        2. Group them in-memory by a normalized key (e.g., roaster_id + normalized_url_slug). A good normalization strategy is to remove protocol, www., and language path segments like /en/, /de/.
        3. Iterate through each group of duplicates. For each group:
          • Designate the first bean as the primaryBean.
          • Iterate through the other duplicateBeans in the group.
          • Re-assign all CrawlUrls from the duplicateBean to the primaryBean.
          • Crucially, re-assign any other relations from the duplicate to the primary bean if necessary.
          • After re-assigning, add the duplicateBean to a list for deletion.
        4. Use a DQL query to delete all the collected duplicateBeans.
        5. Run a final query to set the is_primary flag. For each CoffeeBean, mark one of its CrawlUrls as primary (e.g., the one with the oldest creation date).
  3. Step 3: Update Persistence Logic

    • File to modify: src/Service/Crawler/Persistance/CoffeeBeanPersister.php
    • Changes needed:
      • In findExistingBean():
        • Implement a URL normalization function (e.g., using regex preg_replace('#/([a-z]{2})/#', '/', $url)).
        • Instead of findOneBy(['url' => $url]), create a custom repository method findBeanByNormalizedUrl(string $normalizedUrl, string $roasterId) that joins CrawlUrl and checks against the normalized version of CrawlUrl.url.
      • In processCoffeeBeanFromDTO():
        • Replace $bean->setCrawlUrl($crawlUrl) with $bean->addCrawlUrl($crawlUrl).
        • Add logic to set the isPrimary flag. If the bean is new, the first CrawlUrl added becomes primary. If the bean exists, the flag is not changed.
  4. Step 4: Refactor Codebase

    • Search the entire project for usages of getCrawlUrl().
    • File to modify: src/Service/Crawler/AvailabilityCrawlSchedulerService.php (and any others)
      • Replace $bean->getCrawlUrl() with $bean->getPrimaryCrawlUrl(). Ensure to handle the null case if a bean somehow has no URLs.
    • File to modify: src/Controller/Admin/CoffeeBeanCrudController.php (or similar admin file)
      • Update the field that displays the URL to either show the primary URL or list all associated URLs.

Testing Strategy

  1. Unit Tests:
    • Write a unit test for the new URL normalization logic to ensure it handles various cases correctly.
  2. Integration Tests:
    • Write a test for CoffeeBeanPersister that simulates crawling two URL variants for the same product and asserts that only one CoffeeBean is created and it has two CrawlUrls associated with it.
  3. Migration Test:
    • On a staging database populated with production-like duplicates, run the migration.
    • After the migration, run SQL queries to verify:
      • There are no more duplicate beans.
      • No data (like flavor notes, regions) was lost from the merged beans.
      • Each CoffeeBean has at least one CrawlUrl marked as primary.
  4. End-to-End Test:
    • Manually test the crawling process for a roaster known to have language-variant URLs.
    • Check the admin interface to ensure the bean and its URLs are displayed correctly.
    • Verify the availability checker and other services that relied on the old method still function correctly.

🎯 Success Criteria

  • Running the migration successfully merges all existing duplicate CoffeeBean entities without data loss.
  • After the fix, crawling multiple URL variants for a single product results in only one CoffeeBean entity with multiple CrawlUrl entities linked to it.
  • The application remains stable, and all features that previously used a bean's URL continue to work correctly by using the primary URL.
  • The number of CoffeeBean entities in the database accurately reflects the number of unique products.