Feature Implementation Plan: fix-duplicate-bean-creation¶
📋 Todo Checklist¶
- [ ] Update
CoffeeBeanandCrawlUrlentity definitions. - [ ] Create a database migration to alter the schema and merge existing duplicates.
- [ ] Update
CoffeeBeanPersisterto handle the new relationship and deduplicate intelligently. - [ ] Refactor codebase to use the new entity methods (
getPrimaryCrawlUrl,getCrawlUrls). - [ ] Write tests to validate the fix.
- [ ] Final Review and Testing.
🔍 Analysis & Investigation¶
Codebase Structure¶
- Entities:
src/Entity/CoffeeBean.php,src/Entity/CrawlUrl.php. - Persistence Logic:
src/Service/Crawler/Persistance/CoffeeBeanPersister.php. - Consumers of
getCrawlUrl():src/Service/Crawler/AvailabilityCrawlSchedulerService.php,src/Controller/Admin/CoffeeBeanCrudController.php(and potentially others).
Current Architecture¶
The core issue stems from an incorrect database relationship definition. A CoffeeBean is incorrectly mapped as having only one CrawlUrl (OneToOne), enforced by a unique constraint on crawl_url.coffee_bean_id. However, a single product can have multiple URLs (e.g., for different languages like /en/product and /de/product). This architectural flaw forces the system to create a new, duplicate CoffeeBean for each URL variant instead of associating all relevant URLs with a single bean.
Claude's analysis is correct. The solution is to change the relationship to OneToMany (CoffeeBean can have many CrawlUrls) and ManyToOne (CrawlUrl belongs to one CoffeeBean). This requires schema changes, a complex data migration to merge existing duplicates, and updates to the application logic that handles bean creation and retrieval.
Dependencies & Integration Points¶
- Doctrine ORM: The entity annotations and migration process are managed by Doctrine.
- Symfony Migration Bundle: Used to generate and execute the schema and data migration.
- Symfony Console: A command might be needed to assist with the data migration if it's too complex for a single SQL query.
Considerations & Challenges¶
- Data Migration Complexity: The most critical part of this task is the data migration. It must correctly identify all duplicate
CoffeeBeanentities based on a normalized URL or product name, select one "master" bean, re-associate all relatedCrawlUrls and other relations to it, and then safely delete the duplicate beans. This process must be atomic and reversible. A database backup is mandatory before running it. - Identifying the Primary URL: The concept of an
is_primaryflag onCrawlUrlis essential for consumers of the data (like the availability checker) that need a single, canonical URL for a bean. The logic to determine which URL is primary must be clearly defined. - Code Refactoring: All existing calls to
$coffeeBean->getCrawlUrl()will break and must be located and updated, which carries a risk of introducing regressions if any are missed.
📝 Implementation Plan¶
Prerequisites¶
- MANDATORY: Create a full backup of the production database before attempting to run the migration.
- The migration should be tested thoroughly in a staging environment that mirrors production data as closely as possible.
Step-by-Step Implementation¶
-
Step 1: Update Entity Definitions
- File to modify:
src/Entity/CrawlUrl.php- Add a new
isPrimaryproperty: - Add the corresponding getter and setter (
isPrimary,setIsPrimary). - Change the relationship from
OneToOnetoManyToOne:
- Add a new
- File to modify:
src/Entity/CoffeeBean.php- Change the
$crawlUrlproperty to$crawlUrlsand initialize it as a collection in the constructor. - Remove
getCrawlUrl()andsetCrawlUrl(). - Add new methods to manage the collection:
public function getCrawlUrls(): Collection { return $this->crawlUrls; } public function addCrawlUrl(CrawlUrl $crawlUrl): static { if (!$this->crawlUrls->contains($crawlUrl)) { $this->crawlUrls->add($crawlUrl); $crawlUrl->setCoffeeBean($this); } return $this; } public function removeCrawlUrl(CrawlUrl $crawlUrl): static { if ($this->crawlUrls->removeElement($crawlUrl)) { if ($crawlUrl->getCoffeeBean() === $this) { $crawlUrl->setCoffeeBean(null); } } return $this; } - Add a helper method to get the primary URL:
- Change the
- File to modify:
-
Step 2: Create Database Migration
- Generate a new migration file:
php bin/console make:migration. - In the
up()method:- Schema Changes: The generated migration should automatically contain the SQL to drop the unique index on
crawl_url.coffee_bean_idand add theis_primarycolumn. Verify this is correct. - Data Migration (PHP code within the migration):
- Fetch all
CoffeeBeanentities. - Group them in-memory by a normalized key (e.g.,
roaster_id+normalized_url_slug). A good normalization strategy is to remove protocol,www., and language path segments like/en/,/de/. - Iterate through each group of duplicates. For each group:
- Designate the first bean as the
primaryBean. - Iterate through the other
duplicateBeans in the group. - Re-assign all
CrawlUrls from theduplicateBeanto theprimaryBean. - Crucially, re-assign any other relations from the duplicate to the primary bean if necessary.
- After re-assigning, add the
duplicateBeanto a list for deletion.
- Designate the first bean as the
- Use a DQL query to delete all the collected
duplicateBeans. - Run a final query to set the
is_primaryflag. For eachCoffeeBean, mark one of itsCrawlUrls as primary (e.g., the one with the oldest creation date).
- Fetch all
- Schema Changes: The generated migration should automatically contain the SQL to drop the unique index on
- Generate a new migration file:
-
Step 3: Update Persistence Logic
- File to modify:
src/Service/Crawler/Persistance/CoffeeBeanPersister.php - Changes needed:
- In
findExistingBean():- Implement a URL normalization function (e.g., using regex
preg_replace('#/([a-z]{2})/#', '/', $url)). - Instead of
findOneBy(['url' => $url]), create a custom repository methodfindBeanByNormalizedUrl(string $normalizedUrl, string $roasterId)that joinsCrawlUrland checks against the normalized version ofCrawlUrl.url.
- Implement a URL normalization function (e.g., using regex
- In
processCoffeeBeanFromDTO():- Replace
$bean->setCrawlUrl($crawlUrl)with$bean->addCrawlUrl($crawlUrl). - Add logic to set the
isPrimaryflag. If the bean is new, the firstCrawlUrladded becomes primary. If the bean exists, the flag is not changed.
- Replace
- In
- File to modify:
-
Step 4: Refactor Codebase
- Search the entire project for usages of
getCrawlUrl(). - File to modify:
src/Service/Crawler/AvailabilityCrawlSchedulerService.php(and any others)- Replace
$bean->getCrawlUrl()with$bean->getPrimaryCrawlUrl(). Ensure to handle thenullcase if a bean somehow has no URLs.
- Replace
- File to modify:
src/Controller/Admin/CoffeeBeanCrudController.php(or similar admin file)- Update the field that displays the URL to either show the primary URL or list all associated URLs.
- Search the entire project for usages of
Testing Strategy¶
- Unit Tests:
- Write a unit test for the new URL normalization logic to ensure it handles various cases correctly.
- Integration Tests:
- Write a test for
CoffeeBeanPersisterthat simulates crawling two URL variants for the same product and asserts that only oneCoffeeBeanis created and it has twoCrawlUrls associated with it.
- Write a test for
- Migration Test:
- On a staging database populated with production-like duplicates, run the migration.
- After the migration, run SQL queries to verify:
- There are no more duplicate beans.
- No data (like flavor notes, regions) was lost from the merged beans.
- Each
CoffeeBeanhas at least oneCrawlUrlmarked as primary.
- End-to-End Test:
- Manually test the crawling process for a roaster known to have language-variant URLs.
- Check the admin interface to ensure the bean and its URLs are displayed correctly.
- Verify the availability checker and other services that relied on the old method still function correctly.
🎯 Success Criteria¶
- Running the migration successfully merges all existing duplicate
CoffeeBeanentities without data loss. - After the fix, crawling multiple URL variants for a single product results in only one
CoffeeBeanentity with multipleCrawlUrlentities linked to it. - The application remains stable, and all features that previously used a bean's URL continue to work correctly by using the primary URL.
- The number of
CoffeeBeanentities in the database accurately reflects the number of unique products.