Feature Implementation Plan: Crawler Feedback Loop¶
📋 Todo Checklist¶
- [ ] Create a new
DiscardedProductentity and repository to log items filtered out by the crawler's confidence score. - [ ] Update the
CoffeeBeanPersisterto createDiscardedProductrecords when it rejects a product. - [ ] Create a new admin interface (
DiscardedProductCrudController) for reviewing these records. - [ ] Implement a "Rescue Product" feature in the admin to override the crawler's decision.
- [ ] Write unit and integration tests for the new entities and the rescue logic.
- [ ] Final Review and Testing
🔍 Analysis & Investigation¶
Codebase Structure¶
- This plan will introduce a new entity,
src/Entity/DiscardedProduct.php, and its repository. - It will modify
src/Service/Crawler/Persistance/CoffeeBeanPersister.phpto populate this new table. - It will add a new admin controller,
src/Controller/Admin/DiscardedProductCrudController.php, to provide the user interface for the feedback loop.
Current Architecture & Problem¶
- Problem: The current crawler uses an LLM confidence score to automatically discard products that don't appear to be whole coffee beans. This is efficient but imperfect. If the LLM makes a mistake, there is no mechanism to correct it, and valuable data may be lost permanently.
- Solution: This plan introduces a "purgatory" for discarded products. Instead of deleting them, they are moved to a separate table for human review. This creates a vital feedback loop, allowing administrators to correct the system's mistakes and providing a dataset that can be used to fine-tune the crawler in the future.
Dependencies & Integration Points¶
- Doctrine: Used to create the new
DiscardedProductentity and its relationship with the crawling process. - EasyAdminBundle: Will be used to quickly create the admin interface for reviewing and managing discarded products.
Considerations & Challenges¶
- Rescue Logic: The process of "rescuing" a product needs to be robust. It involves taking the raw data from the
DiscardedProductand re-running the persistence logic, but this time forcing it to bypass the confidence check. - Data Volume: The
discarded_producttable could grow large over time. A periodic cleanup job might be necessary in the long run, but it is not required for the initial implementation. - Admin UI: The admin interface should be designed for efficiency, allowing an admin to quickly see why a product was discarded and to rescue or permanently delete it with a single click.
📝 Implementation Plan¶
Prerequisites¶
- No new external dependencies are required.
Step-by-Step Implementation¶
-
Create
DiscardedProductEntity- Files to create:
src/Entity/DiscardedProduct.phpandsrc/Repository/DiscardedProductRepository.php. - Changes needed: The entity should store all the information needed to potentially re-create the coffee bean. This includes:
id(UUID)productName(string)url(string)reason(string, e.g., "Low confidence score")confidenceScore(float)rawData(json, to store the complete DTO from the crawler)createdAt(datetime)
- Files to create:
-
Update
CoffeeBeanPersisterto Log Discards- Files to modify:
src/Service/Crawler/Persistance/CoffeeBeanPersister.php - Changes needed:
- Inject the
DiscardedProductRepository(orEntityManagerInterface). - In the
processCoffeeBeanFromDTOmethod, inside theif ($score < $threshold)block: - Instead of just logging and returning, create a new
DiscardedProductentity. - Populate it with the name, URL, score, and the full
CoffeeBeanDataDTO (serialized as JSON). - Persist the new
DiscardedProductentity.
- Inject the
- Files to modify:
-
Create Admin Interface for Review
- Files to create:
src/Controller/Admin/DiscardedProductCrudController.php - Changes needed:
- Create a standard EasyAdmin CRUD controller for the
DiscardedProductentity. Display key fields likeproductName,url,confidenceScore, andreason. - Add a custom admin action/button named "Rescue Product". This can be done using
AdminUrlGeneratorandAction::new(). - This action should link to a new method within the CRUD controller.
- Create a standard EasyAdmin CRUD controller for the
- Files to create:
-
Implement the "Rescue" Logic
- Files to modify:
src/Controller/Admin/DiscardedProductCrudController.php - Changes needed:
- Create the public method for the "Rescue" action.
- This method will:
- Fetch the
DiscardedProductentity. - Deserialize its
rawDataback into aCoffeeBeanDataDTO. - Call the
CoffeeBeanPersister->processCoffeeBeanFromDTO()method, but with an additional optional parameter, e.g.,processCoffeeBeanFromDTO($dto, $crawlUrl, $force = true), to bypass the confidence check. - After successful persistence, delete the
DiscardedProductrecord. - Add a success flash message and redirect back to the list of discarded products.
- Fetch the
- Files to modify:
Testing Strategy¶
- Unit Tests:
- Write a unit test to confirm that the
CoffeeBeanPersistercreates aDiscardedProductwhen it rejects a bean. - Write a unit test for the "Rescue" logic in the
DiscardedProductCrudControllerto ensure it correctly calls the persister and deletes the record.
- Write a unit test to confirm that the
- Integration Tests:
- Write an integration test that simulates a crawl result with a low confidence score and asserts that a
DiscardedProductrecord is created in the database.
- Write an integration test that simulates a crawl result with a low confidence score and asserts that a
- Manual Testing:
- Go to the admin panel, view the list of discarded products, and click the "Rescue" button. Verify that the product is successfully created as a
CoffeeBeanand removed from the discard list.
- Go to the admin panel, view the list of discarded products, and click the "Rescue" button. Verify that the product is successfully created as a
🎯 Success Criteria¶
- Products rejected by the crawler based on confidence score are no longer lost but are logged in a new
discarded_producttable. - An admin interface exists to review these discarded products.
- An administrator can successfully "rescue" a mistakenly discarded product with a single click.
- The crawler's accuracy can be monitored, and its mistakes can be corrected, improving data quality over time.