Skip to content

Feature Implementation Plan: Crawler Feedback Loop

📋 Todo Checklist

  • [ ] Create a new DiscardedProduct entity and repository to log items filtered out by the crawler's confidence score.
  • [ ] Update the CoffeeBeanPersister to create DiscardedProduct records when it rejects a product.
  • [ ] Create a new admin interface (DiscardedProductCrudController) for reviewing these records.
  • [ ] Implement a "Rescue Product" feature in the admin to override the crawler's decision.
  • [ ] Write unit and integration tests for the new entities and the rescue logic.
  • [ ] Final Review and Testing

🔍 Analysis & Investigation

Codebase Structure

  • This plan will introduce a new entity, src/Entity/DiscardedProduct.php, and its repository.
  • It will modify src/Service/Crawler/Persistance/CoffeeBeanPersister.php to populate this new table.
  • It will add a new admin controller, src/Controller/Admin/DiscardedProductCrudController.php, to provide the user interface for the feedback loop.

Current Architecture & Problem

  • Problem: The current crawler uses an LLM confidence score to automatically discard products that don't appear to be whole coffee beans. This is efficient but imperfect. If the LLM makes a mistake, there is no mechanism to correct it, and valuable data may be lost permanently.
  • Solution: This plan introduces a "purgatory" for discarded products. Instead of deleting them, they are moved to a separate table for human review. This creates a vital feedback loop, allowing administrators to correct the system's mistakes and providing a dataset that can be used to fine-tune the crawler in the future.

Dependencies & Integration Points

  • Doctrine: Used to create the new DiscardedProduct entity and its relationship with the crawling process.
  • EasyAdminBundle: Will be used to quickly create the admin interface for reviewing and managing discarded products.

Considerations & Challenges

  • Rescue Logic: The process of "rescuing" a product needs to be robust. It involves taking the raw data from the DiscardedProduct and re-running the persistence logic, but this time forcing it to bypass the confidence check.
  • Data Volume: The discarded_product table could grow large over time. A periodic cleanup job might be necessary in the long run, but it is not required for the initial implementation.
  • Admin UI: The admin interface should be designed for efficiency, allowing an admin to quickly see why a product was discarded and to rescue or permanently delete it with a single click.

📝 Implementation Plan

Prerequisites

  • No new external dependencies are required.

Step-by-Step Implementation

  1. Create DiscardedProduct Entity

    • Files to create: src/Entity/DiscardedProduct.php and src/Repository/DiscardedProductRepository.php.
    • Changes needed: The entity should store all the information needed to potentially re-create the coffee bean. This includes:
      • id (UUID)
      • productName (string)
      • url (string)
      • reason (string, e.g., "Low confidence score")
      • confidenceScore (float)
      • rawData (json, to store the complete DTO from the crawler)
      • createdAt (datetime)
  2. Update CoffeeBeanPersister to Log Discards

    • Files to modify: src/Service/Crawler/Persistance/CoffeeBeanPersister.php
    • Changes needed:
      • Inject the DiscardedProductRepository (or EntityManagerInterface).
      • In the processCoffeeBeanFromDTO method, inside the if ($score < $threshold) block:
      • Instead of just logging and returning, create a new DiscardedProduct entity.
      • Populate it with the name, URL, score, and the full CoffeeBeanData DTO (serialized as JSON).
      • Persist the new DiscardedProduct entity.
  3. Create Admin Interface for Review

    • Files to create: src/Controller/Admin/DiscardedProductCrudController.php
    • Changes needed:
      • Create a standard EasyAdmin CRUD controller for the DiscardedProduct entity. Display key fields like productName, url, confidenceScore, and reason.
      • Add a custom admin action/button named "Rescue Product". This can be done using AdminUrlGenerator and Action::new().
      • This action should link to a new method within the CRUD controller.
  4. Implement the "Rescue" Logic

    • Files to modify: src/Controller/Admin/DiscardedProductCrudController.php
    • Changes needed:
      • Create the public method for the "Rescue" action.
      • This method will:
        1. Fetch the DiscardedProduct entity.
        2. Deserialize its rawData back into a CoffeeBeanData DTO.
        3. Call the CoffeeBeanPersister->processCoffeeBeanFromDTO() method, but with an additional optional parameter, e.g., processCoffeeBeanFromDTO($dto, $crawlUrl, $force = true), to bypass the confidence check.
        4. After successful persistence, delete the DiscardedProduct record.
        5. Add a success flash message and redirect back to the list of discarded products.

Testing Strategy

  • Unit Tests:
    • Write a unit test to confirm that the CoffeeBeanPersister creates a DiscardedProduct when it rejects a bean.
    • Write a unit test for the "Rescue" logic in the DiscardedProductCrudController to ensure it correctly calls the persister and deletes the record.
  • Integration Tests:
    • Write an integration test that simulates a crawl result with a low confidence score and asserts that a DiscardedProduct record is created in the database.
  • Manual Testing:
    • Go to the admin panel, view the list of discarded products, and click the "Rescue" button. Verify that the product is successfully created as a CoffeeBean and removed from the discard list.

🎯 Success Criteria

  • Products rejected by the crawler based on confidence score are no longer lost but are logged in a new discarded_product table.
  • An admin interface exists to review these discarded products.
  • An administrator can successfully "rescue" a mistakenly discarded product with a single click.
  • The crawler's accuracy can be monitored, and its mistakes can be corrected, improving data quality over time.