Skip to content

Refactor HtmlCleaner God Object

Priority: 🔴 CRITICAL Status: Planning Related QA Analysis: qa-analysis-overview.md

Problem Statement

Service/Crawler/HtmlCleaner.php has become a god object with extreme complexity:

Violations

  • Class Complexity: 107/60 (EXTREME - 78% over threshold)
  • Multiple Method Violations:
    • Line 121: transformSelectElements() - CC: 18, NPath: 18,818
    • Line 327: findRadioGroupContainer() - CC: 11
    • Line 414: transformInputElements() - CC: 10, NPath: 258
    • Line 453: removeNoisyElements() - CC: 15, NPath: 738

Impact

  • Data Quality: Core crawler component affects data quality platform-wide
  • Maintainability: Extremely difficult to test, debug, and modify
  • Bug Risk: High risk of introducing bugs during any modification
  • Extensibility: Adding new HTML cleaning logic is painful

Guideline Violations

  • God Objects Anti-pattern: Class has too many responsibilities
  • Long Methods Anti-pattern: Multiple methods exceed guidelines
  • SOLID - Single Responsibility Principle: Class handles too many concerns

Current Responsibilities

Based on method names, HtmlCleaner appears to handle:

  1. Select element transformation
  2. Radio group container detection
  3. Input element transformation
  4. Noisy element removal
  5. General HTML cleaning/sanitization
  6. Form element processing
  7. DOM manipulation

Proposed Refactoring Strategy

Step 1: Analyze Current Class Structure

  • Document all public methods and their responsibilities
  • Identify cohesive groups of methods
  • Map dependencies between method groups
  • Identify shared state and utilities

Step 2: Design Service Decomposition

Break down into focused services following Single Responsibility Principle:

Potential Services:

  • FormElementTransformer - Handle form elements (select, input, radio)
  • HtmlSanitizer - Remove noisy/unwanted elements
  • DomManipulator - Core DOM manipulation utilities
  • HtmlCleaner (reduced) - Orchestrate cleaning pipeline

Step 3: Extract Complex Methods

Each overly complex method should be:

  • Broken into smaller, focused methods
  • Or extracted into specialized service
  • Reduced to <10 cyclomatic complexity

Step 4: Create Service Boundaries

  • Define clear interfaces for each service
  • Ensure proper dependency injection
  • Remove tight coupling between services
  • Use composition over inheritance

Step 5: Test Coverage

  • Ensure existing tests still pass
  • Add unit tests for new services
  • Add integration tests for cleaning pipeline
  • Verify data quality is maintained

Success Criteria

  • HtmlCleaner class complexity < 60
  • All methods have cyclomatic complexity < 10
  • Clear separation of concerns
  • Improved testability
  • No regression in crawler data quality
  • Easier to extend with new cleaning rules

Risk Assessment

High Risk:

  • Core crawler component - bugs affect entire platform
  • Complex refactoring with many moving parts
  • Must maintain backward compatibility

Mitigation:

  • Comprehensive test coverage before refactoring
  • Incremental approach (one service at a time)
  • Feature flag new implementation if possible
  • Extensive testing on staging environment

Estimated Effort

High - This is a significant refactoring requiring:

  • Deep analysis of existing code
  • Careful decomposition planning
  • Comprehensive testing
  • Potential for multiple iterations

Dependencies

None - can be addressed independently

  • Method complexity issues (lines 121, 327, 414, 453) will be resolved as part of this refactoring

Notes

  • This is the MOST CRITICAL QA violation (highest complexity: 107/60)
  • Should be prioritized above all other refactoring work
  • Consider pairing on this refactoring given complexity
  • May uncover additional architectural issues during refactoring