Refactor HtmlCleaner God Object¶
Priority: 🔴 CRITICAL Status: Planning Related QA Analysis: qa-analysis-overview.md
Problem Statement¶
Service/Crawler/HtmlCleaner.php has become a god object with extreme complexity:
Violations¶
- Class Complexity: 107/60 (EXTREME - 78% over threshold)
- Multiple Method Violations:
- Line 121:
transformSelectElements()- CC: 18, NPath: 18,818 - Line 327:
findRadioGroupContainer()- CC: 11 - Line 414:
transformInputElements()- CC: 10, NPath: 258 - Line 453:
removeNoisyElements()- CC: 15, NPath: 738
- Line 121:
Impact¶
- Data Quality: Core crawler component affects data quality platform-wide
- Maintainability: Extremely difficult to test, debug, and modify
- Bug Risk: High risk of introducing bugs during any modification
- Extensibility: Adding new HTML cleaning logic is painful
Guideline Violations¶
- God Objects Anti-pattern: Class has too many responsibilities
- Long Methods Anti-pattern: Multiple methods exceed guidelines
- SOLID - Single Responsibility Principle: Class handles too many concerns
Current Responsibilities¶
Based on method names, HtmlCleaner appears to handle:
- Select element transformation
- Radio group container detection
- Input element transformation
- Noisy element removal
- General HTML cleaning/sanitization
- Form element processing
- DOM manipulation
Proposed Refactoring Strategy¶
Step 1: Analyze Current Class Structure¶
- Document all public methods and their responsibilities
- Identify cohesive groups of methods
- Map dependencies between method groups
- Identify shared state and utilities
Step 2: Design Service Decomposition¶
Break down into focused services following Single Responsibility Principle:
Potential Services:
FormElementTransformer- Handle form elements (select, input, radio)HtmlSanitizer- Remove noisy/unwanted elementsDomManipulator- Core DOM manipulation utilitiesHtmlCleaner(reduced) - Orchestrate cleaning pipeline
Step 3: Extract Complex Methods¶
Each overly complex method should be:
- Broken into smaller, focused methods
- Or extracted into specialized service
- Reduced to <10 cyclomatic complexity
Step 4: Create Service Boundaries¶
- Define clear interfaces for each service
- Ensure proper dependency injection
- Remove tight coupling between services
- Use composition over inheritance
Step 5: Test Coverage¶
- Ensure existing tests still pass
- Add unit tests for new services
- Add integration tests for cleaning pipeline
- Verify data quality is maintained
Success Criteria¶
- HtmlCleaner class complexity < 60
- All methods have cyclomatic complexity < 10
- Clear separation of concerns
- Improved testability
- No regression in crawler data quality
- Easier to extend with new cleaning rules
Risk Assessment¶
High Risk:
- Core crawler component - bugs affect entire platform
- Complex refactoring with many moving parts
- Must maintain backward compatibility
Mitigation:
- Comprehensive test coverage before refactoring
- Incremental approach (one service at a time)
- Feature flag new implementation if possible
- Extensive testing on staging environment
Estimated Effort¶
High - This is a significant refactoring requiring:
- Deep analysis of existing code
- Careful decomposition planning
- Comprehensive testing
- Potential for multiple iterations
Dependencies¶
None - can be addressed independently
Related Issues¶
- Method complexity issues (lines 121, 327, 414, 453) will be resolved as part of this refactoring
Notes¶
- This is the MOST CRITICAL QA violation (highest complexity: 107/60)
- Should be prioritized above all other refactoring work
- Consider pairing on this refactoring given complexity
- May uncover additional architectural issues during refactoring