Consultation Patchset Backlog (31-38) Discussion

Nov 24, 2025 by Alex Johnson 49 views

Consultation Patchset Backlog (31–38): Enhancing Backend Robustness and Observability

This article delves into the Consultation Patchset Backlog (31–38), focusing on critical enhancements to backend robustness, test reliability, observability, and overall product polish. Assuming Patchsets 1–30 are already applied or parked as noted, this backlog emphasizes backend improvements that are essential for a stable and scalable system.

Patchset 31.0 – Test Isolation v2: Achieving a Full Suite Green

In this critical patchset, the goal is to ensure the full backend pytest suite passes reliably. Achieving this involves eliminating order-dependent failures and hidden shared-state bugs. The key focus is to create a testing environment where each test operates in isolation, providing confidence in the correctness of the codebase.

Key Tasks for Test Isolation

To accomplish full test suite reliability, several key tasks must be undertaken. First, fixing the remaining failing tests, approximately ten, that only break when the full suite runs is paramount. These tests often relate to critical functionalities such as leaderboards, profiles, Google authentication, models, and ratings. Addressing these failures will significantly enhance the stability of the system.

Next, implementing a single, consistent isolation strategy is essential. Two primary strategies are considered: using an ephemeral database per test or module, or employing a robust truncate-all-tables strategy between tests. The chosen strategy must work effectively even when application code creates its own sessions. This ensures that each test starts with a clean slate, preventing interference from previous tests.

Ensuring billing plan seeding and other required seed data are visible from any session used in tests is another crucial step. Seed data provides a known baseline for tests, allowing them to operate predictably. Finally, finalizing and documenting a canonical testing pattern is necessary. This pattern should include central fixtures in conftest.py, usage of settings_context or environment overrides, and an example test file demonstrating the correct pattern. This standardized approach will make it easier for developers to write reliable tests and maintain consistency across the codebase.

Acceptance Criteria for Patchset 31.0

The acceptance criteria for Patchset 31.0 are stringent, ensuring a high level of reliability. The pytest command cd apps/api && source .venv/bin/activate && pytest -q must pass 100% locally. The CI backend job should run the full suite, or a well-defined large subset, and return green. Additionally, a short tests/TEST_ISOLATION.md or similar document should explain the chosen testing pattern. Meeting these criteria will guarantee that the test suite is robust and reliable, providing a solid foundation for future development.

Patchset 32.0 – Settings & Error Handling Consolidation: Consistency and Safety

This patchset aims to make configuration and error handling consistent, testable, and safe for production. Proper configuration management and error handling are crucial for the stability and maintainability of any application. This patchset addresses these concerns through a series of key tasks.

Key Tasks for Settings and Error Handling

The settings refactor is a core component of this patchset. The primary objective is to enforce settings.* as the single source of truth, eliminating stray os.getenv() reads in application code. This centralized approach simplifies configuration management and reduces the risk of inconsistencies. Extending and standardizing SettingsContext and settings_context utilities used in tests is also important. These utilities provide a consistent way to manage settings in test environments.

Adding validation hooks to catch conflicting or unsafe settings is another critical task. For example, the system should detect if STRIPE_WEBHOOK_VERIFY=1 but the secret is missing. This proactive approach helps prevent configuration errors that could lead to security vulnerabilities or application failures. Cleaning up test-only code, specifically moving any test-only routes or flags out of production main.py into test fixtures/routers, is also essential. Keeping the production app focused on real routes only improves its clarity and reduces the risk of unintended side effects.

Standardizing error handling is the final major task. Introducing a base AppError and domain-specific subclasses in exceptions.py will provide a consistent way to represent application errors. Adding a global exception handler that returns a consistent shape, such as { "code": "...", "detail": "..." }, is crucial for API stability. Migrating ad-hoc HTTPException uses to AppError where appropriate ensures that error handling is uniform throughout the application.

Acceptance Criteria for Patchset 32.0

The acceptance criteria for Patchset 32.0 are designed to ensure consistency and safety. No application code should read raw environment variables; everything must go through settings. Error responses should be consistent across the API for known application errors. Tests must use shared settings helpers instead of ad-hoc environment mutations. Meeting these criteria will significantly improve the reliability and maintainability of the application.

Patchset 33.0 – Structured Logging & Observability v1: Gaining Operational Insights

This patchset focuses on making operational behavior observable through structured logs, provider health visibility, and clear traces for debates. Observability is crucial for understanding how an application is performing in production and for diagnosing issues quickly.

Key Tasks for Structured Logging and Observability

Standardizing the logging format on structured (JSON-style) logs is the first key task. Structured logs make it easier to parse and analyze log data, enabling better monitoring and alerting. Adding contextual fields, such as request ID, user ID, debate ID, provider/model, and circuit state, where relevant will further enhance the usefulness of logs.

Ensuring critical events coverage is also essential. Key lifecycle events, such as debate created, debate completed/failed, retries, circuit open/close, rate-limit events, billing usage, and webhook failures, should be logged. This comprehensive logging provides a detailed view of the system's operation.

Extending the existing Admin Ops endpoint/UI to show clearer provider health information is another crucial step. This includes displaying error rate windows, circuit open/closed status, and cooldown time remaining. This information is vital for administrators to monitor and manage the system effectively. Finally, creating an OBSERVABILITY.md or docs/OPS.md document explaining how to consume logs and what signals are available is necessary. This documentation will help developers and operations staff leverage the observability features effectively.

Acceptance Criteria for Patchset 33.0

The acceptance criteria for Patchset 33.0 emphasize the quality and usability of the observability features. Logs for core flows (debates, billing, rate limits, provider health) must be structured and include useful metadata. Admin Ops should show basic provider health/circuit breaker data. Observability documentation must exist and match the implementation. Meeting these criteria will provide valuable insights into the system's operation and facilitate proactive issue detection.

Patchset 34.0 – Orchestrator & Debate Pipeline Refactor: Enhancing Testability and Extensibility

The primary goal of this patchset is to turn the orchestration logic into a clean, testable pipeline that is easier to extend and reason about. A well-structured orchestrator is essential for managing complex workflows, such as debates, and for ensuring the system's long-term maintainability.

Key Tasks for Orchestrator Refactor

Modularization is a key task in this refactor. Splitting the current orchestrator into focused modules, such as orchestration/engine.py for the core pipeline runner, orchestration/stages.py for stage implementations, orchestration/state.py for shared debate state, and orchestration/finalization.py for scoring and billing updates, will improve the clarity and maintainability of the code.

Introducing a DebatePipeline concept that runs a list of DebateStage objects in sequence is another crucial step. This pipeline pattern provides a structured way to manage the debate workflow. Integrating existing failure-tolerance settings (failure_threshold, min_required_seats) into the pipeline ensures that the system remains robust in the face of failures.

Adding VCR-style integration tests that run full debates using recorded LLM outputs is also essential. These tests validate that the full multi-stage flow behaves as expected after refactors. This ensures that the refactoring does not introduce unintended side effects. The use of VCR-style tests allows for repeatable and reliable testing of the orchestration logic.

Acceptance Criteria for Patchset 34.0

The acceptance criteria for Patchset 34.0 focus on the structure and testability of the orchestrator. The orchestrator logic must be split into smaller, well-named modules. Integration tests for an end-to-end debate flow using recorded LLM responses are required. Existing retry/failure-tolerance semantics must be preserved and covered by tests. Meeting these criteria will result in a more maintainable and robust orchestration system.

Patchset 35.0 – Frontend State & Data Layer (Zustand + TanStack Query): Improving Frontend Reliability

This patchset aims to make live debates and replay more robust and maintainable on the frontend by using proper state and data management. Effective state and data management are crucial for creating a responsive and user-friendly frontend.

Key Tasks for Frontend Improvements

Introducing a Zustand store for debates is a key task. This small store will manage the state for live debates and replay, including the active debate ID, current round, active speaker/seat, SSE connection state, and replay playback position and controls. Removing ad-hoc prop drilling and scattered useState for these concerns will simplify the codebase and improve maintainability.

Using TanStack Query (React Query) for data fetching is another important step. Wrapping core data endpoints (runs list, leaderboard, admin ops, usage charts, etc.) in React Query hooks enables caching, background refetching, and consistent loading/error states. This improves the performance and reliability of the frontend.

Polishing SSE reliability is also essential. Reviewing the existing SSE client implementation and adding better reconnect/backoff behavior and user feedback (banners/toasts when live data is temporarily unavailable) will enhance the user experience. Clear communication of connection status is crucial for maintaining user trust.

Acceptance Criteria for Patchset 35.0

The acceptance criteria for Patchset 35.0 focus on the structure and reliability of the frontend. Live debate and replay UIs must rely on a central store rather than heavily nested prop/state chains. Core lists/views should use React Query with clear loading/error states. SSE disruptions must be handled gracefully and communicated in the UI. Meeting these criteria will result in a more robust and user-friendly frontend.

Patchset 36.0 – Safety & Rate-Limit UX v2: Enhancing User Experience

This patchset aims to turn low-level safety mechanisms (PII scrub, rate limits) into clear, user-visible, and admin-understandable behavior. Transparent safety measures and rate limits are crucial for building user trust and ensuring a positive user experience.

Key Tasks for Safety and Rate-Limit UX

Enhancing PII scrubbing is a key task. Extending the current PII scrubber (emails/phones) with optional patterns (names/addresses) behind config flags allows for more comprehensive data protection. Exposing simple metrics in Admin Ops, such as the number of redactions over a recent window, provides administrators with visibility into the effectiveness of the scrubbing process.

Documenting prompt-injection guardrails is another important step. Documenting the existing defensive prompting approach for agents/parliament and optionally adding lightweight logging for clearly malicious prompt-injection attempts helps protect the system from abuse. Clear documentation of safety measures builds user trust and confidence.

Improving the rate limit UX is also crucial. When a user hits a rate limit, showing contextual messages, such as "You reached X of Y limit" with a direct link to pricing/upgrade, provides actionable feedback. Ensuring that dashboards and banners reflect current usage and limits in a friendly way (both EN/TR) enhances the user experience. Clear and helpful messaging around rate limits prevents user frustration.

Acceptance Criteria for Patchset 36.0

The acceptance criteria for Patchset 36.0 emphasize the transparency and effectiveness of safety measures. PII scrub behavior must be configurable and visible in admin metrics. Rate-limit events should produce clear, actionable UX (toasts/banners/links) instead of generic errors. Safety behavior (what is scrubbed, when, and why) must be documented. Meeting these criteria will result in a safer and more user-friendly system.

Patchset 37.0 – API Keys & Programmatic Access: Enabling Secure Access

This patchset focuses on providing a clean API-key based access model for programmatic usage, aligned with billing and rate limits. Secure API access is essential for developers who want to integrate with the platform programmatically.

Key Tasks for API Key Implementation

Implementing a backend API key model and storage is a key task. API keys must be stored hashed (never plain text) and only shown once on creation. Adding CRUD endpoints for listing, creating, and revoking keys provides administrators with control over API access.

Integrating authentication and rate limits is also crucial. API keys should authenticate requests (with clear precedence vs. cookies), and per-key quotas and rate limits mapped to billing plans must be enforced. This ensures that API usage is properly managed and aligned with billing.

Adding a frontend UI in an "API Access" section in settings where users/teams can manage keys simplifies key management. Creating a small "Using the Consultation API" section with examples and security notes documents the use of API keys.

Acceptance Criteria for Patchset 37.0

The acceptance criteria for Patchset 37.0 focus on the security and usability of API keys. API keys must work end-to-end (create → use → revoke) and be correctly rate-limited. A UI for managing keys must exist. Documentation should clearly explain how to use and protect keys. Meeting these criteria will provide a secure and user-friendly API access mechanism.

Patchset 38.0 – Pydantic v2, Type Safety & Pre-commit: Modernizing the Codebase

This patchset aims to modernize type usage and enforce code quality via tooling. Code modernization and quality enforcement are crucial for the long-term maintainability and stability of the codebase.

Key Tasks for Code Modernization

Migrating to Pydantic v2 is a key task. Replacing legacy Config usage with ConfigDict or model_config, updating all parse_obj/from_orm patterns to their v2 equivalents, and resolving remaining Pydantic deprecation warnings are necessary steps. This ensures that the codebase is using the latest Pydantic features and best practices.

Implementing type checking (mypy) and linting (ruff) is also crucial. Introducing mypy with a reasonably strict config and pydantic/sqlalchemy plugins, and introducing ruff for linting and import organization, helps catch errors early in the development process.

Configuring pre-commit hooks to run formatting (black or preferred formatter), ruff, mypy, and a small fast test subset, ensures consistent code quality. Adding pytest-cov and a coverage threshold for the backend provides visibility into the test coverage of the codebase.

Acceptance Criteria for Patchset 38.0

The acceptance criteria for Patchset 38.0 focus on code quality and modernization. No Pydantic deprecation warnings should remain. CI must run mypy + ruff and fail on serious issues. Pre-commit hooks should be configured and documented for contributors. Coverage reports must be available and above the agreed minimum threshold. Meeting these criteria will result in a more modern, maintainable, and reliable codebase.

In conclusion, the Consultation Patchset Backlog (31–38) outlines critical enhancements to backend robustness, test reliability, observability, and overall product polish. Each patchset addresses specific areas of improvement, ensuring that the system remains stable, secure, and maintainable. By focusing on these key areas, the Consultation platform will continue to provide a robust and reliable service for its users. For more information on best practices in software development and maintenance, visit The Standish Group.