AIOps Root Cause Analysis: Java Architecture

by Alex Johnson 45 views

In the realm of AIOps (Artificial Intelligence for IT Operations), identifying the root cause of issues is paramount. This article delves into the Java layer architecture designed to facilitate efficient root cause analysis within an AIOps framework, focusing on three key API pillars: Tool Manifest, Knowledge Base, and Executor.

API Pillar 1: GET /api/v1/tools/manifest (Tool Manifest)

This Tool Manifest API serves as the initial point of contact when an agent (typically a Python script) starts. Think of it as the agent asking the Java backend, "Hey, what tools do you have available for me to use?" The primary purpose of this API, especially in versions 3.1 and 4.0, is to enable Tool RAG (Retrieval Augmented Generation). Through this API, the agent discovers the available tool_name values it can utilize. In essence, it's a discovery service for available functionalities. The response body is a JSON object that lists all available tools, their descriptions, and the schema for their parameters. For example, a tool might be query_metric_by_name, which allows the agent to query a metric based on its registered name. The description provides context, such as "V3.1 - Query a metric based on its standard name from the 'metric manifest'." The parameters_schema then defines the expected input. In the case of query_metric_by_name, it specifies that metricName (a string representing the unique metric name) and timeDuration (a string indicating the query duration, like '1h' or '5m') are required. It also allows for optional filters (key-value pairs for tag filtering). Another crucial tool is execute_sandboxed_analysis, introduced in V5.1. This tool enables the execution of a Python script within a secure sandbox against the raw data of a query. The parameters include query_name (the name of the metric or log to query), query_type (specifying whether it's a metric, log, or trace), python_script (the LLM-generated Python code for analysis), and timeFilters (the time range for the query). This opens up avenues for custom analysis tailored to specific problems. The get_holistic_dashboard tool, available in V6.0, provides a "holistic view" by fetching all critical SLOs (Service Level Objectives) and SLIs (Service Level Indicators) for a given service. This allows for a comprehensive understanding of service performance. The API also includes a search_knowledge_base tool (V4.0/V6.0) for querying the "knowledge RAG" which encompasses SOP (Standard Operating Procedure) documents and dynamic Slack messages. This tool is invaluable for accessing relevant documentation and troubleshooting information. The search_knowledge_base is very important to quickly locate the resolution or escalation procedure to solve production issues faster. In summary, the Tool Manifest API is the cornerstone for enabling agents to discover and utilize available tools within the AIOps ecosystem, and this greatly improved the Mean Time To Resolution(MTTR).

API Pillar 2: GET /api/v1/knowledge/list (Knowledge Base)

The Knowledge Base API acts as the agent's "RAG ammunition depot." It informs the agent about the valid parameters (metricName, logName) and data schemas when calling the tools (V3.1 and V5.1). Essentially, it's the agent's guide to using the tools correctly. The primary purposes of this API are two-fold: first, to help the agent discover valid metricName values (V3.1), and second, to provide schema_example values (V5.1). The API uses query parameters to specify the type of knowledge being requested. For instance, GET /api/v1/knowledge/list?type=metrics retrieves a list of available metrics, while GET /api/v1/knowledge/list?type=logs retrieves a list of available logs. The response body for the logs endpoint is a JSON object containing a list of log definitions. Each item in the list includes the name of the log (e.g., pro.bff.error.logs), a description explaining the log's purpose (e.g., "BFF application error logs, automatically generated by V3.2 Harvester"), the schema_type (e.g., list[dict]), and a schema_example. The schema_example is particularly valuable for V5.1, as it provides a sample log entry that the agent can use to understand the log's structure. For instance, the pro.bff.error.logs example shows fields like @timestamp, level, service, message, and mdc_context. Another example is pro.druid.slowquery.logs, which includes fields like timestamp, query_time_ms, and sql_text. This API provides the much-needed context and structure, and this plays a key role in schema-aware analysis, where the agent understands the data format before processing it, reducing errors and improving the accuracy of analysis. The Knowledge Base API is the compass that guides agents in navigating the complex landscape of available data and ensuring they use the right parameters and understand the data structure. When agents call V3.1 and V5.1 tools, having access to valid parameters as well as understanding of the data schema is a crucial consideration.

API Pillar 3: POST /api/v1/tools/execute (Executor)

The Executor API serves as the singular, unified entry point for executing actions. Every action performed by an agent goes through this interface, providing a consistent and secure way to interact with the system. The Executor API is the only gateway for all Agent actions, and it ensures a consistent and secure execution environment. The request body, represented by ToolExecutionRequest, contains all the necessary information for executing a tool. For example, when an LLM (Large Language Model) wants to query a metric (V3.1), the request might look like this: The tool_name specifies the tool to be executed (query_metric_by_name). The parameters object provides the input values, such as metricName (pro.bff.req.latency.sum), timeDuration (1h), and filters (service: bff-prod). The analysisHint provides additional context to guide the execution, such as "User is asking about BFF slowness. Focus summary on peaks." The request_id helps track the request. When an LLM wants to perform a sandboxed analysis (V5.1), the request includes the tool_name (execute_sandboxed_analysis), parameters with query_name, query_type, python_script, and timeFilters. The analysisHint could be something like "User wants a custom ratio of NPE vs Timeout." The get_holistic_dashboard tool (V6.0) is invoked with the tool_name set to get_holistic_dashboard and parameters containing the service_name. The analysisHint might be "User asked 'BFF is slow', give me the full picture." The response body, ToolExecutionResponse, provides the outcome of the execution. A successful response for V3.1 or V6.0 includes a status (e.g., "SUCCESS" or "SUCCESS_WITH_SUMMARY"), the request_id, the tool_name, a summary of the execution, and the data returned by the tool. The data might include metrics (with their names and data points) and log patterns (with their patterns and counts). The debug_info provides insights into the execution, such as the executed queries. A successful response for V5.1 (sandboxed analysis) includes the same basic fields, but the data field contains the JSON output from the sandbox script, and the debug_info includes details like the sandbox execution time and the amount of data pulled. The Executor API provides a centralized and secure way to execute any action, whether querying metrics, analyzing logs in a sandbox, or fetching a holistic dashboard. It ensures that all actions are performed consistently and with appropriate context, streamlining the AIOps process.

In summary, these three API pillars – Tool Manifest, Knowledge Base, and Executor – form the foundation of a robust Java layer architecture for AIOps root cause analysis. They enable agents to discover available tools, understand the data they're working with, and execute actions in a consistent and secure manner, ultimately accelerating the identification and resolution of issues.

For more information on AIOps and root cause analysis, you can visit BMC Software's AIOps Guide. This external resource can provide further insights and best practices in the field. This will further improve the efficiency and effectiveness of IT operations.