Publications
Papers, preprints, and current submissions in AI for Software Engineering and AIOps.
2026
- Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice DiagnosticsYifan Yang, Aoyang Fang, Songhan Zhang, and Pinjia He*Apr 2026Accepted at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2026)
Distributed tracing in microservices is critical for diagnostics but generates overwhelming data volumes, necessitating intelligent sampling. To maximize fidelity, state-of-the-art (SOTA) tail-based samplers analyze complete (or even log-enriched) traces by modeling them as graphs. However, this reliance on computationally expensive graph analysis creates a performance bottleneck that prohibits their use in online settings. To this end, we propose Gleaner, an online tail-sampling framework that breaks this trade-off. It is founded on the key insight that explicit graph structures are unnecessary for high-fidelity trace grouping. Instead, Gleaner represents each trace as a “bag-of-edges” augmented with log semantics, replacing slow graph algorithms with highly efficient set-based operations. It also employs an alarm-driven quota and a diversity-preserving strategy to prioritize anomalous and rare traces for downstream Root Cause Analysis (RCA). Experimentally, Gleaner processes traces at 0.74ms each, improving Trace Pattern Coverage by up to 128.7% and Shannon Entropy by up to 32.9% over baselines. At just a 1% sampling rate, Gleaner improves RCA accuracy by 42%–107% over the next-best sampler. Moreover, RCA on Gleaner’s sampled data is more accurate than with the entire, unsampled dataset. This result reframes intelligent sampling from a data reduction technique to a powerful signal enhancement paradigm for automated operations.
2025
- Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware BenchmarkAoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, and Pinjia He*Oct 2025Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)
While cloud-native microservice architectures have revolutionized software development, their inherent operational complexity makes failure Root Cause Analysis (RCA) a critical yet challenging task. Numerous data-driven RCA models have been proposed to address this challenge. However, we find that the benchmarks used to evaluate these models are often too simple to reflect real-world scenarios. Our preliminary study reveals that simple rule-based methods can achieve performance comparable to or even surpassing state-of-the-art (SOTA) models on four widely used public benchmarks. This finding suggests that the oversimplification of existing benchmarks might lead to an overestimation of the performance of RCA methods. To further investigate the oversimplification issue, we conduct a systematic analysis of popular public RCA benchmarks, identifying key limitations in their fault injection strategies, call graph structures, and telemetry signal patterns. Based on these insights, we propose an automated framework for generating more challenging and comprehensive benchmarks that include complex fault propagation scenarios. Our new dataset contains 1,430 validated failure cases from 9,152 fault injections, covering 25 fault types across 6 categories, dynamic workloads, and hierarchical ground-truth labels that map failures from services down to code-level causes. Crucially, to ensure the failure cases are relevant to IT operations, each case is validated to have a discernible impact on user-facing SLIs. Our re-evaluation of 11 SOTA models on this new benchmark shows that they achieve low Top@1 accuracies, averaging 0.21, with the best-performing model reaching merely 0.37, and execution times escalating from seconds to hours.
- DynaCausal: Dynamic Causality-Aware Root Cause Analysis for Distributed MicroservicesSonghan Zhang, Aoyang Fang, Yifan Yang, Ruiyi Cheng, Xiaoying Tang, and Pinjia He*Oct 2025Under peer review; ASE 2026 submission
Cloud-native microservices enable rapid iteration and scalable deployment but also create complex, fast-evolving dependencies that challenge reliable diagnosis. Existing root cause analysis (RCA) approaches, even with multi-modal fusion of logs, traces, and metrics, remain limited in capturing dynamic behaviors and shifting service relationships. Three critical challenges persist: (i) inadequate modeling of cascading fault propagation, (ii) vulnerability to noise interference and concept drift in normal service behavior, and (iii) over-reliance on service deviation intensity that obscures true root causes. To address these challenges, we propose DynaCausal, a dynamic causality-aware framework for RCA in distributed microservice systems. DynaCausal unifies multi-modal dynamic signals to capture time-varying spatio-temporal dependencies through interaction-aware representation learning. It further introduces a dynamic contrastive mechanism to disentangle true fault indicators from contextual noise and adopts a causal-prioritized pairwise ranking objective to explicitly optimize causal attribution. Comprehensive evaluations on public benchmarks demonstrate that DynaCausal consistently surpasses state-of-the-art methods, attaining an average AC@1 of 0.63 with absolute gains from 0.25 to 0.46, and delivering both accurate and interpretable diagnoses in highly dynamic microservice environments.
2024
- Metis: An Interpretable and Unified Troubleshooting Framework for Microservices using Multi-modal DataZhouruixing Zhu, Yifan Yang, Aoyang Fang, Yidan Wang, and Pinjia He*2024Under review at ACM Transactions on Software Engineering and Methodology (TOSEM)
While microservices architecture improves scalability, resilience, and agility, it also introduces significant reliability challenges due to its complexity and dynamic nature. Troubleshooting is essential for maintaining system reliability, typically comprising three key stages: anomaly detection (AD), root cause localization (RCL), and fault type identification (FTI). Effective and interpretable troubleshooting enables engineers to swiftly detect, localize, and resolve faults, ensuring system stability and robustness. However, existing approaches suffer from three main limitations: (1) lack of a unified troubleshooting framework that connects all stages; (2) limited traceability between model predictions and observability data; and (3) insufficient fault coverage due to reliance on single-modal anomaly detectors. To address these challenges, we propose Metis, a unified troubleshooting framework for microservices that integrates AD, RCL, and FTI into a single end-to-end pipeline. Unlike prior methods, Metis systematically incorporates multi-modal observability signals, including logs, metrics, and traces, at every stage, enabling comprehensive and cross-modal fault diagnosis. It also generates interpretable multi-modal events that directly link model predictions to raw observability data, enhancing the transparency and trustworthiness of automatic outputs. Experiments on two widely used microservice benchmarks demonstrate that Metis outperforms all baselines in AD (26.58%–115.52% F1 improvement), FTI (59.09%–738.78% F1 improvement), and RCL (27.39%–1558.18% Avg@5 improvement). In addition, Metis achieves strong efficiency and scalability across datasets of different sizes. Ablation studies further confirm the importance of modeling multi-modal data throughout the troubleshooting pipeline.
* Corresponding author