Industrial Ph.D. Student, Lund University
Data-Driven Continuous Learning in DevOps
We have developed a smart, machine learning (ML) based, filter to support operations in filtering out observations in operations data that significantly deviate from the normal pattern, namely anomalies. We specifically address challenges in evaluating such ML-inspired solutions as each identified anomaly, reported as an alert to development, may represent, a real issue, a warning, or even a false disturbance . Due to this uncertainty, many reported alerts may be disregarded by the DevOps team, which in the long turn may cause floods of alerts in development . Thus, we explore approaches to iterative and continuous learning about real issues, which is needed both for the DevOps team and an ML model employed for detecting anomalies. Our results are based on collaboration with the DevOps team from a Swedish company responsible for ticket management and sales in public transportation.
Thanks to the DevOps concept of continuous monitoring, a vast amount of operations data is available for analysis and visualization through cloud platforms, such as Microsoft Azure Monitor . A proper analysis of this data may assist in revealing unexpected and unwanted software behavior. This includes the selection of suitable ML approaches, which depends on the available data, its type, volume, and annotations. Through collaboration with the case company, we explored available operations data and its characteristics, and proposed a smart filter as a replacement for their existing heuristic-based approach with the manually set thresholds for detecting anomalies. The smart filter is a kind of autonomous monitor that treats performance metrics across multiple services and discovers outliers by employing neural networks capable of learning complex and non-linear dependencies in the data .
In this talk, we will present how utilizing feedback from the development team could be used to evaluate and iteratively improve the ML model based on generated labeled data denoting true positive alerts, true negative alerts, warnings, and unspecified alerts. This means that reported alerts need to be continuously examined by the DevOps team and their resolution saved and used for updating existing or training new ML models . More and more labeled anomalous data will be collected over time that will be used for optimizing the selected, best performing, ML models. With the DevOps team in the loop, the ML approach will be developed and iteratively improved over time based on the feedback while the overall strategy will help the DevOps team to fully understand how the software operates and timely address any detected disturbances.
 Brian Fitzgerald and Klaas-Jan Stol. Continuous software engineering: A roadmap and agenda. Journal of Systems and Software, 2017
 Adha Hrusto, Per Runeson, and Emelie Engström. Closing the Feedback Loop in DevOps Through Autonomous Monitors in Operations. SN Computer Science, 2(6):447, August 2021. DOI:10.1007/s42979-021-00826-y
 Adha Hrusto, Emelie Engström, Per Runeson. Optimization of Anomaly Detection in a Microservice System Through Continuous Feedback from Development. In Proceedings of 10th ACM/IEEE International Workshop on Software Engineering for Systems-of-Systems and Software Ecosystems (SESoS 2022).
 Tanja Hagemann and Katerina Katsarou. A systematic review on anomaly detection for cloud computing environments. In 2020 3rd Artificial Intelligence and Cloud Computing Conference, AICCC 2020, page 83–96, New York, NY, USA, 2020. Association for Computing Machinery
 Kukjin Choi, Jihun Yi, Changhwa Park, and Sungroh Yoon. Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines. IEEE Access, 9:120043–120065, 2021.