Background
- The goal was to implement an AI anomaly detection module to enhance product competitiveness and attract more clients.
- At the time, there were no data scientists or in-house personnel capable of developing an AI engine, so an external company was contracted to build the AI module and transfer the source code for productization.
- The external company needed to tailor the AI module to the internal system, requiring collaboration on requirements, feedback, progress management, and QA tasks, which I was responsible for.
Development Details
Performance Data Anomaly Detection
- Collection and learning of performance data.
- Evaluation of real-time performance data for anomalies based on training data.
- Creation and delivery of anomaly detection rules for all performance data.
- Generation of learning, evaluation, and rule reports.
- Prediction of performance data for specific dates.
Log Data Anomaly Detection
- Collection and parsing of log data.
- Use of language AI models to learn from parsed log data.
- Detection of real-time log anomalies based on training data.
- Log learning and anomaly report generation.
- Cross-report generation between performance and log data for the same time period.
Key AI Algorithms
Development Environment
- OS Platform: Ubuntu Linux
- Programming Language: Python 3.10
- Major Libraries: TensorFlow, Keras (PyTorch was not used)
Metric Anomaly Detection Algorithms
- Learning and anomaly detection: AutoEncoder.
- Anomaly evaluation: SHAP, LIME.
- Anomaly rule creation: RIPPER.
Log Anomaly Detection Algorithms
- Log word learning: Word2vec, BERT.
- Log learning and anomaly detection: LSTM + AutoEncoder.
My Role
Key Responsibilities during External Development
(March 2023 – January 2024)
- Progress management of the external development team.
- Requirements management and feedback provision.
- Unit, integration testing, and QA execution.
- Development of collector programs to provide real-time performance data.
- Data augmentation from client data for training purposes.
- Review of external development processes and architecture, and creation of internal project documentation.
Productization Post AI Module Handover
(February 2024 – Present)
- Implemented a mock banking system for AI data generation.
- Developed and applied JMeter scripts to inject traffic into the mock banking system.
- Integrated FileBeat – Kafka Cluster for real-time log data transmission with the collector program.
- Currently working on the productization of the AI module into an engine.
Acquiring AI Engine Development Skills
Before this project, I had theoretical knowledge of AI but lacked practical experience, so I had to engage in self-study of machine learning (ML) and deep learning (DL). While I didn’t directly implement algorithms, basic knowledge was required to manage the external team’s progress. Through this project, I gained foundational knowledge and work experience in AI. Additionally, I had never previously managed external contractors or provided support, but this project emphasized the importance of communication. Maintaining clear communication helped ensure a smooth code transfer and overall project progress.
Efforts to Overcome AI Engine Development Limitations
There were certain challenges during development, and I made several efforts to address them:
- We were only able to obtain two weeks of client data, despite requests for more from the technical team, and were unable to obtain separate event data indicating performance anomalies. Recognizing that the limited data would hinder the external team’s development, I carried out data augmentation to facilitate smoother project progression.
- When implementing log anomaly detection, I proposed and implemented the idea of establishing a FileBeat – Kafka cluster to collect log data in real-time. This allowed us to develop an anomaly detection module for various types of log data.
Construct test environments
After developing an engine for productization, we conducted internal testing. To identify meaningful anomalies in our current performance data, various configurations were required, including hyperparameter tuning, anomaly detection algorithm settings, and loss function configurations. However, our testing environment was limited, consisting of just one Linux server and two Windows PCs.
The ideal approach would have been to use EKS or GKE on the cloud, but it wasn’t feasible under the given conditions, so we used Docker containers as an alternative. The current system is a typical monolithic setup, running as a general Python application. Therefore, we created custom Docker images for the collector and AI anomaly detection engine, which allowed us to perform tests simultaneously with more configurations.
This testing experience taught me that while it’s ideal to conduct tests under optimal and real-world-like conditions, it’s equally important to find the best possible configuration within limited resources considering the company’s situation. I also found an optimal setup using container technology.