I am a Research Scientist at Meta. I obtained Ph.D. degree from UNC Charlotte, supervised by Prof. Dong Dai. Before that, I worked as a Machine Learning Engineer focusing on Natural Language Processing in iFLYTEK Research Lab. I obtained my bachelor’s degree from University of Science and Technology of China (USTC)
My research interests include Reinforcement Learning, Scheduling and Anomaly Detection. My current research focuses on applying deep learning models to anomaly detection and automated batch job schedulers.
Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters.
Di Zhang, Monish Soundar Raj, Bing Xie, Sheng Di, Dong Dai
Accepted to appear in the 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS’24), 2024.
PDF(TBA)
A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs.
Elliot Kolker-Hicks, Di Zhang, Dong Dai
14th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’23), 2023.
[PDF]
Early Exploration of Using ChatGPT for Log-based Anomaly Detection on Parallel File Systems Logs.
Chris Egersdoerfer, Di Zhang, Dong Dai
In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC’23 Poster), 2023.
[PDF]
Optimizing Resource Management for Machine Learning Workloads in High-Performance Clusters
Di Zhang, Dong Dai
In Proceedings of the 37th IEEE International Parallel & Distributed Processing Symposium Workshop (IPDPS’23 Phd Forum), 2023.
[PDF][Poster]
Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis.
Di Zhang, Chris Egersdoerfer, Tabassum Mahmud, Mai Zheng, Dong Dai
37th IEEE International Parallel & Distributed Processing Symposium (IPDPS’23), 2023
[PDF][Code][Slides]
ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection
Chris Egersdoerfer, Di Zhang, Dong Dai
Workshop on Fault Tolerance for HPC at eXtreme Scale (with SC’22).
[PDF]
SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
Di Zhang, Dong Dai, Bing Xie
31st International ACM Symposium on High-Performance Parallel and Distributed Computing. HPDC’22.
[PDF] [Code] [Slides]
A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, Jonathan Cook
ACM Transactions on Storage. TOS’22.
[PDF]
SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis
Di Zhang, Dong Dai, Runzhou Han, Mai Zheng
13th ACM Workshop on Hot Topics in Storage and File Systems. HotStorage’21.
[PDF] [Slides]
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning
Di Zhang, Dong Dai, Youbiao He, Forrest Sheng Bao, Bing Xie
International Conference for High Performance Computing, Networking, Storage and Analysis. SC’20.
[PDF] [Code] [Slides]