CLOUD RAID DETECTING DISTRIBUTED CONCURRENCY BUGS VIA LOG MINING AND ENHANCEMENT
DOI:
https://doie.org/10.5281/bk9rv424Keywords:
Distributed Systems, Concurrency Bugs, Bug Detection, Cloud Computing.,,Abstract
Cloud systems are plagued with distributed concurrency problems, which frequently
result in data loss and service outages. CLOUD-RAID, a novel automated tool for discovering
distributed concurrency problems fast and effectively, is presented in this work. Distributed
concurrency problems are notoriously difficult to detect because they are caused by unexpected
message orderings across nodes. CLOUDRAID analyzes and tests automatically just the
message orderings that are likely to disclose flaws in cloud systems to discover concurrency
bugs in cloud systems fast and effectively. CLOUDRAID specifically mines logs from past
runs to identify message orderings that are possible but have not been thoroughly tested. In
addition, we offer a log augmenting approach for automatically introducing additional logs into
the system under test. These additional logs boost CLOUDRAID's efficacy even further
without imposing any apparent performance overhead. Because of our log-based methodology,
it is well-suited for live systems. CLOUDRAID was used to investigate six exemplary
distributed systems: Three of the nine new problems discovered, all of which have been
validated by their original developers, are critical and have already been repaired.
References
2018. Google Protocol Buffer. (2018). Retrieved April 26, 2018 from https:
//developers.google.com/protocol-buffers/.
2018. WALA Home page. (2018). Retrieved April 26, 2018 from http://wala.
sourceforge.net/wiki/index.php/Main_Page/.
Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, Arvind Krishnamurthy, and Thomas E
Anderson. 2012. Mining temporal invariants from partially ordered logs. ACM SIGOPS
Operating Systems Review 45, 3 (2012), 39–46.
Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D Ernst.
Leveraging existing instrumentation to automatically infer invariantconstrained models.
In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on
Foundations of software engineering. ACM, 267–277.
Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014.
The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services.. In
OSDI. 217–231.
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on
Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. https://doi.org/10.
/1327452.1327492
Florin Dinu and TS Ng. 2012. Understanding the effects and implications of compute node
related failures in hadoop. In Proceedings of the 21st international symposium on High
Performance Parallel and Distributed Computing. ACM, 187– 198.
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly
Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017
ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298.
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron
Brightwell. 2012. Detection and Correction of Silent Data Corruption for Large-scale High
performance Computing. In Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los
Alamitos, CA, USA, Article 78, 12 pages. http://dl.acm.org/citation.cfm?id=2388996.2389102
Cormac Flanagan and Stephen N Freund. 2009. FastTrack: efficient and precise dynamic
race detection. In ACM Sigplan Notices, Vol. 44. ACM, 121–133.
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection
in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09.
Ninth IEEE International Conference on. IEEE, 149–158.