School of Computing Research Colloquia

A Methodology for Characterizing Workloads in Google MapReduce Cloud to Derive Realistic Resource Utilization Models

Peter Garraghan, Distributed Systems and Services Research Group

Analyzing behavioral patterns of workloads is critical to understanding Cloud computing environments. However, until now only a limited number of real-world cloud datacenter trace logs have been available for analysis. This has led to a lack of methodologies to capture the diversity of patterns that exist in such datasets. This paper presents the first large-scale analysis of real-world cloud data, using a recently released dataset that features traces from over 12,000 servers over the period of a month. Based on this analysis, we develop a novel methodology for characterizing workloads that for the first time considers Cloud workload in the context of both user and task in order to derive resource estimation and utilization models.

We present statistical properties of the data, outline the diversity of workload types, and discuss their behavioral patterns. The derived model assists in understanding the relationship between users and tasks within workload, and enables further work such as resource optimization, energy consumption, and failure correlation. Our approach is evaluated by contrasting the logged data against simulation experiments; our results show that the derived models correctly describe the operational environment, and confirm the great variability of patterns that exist in Cloud computing.