Clustering in the presence of data errors with applications to sales forecasts in retail merchandising
(Research Seminar, October 9th, 2003)
Nitin Patel
MIT-Sloan and Cytel Software
Abstract
Many methods exist for clustering objects into groups that consist of similar objects based on measurements of attributes of the objects. These methods have been applied in diverse disciplines that include biology, medicine, astronomy, engineering, marketing and finance. Virtually all the work in clustering has assumed that the data has no measurement error. In the presence of such errors, popular clustering methods, like k-means and hierarchical clustering, may perform poorly. The fundamental question that this talk addresses is: "What are appropriate clustering methods in the presence of measurement errors?"
The talk is divided into three parts. In the first part, I will motivate the importance of recognizing error in a clustering algorithm using several artificial examples. In the second part, I will describe error-based clustering algorithms that we have developed for data with errors that follow the multivariate normal distribution. These algorithms are generalizations of the k-means and Ward's hierarchical clustering methods. The algorithms have the important property of being scale-invariant, so that the clustering results are independent of the measurement units. In the third part, I focus on an application of error-based clustering to sales forecasting in retail merchandising where it outperforms the k-means and hierarchical clustering methods. I conclude with a brief description of current work in applying these ideas to improving forecasting in popular statistical models such as multiple linear regression, generalized linear regression, auto-regressive and moving average time series models and repeated measures models.
The work I will be reporting was done jointly with Mahesh Kumar, a doctoral student at the Operations Research Center, MIT and Professor James B. Orlin at MIT-Sloan.
|
|