Cluster Analysis in Practice: Dealing with Outliers in Managerial Research



Main Article Content

Humberto Elias Garcia Lopes
https://orcid.org/0000-0002-6207-2726 orcid
Marlusa de Sevilha Gosling
https://orcid.org/0000-0002-7674-2866 orcid

Abstract

Context: in recent years, cluster analysis has stimulated researchers to explore new ways to understand data behavior. The computational ease of this method and its ability to generate consistent outputs, even in small datasets, explain that to some extent. However, researchers are often mistaken in holding that clustering is a terrain in which anything goes. The literature shows the opposite: they must be careful, especially regarding the effect of outliers on cluster formation. Objective: in this tutorial paper, we contribute to this discussion by presenting four clustering techniques and their respective advantages and disadvantages in the treatment of outliers. Methods: for that, we worked from a managerial dataset and analyzed it using k-means, PAM, DBSCAN, and FCM techniques. Results: our analyzes indicate that researchers have distinct clustering techniques for dealing with outliers accordingly. Conclusion: we concluded that researchers need to have a more diversified repertoire of clustering techniques. After all, this would give them two relevant empirical alternatives: choose the most appropriate technique for their research objectives or adopt a multi-method approach.



Downloads

Download data is not yet available.


Article Details

How to Cite
Lopes, H. E. G., & Gosling, M. de S. (2020). Cluster Analysis in Practice: Dealing with Outliers in Managerial Research. Journal of Contemporary Administration, 25(1), e200081. https://doi.org/10.1590/1982-7849rac2021200086
Section
Articles

References

Acock, A. C. (2014). A gentle introduction to Stata (4th ed). College Station: Stata Press.
Adams, J., Hayunga, D., Mansi, S., Reeb, D., & Verardi, V. (2019). Identifying and treating outliers in finance. Financial Management, 48(2), 345–384. https://doi.org/10.1111/fima.12269
Aggarwal, C. (2014). An introduction to cluster analysis. In C. C. Aggarwal, C. K. Reddy (Eds.), Data clustering: Algorithms and applications (pp. 1-28). New York: CRC Press.
Besanko, D., Dranove, D., Shanley, M., & Schaefer, S. (2016). Economics of strategy (7th ed). Toronto: Wiley.
Beysolow, T. (2017). Introduction to deep learning using R: A step-by-step guide to learning and implementing deep learning models using R. New York: Apress.
Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press.
Bhat, A. (2014). K-medoids clustering using partitioning aroud medoids for performing face recognition. International Journal of Soft computing, Mathematics and Control, 3(3), 1-12. https://doi.org/10.14810/ijscmc.2014.3301
Boehmke, B., & Greenwell, B. (2019). K-means Clustering (p. 399–416). New York: CRC Press. https://doi.org/10.1201/9780367816377-20
Caffo, B. (2016). Statistical inference for data science. British Columbia, UK: Leanpub.
Dunn, J. C. (1973). A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32–57. https://doi.org/10.1080/01969727308546046
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996 August). A density-based algorithm for discovering clusters in large spatial databases with noise.  Proceedings of the International Conference on Knowledge Discovery and Data Mining, Munchen, Germany, 2. Retrieved from https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
Everitt, B. S., & Hothorn, T. (2006). Cluster analysis. In B. S. Everitt, T. Hothorn, A handbook of statistical analyses using R (pp. 243–258). New York: CRC Press.
Fávero, L. P., & Belfiore, P. (2017). Análise de agrupamentos. In Manual de análise de dados: Estatística e modelagem multivariada com Excel, SPSS e Stata (pp. 309–378). São Paulo: GEN.
Fischetti, T. (2015). Data analysis with R: Load, wrangle, and analyze your data using the world’s most powerful statistical programming language. Birmingham: Packt.
Hahsler, M., Piekenbrock, M., Arya, S., & Mount, D. (2019). Density-based clustering of applications with noise (DBSCAN) and related algorithms. CRAN. Retrieved from https://cran.r-project.org/web/packages/dbscan/dbscan.pdf
Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1–30. https://doi.org/10.18637/jss.v091.i01
Hair, J., Black, W. C., Babin, B. J., & Anderson, R. E. (2018). Multivariate data analysis (8th ed). Ireland: Cengage Learning EMEA.
Hartigan, J. A., & Wong, M. A. (1979). A K-means clustering algorithm. Journal of the Royal Statistical Society, 28(1), 100–108. https://doi.org/10.2307/2346830
Husson, F., Lê, S., & Pagès, J. (2017). Clustering. In Exploratory multivariate analysis by example using R (pp. 173–208). New York: CRC Press.
Irizarry, R. A., & Love, M. (2015). Data analysis for the life sciences. British Columbia, UK: Leanpub.
Janssen, A., & Wan, P. (2020). K-means clustering of extremes. Electronic Journal of Statistics, 14(1), 1211–1233. https://doi.org/10.1214/20-EJS1689
Kassambara, A. (2017). Practical guide to cluster analysis in R unsupervised machine learning. London: STHDA.
Kaufman, L., & Rousseeuw, P. (1990). Partitioning around medoids (Program PAM). In Finding groups in data: An introduction to cluster analysis (pp. 68–125). New York: Wiley-Interscience.
Ketchen, D. J., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17(6), 441–458. https://doi.org/10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G
Loperfido, N. (2020). Kurtosis-based projection pursuit for outlier detection in financial time series. The European Journal of Finance, 26(2–3), 142–164. https://doi.org/10.1080/1351847X.2019.1647864
Lopes, H. E. G., Pereira, C., & Vieira, A. F. (2009). Comparação entre os modelos norte-americano (ACSI) e europeu (ECSI) de satisfação: Um estudo no setor de serviços. RAM, Revista de Administração Mackenzie, 10(1), 161–187. https://doi.org/10.1590/S1678-69712009000100008
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Berkeley symposium on mathematical statistics and probability, 1, 281–297. Retrieved from https://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992
Maechler, M. (2019). Package “cluster”. CRAN. https://svn.r-project.org/R-packages/trunk/cluster
Malhotra, N. (2018). Marketing research: An applied orientation (7th ed). New York: Pearson.
Moustaki, I., Jöreskog, K. G., & Mavridis, D. (2004). Factor models for ordinal variables with covariate effects on the manifest and latent variables: A comparison of LISREL and IRT approaches. Structural Equation Modeling: A Multidisciplinary Journal, 11(4), 487–513. https://doi.org/10.1207/s15328007sem1104_1
Norusis, M. J. (2006a). Cluster Analysis (pp. 361–391). Upper Saddle River, NJ: Prentice-Hall.
Norusis, M. J. (2006b). SPSS 15.0 statistical procedures companion. Upper Saddle River, NJ: Prentice Hall.
Nunnally, J., & Bernstein, I. (1994). Psychometric Theory. New York: McGraw Hill.
Pandey, P., & Singh, I. (2016). Comparision between K-mean clustering and improved K-mean clustering. International Journal of Computer Applications, 146(13), 39–42. http://doi.org/10.5120/IJCA2016910868
Peng, R. (2019). Report writing for data science in R. British Columbia, UK: Leanpub.
Raykov, Y., Boukouvalas, A., Baig, F., & Little, M. (2016). What to do when k-means clustering fails: A simple yet principled alternative algorithm. PLoS ONE, 11(9), 1–28. https://doi.org/10.1371/journal. pone.0162259
Sander, J. (2010). Density-based clustering. In Encyclopedia of Machine Learning (pp. 270–273). Berlin: Springer-Verlag.
Scoltock, J. (1982). A survey of the literature of cluster analysis. The Computer Journal, 25(1), 130–134. https://doi.org/10.1093/comjnl/25.1.130
Starczewski, A., Goetzen, P., & Joo Er, M. (2020). A new method for automatic determining of the DBSCAN parameters. Journal of Artificial Intelligence and Soft Computing Research, 10(3), 209–211. https://doi.org/10.2478/jaiscr-2020-0014
Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750–763. https://doi.org/10.1198/016214503000000666
Sun, L., Chen, G., Xiong, H., & Guo, C. (2017). Cluster analysis in data-driven management decisions. Journal of Management Science and Engineering, 2(4), 227–251. https://doi.org/10.3724/SP.J.1383.204011
Thrun, M. (2019). Cluster analysis of per capita gross domestic products. Entrepreneurial Business and Economics Review, 7(1), 217–231. https://doi.org/10.15678/EBER.2019.070113
Velmurugan, T., & Santhanam, T. (2010). Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points. Journal of Computer Science, 6(3), 363–368. Retrieved from http://www.thescipub.com/pdf/10.3844/jcssp.2010.363.368
Yu, H., Wang, X., Wang, G., & Zeng, X. (2020). An active three-way clustering method via low-rank matrices for multi-view data. Information Sciences, 507, 823–839. https://doi.org/10.1016/j.ins.2018.03.009