Análise de Clusters na Prática: Lidando com Outliers na Pesquisa Gerencial



Artigo principal Conteúdo

Humberto Elias Garcia Lopes
Marlusa de Sevilha Gosling

Resumo

Contexto: nos últimos anos, a análise de clusters tem estimulado os pesquisadores a explorar novas maneiras para entender o comportamento dos dados. A facilidade computacional desse método e sua habilidade de gerar resultados consistentes, mesmo em bases de dados pequenas, explicam isso em certa medida. Entretanto, os pesquisadores frequentemente se equivocam ao sustentar que a clusterização é um território no qual vale tudo. A literatura mostra o oposto: eles têm que ser cuidadosos, especialmente em relação ao efeito dos outliers na formação dos clusters. Objetivo: neste artigo tutorial, nós contribuímos para essa discussão ao apresentarmos quatro técnicas de clusterização com suas respectivas vantagens e desvantagens no tratamento dos outliers. Métodos: para isso, nós trabalhamos com uma base de dados gerenciais, analisando-a por meio das técnicas k-means, PAM, DBSCAN e FCM. Resultados: nossas análises indicam que os pesquisadores têm diferentes técnicas de clusterização ao seu dispor para tratar os outliers adequadamente. Conclusão: nós concluímos que os pesquisadores precisam ter um repertório mais diversificado de técnicas de clusterização. Afinal, isso daria a eles duas alternativas empíricas relevantes: escolher a técnica mais apropriada para os objetivos das suas pesquisas ou adotar uma abordagem multimétodo.



Histórico de Downloads

Não há dados estatísticos.


Detalhes do artigo

Como Citar
Lopes, H. E. G., & Gosling, M. de S. (2020). Análise de Clusters na Prática: Lidando com Outliers na Pesquisa Gerencial. Revista De Administração Contemporânea, 25(1), e200081. https://doi.org/10.1590/1982-7849rac2021200086
Seção
Artigos

Referências

Acock, A. C. (2014). A gentle introduction to Stata (4th ed). College Station: Stata Press.
Adams, J., Hayunga, D., Mansi, S., Reeb, D., & Verardi, V. (2019). Identifying and treating outliers in finance. Financial Management, 48(2), 345–384. https://doi.org/10.1111/fima.12269
Aggarwal, C. (2014). An introduction to cluster analysis. In C. C. Aggarwal, C. K. Reddy (Eds.), Data clustering: Algorithms and applications (pp. 1-28). New York: CRC Press.
Besanko, D., Dranove, D., Shanley, M., & Schaefer, S. (2016). Economics of strategy (7th ed). Toronto: Wiley.
Beysolow, T. (2017). Introduction to deep learning using R: A step-by-step guide to learning and implementing deep learning models using R. New York: Apress.
Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press.
Bhat, A. (2014). K-medoids clustering using partitioning aroud medoids for performing face recognition. International Journal of Soft computing, Mathematics and Control, 3(3), 1-12. https://doi.org/10.14810/ijscmc.2014.3301
Boehmke, B., & Greenwell, B. (2019). K-means Clustering (p. 399–416). New York: CRC Press. https://doi.org/10.1201/9780367816377-20
Caffo, B. (2016). Statistical inference for data science. British Columbia, UK: Leanpub.
Dunn, J. C. (1973). A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32–57. https://doi.org/10.1080/01969727308546046
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996 August). A density-based algorithm for discovering clusters in large spatial databases with noise.  Proceedings of the International Conference on Knowledge Discovery and Data Mining, Munchen, Germany, 2. Retrieved from https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
Everitt, B. S., & Hothorn, T. (2006). Cluster analysis. In B. S. Everitt, T. Hothorn, A handbook of statistical analyses using R (pp. 243–258). New York: CRC Press.
Fávero, L. P., & Belfiore, P. (2017). Análise de agrupamentos. In Manual de análise de dados: Estatística e modelagem multivariada com Excel, SPSS e Stata (pp. 309–378). São Paulo: GEN.
Fischetti, T. (2015). Data analysis with R: Load, wrangle, and analyze your data using the world’s most powerful statistical programming language. Birmingham: Packt.
Hahsler, M., Piekenbrock, M., Arya, S., & Mount, D. (2019). Density-based clustering of applications with noise (DBSCAN) and related algorithms. CRAN. Retrieved from https://cran.r-project.org/web/packages/dbscan/dbscan.pdf
Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1–30. https://doi.org/10.18637/jss.v091.i01
Hair, J., Black, W. C., Babin, B. J., & Anderson, R. E. (2018). Multivariate data analysis (8th ed). Ireland: Cengage Learning EMEA.
Hartigan, J. A., & Wong, M. A. (1979). A K-means clustering algorithm. Journal of the Royal Statistical Society, 28(1), 100–108. https://doi.org/10.2307/2346830
Husson, F., Lê, S., & Pagès, J. (2017). Clustering. In Exploratory multivariate analysis by example using R (pp. 173–208). New York: CRC Press.
Irizarry, R. A., & Love, M. (2015). Data analysis for the life sciences. British Columbia, UK: Leanpub.
Janssen, A., & Wan, P. (2020). K-means clustering of extremes. Electronic Journal of Statistics, 14(1), 1211–1233. https://doi.org/10.1214/20-EJS1689
Kassambara, A. (2017). Practical guide to cluster analysis in R unsupervised machine learning. London: STHDA.
Kaufman, L., & Rousseeuw, P. (1990). Partitioning around medoids (Program PAM). In Finding groups in data: An introduction to cluster analysis (pp. 68–125). New York: Wiley-Interscience.
Ketchen, D. J., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17(6), 441–458. https://doi.org/10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G
Loperfido, N. (2020). Kurtosis-based projection pursuit for outlier detection in financial time series. The European Journal of Finance, 26(2–3), 142–164. https://doi.org/10.1080/1351847X.2019.1647864
Lopes, H. E. G., Pereira, C., & Vieira, A. F. (2009). Comparação entre os modelos norte-americano (ACSI) e europeu (ECSI) de satisfação: Um estudo no setor de serviços. RAM, Revista de Administração Mackenzie, 10(1), 161–187. https://doi.org/10.1590/S1678-69712009000100008
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Berkeley symposium on mathematical statistics and probability, 1, 281–297. Retrieved from https://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992
Maechler, M. (2019). Package “cluster”. CRAN. https://svn.r-project.org/R-packages/trunk/cluster
Malhotra, N. (2018). Marketing research: An applied orientation (7th ed). New York: Pearson.
Moustaki, I., Jöreskog, K. G., & Mavridis, D. (2004). Factor models for ordinal variables with covariate effects on the manifest and latent variables: A comparison of LISREL and IRT approaches. Structural Equation Modeling: A Multidisciplinary Journal, 11(4), 487–513. https://doi.org/10.1207/s15328007sem1104_1
Norusis, M. J. (2006a). Cluster Analysis (pp. 361–391). Upper Saddle River, NJ: Prentice-Hall.
Norusis, M. J. (2006b). SPSS 15.0 statistical procedures companion. Upper Saddle River, NJ: Prentice Hall.
Nunnally, J., & Bernstein, I. (1994). Psychometric Theory. New York: McGraw Hill.
Pandey, P., & Singh, I. (2016). Comparision between K-mean clustering and improved K-mean clustering. International Journal of Computer Applications, 146(13), 39–42. http://doi.org/10.5120/IJCA2016910868
Peng, R. (2019). Report writing for data science in R. British Columbia, UK: Leanpub.
Raykov, Y., Boukouvalas, A., Baig, F., & Little, M. (2016). What to do when k-means clustering fails: A simple yet principled alternative algorithm. PLoS ONE, 11(9), 1–28. https://doi.org/10.1371/journal. pone.0162259
Sander, J. (2010). Density-based clustering. In Encyclopedia of Machine Learning (pp. 270–273). Berlin: Springer-Verlag.
Scoltock, J. (1982). A survey of the literature of cluster analysis. The Computer Journal, 25(1), 130–134. https://doi.org/10.1093/comjnl/25.1.130
Starczewski, A., Goetzen, P., & Joo Er, M. (2020). A new method for automatic determining of the DBSCAN parameters. Journal of Artificial Intelligence and Soft Computing Research, 10(3), 209–211. https://doi.org/10.2478/jaiscr-2020-0014
Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750–763. https://doi.org/10.1198/016214503000000666
Sun, L., Chen, G., Xiong, H., & Guo, C. (2017). Cluster analysis in data-driven management decisions. Journal of Management Science and Engineering, 2(4), 227–251. https://doi.org/10.3724/SP.J.1383.204011
Thrun, M. (2019). Cluster analysis of per capita gross domestic products. Entrepreneurial Business and Economics Review, 7(1), 217–231. https://doi.org/10.15678/EBER.2019.070113
Velmurugan, T., & Santhanam, T. (2010). Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points. Journal of Computer Science, 6(3), 363–368. Retrieved from http://www.thescipub.com/pdf/10.3844/jcssp.2010.363.368
Yu, H., Wang, X., Wang, G., & Zeng, X. (2020). An active three-way clustering method via low-rank matrices for multi-view data. Information Sciences, 507, 823–839. https://doi.org/10.1016/j.ins.2018.03.009