Volume 4,Issue 1
Statistical Learning in Imbalanced Data Classification
In statistical learning, the classification of imbalanced data is a headache-inducing problem. Why do you say so? Due to the uneven distribution of classes, some classes have a small number of students and insufficient representativeness. Coupled with the limitations of the standard assessment indicators themselves, this poses a particularly significant challenge to statistical learning. There are several methods to try, such as data-level methods, for instance, resampling, which involves re-extracting samples, and feature selection, which involves picking out key features. Algorithm-level methods, such as cost-sensitive learning, which sets different costs for different situations, and ensemble learning, which combines multiple algorithms for use; In addition, there are hybrid methods that combine data-level and algorithm-level approaches. These methods can all deal with the problem of imbalanced data classification to a certain extent. When it comes to actual operation, there are still some matters that need to be carefully considered. You need to first figure out exactly what the data is like, what kind of evaluation metrics are appropriate, model validation cannot be taken lightly, and the model must be understandable and easy to explain.
[1] He H, Bai Y, Garcia E A, et al., 2008, ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE.
[2] García, Salvador, Herrera, et al., 2009, Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation.
[3] Fernández A, López V, Galar M, et al., 2013, Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42: 97-110.
[4] Yin Q Y, Zhang J S, Zhang C X, et al., 2014, A Novel Selective Ensemble Algorithm for Imbalanced Data Classification Based on Exploratory Undersampling. Mathematical Problems in Engineering.
[5] Liang G, Zhang C, 2012, A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification. Springer, Berlin, Heidelberg.
[6] Yan Y, 2018, Deep Learning Based Imbalanced Data Classification and Information Retrieval for Multimedia Big Data. ProQuest LLC.
[7] Jin Y, Wang N, Wu R, et al., 2024, Ultra-imbalanced classification guided by statistical information.