With the rapid development of infrastructure including power grids, managers in power industry these days are faced with an increasingly severe problem of theft of electricity. Theft of electricity has negative effects on many socioeconomic aspects, including impacting stable growth of economic for power enterprises and social development. The traditional anti-theft means require officers checking the integrity of kilowatt-hour meter and the correctness of wiring house by house, which requires enormous manpower and material resources. With the advancement of information collecting technology, power enterprises now possess relatively complete database of power consumption. As a result, performing data mining on existing database and identifying abnormal users has become a hot topic in the field of information technology.
The purpose of this paper is to identify abnormal users with machine learning algorithms, providing power enterprises with means to detect theft of electricity at lower cost. The contributions of this paper are listed as follows:
1) Propose various features, providing means to describe users’ behaviors. First, monistic features (mean, difference, coefficient of variation, range, standard deviation), binary features (cosine similarity, Pearson product-moment correlation coefficient) and multivariate features are calculated based on user power consumption (monthly, seasonally, yearly, per holiday, per workday). Then principal component analysis is used to perform dimensionality reduction on the dataset. The dataset is finally scaled and oversampled to form the feature dataset.
2) Study various classification algorithms, optimize hyperparameters, providing models to detect theft of electricity. In this paper, the mechanism behind support vector machine, back-propagation neural network, random forest and XGBoost is studied and the hyperparameters are optimized.
3) Compare and evaluate different classifiers, and provide suggestions for real-world application. In this paper, different classifiers are evaluated with different metrics and the best classifier is recommended based on real-world dataset and different application scenarios.