M. Alca˜niz, L. Ramon, M. Santolino
strategies resulted in a significant reduction in computing
time. When predicting alcohol impairment, the area of
control (built-up or not), hour of day and driver’s age were
the most relevant variables for classification.
Imbalanced data, positive, drunk driving, police,
checkpoint, machine learning.
Tree-based models have attracted the increasing attention of
researchers in recent years; however, analyses of the use of such
models when there is a highly unequal distribution between classes
are scarce. This is particularly true of binary data where one class
includes the majority of cases and the other represents just a small
portion. Imbalanced datasets of this kind are very common in such
disciplines as medical diagnosis, on-line advertising, fraud detection,
network intrusion, road safety, etc.
Many classification algorithms behave well for balanced datasets;
yet, when applied to imbalanced data, model fitting may be biased
towards the majority class. As a result, the model may provide a
poor predictive performance for the minority class, which is usually
the most interesting one. Kumar and Sheshadri
, He and
review problems of class imbalance and
alternative solutions. Here, the performance of two strategies for
dealing with imbalanced data –that is, sampling and cost sensitive
methods– are compared, and the interpretability of their respective
results is discussed.
Specifically, we illustrate the performance and features of
tree-based models by applying them to the classification of
alcohol-impaired drivers in Catalonia (Spain).
for breath alcohol content (BrAC) over the legal limits, highly
imbalanced results are obtained –clearly, most drivers are not