Table of Contents Table of Contents
Previous Page  12 / 169 Next Page
Information
Show Menu
Previous Page 12 / 169 Next Page
Page Background

190

M. Alca˜niz, L. Ramon, M. Santolino

strategies resulted in a significant reduction in computing

time. When predicting alcohol impairment, the area of

control (built-up or not), hour of day and driver’s age were

the most relevant variables for classification.

Keywords:

Imbalanced data, positive, drunk driving, police,

checkpoint, machine learning.

1. Introduction

Tree-based models have attracted the increasing attention of

researchers in recent years; however, analyses of the use of such

models when there is a highly unequal distribution between classes

are scarce. This is particularly true of binary data where one class

includes the majority of cases and the other represents just a small

portion. Imbalanced datasets of this kind are very common in such

disciplines as medical diagnosis, on-line advertising, fraud detection,

network intrusion, road safety, etc.

Many classification algorithms behave well for balanced datasets;

yet, when applied to imbalanced data, model fitting may be biased

towards the majority class. As a result, the model may provide a

poor predictive performance for the minority class, which is usually

the most interesting one. Kumar and Sheshadri

[20]

, He and

Garcia

[16]

and Chawla

[9]

review problems of class imbalance and

alternative solutions. Here, the performance of two strategies for

dealing with imbalanced data –that is, sampling and cost sensitive

methods– are compared, and the interpretability of their respective

results is discussed.

Specifically, we illustrate the performance and features of

tree-based models by applying them to the classification of

alcohol-impaired drivers in Catalonia (Spain).

When testing

for breath alcohol content (BrAC) over the legal limits, highly

imbalanced results are obtained –clearly, most drivers are not