Table of Contents Table of Contents
Previous Page  11 / 169 Next Page
Information
Show Menu
Previous Page 11 / 169 Next Page
Page Background

Bolet´ın de Estad´ıstica e Investigaci´on Operativa

Vol. 33, No. 3, Noviembre 2017, pp. 189

-222

Estad´ıstica

A comparative analysis of tree-based models

classifying imbalanced breath alcohol data

Manuela Alca˜niz and Miguel Santolino

Department of Econometrics

University of Barcelona

B

malcaniz@ub.edu

,

B

msantolino@ub.edu

Llu´ıs Ramon

Data Scientist

Digital Origin

B

lramon@digitalorigin.com

Abstract

When applied to binary data, most classification

algorithms behave well provided the dataset is balanced.

However, when one single class includes the majority of cases,

a good predictive performance for the minority class is not

easy to achieve. We examine the strengths and weaknesses

of three tree-based models when dealing with imbalanced

data. We also explore sampling and cost sensitive methods

as strategies for improving machine learning algorithms. An

application to a large dataset of breath alcohol content tests

performed in Catalonia (Spain) to detect drunk drivers is

shown. The Random Forest method proved to be the model of

choice if a high performance is required, while down-sampling

©

2017 SEIO