Abstract:This paper introduces RobotCycle, a novel ongoing project that leverages Autonomous Vehicle (AV) research to investigate how cycling infrastructure influences cyclist behaviour and safety during real-world journeys. The project's requirements were defined in collaboration with key stakeholders (i.e. city planners, cyclists, and policymakers), informing the design of risk and safety metrics and the data collection criteria. We propose a data-driven approach relying on a novel, rich dataset of diverse traffic scenes captured through a custom-designed wearable sensing unit. We extract road-user trajectories and analyse deviations suggesting risk or potentially hazardous interactions in correlation with infrastructural elements in the environment. Driving profiles and trajectory patterns are associated with local road segments, driving conditions, and road-user interactions to predict traffic behaviour and identify critical scenarios. Moreover, leveraging advancements in AV research, the project extracts detailed 3D maps, traffic flow patterns, and trajectory models to provide an in-depth assessment and analysis of the behaviour of all traffic agents. This data can then inform the design of cyclist-friendly road infrastructure, improving road safety and cyclability, as it provides valuable insights for enhancing cyclist protection and promoting sustainable urban mobility.
Abstract:We study the often overlooked phenomenon, first noted in \cite{breiman2001random}, that random forests appear to reduce bias compared to bagging. Motivated by an interesting paper by \cite{mentch2020randomization}, where the authors argue that random forests reduce effective degrees of freedom and only outperform bagging ensembles in low signal-to-noise ratio (SNR) settings, we explore how random forests can uncover patterns in the data missed by bagging. We empirically demonstrate that in the presence of such patterns, random forests reduce bias along with variance and increasingly outperform bagging ensembles when SNR is high. Our observations offer insights into the real-world success of random forests across a range of SNRs and enhance our understanding of the difference between random forests and bagging ensembles with respect to the randomization injected into each split. Our investigations also yield practical insights into the importance of tuning $mtry$ in random forests.
Abstract:We present FAST, an optimization framework for fast additive segmentation. FAST segments piecewise constant shape functions for each feature in a dataset to produce transparent additive models. The framework leverages a novel optimization procedure to fit these models $\sim$2 orders of magnitude faster than existing state-of-the-art methods, such as explainable boosting machines \citep{nori2019interpretml}. We also develop new feature selection algorithms in the FAST framework to fit parsimonious models that perform well. Through experiments and case studies, we show that FAST improves the computational efficiency and interpretability of additive models.
Abstract:We present FIRE, Fast Interpretable Rule Extraction, an optimization-based framework to extract a small but useful collection of decision rules from tree ensembles. FIRE selects sparse representative subsets of rules from tree ensembles, that are easy for a practitioner to examine. To further enhance the interpretability of the extracted model, FIRE encourages fusing rules during selection, so that many of the selected decision rules share common antecedents. The optimization framework utilizes a fusion regularization penalty to accomplish this, along with a non-convex sparsity-inducing penalty to aggressively select rules. Optimization problems in FIRE pose a challenge to off-the-shelf solvers due to problem scale and the non-convexity of the penalties. To address this, making use of problem-structure, we develop a specialized solver based on block coordinate descent principles; our solver performs up to 40x faster than existing solvers. We show in our experiments that FIRE outperforms state-of-the-art rule ensemble algorithms at building sparse rule sets, and can deliver more interpretable models compared to existing methods.
Abstract:Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR) can provide richer information even under a very low penetration rate. This paper aims to develop a cooperative data collection system, which integrates Lidar point cloud data from both infrastructure and CAVs to create a cooperative perception environment for various transportation applications. The state-of-the-art 3D detection models are applied to detect vehicles in the merged point cloud. We test the proposed cooperative perception environment with the max pressure adaptive signal control model in a co-simulation platform with CARLA and SUMO. Results show that very low penetration rates of CAV plus an infrastructure sensor are sufficient to achieve comparable performance with 30% or higher penetration rates of connected vehicles (CV). We also show the equivalent CV penetration rate (E-CVPR) under different CAV penetration rates to demonstrate the data collection efficiency of the cooperative perception environment.
Abstract:ControlBurn is a Python package to construct feature-sparse tree ensembles that support nonlinear feature selection and interpretable machine learning. The algorithms in this package first build large tree ensembles that prioritize basis functions with few features and then select a feature-sparse subset of these basis functions using a weighted lasso optimization criterion. The package includes visualizations to analyze the features selected by the ensemble and their impact on predictions. Hence ControlBurn offers the accuracy and flexibility of tree-ensemble models and the interpretability of sparse generalized additive models. ControlBurn is scalable and flexible: for example, it can use warm-start continuation to compute the regularization path (prediction error for any number of selected features) for a dataset with tens of thousands of samples and hundreds of features in seconds. For larger datasets, the runtime scales linearly in the number of samples and features (up to a log factor), and the package support acceleration using sketching. Moreover, the ControlBurn framework accommodates feature costs, feature groupings, and $\ell_0$-based regularizers. The package is user-friendly and open-source: its documentation and source code appear on https://pypi.org/project/ControlBurn/ and https://github.com/udellgroup/controlburn/.
Abstract:Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
Abstract:Tree ensembles distribute feature importance evenly amongst groups of correlated features. The average feature ranking of the correlated group is suppressed, which reduces interpretability and complicates feature selection. In this paper we present ControlBurn, a feature selection algorithm that uses a weighted LASSO-based feature selection method to prune unnecessary features from tree ensembles, just as low-intensity fire reduces overgrown vegetation. Like the linear LASSO, ControlBurn assigns all the feature importance of a correlated group of features to a single feature. Moreover, the algorithm is efficient and only requires a single training iteration to run, unlike iterative wrapper-based feature selection methods. We show that ControlBurn performs substantially better than feature selection methods with comparable computational costs on datasets with correlated features.
Abstract:Model interpretations are often used in practice to extract real world insights from machine learning models. These interpretations have a wide range of applications; they can be presented as business recommendations or used to evaluate model bias. It is vital for a data scientist to choose trustworthy interpretations to drive real world impact. Doing so requires an understanding of how the accuracy of a model impacts the quality of standard interpretation tools. In this paper, we will explore how a model's predictive accuracy affects interpretation quality. We propose two metrics to quantify the quality of an interpretation and design an experiment to test how these metrics vary with model accuracy. We find that for datasets that can be modeled accurately by a variety of methods, simpler methods yield higher quality interpretations. We also identify which interpretation method works the best for lower levels of model accuracy.
Abstract:Deep learning based methods have been dominating the text recognition tasks in different and multilingual scenarios. The offline handwritten Chinese text recognition (HCTR) is one of the most challenging tasks because it involves thousands of characters, variant writing styles and complex data collection process. Recently, the recurrent-free architectures for text recognition appears to be competitive as its highly parallelism and comparable results. In this paper, we build the models using only the convolutional neural networks and use CTC as the loss function. To reduce the overfitting, we apply dropout after each max-pooling layer and with extreme high rate on the last one before the linear layer. The CASIA-HWDB database is selected to tune and evaluate the proposed models. With the existing text samples as templates, we randomly choose isolated character samples to synthesis more text samples for training. We finally achieve 6.81% character error rate (CER) on the ICDAR 2013 competition set, which is the best published result without language model correction.