Using decision trees to compare datasets

Supervised machine learning is an incredibly useful technique for using data to make predictions. When dealing with a large amount of data with many variables, it can also provide significant insight into which variables are the most important for the prediction you would be making.

In this example, I’m using data from single particle mass spectrometer to provide information about the chemistry of different types of mineral-rich particles. Using the SPIN ice cloud chamber that I’ve worked on developing, I can generate complementary data set for the ice nucleation efficiency of these particles, which I classify as “good,” “moderate,” or “bad” ice nuclei at a given temperature.

I use the chemical data from the mass spectrometer as my independent variables or “predictors” and the ice nucleation efficiency classes as my dependent variables or “response.” I supply bootstrap samples of these data to train an aggregation of decision trees (i.e. bagged decision trees). Since a fraction of the variables are held out for each tree, the model provides estimate the out-of-bag error for these variables, i.e. how well a particular tree performs if it was not trained on a particular variable. This out-of-bag error estimate then provides a basis for ranking variables by their importance.

Using the predictors that have been determined important, I can run smaller, more efficient, models with only these predictors included. Plotted below is the classification error vs. number of used predictors in one example. The order the predictors are added is determined by the ranking from the full model with all predictors. The final error of the full model is shown in red.imp-40

 

 

Acknowledgements: Maria Zawadowicz for providing the mass spectrometry data from the Particle Analysis for Laser Mass Spectrometry (PALMS) instrument. Dan Cziczo, my PhD advisor.