Apply Machine Learning Algorithms to Predict Continuous Body Mass Index (BMI) Data

Ernest Bonat, Ph.D.
6 min readJul 20, 2024

Ernest Bonat, Ph.D., Jesse W. L. Stuart, BSCS

1. Overview
2. Dataset Selection
3. Exploratory Data Analysis
4. Apply Machine Learning Regressor Algorithms
5. Conclusions

1. Overview

Many factors contribute to obesity, and many of these factors rely on personal choices. Different personal habit, genetic and demographic factors produce greater or lesser effects on obesity. These factors are often tracked using categorical data. Continuous Body Mass Index (BMI) is a medically accepted continuous measure of obesity. Continuous data concerning body BMI should be predictable from contributing categorical behavior data using statistical regression. Since body fat is a consequence of personal habits, the continuous BMI values should reflect categorical data describing personal habits.

This study seeks to determine the whether continuous BMI data can be regressed using categorical data using publicly available data available on Kaggle. This data describes weight, height, gender, age, and lifestyle features, along with BMI values. By comparing lifestyle answers, this analysis seeks to predict BMI values based on personal habits, to determine which habits contribute most to BMI values.

This paper is an example in how to apply Machine Learning (ML) regression algorithms to predict BMI.

2. Dataset Selection

The Obesity Dataset Cleaned and Data Synthetic dataset on Kaggle describes the health, lifestyle choices (i.e. food consumption habits, exercise habits, age, gender, height, and weight) for survey participants from Columbia, Mexico, and Peru. The dataset includes 2084 records after removing duplicates. The dataset used in this study is available on Kaggle at the link: (Obesity Dataset Cleaned and Data Synthetic). The dataset consists of nineteen columns.

Table 1: The Complete List of Original Column Names of Obesity Dataset Cleaned and Data Synthetic.

The Obesity Dataset Cleaned and Data Synthetic dataset is based on a dataset from Science Direct. For more information see the Science Direct web link on the Kaggle webpage.

Table 2: Column Name Meanings and Values.

Column Name Change:

The “family_history_with_overweight” column will be referred to as “FHWO” in this document and program output. Except where the type is indicated all values in Table 2 are string values, except for the NCP feature which are integers 1 through 4.

Table 3: Dropped Columns and Reasons Why.

3. Exploratory Data Analysis

The dataset contains 2111 records, with no missing and no NaN data. Analysis showed 27 duplicate records, which were removed, leaving 2084 unique records.

This analysis concerns BMI values based on categorical data. Categorical data features from the dataset mostly describes a person’s chosen behavior. In addition, the two genetic features — Gender and FHWO are also represented by categorical values. The non-categorical features were removed from the dataset before processing consist of the features id, Age, Height, Weight, NObeyesdad. The NObeyesdad feature is a column in the original data set used to classify the BMI values into seven types of obesity.

BMI was calculated from Height and Weight: BMI = Weight / Height2. Since BMI is the y variable, it was also removed from X. Removing the non-categorical features leaves thirteen columns: Gender, FHWO, FAVC, FCVC, NCP, CAEC, CALC, CH2O, FAF, TUE, MTRANS, SMOKE, SCC. These can further be grouped as diet, exercise, habits, and genetics. Diet group consists of FAVC, FCVC, NCP, CAEC, CALC, CH2O. The Exercise group consists of FAF, TUE, MTRANS. The Habits group consists of SMOKE, and SCC. The Genetic group consists of Gender and FHWO.

Table 4: Features Used in Study Grouped by Type.

Preprocessing was performed to the original dataset. Since all features determined by choice are categorical, these features were converted to integers. Ordinality was also preserved, and where possible, the greater integer values correspond to greater agreement or effect. For example: the CAEC feature values “no”, “sometimes”, “frequently” and “always” values are converted to values 1, 2, 3 and 4, respectively. Conversion from categorical feature values to integers uses a data dictionary.

Table 5: Categorical Features — Original Values and Numeric Equivalents

4. Apply Machine Learning Regressor Algorithms

Based on LazyRegressor output, this analysis relies on RandomForestRegressor.

Table 6: LazyRegressor Output Used for Determining Best Regressor.

Hyperparameter optimization was performed for RandomForestRegressor using both GridSearchCV and RandomSearchCV. GridSearchCV results indicated the best hyperparameters with criterion = ‘poisson’ — hereafter referred to as the “Poisson hyperparameters” or “Poisson”. RandomSearchCV results determined the best hyperparameters with criterion = ‘squared_error’ — referred to as the “Squared_Error” hyperparameters.

Cross validation was performed with 13 divisions, which resulted with mean CV scores around 0.80 for both criterion — Squared_Error and Poisson. Regression R2 scores, adjusted R2 scores, mean squared error and root mean squared error were compared between the two criteria.

Analysis shows R2 / Mean Squared Error scores of 0.8075 / 12.2136 for the Squared_Error hyperparameters. R2 / Mean Squared Error scores for Poisson are 0.8049 / 12.3811. With the better R2 and Mean Squared Error scores this analysis relies on the Squared_Error criterion hyperparameters has the best R2. Thus, scatter plots and feature importance plots rely on the Squred_Error criterion hyperparameters.

Table 7: Testing hyperparameters Set 1: Squared_Error.

Table 8: Testing hyperparameters Set 2: Poisson.

Correlation matrices created using matplotlib library’s pyplot function. Images of correlation matrices created using seaborn library to create false heatmap charts.

Figure A: Correlation Matrix for Full Dataset.

Figure B: Correlation Matrix for Categorical Features + BMI.

Scatter Plot for Test vs True BMI Values (Figure C) indicate the model, based on categorical values, does indeed predict the continuous BMI value well.

Figure C: Scatter Plot for True Vs Predicted BMI Values.

Figure D: Random Forest Feature Importance.

Table 9: Alphabetical List of Categorical Contributing to BMI.

Table 10: Categorical Features Contributing to BMI by Correlation Value.

Note: Correlation values listed by decreasing absolute magnitude.

5. Conclusions

1. Based on the RandomForestRegressor R2 value 0.8075 (Table 7: Hyperparameter, Cross Validation and Regression R2 and MSE Comparison), and the Scatter Plot for Test vs True BMI Values (Figure C), categorical values of features representing contributions to a continuous value can indeed create a strong model for predicting the continuous value.

2. This data shows no single feature contributing to BMI values. Per both the Feature Importance Plot (Figure D), and the Categorical Features Contributing to BMI by Correlation Value (Table 6), the greatest single feature is FHWO, a genetic feature (Family History with Overweight), followed by diet related features.

3. Per the Feature Importance Plot (Figure D) most of the most important features relate to diet: CAEC; FCVC; NCP. Genetic feature FHWO contributes most, and Gender also contributes.

4. Of the top seven correlation values listed in Table 10, five of the features concern diet or dieting: CAEC; FAVC; FCVC; SCC; CALC. FHWO shows a positive correlation with BMI, indicating individuals from families with overweight members are likely to have a higher BMI. CAEC, SCC, and FAF show a negative correlation with BMI, indicating that an increase in these features correlates to a lower BMI. CALC, CH2O, FAVC, and FCVC also show a positive correlation with BMI.

--

--

Ernest Bonat, Ph.D.

I’m a Senior Machine Learnig Developer. I work on Machine Learning application projects for Life Sciences using Python and Python Data Ecosystem.