Using Linear Regression to Predict Medical Expenses (R)

Gabriel Campregher
5 min readMar 21, 2022

1. Introduction

The goal here is quite objective. It is using a multiple linear regression model to predict medical insurance charges. The used data was taken from the book Machine Learning with R by Brett Lantz. The data is simulated on the basis of the USA demographics Census Bureau, therefore it reflects on some level the reality of the United Estates health insurance industry. The database is also not that extensive. It contains 1,336 observations and it gives us 7 default variables. If you want to check out the data by your own, there is a link at the end of this article.

Multiple Linear Regression Equation

1.1 Variables

Age : indicates the age of the main beneficiary. The maximum valuable is 64 years old, since people older than that are usually covered by the government.

Sex : indicates the gender of the policyholder.

BMI : it is the body mass index and it tells us if the person is over or under an ideal weight level relative to their height. The index is calculated by dividing the weight by the height squared. Roughly, an individual should have a BMI between 18.5 and 24.9. If a person has a BMI above 30, he is considered obese.

Children : number of children or dependents covered by health insurance.

Smoker : indicates if the beneficiary smokes tobacco regularly.

Region : tells the place of residence of the policyholder in the USA. This variable is divided into four geographical categories : northeast, southeast, southwest, and northwest.

1.2 Model and Prediction

The starting-point equation will be the one below. It contains all the variables on their default form. Before making any changes to the model, it is necessary to conduct an exploratory analysis to understand the dynamics between the charges and the independent variables. After that, we are going to test different models, in order to find the one with the best fit possible.

First Model

2. Exploratory Analysis

Smoker — As we can see, the smokers group have paid considerably higher charges. The averege premium paid by non smokers is 8,434 dollars, which is about 4 times less than the average value paid by smokers. Another interesting point is that 45% of the policyholders who smoke have paid higher charges than the non- smoking person who has paid the most (36,911 US$).

Region — The region does not seem to have an influence on explaining the dependent variable. Furthermore, each region contains about 24% of the observations, except for the southeast, which has 27% of all beneficiaries.

Sex — The gender of the policyholder also doesn’t appear to have much influence either. Although, the male group seams to have a higher variance on the third and fourth quartiles.

Age — As age goes up, charges also tend to get higher.

Bmi — We can see there are many points above the line of 30,000 dollars charge when the bmi crosses beyond 30. All of those people are considered to be obese. In fact, 52% of the observations have a bmi higher or equal to 30.

3. Linear Regression Model

3.1 Model A

Model A

It is the most simple one, since it contains all the basic variables of the data set. The null hypothesis of the individual significant test (t-test) was not rejected in two cases, sex and region. That means, both variables don’t have a statistical significance to explain the charges. Because of that, I have removed them from the next model. The R squared was 74,67%.

3.2 Model B

Model B

Besides removing sex and region from the equation, I have transformed a quantitative variable into a binary one. Now, the new variable (bmi30) divides the data into groups, which are “obese” and “not obese”. After these two changes, we got a slightly better R squared ( 74,91%).

3.3 Model C

Model C

For the third model, I have added an interaction between two independent variables, in order to catch the harmful effect the two can cause when combined. I am talking about an obese person who smokes. The interaction was introduced in the model simply by multiplying “smoker” and “bmi30”. the C model R squared is 86%.

3.4 Comparing the Models

the third model proves to be the best of the three :

Model A : R squared ( 74,67%). , Root Mean Squared Error 5,517

Model A : R squared ( 74,91%) , Root Mean Squared Error 5,520

Model C: R squared (86%) , Root Mean Squared Error 4,196

4. Exemple of Prediction

Model C — Prediction vs. Real Values

Imagine a woman who is 23 years old, has no kids, is obese and smokes. According to the linear regression model, her annual health insurance expenses would be , on average, 37,008 dollars.

Data:

https://www.kaggle.com/datasets/mirichoi0218/insurance.

References :

LANTZ, B. Machine Learning with R. Birmingham: Packt Publishing Ltd, 2013.

GUJARATI,D ; PORTER,D. Basic Econometrics. New York : McGraw-Hill/Irwin, 2009.

https://www.kaggle.com/code/ruslankl/health-care-cost-prediction-w-linear-regression/report.

https://www.kaggle.com/code/grosvenpaul/regression-eda-and-statistics-tutorial.

--

--