How to estimate risk ratios using regression models with R?

Let’s learn together how to use regression models to get valid estimates of risk ratios. Isn’t already odd that people prefer odds ratios over risk ratios?
Biostatistics
GLM
R
Author

Abdullah Abdelaziz

Published

April 22, 2024

Our Goal

In this post, we will learn why we, as researchers and epidemiologists, should be using risk ratios more often than we do. We will also learn a simple way to use regression models to easily estimate risk ratios.


Introduction

Important note

Basically, I am arguing here that risk ratios are generally better than odds ratios. If you already know this, and you are just interested in the implementation, feel free to jump to Prerequisites

You might remember your first classes in epidemiology\biostatistics where you were first taught how to use regression models. If you were taught the same way I was, then you must have seen people telling you that if you have a binary outcome (life vs death, receive treatment vs don’t receive treatment, etc.), the only way to model it is to use a logistic regression model to study the association between your exposure (aka predictor or independent variable depending on your field of study) and your outcome (aka response variable or dependent variable).

When we use logistic regression models, the association between the binary outcome variable and the exposure variable will be expressed in the odds ratios scale, which represents the odds of the outcome in the exposed (treated) individuals divided by the odds of the outcome in the unexposed (untreated) individuals. However, from your first epidemiology class, you might remember that we are mainly interested in risk ratios, not odds ratios. Risk ratios come from probabilities, which have a really more intuitive meaning than odds. Also, due to their mathematical properties, odds ratios are almost always non-collapsible (risk ratios also have this problem to some extent, but they manifest in slightly different circumstances)1. In cohort studies, odds ratios will always exaggerate the effect size when compared to the risk ratios if your outcome variable is common in your dataset.

1 I will create another blog post to describe why non-collapsiblity could be a problem when we try to estimate causal estimates of the treatment effect, but for now you have to trust me that it’s a problem 😅

So you might be asking: why is everyone using odds ratios?

That’s a great question! The reason is largely due to the simplicity of implementing logistic regression in statistical software. From a generalized linear model (GLM) standpoint, logistic regression is more stable and is more likely to converge2 when compared to regression models used to estimate risk ratios3.

2 Convergence simply means that the algorithm that the model uses managed to find a set of coefficients (the \(\beta\)’s) that gives the best fit possible to the data. If convergence failed, that means the model couldn’t estimate those coefficients.

3 For more details why this is the case, check this nice article(Mittinty and Lynch 2022)

Linear regression, logistic regression, Poisson regression, Gamma regression, etc. You might have heard about those folks before. All of them are actually variants of a large family of models called the Generalized Linear Models, or GLM for short. There are a lot of nice books out there for learning GLMs (and I strongly recommend you to learn about them), but for the sake of the post, you only need to learn how we specify a GLM.

To specify a GLM, you need to specify two elements in the model:

  1. A probability distribution that best describes the outcome variable.
  2. A link function that determines the scale of the regression coefficients (which represent the measures of associations e.g. risk ratios, risk difference, mean difference, etc.)

Let’s see an applied example to see how this works. We will use the mtcars dataset (loaded by default in R). We will model the variable vs (which is a binary variable 0/1) against the variable hp, which is a continuous variable. Don’t worry about the interpretation for now. I am just trying to demonstrate the concept of GLMs.

# install.packages(c("tidyverse","broom")) # Run run this if you don't have the package installed

# Load packages
library(tidyverse)
library(broom)

# Let's build and run GLM
my_model <- glm(vs ~ hp, 
    family = binomial(link = "logit"),
    data = mtcars)

my_model |> 
  tidy()
term estimate std.error statistic p.value
(Intercept) 8.3780240 3.2159316 2.605162 0.0091831
hp -0.0685607 0.0273995 -2.502265 0.0123401

The code above describes a GLM in which the distribution specified is the binomial distribution, which makes sense because the outcome variable is a binary (0/1) variable. We used the link ="logit" to specify that we need the coefficients to be expressed in the log odds ratio scale. I built the model and stored it in the my_model object.

We can interpret the coefficient -0.0685607 as the following “With each additional unit of the hp variable, the log odds of outcome vs will on average decrease by 0.0685607. And if you exponentiated this -0.0685607, you would get 0.9337368, and this would be your odds ratio!! Now you can interpret this as the following: With each additional unit of the hp variable, the odds of the outcome vs will, on average, decrease by 0.07, which is the excess odds given by \(1-OR\).

If you haven’t noticed by now, this is essentially a logistic regression model. Logistic regression is a GLM where the distribution is binomial, and the link function is the logit function. Depending on your needs, you can vary the choices for both the distribution and link functions. However, things are not that straightforward in some situations, and I will explain later in the post why this might be the case.


Prerequisites

To be able to follow up with the post, I assume the following:

  • You have some familiarity with R and Rstudio.

  • You know what regression means (just high-level understanding). Reading A super quick background on GLMs can be helpful.


Implementation

Now, let’s do the cool stuff! Remember that multiple ways to estimate risk ratios using regression models exist, but I will present the one I am particularly comfortable with.

1. Load the necessary packages

library(tidyverse) #This is for data manipulation and wrangling stuff
library(broom)
library(fixest) # This is a wonderful package to run GLMs
library(sandwich) # To deal with standard errors

2. Simple implementation

Following on the same reasoning used here, theoretically, we can use the glm function in R, with family = binomial(link = "log)) to estimate the log probability ratio, which is equivalent to the log risk ratio4. We can exponentiate the coefficients to get risk ratios similar to what we have done with the log odds ratio.

4 Risk in the epidemiology world is actually equivalent to what we call in our daily lives probabilities

Let’s try this

# Let's build and run GLM to estimate risk ratio
my_model <- glm(vs ~ hp, 
    # we set link to "log" since we want risk not odds
    family = binomial(link = "log"), # we set link to "log" since we want risk not odds
    data = mtcars)
Error: no valid set of coefficients has been found: please supply starting values

Okay, R does not know how to fit the model! Should we panic? Is it the end of the world?? Should we abandon risk ratios?? Should we stop using R?? Well. Not really. This is expected and often happens with all kinds of statistical software. The problem is that the maximum likelihood algorithm (the algorithm used by GLMs) cannot estimate proper coefficients to fit the data. There is a good mathy reason behind such kinds of errors, described (Mittinty and Lynch 2022). But we don’t need to worry about that for now; focus on the big picture.

This error can happen with any GLM for many reasons, but it more often than not happens specifically when you try to use the binomial distribution with a log link to estimate risk ratios.

3. Give me my risk ratios!!

But don’t worry; this blog post is specifically made to address this issue. There are multiple ways to address this issue, and they are elegantly described in this article (Naimi and Whitcomb 2020). I will present the simplest and quickest method, in my opinion.

The modified Poisson regression

The whole idea behind the solution is to use another member of the GLM family; the Poisson model. Our friendly Poisson model is usually used to model count variables (0, 1, 2, 3, ….. etc.). Let’s see if that is going to work

# Let's build and run GLM to estimate risk ratio
my_model <- glm(vs ~ hp, 
    # we set link to "log" since we want risk not odds
    family = poisson(link = "log"), # We use the Poisson distribution now
    data = mtcars)

my_model |> 
  tidy()
term estimate std.error statistic p.value
(Intercept) 1.5701827 0.7243790 2.167626 0.0301872
hp -0.0211491 0.0073697 -2.869744 0.0041080

This worked! The -0.0211491 is our log risk ratio. After exponentiation, we get 0.979073. We can interpret this as the following: For every unit increase of hp, we expect the risk of getting the outcome vs to decrease, on average, by 0.02 or 2% (Compare this to the 7% decrease in the odds ratios scale).

4. We are not done yet (Check your standard errors)!!

Yup! As the title says, if it were that straightforward, we would’ve routinely used the Poisson regression model in our daily risk ratio estimation. But since we are using a Poisson distribution instead of a binomial distribution, our standard errors are not correct.

Thankfully, we can adjust and correct those standard errors! To do that, we need the sandwich package, which contains a wide range of standard error estimators robust to various modeling issues. Usually, correcting standard errors using sandwich can get clunky in terms of code implementation. This is where the fixest package comes to the rescue. The fixest package is awesome for doing all the work (running the GLM model and correcting standard errors in one step).

Let’s do this

# Create the model
my_model_corrected_se <- feglm(vs ~ hp, 
             data=mtcars,
             family = poisson(link="log"),
      vcov = sandwich
) 

# Show the output
my_model_corrected_se |> 
  tidy()
term estimate std.error statistic p.value
(Intercept) 1.5701827 0.2485830 6.316533 0
hp -0.0211491 0.0030141 -7.016667 0

Let’s contrast the standard errors from the two models

my_model |> 
  tidy()
term estimate std.error statistic p.value
(Intercept) 1.5701827 0.7243790 2.167626 0.0301872
hp -0.0211491 0.0073697 -2.869744 0.0041080
my_model_corrected_se |> 
  tidy()
term estimate std.error statistic p.value
(Intercept) 1.5701827 0.2485830 6.316533 0
hp -0.0211491 0.0030141 -7.016667 0

The standard error before the correction was about 2 times larger than the one after the correction5. This shows the importance of correcting standard errors when using the Poisson distribution with GLM to estimate risk ratios. This procedure was reported about two decades ago in this paper (Zou 2004), yet you don’t see it widely implemented in the literature. One of the reasons, in my opinion, is that its software implementation is not that straightforward.

5 When to correct your standard errors is actually a good thing to know, but we will leave that for another blog post.

The use of the feglm function from the fixest package can make life very easy if you want to estimate risk ratios or other measures of effect.

And that’s it!

I hope you enjoyed the post. If you have any suggestions, corrections, or questions, please send them to my email at aabdel51@uic.edu

References

Mittinty, Murthy N, and John Lynch. 2022. “Reflection on Modern Methods: Risk Ratio Regressionsimple Concept yet Complex Computation.” International Journal of Epidemiology 52 (1): 309–14. https://doi.org/10.1093/ije/dyac220.
Naimi, Ashley I, and Brian W Whitcomb. 2020. “Estimating Risk Ratios and Risk Differences Using Regression.” American Journal of Epidemiology 189 (6): 508–10. https://doi.org/10.1093/aje/kwaa044.
Zou, Guangyong. 2004. “A Modified Poisson Regression Approach to Prospective Studies with Binary Data.” American Journal of Epidemiology 159 (7): 702–6. https://doi.org/10.1093/aje/kwh090.