<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>Categorical Data Analysis - Logistic Regression</title>
<style type="text/css">
body p strong {
	color: #000;
}
body {
	color: #000;
}
body h2 {
	text-align: right;
	font-size: 14px;
}
</style>
</head>

<body><div align="left">
<h2> <a href="http://pulmonarychronicles.com/ojs/index.php?journal=pulmonarychronicles&page=article&op=view&path%5B%5D=148&path%5B%5D=369" title="PDF" target="_blank">PDF</a></h2>
<h3><strong><a id="TOP"></a>Categorical Data Analysis - Logistic Regression</strong></h3>
<p><strong>Shengping Yang PhD<sup>a</sup>, Gilbert Berdine MD<sup>b</sup></strong></p>
<p>Correspondence to Shengping Yang MD.&nbsp;
  <br>
Email: <a href="mailto:Shengping.yang@ttuhsc.edu">Shengping.yang@ttuhsc.edu</a></p>
<style type="text/css">
 .row { vertical-align: top; height:auto !important; }
 .list {display:none; }
 .show {display: none; }
 .hide:target + .show {display: inline; }
 .hide:target {display: none; }
 .hide:target ~ .list {display:inline; }
 @media print { .hide, .show { display: none; } }
 </style>
<div class="row">
 <a href="#hide1" class="hide" id="hide1">+ Author Affiliation</a>
 <a href="#show1" class="show" id="show1">- Author Affiliation</a>
 <div class="list">
  <div><sup><strong>a</strong></sup> a biostatistician in the Department of Pathology at TTHUSC in Lubbock, TX.
  <div><sup><strong>b</strong></sup> a pulmonary physician in the Department of Internal Medicine, TTUHSC in Lubbock, TX.</div>
 </div>
</div>
<style type="text/css">
body p strong {
	color: #000;
}
body {
	color: #000;
}
body h2 {
	text-align: right;
	font-size: 14px;
}
</style>
<p><em>SWRCCC</em> 2014;2(7):51-54 &nbsp;&nbsp;<br>
<strong>doi:</strong> 10.12746/swrccc2014.0207.094</p>
<p align="center">
...................................................................................................................................................................................................................................................................................................................................</p>
<p>&nbsp;</p>
<font face="Times New Roman, Times, serif" size="+1"><em>
I am planning a case-control study on lung cancer and body mass index (BMI). I think that this information would not
fit a normal distribution. I would like to know more about how to analyze data from such a study.</em></font>
<p>&nbsp;</p>
<p align="center">
...................................................................................................................................................................................................................................................................................................................................</p>
<p><span style="font-size: large"><strong><em>C</em></strong></span>ase-control studies are widely used in investigating
the potential relationship of a suspected
risk factor and a disease or outcome of interest. By
looking retrospectively and comparing how frequently
the exposure to a risk factor is present in subjects
who have that disease (case) with those who do not
have that disease (control), the relationship can be
evaluated. The outcome variable can only take exactly
two values, conventionally labeled as “case”
and “control”. In fact, this type of variable is called a
  <strong>categorical/nominal</strong> variable1 (data that have two or
more categories, but there is no intrinsic ordering to
the categories).<br>
<br>
Since a categorical outcome variable can take
only a few (two in case-control studies) possible values,
its distribution can be very different from normal.
Thus many of the statistical methods developed for
analyzing data with normally-distributed outcome
variables are not suitable for analyzing data with categorical
outcomes. <em>Note that those methods are also
not suitable for analyzing data with <strong>ordinal </strong>(a statistical
data type consisting of numerical scores that exist
on a rank scale) or <strong>cardinal</strong> (a type of data in which
observations can take only the non-negative integer
values {0, 1, 2, 3, ...}, and where these integers arise
from counting rather than ranking) outcome variables.</em><br>
<br>
Binary logistic regression (we will drop “binary” for
simplification purpose) is widely used in case-control study data analyses. In this column, we will provide
some details on the application, assumption, interpretation,
and pitfalls of logistic regression.</p>
<br>
<h3><strong>1.The basics of logistic regression</strong></h3>
<p>In the previous article, we showed how linear regression
“fits” data point pairs of a continuous dependent
variable x and a continuous variable y to the linear
function <em>y=mx+b</em>. Our case-control study cannot
use this method, because our outcome variable y can
take only values of ‘case’ or ‘control’. Logistic regression
solves this problem by transforming a non-linear
equation into a linear form.<br>
<br>
The first step is the use of the logistic function:
<p align="center"><img src="../Formula/StatColumn_clip_image002_0002.gif" alt="f1" width="132" height="33">.</p>
 The variable <em>t</em> can take any value from -∞ to +∞.
The variable <em>t</em> will be ‘fit’ using regression methods
to a linear function of our explanatory variable <em>x</em>. The
explanatory variable in our example would be <em>BMI</em>.<br>
<br>
The linear model is:
<p align="center"><img src="../Formula/StatColumn_clip_image002_0003.gif" alt="f2" width="80" height="22">.</p>
The logistic function becomes:
<p align="center"><img src="../Formula/StatColumn_clip_image002_0000.gif" alt="f3" width="198" height="35"></p>
The physical meaning of <em>β<sub>0</sub></em> is the ‘intercept’ or
log-odds of being a ‘case’ when the explanatory variable
has a value of 0, if 0 is achievable. The physical meaning of <em>β<sub>1</sub></em> is the parameter which defines the rate
of change in the log-odds with changes in the explanatory
variable (<em>BMI</em>).<br>
<br>
In order to estimate the regression coefficients,
numeric methods, such as the Newton-Raphson iteration,
are usually used because it is not possible to
find a closed-form expression for the coefficient values.
The Newton-Raphson iteration takes the form:
<p align="center"><img src="../Formula/StatColumn_clip_image002.gif" alt="formula4" width="104" height="33">,</p>
<p>where <font face="Times New Roman, Times, serif">x</font><sub>1</sub> is the new estimate, <em>x</em><sub>0</sub> is the previous
estimate, <span style="font-size: large"><em><strong>f</strong></em></span> (<em>x</em><sub>0</sub>) is the value of the function for
the previous estimate, and <span style="font-size: large"><em><strong>f</strong></em></span>(<em>x</em><sub>0</sub>) is the value of the
first derivative for the previous estimate. The Newton
method is well suited to automated computing provided
the function is differentiable and the estimates
converge to a single defined value.</p>
<br>
<br>
<h3><strong>2. Application of logistic regression in casecontrol studies.</strong></h3>
<p>In the example of a lung cancer study, the objective
is to assess whether lung cancer is significantly
associated with <em>BMI</em>. The two possible outcomes are:
developed lung cancer and no lung cancer, respectively;
and we want to evaluate the effect of <em>BMI</em> on
lung cancer, while controlling for smoking and other
risk factors.<br>
<br>
A variety of software can be used for performing
logistic regression analysis, such as SAS, Stata,
SPSS, S-Plus/R, and Minitab. Since SAS is one of
the most widely used software in statistics, below we provide the SAS code example for analyzing the lung cancer study data.<br>
<br>
<font face="Courier New, Courier, monospace" color="#0000CC">proc logistic</font> <font face="Courier New, Courier, monospace" color="#3399FF">descending;<br>
class</font><font face="Courier New, Courier, monospace">　smoking;</font><br>
<font face="Courier New, Courier, monospace" color="3399FF">model</font> <font face="Courier New, Courier, monospace">disease = BMI smoking &lt;other risk factors>;</font><br>
<font face="Courier New, Courier, monospace" color="#0000CC">run;</font></p>
<br>
The <em>proc logistic</em> procedure is used for modeling the
probability of developing lung cancer. The outcome
variable Disease is a categorical variable, coded as
“1” for subjects who developed cancer and “0” for
those who did not. While <em>BMI</em> is treated as a continuous
variable (we can later treat <em>BMI</em> as a categorical
variable as well to see how it is associated with lung
cancer), the class statement tells SAS that smoking is
a categorical variable. The option descending is used
by default to be consistent with how the outcome variable
is coded.<br>
<br>
<h3><strong>3. Assumptions of logistic regression.</strong></h3>
There are several assumptions underlying a logistic
regression model. Since some of them are quite technical,
we will skip them and focus only on the following
three that are particularly relevant to a case-control
study.<br>
<br>
(a) No important variables are omitted.<br>
<br>
Not including known risk factor(s) in a logistic
regression model creates estimation bias, because
compensating for the missing risk factor(s) results in
over- or underestimating the effect of other risk factors.
Therefore, it is important for researchers to make
sure that all known potential risk factor/confounder
data are collected. For example, in the lung cancer
study, while our objective is to investigate the association
between lung cancer and BMI, we still need to
simultaneously collect data on smoking, family history
of cancer, exposure to pollution, and any other known
confounding variables.<br>
<br>
(b) The observations are independent.<br>
<br>
When this assumption is violated, the estimated
standard errors are incorrect, as are the inferences.
To avoid this violation, the study design and sampling
plan have to be developed properly.<br>
<br>
(c) No severe collinearity among independent variables is present.<br>
<br>
Collinearity occurs when two or more predictor
variables in a multiple regression model are highly
correlated. For example, gestational age and birth
weight are highly correlated, i.e., low (high) gestational
age is usually associated with low (high) birth weight.
Including both variables in a logistic regression model
will cause collinearity. Severe collinearity inflates
the standard errors for the coefficients, which causes
the estimated coefficients to be unreliable. Therefore,
considerations need to be taken in the study planning
stage to avoid causing collinearity problems.<br>
<h3><strong>4. Interpretation of logistic regression.</strong></h3>
By definition, the odds of an event (disease) is the
ratio of the probability that an event will occur to the
probability that the event will not occur. In the lung
cancer study, suppose that we have the following
data:</p>
<p>
<table border="1" cellspacing="0" cellpadding="0" width="450" frame="hsides" border rules="groups">
  <thead>
  <tr>
    <td width="114" valign="top"><p align="center">&nbsp;</p></td>
    <td width="168" valign="top"><p align="center">Developed    lung cancer</p></td>
    <td width="168" valign="top"><p align="center">No lung    cancer</p></td>
  </tr>
  </thead>
  <tr>
    <td width="114" valign="top"><p>Smoker</p></td>
    <td width="168" valign="top"><p align="center"><em>nsc</em></p></td>
    <td width="168" valign="top"><p align="center"><em>nsn</em></p></td>
  </tr>
  <tr>
    <td width="114" valign="top"><p>Non-smoker</p></td>
    <td width="168" valign="top"><p align="center"><em>nnc</em></p></td>
    <td width="168" valign="top"><p align="center"><em>nnn</em></p></td>
  </tr>
</table>
</p>
<p>
The odds of developing lung cancer for smokers
is <img src="../Formula/StatColumn_clip_image002_0004.gif" alt="f6" width="50" height="28" valign="middle">, and for non-smokers is <img src="../Formula/StatColumn_clip_image002_0005.gif" alt="f7" width="53" height="28" valign="middle">. The odds
ratio (OR) is the ratio of these two, thus,<br>
<p align="center"><img src="../Formula/StatColumn_clip_image002_0006.gif" alt="f8" width="146" height="42"> .</p>
<p>Numerically,  suppose <em>nsc</em>=400, <em>nnc</em>=100, <em>nsn</em>=300, and <em>nnn</em>=700,  then<img src="../Formula/StatColumn_clip_image002_0007.gif" alt="f9" width="144" height="30" valign="middle">.</p>
<br>
In the above example, there is only one risk factor
(smoking), and the odds ratio calculated is called <strong>raw</strong> odds ratio. Logistic regression analysis can handle
models with multiple risk factors, and provide odds
ratio estimates for each risk factor while adjusting for
all other risk factors (called <strong>adjusted</strong> odds ratio). Now
suppose that the adjusted odds ratio for smoking is
8.55 (with <em>P</em> value less than a pre-specified significance
level); then we can interpret it as: The odds of
lung cancer is 8.55 times as high for smokers than for
non-smokers given other risk factors equal.<br>
<br>
<h3><strong>5. Pitfalls in interpretation of logistic regression.</strong></h3>
As one of the major limitations of an observational
study, a logistic regression can be used only
for detecting association, rather than causation. For
example, supposing we found a significant association
between lung cancer and smoking, we cannot
conclude that smoking causes lung cancer because
there are alternative explanations - “The same thing
that causes people to smoke may predispose them to
lung cancer.”<a href="#References"><sup>3</sup></a> Therefore, further studies have to be
conducted to verify that a causal effect does exist.<br>
<br>
Another issue associated with logistic regression
is the interpretation of odds ratio. Clinicians think in
probabilities, not odds. Although odds ratios are valid
measurements of strength of an association, many
times they are not good indications of relative risk
(RR; the ratio of the probability of an event occurring
in an exposed group to the probability of the event
occurring in a non-exposed group). In fact, odds ratio
can be used as a proxy for relative risk only when
the assumption of “rare” event is met.<a href="#References"><sup>2</sup></a> For a “rare”
event, the probabilities of an event for both the exposed
and non-exposed groups are very small, i.e.,
we have both P(event│exposure) ≈ 0 and P(event |
non-exposure) ≈ 0. Therefore,</p>
<p>OR=<img src="../Formula/StatColumn_clip_image002_0008.gif" alt="f10" width="504" height="37" valign="middle">RR.</p>
<p>Sample size calculation is critical to the success
of a case-control study. In general, sample size increases
with smaller effects and smaller pre-defined<br>
Type I and Type II errors. We will discuss sample size
calculation issues in future articles.</p><br>
<h3><strong><em><a id="References"></a>References</em></strong></h3>
<ol>
<li> Agresti A. Categorical Data Analysis. <em>John Wiley &
Sons, Inc.</em> 2013. 1-7; 163-191. Print.</li>
<li> Grimes DA, Schulz KF. Making sense of odds and odds
ratios. <em>Obstetrics & Gynecology</em> 2008; 111(2): 423-426.</li>
<li> Milberger S, et al. Tobacco manufacturers’ defense
against plaintiffs’ claims of cancer causation: throwing
mud at the wall and hoping some of it will work. <em>Tob
Control</em> 2006; 15(Suppl 4): iv17-iv26.</li>
</ol>
<p align="center">...................................................................................................................................................................................................................................................................................................................................</p>
<br>
<strong>Received:</strong> 05/02/2014<br>
<strong>Accepted:</strong> 06/01/2014<br>
<strong>Published electronically:</strong> 07/15/2014<br>
<strong>Conflict of Interest Disclosures:</strong> none<br>
<p>&nbsp;</p>
<p><strong><a href="#TOP">Return to top</a></strong>
</p>
</body>
</html>
