Shengping Yang PhD^{a}, Gilbert Berdine MD^{b}
Correspondence to Shengping Yang MD.
Email: Shengping.yang@ttuhsc.edu
SWRCCC 2014;2(7):5154
doi: 10.12746/swrccc2014.0207.094
...................................................................................................................................................................................................................................................................................................................................
I am planning a casecontrol study on lung cancer and body mass index (BMI). I think that this information would not fit a normal distribution. I would like to know more about how to analyze data from such a study.
...................................................................................................................................................................................................................................................................................................................................
Casecontrol studies are widely used in investigating
the potential relationship of a suspected
risk factor and a disease or outcome of interest. By
looking retrospectively and comparing how frequently
the exposure to a risk factor is present in subjects
who have that disease (case) with those who do not
have that disease (control), the relationship can be
evaluated. The outcome variable can only take exactly
two values, conventionally labeled as “case”
and “control”. In fact, this type of variable is called a
categorical/nominal variable1 (data that have two or
more categories, but there is no intrinsic ordering to
the categories).
Since a categorical outcome variable can take
only a few (two in casecontrol studies) possible values,
its distribution can be very different from normal.
Thus many of the statistical methods developed for
analyzing data with normallydistributed outcome
variables are not suitable for analyzing data with categorical
outcomes. Note that those methods are also
not suitable for analyzing data with ordinal (a statistical
data type consisting of numerical scores that exist
on a rank scale) or cardinal (a type of data in which
observations can take only the nonnegative integer
values {0, 1, 2, 3, ...}, and where these integers arise
from counting rather than ranking) outcome variables.
Binary logistic regression (we will drop “binary” for
simplification purpose) is widely used in casecontrol study data analyses. In this column, we will provide
some details on the application, assumption, interpretation,
and pitfalls of logistic regression.
In the previous article, we showed how linear regression
“fits” data point pairs of a continuous dependent
variable x and a continuous variable y to the linear
function y=mx+b. Our casecontrol study cannot
use this method, because our outcome variable y can
take only values of ‘case’ or ‘control’. Logistic regression
solves this problem by transforming a nonlinear
equation into a linear form.
The first step is the use of the logistic function:
.
The variable t can take any value from ∞ to +∞. The variable t will be ‘fit’ using regression methods to a linear function of our explanatory variable x. The explanatory variable in our example would be BMI..
The logistic function becomes: The physical meaning of β_{0} is the ‘intercept’ or logodds of being a ‘case’ when the explanatory variable has a value of 0, if 0 is achievable. The physical meaning of β_{1} is the parameter which defines the rate of change in the logodds with changes in the explanatory variable (BMI).,
where x_{1} is the new estimate, x_{0} is the previous estimate, f (x_{0}) is the value of the function for the previous estimate, and f(x_{0}) is the value of the first derivative for the previous estimate. The Newton method is well suited to automated computing provided the function is differentiable and the estimates converge to a single defined value.
In the example of a lung cancer study, the objective
is to assess whether lung cancer is significantly
associated with BMI. The two possible outcomes are:
developed lung cancer and no lung cancer, respectively;
and we want to evaluate the effect of BMI on
lung cancer, while controlling for smoking and other
risk factors.
A variety of software can be used for performing
logistic regression analysis, such as SAS, Stata,
SPSS, SPlus/R, and Minitab. Since SAS is one of
the most widely used software in statistics, below we provide the SAS code example for analyzing the lung cancer study data.
proc logistic descending;
class smoking;
model disease = BMI smoking <other risk factors>;
run;

Developed lung cancer 
No lung cancer 
Smoker 
nsc 
nsn 
Nonsmoker 
nnc 
nnn 
The odds of developing lung cancer for smokers
is , and for nonsmokers is . The odds
ratio (OR) is the ratio of these two, thus,
.
Numerically, suppose nsc=400, nnc=100, nsn=300, and nnn=700, then.
OR=RR.
Sample size calculation is critical to the success
of a casecontrol study. In general, sample size increases
with smaller effects and smaller predefined
Type I and Type II errors. We will discuss sample size
calculation issues in future articles.
...................................................................................................................................................................................................................................................................................................................................