LOGISTIC REGRESSION (aka VARBRUL) in SPSS.
This tutorial continues where the January 5 tutorial left off.
PREAMBLE:
today we look at actually building a quantitative model for
final /-s/ deletion in Panama Spanish. Model building can have a
number of purposes:
o in order to test a hypothesis about the relationship between two
variables in your data, you may need to build a model as a way of
controlling for interactions between variables. Even so-called
"simple" hypothesis testing implicitly involves building a (simple)
model of your data.
o a model can tell you how much of the variability in your data you
are able to account for using information you have encoded in your
dataset. As such, it can tell you how well you are able to explain
the phenomenon.
o a model can serve as a compact, extensible representation of your
data that you apply for some other purpose. A probabilistic model
of verbal subcategorization frames, for example, could be useful to
a psycholinguist who is interested in how knowledge of
subcategorization trends affects reading times.
Our models will be of the form of
P( Y=y | X=x )
where Y is the DEPENDENT variable and X is some set of INDEPENDENT
variables. For now, we'll treat logistic regression as a black box,
and focus on what you do in general with models of the form outlined
above.
7) Aggregated, or grouped, views of the data. There are 8,846
examples in our dataset, but many of them are indistinguishable,
and given the number of values that each variable takes, there are
only 2*3*5*4 = 120 possible combinations of variable values. It
can be more useful to look at the data in aggregate form -- that
is, only one row per unique combination of variables.
When we aggregate our data, we have a choice: we can either combine
cases that differ only in Deletion, or treat them separately. In
the former case, we probably want to include in the aggregation the
frequency of deletion for each row. The latter case is a bit
simpler in SPSS, though, so we'll do it first:
Select Data | Aggregate. Move Deletion, POS, Environment, and class
all into the "Break Variables" block. Check "Number of cases", and
name it something -- I'll use "N". Finally, under "Save" select
"Create new data file" and name it something appropriate -- I'll
use "cedergren-aggregated.sav". Click OK. Then go back to the
main menu, choose File | Open, and select file that you just
created. In the new file, each row has a unique combination of
variable values, and also has the number of cases under the "N"
column.
To aggregate data differing in value of Deletion, go back to the
original data file and choose Data |aggregate again. Move POS,
Environment, and class into the "Break Variables" block. Then move
Deletion into the "Aggregated Variables" box. By default it will
say in that box:
Deletion_mean = MEAN(Deletion)
This means that a new variable called Deletion_mean will be
created, and its values will be the average value of Deletion for
all the cases of a given combination of the break variables. As
before, check "Number of cases" and name the new variable, and
choose to create a new file. Click OK, and open the new file.
Note that there are only about half as many rows now, and that each
row has a Deletion_mean variable valued between 0 and 1.
8) Logistic Regression (aka VARBRUL). Select Analyze | Regression |
Binary Logistic. This gives you the interface for setting up
logistic regressions. Recall that we're interested in looking at
the effects of grammatical category (POS), following environment,
and social class on likelihood of /-s/ deletion. We'll start with
building a simple model of deletion where POS, environment, and
class have independent effects. Since deletion is the dependent
variable in our dataset, move "Deletion" into the "Dependent" box
and move "POS", "Environment", and "class" into the "Covariates"
box. Also, even though we've coded social class with numerical
values, we're actually going to treat it as a _categorial_ variable
-- that classes 1,2,3,4 are not on any kind of scale. Do this by
clicking on "Categorial" at the bottom of the window. This will
open up a "Define categorial variables" window; move "class" into
the right-hand "Categorial Covariates" box in this window. You can
leave "Change Contrast" at "Indicator". Then click Continue, and
back in the Logistic Regression interface window click OK.
A series of tables will then be printed in the Output window.
There will be three major sections of output, looking like this
(See the outline on the left side of the Output window):
Logistic Regression
Case Processing Summary
...
Dependent variable encoding
...
Categorical Variables Codings
...
Block 0: Beginning Block
Classification Table
...
Variables in the Equation
...
Variables not in the Equation
...
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
...
Model Summary
...
Classification Table
...
Variables in the Equation
...
At the moment, we're not interested in the specifics of how
logistic regression works, which is required to understand part of
the "Variables (not) in the Equation" tables. Instead, we want to
focus on looking at how well a model fits a dataset, and comparing
the fit of two models. The information relevant for this is in
three tables:
I) Model Summary: this table in Block 1 contains three pieces of
information. The most important one is
-2 Log likelihood
which is calculated from the likelihood of the data under the
model. Lower values of this statistic indicate more likely
models, since it is a negative log (think this through if it's
not self-evident). All else being equal, we want to use models
that have high data likelihood.
The Cox & Snell R square and Nagelkerke R square are closely
related statistics, and basically summarize how much of the
variability in your data is successfully explained away by your
model. Larger values of these R squares (the Nagelkerke has a
maximum value of 1) indicate that your model captures more of the
data variability. We'll look more at this idea later in the
course, in the context of linear regression.
Block 0 should really have a Model Summary table that shows the
data likelihood, but SPSS for some reason doesn't put it there.
If you think about it, there is a way of getting around this!
II) Classification table: this table shows the percentage of
individual cases correctly classified by a simple "majority rule"
that says for each case, the model guesses the class it considers
most likely.
III) Omnibus Tests of Model Coefficients: this table shows a
chi-square statistic comparing the model with a simpler model (in
this case, the model with only an "intercept"). Larger values of
the chi-square statistic indicate a bigger difference in fit
between this and the simpler model. The "Sig." column shows the
statistical significance of this difference (again, we'll discuss
this idea more throughly later on.) In our case, all three rows
are the same, but you can also have SPSS build a model
incrementally, in which case the rows can differ.
9) Visualizing data: scatterplots. Displaying the data in some
graphical form is often useful for understanding patterns in it.
Right now we will visualize the fit of the logistic regression
model by plotting predicted frequency of deletion against actual
frequency of deletion. To do this, we're going to make use of the
data aggregation function again. Go back into the binary logistic
regression interface, click "Save", check Predicted Values:
Probabilities, and click "Continue". Then click "OK" from the main
logistic regression interface. The output will be the same as
before, but in the Data Editor window there will be a new variable
PRE_1 whose value is the predicted probability
P( Deletion = 1 | POS, Environment, class)
To usefully plot predicted versus actual frequency, we need to
aggregate the data as before. Select Data | Aggregate, move POS,
Environment, and class into "Break Variables" as before, and move
both Deletion and the new Predicted variable into "Aggregated
Variables." (The MEAN of the Predicted variable is fine because
this will always be the same for a given combination of break
variables -- why?) Save the aggregated data to file and open that
file.
The Data Editor window now has both Predicted_mean and
Deletion_mean variables. We'll plot these against each other. Go
to Graphs | Scatter/Dot, highlight "Simple Scatter", and click
"Define." Move Predicted_mean to the "X axis" box and
Deletion_mean to the "Y axis" box. When you click OK, a graph will
appear in the Output window, with Predicted_mean on the X axis and
Deletion mean on the Y axis.
Now double click on the chart, and you'll enter the Chart Editor.
You can do all sorts of useful things from this window, but we'll
focus on two. First, select Options | Reference Line from
Equation. In the ideal model, the predicted deletion frequency
always exactly equals the ideal deletion frequency, so we can plot
this ideal model by selecting a Y-intercept of 0 and a slope of 1.
Click "Apply" and a line representing this ideal area of fit will
appear. Points close to this line represent combinations of
independent variables (often called COVARIATE VECTORS) where the
model closely matches the data. Points farther away have a poorer
fit. Points extremely far away may be OUTLIERS -- that is, cases
where there is a fundamental mismatch the assumptions inherent in
the model and the actual dynamics in the data.
You can also examine individual cases by left-clicking on points,
then right-clicking on them and selecting "Go to case". This can
be very useful.