|
FALL 2004 DR SUSAN CAROL LOSH |
|
PLEASE READ THIS GUIDE FIRST! Before working with "point estimation," we will first examine relationships between two variables. I will follow the three questions in the guides in this order: (1) is any apparent relationship between two variables real or a statistical accident? (2) If the relationship is probably "real" (i.e., not due to chance sampling variations), how strong is it? (3) If the relationship is probably real and non-trivial, what is the possible causal structure of a relationship between two variables? I treat the material in Guides 4 and 5 as a unit. A lot of material in Agresti and Finlay below will make much more sense after you have finished the material in Guide 5. If you hate that queasy, "I can't understand what's going on" feeling, I strongly recommend that you go over the material below after reading Guide 4. Return and REREAD this material after we complete Guide 5. KEY TO: Huff, Chapter 7, pp. 74-86.
|
|
|
|
|
|
|
|
|
|
|
A bivariate distribution simultaneously and jointly cross-classifies the scores on a case for two variables.
For example, if we have a bivariate distribution of gender and support for President Bush (favorable/unfavorable) we can simultaneously cross-classify people as favorable males, favorable females, unfavorable males and unfavorable females.
The jointly cross-classified cases form the "cells" or interior of the table. Each cell has a frequency of cases that have a JOINT score considering both variables simultaneously.
The univariate summaries for each variable separately (for example, male or female) are at the bottom of the table for the independent variable and at the far right of the table for the dependent variable, and are called the marginals. Because the row and column totals are in the margins of the table, they are often called "the marginals". Remember that the cells are labelled with the row number first, then the column number.
The grand total is usually presented in the lower right corner of the bivariate table.
Title: Generic Bivariate Table
| Variable X, Value 1 | Variable X, Value 2 | Row Totals | |
| Variable Y, Value 1 | (Cell 1,1) | (Cell 1,2) | Marginal Total
Variable Y, Value 1 |
| Variable Y Value 2 | (Cell 2,1) | (Cell 2,2) | Marginal Total
Variable Y, Value 2 |
| Column Totals | Marginal Total
Variable X, Value 1 |
Marginal Total
Variable X, Value 2 |
Grand Total |
Then, with values supplied for each variable:
Title: Attitude toward President Bush by Gender
| MALE | FEMALE | ||
| FAVORABLE | Male-Favorable (Cell 1,1) | Female Favorable (Cell 1,2) | Total Favorable |
| UNFAVORABLE | Male-Unfavorable (Cell 2,1) | Female Unfavorable (Cell 2,2) | Total Unfavorable |
| Total Male | Total Female | Grand Total |
|
|
The size of a crosstabulation table (which is the total number of cells) depends on how many rows and columns are in the table.
In turn, the number of rows or columns depends on how many values or categories each variable has.
If the row variable has 3 categories and the column variable has 4 categories, the result is a "3 by 4" table.
CONVENTION: The row number always comes first.
Square tables have the same number of rows and columns (e.g., a 2 X 2 table such as the example above).
Size is important because it plays a role
in the type of statistic you choose for your data and how well that statistic
will work. For example, some statistics such as the correlation coefficient
phi work better on square tables. The larger the number of cells, all else
equal, the larger the Chi-Square statistic becomes, so measures of the
statistical significance of this statistic will take the table size into
account.
|
|
The Bivariate Percentage
Table is
just a variation on our old friend, the univariate percentage table. However,
the bivariate table gives more information: it allows us to compare
and contrast group similarities and differences. I have the very simplest
bivariate table, which is a 2 X 2 table, below. There is one column each
for women and men, one row for the correct answer and one row for the incorrect
answer. The first table shown (also found in Guide 2) is the Bivariate
Frequency
Distribution:
| How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
| NOTE: By convention, categories of the independent variable typically form the COLUMNS of the table. | Male | Female | Total |
| Answer to Question: | |||
| Sun goes around Earth (WRONG) | 104
(r1, c1) |
283 | 387 |
| Earth goes around Sun (RIGHT) | 649 | 538 | 1187 |
| Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) | 753 | 821 | 1574 |
Source: NSF Surveys of Public Understanding
of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574
A key issue is whether to percentize
down the columns or across the rows.
Make no mistake about it, this IS
a key issue and not a matter of semantics. Percentizing in "the wrong direction"
will totally change the meaning of the results that you present.
CONVENTION: Values of the independent variable create the columns of the table.
For example, the two values of gender:
male and female, head each column in my sample table.
Remember, gender might cause science knowledge,
but we know science knowledge CANNOT cause biological sex.
Therefore, gender is the independent variable.
Science knowledge is the possible effect, or dependent variable.
CONVENTION: Percentize separately within values of the independent variable.
In my example, this means that first I calculate the percent giving correct and incorrect responses for men.
I then repeat the process, calculating the percent giving correct and incorrect responses for women.
Once I have done so, I can now specify the percentage of men who give the right answer (the Earth goes around the Sun) and the percentage of women who give the right answer, and then directly compare women and men.
These percentages within gender are different numbers, and they mean something entirely different from the following question:
among those who think the Earth goes around the Sun, what percent are female?(Answer 538/1187 X 100 or 45.3% Since women are 821/1574 X 100 or 52.2 percent of the sample, we can see that women are underrepresented among those giving the correct answer. Notice below that neither column has a percent figure of 45.3%)
CONVENTION: Remember that when the columns are formed by categories of the independent variable, a percent sign ONLY goes at the top of each column (in this case, the "wrong answer") and after the 100 percent at the bottom of each column. (Note: this is because the values of the independent variable form the columns of the table.)
These conventions
are particularly important as the number of values for each variable grows.
They help your reader to immediately discern which way the percentages
are calculated and they make your table much easier to read.
| How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
| Male | Female | |
|
Answer to Question:
|
||
|
Sun goes around Earth
(WRONG)
|
16.0%
|
34.5%
|
|
Earth goes around Sun
(RIGHT)
|
84.0
|
65.5
|
|
Total
|
100.0%
|
100.0%
|
|
Casebase
|
|
|
Source: NSF Surveys of Public Understanding
of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574
|
|
There is only so much that we can do with a single variable. But with two variables, the analytic possibilities open up!
We
can discuss prediction, from
one variable to another.
We
can discuss causality. A
first step in establishing causality is to examine the joint frequencies
and see if one variable covaries with a second.
Covariation or correlation means that scores on a second variable change in some systematic way as scores on a first variable change.
If
one variable causes a second, scores on the two variables should systematically
correlate or covary. If
two variables are, in fact, causally related, they should have a "statistically
significant" and substantively important correlation.
Causation implies correlation.
However, the reverse does not hold.
Covariation
alone does NOT mean that one variable causes a second. Two
variables can be correlated, yet not be causally related to one another.
That is: correlation does not equal causation.
Virtually all scientists of whatever kind
(educational, behaviorial, social, life, physical, etc.) and educators
care about cause and effect. If we understand what causes a phenomenon,
whether the event is a speech disability or AIDS, the potential for changing
that event is much greater. Knowledge about causes [independent variables]
mean greater chances for understanding, predicting and controlling effects
[dependent variables].
|
|
FIRST, we ask whether any apparent relationship between two variables in sample data is a statistical ACCIDENT caused by sampling error (sampling variability) or whether the relationship is REAL, that is, non-zero or "statistically significant".
What are the odds that an observed non-zero relationship in a sample is simply due to chance?
This is the question of statistical significance or statistical inference. We generally test statistical significance with different sampling distributions and a probability density function (pdf).
The smaller the odds that an observed relationship is due to chance, generally the more confident we are that an observed relationship is REAL, that is, non-zero, in the population too, and not a chance accident that only holds for one particular sample.
Notice that we do NOT address strength of a relationship at this point,
only whether the true association in the population is zero or not.
We often check substantive significance through the value of a correlation coefficient. Over the coming weeks, we will examine the properties of several different correlation coefficients so that we can choose the more appropriate one for our data, or so that we can assess the appropriateness of the chosen correlation coefficients in professional research projects.
This topic will be addressed in more depth
in Guide 6 because analytically it involves multivariate analyses.
One possibility
is that the variables are locked in a symmetric relationship and we cannot
tease out which variable is the cause and which variable is the effect.
One
example is the correlation between marital status and reported mental health
in men over 30. Married men over 30 report better mental health than never
married men over 30.
But what's cause and what's effect? Some family researchers speak of the "buffer effects" of marriage, instilling greater mental health. There may also be self-selection effects, i.e., older men in poorer mental health are less likely to marry in the first place. When the cause is indeterminate, we speak of a symmetric relationship in which cause and effect cannot be unequivocally established.
To address question (3) typically requires three variables: the original two, and a third variable used as a statistical control variable.
For example, remember the case of the correlation
between ice cream consumption and assault rates over the months of the
year? Eating ice cream incites assault, perhaps? Or maybe people get so
hot and sweaty assaulting others that a nice, cold ice cream hits the spot.
Or, maybe the true situation looks like this instead:
|
Any apparent "causal" relationship between ice cream consumption and assaults is "spurious," it occurred ONLY because a third variable, temperature in this example, caused both ice cream consumption and the assault rate.
Here's another example of a spurious relationship between two variables, uncovered by the use of a third, or control variable.
Did you know there is a positive relationship
between the number of fire engines at a fire and the amount of dollars
in fire damage? Yes, there is. Better avoid calling the fire department
next time there is a fire! That's a sure way to pump up those insurance
claims! Or, perhaps the situation really looks like the chart below instead:
|
It should be apparent by this time that it is often difficult to tease out the causal structure of a relationship between two variables. We will focus the most attention on causal structure in non-experimental studies in Guide 6.
|
|
|
BASICS |
|
|
|
I will use the terms association, relationship and correlation more or less interchangeably.
We will start with question one:
|
If your data are
from a total population, stop
here!
You do not need to infer from sample results to the population because
you already have the population. You may proceed directly to question
#2 and ask about relationship
strength
and
effect size.
|
As you now know, most of our data that we collect come from a sample. Most populations are too large and unwieldy to study feasibly without an enormous expediture in time, effort and finances. (Even attempts to study an entire population, such as the decennial U.S. Census, often undercount particular segments of the U.S. population.)
However, even from very good samples, the results can vary or fluctuate considerably from one sample to another. Thus, because most of our data are from samples or subsets of the population, what appears to be a relationship could, in fact, instead be an accidental finding.
The TRUE population parameter of the association between two variables could be zero, even though it misrepresented itself as nonzero in any one sample finding.
Stated a bit differently, even if the true relationship were zero in the total population, in any one specific sample, the results could appear as though two variables were related (such as a correlation coefficient of .25)--but the observed numeric correlation could simply be a sampling fluctuation from sample to sample around a true population value of zero.
In statistical inference, we begin with a "null hypothesis". You may be familiar with the idea of a null hypothesis, typically written as:
Ho
from your thesis or dissertation research. The idea in statistics is related to these kinds of uses.
One of the easiest mathematical
and conceptual starting points is to assume that there is NO
relationship between two variables in the population, i.e., the correlation
between two variables is really zero.
While we actually could hypothesize that
the true value of the correlation could be some specific non-zero number,
typically we don't have sufficient information to guess at what that number
could be. (Later in the semester, we will consider how prior knowledge
may lead us to make more precise estimates about the value of a statistic.)
Further, if we guess at a particular number for a null hypothesis, generally
each different starting number can generate a pdf with a different form,
shape, or peak. The EASIEST starting number to work with is a correlation
of ZERO between two variables.
|
If the odds or the probability is high of observing our results by chance (say 80 samples out of 100 would have similar results when the true population relationship is zero), we conclude that any illusory relationship is an ACCIDENT and that the true relationship in the population is zero.
The lower or smaller the odds or the probability of observing our results solely by chance (say, only 5 samples in 100 would have similar results when the true population relationship is really zero) the more confident we become that the apparent relationship is REAL and not a chance fluxuation. After all, if we only take ONE sample, what are our odds of getting one of those unusual samples that "look real--but aren't" solely by chance. They are only one in twenty, and those are slim odds.
We write the probability (p) of observing a relationship solely by chance as:
p =
or
p <
or
p >
some figure between 0 and 1.
Probabilities
are always between 0 and 1.
Here are some examples:
If there were
NO relationship in the population between two variables (the
correlation in the population is really zero) then:
| PROBABILITY STATEMENT | WHAT IT MEANS IN WORDS |
| p < .01
OR
"p is less than 1 chance in 100" |
the results we observed in our sample would occur by chance less than once in 100 samples if the true population correlation = 0 |
| p = .10 OR
"p equals 1 chance in 10" |
the results in our sample would occur by chance in exactly 10 out of 100 samples if the true population correlation = 0 |
| p > .05 OR
"p is greater than 5 chances in 100" |
the results in our sample would occur by chance in more than 5 in 100 samples if the true population correlation = 0 |
The cutoff probability that indicates
that the sample relationship is probably real (non-zero) in the
population for most behavioral or social science data generally is p <
.05, that is:
|
Less than 5 chances in 100 is considered a sufficiently rare event or sample that it is unlikely to have occurred by accident, and therefore the relationship is probably REAL, that is, non-zero--"there is something there." We reject our beginning or "null" hypothesis of a zero correlation in the total population.
Note, however, at this probability level, that you DO have a 5/100 chance of being wrong and the relationship in the population really is zero.
Perhaps you had one of the rare 5 samples in 100 that has a large result when the population result is zero. The chances of this happening are small (less than 5 in 100), but it does happen.
When p < .05, we often say the relationship is statistically significant (or REAL or non-zero).
[NOTE: If you have a hard time with the < and > symbols, you can use LT for "less than" and GT for "more than" on exams.]
If you wanted to be even more confident that your data were unlikely to be due to a chance fluctuation, you might want to go out to p < .01 or even p < .001 (the second number is the odds of less than 1 in 1000 that you would get these results by chance if there were a zero correlation in the population.)
|
|
What contributes in general to findings that are "statistically significant" (or, in our terms, "real")?
The "statistical significance" of a sample correlation coefficient or association depends on where it is located in a probability distribution or curve (PDF) for the sampling distribution. This location will determine the relevant area under the curve. In turn, the relevant area under the curve gives the odds of the result you observed in one sample of occurring by chance.
If the statistic for the correlation is extreme, or in the "far tail" of the distribution, that correlation would be very unlikely to occur if there were NO relationship between your two variables. A small odds of occurrence by chance means the relationship is probably "real" or non-zero.
Some correlations follow an approximately normal sampling distribution, especially when the case base is quite large, so let's take another look at the normal curve:
very large negative sample correlation |
close to zero |
very large positive sample correlation |
Distribution under null hypothesis assumption H0 correlation = 0
Under most null hypotheses, the mean of the sampling distribution of a correlation would be assumed to be zero, that is, no relationship.
Very large positive correlations, or numerically large negative correlations under this null assumption would occur at the extremes, or way out in the tails of the distribution. If the true population correlation were really zero, we would be very unlikely to get a sample result with either a strong positive or a strong negative correlation. Such an event, if the true population value were zero, would only happen by chance in a very few extreme--and unrepresentative--samples.
Besides the normal pdf, common probability distributions include the t distribution (which acts like the normal distribution if the sample size is large). The t is a flatter distribution than the normal curve with larger standard error distances when the casebase is less than 120. We will examine the t-distribution at the very end of this Guide. Another common pdf is the Chi-square distribution (X2) which we will examine shortly.
Contributing factors to the location of your statistic in its associated probability density function include:
The
casebase. All things equal, results from larger samples are more
likely to be "statistically significant" than smaller samples.
That is because large sample results are quite stable, they show very little variability from sample to sample and they have very small standard errors. A larger sample will give us a more precise estimate of a population parameter than a smaller sample will.
As a result, it takes a smaller "difference from zero" to be a stable, reliable difference in big samples than it does in small samples, where the standard error is larger, and an extreme correlation is more likely to occur by chance in a smaller sample.
The size of the correlation. All things equal, larger correlations
are more likely to be statistically significant than smaller correlations.
Larger effects are more likely to represent truly non-zero correlation
findings than smaller effects.
If our findings are "statistically significant,"
we reject the Null Hypothesis (Ho) of no relationship and accept
an alternative hypothesis (generically: HA)
What does our alternative hypothesis
look like?
It depends on the statistic we are
working with.
Some statistics can ONLY take on positive
values by definition.
For those statistics, the alternative to the null hypothesis is always positive: HA > 0.Some statistics can take on either values larger than zero (+) or smaller than zero (-).
The Chi-square statistic is one example. As a squared measure, it cannot be less than zero.
If you are unwilling to guess on direction, you can just say HA =\= 0.
This is called a "two tailed" hypothesis because the result could be larger or smaller than zero and occur in either "tail" of the distribution.If you are willing to guess on a direction IN ADVANCE, and specify whether you think the correlation is larger OR smaller than zero, but not both, i.e.:
HA > 0 OR HA < 0 but not both
you have what is called a "one-tailed hypothesis" because you specified direction IN ADVANCE. This is a more powerful and precise hypothesis.
|
|
First things first: the Chi-square statistic IS NOT a correlation coefficient or a measure of association.
Instead X2 is a pdf thatCAN answer our first question: is our correlation simply an accidental variation around a true population value of zero? Or is our correlation "real," that is, something different from zero?
The Chi-square probability density function can help tell us whether an apparent sample relationship between two nominal variables is real or accidental.
Chi-square can be a nominal statistical significance measure. You can use it when one or both variables are nominal. (However, if your independent variable is nominal and your dependent variable is numeric, you have some better choices: see the ANOVA and t-test descriptions later in this Guide.)
Also use Chi-square if the relationship between two variables is nonlinear or curvilinear (we will examine this issue in Guide Five) even if both variables are ordinal or interval.
Chi-square is the probability distribution that is used to test whether the phi and Cramer's V correlation coefficients are zero in the population or something nonzero. Because it is a squared measure, X2 can only be positive.
This is the PEARSON Chi-Square statistic, named after the statistician Karl Pearson. There ARE other X2 statistics in common use. One example is the Likelihood-Ratio X2 statistic, which is presented, along with the Pearson statistic X2, in many computer program outputs (such as the SDA or SPSS output).
Each of these different X2s has different properties, different calculations, and different uses.
For this course, we will use ONLY the Pearson X2 statistic.
If you take later statistics courses, you will meet some of the other X2s.
The formula for the Chi-square statistic
compares
deviations between the observed frequencies in a bivariate distribution
and the frequencies that would be expected by chance if the two variables
were totally unrelated or had a zero relationship.
The mathematical probability density
function (PDF) that produces the X2 distribution is pretty messy
and more complex than the normal distribution. You don't have to memorize
this one, but if you applied the mathematical operation of integration
to this function and you have a 2 X 2 table, the Chi-square distribution
when the POPULATION X2 was assumed to be zero would look
like my drawing below. (The graph for X2 looks different depending
on the number of rows and columns in the table.) Here's the PDF for 2 rows
and 2 columns, or a 4-cell table (courtesy of Dr. Ken Brewer's statistics
book):
Here's what the Chi-square distribution would look like for a 2 X 2 table if we drew it as a graph. (Well, approximately, anyway.) The shape of the Chi-square distribution will depend on the number of degrees of freedom--for our purposes that will be the number of rows and columns in the table. In this example, for a 2 by 2 (total = 4 cells) table, the df =1.
TERMINOLOGY:
DF or df is short for the
term "degrees of freedom" which we will discuss shortly.
The "O" is the OBSERVED frequency in a specific cell, say, row 1 column 1. (O = Observed) The "E" is the EXPECTED frequency in the identical cell, say, row 1 column 1. (E = Expected)
How do we obtain these "expected frequencies" that would occur if the correlation between two variables is zero? The expected frequencies are determined by the marginals and the sample size.
EXAMPLE: CALCULATING AND INTERPRETING CHI-SQUARE
Let's take a look at the relationship between gender and the Earth and the Sun again.
Below are the OBSERVED FREQUENCIES:
| How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
| Male | Female | Total | |
| Answer to Question: | |||
| Sun goes around Earth (WRONG) | 104
(r1, c1) |
283 | 387 |
| Earth goes around Sun (RIGHT) | 649 | 538 | 1187 |
| Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) | 753 | 821 | 1574 |
Source: NSF Surveys of Public Understanding
of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574
Now, below are the OBSERVED COLUMN PERCENTAGES.
Notice
I have added a column for percents on the planets question for the total
sample to the far right:
| How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
| OBSERVED PERCENTAGES | Male | Female | Total Sample |
|
Answer to Question:
|
|||
|
Sun goes around Earth
(WRONG)
|
16.0%
|
34.5%
|
24.6%
|
|
Earth goes around Sun
(RIGHT)
|
84.0
|
65.5
|
75.4
|
|
100.0%
|
100.0%
|
100.0%
|
|
|
Casebases
|
753
|
821
|
1574 |
Source: NSF Surveys of Public Understanding
of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574
If there were NO ASSOCIATION between
gender and the science question, what would the percentages in each column
look like? The answer is that the percentages for women and men
would be the same as they are for the total sample, as you see below for
the EXPECTED PERCENTAGES:
| How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
| EXPECTEDPERCENTAGES | Male | Female | Total Sample |
|
Answer to Question:
|
|||
|
Sun goes around Earth
(WRONG)
|
24.6%
|
24.6%
|
24.6%
|
|
Earth goes around Sun
(RIGHT)
|
75.4
|
75.4
|
75.4
|
|
100.0%
|
100.0%
|
100.0%
|
|
|
Casebases
|
753
|
821
|
1574 |
We can use our knowledge of the relationship among frequencies, percentages, and proportions to turn the percentages in each column back into EXPECTED FREQUENCIES.
The expected frequencies
are the frequencies that would be expected if there were no association
between gender and the science question in the population at large. This
is sometimes called the "independence hypothesis," that is, in this instance,
gender and answers to the science question would be independent from one
another.
NUTS
AND BOLTS: CALCULATING EXPECTED FREQUENCIES
For row 1 column 1, the expected percent of men giving the wrong answer is 24.6 and the expected proportion is .246.
Multiply the expected proportion of men giving the wrong answer by the total casebase FOR MEN ONLY:
.246 X 753 = 185.2
Notice that with a 2 X 2 table, I have to calculate the expected frequency for ONLY ONE CELL in the table. See the yellow cell in the table below. Because of the marginal totals in the far right column and the bottom row, I can get the other three cells by subtraction. We use the OBSERVED ROW AND COLUMN TOTALS to calculate the EXPECTED CELL frequencies.
For example, the expected cell frequency for the number of women giving the wrong answer will be:
387 - 185.2 = 201.8 or
Total observed frequency giving wrong answer - expected frequency MEN giving wrong answer =
expected frequency WOMEN giving wrong answer
Be sure to distinguish among observed and expected frequencies in the "right places."
The expected cell frequency for the number of men giving the right answer will be:
753 -185.2 = 567.8 or
Total observed male casebase - expected frequency MEN giving wrong answer =
expected frequency MEN
giving right answer
| How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun? |
| EXPECTED CELL FREQUENCIES | Male | Female | Total Sample |
|
Answer to Question:
|
|||
|
Sun goes around Earth
(WRONG)
|
185.2
|
201.8
|
387
|
|
Earth goes around Sun
(RIGHT)
|
567.8
|
619.2
|
1187
|
|
Casebases
|
753
|
821
|
1574 |
And, again by subtraction, we find that the expected frequency of WOMEN giving the RIGHT answer = 619.2
Notice that I only had to calculate the expected frequency for ONE CELL (Male, Wrong) (the "yellow cell") out of the four cells in the table, then I could obtain the other three expected frequencies in the table by subtraction from the marginal totals.
This means I only have ONE "degree of freedom" in my 2 X 2 table, or one "independent" piece of information. Once the expected number of males giving the wrong answer is calculated, all the other three interior cells of the table are calculated through subtraction.
Using both the expected and the observed
frequencies for our gender by education table and the Chi-Square calculation
formulation, here's the calculations:
|
|
SCROLL BACK TO EQUATION 7.10 TO CHECK
OUT THE X2 FORMULA.
We start with the row 1,1 cell, then
the 1,2 cell.
Thus, calculations are done for the first
row, left to right.
We proceed to the second row to calculate
the X2 component for each of the cells in turn, from left to
right.
We do NOT use the row or column
marginals in the Chi-Square formula, only the internal cells in the table
itself.
[ ( 104 - 185.2 )2÷185.2]
+
[ (283 - 201.8)2÷201.8]
+
[ (649 - 567.8)2÷567.8]
+
[ (538 - 619.2 )2÷619.2].
OR
[ ( - 81.2)2÷185.2] + [ (81.2)2 ÷201.8] +[ (81.2)2÷567.8] + [ - 81.2 )2 ÷619.2].
OR
[ ( 6593.44 )÷185.2] + [ (6593.44) ÷201.8] +[ (6593.44)÷567.8] + [ (6593.44 ) ÷619.2].
(Remember, the square of a negative number in arithmetic is a positive number.)
OR
35.60 + 32.67 + 11.61 + 10.65 = 90.53
Chi-square might be small, or it might be large, but Chi-square should ALWAYS be a positive number.
Writing it the way X2 is typically presented:
X2 (1) = 90.53
The (1) subscript means that there is ONE degree of freedom in the table. You must include this information for your reader (or you) to accurately assess the value of Chi-Square.
Although you will not need to calculate Chi-Square by hand, you do need to follow the logic in its calculations above.
IS THE ASSOCIATION REAL OR ACCIDENTAL? X2 AND STATISTICAL SIGNIFICANCEREMEMBER: Our null hypothesis is that Chi-Square is zero, which means that there is NO association between gender and the science question in the population.
Given this null hypothesis, our alternative hypothesis is that there is some association between gender and the science question in the population, no matter how small, and, therefore, Chi-Square is bigger than zero (the Pearson X2 is a squared measure so it is positive.)
In symbols we write:
Ho : X2
= 0
HA : X2
> 0
|
|
We always evaluate X2 relative to its associated degrees of freedom, because the greater the number of rows and columns in a table, that is, the larger the size of the table, the larger X2 becomes. A moment's reflection, and examination of the calculation formula for X2, should convince you of this because with each added row or column, we add in more pieces of X2 .
Whether a relationship is different from zero or not should NOT depend on the table size.
So we need to control for the size of the table, and that's what considering the degrees of freedom (or df or DF for short) does. In addition, every change in table size changes what the distribution of X2 even looks like, and what the probability values are.
The df for the number of rows
in the table is calculated as (# rows - 1).
This is because if you know the row total,
and the frequencies for all but one of the cells, you can calculate the
frequency in the last cell in the row by subtraction.
The df for the number of columns
in the table is calculated as (# columns - 1).
This is because if you know the column
total, and the frequencies for all but one of the cells, you can calculate
the frequency in the last cell in the column by subtraction.
For the entire table, the df = (# rows - 1) * (# columns - 1). For example, a 2 X 2 table would have one total degree of freedom. Although we start with four degrees of freedom (one for each cell in the table):
and if we know the frequency of just one cell in the table, and the three pieces of information above, we can calculate the frequencies in the other three cells of the table through subtraction.subtract 1 df for the row total
subtract 1 df for the column total
subtract 1 df for the grand total
Thus, in shorthand, for our association between two variables, the degrees of freedom are:
df = ( r - 1 ) * ( c - 1)
These days, most computer statistical programs automatically calculate exact probability levels for you so that you can assess whether the probable association in the population is zero or nonzero (see my SDA example below and Assignment 3). However, in "the bad old days" we had to look up the value of X2 in a table and decide whether it was an accident (really zero in the population) or real (really nonzero in the population) and we need to know the df (or control for table size) to do this.
See Table C (page 670) in Agresti and Finlay for an example. Let's look at the pieces of such a X2 table (somewhat abbreviated for this page):
The far left column is for degrees of freedom (df). Remember this depends on table SIZE.
In a 2 X 2 table like the example above, there is 1 degree of freedom, so we look at "row 1". (Marked in blue.)
The "p" levels across the top are the probabilities that a particular X2 will happen in any one sample, GIVEN THAT X2 IS REALLY ZERO IN THE POPULATION.
For example, if X2 were
really zero in the population, and we had a 2 by 2 table, X2s
as large (OR larger) as 3.84 would occur by chance in only 5 samples in
100 (or ".05").
|
df/ p
|
.250
|
.100
|
.05
|
....
|
.01
|
.005
|
.001
|
| 1 |
|
|
|
|
|
|
|
| 2 |
|
|
|
|
|
|
|
| 3 |
|
|
|
|
|
|
|
| 4 |
|
|
|
|
|
|
|
| 5 |
|
|
|
|
|
|
|
| 6 |
|
|
|
|
|
|
|
| 7 |
|
|
|
|
|
|
|
| ... | |||||||
| 30 |
|
|
|
|
|
|
Thus, for the gender and science question
table, the X2 (with 1 df) was 90.53
Notice that this result is WAY larger
than the X2 value listed in the table for the p = .001
level, which is only 10.83.
Thus, we REJECT the null hypothesis that
X2 is really 0 in the population for the association between
these two variables, and conclude that the relationship between gender
and the science question is real, or nonzero.
However, WE COULD BE WRONG! X2 might really be 0 in the population and we just got one of those rare, unrepresentative samples where X2 looked like it was real (but was really just a sampling accident).
We DO know the odds of our being wrong. They are the probability of our results being an accident if X2 is really zero. Those odds are less than one in 1000 samples or p < .001.
Thus, we REJECT the null hypothesis that X2 is really 0 in the population for the association between these two variables, and conclude that the relationship is real, or nonzero. And we include the chances of our being wrong when we present the X2 statistic:
X2 (1) = 90.53, p < .001
Click the button. Try to locate where
the Chi-square (1) of 90.53 belongs on the Chi-square graph:
|
|
|
What you see below is an example of a 4 X 5, larger, table from the Current Population Survey August 2000 data that looks at the association between educational level ("reduc," ORDINAL VARIABLE) and the number of computers in the household ("pcnum," RATIO VARIABLE).
By examining the column percents from left
to right, we can see that (1) the percentage of households with NO computer
decreases as we go from those with less than a high school diploma to those
with an advanced college degree and (2) the percentage of households with
at least 3 computers increases as we go from the poorly educated to the
well-educated. Notice that the table also shows SDA's "color coding" to
examine how relationships work quickly. "Red" or "pink" cells have more
frequencies than would be expected by chance. "Blue" cells have fewer frequencies
than would be expected by chance.
| Frequency Distribution | |||||||
|---|---|---|---|---|---|---|---|
| Cells contain:
-Column percent -N of cases |
reduc | ||||||
| 1
12th grade No Diploma or less |
2
High School Grad-Diploma Or Equiv(GED) |
3
Some College But No Degree Associate Deg. |
4
Bachelor's Degree(ex. ba, ab,bs) |
5
Master's Deg./ Professional Deg./ Doctorate Deg. |
ROW
TOTAL |
||
| pcnum | 0 | 64.7
12,758 |
53.7
15,906 |
34.0
8,204 |
23.2
3,359 |
18.2
1,255 |
43.7
41,482 |
| 1 | 26.2
5,169 |
36.5
10,803 |
47.0
11,334 |
49.6
7,175 |
47.8
3,293 |
39.8
37,775 |
|
| 2 | 6.2
1,220 |
7.2
2,133 |
13.1
3,164 |
18.3
2,645 |
21.9
1,510 |
11.3
10,671 |
|
| 3 | 2.9
576 |
2.7
786 |
5.9
1,424 |
8.8
1,274 |
12.1
833 |
5.2
4,892 |
|
| COL TOTAL | 100.0
19,723 |
100.0
29,628 |
100.0
24,126 |
100.0
14,452 |
100.0
6,891 |
100.0
94,821 |
|
| Means | .47 | .59 | .91 | 1.13 | 1.28 | .78 | |
| Std Devs | .74 | .74 | .84 | .87 | .90 | .84 | |
| Color coding: | <-2.0 | <-1.0 | <0.0 | >0.0 | >1.0 | >2.0 | T |
| N in each cell: | Smaller than expected | Larger than expected | |||||
|
You can look at the SDA "color coding"
to see at a glance that we have more cases than would be expected by chance
(IF THERE WERE NO RELATIONSHIP) in the high education-lots of computers
cells, and more cases than would be expected by chance in the low education-no
computer cells.
|
|
Eta, R, Somer's d, Gamma, Tau-b, and Tau-c
are correlation coefficients (which we will examine in Guide 5).
|
* I have been a committee member on two doctoral committees over the
past summer where the doctoral candidate incorrectly put a row of zeros
as a probability level because they did not know how to interpret their
computer output.
| Summary Statistics | ||||||||
|---|---|---|---|---|---|---|---|---|
| Eta* = | .32 | Gamma = | .40 | Chisq(P) = | 11,296.52 | (p= .0000) | ||
| R = | .31 | Tau-b = | .29 | Chisq(LR) = | 11,466.23 | (p= .0000) | ||
| Somers' d* = | .26 | Tau-c = | .27 | df = | 12 | |||
| *Row variable treated as the dependent variable. | ||||||||
|
|
Because of the shape of the Chi-Square Distribution, MOST sample X2s will be near 0 if there is no relationship between two variables in the population.
However, if there is at least some relationship between two variables in the population, the value of X2 in any particular sample can be very large.
If the value of Chi-square is quite large, we conclude the apparent sample relationship is REAL (nonzero) in the population. If the Chi-square value is near 0, we conclude the apparent relationship is an ACCIDENT of this particular sample.
The size of X2 is influenced by:
table size
sample size and
how strong the relationship is
Since three different factors influence Chi-square, a large X2 DOES NOT MEAN that you have a strong or large relationship. Relationship strength or effect size is a separate issue (see Guide 5). Relationship strength is the second question we must answer about the relationship between two variables.
You could have a large X2 just
because you have 4000 cases and the results are very reliable (large samples
are more reliable).
|
|
One way of looking at the association between two variables is to construct a crosstabulation table. You can create such a table whether your variables are nominal, ordinal, interval, or ratio. However, a table can quickly become difficult to read as the number of rows and columns grows.
What if an independent variable has only two categories, but the other variable is not only interval, but has several categories--perhaps even several dozen categories, such as the variable years of age. As you already know, it is difficult to work with a variable that has several dozen categories in a table. Here are some examples:
The difference between means on a scale of several items of basic science
knowledge for women and men.
The difference in response to a new antibiotic among individuals with a
bacterial respiratory infection. Control Group 1 gets the current antibiotic,
and Intervention Group 2 gets the new drug. We compare the mean number
of days to return to health in each group. The range could be well over
a dozen days to recovery.
Two
different methods of teaching reading are compared. The dependent variable
is the number of vocabulary words learned over six weeks among second grade
students. A pupil could learn dozens of vocabulary words over that time.
Mean
weight loss in pounds is compared among two groups who used different exercise
methods for three weeks.
Instead of a crosstabulation table, you can examine the difference in means across two (or more) groups. This is a slightly different form of examining an association between the two variables, but the same general logic holds.
Once again, it is easiest, both logically
and mathematically, to postulate that if you had both of the total populations
(say, women and men), the means for both groups would be the same. Or,
put slightly differently, that there was no difference in the means
for the two populations. Or, again put slightly differently, that:
|
and thus µ1 - µ2 = 0 |
And if you have MORE than two groups:
|
where K is the number of groups |
The generic idea is that IN THE POPULATION, the means on the DEPENDENT variable for each value of the INDEPENDENT variable are all the same.
FOR EXAMPLE:
In other words, if the t-test for a difference in means between two groups is at least |1.96| in a large sample, (absolute value), the population relationship is probably REAL. (That's the t-test, not just mean #1 minus mean #2.)
The t-distribution can be viewed as a special case of the F distribution when the number of groups (k) equals only 2. For this reason, many computer programs, including the SDA, only calculate F-ratios, not ts.
If your computer output (as ours does)
only gives the F-ratio AND
you only have two groups to compare
(such as men and women),
SIMPLY TAKE THE SQUARE ROOT OF THE
F-RATIO.
The result will be the absolute value of
the t (that is, no positive or negative sign; see Assignment 3 for what
to do if you do want a positive or negative sign for the t.) The probability
level will be OK, but you need to do n - 2 for the t-distribution degrees
of freedom (not n - 1).
Here is what the PDF that
draws the curve for the t-test looks like (the one for the F distribution
is even more complex):
|
GET A LOAD OF "STUDENT'S t" |
|
|
Below is the computational formula for the t-test for the difference between mean scores on two values of the same nominal variable. Again, computers calculate these formulas faster and more accurately than people can, but there are a few things I want you to notice:
COMPUTATIONAL FORMULA FOR THE MEAN DIFFERENCES
BETWEEN TWO GROUPS:
|
|
When we compare mean scores on an interval-level dependent variable across a nominal independent variable with only two values or groups, we call the statistic a "t-test."
If we have MORE THAN two values on the nominal variable, we call this analysis a "one-way analysis of variance" ("one way" means only one nominal variable) or a one-way "ANOVA" for short.
The output below compares mean scores on the number of computers in the household across educational levels. In other ways, we are trying to make the same assessment that we did in the crosstabulation table, but this time we are looking at mean number of computers instead of the bivariate percentage distribution.
The table below of mean scores across groups
makes it easy to spot that the average number of computers per household
rises steadily with educational level, from 0.47 average computers for
those with less than a high school degree to an average 1.28 computers
for households with advanced degrees.
| Main Statistics | |||
|---|---|---|---|
| Cells contain:
-Means -SRS Std Errs -Std Devs -N of cases -Weighted N |
|||
| reduc | 1 12th grade No Diploma or less | .47
.005 .740 19,420 19,723.3 |
|
| 2 High School Grad-Diploma Or Equiv(GED) | .59
.004 .739 30,068 29,627.7 |
||
| 3 Some College But No DegreeAssociate Deg. | .91
.005 .836 24,337 24,126.4 |
||
| 4 Bachelor's Degree(ex. ba,ab,bs) | 1.13
.007 .867 14,519 14,452.1 |
||
| 5 Master's Deg./Professional Deg./Doctorate Deg. | 1.28
.011 .898 6,952 6,891.0 |
||
| COL TOTAL | .78
.003 .841 95,296 94,820.6 |
||
Recode for 'pcnum' 0 = -1 "0"; 1
= 1 "1"; 2 = 2 "2"; 3 = 3 "3"
| Color coding: | <-2.0 | <-1.0 | <0.0 | >0.0 | >1.0 | >2.0 | T |
| Mean in each cell: | Smaller than average | Larger than average | |||||
Like the Chi-square distribution, the computer will calculate the exact odds of obtaining these SAMPLE results if all the means for number of computers across educational levels IN THE POPULATION were the same (that is, there are NO mean differences in computer ownership across levels of education).
In our example, if there were no mean differences in the number of computers across educational level, just by chance, we would expect to observe the F-ratio below in the sample results that we did above in:
LESS THAN ONE IN 10,000 SAMPLES JUST
BY CHANCE!
That's what the row of zeros under the
"P" means.
For our purposes the parts of the table
that I have colored yellow are the ones that we will use for this part
of the course. We will look at these pieces later in Guide 5 and Assignment
3.
| Analysis of Variance | ||||||
|---|---|---|---|---|---|---|
| SSQ | Eta_sq | df | MSQ | F | P | |
| reduc | 6,799.039 | .101 | 4 | 1,699.760 | 2,687.880 | .0000 |
| Residual | 60,260.052 | .899 | 95,291 | .632 | ||
| Total | 67,059.091 | 1.000 | 95,295 | |||
Remember: the row of ".0000" under the "P" REALLY means p < .0001
Now, it's decision time:
Do you accept the null hypothesis and say, "well, I just got one of those wierd 1 in 10,000 samples and there really are no educational level differences in computer home ownership" (admittedly a VERY rare event, but it could happen)?
Or, do you reject the null hypothesis (F = 0 or t = 0, which represents NO mean differences across educational groups) and say that the relationship between educational level and number of household computers is nonzero and REAL? That at least one educational group has higher home computer ownership than the others?
Most of us would say the relationship was real, but one out of 10,000 times we would be wrong. The population differences among means really could be zero (so the population value of F would be 0) but you got the 1 in 10,000 samples where you had an extreme result, an enormous F--but it was just a wierd and unrepresentative sample.
We write this as:
F4,95291 = 2687.00, p <
.0001
(DO NOT write a row of zeros here!)
What are the degrees of freedom (df)
for this example?
For the F-ratio, they are the number
of groups minus one for the first figure (in this example: 5 -1 = 4).
For the second figure in the superscript, the df are n - k - 1 ( where k = the number of groups).
IF YOU ONLY HAD TWO GROUPS (not 5):
Take the square root of the F and you
have the absolute t-value.
The degrees of freedom will be n -
2.
The probability level will be IDENTICAL
to the value in your output under the "P" header.
|
PROBABILITY VERSUS NON-PROBABILITY SAMPLES |
If you have a population, stop here! You do not need inference measures under most circumstances.
However, most of the data that we collect come from samples. Therefore, how the sample is taken is critically important.
We can put the laws of probability to work for us and we can use the properties in probability density functions only under certain specified conditions.
#1. Our data come from a probability sample. This means that systematic, non-human judgment means of collecting the data were used.
Each element in a probability sample MUST have an a priori known, non-zero chance of selection.
The chances of
selection can be unequal.
(Only novices think that probability samples
mean equal chances of selection. For example, disproportionate stratified
samples are also probability samples)
Common probability samples include:
The
simple random sample (srs), akin to drawing well-mixed numbers
from a jar. Random digit dial telephone surveys approximate srs.
The
systematic sample, in which
every nth element is selected after a random start, for example,
every 100th registered FSU student might be selected after a
random starting point.
The
cluster sample. One example
is when a set of entire classrooms (each
classroom is a "cluster") is selected from a district school, perhaps with
a simple random sample of classrooms.
|
|
#2. Some form of probability method is used to assign elements to different intervention or treatment groups.
Typically, this is the random assignment ("randomization") used in "true" experiments (and possibly the random assignment of intact groups in some "quasi-experimental" conditions).
However, if you have not selected your entire sample of elements from the population using probability methods, you can NOT generalize to a larger population. This is because your cases do not represent any larger population; they only represent themselves.
If you have used random assignment to place cases in intervention groups, you CAN use inferential methods to compare the differences across the two or more groups. This is because we can assume all your intervention groups were approximately the same (on the average) at the beginning of your study, and the only difference was the coin flip or random number that put a person in Treatment Group 1 rather than in Treatment Group 2.
A relationship or correlation may be statistically significant (i.e., non-zero) but that does NOT mean it is a large or even moderate size association. With very large samples, very small correlations can be statistically real and non-zero because the standard errors are small and the estimates relatively stable. We see this occur in epidemiological studies where the federal government has samples of 50,000.How large or how strong a relationship ("effect size") is a different question from how statistically significant it is. However, we need to address the statistical significance question FIRST.
If your results in the population are probably really zero, then no matter how impressive your sample correlation appears to be, you know the strength of it already: it is zero.
![]() |
READINGS AND ASSIGNMENTS |
OVERVIEW |
|
Susan Carol Losh September
23, 2004
This page was built with
Netscape Composer
and is best viewed with
Netscape Navigator
600 X 800 display resolution.
Paradise
lost...