SEVERAL LARGE PICTURE FILES ON THIS SITE CAN LOAD VERY SLOWLY.
YOUR PATIENCE IS APPRECIATED.
 
OVERVIEW

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS

EXAM 1 IS WEDNESDAY SEPTEMBER 29


 
EDF 5400 INTRODUCTORY STATISTICS
FALL 2004

DR SUSAN CAROL LOSH


 

PLEASE READ THIS GUIDE FIRST! Before working with "point estimation," we will first examine relationships between two variables. I will follow the three questions in the guides in this order: (1) is any apparent relationship between two variables real or a statistical accident? (2) If the relationship is probably "real" (i.e., not due to chance sampling variations), how strong is it? (3) If the relationship is probably real and non-trivial, what is the possible causal structure of a relationship between two variables?

I treat the material in Guides 4 and 5 as a unit. A lot of material in Agresti and Finlay below will make much more sense after you have finished the material in Guide 5. If you hate that queasy, "I can't understand what's going on" feeling, I strongly recommend that you go over the material below after reading Guide 4. Return and REREAD this material after we complete Guide 5.

KEY TO: Huff, Chapter 7, pp. 74-86. 
KEY TO: Agresti and Finlay, in order: Chapter 8, pp. 248-266; Chapter 8, pp. 272-278; Chapter 8, pp. 282-286; Chapter 6, pp. 154-167; pp. 171-179; pp. 193-198.


 
 
GUIDE 4: SOME BASICS ON BIVARIATE DISTRIBUTIONS

 
BIVARIATE DISTRIBUTIONS
BIVARIATE PERCENTAGE TABLES
 QUESTIONS ABOUT RELATIONSHIPS
SAMPLE RELATIONSHIPS
 WHAT KIND OF SAMPLE?



BIVARIATE DISTRIBUTIONS

A bivariate distribution simultaneously and jointly cross-classifies the scores on a case for two variables.

For example, if we have a bivariate distribution of gender and support for President Bush (favorable/unfavorable) we can simultaneously cross-classify people as favorable males, favorable females, unfavorable males and unfavorable females.

The jointly cross-classified cases form the "cells" or interior of the table. Each cell has a frequency of cases that have a JOINT score considering both variables simultaneously.

The univariate summaries for each variable separately (for example, male or female) are at the bottom of the table for the independent variable and at the far right of the table for the dependent variable,  and are called the marginals. Because the row and column totals are in the margins of the table, they are often called "the marginals". Remember that the cells are labelled with the row number first, then the column number.

The grand total is usually presented in the lower right corner of the bivariate table.

Title: Generic Bivariate Table
  Variable X, Value 1 Variable X, Value 2 Row Totals 
Variable Y, Value 1 (Cell 1,1) (Cell 1,2) Marginal Total 
Variable Y, Value 1
Variable Y Value 2 (Cell 2,1) (Cell 2,2) Marginal Total 
Variable Y, Value 2
Column Totals Marginal Total 
Variable X, Value 1
Marginal Total 
Variable X, Value 2
Grand Total

Then, with values supplied for each variable:

Title: Attitude toward President Bush by Gender
  MALE FEMALE  
FAVORABLE Male-Favorable (Cell 1,1) Female Favorable (Cell 1,2) Total Favorable
UNFAVORABLE Male-Unfavorable (Cell 2,1) Female Unfavorable (Cell 2,2) Total Unfavorable
  Total Male Total Female Grand Total

 
THE "SIZE" OF A CROSSTABULATION TABLE

The size of a crosstabulation table (which is the total number of cells) depends on how many rows and columns are in the table.

In turn, the number of rows or columns depends on how many values or categories each variable has.

If the row variable has 3 categories and the column variable has 4 categories, the result is a "3 by 4" table.

CONVENTION: The row number always comes first.

Square tables have the same number of rows and columns (e.g., a 2 X 2 table such as the example above).

Size is important because it plays a role in the type of statistic you choose for your data and how well that statistic will work. For example, some statistics such as the correlation coefficient phi work better on square tables. The larger the number of cells, all else equal, the larger the Chi-Square statistic becomes, so measures of the statistical significance of this statistic will take the table size into account.
 


 BIVARIATE PERCENTAGE TABLES

The Bivariate Percentage Table is just a variation on our old friend, the univariate percentage table. However, the bivariate table gives more information: it allows us to compare and contrast group similarities and differences. I have the very simplest bivariate table, which is a 2 X 2 table, below. There is one column each for women and men, one row for the correct answer and one row for the incorrect answer. The first table shown (also found in Guide 2) is the Bivariate Frequency Distribution:
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
NOTE: By convention, categories of the independent variable typically form the COLUMNS of the table. Male Female Total
Answer to Question:
Sun goes around Earth (WRONG) 104
(r1, c1)
283 387
Earth goes around Sun (RIGHT) 649 538 1187
Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) 753 821 1574

Source: NSF Surveys of Public Understanding of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574

A key issue is whether to percentize down the columns or across the rows.
Make no mistake about it, this IS a key issue and not a matter of semantics. Percentizing in "the wrong direction" will totally change the meaning of the results that you present.

CONVENTION: Values of the independent variable create the columns of the table.

For example, the two values of gender: male and female, head each column in my sample table.
Remember, gender might cause science knowledge, but we know science knowledge CANNOT cause biological sex.
Therefore, gender is the independent variable. Science knowledge is the possible effect, or dependent variable.

CONVENTION: Percentize separately within values of the independent variable.

In my example, this means that first I calculate the percent giving correct and incorrect responses for men.

I then repeat the process, calculating the percent giving correct and incorrect responses for women.

Once I have done so, I can now specify the percentage of men who give the right answer (the Earth goes around the Sun) and the percentage of women who give the right answer, and then directly compare women and men.

These percentages within gender are different numbers, and they mean something entirely different from the following question:

among those who think the Earth goes around the Sun, what percent are female?
(Answer 538/1187 X 100 or 45.3% Since women are 821/1574 X 100 or 52.2 percent of the sample, we can see that women are underrepresented among those giving the correct answer. Notice below that neither column has a percent figure of 45.3%)

CONVENTION: Remember that when the columns are formed by categories of  the independent variable, a percent sign ONLY goes at the top of each column (in this case, the "wrong answer") and after the 100 percent at the bottom of each column. (Note: this is because the values of the independent variable form the columns of the table.)

These conventions are particularly important as the number of values for each variable grows. They help your reader to immediately discern which way the percentages are calculated and they make your table much easier to read.
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
  Male Female
Answer to Question:
   
Sun goes around Earth (WRONG)
16.0%
34.5%
Earth goes around Sun (RIGHT)
84.0 
65.5 
 Total
100.0%
100.0%
Casebase
753
821

Source: NSF Surveys of Public Understanding of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574

QUESTIONS ABOUT BIVARIATE RELATIONSHIPS

There is only so much that we can do with a single variable. But with two variables, the analytic possibilities open up!

We can discuss prediction, from one variable to another.

We can discuss causality. A first step in establishing causality is to examine the joint frequencies and see if one variable covaries with a second.

Covariation or correlation means that scores on a second variable change in some systematic way as scores on a first variable change.

If one variable causes a second, scores on the two variables should systematically correlate or covaryIf two variables are, in fact, causally related, they should have a "statistically significant" and substantively important correlation.

Causation implies correlation.

However, the reverse does not hold.

Covariation alone does NOT mean that one variable causes a second. Two variables can be correlated, yet not be causally related to one another. That is:  correlation does not equal causation.

Virtually all scientists of whatever kind (educational, behaviorial, social, life, physical, etc.) and educators care about cause and effect. If we understand what causes a phenomenon, whether the event is a speech disability or AIDS, the potential for changing that event is much greater. Knowledge about causes [independent variables] mean greater chances for understanding, predicting and controlling effects [dependent variables].
 

We ask THREE MAJOR QUESTIONS about the relationship between two variables.

FIRST, we ask whether any apparent relationship between two variables in sample data is a statistical ACCIDENT caused by sampling error (sampling variability) or whether the relationship is REAL, that is, non-zero or "statistically significant".

What are the odds that an observed non-zero relationship in a sample is simply due to chance?

This is the question of statistical significance or statistical inference.  We generally test statistical significance with different sampling distributions and a probability density function (pdf).

The smaller the odds that an observed relationship is due to chance, generally the more confident we are that an observed relationship is REAL, that is, non-zero, in the population too, and not a chance accident that only holds for one particular sample.

Notice that we do NOT address strength of a relationship at this point, only whether the true association in the population is zero or not.



SECOND if (and only if) the apparent relationship is probably REAL or "statistically significant", we ask HOW STRONG the relationship is. This is substantive significance or "effect size".

We often check substantive significance through the value of a correlation coefficient. Over the coming weeks, we will examine the properties of several different correlation coefficients so that we can choose the more appropriate one for our data, or so that we can assess the appropriateness of the chosen correlation coefficients in professional research projects.



THIRD, if the relationship is REAL and the strength is NONTRIVIAL, we ask about causality in the relationship.

This topic will be addressed in more depth in Guide 6 because analytically it involves multivariate analyses.
 


One possibility is that the variables are locked in a symmetric relationship and we cannot tease out which variable is the cause and which variable is the effect. One example is the correlation between marital status and reported mental health in men over 30. Married men over 30 report better mental health than never married men over 30.

But what's cause and what's effect? Some family researchers speak of the "buffer effects" of marriage, instilling greater mental health. There may also be self-selection effects, i.e., older men in poorer mental health are less likely to marry in the first place. When the cause is indeterminate, we speak of a symmetric relationship in which cause and effect cannot be unequivocally established.

To address question (3) typically requires three variables: the original two, and a third variable used as a statistical control variable.

For example, remember the case of the correlation between ice cream consumption and assault rates over the months of the year? Eating ice cream incites assault, perhaps? Or maybe people get so hot and sweaty assaulting others that a nice, cold ice cream hits the spot. Or, maybe the true situation looks like this instead:
 

 

          +   /  ICE CREAM CONSUMPTION
             /          | 
            /           | 
TEMPERATURE/            |   = 0 once temperature
IN DEGREES \            |     is controlled
            \           |
          +  \          |
              \  ASSAULT RATE
 

Any apparent "causal" relationship between ice cream consumption and assaults is "spurious," it occurred ONLY because a third variable, temperature in this example, caused both ice cream consumption and the assault rate.

Here's another example of a spurious relationship between two variables, uncovered by the use of a third, or control variable.

Did you know there is a positive relationship between the number of fire engines at a fire and the amount of dollars in fire damage? Yes, there is. Better avoid calling the fire department next time there is a fire! That's a sure way to pump up those insurance claims! Or, perhaps the situation really looks like the chart below instead:
 

 

          +   /  NUMBER OF FIRE ENGINES AT FIRE
             /          | 
            /           | 
   SIZE OF /            |   = 0 once size of
      FIRE \            |     fire is controlled
            \           |
          +  \          | 
              \  AMOUNT OF FIRE DAMAGE IN DOLLARS
 

It should be apparent by this time that it is often difficult to tease out the causal structure of a relationship between two variables. We will focus the most attention on causal structure in non-experimental studies in Guide 6.

A SAMPLE RELATIONSHIP: REAL OR ACCIDENTAL?
INFERENCE 
BASICS
COMMON DISTRIBUTIONS
CHI-SQUARE
DIFFERENCES IN MEANS ACROSS GROUPS

I will use the terms association, relationship and correlation more or less interchangeably.

We will start with question one:
 
 

 
Are the associations between two variables that we observe in any particular sample simply random variations from a true population correlation of 0, that can naturally occur from sample to sample, or is the association "real," i.e., somehow different from zero in a stable way?

If your data are from a total population, stop here! You do not need to infer from sample results to the population because you already have the population. You may proceed directly to question #2 and ask about relationship strength and effect size.
 

 
 
REVIEW! IMPORTANT
POPULATION PARAMETERS AND SAMPLE STATISTICS

When our measures include every possible case  that we can study in a particular group, or the entire collection of the elements that we wish to study, we have a census or population. It is typically difficult to study an entire population. 

Instead we take a sample or subset of cases. If we have a representative sample, we can make very good generalizations about our population (the inference function of statistics), always remembering that results will vary from sample to sample. Representative usually means PROBABILITY SAMPLE and we will have a section on samples later in this course.

We call the descriptive measures we calculate on a population parameters.

We call the descriptive measures we calculate on a sample statistics or statistical estimates.

As you now know, most of our data that we collect come from a sample. Most populations are too large and unwieldy to study feasibly without an enormous expediture in time, effort and finances.  (Even attempts to study an entire population, such as the decennial U.S. Census, often undercount particular segments of the U.S. population.)

However, even from very good samples, the results can vary or fluctuate considerably from one sample to another. Thus, because most of our data are from samples or subsets of the population, what appears to be a relationship could, in fact, instead be an accidental finding.

The TRUE population parameter of the association between two variables could be zero, even though it misrepresented itself as nonzero in any one sample finding.

Stated a bit differently, even if the true relationship were zero  in the total population, in any one specific sample, the results could appear as though two variables were related (such as a correlation coefficient of .25)--but the observed numeric correlation could simply be a sampling fluctuation from sample to sample around a true population value of zero.

In statistical inference, we begin with a "null hypothesis".  You may be familiar with the idea of a null hypothesis, typically written as:

Ho

from your thesis or dissertation research. The idea in statistics is related to these kinds of uses.


One of the easiest mathematical and conceptual starting points is to assume that there is NO relationship between two variables in the population, i.e., the correlation between two variables is really zero.

While we actually could hypothesize that the true value of the correlation could be some specific non-zero number, typically we don't have sufficient information to guess at what that number could be. (Later in the semester, we will consider how prior knowledge may lead us to make more precise estimates about the value of a statistic.) Further, if we guess at a particular number for a null hypothesis, generally each different starting number can generate a pdf with a different form, shape, or peak. The EASIEST starting number to work with is a correlation of ZERO between two variables.
 

 
Restudy the paragraph below because it contains the heart of the question that we ask in what is called "classical statistical inference."
 
 
We ask ourselves this question: "if the population relationship between two variables really were zero, what are the odds that we would observe the results we found in this specific sample solely by accident or by chance"?

If the odds or the probability is high of observing our results by chance (say 80 samples out of 100 would have similar results when the true population relationship is zero), we conclude that any illusory relationship is an ACCIDENT and that the true relationship in the population is zero.

The lower or smaller the odds or the probability of observing our results solely by chance (say, only 5 samples in 100 would have similar results when the true population relationship is really zero)  the more confident we become that the apparent relationship is REAL and not a chance fluxuation. After all, if we only take ONE sample, what are our odds of getting one of those unusual samples that "look real--but aren't" solely by chance. They are only one in twenty, and those are slim odds.

We write the probability (p) of observing a relationship solely by chance as:

p  =                   or
p  <                   or
p  >

some figure between 0 and 1.

Probabilities are always between 0 and 1.
Here are some examples:

If there were NO relationship in the population between two variables (the correlation in the population is really zero)  then:
 

PROBABILITY STATEMENT WHAT IT MEANS IN WORDS
p < .01     OR
"p is less than 1 chance in 100"
the results we observed in our sample would occur by chance less than  once in 100 samples if the true population correlation = 0
p = .10     OR
"p equals 1 chance in 10"
the results in our sample would occur by chance in exactly 10 out of 100 samples if the true population correlation = 0
p > .05     OR
"p is greater than  5 chances in 100"
the results in our sample would occur by chance in more than 5 in 100 samples if the true population correlation = 0

 


The cutoff probability that indicates that the sample relationship is probably real (non-zero) in the population for most behavioral or social science data generally is p < .05, that is:
 

 
If there really were NO relationship in the population, then the observed sample results would occur by chance in less than 5 in 100 independent samples of the same size and the same type (taken at about the same time).

Less than 5 chances in 100 is considered a sufficiently rare event or sample that it is unlikely to have occurred by accident, and therefore the relationship is probably REAL, that is, non-zero--"there is something there." We reject our beginning or "null" hypothesis of a zero correlation in the total population.

Note, however, at this probability level, that you DO have a 5/100 chance of being wrong and the relationship in the population really is zero.

Perhaps you had one of the rare 5 samples in 100 that has a large result when the population result is zero. The chances of this happening are small (less than 5 in 100), but it does happen.

When p < .05, we often say the relationship is statistically significant (or REAL or non-zero).

[NOTE: If you have a hard time with the < and > symbols, you can use LT for "less than" and GT for "more than" on exams.]

If you wanted to be even more confident that your data were unlikely to be due to a chance fluctuation, you might want to go out to p < .01 or even p < .001 (the second number is the odds of less than 1 in 1000 that you would get these results by chance if there were a zero correlation in the population.)



SOME WIDELY USED STATISTICAL DISTRIBUTIONS

What contributes in general to findings that are "statistically significant" (or, in our terms, "real")?

The "statistical significance" of a sample correlation coefficient or association depends on where it is located in a probability distribution or curve (PDF) for the sampling distribution. This location will determine the relevant area under the curve. In turn, the relevant area under the curve gives the odds of the result you observed in one sample of occurring by chance.

If the statistic for the correlation is extreme, or in the "far tail" of the distribution, that correlation would be very unlikely to occur if there were NO relationship between your two variables. A small odds of occurrence by chance means the relationship is probably "real" or non-zero.

Some correlations follow an approximately normal sampling distribution, especially when the case base is quite large, so let's take another look at the normal curve:


very large negative 
sample correlation 
sample correlation 
close to zero 

very large positive sample correlation

Distribution under null hypothesis assumption H correlation = 0

Under most null hypotheses, the mean of the sampling distribution of a correlation would be assumed to be zero, that is, no relationship.

Very large positive correlations, or numerically large negative correlations under this null assumption would occur at the extremes, or way out in the tails of the distribution. If the true population correlation were really zero, we would be very unlikely to get a sample result with either a strong positive or a strong negative correlation. Such an event, if the true population value were zero, would only happen by chance in a very few extreme--and unrepresentative--samples.

Besides the normal pdf, common probability distributions include the t distribution (which acts like the normal distribution if the sample size is large). The t  is a flatter distribution than the normal curve with larger standard error distances when the casebase is less than 120. We will examine the t-distribution at the very end of this Guide. Another common pdf is the Chi-square distribution (X2) which we will examine shortly.

Contributing factors to the location of your statistic in its associated probability density function include:

The casebase. All things equal, results from larger samples are more likely to be "statistically significant" than smaller samples.

That is because large sample results are quite stable, they show very little variability from sample to sample and they have very small standard errors. A larger sample will give us a more precise estimate of a population parameter than a smaller sample will.

As a result, it takes a smaller "difference from zero" to be a stable, reliable difference in big samples than it does in small samples, where the standard error is larger, and an extreme correlation is more likely to occur by chance in a smaller sample.

The size of the correlation. All things equal, larger correlations are more likely to be statistically significant than smaller correlations. Larger effects are more likely to represent truly non-zero correlation findings than smaller effects.

If our findings are "statistically significant," we reject the Null Hypothesis (Ho) of no relationship and accept an alternative hypothesis (generically: HA)


What does our alternative hypothesis look like?
It depends on the statistic we are working with.
Some statistics can ONLY take on positive values by definition.

For those statistics, the alternative to the null hypothesis is always positive: H > 0.
The Chi-square statistic is one example. As a squared measure, it cannot be less than zero.
Some statistics can take on either values larger than zero (+) or smaller than zero (-).
If you are unwilling to guess on direction, you can just say H =\= 0.
This is called a "two tailed" hypothesis because the result could be larger or smaller than zero and occur in either "tail" of the distribution.

If you are willing to guess on a direction IN ADVANCE, and specify whether you think the correlation is larger OR smaller than zero, but not both, i.e.:

H > 0       OR         H < 0          but not both

you have what is called a "one-tailed hypothesis" because you specified direction IN ADVANCE. This is a more powerful and precise hypothesis.



These are some of the basics of statistical significance, but we will revisit this topic throughout the semester. At this point, you need to be familiar with the basic terminology and the basic logic.

AN ASSOCIATION BETWEEN TWO VARIABLES AND THE CHI-SQUARE STATISTIC

First things first: the Chi-square statistic IS NOT a correlation coefficient or a measure of association.

Instead X2 is a pdf thatCAN answer our first question: is our correlation simply an accidental variation around a true population value of zero? Or is our correlation "real," that is, something different from zero?

The Chi-square probability density function can help tell us whether an apparent sample relationship between two nominal variables is real or accidental.

Chi-square can be a nominal statistical significance measure. You can use it when one or both variables are nominal. (However, if your independent variable is nominal and your dependent variable is numeric, you have some better choices: see the ANOVA and t-test descriptions later in this Guide.)

Also use Chi-square if the relationship between two variables is nonlinear or curvilinear (we will examine this issue in Guide Five) even if both variables are ordinal or interval.

Chi-square is the probability distribution that is used to test whether the phi and Cramer's V correlation coefficients are zero in the population or something nonzero. Because it is a squared measure, X2 can only be positive.

This is the PEARSON Chi-Square statistic, named after the statistician Karl Pearson. There ARE other X2 statistics in common use. One example is the Likelihood-Ratio X2 statistic, which is presented, along with the Pearson statistic X2, in many computer program outputs (such as the SDA or SPSS output).

Each of these different X2s has different properties, different calculations, and different uses.

For this course, we will use ONLY the Pearson Xstatistic.

If you take later statistics courses, you will meet some of the other X2s.

The formula for the Chi-square statistic compares deviations between the observed frequencies in a bivariate distribution and the frequencies that would be expected by chance if the two variables were totally unrelated or had a zero relationship.
 


The mathematical probability density function (PDF) that produces the X2 distribution is pretty messy and more complex than the normal distribution. You don't have to memorize this one, but if you applied the mathematical operation of integration to this function and you have a 2 X 2 table, the Chi-square distribution when the POPULATION X2 was assumed to be zero would look like my drawing below. (The graph for X2 looks different depending on the number of rows and columns in the table.) Here's the PDF for 2 rows and 2 columns, or a 4-cell table (courtesy of Dr. Ken Brewer's statistics book):


 

Here's what the Chi-square distribution would look like for a 2 X 2 table if we drew it as a graph. (Well, approximately, anyway.) The shape of the Chi-square distribution will depend on the number of degrees of freedom--for our purposes that will be the number of rows and columns in the table. In this example, for a 2 by 2 (total = 4 cells) table, the df =1.

TERMINOLOGY: DF   or    df    is short for the term "degrees of freedom" which we will discuss shortly.

 


And, finally, here is the formula that we use to calculate Chi-square in bivariate distribution tables:


 

The "O" is the OBSERVED frequency in a specific cell, say, row 1 column 1.  (O = Observed) The "E" is the EXPECTED frequency in the identical cell, say, row 1 column 1. (E = Expected)


How do we obtain these "expected frequencies" that would occur if the correlation between two variables is zero? The expected frequencies are determined by the marginals and the sample size.

EXAMPLE: CALCULATING AND INTERPRETING CHI-SQUARE

Let's take a look at the relationship between gender and the Earth and the Sun again.

Below are the OBSERVED FREQUENCIES:
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
  Male Female Total
Answer to Question:
Sun goes around Earth (WRONG) 104
(r1, c1)
283 387
Earth goes around Sun (RIGHT) 649 538 1187
Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) 753 821 1574

Source: NSF Surveys of Public Understanding of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574

Now, below are the OBSERVED COLUMN PERCENTAGES. Notice I have added a column for percents on the planets question for the total sample to the far right:
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
 OBSERVED PERCENTAGES Male Female Total Sample
Answer to Question:
     
Sun goes around Earth (WRONG)
16.0%
34.5%
24.6%
Earth goes around Sun (RIGHT)
84.0 
65.5
75.4
 
100.0%
100.0%
100.0%
Casebases
753
821
1574

Source: NSF Surveys of Public Understanding of Science and Technology, 2001, Director, Opinion Research Corporation/MACRO.
n = 1574

If there were NO ASSOCIATION between gender and the science question, what would the percentages in each column look like? The answer is that the percentages for women and men would be the same as they are for the total sample, as you see below for the EXPECTED PERCENTAGES:
 
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
HYPOTHETICAL EXPECTED PERCENTAGES
 EXPECTEDPERCENTAGES Male Female Total Sample
Answer to Question:
     
Sun goes around Earth (WRONG)
24.6%
24.6%
24.6%
Earth goes around Sun (RIGHT)
75.4 
75.4 
75.4
 
100.0%
100.0%
100.0%
Casebases
753
821
1574

We can use our knowledge of the relationship among frequencies, percentages, and proportions to turn the percentages in each column back into EXPECTED FREQUENCIES.

The expected frequencies are the frequencies that would be expected if there were no association between gender and the science question in the population at large. This is sometimes called the "independence hypothesis," that is, in this instance, gender and answers to the science question would be independent from one another.
 


 NUTS AND BOLTS: CALCULATING EXPECTED FREQUENCIES

For row 1 column 1, the expected percent of men giving the wrong answer is 24.6 and the expected proportion is .246.

Multiply the expected proportion of men giving the wrong answer by the total casebase FOR MEN ONLY:

.246 X 753 = 185.2

Notice that with a 2 X 2 table, I have to calculate the expected frequency for ONLY ONE CELL in the table. See the yellow cell in the table below. Because of the marginal totals in the far right column and the bottom row, I can get the other three cells by subtraction. We use the OBSERVED ROW AND COLUMN TOTALS to calculate the EXPECTED CELL frequencies.

For example, the expected cell frequency for the number of women giving the wrong answer will be:

387 - 185.2 = 201.8   or

Total observed frequency giving wrong answer - expected frequency MEN giving wrong answer =

expected frequency WOMEN giving wrong answer

Be sure to distinguish among observed and expected frequencies in the "right places."

The expected cell frequency for the number of men giving the right answer will be:

753 -185.2 = 567.8   or

Total observed male casebase - expected frequency MEN giving wrong answer =

expected frequency MEN giving right answer
 
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
 EXPECTED CELL FREQUENCIES Male Female Total Sample
Answer to Question:
     
Sun goes around Earth (WRONG)
185.2
201.8
387
Earth goes around Sun (RIGHT)
567.8
619.2
1187
Casebases
753
821
1574

And, again by subtraction, we find that the expected frequency of WOMEN giving the RIGHT answer = 619.2

Notice that I only had to calculate the expected frequency for ONE CELL (Male, Wrong) (the "yellow cell") out of the four cells in the table, then I could obtain the other three expected frequencies in the table by subtraction from the marginal totals.

This means I only have ONE "degree of freedom" in my 2 X 2 table, or one "independent" piece of information. Once the expected number of males giving the wrong answer is calculated, all the other three interior cells of the table are calculated through subtraction.

Using both the expected and the observed frequencies for our gender by education table and the Chi-Square calculation formulation, here's the calculations:
 
 
CALCULATING THE CHI-SQUARE

SCROLL BACK TO EQUATION 7.10 TO CHECK OUT THE X2 FORMULA. 

We start with the row 1,1 cell, then the 1,2 cell.
Thus, calculations are done for the first row, left to right.
We proceed to the second row to calculate the X2 component for each of the cells in turn, from left to right.
We do NOT use the row or column marginals in the Chi-Square formula, only the internal cells in the table itself.

[ ( 104 - 185.2 )2÷185.2] +
[ (283 - 201.8)2÷201.8] +
[ (649 - 567.8)2÷567.8] +
[ (538 - 619.2 )2÷619.2].

OR

[ (  - 81.2)2÷185.2] + [ (81.2)2 ÷201.8] +[  (81.2)2÷567.8] + [  - 81.2 )2 ÷619.2].

OR

[ (  6593.44 )÷185.2] + [ (6593.44) ÷201.8] +[  (6593.44)÷567.8] + [  (6593.44 ) ÷619.2].

(Remember, the square of a negative number in arithmetic is a positive number.)

OR

35.60  + 32.67  + 11.61 + 10.65  =  90.53

Chi-square might be small, or it might be large, but Chi-square should ALWAYS be a positive number.

Writing it the way X2 is typically presented:

X2 (1) = 90.53

The (1) subscript means that there is ONE degree of freedom in the table. You must include this information for your reader (or you) to accurately assess the value of Chi-Square.

Although you will not need to calculate Chi-Square by hand, you do need to follow the logic in its calculations above.

IS THE ASSOCIATION REAL OR ACCIDENTAL?  X2 AND STATISTICAL SIGNIFICANCE

REMEMBER: Our null hypothesis is that Chi-Square is zero, which means that there is NO association between gender and the science question in the population.

Given this null hypothesis, our alternative hypothesis is that there is some association between gender and the science question in the population, no matter how small, and, therefore, Chi-Square is bigger than zero (the Pearson X2  is a squared measure so it is positive.)

In symbols we write:

Ho : X2  = 0
HA : X2  > 0
 
 
AN INTRODUCTION TO THE "DEGREES OF FREEDOM"

We always evaluate X2 relative to its associated degrees of freedom, because the greater the number of rows and columns in a table, that is, the larger the size of the table, the larger X2  becomes. A moment's reflection, and examination of the calculation formula for X2, should convince you of this because with each added row or column, we add in more pieces of X2 .

Whether a relationship is different from zero or not should NOT depend on the table size.

So we need to control for the size of the table, and that's what considering the degrees of freedom (or df  or DF for short) does. In addition, every change in table size changes what the distribution of  X2 even looks like, and what the probability values are.

The df for the number of rows in the table is calculated as (# rows - 1).
This is because if you know the row total, and the frequencies for all but one of the cells, you can calculate the frequency in the last cell in the row by subtraction.

The df for the number of columns in the table is calculated as (# columns - 1).
This is because if you know the column total, and the frequencies for all but one of the cells, you can calculate the frequency in the last cell in the column by subtraction.

For the entire table, the df  = (# rows - 1) * (# columns - 1).  For example, a 2 X 2 table would have one total degree of freedom. Although we start with four degrees of freedom (one for each cell in the table):

subtract 1 df for the row total
subtract 1 df for the column total
subtract 1 df for the grand total
and if we know the frequency of just one cell in the table, and the three pieces of information above, we can calculate the frequencies in the other three cells of the table through subtraction.

Thus, in shorthand, for our association between two variables, the degrees of freedom are:

df = ( r - 1 ) * ( c - 1)



LOOKING UP X2 IN A TABLE

These days, most computer statistical programs automatically calculate exact probability levels for you so that you can assess whether the probable association in the population is zero or nonzero (see my SDA example below and Assignment 3). However, in "the bad old days" we had to look up the value of X2  in a table and decide whether it was an accident (really zero in the population) or real (really nonzero in the population) and we need to know the df  (or control for table size) to do this.

See Table C (page 670) in Agresti and Finlay for an example. Let's look at the pieces of such a X2 table (somewhat abbreviated for this page):

The far left column is for degrees of freedom (df). Remember this depends on table SIZE.

 In a 2 X 2 table like the example above, there is 1 degree of freedom, so we look at "row 1". (Marked in blue.)

The "p" levels across the top are the probabilities that a particular X2  will happen in any one sample, GIVEN THAT X2  IS REALLY ZERO IN THE POPULATION.

For example, if X2  were really zero in the population, and we had a 2 by 2 table,  X2s as large (OR larger) as 3.84 would occur by chance in only 5 samples in 100 (or ".05").
 

df/ p
.250
.100
.05
....
.01
.005
.001
1
1.32
2.71
3.84
 
6.63
7.88
10.83
2
2.77
4.61
5.99
 
9.21
10.60
13.82
3
4.11
6.25
7.81
 
11.34
12.84
16.27
4
5.39
7.78
9.49
 
13.28
14.86
18.47
5
6.63
9.24
11.07
 
15.09
16.75
20.52
6
7.84
10.64
12.59
 
16.81
18.55
22.46
7
9.04
12.02
14.07
 
18.48
20.28
24.32
...              
30
34.80
40.26
43.77
 
50.89
53.67
59.70

Thus, for the gender and science question table, the X2 (with 1 df) was 90.53
Notice that this result is WAY larger than the X2  value listed in the table for the p = .001 level, which is only 10.83.
Thus, we REJECT the null hypothesis that X2 is really 0 in the population for the association between these two variables, and conclude that the relationship between gender and the science question is real, or nonzero.

However, WE COULD BE WRONG!  X2 might really be 0 in the population and we just got one of those rare, unrepresentative samples where X2  looked like it was real (but was really just a sampling accident).

We DO know the odds of our being wrong. They are the probability of our results being an accident if X2 is really zero. Those odds are less than one in 1000 samples or p < .001.

Thus, we REJECT the null hypothesis that X2 is really 0 in the population for the association between these two variables, and conclude that the relationship is real, or nonzero. And we include the chances of our being wrong when we present the X2  statistic:

X2 (1) = 90.53,  p < .001

Click the button. Try to locate where the Chi-square (1) of 90.53 belongs on the Chi-square graph: 
 

 
To completely present your Chi-square results for a particular table in an analysis, you must include:
  • The X2 symbol
  • The degrees of freedom for the table
  • The X2 value itself
  • The probability sign (and <  >  or = )
  • And the fractional odds of your making a mistake if X   is really zero, i.e., "the probability level"

  •  
 


 
AN SDA EXAMPLE: BIVARIATE DISTRIBUTION

What you see below is an example of a 4 X 5, larger, table from the Current Population Survey August 2000 data that looks at the association between educational level ("reduc," ORDINAL VARIABLE) and the number of computers in the household ("pcnum," RATIO VARIABLE).

By examining the column percents from left to right, we can see that (1) the percentage of households with NO computer decreases as we go from those with less than a high school diploma to those with an advanced college degree and (2) the percentage of households with at least 3 computers increases as we go from the poorly educated to the well-educated. Notice that the table also shows SDA's "color coding" to examine how relationships work quickly. "Red" or "pink" cells have more frequencies than would be expected by chance. "Blue" cells have fewer frequencies than would be expected by chance.
 
 
Frequency Distribution
Cells contain:
-Column percent
-N of cases
reduc
1
12th grade No 
Diploma or less
2
High School 
Grad-Diploma Or Equiv(GED)
3
Some College 
But No 
Degree Associate Deg.
4
Bachelor's Degree(ex. ba, ab,bs)
5
Master's Deg./ Professional Deg./ Doctorate 
Deg.
ROW
TOTAL
pcnum 64.7
12,758
53.7
15,906
34.0
8,204
23.2
3,359
18.2
1,255
43.7
41,482
26.2
5,169
36.5
10,803
47.0
11,334
49.6
7,175
47.8
3,293
39.8
37,775
6.2
1,220
7.2
2,133
13.1
3,164
18.3
2,645
21.9
1,510
11.3
10,671
2.9
576
2.7
786
5.9
1,424
8.8
1,274
12.1
833
5.2
4,892
COL TOTAL 100.0
19,723
100.0
29,628
100.0
24,126
100.0
14,452
100.0
6,891
100.0
94,821
Means .47 .59 .91 1.13 1.28 .78
Std Devs .74 .74 .84 .87 .90 .84
Color coding: <-2.0 <-1.0 <0.0 >0.0 >1.0 >2.0 T
N in each cell: Smaller than expected Larger than expected

 

 
The SDA program also calculates the mean scores and standard deviations for each column. The mean is the arithmetic average on the dependent variable for each separate category of the independent variable. This is OK for this particular example ONLY BECAUSE "pcnum" IS A RATIO VARIABLE. You need to simply ignore the means and SDs if you had values of an ordinal or nominal variable in the rows.

You can look at the SDA "color coding" to see at a glance that we have more cases than would be expected by chance (IF THERE WERE NO RELATIONSHIP) in the high education-lots of computers cells, and more cases than would be expected by chance in the low education-no computer cells.
 
 

Below are the statistics that the SDA program gives you. USE THE Chisq(P) = ONLY. This is the Pearson Chi-Square.

Eta, R, Somer's d, Gamma, Tau-b, and Tau-c are correlation coefficients (which we will examine in  Guide 5).
 
 
 
IMPORTANT IN DATA INTERPRETATION:    Associated with the Pearson Chi-Square in the SDA output is the p = value of ".0000".

This REALLY means that p < .0001 (one chance in ten-thousand) because most computer statistical programs truncate at three or four decimal places. SPSS does the same thing.

If you get a result like this when you do an assignment or analyse your own data, please be sure to write p < .0001 (DO NOT put a row of zeros!)*
 

* I have been a committee member on two doctoral committees over the past summer where the doctoral candidate incorrectly put a row of zeros as a probability level because they did not know how to interpret their computer output.
 
Summary Statistics
Eta* = .32 Gamma = .40 Chisq(P) = 11,296.52 (p= .0000)
R = .31 Tau-b = .29 Chisq(LR) = 11,466.23 (p= .0000)
Somers' d* = .26 Tau-c = .27 df = 12
*Row variable treated as the dependent variable.

REVIEW: THE CHI-SQUARE PDF, STATISTICAL SIGNIFICANCE, AND "PRACTICAL SIGNIFICANCE"

Because of the shape of the Chi-Square Distribution, MOST sample X2s will be near 0 if there is no relationship between two variables in the population.

However, if there is at least some relationship between two variables in the population, the value of X2 in any particular sample can be very large.

If the value of Chi-square is quite large, we conclude the apparent sample relationship is REAL (nonzero) in the population. If the Chi-square value is near 0, we conclude the apparent relationship is an ACCIDENT of this particular sample.

The size of X2 is influenced by:

  table size
  sample size and
  how strong the relationship is

Since three different factors influence Chi-square, a large X2 DOES NOT MEAN that you have a strong or large relationship. Relationship strength or effect size is a separate issue (see Guide 5). Relationship strength is the second question we must answer about the relationship between two variables.

You could have a large X2 just because you have 4000 cases and the results are very reliable (large samples are more reliable).
 

THE T-DISTRIBUTION AND THE F-DISTRIBUTION

One way of looking at the association between two variables is to construct a crosstabulation table. You can create such a table whether your variables are nominal, ordinal, interval, or ratio. However, a table can quickly become difficult to read as the number of rows and columns grows.

What if an independent variable has only two categories, but the other variable is not only interval, but has several categories--perhaps even several dozen categories, such as the variable years of age. As you already know, it is difficult to work with a variable that has several dozen categories in a table. Here are some examples:

The difference between means on a scale of several items of basic science knowledge for women and men.

The difference in response to a new antibiotic among individuals with a bacterial respiratory infection. Control Group 1 gets the current antibiotic, and Intervention Group 2 gets the new drug. We compare the mean number of days to return to health in each group. The range could be well over a dozen days to recovery.

Two different methods of teaching reading are compared. The dependent variable is the number of vocabulary words learned over six weeks among second grade students. A pupil could learn dozens of vocabulary words over that time.

Mean weight loss in pounds is compared among two groups who used different exercise methods for three weeks.

Instead of a crosstabulation table, you can examine the difference in means across two (or more) groups. This is a slightly different form of examining an association between the two variables, but the same general logic holds.

Once again, it is easiest, both logically and mathematically, to postulate that if you had both of the total populations (say, women and men), the means for both groups would be the same. Or, put slightly differently, that there was no difference in the means for the two populations. Or, again put slightly differently, that:
 
 

Ho: µ1 = µ2
and thus
µ1 - µ2 = 0

And if you have MORE than two groups:
 
 

Ho1 = µ2 = µ3 = ... = µk

where K is the number of groups

The generic idea is that IN THE POPULATION, the means on the DEPENDENT variable for each value of the INDEPENDENT variable are all the same.

FOR EXAMPLE:

We test whether at least one of the means is different from the other (and at this stage,  it doesn't matter which one) using: In addition to testing the difference in means across two groups, the t-distribution is a probability distribution that is often used with ordinal or interval correlation coefficients (such as r or tau-b). The t resembles the normal distribution (.e.g., bell shape) but is flatter on top. t can be positive or negative just like a Z score. An absolute value of t > |1.96| in a sample of several hundred cases usually corresponds to p < .05.

In other words, if the t-test for a difference in means between two groups is at least |1.96| in a large sample, (absolute value), the population relationship is probably REAL. (That's the t-test, not just mean #1 minus mean #2.)

The t-distribution can be viewed as a special case of the F distribution when the number of groups (k) equals only 2. For this reason, many computer programs, including the SDA, only calculate F-ratios, not ts.

If your computer output (as ours does) only gives the F-ratio AND
you only have two groups to compare (such as men and women),
SIMPLY TAKE THE SQUARE ROOT OF THE F-RATIO.

The result will be the absolute value of the t (that is, no positive or negative sign; see Assignment 3 for what to do if you do want a positive or negative sign for the t.) The probability level will be OK, but you need to do n - 2 for the t-distribution degrees of freedom (not n - 1).
 


 Here is what the PDF that draws the curve for the t-test looks like (the one for the F distribution is even more complex):
 
AND YOU THOUGHT THE NORMAL DISTRIBUTION AND THE X2 PDF WERE BAD: 
GET A LOAD OF "STUDENT'S t"

 
 
THE T-DISTRIBUTION AND MEAN DIFFERENCES BETWEEN GROUPS

Below is the computational formula for the t-test for the difference between mean scores on two values of the same nominal variable. Again, computers calculate these formulas faster and more accurately than people can, but there are a few things I want you to notice:

The computations for the F-ratio are much more complicated than for the t and we will examine them in depth when we look at multivariate interval-level variable distributions in the multivariate section of our course.

COMPUTATIONAL FORMULA FOR THE MEAN DIFFERENCES BETWEEN TWO GROUPS:


 
AN SDA EXAMPLE: THE DIFFERENCE IN MEANS ACROSS SEVERAL GROUPS

When we compare mean scores on an interval-level dependent variable across a nominal independent variable with only two values or groups, we call the statistic a "t-test."

If we have MORE THAN two values on the nominal variable, we call this analysis a "one-way analysis of variance" ("one way" means only one nominal variable) or a one-way "ANOVA" for short.

The output below compares mean scores on the number of computers in the household across educational levels. In other ways, we are trying to make the same assessment that we did in the crosstabulation table, but this time we are looking at mean number of computers instead of the bivariate percentage distribution.

The table below of mean scores across groups makes it easy to  spot that the average number of computers per household rises steadily with educational level, from 0.47 average computers for those with less than a high school degree to an average 1.28 computers for households with advanced degrees.
 
 
Main Statistics
Cells contain:
-Means
-SRS Std Errs
-Std Devs
-N of cases
-Weighted N
reduc 1 12th grade No Diploma or less .47
.005
.740
19,420
19,723.3
2 High School Grad-Diploma Or Equiv(GED) .59
.004
.739
30,068
29,627.7
3 Some College But No DegreeAssociate Deg. .91
.005
.836
24,337
24,126.4
4 Bachelor's Degree(ex. ba,ab,bs) 1.13
.007
.867
14,519
14,452.1
5 Master's Deg./Professional Deg./Doctorate Deg. 1.28
.011
.898
6,952
6,891.0
COL TOTAL .78
.003
.841
95,296
94,820.6

Recode for 'pcnum' 0 = -1 "0"; 1 = 1 "1"; 2 = 2 "2"; 3 = 3 "3"
 
Color coding: <-2.0 <-1.0 <0.0 >0.0 >1.0 >2.0 T
Mean in each cell: Smaller than average Larger than average

Like the Chi-square distribution, the computer will calculate the exact odds of obtaining these SAMPLE results if all the means for number of computers across educational levels IN THE POPULATION were the same (that is, there are NO mean differences in computer ownership across levels of education).

In our example, if there were no mean differences in the number of computers across educational level, just by chance, we would expect to observe the F-ratio below in the sample results that we did above in:

LESS THAN ONE IN 10,000 SAMPLES JUST BY CHANCE!
That's what the row of zeros under the "P" means.

For our purposes the parts of the table that I have colored yellow are the ones that we will use for this part of the course. We will look at these pieces later in Guide 5 and Assignment 3.
 
Analysis of Variance
SSQ Eta_sq df MSQ F P
reduc 6,799.039 .101 4 1,699.760 2,687.880 .0000
Residual 60,260.052 .899 95,291 .632
Total 67,059.091 1.000 95,295

Remember: the row of ".0000" under the "P" REALLY means p < .0001

Now, it's decision time:

Do you accept the null hypothesis and say, "well, I just got one of those wierd 1 in 10,000 samples and there really are no educational level differences in computer home ownership" (admittedly a VERY rare event, but it could happen)?

Or, do you reject the null hypothesis (F = 0 or t = 0, which represents NO mean differences across educational groups) and say that the relationship between educational level and number of household computers is nonzero and REAL? That at least one educational group has higher home computer ownership than the others?

Most of us would say the relationship was real, but one out of 10,000 times we would be wrong. The population differences among means really could be zero (so the population value of F would be 0) but you got the 1 in 10,000 samples where you had an extreme result, an enormous F--but it was just a wierd and unrepresentative sample.

We write this as:

F4,95291 = 2687.00, p < .0001
(DO NOT write a row of zeros here!)

What are the degrees of freedom (df) for this example?
For the F-ratio, they are the number of groups minus one for the first figure (in this example: 5 -1 = 4).

For the second figure in the superscript, the df are n - k - 1 ( where k = the number of groups).

IF YOU ONLY HAD TWO GROUPS (not 5):

Take the square root of the F and you have the absolute t-value.
The degrees of freedom will be n - 2.
The probability level will be IDENTICAL to the value in your output under the "P" header.

WHAT KIND OF SAMPLE DID YOU TAKE?
PROBABILITY VERSUS NON-PROBABILITY SAMPLES

If you have a population, stop here! You do not need inference measures under most circumstances.

However, most of the data that we collect come from samples. Therefore, how the sample is taken is critically important.

We can put the laws of probability to work for us and we can use the properties in probability density functions only under certain specified conditions.

#1. Our data come from a probability sample. This means that systematic, non-human judgment means of collecting the data were used.

Each element in a probability sample MUST have an a priori known, non-zero chance of selection.

The chances of selection can be unequal.
(Only novices think that probability samples mean equal chances of selection. For example, disproportionate stratified samples are also probability samples)

Common probability samples include:

The simple random sample (srs), akin to drawing well-mixed numbers from a jar. Random digit dial telephone surveys approximate srs.

The systematic sample, in which every nth element is selected after a random start, for example, every 100th registered FSU student might be selected after a random starting point.

The cluster sample. One example is when a set of entire classrooms  (each classroom is a "cluster") is selected from a district school, perhaps with a simple random sample of classrooms.
 
 

OR, INSTANCE NUMBER 2:

#2. Some form of probability method is used to assign elements to different intervention or treatment groups.

Typically, this is the random assignment ("randomization") used in "true" experiments (and possibly the random assignment of intact groups in some "quasi-experimental" conditions).

However, if you have not selected your entire sample of elements from the population using probability methods, you can NOT generalize to a larger population. This is because your cases do not represent any larger population; they only represent themselves.

If you have used random assignment to place cases in intervention groups, you CAN use inferential methods to compare the differences across the two or more groups. This is because we can assume all your intervention groups were approximately the same (on the average) at the beginning of your study, and the only difference was the coin flip or random number that put a person in Treatment Group 1 rather than in Treatment Group 2.


A relationship or correlation may be statistically significant (i.e., non-zero) but that does NOT mean it is a large or even moderate size association. With very large samples, very small correlations can be statistically real and non-zero because the standard errors are small and the estimates relatively stable. We see this occur in epidemiological studies where the federal government has samples of 50,000.

How large or how strong a relationship ("effect size") is a different question from how statistically significant it is. However, we need to address the statistical significance question FIRST.

If your results in the population are probably really zero, then no matter how impressive your sample correlation appears to be, you know the strength of it already: it is zero.



 
 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh September 23, 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.

Paradise lost...