BE SURE TO CHECK THE NEW DUE DATE!!
OVERVIEW


 

DUE FRIDAY

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING

ASSIGNMENT 4 DUE NOVEMBER 12 by 3 PM

READINGS AND ASSIGNMENTS
 
RETURN TO 
ASSIGNMENT PORTAL

 

PUT IN MY MAILBOX
307 STONE BUILDING
EARLY ASSIGNMENTS ACCEPTED


 
EDF 5400 INTRODUCTORY STATISTICS
FALL 2004

DR SUSAN CAROL LOSH

PLEASE NOTE THE DUE DATE:  IT IS FRIDAY  NOVEMBER 12 BY 3 PM!
 

 

ASSIGNMENT 4: CROSSTABULATIONS INCLUDING A CONTROL VARIABLE 
ISSUES OF CAUSALITY IN NON-EXPERIMENTAL DATA
20 POINTS


DUE TO THE INTENSE NATURE OF THIS COURSE
LATE PAPERS ARE NOT ACCEPTED
EARLY ASSIGNMENTS ARE ACCEPTED

.
 


 
LAST DAY QUESTIONS ABOUT THIS ASSIGNMENT? PLEASE SEE ME OR MARIA.

OR YOU MAY ALSO EMAIL US. PLEASE DO NOT E-MAIL AFTER 8 PM THURSDAY NIGHT.
Different e-mail providers may take a long time to deliver their mail & we may not receive it in time. We are not responsible for late delivery of e-mail by either your provider or ours, or for server viruses that slow transmission, so please leave enough time!

IF YOU E-MAIL ME ON FRIDAY MORNING I WILL NOT BE ABLE TO RESPOND TO YOU. 

REMEMBER: NO E-MAIL ATTACHMENTS! THANK YOU.
FAX IS OK 850-644-8776  REMEMBER YOUR NAME AND EDF5400.

ASSIGNMENT STATS
YOUR TASKS 
TODAY
CROSSTABULATIONS COMPUTER RUN
THREE WAY CROSSTABS
WHAT YOU 
TURN IN 

OF COURSE YOU'LL DO COMPUTER RUNS IMMEDIATELY!
Strange things can happen...
 

PROGRAM GLITCH
(they all have them)

Remember this one? One glitch that may happen is that your probability level is a set of zeros, like this: .00   or  .0000 (SDA and SPSS both do this)
The program truncated your probability level because it took up too many decimal points.
 

What this means:
.00  =  p < .01
.000 = p < .001
.0000 = p < .0001

Please observe the correct terminology (e.g., p < .01) in reporting your results depending on the number of zeros that appear in your output.

In this assignment, you'll use the SDA system and the General Social Survey to see how a control variable may (or may not) alter the original correlation between two variables.

This assignment gives you practical experience on (1) the first two basic questions we ask about a relationship between two variables, then on (2) the THIRD basic question we ask about an association between two variables.
REVIEW:

Here are the first two basic questions we ask about a relationship between two variables:
 
 

 
#1. Is the bivariate relationship zero or nonzero in the population?

This question typically examines what is called the "statistical significance" of the association.
 
 

 
#2. If the bivariate relationship is probably nonzero in the population, how strong is the relationship?

Question 2 requires you to:

Question 2 is about "effect size", or what is also called "substantive" or "practical significance".

If the correlation coefficient is statistically significant and at least weak (at least |.11| in magnitude, but ideally moderate or more), continue to question 3.
 
 
 
#3. If the bivariate relationship is probably nonzero and nontrivial in magnitude, what is the apparent causal structure of the bivariate relationship in non-experimental data?

Question 3 requires you to add at least one "control variable."
 

REVIEW

Below are the six possible outcomes to the bivariate correlation when you add a third or control variable:

First, you must decide whether ANY CHANGES have occurred in the original bivariate relationship when you examine it across the separate values of the control variable.

THEN STOP!!

If changes do occur, decide which ONE of the following patterns you have:

If you have interaction effects, assess the pattern of these effects.

THEN STOP!!


If you do not have extraneous, joint, or interaction effects, study the pattern of your results for one of the following:


BE SURE TO KEEP THE DIFFERENCE STRAIGHT BETWEEN:

(1) THE PROBABILITY LEVEL USED TO TEST WHETHER THE RELATIONSHIP IS ZERO AND
(2) THE CORRELATION COEFFICIENT USED TO ESTIMATE THE STRENGTH OF THE ASSOCIATION.
 

ASSIGNMENT STATS

This total assignment is worth 20 points.

Correctly following all programming information for running frequency distributions on the two variables, the crosstabulation, and the difference of means test, and turning in all output = 2 points.

Although your actual output does not count very heavily, I MUST receive your output in order for you to receive credit on this assignment.

QUESTIONS YOU WILL NEED TO ANSWER, BASED ON YOUR OUTPUT DATA
 
 
Which correlation coefficient should you use to measure any association between degree and sex? BRIEFLY, WHY should you use that particular correlation coefficient? [If you use phi, examine the table size to see if you should use Cramer's V instead.]

Was there any association between degree and sex in your table for the total sample? (Was the association real or accidental?)

What was that significance level for the association between degree and sex?
(Remember:  a row of .00s here really means p < .01.)

Was this relationship curvilinear (non-linear), monotonic, approximately linear, or couldn't you tell? How did you know?

What was the numeric value of this correlation? Describe the strength [and direction if applicable] of this association in words.

NEXT: Which correlation coefficient should you use to measure any association between year and degree? BRIEFLY, WHY should you use that particular correlation coefficient? [If you use phi, examine the table size to see if you should use Cramer's V instead.]
Was there any association between year and degree in your table for the total sample? (Was the association real or accidental?)

What was that significance level for the association between year and degree?
(Remember:  a row of  two .00s here really means p < .01.)

Was this relationship curvilinear (non-linear), monotonic, approximately linear, or couldn't you tell? How did you know?

What was the numeric value of this correlation? Describe the strength [and direction if applicable] of this association in words.

Now, what happens when you use your control variable?

Use the same kind of correlation coefficient that you chose for the relationship between year and degree IN THE TOTAL SAMPLE. What were the numeric values of the correlations between year and degree for:

Were either or both of these correlations statistically significant?
What were the levels of statistical significance for each of the two "partial" correlations?
Describe the strength of each partial correlation in words.

What are your causal conclusions about the type of relationship considering all three variables together?

Do you have an extraneous relationship, a joint relationship, an interaction effect, an intervening (indirect or mediated) relationship, a spurious relationship, or a suppressed relationship?

Remember, you can make ONLY ONE CHOICE from among the SIX outcomes immediately above. (If your correlations for Men and Women differ by at least |.10| , you have an interaction effect and that is what you call it.)

In a sentence or so, discuss the reasoning behind your decision.

Lots of questions here, but take everything step by step and you should be able to answer them all.


YOUR TASKS FOR ASSIGNMENT FOUR

OVERALL

FIRST, access SDA and the General Social Survey.

SECOND, run frequencies on all variables for this computer session (year, degree, sex). It is always a good idea to click the percents box to get a idea of the relative distributions.

THIRD, you will run your assigned crosstabulations.You will select the most appropriate statistics for your data and then decide what kind of causal relationship you have.

SPECIFICALLY

1. You will run univariate FREQUENCIES on the three variables for this exercise. You will watch for out of order codes, missing data, "wild punches," and other anomalies so that you can recode these if needed or restrict the range of valid codes.

You will run frequencies on:

sex (male or female)

degree (respondent's highest degree level, in five degrees)

year (year of study--we will only use 1972 and 2002)
 

2.   You will then conduct a crosstabulation of degree by sex for the total sample. In your program, you will request statistics and column percentages.

3.   Based on your crosstabulation, you will select the BEST or MOST APPROPRIATE correlation coefficient for your data. You will decide whether any apparent association between degree and the respondent's gender is ACCIDENTAL or REAL. You will report the level of "statistical significance" of this association.

4.  You will report the magnitude of the correlation coefficient, its direction (if appropriate), its form (IF APPROPRIATE), and how strong the correlation coefficient is in words.

5.  Next, you will conduct a crosstabulation of degree  by year for the total sample. In your program, you will request statistics and column percentages. (NOTE: this will be part of your control variable run. It comes out as the very last table in your output when you run the three way contingency table.)

6.  Based on your crosstabulation, you will select the BEST or MOST APPROPRIATE correlation coefficient for your data. You will decide whether any apparent association between year and the respondent's degree level is ACCIDENTAL or REAL. You will report the level of "statistical significance" of this association.

7.  You will report the magnitude of the correlation coefficient, its direction (if appropriate), its form (if appropriate), and how strong the correlation coefficient is in words.

8.  At the same time you conduct the crosstabulation for  degree by year for the total sample, you will request:

Using the same type of correlation coefficient that you used for the relationship between degree by year in the total sample, you will assess the bivariate relationship between degree by year separately each for Men and for Women.

You will assess the statistical significance of the correlation separately for each sex, its magnitude, direction (if appropriate), its form (if appropriate), and its strength in words.

At step 8, you have conducted a multivariate crosstabulation (sometimes called a "three-way crosstab").
For most of you it is your very first one. Congratulations!

9.  You will decide on the general causal status of the three way crosstabulation. Is it:

direct
joint
statistical interaction
intervening
spurious or
suppressed?
You may choose only ONE of these six alternatives.
Then, you will briefly give a rationale for your choice.


OLD FEATURES in this exercise include:

    Accessing an online database (the General Social Survey file) and the SDA system.
    Running univariate frequency and percentage distributions.
    Using a selection filter to select only two years: here, 1972 and 2002
    Recoding the values of a variable into fewer values.
    Conducting a crosstabulation to generate:

    Assessing the statistical significance of an association between two variables.
    Assessing the magnitude of the relationship between two variables.

NEW FEATURES include:


MAKE LIFE EASIER!

 
 
ALWAYS A GOOD IDEA TO PRINT OUT A COPY OF THIS ASSIGNMENT.

CHECK OFF EACH INSTRUCTION AS YOU COMPLETE IT AND YOU WILL HAVE A SPEEDY AND NONTRAUMATIC EXPERIENCE RUNNING THE SDA SYSTEM. 


 
 
SDA REVIEW

Use the RIGHT toe of your mouse to click on this link:
 
 
http://www.icpsr.umich.edu/GSS/

When the menu opens on the link, click on:         Open in New Window

Click on the  button to pull up the statistical program selection screen.
 
 
Remember: WHAT YOU SEE BELOW IS A NONWORKING COPY. 
YOU MUST GO TO THE NEW PAGE THAT OPENS ON YOUR MONITOR TO ACCESS THE SDA PROGRAM.

You can always switch back to this screen by clicking on the box at the very bottom of the monitor screen that reads "Assignment 4". Or you can print out the pages of Assignment 4.

Once again, you will bring up the "radio buttons" screen to select an analytic option, in this case, Frequencies or crosstabulation. The directions that you place in the SDA boxes will direct the program to perform a series of crosstabulations.
 

Survey Documentation and Analysis

Study: GSS 1972-2002 Cumulative Datafile

Select an action:
Browse codebook in this window
Frequencies or crosstabulation
Comparison of means
Correlation matrix
Comparison of correlations
Multiple regression
Logit/Probit
List values of individual cases
Recode variables (into public work area)
Compute a new variable
List/delete derived variables

Download a customized subset

Suggestion for running analysis programs:
Click the "Open Extra Codebook Window" button above. This allows you to "copy-and-paste" the names of variables you wish to analyze from the codebook window to the analysis windows.
Return to SDA Home Page

In the Study: GSS 1972-2002 Cumulative Datafile screen that opens: first click on:
 
Open Extra Codebook Window

to open up the codebook window. Remember to click on the "Standard Codebook" link.

Use the radio button to select Frequencies or crosstabulation. Then, click on :
 
Start

so that you have the SDA program window active as well.
 


PRELIMINARY FREQUENCIES

REMEMBER! The first step in working with data is to ALWAYS display the total frequencies on all the variables that you plan to analyze for a particular research project.
 
 

 
Next to Selection Filter(s): Add the words: year (1972,2002)
so that you only include respondents from the years 1972 and 2002.

After you have accessed the Current Population Survey data and the SDA system, be sure to run the original frequencies for:

1.sex

2. degree and

3. year

Obtain the frequencies ONLY. (This includes percentages too!) You DO NOT need measures of central tendency, dispersion, or other univariate measures for this assignment.

BE SURE TO IMMEDIATELY PRINT THESE FREQUENCIES PAGES AND INCLUDE THEM WITH YOUR OUTPUT.

See the directions below for "what goes in the boxes" below the SDA Program box.
 
RUNNING YOUR SDA SHORT PROGRAM FOR CROSSTABULATIONS

Pull up the window for the SDA Program Screen. Again for your review, below is the example screen for running frequencies or crosstabulations that appeared when you clicked the "START" button on the SDA beginning window.

We can be reasonably sure that time will influence educational level in the United States because people in more recent years are better educated. Most states and the federal government have tried to prevent high school dropouts and encourage high school graduates to attend college.  I cannot imagine any circumstances under which one's level of education would influence year (try the "giggle factor" here if nothing else).

We can be totally sure that neither the study year nor someone's level of education will influence  whether they are male or female! On the other hand, one sex or the other may have increased their educational level disproportionately over time.


FIRST, you will run the crosstabulation between degree and sex.

Row: degree

Make sex your Column: variable

REMEMBER: Where it says Selection Filter(s):  put: year (1972,2002)
to select only respondents from 1972 and 2002.

Be sure that the "Column" box is checked under Percentaging:

Now, click on the boxes to the LEFT of:

"Statistics"                    AND
"Question text"

If you like, you can change the number of decimal places ONLY for statistics to "4".
However there is a good  chance this won't work and you will only get 2 decimal places anyway under statistical significance...

Click on the gray box at the bottom left that says:
 
Run the Table

IMMEDIATELY PRINT THESE PAGES BEFORE YOU BEGIN THE THREE-WAY CROSSTABULATION RUN BELOW!

 
THE BOX BELOW WILL NOT WORK TO RUN YOUR PROGRAM. 
YOU MUST GO TO THE NEW PAGE THAT OPENS ON YOUR MONITOR TO RUN YOUR PROGRAM.

 


 
SDA Tables Program
(Selected Study: GSS 1972-2002 Cumulative Datafile)
Help: General / Recoding Variables
REQUIRED Variable names to specify
Row:

OPTIONAL Variable names to specify
Column:
Control:
Selection Filter(s):Example: age(18-50) gender(1)
Weight:

Percentaging:ColumnRowTotal

Other options
StatisticsSuppress tableQuestion text
Color codingShow Z-statistic

Chart options
Type of chart (if any) to display:
Bar chart options:
   Orientation: Vertical Horizontal
   Visual Effects: 2-D 3-D
Size of chart - width:  height: 




Change number of decimal places to display
For percents:
For statistics:

 

Now for the three-way crosstabulation. You will also be able to get the BIVARIATE crosstabulation of degree by year for the total sample from this run. It will be the VERY LAST TABLE (and include all the valid cases) on the run output.

In the Row box, type: degree

degree goes in the "row" line because it is your dependent variable.

Next to the Column: line, type: year
This will make year, your independent variable, the column variable, per MOST conventions.

Next to the Control: line, type: sex

Be sure to switch "sex" to the Control: line. This will make sex your control variable. This command will generate you THREE crosstabulation tables for the association between degree level and time:

REMEMBER: Next to Selection Filter(s):  put: year (1972,2002)

Next to Percentaging: be sure the little box to the LEFT of "Column" is checked.
(Click to check this box if it is blank.)

This WILL produce column percents because you have both rows and columns.

Now, click on the boxes to the LEFT of:

"Statistics"                    AND
"Question text"

Remember to ONLY use the P or Pearson Chi-Square for this assignment.
Also recall that if you choose to use Phi for any of your correlations, that you can easily calculate it if you:

(1) Take the Pearson Chi-Square and
(2) Divide it by the casebase (n), then
(3) Take the SQUARE ROOT of the result at Step 2.

That's it, that's Phi.
If you want Phi-squared instead, just stop at Step 2.

Use the Cramer's V correction in the denominator if you have a 3 by 3 size table OR LARGER. The phi formula will work for a "2 by anything" table.

Now click the gray box at the bottom left that says:
 
Run the Table

Here is your output for your first multivariate crosstabulation run using the SDA system.

Go ahead and print your "three-way crosstab" NOW so that you have these pages to study while you complete this assignment. They also form part of the Assignment 4 that you will turn in to me.
 
 

 
REMEMBER!  If you get a row of '00's next to "p ="  or underneath the "P" header, this means that your results could have occurred by accident less than once in 100 samples if Chi-square were really zero. This really means p<.01 (or even p<.0001 if you have four zeros) so that is what you put. Such a result is VERY statistically significant and means your relationship is probably REAL. 

 


HERE'S WHAT YOU TURN IN BY 3 PM ON FRIDAY NOVEMBER 12 2004
My mailbox 307 Stone (you may also turn in at class 11-10-04



Here's what you turn in to me by 3 PM Friday November 12 (you may add a short explanation to your answer to any of these questions). Points for each part are in parentheses:

Your printed output (2 points maximum) for:

THEN, YOUR ANSWER TO QUESTIONS 1- 15 BELOW:

(1) (2 points) Which correlation coefficient should you use to measure any association between degree and sex? BRIEFLY, WHY should you use that particular correlation coefficient? [If you use phi, examine the table size to see if you should use Cramer's V instead.]

(2) (1 point) Was there any association between degree and sex in your table for the total sample? (Was the association real or accidental?)(Was the association zero or non-zero--these three questions are equivalent.)

(3) (1 point) What was that significance level for the association between degree and sex? (Remember:  a row of .00s or 1.00 here really means p < .01.)

(4) (1 point) Was this relationship curvilinear (non-linear), monotonic, approximately linear, or couldn't you tell? [This is often called the FORM of the relationship.] How did you know?

(5) (1 point) What was the numeric value of this correlation? Describe the strength [and direction if applicable] of this association in words.
 

CLICK HERE TO REVIEW THE CHART ON CORRELATION COEFFICIENT STRENGTH

(6) (2 points) Which correlation coefficient should you use to measure any association between year and degree? BRIEFLY, WHY should you use that particular correlation coefficient? [If you use phi, examine the table size to see if you should use Cramer's V instead.]

(7) (1 point) Was there any association between year and degree in your table for the total sample? (Was the association real or accidental?) Remember, the crosstabulation table for males and females combined will be the THIRD table in your program output.

(8) (1 point) What was that significance level for the association between year and degree? (Remember:  a row of .00s or 1.00 here really means p < .01.)

(9) (1 point) Was this relationship curvilinear (non-linear), monotonic, approximately linear, or couldn't you tell? [This is often called the FORM of the relationship.] How did you know?

(10) (1 point) What was the numeric value of this correlation? Describe the strength [and direction if applicable] of this association in words.

Now, what happens when you use your control variable?

(11) (1 point) Use the same kind of correlation coefficient that you chose for part (6) above [NOT part (1)]. What were the numeric values of the correlations between year and degree for:

(12) (1 point) Were either or both of these correlations statistically significant?

(13) (1 point) What were the levels of statistical significance for each of the two correlations in part (11)?

(14) (1 point) Describe the strength of each correlation (direction, form if applicable) in part (11) in words.

(15) (2 points) What are your causal conclusions about the type of relationship considering all three variables together?

Do you have an extraneous relationship, a joint relationship, an interaction effect, an intervening (indirect) relationship, a spurious relationship, or a suppressed relationship?

Remember, you can make ONLY ONE CHOICE from among the outcomes immediately above. (If your correlations for Males and Females differ by at least |.10| , you have an interaction effect and that is what you must call it.)

In a sentence or so, discuss the reasoning behind your decision.
 

 
REMEMBER! CHI-SQUARE IS NOT A CORRELATION COEFFICIENT. 
Instead, it tests the statistical significance level of a correlation coefficient.

Correlation coefficients are bounded by the absolute values of 0 to 1. If some number in your output is larger than 1, you can safely assume it is NOT a correlation coefficient.

TECHNICAL NOTE: The SDA program (SPSS too) assumes that the top left-hand cell contains either

(1) the highest values for both the independent and the dependent variables and the bottom right-hand cell contains the lowest values for both the independent and the dependent variables OR
(2) the top left-hand cell contains the lowest values for both the independent and the dependent variables and the bottom right-hand cell contains the highest values for both the independent and the dependent variables.

Obviously this is not a problem if one of your variables is nominal in the cross-tab table but it could be a problem if both your variables are at least ordinal. Because the top left-hand cell is low for both variables in this example (year = 1972 and degree = less than high school) your correlation coefficient this time around will be OK.

However, in other variable codings (e.g., the top left-hand cell is "high-low"), the sign of the correlation coefficient may be reversed and the analyst must change the signs of the correlation coefficients. Be alert to this issue in analyses you may do later on for theses, dissertations, conference papers, reports, articles, etc. and how your independent and dependent variables are coded. Remember that computers are robots and they go by table position, they don't really know which cells you meant were the "high"  cells. SPSS and SAS follow similar conventions for tables.
 
 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh November 2, 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.