THERE IS NO CLASS WEDNESDAY NOVEMBER 24. HAVE A HAPPY THANKSGIVING!
OVERVIEW
 

ASSIGNMENT 5 DUE NOVEMBER 29

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS
 
RETURN TO  ASSIGNMENT PORTAL

 
EDF 5400 INTRODUCTORY STATISTICS
FALL 2004

DR SUSAN CAROL LOSH
EDUCATIONAL PSYCHOLOGY AND LEARNING SYSTEMS

EXAM GUIDE 3 WILL BE AVAILABLE HERE

PLEASE NOTE THE EXAM 3 DATE: WEDNESDAY DECEMBER 8 5:30 PM
 

 

ASSIGNMENT 5: BASIC MULTIPLE REGRESSION
20 POINTS

DUE MONDAY NOVEMBER 29 2004 BY CLASS
DUE TO THE INTENSE NATURE OF THIS COURSE
LATE PAPERS ARE NOT ACCEPTED
EARLY ASSIGNMENTS ARE ACCEPTED

 
LAST DAY QUESTIONS? PLEASE SEE ME OR MARIA.
I WILL BE IN MONDAY AFTERNOON 3:30-5:00 PM NOVEMBER 22 & NOVEMBER 29
MARIA HAS OFFICE HOURS TUESDAY 3:30-5:00.
CHECK WITH EMAIL FOR RYAN

OR YOU MAY ALSO EMAIL US. HOWEVER, PLEASE DO NOT E-MAIL AFTER 8 PM SUNDAY NIGHT.
Different e-mail providers may take a long time to deliver their mail & we may not receive it in time. We are not responsible for late delivery of e-mail by either your provider or ours, or for server viruses that slow transmission, so please leave enough time!

IF YOU E-MAIL ME MONDAY MORNING I WILL NOT HAVE ANY TIME TO RESPOND TO YOU. 

REMEMBER: I DO NOT TAKE E-MAIL ATTACHMENTS! THANK YOU.
FAX IS OK 850-644-8776  PLEASE REMEMBER YOUR NAME AND EDF5400!

ASSIGNMENT STATS
YOUR TASKS 
TODAY
REGRESSION COMPUTER RUN
SOME GUIDANCE HINTS
WHAT YOU TURN IN ON NOVEMBER 29

OF COURSE YOU'LL DO COMPUTER RUNS IMMEDIATELY! WE RECOMMEND  YOU HAVE YOUR OUTPUT PRESENT DURING MONDAY CLASS NOVEMBER 22 & 29 FOR THE REGRESSION SEGMENT OF THE COURSE.
 

PROGRAM PECULIARITIES AND GLITCHES
This also holds for your regression assignment.
One glitche that may happen is that your probability level is a set of zeros, like this: .00   or  .0000      (SDA and SPSS both do this)

The program truncated your probability level because it took up too many decimal points.

What this means:
.00  =  p < .01
.000 = p < .001
.0000 = p < .0001
In order to receive full credit, please observe the correct terminology (e.g., p < .01) in reporting your results depending on the number of zeros that appear in your output.
 

In this assignment, you'll use the SDA system and the 2002 General Social Survey to see how three independent variables differentially affected an individual's highest year of education (educ).

 
Notice that I am specifying the direction of each independent variable on the dependent variable, years of education, in advance: positive for mother's education or being brought up in a two parent family and negative for sibs. THOUGHT QUESTION: Why might I want to specify whether the effect of an independent variable was positive or negative in advance? What kind of advantage might that give?
 

REVIEW

This assignment requires you to draw upon considerable amounts of previous course material.

#1. Examining means and standard deviations.

#2. Examining bivariate correlation coefficients (Pearson's r)

#3. Assessing the strength and direction of correlation coefficients in words.

BUT THERE ARE SEVERAL NEW ELEMENTS

#1. Examining a regression equation that has three independent variables.

#2. Describing how each of the three independent variables influenced the dependent variable in words.

#3. Assessing whether the entire regression equation was statistically significant (i.e., was at least one B reliably different from zero?)

#4. Deciding how strong the combined total effect of the three independent variables was on the dependent variable, educ.

#5. Deciding whether each separate regression B was statistically different from zero or essentially zero within sampling error.

#6. Ranking each independent variable from most to least important in terms of how it influenced the years of respondent education.

#7. Assessing the relative net strength of each of the independent variables on predicting the years of respondent education.



ASSIGNMENT STATS

This total assignment is worth 20 points.

Correctly following all programming information for running frequency distributions and all the regression statistics, and turning in all output = 2 points.

Although your actual output does not count very heavily, I MUST receive your output in order for you to receive credit on this assignment.
 
QUESTIONS YOU WILL NEED TO ANSWER, BASED ON YOUR OUTPUT DATA

You will describe the numeric values of  the three zero-order (bivariate) Pearson r correlations between each of your independent variables and your dependent variable educ. (Include DIRECTION: the positive or negative sign.)

You will describe the strength and direction of each of these three bivariate correlations above in words.

Overall, how much variance (i.e., R2) did you predict in educ with your three independent variables?

Was that value of R2 statistically significant (that is, "REAL")? What was the "significance" level or the probability level of the R2 for this regression?

You will write out the estimated numeric regression equation for educ using your independent variables of sibs, family16, and maeduc.

What was the probability level (significance level) for the effect of each independent variable [those are found through the t-statistic for each separate B] on educ?

You can construct a chart showing the dependent variable, educ, and the independent variables, Bs, beta weights, and significance levels. This is a short way to present the numeric results but please note that such a chart cannot substitute for describing the effects in words. (Chart examples are presented in Guide 8.)

You will interpret each of the effects of sibs, family16, and maeduc on educ in words [i.e., how much the years of respondent education rises or falls for a one year change in mother's education or how many more or fewer years of respondent education there were if the person grew up in a two parent family as opposed to an "other" situation.]

You will describe the relative impact (the BETA weight) of each of your independent variables on  educ and describe these results numerically for each independent variable, the strength (remember to use our STRENGTH chart!), and the direction of each beta weight on educ.

You'll rate each independent variable from most to least important in terms of how much each predictor influences the dependent variable and decide whether to use the the "B"s or the "Beta Weights."

And, YES, you really should be able to do all this by November 29!

Take each question step by step and you should be able to answer them all.


YOUR TASKS FOR ASSIGNMENT FIVE

OVERALL

FIRST, access SDA and the 2002 General Social Survey.

SECOND, run frequencies on all variables for this computer session (educ, sibs, family16, and maeduc). Click the percentages box to get a idea of the relative distributions. You don't need to run measures of central tendency or variation for this assignment.

REMEMBER: Where it says Selection Filter(s):  put: year (2002)
to select only respondents from 2002.

YOU WILL NEED TO RECODE SIBS AND FAMILY16 WHEN YOU DO THE REGRESSION RUN.
See below for directions.

THIRD, you will run your regression program. You will interpret several statistics about the regression equation for educ.

SPECIFICALLY

1. You will run FREQUENCIES on the four variables for this exercise because you ALWAYS check the frequencies on all the variables that you plan to use at the beginning of any analysis session. You will watch for out of order codes, missing data, "wild punches," and other anomalies so that you can recode these if needed or restrict the range of valid codes.

You will run frequencies on:

family16 (the respondent's living situation when s/he was 16 years old)

maeduc (respondent's mother's YEARS of education)

sibs (the respondent's NUMBER of brothers and sisters)

educ (number of respondent's YEARS of education)

When you actually do the regression, you will have your first experience with a "dummy variable" or a dichotomized variable coded only into the categories 0 and 1. This will be the variable "family16" that will be recoded to Two parent family (1) and Other (0).

2.   You will then conduct a multiple regression. Your dependent variable will be educ. Your independent variables will be number of sibs, family16, and years of maeduc.

3.   Based on your results, you will decide on the statistical significance of the total regression equation, its substantive importance, and the statistical and substantive significance of the NET effect of each of the independent variables on educ.

4.  You will examine the relative NET magnitude of the influence of each independent variable on the years of respondent education, and decide which type of regression coefficient (B or Beta) is the more appropriate kind of regression coefficient to use to present your results.
 



OLD FEATURES in this exercise include:

    Accessing an online database (the GSS file) and the SDA system.
    Running univariate frequency and percentage distributions.
    Assessing the statistical significance of an association between two variables.
    Assessing the magnitude and direction of the relationship between two variables.
    Filtering for the study year (2002 ONLY).

NEW FEATURES include:

Conducting a multiple regression analysis.
Assessing R2 and the F-Test for regression.
Assessing metric Bs and the t-test for each B.
Using the Beta Weights to assess the relative net impact of each independent variable on your dependent variable. 




MAKE LIFE EASIER!

 
 
ALWAYS A GOOD IDEA TO PRINT OUT A COPY OF THIS ASSIGNMENT.

CHECK OFF EACH INSTRUCTION AS YOU COMPLETE IT AND YOU WILL HAVE A SPEEDY AND NONTRAUMATIC EXPERIENCE RUNNING THE SDA SYSTEM. 


 
SDA REVIEW

Use the RIGHT toe of your mouse to click on this link:
 
http://www.icpsr.umich.edu/gss

When the menu opens on the link, click on:         Open in New Window

Click on the  button to pull up the statistical program selection screen.
 
 
Remember: WHAT YOU SEE BELOW IS A NONWORKING COPY.
YOU MUST GO TO THE NEW PAGE THAT OPENS ON YOUR MONITOR TO ACCESS THE PROGRAM.

You can always switch back to this screen by clicking on the box at the very bottom of the monitor screen that reads "Assignment 5". Or you can print out the pages of Assignment 5.

Once again, you will bring up the "radio buttons" screen to select an analytic option. First, you will click on"Frequencies or crosstabulation", to run the frequencies on family16, sibs, maeduc, and educ.

In the Study: GSS 1972-2002 Cumulative Datafile screen that opens: first click on:
 
Open Extra Codebook Window

to open up the codebook window. Then, click on :
 
Start

so that you have the SDA program window active as well.
 

Survey Documentation and Analysis

Study: GSS 1972-2002 Cumulative Datafile

Select an action:
Browse codebook in this window
Frequencies or crosstabulation
Comparison of means
Correlation matrix
Comparison of correlations
Multiple regression
Logit/Probit
List values of individual cases
Recode variables (into public work area)
Compute a new variable
List/delete derived variables

Download a customized subset

Suggestion for running analysis programs:
Click the "Open Extra Codebook Window" button above. This allows you to "copy-and-paste" the names of variables you wish to analyze from the codebook window to the analysis windows.
Return to SDA Home Page



PRELIMINARY FREQUENCIES

REMEMBER! The first step in working with data is to ALWAYS display the total frequencies on all the variables that you plan to analyze for a particular research project.
 

 
Next to Selection Filter(s): Add the words: year (2002)
so that you only include respondents from the year 2002.

After you have accessed the General Social Survey data and the SDA system, be sure to run the original 2002 frequencies for:

1. family16

2. maeduc

3. sibs and

4. educ

Obtain the frequencies and click the Column box on Percentaging:  You DO NOT need measures of central tendency, dispersion, or other univariate measures for this assignment, just the frequencies and the percents.

The percentages will be the BOLD numbers in each cell.

REMEMBER: Where it says Selection Filter(s):  put: year (2002)
to select only respondents from 2002.

BE SURE TO IMMEDIATELY PRINT THESE PAGES AND INCLUDE THEM WITH YOUR OUTPUT.

Remember in the Standard Codebook to click on the blue underlined shorthand abbreviation or mnemonic to examine the basic univariate frequency distribution for the particular variable of interest, including missing value codes and frequencies. Recall that the SDA program sometimes, BUT DEFINITELY NOT ALWAYS, omits the cases with missing values when it executes an analysis.

After you have run the original frequencies, here is what you will do in the context of your regression computer run:

You will create a "dummy variable" for family16. A "dummy variable" is a dichotomy where the cases are ONLY coded as zero or one. For this exercise, the category "Mom-Dad" (two parent family) will be coded 1 and "Other"  will be coded 0 because individuals from two parent families may have been able to afford more education more easily.

You will truncate the values for "sibs" which can range into the twenties into the values 0-10 by using a recode statement in the context of the regression run. This will prevent outlying values (e.g., 28 brothers and sisters) from having a disproportionate effect on the results.
 
 
RUNNING YOUR SDA REGRESSION PROGRAM

The first thing to do is to return to the Study: GSS 1972-2002 Cumulative Datafilewindow.

Survey Documentation and Analysis

Study: GSS 1972-2002 Cumulative Datafile

Select an action:
Browse codebook in this window
Frequencies or crosstabulation
Comparison of means
Correlation matrix
Comparison of correlations
Multiple regression
Logit/Probit
List values of individual cases
Recode variables (into public work area)
Compute a new variable
List/delete derived variables

Download a customized subset

Suggestion for running analysis programs:
Click the "Open Extra Codebook Window" button above. This allows you to "copy-and-paste" the names of variables you wish to analyze from the codebook window to the analysis windows.
Return to SDA Home Page


This time, you will click on the Multiple Regression radio button and this will pull up a totally new screen for you that will look like the screen below:
 
 
 THE BOX BELOW WILL NOT WORK TO RUN YOUR PROGRAM. 
YOU MUST GO TO THE NEW PAGE THAT OPENS ON YOUR MONITOR TO RUN YOUR PROGRAM.

 


 
SDA Regression Program
Selected Study: GSS 1972-2002 Cumulative Datafile
Help: General / Dummy vars / Product terms
 
   Dependent:

   Independent: (You can tab from one input box to the next)
1: 2: 3: 4:
5: 6: 7: 8:
9: 10: 11: 12:
13: 14: 15: 16:
   More independent variables


   Selection Filter(s):Example: age(18-50) gender(1)
   Weight:

   Other statistics:
T-testsGlobal F-testUnivariate stats
Correlation matrix Covariance matrix

Color codingQuestion text



Change number of decimal places to display:
   For coefficients:
   For t-tests:
   For F-test:
   For univariate stats:
   For correlation matrix:
   For covariance matrix:


   More independent variables
17: 18: 19: 20:
21: 22: 23: 24:
25: 26: 27: 28:
29: 30: 31: 32:
33: 34: 35: 36:
37: 38: 39: 40:
41: 42: 43: 44:
45: 46: 47: 48:
49: 50: 51: 52:
Continue with other specifications
 

In the Dependent: box type:          educ

You will see that there are MANY boxes for independent variables (FIFTY TWO boxes if you scroll down to the bottom of the screen!) We will only use the first three boxes. Be sure that you place specifications ONLY in boxes 1, 2, and 3 going across the row.

In box #1 type:      sibs (r:0=0;1=1;2=2;3=3;4=4;5=5;6=6;7=7;8=8;9=9;10=10-28 "10 or more")

Yes, this really will all fit in the little box. If you prefer, you can just cut and paste the line beginning with "sibs" and including the recode statement into the little box (make sure to copy it all, including the closing parenthesis.) Sibs can go over 30 brothers and sisters (although not very often) so you will recode it to a manageable number to avoid outliers.

In box #2 type:      family16 (r:1=1-3 "Mom - Dad";0=0,4-9 "Other")

This statement will also fit into the little box. Again, you can cut and paste this line beginning with "family16" if you prefer. This will recode those individuals growing up with two parents as "1" and those in all other living arrangements as "0".

In box #3 type:     maeduc

You won't have to recode this variable at all, so just put the variable name.

REMEMBER: Where it says Selection Filter(s):  put: year (2002)
to select only respondents from 2002.

Under "Other statistics:" click on the boxes to the LEFT of:

T-tests
Global F-test
Univariate stats (this is so you can double check your frequency figures)
Correlation matrix

Color coding and Question text are your option BUT.

I find the color coding helpful. Blue coefficients are negative and red coefficients are positive.

Although you may be familiar with many of the questionnaire codes for these variables, I strongly recommend that you have the codes from your univariate frequencies in front of you so that you can translate the coefficients into English words more easily while you do this assignment. This will be especially true for the "family16" dummy variable.

Click on the gray box at the bottom left that says:
 
Run Regression

Then blink! Here is your output.

P.S. Did you notice that you can change the number of printed decimal places all the way to 6 decimal places? However, I don't recommend this. It won't help very much. All you will do is clutter up the data presentation and you have enough to worry about. It won't help with the tests of statistical significance either.
 

 
REMEMBER! If you get a row of '000's next to "p ="  or underneath the "P" or "Probability" header, this means that your results could have occurred by accident only less than once in 1,000 or  even 10,000 samples if the "B" coefficient, or the R2 were really zero. This really means p<.001 (or even p<.0001 if you have four zeros) so that is what you put. Such a result is VERY statistically significant and it means your relationship is probably REAL, i.e., non-zero. 

IMMEDIATELY PRINT THESE PAGES BEFORE YOU ANSWER THE ASSIGNMENT QUESTIONS BELOW!


HERE'S SOME GUIDANCE TO HELP YOU EVALUATE YOUR MULTIPLE REGRESSION RESULTS

FIRST examine your univariate and bivariate statistics: the means, standard deviations, and the correlation coefficients.
Note any unusually strong or very weak correlations (e.g., under |.10| or over |.50|).
MAKE SURE YOU KNOW WHAT THE METRIC IS OF YOUR DEPENDENT VARIABLE (pounds of weight? number of weekly workhours? years of education? number of library books?) ! This will be the metric you will use for the Bs. (HINT: Your dependent variable in this exercise is years of respondent education.)

SECOND see if the overall R2 is statistically significant. Use the Global F-Test results and look at the "P" for probability level.

The null hypothesis is, Ho : R2 = 0

The alternative hypothesis is  HA:  R2 > 0.

Because R2 is a squared measure, it cannot be a negative number.

If the significance or probability level for the F test is small (p < .05), then the R2 is REAL (non-zero).
Usually this means at least one B is non-zero.
Go to step 3.

If the R2 is basically 0 (p > .05), any apparent influence of the predictors on the dependent variable is an ACCIDENT. STOP HERE IN THIS CASE! GO NO FURTHER!

THIRD see if the STRENGTH of R2 is at least weak (.11 plus).
If yes, continue to step 4.
If R2 is smaller than .10, your results are real but not practically important.
Interpret any Bs with extreme caution.

FOURTH NOW examine each of the Bs.

The null hypothesis for each separate B, Ho : B = 0

The alternative hypothesis is  HA:  |B| > 0 for a 1-tailed test

You can use a 1-tailed (1-sided) test if you predicted the direction of the effect of each independent variable on the dependent variable in advance. (Re-examine the box in the early part of this assignment.)

or HA:  B =/= 0 for a 2-tailed test

If you use a 1-tailed test, you can cut the probability levels associated with each B  in half. (The program Bs are for 2-tailed tests.) That means that some of the smaller Bs in a regression will be statistically significant where they would not be were a 2-tailed test used.

B can be positive or negative.
The test for the statistical significance of each B simultaneously tests whether the accompanying Beta Weight (BETA) is zero, too.

Any B less than twice its own standard error will usually have a significance level greater than .05.
(Yes, you read that right.)
This means any apparent influence of that B is so small that it is a sampling ACCIDENT and that B is really 0.

Use a marker to note the Bs with statistical significance p < .05.
These are REAL or nonzero.

Discuss how the statistically significant Bs raise or lower scores on the dependent variable (see my example for how pounds of weight works and follow it along: For example, for each 15 minute period a woman exercised, she would weigh 1 pound less.)

CLICK HERE TO REVIEW THE WEIGHT EXAMPLE.

FIFTH Look at the BETA weights of the SIGNIFICANT Bs. (Remember that the Bs that were not statistically significant are really 0 in the population and so are the corresponding Beta Weights.)

Rank the Beta Weights from most to least important in terms of absolute value size.
Discuss the strength and direction of each statistically significant beta weight.
 


HERE'S WHAT YOU TURN IN BY CLASS ON MONDAY NOVEMBER 29 2004



  Here's what you turn in to me by class Monday November 29 (you may add a short explanation to your answer to any of these questions). Points for each part are in parentheses:

Your printed output (2 points maximum) for:

The regression itself

The global F-Test statistics and t-tests for each B

The univariate means and standard deviations and

The correlation matrix of all four variables in this regression.

REMEMBER TO RECODE SIBS AND FAMILY16

THEN, YOUR ANSWER TO QUESTIONS 1- 12 BELOW:

(1) (1 point) Describe the numeric values of  the three zero-order (bivariate) Pearson r correlations between each of your independent variables and your dependent variable educ. (Include DIRECTION.)(Form is assumed to be linear because these are Pearson's r.)

(2) (1 point) Describe the strength and direction of each of these three correlations from part (1) in words.
 

CLICK HERE TO REVIEW THE CHART ON CORRELATION COEFFICIENT STRENGTH
YOU WILL ALSO USE THIS CHART FOR THE BETA WEIGHT STRENGTH

(3) (2 points) Overall, how much variance ALL TOGETHER (i.e., R2) did you predict in educ with your three independent variables?

(4) (1 point) Was the value of R2 statistically significant (that is, "REAL" or non-zero)?

(5) (1 point) What was the "significance" level or the probability level of the R2 for this regression?

(6) (2 points) Write out the estimated numeric regression equation for educ  using your independent variables of sibs, family16, and maeduc.

(7) (2 points) What was the probability level (significance level) for the effect of each independent variable [those are the Bs] on educ?

You can construct a chart that shows the dependent variable (educ), and how educ was affected by each independent variable, showing the Bs, beta weights, and significance levels. This is a short way to present the numeric results but such a chart cannot substitute for describing the effects in words. (See examples in Guide 8).

(8) (2 points) Interpret each of the effects of sibs, family16, and maeduc on educ in words [i.e., how much the years of respondent education rose or fell for a one year change in maeduc, or how many more or fewer years of education the respondent had if they grew up in a two parent family.]

(9) (1 point) What was the relative impact (the BETA weight) of each of your independent variables on educ?
Describe these results numerically for each independent variable.

(10) (1 point) Describe the strength (remember to use our chart!) and the direction of each of the beta weights on educ.

(11) (2 points) Rate each independent variable from most to least important in order in terms of how much each predictor influenced the dependent variable (remember to ignore the + or - signs while you actually rank the effects of the independent variables and use the absolute value! ).

(12) (2 points) Did you use the "B"s or the "Betas" for part (11)?
BRIEFLY describe the reason behind your choice.
 
 

 
REMEMBER! THE F-TEST IS NOT A CORRELATION COEFFICIENT. 
THE T-TEST IS NOT A CORRELATION COEFFICIENT.
Instead, the F tests the statistical significance level of the correlation coefficient R2.
Each t tests the statistical significance of the associated B.

If the entity is greater than one, it CANNOT BE  a correlation coefficient.



READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh November 21, 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.