Ready to use.
OVERVIEW
 
 
 

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
READINGS AND ASSIGNMENTS
 
 
 
 

THE EXAM 3 
REVIEW GUIDE


 
EDF 5400 INTRODUCTORY STATISTICS
FALL 2004

DR SUSAN CAROL LOSH


 
 

ASSIGNMENT 5: BASIC MULTIPLE REGRESSION
20 POINTS

GENERAL FEEDBACK ASSIGNMENT 5
REVIEW ASSIGNMENT 5 SPECIFICATIONS HERE

This assignment is worth 5 PERCENT toward your final grade.
Remember! I use plus and minus grading on assignments and for the final grade.


This Feedback page is generic. If you feel it does not address the score on your paper, please quickly make an appointment and Maria or I will go over your paper.

Again, despite the "AQ" (Anxiety Quotient) , most people did quite well. The median score was 20/20 and the mean was 18.97 points. (Please see me if those terms are unfamiliar.) Very consistent with scores on the earlier assignments.

REPEAT: Most of us get nervous when we learn new material. There is that ghastly feeling of not quite having one's feet on the floor. But, as you know by this time, such a feeling dissipates with practice.

Are you  a "regression whiz," ready to tackle the most complex elaborations and complications on regression? Nope. Should you be able to read (or perform) basic analyses and interpret your own or others' basic analytic results? You should.
 

The 18-20 point paper
 (2 points).

As questions 1 and 2 requested, you only examined how the dependent variable, educ, related to sibs, family16 and mother's educational level, because educ is what you want to predict using sibs, family16 and mother's educational level as independent variables. You observed but did NOT include the intercorrelations among the INDEPENDENT variables in this question.

You used only the bivariate rs for this question and no other measures, not Bs, not Beta weights.

Bivariate correlation (r) of educ (respondent's education) with:

VARIABLE NUMERIC VALUE STRENGTH DIRECTION
Sibs
-.25
weak
negative
Family16 (1 = 2 parent)
+.10
very weak
positive 
Mother's education
+.37
moderate
positive

You did NOT say that any of the BIVARIATE correlations were zero or nonzero, because your correlation output in the regression package did not tell you statistical significance, and we have no way of knowing it without a formal statistical test of the bivariate correlation. (They exist, but we did not do them on this assignment. The correlations program of SDA will include statistical significance tests for the bivariate rs. SPSS is similar to SDA is this respect.)

We CANNOT interpolate multivariate results to bivariate results, i.e., if the B for family16 were zero, that does NOT mean the bivariate correlation is zero. It might be--and it might not be. Often, a bivariate correlation is larger and statistically significant, but the B, which is a NET result, controlling all the other variables in the equation is not statistically significant. That's a big reason we do multivariate analyses in the first place.

Similarly, we certainly cannot interpolate bivariate results to multivariate results (the Bs) because the Bs are multivariate effects that are net of the statistical controls of all the other variables in the regression equation. In fact, we often expect that the bivariate correlations will be larger than the partial (controlled) correlations.

Yes, dummy variables DO have a direction. In the case here of a dichotomy, they tell us about mean differences on the dependent variable by the two categories of the independent variable. A positive correlation between family16 and years of education means that adults who were adolescents in two parent families have more years of education--just as this also appears for the B and beta coefficients.


You correctly identified the percentage variance explained in educ by sibs, family16 and mother's years of education as 16.4% because you realized that R2 X 100 = the percent variance explained in the dependent variable.

You identified R2 as "real," NOT because its value was .164 (real but weak) or NOT because the case base was large but BECAUSE the probability level associated with this R2  was less than .001--and for no other reason than a formal statistical associated F-test.

DO NOT TRUST SAMPLE VALUES without substantiating evidence (such as a probability level). Many sample values "look real," that is, they LOOK nonzero, when in fact they are simply sampling fluxuations around zero no matter how large they appear. Yes, as novices and beginning analysts, we may "eyeball" some results, and it is true that small results are very often statistically significant with large samples of 1000 or more. However, when the formal statistical test (in this case, an F-ratio) is right there on the output, this is what to use.

The numeric regression equation for this problem was:
 
 

= 10.652 - 0.178 X1 + 0.627 X2 + 0.266 X3

where:

X1 was number of siblings
X2 (D1) was family16 (2 parents = 1, otherwise 0) and
X3 was mother's years of education

By convention, the constant term (10.652) goes first.
By convention, the value of the b coefficient procedes the variable designation (.627 X2)as you see it above.
Be sure to include the variable names, otherwise we can't know which slope goes with which variable. (Some people left them off; typically you lost 1 point of credit if you did.)

Be sure to include the constant. It is what makes this equation a prediction equation.

The metric regression Bs were what needed to be in this equation. Not the bivariate rs, not the beta weights, and not the probability levels. This is a prediction equation. You lost 1 to 2 points credit, depending on what else you put.

And you noted that the Bs for all three independent variables (and the constant term too) were statistically significant at the p <.001 level.

PARENTHETICAL NOTE: I generally suggest including ALL terms in the initial predictive regression equation, even if some are not statistically significant. Each term had a conceptual reason for being there, otherwise it wouldn't have been in (your) regression in the first place.
 

 
In professional papers and articles, you will often see two estimated regression equations, the first contains all the terms for all the independent variables that were expected to be predictors of the dependent variable. This equation shows at a glance the independent variables that the researcher thought would be important. 

The second is the "streamlined" equation, which only contains the terms for the independent variables that were statistically significant. The investigator then goes on to compare the two equations and speculate why variables initially thought to be conceptually important were not statistically different from zero in their effects on the dependent or response variable.
 

DESCRIBING THE RESULTS IN WORDS GAVE THE MOST TROUBLE

Each additional year of mother's education increased one's own years of education by .266 years (about one-quarter of a year), controlling for the number of siblings and family composition at age 16.

If you think about it, this really is a substantial finding as well as a statistically significant one. Every four years of a mother's education translates into a one year increment in the adult child's completed schooling. Considering that educational levels in the USA rose nearly SIX YEARS on the average from 1900 to 2000, mother's education makes a very substantial contribution to child's education. As Peter Blau and Otis Dudley Duncan asserted nearly 40 years ago, providing for a child's schooling is apparently a major way in which occupational and social class inheritance "works" in modern industrial societies.

Controlling mother's education and number of siblings, adults who were in two parent families at age 16 had  about two-thirds of a year more education .627 than those who grew up in other circumstances.

For each additional brother or sister, net of family16 and mother's education, the individual had about 1/6 of one year less education (-0.178). Thus, adults who had had 6 siblings had, on the average, one year of education LESS than only children (who had no siblings).
 

 
DO NOT interpret metric Bs as percentages. THEY ARE NOT.Since the dependent variable is years of respondent's education, the units come out as years (or fractions of years).

DO NOT interpret metric Bs in OLS regression as likelihoods, rates or ratios. THEY ARE NOT.They come out in the metric or the data unit OF THE DEPENDENT VARIABLE. If the dependent variable is number of years of education completed, so are the metric Bs.

DO NOT interpret metric Bs are "strong," "weak," "moderate," etc. They are metric measures, not standardized ones. They come out in the metric or unit of the dependent variable. If the dependent variable is number of years of education completed, so are the metric Bs. Metric Bs are slopes. A correlation coefficient is not a slope.

DO NOT compare metric Bs with the zero order correlation coefficients. The r is a standardized measure. The metric B is NOT standardized. 
 

One problem was that a few students confused metric Bs with standardized Beta Weights. A couple of others confused metric Bs with correlation coefficients.


How the standardized regression coefficients (the Betas) influenced educ (respondent's years of education):

VARIABLE RANK BETA NUMERIC VALUE STRENGTH DIRECTION
Ma-educ
1
.320
moderate
positive
Sibs
2
-.153
weak
negative
Family16
3
.085
very weak
positive

You used the Beta Weights to rank order the independent variables in terms of their influence on educ BECAUSE:

Beta Weights are standardized. They come out in standard deviation units.
Therefore, you can use them to directly compare the net effects of the independent variables WITHIN a single equation because the metric is the same now for all independent variables, just as it was when we did z-scores. The new metric is "standard deviation units".

Use the ABSOLUTE VALUE OF THE BETA WEIGHT to rank order their effects, then add back the positive or negative sign.

Use the metric Bs to compare across groups or populations, or to have a prediction equation.

You don't chose Beta Weights (or Bs) because they are "more realistic" (I'm not sure what that means, both the Bs and Betas are realistic because they are both based on observed data) or more moderate, or any other subjective criteria. The Bs have a special role to play in regression analysis and the Beta Weights have a different special role to play in regression analysis.



 
YOU LOST CREDIT IF

You used the correlations among the INDEPENDENT variables for question 2.

Your output was not complete.

You said the R2 was real or not real because of its size (.164) or the sample size, instead of looking at the probability level. WE TEST FOR STATISTICAL SIGNIFICANCE BECAUSE SAMPLE RESULTS CAN BE MISLEADING. They can look nonzero even when they really are sampling fluxuations around a true population value of zero. Conversely, in large samples, it is not unusual to find a small R2 that is highly statistically significant. Use the F-test results and the p-level to see if R2 is statistically significant or not.

You left out the "X"s in the regression equation (it was OK to put variable names such as age or gender; it was NOT OK to totally omit the identity of each independent variable.)

You didn't use the metric B slopes in the estimated regression equation but instead put Beta Weights, correlation coefficients or some other number that was not a metric B.

You interpreted metric B slopes as percentages, likelihoods, rates, ratios, or correlation coefficients. You did not use the unit of the dependent variable to describe the Bs.

You didn't rank order the effects of the independent variables using the beta weights.

You put the probability levels instead of either the Bs or the Beta Weights (or conversely, you put the Bs or Beta Weights instead of the p-levels when asked.) Remember, you must be able to tell these entities apart when you read in order to assess the author's conclusions!

You thought the beta weight was a correlation coefficient (it is NOT; it is a standardized regression coefficient.) You thought the beta weight was a probability (it IS NOT).

You messed up the decimal places. This may seem trivial, but, in fact, is important. A correlation coefficient of 0.06 is negligible. A correlation of 0.60 is strong. Confusion over decimal places could lead you to believe results were statistically significant when they were not, or vice versa; or, this confusion could mean the Beta Weights were misinterpreted. Assess whether the confusion over decimal places was a careless error or if you really have trouble knowing the difference. I consider the latter a mild form of math dyslexia, but it CAN be overcome with help and with practice.

Although there were some errors, most were "novice errors," that is, you will be unlikely to repeat them with more practice. 77% were 19 or 20 point papers! Congratulations!

For those continuing on to EDF5401, you will see many of these terms again! Good luck!
 
PLEASE STUDY  YOUR ASSIGNMENT. COMMENTS ARE ON THEM AS APPROPRIATE. 
THERE WILL BE A SIMILAR PROBLEM ON EXAM 3.
LOOK FOR ANOTHER REGRESSION EXAMPLE ON THE EXAM 3 STUDY GUIDE.


  READINGS AND ASSIGNMENTS
OVERVIEW 

November 30 2004
This page was built with Netscape Composer.
It is best displayed in Netscape Navigator,
600 X 800 display resolution.
Susan Carol Losh