OVERVIEW GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS
 
 
 

ASSIGNMENT 4 IS DUE FRIDAY 11-12 BY 3 PM


 
EDF 5400 INTRODUCTORY STATISTICS
FALL 2004
GUIDE 7: REGRESSION BASICS

DR SUSAN CAROL LOSH

ASSIGNMENT FIVE SPECIFICATIONS
GENERAL FEEDBACK EXAM 2

Yes, it is true. Time really flies!
Here is what the rest of the Fall term looks like:

ASSIGNMENT OR EXAM
DUE DATE
Assignment 4 November 12 (Friday 3 PM my box)
Assignment 5
NEW DATE!!!
November 29 (Monday)
Exam 3 December 8 (Wednesday 5:30)

 
READ THIS GUIDE FIRST!
KEY TO: Agresti and Finlay, Chapter 9, pp 301-342; Chapter 11, pp 382-404 and pp 411-421

I treat bivariate and multiple regression as a comprehensive unit because I believe it is easier to learn this way. Therefore, it is a good idea to read through Guides 7 and 8 which translate much of this material, then go back and read both selections of Agresti and Finlay's Chapters 9 and 11.
 


 
AN EXAMPLE
EXTENDING THE EXAMPLE
BEGINNER'S RULES
"DUMMY VARIABLES"
STANDARDIZED COEFFICIENTS

Multiple crosstabulation or contingency tables can take us only so far.

TEXT DISCREPANCY NOTE: Agresti and Finlay use the term "statistical control" in its generic sense, i.e., adding a third or "control" variable. However, in cross tabulation tables, we are using what most statisticians call PHYSICAL CONTROL, because we have literally physically divided cases into separate tables. In line with the more common usage, I reserve the term STATISTICAL CONTROL for regression type analyses in which the sample or population is analyzed as a unity although control variables are added through mathematical techniques solving systems of simultaneous equations. On the other hand, the technique of Regression is an extremely elegant way to summarize interrelationships between a set of independent variables and a single dependent variable. It is also a relatively easy technique for novices to grasp because it is additive and linear.
 
SOME BASIC REGRESSION RULES
Instead of the physical control that you used with multivariate crosstabulation analysis, regression uses statistical control. All predictors are entered in one equation. Mathematically, regression adjusts the prediction coefficient of each independent variable on the dependent variable for the effects of all other independent variables.

The resulting predictive regression coefficients are net effects, controlling for every other independent variable in the regression equation.

This is one terrific technique!
IF YOU CAN MEET THE ASSUMPTIONS THAT ARE REQUIRED TO USE IT.

  A WEIGHTY EXAMPLE TO START US OFF

You are a nutritionist--or a sports psychologist--or a medical doctor--or perhaps just a weight watcher--who is trying to predict the weights of a group of adult women enrolled in Educational Psychology courses. You have measured the weight in pounds of each woman enrolled in these classes. (You have also measured several other variables, but we will get to those in a little bit.)

With no other information, your best guess for each woman's weight will be the average or mean weight in pounds of all enrolled women.

Obviously, with no other information to go by, you will be wrong much of the time as you try to predict each woman's weight. Mean weight gives us very little information about how much any one individual weighs. For weights close to the female mean, you will be reasonably accurate, but if there is any kind of variation in weight across your sample of women, you will make many, many mistakes or errors.

How much will you be wrong (on the average)? We can measure that with an old friend:
You can measure your average error with the standard deviation of mean weight. That is the "average distance" each woman's weight is from the mean weight score.

It's also your average "error" or "mistake" in guessing weight if you use the mean weight score as a predictor.

Actually, we are going to use a "cousin" of the standard deviation. This time, instead of the average deviation, we look at a total error estimate for the entire sample, which we call "the total sum of squared deviations around the mean,' or the "Total Sum of Squares," or just TSS for short.  You met the total sum of squares in Guide 3 on the way to learning the standard deviation.

All we do is add up the entire set of squared deviations between an individual score and the mean score, as you see below. We will use "y" to represent our dependent variable score.
 

 
THE FORMULA BELOW IS FOR THE TOTAL SUM OF SQUARES
 lall
(yi) 2 = TSS


Aha! A jolly statistician comes to your rescue to reduce your errors in prediction.
"Did you also measure each woman's height in feet and inches?" she asks.
"Of course," you respond.

"Then," she says, "I have a formula for you that will help you predict weight better than simply using the mean weight score alone and here it is:

We can write this easy formula this way:

Weight in pounds = 100 pounds for the first five feet of height + 5 pounds per inch over five feet.

That's pretty verbose, so let's call height in inches "x" and pounds of weight "y". Then we can rewrite our formula very simply as:

y  =  100  + 5x

Now, let's see if you can do a better job predicting weight."

Oops. We left some things out.

After all, do all five feet tall women weigh 100 pounds?
Of course not, some weigh more and some weigh less.

Do all five feet five inches tall women weigh 125 pounds?
No, some do, but some weigh more and some weigh less.

So we really need to rewrite the equation this way:

y  =  100  + 5x + e

Where "e" stands for the ERROR TERM, or the difference between the woman's actual observed weight and the weight we would predict for her if we knew her height score.

The "e" term helps us balance the equation exactly so that the left hand side and the right hand side of the equation match. Sometimes it is called the RESIDUAL TERM (see below).

and we can write the equation more generically still as:

y  =  a  + bx + e

NOTE: these are terms that are typically used with SAMPLE data. Very often, with POPULATION data, you will see the greek letters used, a =    and b =   or:

y =   + x + 

where "a" (or alpha) is the "intercept term" and "b" (or beta) is the "slope" term and "e" (or epsilon) is the "error" term. We can graph "perfect" height and weight with no errors in prediction this way in the following graph. The formula creates a straight line.

The "intercept term" or "constant" term ("a") is where the line crosses the "y" axis or the dependent variable score. It is the Y score we would expect if the X score were "0". For example, if a woman were exactly five feet tall, her "x" score for the number of inches over five feet would be zero and she would weight exactly 100 pounds.

The "b" term is a slope. It is the rate of increase (or decrease) in the dependent variable for a ONE UNIT CHANGE IN THE INDEPENDENT VARIABLE.

The "b" term tells us how many units to go up or down in the dependent variable for a one unit change in the independent variable.

IN OUR EXAMPLE, FOR "X" or "Height," ONE UNIT IS ONE INCH OF HEIGHT (over five feet).  The slope term, weight change or "b" term is five POUNDS.

IMPORTANT: The b terms ARE ALWAYS IN THE UNITS OF THE DEPENDENT VARIABLE. In my example, the bs will always come out as pounds.


Thus, for each additional inch of height over five feet tall, the woman weighs five more pounds.
PLEASE BECOME USED TO SAYING THESE RESULTS IN WORDS. YOU WILL HAVE TO DO SO ON ASSIGNMENT 5 AND ON EXAM 3.
 


 y or Weight in pounds 

              150     |                                        X 
              145     |                                    X 
              140     |                                X 
              135     |                            X
              130     |                        X 
              125     |                    X
              120     |                X
              115     |            X 
              110     |        X
              105     |    X 
              100     |X______________________________________________________
                      5' 5'1" 5'2" 5'3" 5'4" 5'5" 5'6" 5'7" 5'8" 5'9" 5'10"

 x or height in inches (5 feet plus)
TECHNICAL NOTE: these Xs here are supposed to form a straight line...

If you were to draw a line connecting these points, it should look like a straight line, with all the dots or "x"s on the line. This is an example of a "perfect linear relationship" with
r = 1.0
 

We call the b (or  in the population) term the METRIC REGRESSION COEFFICIENT. Notice that the metric b term always comes out in the metric units of the dependent variable, such as  pounds.

In our example here, b always comes out IN POUNDS and weight in pounds is our dependent variable, or what we are trying to predict.

If you were trying to predict Graduate Record Exam scores, the metric b term would be in metric units of GRE points.

Sometimes the metric regression coefficient is called the "unstandardized" regression coefficient. Use the metric coefficient when you want to make a definitive prediction (e.g., a person's weight in pounds or the dollars of a person's salary income).


In the "real world," we probably can't perfectly predict weight from height. Instead of connecting the points to form a straight line, we are more likely to have a "cloud of points" with a certain amount of variability around the line. Each point represents a combined height and weight score.
 


 y or Weight in pounds 
                                                                                                                     .
              150     |                           .         .   .     X 
              145     |                           .    .    .   X     . 
              140     |                      .    .    .    X   .
              135     |            .    .    .    .    X    .   .
              130     |            .    .    .    X    .        . 
              125     |        .   .    .    X    .    .
              120     |    .   .   .    X    .
              115     |.   .   .   X    .
              110     |.   .   X   .    . 
              105     |.   X   .
              100     |X   .___________________________________________________
                      5' 5'1" 5'2" 5'3" 5'4" 5'5" 5'6" 5'7" 5'8" 5'9" 5'10"

 x or height in inches (5 feet plus)
TECHNICAL NOTE: these Xs here are supposed to form a straight line...
r ~= 0.50

 


 Although, even with some variability, we do suspect that ON THE AVERAGE, we will do a better job predicting each woman's weight if we know her height than if we only had weight scores and the mean weight alone.

What do I mean by "a better job"? The primary way statistical analysts do it is to look at the difference between:

the actual weight scores and the predicted weight scores --  or the "error term" e.
Remember the error term a little ways back?
The DIFFERENCE BETWEEN THE OBSERVED AND THE PREDICTED SCORE?

For example, if we had a woman who actually weighed 130 pounds and we had a predicted weight score (using height) of 125 pounds we would have 130 - 125 or a deviation score of "5".

If we had a woman who actually weighed 120 pounds and we had a predicted weight score (using height) of 125 pounds we would have 120 - 125 or a deviation score of "- 5".

One simple symbolic way to describe the difference (deviation) between the OBSERVED SCORE and the PREDICTED (ESTIMATED) SCORE is to:
 
call the OBSERVED SCORE simply
call the PREDICTED (ESTIMATED) SCORE
(or "y-hat")

(Sometimes you will see the predicted  score written as y' or as "y-prime".)


We then call the difference between the observed and the estimated (or predicted) score:
 

 e = yii

The "e" or ERROR TERM gives us some flexibility because we know that we probably will not be able to EXACTLY predict the dependent variable.

In the population, e will be denoted by an epsilon or "" term.

RESIDUALS

Sometimes the "e" term is called THE RESIDUAL TERM, because the "e" term is what is "left over" when you have done your best job of predicting the dependent variable and produced an estimated dependent variable score for each person.

Another way to think about the residual is as a  deviation from the regression line or plane like the scattered points around the regression line in my box above. More examples of residuals are provided for you below as I extend the example about height and weight. Graphically, the residual is a vertical deviation from the predicted weight score.

There are a lot of regression assumptions about the distribution of the residual terms (do they resemble a normal distribution? are they clearly some other kind of distribution? do they incorporate some type of statistical bias?)  that we will examine in Guide 8.

We can then find an average deviation score and compare the average "e" deviation with the standard deviationIf height is a good predictor of weight, then the average deviation between the observed weight and the estimated weight will be very small (the estimated weight will be very close to the observed weight), and this average "e" deviation will be much smaller than the standard deviation.

To ascertain the TOTAL ERRORS for all our cases put together, we look at the sum of the squared differences between the observed and the predicted dependent variable score for each individual. (the "unexplained" or residual sum of squares).
 

 
TECHNICAL NOTE: if we simply added the positive and negative deviations from the observed scores, the heavier than average and lighter than average people would cancel each other other and the sum (and the average) would equal zero.

So, before we add the deviation scores, we square them first, resulting in the equation below:

"SUM OF SQUARED ERRORS"  =  (yii) e2

 

REMEMBER THIS PHRASE! We call the sum of the squared differences between the observed and predicted scores on the dependent variable for each individual (what you see in the box immediately above):

(THE AGRESTI AND FINLAY TERM)

The "Sum of Squared Error" or SSE

Makes sense: this is the total sum of all the squared residuals. It is the variation in the dependent variable that all the independent variables put together cannot account for or "explain" in the dependent variable. Hopefully, it is a form of random error or variation from one person to another.
 

ALTERNATIVE NAMES FOR THE SUM OF SQUARED ERROR:

Now, comes the hard part. This useful diagnostic entity has several different names. It has been called:

"the residual sum of squares" (we don't want that one. Too easy to confuse with the "regression sum of squares.") It has been called

"the error sum of squares" (we don't want that one either, we reserve the abbreviation "ESS" for the explained sum of squares).

the "unexplained sum of squares" (USS) which is one of the more popular terms.

so don't be surprised if you run across any of these three terms instead as you read or attend conferences.



  EXTENDING OUR WEIGHTY EXAMPLE 

With simple regression, you have ONE independent variable. With multiple regression, you have AT LEAST TWO independent variables.

In fact, if you look at the Berkeley SDA program, it allows you to enter 52 independent variables (but I recommend you do not try this at home).


Surely, height is not the whole story for weight, or we would be able to predict each woman's weight perfectly by knowing her height score.

So, let's examine some other possible predictors. How about:

Let's look at a couple of examples to see how this works:

100 pounds for the first 5 feet
add 5 pounds for each inch over 5 feet
add 10 pounds for each 1/2 inch wrist measurement over 6 inches

(subtract 10 pounds for each 1/2 inch wrist measurement under 6 inches)
add 2 pounds per 1000 daily consumed calories
SUBTRACT 1 pound for each weekly 15 minute exercise period

 
Woman #1 is 5 feet 4 inches tall. Her wrist measurement is 6 inches. She eats 1500 calories per day and has 7 15 minute exercise periods per week. Her predicted weight is:
100 + (5 X 4 inches over 5 feet) + (10 X 0 wrist measure) + (2 X 1.5 kcal) - (1 X 7 exercise periods) 

OR

100 + (5 X 4) + (10 X 0) + (2 X 1.5) - (1 X 7) 

OR

100 + 20 + 0 + 3 - 7 = 116 predicted pounds of weight

Her observed weight is 115 pounds so her "e" or residual score is the difference between her observed score (115 pounds) minus her predicted score (116 pounds): 115 - 116 = - 1 pound.
 


 
 
 
Woman #2 is also 5 feet 4 inches tall. Her wrist measurement is 7 inches. She eats 3000 calories per day but has only ONE 15 minute exercise period per week. Her predicted weight is:
100 + (5 X 4 inches over 5 feet) + (10 X 2 wrist measure*) + (2 X 3 kcal) - (1 X 1 exercise periods) 

*2   1/2 inch increments over a 6 inch wrist

OR

100 + (5 X 4) + (10 X 2) + (2 X 3) - (1 X 1) 

OR

100 + 20 + 20 + 6 - 1 = 145 predicted pounds of weight

Her observed weight is 150 pounds  and her predicted weight is 145 pounds, so her "e" or residual score is: 150 - 145 = + 5 pounds.

Notice that even after adding all these predictors, we STILL have not predicted each woman's weight perfectly. However, we should do much better than we would have done knowing only each woman's height, and we will virtually certainly do better than if we knew only mean weight score for the total group of women.

So, once again, we will have a new error term for each woman, that represents the difference between her observed weight and the weight that we predicted for her knowing her height, wrist measurement, kilo calories per day and weekly exercise habits. And we can write this exactly the same way as we did for using only one predictor, that is:

We still call the difference between the observed and the estimated (predicted) score:
 

 ei = yii

This, again is the RESIDUAL SCORE for each woman.
Each person we study has a unique "e" or residual score.

For example, if woman #1 really weighs 115 pounds, her error "e" score = 115 - 116 or   - 1 pound.
She weighs one pound LESS than expected.

If woman #2 really weighs 150 pounds, her error "e" score = 150 - 145 or   5 pounds.
She weighs 5 pounds MORE than expected.

However, now that we have included three additional predictors, the regression equation (or prediction equation) looks more complicated. Generically, we can now write this equation as you see below (remember for a population we would use the Greek letter  instead of the bs):
 
 


 
  =  b0  + by1x1 + by2x2 +by3x3 +by4x4
estimated
y  =  b0  + by1x1 + by2x2 +by3x3 +by4x4+e
observed

The difference is that the observed equation includes the "e" term and the estimated equation drops the "e" term. The estimated equation is more generic, what we typically use when we refer to the entire sample.

We are now working with multiple regression. In multiple regression, you have several xs and bs in the equation, one for each INDEPENDENT variable (you still have only ONE DEPENDENT variable.).

And, for our specifics in this weighty example, it becomes:
 

  =  100  +  5x1 +  10x2 +  2x3 +  (-1)x4estimated

y  =  100  +  5x1 +  10x2 +  2x3 +  (-1)x4+eobserved

where: y = observed weight  in pounds

x1 = number of inches over five feet (negative for under five feet in height)
x2 = number of half-inches in wrist measurement over six inches
x3 = number of kilo calories (1000s of calories) per day
x4 = number of weekly 15 minute exercise periods
e  = error term ( yi
 

This whole process is what we call STATISTICAL CONTROL. You do not need to physically separate your case base into distinct groups as you did with multivariate contingency tables. Instead, as we shall shortly see, you will control through the covariances (a form of correlation) that each variable has with every other variable -- the relationships between your dependent variable (weight in this example) and each of your independent variables in turn -- as well as the correlations that your independent variables have with one another.

Sound immensely complicated? Well...it's not exactly easy, but as we take each piece of the regression separately, it is manageable knowledge. While there are computational formulae to solve for the bs if you have one or two independent variables, more complex models use matrix algebra or partial derivatives from calculus.

But first, let's see what data assumptions must be met to use regression analysis in the first place.
 


ASSUMPTIONS BEHIND REGRESSION ANALYSIS

In regression analysis, we make numeric predictions about scores on a numeric dependent variable.

We also use our independent variables to make precise "more than" or "less than" statements about values on the dependent variable. For example, we say that people who eat more weigh more--and we then make a precise prediction about how many more pounds someone will weigh for each kilo calorie consumed daily.
 
 
BEGINNER'S RULES

This means that:

A nonlinear relationship between the dependent variable and an independent variable will not work with the BEGINNER'S RULES.

The regression equation (like our height, bone structure, kilo calories, exercise and weight example above) defines the "best fitting" straight line" or n-dimensional geometric plane to describe our data.

"a" and "b" (or the "bs" are chosen to minimize the "average e". That is, the regression line or plane is calculated to minimize the average error term for a particular data set. We will examine the formulas that produce the intercept and slope terms in Guide 8. They are cumbersome to calculate, but that is what computers are for.
 
 

When we have One independent and 
One dependent variable
We call this SIMPLE REGRESSION
When we have At least two independent and
One dependent variable
We call this MULTIPLE REGRESSION

 
ANOTHER VIEW OF PEARSON'S r

You originally learned Pearson's r as the correlation coefficient between one interval (or ratio) variable and a second interval (ratio) variable. If you revisit the original complex formula for Pearson's r, here it is:
 
 

 
Here's the formula for Pearson's r:
 

( Xi )(Yi)
  _________________________________________ 
 
  _____________
 / 
v ( Xi )2
  _____________
 / 
(Yi)2

            ____________
           /
my    V                                    means to take the square root.


 
 
 
The NUMERATOR of Pearson's r is called the COVARIATION BETWEEN THE TWO INTERVAL (RATIO) VARIABLES X & Y.

This is also often called the COVARIANCE OF X AND Y.

Sometimes we write (x - ) * (y - ) more simply as just:xy (in italics)

This is the covariation of (the deviation of the independent variable from its own mean) times
(the deviation of the dependent variable from its own mean).

The DENOMINATOR of Pearson's r is the standard deviation of the first variable multiplied by the standard deviation of the second variable. There's no n in the formula because n is in both the original numerator and the original denominator and cancels out.

Pearson's r ranges from -1 to +1. It is usually a fraction with 2 decimal places.
It is zero when there is no LINEAR relationship.
It is +1 when there is a perfect positive linear relationship.
It is -1 when there is a perfect negative or inverse linear relationship.
 

But there are some other ways to conceptualize Pearson's r or  (in the population).

We can also see Pearson's r as:

The square of r is r2. When r2 = 1, we say that we have "explained all the variation in the dependent variable."
We will visit other characteristics of Pearson's r later on, but below are some important terms:
 
 
When you have ONE INDEPENDENT VARIABLE
(and one dependent variable)
AT LEAST TWO INDEPENDENT VARIABLES
(and one dependent variable)
Use:
r
R
Call it The zero order correlation
The bivariate correlation
The multiple correlation coefficient
The multivariate correlation

Recall from Guide 5 that R2 (or r2) is THE PRE measure .
R tells us how precisely we can predict scores on the dependent variable from scores on at least one independent variable. As we will see in Guide 8, R2 tells us how much of the variation we can explain in the dependent variable, knowing the scores of all the independent variables in the regression equation.

R2 tells us how close the data are to the regression line or regression plane that you can draw with the regression equation.

Use the strength chart from Guide 5 to evaluate R2 from very weak to very strong (R2 is always positive).

R2 X 100 is often called the percent variance explained in the dependent variable.
When you are asked, "how much variation did you explain in the dependent variable?", your answer will be the value of R2.

(SMALL NOTE: There is also something called the "adjusted" R2 is The adjusted R2 is adjusted for the number of independent variables or predictors. What happens is that the "adjusted" R2 "shrinks" in size if you include many independent variables which have trivial correlations with the dependent variable. With three predictors, the values of the adjusted" R2 and the R2 will probably be about the same.)



I also introduce here a new concept: the PARTIAL CORRELATION COEFFICIENT.

The partial correlation coefficient is the correlation between an independent variable and the one dependent variable, statistically controlling for at least one other independent variable.

You actually had an indirect introduction to the partial correlation coefficient when you examined the partial subtables in the three-way crosstabulation assignment. However, in the case of regression, you have ONLY ONE PARTIAL CORRELATION COEFFICIENT PER INDEPENDENT VARIABLE (per equation). In the case of regression, your partial correlation coefficient is a kind of weighted average across all the values of the control (or second independent variable) variable.
 
GLOSSARY REVIEW

I am introducting a lot of new terms here, so you may want to pause, look back over this section, and ensure that you are comfortable with what is meant by the following terms:
 
 
 
standard deviation of the mean
observed dependent variable score
estimated dependent variable score
a or "intercept" term
b or "slope" term
simple regression
multiple regression
zero-order r
multiple R
covariance
partial correlation coefficient
e or "error term" ("residual")
Sum of Squared Error (SSE)
Total Sum of Squares (TSS)
y or "observed score"
or "predicted score"

 "DUMMY VARIABLES" IN REGRESSION

In REGRESSION, we try to predict scores on an interval-ratio (that is, numeric) dependent variable using numeric independent variables. We are about to break one of our regression assumptions for a very special case.

Normally, our independent variables are ALSO numeric. However, we can include ordinal or nominal variables as independent variables in regression IF and only if these variables take what is called "dummy variable" form.

Contrary to their name, dummy variables are very smart variables indeed.
A dummy variable is a dichotomized variable that can ONLY take on the scores of 0 or 1.
One value of the dichotomy is scored "1"; the other value is coded "0".
Mathematically, this enables us to do several things:

EXAMPLE: Let's add a dummy variable to the weight equation below called "D1" for "eats desserts daily" scored "1"  if the person eats a dessert everyday (everyone else scores "0"). Let's assume that on the average, and all else equal, daily dessert-eaters weigh 5 pounds more than those who don't -- and let's see what happens:
 
  =  100  +  5x1 +  10x2 +  2x3 +  (-1)x4 + 5D       the estimated equation 

where:  = estimated  weight  in pounds
x1 = number of inches over five feet (negative for under five feet in height)
x2 = number of half-inches in wrist measurement over six inches
x3 = number of kilo calories (1000s of calories) per day
x4 = number of weekly 15 minute exercise periods
D = eats dessert daily (1 = yes; 0 = no)
e  = the error term ( yi ) will be added to the observed equation
 

What happens to this estimated equation for the cases scored "0" on eating dessert?
What happens to this estimated equation for the cases scored "1" on eating dessert?
(Without the "e" or error term at the end of the equation, it is an ESTIMATED [or approximate] prediction equation.)

For simplicity's sake the ONLY variable I will touch in this equation is our "dummy variable" for "eats dessert." All the other terms are just as shown in the box immediately above.
 

Person eats dessert, D1 = 1   =  100  +  5x1 +  10x2 +  2x3 +  (-1)x4 + 5 X 1
Since 1 X "anything" = "anything", 
5 x 1 = 5
  =  100  +  5x1 +  10x2+  2x3 +  (-1)x4 + 5
   
Non-dessert eater, D1 = 0   =  100  +  5x1 +  10x2 +  2x3 +  (-1)x4 + 5 X 0
Since 0 X "anything" = 0,  5 x 0 = 0   =  100  +  5x1 +  10x2 +  2x3 +  (-1)x

For DESSERT EATERS ONLY, we can rearrange the terms in the regression equation (notice there is no "X" or independent variable category associated with D1). So our comparison now looks like this:
 

Person eats dessert, D1 = 1
  =  100  + 5 +  5x1 +  10x2 +  2x3 + (-1)x4
Add 100 + 5 for dessert-eaters
  =  105  +  5x1 +  10x2 +  2x3 +  (-1)x
   
And the equation for non-dessert eaters is just:
  =  100  +  5x1 +  10x2 +  2x3 +  (-1)x

What this means is that THE INTERCEPT (or "bo" term) will be 5 pounds higher for our dessert eaters. No matter what else they do, all that sugar and fat catches up with them and they will weigh 5 pounds more on the average than non-dessert eaters, all other things (height, bone structure, kilo calories, exercise) equal.

          Here a couple of questions you should ask about dummy variables:
 

1. My variable has several categories. How should I collapse it down to just two categories, scored 0 and 1?

2. How do I decide which category should be scored 1 and which one should be scored 0?

First, you CAN create more than one dummy variable from a single nominal or ordinal variable. However, your results will be more complicated to interpret. If you believe that one group or category in your variable has something unique, you are better off just dichotomizing because it will be easier to interpret your results.

However, if you have k categories, where "k" is the number of categories in your nominal or ordinal variable, you can create NO MORE THAN k - 1 dummy variables from this. For example, if your nominal or ordinal independent variable has 3 categories, you can only create 2 dummy variables.

One category must ALWAYS be scored zero, no matter how many other categories you have (in a two category variable, it's easy, one category is scored 1 and the other is scored 0).

Let's look at the example below. You have three categories for the variable "marital status": NEVER married, EVER married (divorced, separated, widowed) or CURRENTLY married. You decide that you will compare the other marital status groups with those who are currently married.

Since you have 3 values or categories, you can create TWO dummy variables. The category "currently married" will be scored zero for both dummy variables.

The "currently married" will be the OMITTED or the REFERENCE CATEGORY for the two dummy variables, "never married" and "ever married".
 

DUMMY VARIABLE
DI --NEVER MARRIED
D2 --EVER MARRIED
CATEGORY SCORE FOR THE:    
Currently Married
0
0
Never Married
1
0
Ever Married
0
1

The "never married" are coded 1 on the "never married" dummy variable, D1. Everyone else is coded zero.
The "ever married" are coded 1 on the "ever married" dummy variable, D2. Everyone else is coded zero.
The "currently married" are our reference group in this example and are ALWAYS coded zero on BOTH dummy variables, DI  and D2 .

You would do this if you had reason to believe that people who were never married and people who had once been married (but weren't any more) differed in some way from the people who were currently married.

Second, who gets the "high score" of 1 on each dummy variable?

To some extent, the choice of which category to code 1 and which to code 0 is somewhat arbitrary.
If you have a conceptual reason to believe that one group is relatively unique with respect to the dependent variable, this reasoning takes precedence and you would code this group "1".

For example, suppose we were studying the "digital divide" and home computer ownership, and we want to create a dichotomy for race. Research on the "digital divide" suggests that White and Asian Americans are wealthier and therefore can afford more home computers. Black, Hispanic, and Native Americans are less wealthy and can afford fewer computers. Therefore we could code Whites and Asians as "1" and everyone else "0".

We would then interpret this particular dummy variable b coefficient as how many more household computers Whites and Asians owned compared with all the other groups combined (or how many fewer computers if the b coefficient is negaitive).

Another possibility is if you suspect that the group coded "1" may have higher scores on the dependent variable (conceptual issues aside), that is a justification. Most of the time, all of us have an easier time interpreting positive coefficients than we do negative coefficients (BUT conceptual issues override this.).

It is important to be consistent with how you code dummy variables within the scope of a single research project. For example, if I am analyzing the digital divide, and I began by coding Whites and Asians as "1" I would stay with that coding throughout my digital divide analysis.
 
 

 
IMPORTANT NOTE: Try this type of recoding ONLY for independent (NOT dependent) nominal or ordinal variables. 

DO NOT CREATE DUMMY DEPENDENT VARIABLES. This opens up many regression analysis complications that are way beyond the scope of this course!!
 


 METRIC AND STANDARDIZED REGRESSION COEFFICIENTS

Up until now, we have dealt with METRIC regression coefficients and what is called a PREDICTION EQUATION.

Metric prediction equations literally predict values of the dependent variable, just as we predicted weight in pounds for the earlier examples in this Guide.

Metric prediction equations are VERY widely used. Variations on them predict college grades (from high school grades and Scholastic Aptitude Test scores), gross national product, income variations, longevity in years (using your health habits and the longevity of your ancestors), and many other variables of scientific and/or economic or social interest.

However, in metric form, it is nearly impossible to compare the influence of each independent variable on the dependent variable. The different metrics of the independent variables make it like comparing "apples and oranges."

How does one year of age compare to being White/Asian versus not?

How does an extra inch of height compare to one more 15 minute weekly exercise period?

And, the answer is -- they don't compare at all.

Generally, the wider the range of categories on your independent variable, the smaller the impact of the metric b on the dependent variable. For example, the age in years variable from the Current Population Survey data has a valid adult range from 18 to 90 years, or a range of 72 years! On the other hand, the number of personal computers in the household basically ranges from 0 to 3. Obviously the standard deviation for age will be far greater than the standard deviation for the number of household computers.

Since the standard deviation of the independent variable is in the DENOMINATOR for the formula for the "b" slope term, the bigger the standard deviation of the independent variable (all else equal), the smaller the number for the "b" slope will be. Guide 8 will give you this formula for the simpler cases of one or two independent variables.

Therefore, it is very helpful to be able to STANDARDIZE the regression equation and to turn the metric regression coefficients into STANDARDIZED SLOPES or STANDARDIZED REGRESSION COEFFICIENTS.

We can directly compare standardized regression coefficients WITHIN A SINGLE EQUATION (do NOT USE THESE TO COMPARE ACROSS EQUATIONS).

Standardized regression coefficients are all in STANDARD DEVIATION UNITS of THE DEPENDENT VARIABLE, no matter what the original metric of the variable was.

Because all the regression coefficients are in standard deviation units instead of their original metric scores, we can rank them in absolute value order from largest to smallest, or from the most important to the least important in terms of how much they influence the dependent variable. The largest standardized regression coefficient (in absolute value) is the most important influence.

Because they are in standard deviation units, standardized regression coefficients in theory can range from positive infinity through negative infinity. However, in practice, they can be 0 (that independent variable has no effect on the dependent variable) and they range to + or - one. Independent variables that have standardized regression coefficients close to (+) 1 typically have a very strong influence on the dependent variable.
 
 

 
If you observe a standardized regression coefficient larger than 1 in absolute value in your own analysis, or in the analyses of others, beware! You don't have a super influential independent variable. Instead, you have just received a diagnostic for some common problems in regression that will be treated in depth in your higher level statistics courses.

OOPS! THAT TERMINOLOGY THING

I have mentioned in class that when I was young and naive, I really expected that mathematicians and statisticians would be consistent! They would only have one name for one thing -- not the same name for three different things or three different names for the same one thing. This, of course, would keep things clear and make it easier.

BUT, I was wrong. And, except for some very basic univariate stuff, statisticians are just as awful and inconsistent about terminology as all the other disciplines.

I mention this as a preamble because:

Given the utility of the standardized regression coefficient, this is indeed a shame.

The two most popular designations for the standardized regression coefficient are:

However, since the metric regression coefficient is often designated as a "Beta" (especially when we speak of a population), you can see the problem.

So here is the terminology WE will use for the rest of the semester:
 

UNSTANDARDIZED REGRESSION COEFFICIENT = METRIC regression coefficient
STANDARDIZED REGRESSION COEFFICIENT = b* or BETA WEIGHT

How do you calculate a BETA WEIGHT?

1. The first way to do so is to simply standardize the scores on all the variables (including the dependent variable) you are using in the regression and turn them into Z-scores. See Guide 3 for a review if you don't remember standardized variables very clearly.

Each of your variables will now have a mean of 0 and a variance of 1.

The regression line or plane will now go through the origin and the constant term will disappear (because it will now be zero).

All your regression coefficients are now Beta Weights.

2. The second way to do so is to multiply each metric regression coefficient by the following formula.

The regression line or plane will now go through the origin and the constant term will disappear (because it will now be zero).

All your regression coefficients are now Beta Weights.
 

b* = b (sx ÷ sy )

1. In words, take each independent variable's unstandardized regression b coefficient.

2. Create a ratio by dividing the standard deviation of the particular independent variable  by the standard deviation of the dependent variable. That's (sx ÷  sy ).

3. Now, multiply each metric b by the ratio of the standard deviation of the independent variable to the standard deviation of the dependent variable.

4. Make sure you are using the standard deviation for the independent variable in the numerator of the ratio that matches the independent variable b you are examining . (Of course, the computer programs will calculate all this for you.)

 5. Make sure you do this for ALL of the predictor variables, one at a time.

6. sy is always the standard deviation of the dependent variable in the equation and it always forms the denominator of the ratio.


Use the beta weights:
(1) to see how relatively important each independent variable is
(2) Use the strength chart to assess each beta weight from very weak to very strong too.
    REVIEW THE STRENGTH CHART HERE.



 
MANY, MANY UNANSWERED QUESTIONS

This Guide has sidestepped much of the technical aspect of regression analysis.

How do we get those "a" and "b" terms in simple regression?
How do we get the "b" terms in multiple regression?
What are these covariances, anyway?
What's a correlation matrix?
What does it mean to say that R2 represents "the percent of variance explained" in the dependent variable?
How do we test for the statistical significance of the R2 and of each separate metric b?
What are some of the problems that a large Beta Weight helps diagnose?

Continue with Guide 8 to find out.
 
 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh November 10, 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.