SEE GENERAL FEEDBACK FOR ASSIGNMENT 1 CLICK HERE

THIS SITE HAS SEVERAL PICTURE FILES WHICH ARE LARGE. PLEASE BE PATIENT WHILE THE SITE LOADS.

Thank you to Karen Hutzel, survivor of Summer 2003 for providing picture files for some of the most widely used course symbols!
 
OVERVIEW

 

Assignment 1
due September 15

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS


 

EDF 5400 INTRODUCTORY STATISTICS
FALL 2004

DR SUSAN CAROL LOSH
EDUCATIONAL PSYCHOLOGY AND LEARNING SYSTEMS


 
GUIDE 3: UNIVARIATE STATISTICS
BASIC STATISTICS AND DISPLAYS FOR UNIVARIATE DISTRIBUTIONS

 
 
READ THIS GUIDE FIRST!
KEY TO: Huff, Huff, Chapters 4, 5 & 6, pp. 53-73 
KEY TO: Agresti and Finlay, Chapter 4, SKIM entire chapter (and I DO mean skim to get general ideas). Focus on: pp. 86-89 AND pp. 94-111. OPTIONAL (won't be on exams, but many people find useful; take a glance and see if you are one): stem & leaf plot material.

 
MEASURES 
OF CENTRAL TENDENCY
MEASURES OF DISPERSION
SAMPLING
DISTRIBUTIONS
STANDARDIZED Z-SCORES
NORMAL CURVE 
BASICS
UNIVARIATE GRAPHIC 
DISPLAYS


A VERY SMALL COMPUTER PRIMER

High speed computers have revolutionized data analysis. They accomplish in seconds what it took hours--even weeks--to calculate by hand and they have removed much of the drudgery from statistics and data analysis. BUT most of the time computers are basically robots. They do what they are told--EXACTLY as they are told, even if the results do not make sense. GIGO is short for "garbage in, garbage out", indicating that your results are only as good as your input.

A case in point is how statistical programs such as the SDA system or SPSS (Statistical Package for the Social Sciences) use numbers. Often, the information management specialist who built the computer file for the data assigned numbers to all categories of each variable, even nominal data such as gender. This is for data-processing ease, especially speed, and this does NOT mean that you really have a numeric variable because the "computer said so".

As long as the categories of a variable are stored in the computer as numbers, a computer will calculate numeric means on nominal or ordinal data or give you a "number" for the range on a truly ordinal variable. YOU must decide which kind of data you have and which extenuating circumstances dictate the selection of the proper statistic.

Each year, issues of data analysis and computers become more sophisticated.

There are now hundreds of online databases and archives that are posted on the Internet. The General Social Survey data that we are working with now is one example of such a database.

OPTIONAL: To see many more online databases as well as some considerations when you use such an archive, click HERE and follow the links.

In the last few years, the Internet has truly become an interactive partner in data analysis. Relatively simple statistical programs, such as the Berkeley SDA system (currently the most common in use), have been developed so that online databases can be analyzed online, instead of downloading data files to your home computer (although SDA can do that too). As you know, SDA is an incredibly fast statistical program that can "tear through" thousands of cases in a couple of seconds.

Computer hardware describes the physical computer equipment. Examples are CPUs (central processing units), RAM (random access memory), modems to connect you to telephone sources, monitors, scanners, CD-"burners," or printers.

Software  refers to the actual programs or written commands to computers. There are many kinds of software. Some software are systematic coordinators such as Windows. They contain general commands that allow you to copy data or access programs. (Viruses are programs, too.)

Then there are specialized programs, such as word processors (e.g., Word Perfect or Word), spread sheet programs (e.g., Lotus or EXCEL), learning tools (e.g., Treasure Mountain), card games, Internet connectors and browsers (e.g., Netscape or Explorer) and statistical packages.

Statistical packages such as SDA or SPSS are "bundles" of programs that execute various statistical estimates, such as univariate frequency displays, measures of central tendency and variation, or crosstabulations. Most statistical packages can transform variable categories, create new variables from old ones (e.g., an added index), assign missing data codes, and create tables and charts. Thus, your experiences with SDA will make learning SPSS and other more complicated programs much easier.

Try as they do, the "thinking" that computer programs do is currently only as good as the human brain that wrote the program. The computer does not know the level of measurement of your data or the nuances of missing cases. Computers are very literal, and they do not do what you meant, only what you told them to do.

Only you can make the selection of the appropriate statistic to use. What the computer WILL do is spare you the drudgery of calculations, or working with n-dimensional matrices. In fact, the computer has made routine the use of more complex analytic techniques, such as logistic regression, where the mathematics had been known for decades, but the calculations were staggeringly cumbersome.


MEASURES OF CENTRAL TENDENCY OR LOCATION

The problem that motivates us to use measures of central tendency is the same problem that motivates us to construct a table. You (or an archive) have a lot of data. You want to present the information accurately, but succincly, and with the least amount of detail necessary. In other words, you want to reduce and describe the data.

One way to do so, of course, is to construct a frequency or percentage distribution table. As you know, when a variable has several categories, it is difficult to quickly summarize the meaning of all the data. Remember the General Social Survey computer output on "year of birth?" There were several dozen categories. Yet, even when you condense the data down to four or five categories, this can still be cumbersome when you want to describe the data to someone else.

Furthermore, researchers often work with several variables at a time in a study. This could ultimately lead to a gigantic series of tabular displays, one for each variable, which are lengthy and complicated to read. Ouch!

Ideally, many statisticians and data analysts like to use a SINGLE CATEGORY, or, if our data are actually numeric, a single number to summarize the "average" or the "typical" category or score for an entire distribution. A single category score is easy to remember and, if it is "average" or "typical," it is easy to grasp.

We often summarize a univariate distribution with a measure of central tendency. Sometimes this is called a measure of "central location." This is a single category or "most typical" score from the distribution of your variable.

The measures of central tendency that we will study are modes, medians, and means.

You must make some decisions before you can apply these techniques. For example, you must decide the level of measurement, whether each variable you wish to describe is nominal, ordinal or interval/ratio. You should also examine the shape of the distribution. If your scores are numeric, you need to know whether you have a few extremely large (positive skew) or extremely small (negative skew) scores.

NOTE: At this point, you may have some reservations about describing an "average score." What if scores are "heaped at the extremes"? For example, you might look at the quiz scores from a recent quiz for a course you are teaching. You immediately notice that students tended to do very well or very poorly on the quiz, as shown in my example below:

Student 
Quiz Score (maximum possible = 10)
Debby A.
3
Sam B.
1
Jack C.
8
Anne D.
8
Tanisha E.
9
Sari F.
2
Zack G.
10
Juan H.
3
Ken I.
9
Janet J.
2

How can one category score or one number hope to do justice to such as distribution?

This is one example of looking at "the shape" of the distribution. And, as we will shortly see, there are ways to augment our measures of central tendency. Later, we will capture some of the diversity in the distribution with a measure of dispersion or variation.


THE MODE: A MEASURE FOR ALL KINDS OF DATA

The mode is the category that contains the largest frequency or the greatest number of scores. In other words, the mode is the score that occurs the most often in your variable of interest.

If your variable is a nominal measure, the mode is the only measure of central tendency that you can use.

(But remember, all is not lost: you still can do percents, rates, ratios and compare groups on a nominal variable.)

Look at the table below to locate the mode. You can use either the frequencies per category or the category percentage to do so.

Number of Computers per United States Household
 

How many computers or laptops are there in this household? Number of Cases Percent of Total Cases
    No computer
50646
41.6%
    1   MODE--largest # of cases
50710
41.7 
    2
14075
11.5 
    3 or more
6314
5.2 
Total
121,745
100.0%

Source = Current Population Survey Internet and Computer Use Supplement (Aug 2000)
Missing data = 13241

Some data have so many modes ("multi" modal) that the concept becomes meaningless. On the other hand, you might want to use the mode when you have an ordinal variable that has only a few (or even two) categories. All that is required for the mode is to be able to tally the number of cases in each category. It does not matter if the categories are ordered or numeric.
 
 
THE MEDIAN: FOR ORDINAL, INTERVAL, AND RATIO VARIABLES

The median is a measure of central tendency that can be used with ordinal, interval, or ratio data.

In a set of cases in which the cases have been rank-ordered from highest to lowest (or lowest to highest), the median is the middle score.

Another way of viewing the median is that the median is the 50th percentile.

YOU MUST RANK ORDER THE CASES (or categories) FIRST, or the median will be nonsense.

In a set of cases (for example, the order of finishers in a race), you will need the rank order of each case.

EXAMPLE: in a small footrace, we had (in score position:)

1, 2, 3, 4, 5, 6, 7   The middle score is the "4th position" with three scores above it, and three below.

EXAMPLE: Here are the grade point averages of the top seven students at Lion High School ranked from the highest down to the lowest:

4.00   3.80   3.70   3.69   3.68    3.60    3.60   The value of the score in the "4th position" = the median = 3.69


In general, with an odd-numberset of ranked cases, the median is the [(n+1)/2]th case. In my example above, it would be (7+1)/2 = 8/2 = the 4th score position.

What if you have an even number of cases? Then, you will have two middle scores, and you will take the arithmetic average of those two.

EXAMPLE: 1, 2, 3, 4, 5, 6, 7,8   The two middle scores are "4th position"  and "5th position" and their average in this example is 4.5

EXAMPLE:  Using the grade point example and adding a lower GPA to the seven grades is:

4.00   3.80   3.70   3.69   3.68    3.60    3.60   3.20   Our two middle scores are 3.69 and 3.68.

Their average or the median is 3.685.

The second example illustrates a nice point about the median. The median is less affected than numeric averages by extremely high or low scores. Although the student with the 3.20 average was substantially below the others, adding this lower score only caused the median to drop from 3.69 (with seven students) to 3.685 (with eight students).

The federal United States government typically reports income by median rather than numeric averages (e.g., median income per educational attainment category). That is because income is "skewed," that is, a few extremely high scores (Bill Gates, maybe?) raise arithmetic averages way, way up. But the median will hardly change at all.


Another way to look at the median is that it is the 50th percentile. If you have become comfortable with cumulative percentages, the category containing the median is the lowest-ranked category where the cummulative percentage jumps to over 50 percent. This is an easy way to find the median value in data that are presented in tabular array, as is often the case for data presented in journals or the mass media. Let's stay with our CPS data about computers in the home:

In the category "1 computer in the household" the cumulative percent jumps from 41.6 to over 50.0 percent (in fact, it jumps up to 83.3 percent).

Therefore, the median category score is "1 computer per household".

Number of Computers per United States Household
 

How many computers or laptops are there in this household? Number of Cases Percent of Total Cases Cumulative - down
    No computer 50646
41.6%
41.6
    1 MEDIAN--50th percentile 50710
41.7 
83.3
    2 14075
11.5 
94.8
    3 or more 6314
5.2 
100.0
Total 121,745
100.0%
 

Source = Current Population Survey Internet and Computer Use Supplement (Aug 2000)
Missing data = 13241
 
 

 
PLEASE NOTE: THE MEDIAN CATEGORY SHOULD BE A VERBAL CATEGORY WHEN YOU HAVE TRULY ORDINAL DATA. "Agree" is a word, not a number. Do NOT use a numerical category from your output if the median category is a word or words, rather than a true interval-level number.

It is a very common mistake to assign a number for the median to ordinal data. Use the verbal label for the median category in ordinal data.
 

 



 
FOR INTERVAL OR RATIO DATA: THE FAMILIAR ARITHMETIC MEAN

 
 

To your immediate right is the definitional formula for the ARITHMETIC MEAN. Statisticians sometimes call the mean "the first moment" or the "center" of a distribution of numeric scores. We are using a sample here (small "n" and small "x"). Some statisticians just use the phrase "x-bar".

   n
Xi
 i=1 
 ____________
     n
=

In words, here's how we obtain the arithmetic mean:

STEP ONE: For the chosen variable, start with the score for the first case, add that score to the score for the second case, then add in the score for the third case, and keep adding in the scores until you have added in the score for the very last case on that variable. This is the sum of all the scores on that variable.

STEP TWO: Divide the sum of all the scores from Step One by the number of cases that you have (n).

The result is the arithmetic mean.

Here's what each symbol means:

is a capital, or upper case, Greek letter sigma. Mathematicians use    as a shorthand way of saying "to sum" or "to add".

What is added are the scores to the right of the sigma sign.

Xi means a single, particular score.

i = 1  means to start with the very first case on that variable.

n is the total number of cases ("the casebase") IN A SAMPLE OF SCORES.

A sample is some subset of the entire population of scores. Most of the time, in the behavioral and social sciences (and in many biological or physical sciences, too) we work with a sample of scores, not the entire population.
is the arithmetic mean for a SAMPLE of scores.

If you have the ENTIRE POPULATION OF SCORES, the symbol for the casebase is capital N, i.e., N.

The symbol for the ENTIRE POPULATION ARITHMETIC MEAN is the Greek letter Mu or µ.
 



 
 
WITH MANY HUNDREDS OR THOUSANDS OF CASES, YOU WILL NOT ENTER EACH CASE BY HAND AND ADD IT IN. IT'S TOO EASY TO MAKE A MISTAKE. INSTEAD YOU WILL USE SOME FORM OF GROUPED DATA, SUCH AS THE EXAMPLE BELOW.

Here's the arithmetic mean, applied to the number of computers per household. However, we will have to take some shortcuts because no one is going to add the scores for 121,745 cases by hand.

Number of Computers per United States Household (in frequencies)
 

How many computers in household? Number of Cases
    No computer 50646
    1 50710
    2 14075
    3 or more 6314
Total 121,745 (13241 missing)

Source = Current Population Survey Internet and Computer Use Supplement (Aug 2000)

Instead of adding each case separately, we will take the value of each category (call it C) and multiply that value by the number of cases in the category (fc). As follows:
 

Category Score Category Frequency (C) C  X  f c  
    0 50646 0 X 50646   =
0
    1 50710 1 X 50710   = 
50710
    2 14075 2 X 14075   = 
28150
    3 6314 3 X   6314   = 
18942
Total Sum      97802

(for purposes of this exercise, we will treat the category "3 or more" as just "3")

  Then  /N  becomes 97802/121,745   = 0.80 for the mean number of computers per household.
 


 
 
 
POPULATION PARAMETERS AND SAMPLE STATISTICS

When our measures include every possible case  that we can study in a particular group, or the entire collection of the elements that we wish to study, we have a census or population. It is usually too cumbersome, time-consuming, and inaccurate (undercounts) to study an entire population. 

Instead we take a sample or subset of cases. If we have a representative sample, we can make very good generalizations about our population (the inference function of statistics), always remembering that results will vary from sample to sample. 

We call the descriptive measures we calculate on a population parameters  and we usually denote them with Greek letters

We call the descriptive measures we calculate on a sample statistics or statistical estimates and usually denote them with Roman or English letters. Sometimes we use capital letters for the population and small letters for the sample.

I wish I could tell you that statistics terminology is consistent. That's only sometimes true. After all, we don't use the Greek letter "nu" (that's  ) for the population casebase. Instead, typically the capital English letter "N" is used for the population casebase and the lower case English "n" is used for the sample casebase.

(Maybe that's because the Greek letter "nu" looks like a "v" instead of an "n".)
P.S. PLEASE AVOID CALLING ALL REPRESENTATIVE SAMPLES "random samples." Random sample IS NOT a technical statistical term. Thank you.
 

The arithmetic mean has some interesting properties (if you're curious, you can take a calculator and check them out yourself).

The entity:

Xi
is called the deviation of each score from the mean or the "mean deviation score".

The total sum of all the mean deviation scores added up equals 0 within rounding error. The large and small scores essentially cancel each other out. This is one reason why the mean is considered the "center" of a set of interval-ratio data.

Because you are using numeric operations (such as addition and division) to calculate the mean, of course, your  data must be numeric too. The mean, or "arithmetic average" is also a number scaled with equal intervals (such as one year or one dollar). This means your variable must be interval or ratio.
 
 

 
Try calculating an arithmetic average for college major, when your categories include "humanities," "social sciences," "consumer sciences," "physical sciences," and so forth. Under these circumstances, whatever would a number imply? Would an arithmetic mean of "2" make any sense when the category labels are chemistry, art, and early childhood education?

 

 
A NOTE ON CATEGORY MIDPOINTS ON INTERVAL DATA, ESPECIALLY IN GROUPED DATA

We become interested in the category midpoint for two main reasons:

First, our measurements, especially in continuous data, are often approximations, so there is some "wiggle room" in the category.

Second, the categories may have been pregrouped, such as "9 to 11 years of school." For a number of reasons, you may want to estimate the "middle point" of the category.

Sometimes we use the midpoint of categories to calculate means and standard deviations in cases where the data categories were grouped or collapsed (see above).

Finding the category midpoint means making some estimate of the upper and the lower boundaries (the "true limits") of the categories. Then we do the arithmetic average of the upper and lower boundaries. (Most of the time, of course, we just go with the integer values.)

For a single score, such as "3," typically we can go .5 in either direction. So, in this example, the boundaries would range from 2.5 to 3.5 with an average midpoint of (2.5 + 3.5) / 2 = 3.

For a grouped category, we again add the lower and upper boundary of the category, then divide by 2.

In my example "9 to 11 years of school" category, the midpoint is (8.5+11.5) / 2 =10.

Of course, if you had even more precise estimates of the upper and lower boundaries of the categories you would use those precise estimates instead of  + .5

MEASURES OF DISPERSION

Remember those lopsided quiz scores way at the top? 

A mean score in these conditions looks misleading and because of the "heaps" at each end, the median isn't a whole lot better.

However, if we add a measure of dispersion or variability that will help to describe the data. Consider another example with the following two sets of quiz scores:
 
QUIZ SCORES STUDENT CLASS A SCORE STUDENT CLASS B SCORE
  1 1 1 3
  2 1 2 3
  3 3 3 3
  4 3 4 3
  5 5 5 3
  6 5 6 3
CLASS MEAN   3   3

Although the two classes look quite different, the class means are identical.
 


The purpose of measures of dispersion or variability is to say something about how much, "on the average" a score varies or deviates from a measure of central tendency. For example, class one above has a greater diversity in quiz scores than class two. Once we have both a measure of central tendency and a measure of average variability or dispersion, we know a lot about a set of scores.
 
 
THE INDEX OF DISPERSION, D

Measures of dispersion or variation include the Index of Dispersion, D, (sometimes also called the index of qualitative variation [IQV]), the range, and the standard deviation of the mean. The only measure available for a set of nominal scores is the Index of Dispersion which varies from 0 (all cases are in the same category: a constant) to 1.00 when cases are evenly or uniformly distributed across categories so that each category has the same number of cases.

The quiz scores for Class B above would have a "D" of 0 because all the scores are a "3." There is no dispersion at all. On the other hand, the D for Class A would be 1, there are three categories with scores, 1, 3 and 5. Each category has the same number of cases, two.

The D is cumbersome to calculate and impractical if your variable has over 10 categories. However, it is useful for nominal OR ordinal data when the variable only has a few categories. Unfortunately, most statistical software packages do NOT calculate this measure (if there were many categories and lots of frequencies, this would probably crash the computer). Below is the formula for D should you want to use it at some future time:
 
 

D = 
k (n 2 ( f 2) )
n 2 (k-1)

where k = the number of categories

n = the TOTAL sample size or total case base (N in the case of a population)

and f is the observed frequency in each category of the variable.

Or, in words:

Square the frequency in each category of the variable, add up all the squared frequencies.

Subtract this sum from the square of the casebase.

Multiply this entire numerator mess by the number of categories.

In the denominator, multiply the square of the sample size by (the number of categories - 1).

Divide the numerator by the denominator.
 

 
You DO NOT have to memorize this formula or memorize how to calculate D.

You DO have to remember that there is a measure of dispersion (D) for nominal variables that have relatively few catagories. D varies from 0 (all cases in one category) to 1 (an equiprobable distribution).
 


 
ORDINAL MEASURES OF DISPERSION

For ordinal data, we have two measures of dispersion: the range and the inter-quartile range.

The generic definition of the range is to list the two end-points: the highest and the lowest category scores. This definition will work for both truly ordinal variables and for interval or ratio variables.

A second, very common, definition of the range is the highest category score minus the lowest category score. YOU CANNOT USE THIS VERSION OF THE RANGE ON TRULY ORDINAL DATA BECAUSE THIS VERSION PRODUCES A NUMBER! What is a meaningful number for "strongly agree" minus "strongly disagree"? There isn't one!  (However, you can give the two endpoints, and this is meaningful.) Another problem with using the numeric version of the range is that it is insensitive to the absolute magnitude of the scores. 16 - 1 = 15 (say, for years of school completed) is clearly more comprehensive than 1016 - 1001 = 15 (say, for weekly salary in dollars). I urge you to use the two endpoints (with verbal  labels, if applicable) for the range.

The inter-quartile range is probably more useful than the range and spans the middle 50 percent of the cases. The IQR goes from the 25th to the 75th percentile. When your data are numeric, you can subtract the number that corresponds to the category that contains the 25th percentile from the number that corresponds to the category that contains the 75th percentile. Again, more generic for ordinal, interval, and ratio variables are the verbal end points (or numbers, in the case of truly numeric data) for the categories that contain the 25th and the 75th percentile. Once more, the cumulative percentage makes it easy to find the end points of the inter-quartile range.

Thus, the inter-quartile range contains the middle 50 percent of the scores.

Looking at the Current Population Survey again, the 25th percentile category is "no computer". The 75th cumulative percentile category is "1 computer." The Interquartile Range goes from "no computer" to "1 computer."

Number of Computers per United States Household
 

How many computers or laptops are there in this household? Number of Cases Percent of Total Cases Cumulative - down
    No computer 50646
25th     41.6%
41.6
    1 50710
75th       41.7 
83.3
    2 14075
11.5 
94.8
    3 or more 6314
5.2 
100.0
Total 121,745
100.0%
 

Source = Current Population Survey Internet and Computer Use Supplement (Aug 2000)
Missing data = 13241



 
THE STANDARD DEVIATION OF THE MEAN

With interval or ratio data and the arithmetic mean, we can use the standard deviation of the mean.  Here's how:

1. Subtract the mean from each score on your chosen variable.

2. Square each deviation difference.

3. Now add up all the squared differences.

4. This sum is called the "Total Sum of Squares" (TSS for short).

5. Take the TSS and divide it by either the total number of cases for a population or
     by (the total number of cases -1) for a sample.

This quantity in step five is called the variance. The variance is the average squared deviation from the mean. We square each deviation from the mean first because if we did not, the sum of the deviations from the mean would be zero in every case. That would not distinguish among the different degrees of variability across samples such as those in the Class A and Class B example above.

6. Now, take the square root of the variance and you have the standard deviation of the mean, or the "average deviation" a score is from the mean.

Everything that I just said in steps 1-6 is summed up in the definitional formula presented below.
 
 
TERMINOLOGY
  for a population

 s        for a sample

=
______________
)(Xi - µ)2/N

Unless you have a very small sample, you will not calculate a standard deviation by hand. Once the casebase exceeds a few dozen, even the short-hand computational formulae that you see in textbooks, or hand-calculation procedures using grouped categories, rapidly become tedious and difficult to execute without error.

Each variable in a particular sample has its own unique standard deviation.



SAMPLES AND SAMPLING DISTRIBUTIONS: AN INTRODUCTION

The goal of much research is to predict the true POPULATION VALUE. We want to minimize ANY deviation from the true population value when we make such a prediction. However, because many populations are very large, it is too expensive, time consuming, or even practically impossible to measure every unit or case in the population.

So, most of the time we take a subset, or a SAMPLE, from the population. A well-chosen sample, nearly always a PROBABILITY SAMPLE, in which each case has a KNOWN chance of selection, often allows us to make very good inferences to the total population. However, because a sample is a subset of cases, we do expect random variations from case to case and from sample to sample. Positive fluctuactions cancel out negative ones IN THE LONG RUN, although not necessarily in any ONE particular sample.

When we observe sample univariate results, such as a mean or a percentage, we often put error limits around that result in an attempt to estimate what is happening in the population.

What we want to do is make an estimate of the "average sample," the one that would occur if we took repeated samples of the same size and the same type (typically around the same time) from the same population. Each sample would provide an estimate of the parameter we wanted to know.

EXAMPLE: if we wanted to know the average number of years of completed education among general public adults, and we took repeated samples, we could estimate a mean years of completed education for each sample. We could also estimate the average variability from sample to sample.

EXAMPLE:  If we wanted to know the population percentage that would vote for President Bush in 2004, we could examine several polls (say, each one 1500 cases, and each one obtained through a Random Digit Dial telephone survey) over the next several months. Each polls would have a sample estimate of the percentage of voters choosing President Bush.

Thus, we have a SET OF SAMPLE ESTIMATES (such as a set of mean years of education or percentage endorsing President Bush) FROM SEVERAL SAMPLES.

We call this set of sample estimates THE SAMPLING DISTRIBUTION.

When we have a subset of individual cases, we have a sample.
The unit is an individual case, such as a United States adult.

When we have a set of samples, with a statistic, such as a mean, from each sample, we have a sampling distribution.
The unit is AN INDIVIDUAL SAMPLE.

Sometimes, we actually physically take a set of repeated samples and we can create a sampling distribution from these. One example is all the polls estimating who will win an election. These typically occur around the same time, are about the same size, and taken the same way.

More often, We make generalizations from SAMPLING DISTRIBUTIONS.
We do so very often with only a single sample.

SAMPLING DISTRIBUTIONS are hypothetical distributions of a sample statistic (such as a mean) taken from an infinite number of samples of the same size and the same type taken around the same time period (say, n = 900 for each sample and each sample is a Random Digit Dial survey).

Remember that each element in a sampling distribution is a separate sample.

In the long run, we hope that the center of the sampling  distribution, such as the "mean of the means" (the grand mean) will be the same as the true population value (such as the true population mean.) This is often called the expected value.

If we do a good job on sampling, we can estimate the population mean or percentage from just one sample and put approximate limits of variability (called "confidence intervals") around our estimate.

The sampling distribution also has a measure of variation. The standard deviation of the sampling distribution is calculated in a way similar to that of a sample. Let M equal the number of SAMPLES. Add up the sample means from all the samples and divide by M, the total number of samples. This gives us the mean of the sampling distribution.
 

  1. Take each sample mean, and subtract from it the mean of the sampling distribution. This will give a sample mean deviation score.
  2. Square each sample mean deviation score.
  3. Add up all the squared mean deviation scores.
  4. Divide by M, the number of SAMPLES.
  5. Take the square root of step 4.
 
We give the standard deviation of the sampling distribution a special name: THE STANDARD ERROR.

The standard error is the standard deviation of the sampling distribution.

It is the standard error of some statistic distributed over the sampling distribution, such as the "standard error of the mean."

It tells us how much the sample mean score varies, on the average, from sample to sample.
 

When I write:  )n

That's short for "the square root of n".

To summarize, for measures of central tendency and variation for samples, populations, and sampling distributions:
 
 

  Unit Measure of central tendency Measure of variation
Sample An individual case Mean:
Standard deviation: s
Population An individual case Mean: 
Standard deviation: 
Sampling Distribution A single SAMPLE Mean:    or 
Standard error: aa

 STANDARDIZED VARIABLES AND Z-SCORES

Standardized scores allow us to compare how extreme a score is across different variables no matter what the metric of the variable may be. For example, Marilyn Vos Savant, who is a syndicated speaker and columnist is supposed to have the highest measured intelligence test score in the WORLD.

Let's suppose that Marilyn's IQ score is 175. This metric is in IQ points.

Does Marilyn's stratospheric IQ transfer into megabucks too? (For causal purposes, we will assume that even the entire budget of the United States will not make Marilyn a genius, so the causal arrow must run from Marilyn's IQ to her income in dollars.)

Let's assume that Marilyn's annual income is $80,000 in U.S. dollars. This metric is in dollars.

Standardized or "normal" variables have a mean of 0 and a variance and standard deviation of 1, no matter what the original metric of the variable was (e.g., IQ points, dollars of income or years of age). This is what enables us to compare mean scores across different groups and even different variables.

NOTE: YOU CAN ONLY CALCULATE STANDARD SCORES WITH INTERVAL-RATIO DATA!
You are using arithmetic operations to calculate a standard score.

Here's how to obtain a standardized score, often called a "Z score" or a "normal score":

1. Take each score of a given variable

2. Subtract the mean from each score.

3. Divide the deviation score by the standard deviation for that variable.

In symbols FOR A POPULATION:

 Z =   (Xi - µ) / 

For example, common IQ measures are calibrated to have a mean of 100 and a standard deviation of 15.

If Marilyn Vos Savant's IQ is 175, her Z score would be:  (175 - 100) / 15  = + 5.00

Marilyn's IQ is five standard deviations above the average U.S. IQ.

How about Marilyn's income? Mean family income in the United States is about $50,000 per year. Let's suppose the standard deviation for income is $15,000.

So Marilyn's Z score on income would be: (80,000 - 50,000) / 15,000 = +2.00 or two standard deviations above the average.

So, although Marilyn's IQ is WAY above average, her income is above average, but not nearly to the same degree. Although IQ and income are two different measures, calibrated on two different metrics, the Z score allows us to directly compare both measures for a particular person.

Z scores are particularly valuable to use with the Normal Curve (see below). Because the areas under the normal curve are known by definition, if your data conform to a normal distribution, you can tell whether your score is about average or extremely high or low...and even a score's percentile if your variable follows a normal distribution.
 


 NORMAL CURVE 101

The Normal Curve is a mathematically derived hypothetical distribution of scores.

To understand the basics of the normal curve, you should now be familiar with the mean, median and mode, and standard deviations and standard errors.

Below is the basic function that produces the normal curve. When the area under the curve is aggregrated through the mathematical process of integration, we have what is called a probability density function or PDF. Virtually every statistic has its own unique PDF that will draw a curve. The normal curve is actually almost the most simple PDF in the field of statistics.

PDFs allow us to make very useful inference statements because each PDF is a collection of mathematical properties. Because the PDF itself is hypothetically defined, it always has the same theoretical mathematical properties, regardless of the specific sample involved (although the specific numbers will depend on the data itself.)

Here's the formula that produces the normal curve. (This copy is courtesy of Dr. Brewer in EPLS' book:)

If you look at some of the components of this function for the normal curve, you will see some familiar symbols: the population mean for the variable (µ), the population variance (2), and the population standard deviation ().
 
 
PROPERTIES OF THE NORMAL CURVE

You can use the normal curve only with numeric data.

The curve is bell-shaped.

The curve is symmetric: each side is a "mirror image" of the other.

The distribution has a center. The mean, median and mode are the same number and they are all in the exact center of the distribution of scores.

The total area under the curve is set to 100% or 1.00.

With  normally distributed data, 68 percent of cases are within one positive and negative standard deviation of the mean, 95 percent of cases are within ± 1.96 standard deviations of the mean, and 99 percent of the cases are within + 3 s of the mean.
 
 

 
If the distribution of scores on your variable meets all of these criteria that define the normal curve, then we say that your variable follows a normal distribution.

If your data happen to resemble this (and a lot of data distributions such as height do), you can do more numerically with your data. Having a normally distributed variable gives you a lot more statistical options.
 


 

The total area under the normal curve is set to 1.00 or 100 percent. We can calculate the various areas under the normal curve (tables at the very back of your book can help you do so, and computers will calculate this too). For example, 34 percent of the cases (or the area under the curve) is found between the mean and 1 standard deviation, or to use the symbolic terminology:

µ + 

We will revisit the normal curve later in this course.


THE SAMPLING DISTRIBUTION REVISITED AND THE NORMAL CURVE

Imagine that you have taken several samples of exactly the the same size and the same type (say,  n = 1500 telephone Random Digit Dial samples).

You now have M samples. You calculate a mean from each of these M samples. Then you average these separate sample means to find the "mean of the means" (often called the "grand mean").

The standard deviation around the "mean of the means" or the grand mean has a special name: we call it the standard error of the mean so that we know that we are dealing with a sampling distribution of samples and not one sample of individual cases.

The standard error behaves analagously to the standard deviation for the normal curve except that the standard error is a measure of variability across separate entire samples.

(The standard deviation is a measure of variability across individual cases in a single sample or a single population.)

We can apply the Normal Curve to either a sample distribution of cases OR to a set of sample statistics (such as a set of means across several samples)


CRITICALLY IMPORTANT: The results from a set of samples may be normally distributed even if the cases from a single sample do NOT have a normal distribution.

If each sample is big enough ("the law of large numbers"),  the results will vary less from sample to sample and form a normal distribution, for example, of a mean or a proportion.

Because we know the defined math qualities of  the normal distribution in advance (they are mathematically defined, remember), we can use these properties with our data if the cases themselves are normally distributed or  the sample is several hundred cases.

For example, we expect about 95 percent of SAMPLE MEANS to be withing two STANDARD ERRORS on either side of the grand mean.
 
 
PUTTING IT ALL TOGETHER: CONFIDENCE INTERVALS #1

Because when we take a sample we expect some variability around the outcome, we can place a confidence interval around our estimate of the sample mean. This gives us some idea of the average amount we can expect the mean to vary from sample to sample.

If we go out two estimated standard error units around the mean, 95 percent of the confidence intervals placed about the mean constructed in the following way will contain the true population mean.

Obviously, any ONE confidence interval constructed from a single sample either will contain the population mean or it will not. Our faith is in the long run PROCESS--that 95 percent of the confidence intervals constructed in such a way WILL contain the population parameter.

If we are going out almost two standard errors (1.96 to be exact) about the sample mean in either direction, this will capture the population mean in about 95 percent of the samples (same size and type from the same population at about the same time) that we could take. In 5 percent of the samples we take, the confidence interval will NOT contain the population mean. We never know which sample is a good one or a bad one, but our faith is that we got one of the 95 accurate samples and not one of the 5 bad samples.

The general formula for the 95% confidence interval around the mean is:

+1.96 * (s.e.)

The 1.96 means we are capturing the population mean score in 95 percent of the samples whether higher than the sample mean in this particular sample [+ 1.96 * (s.e.)] or lower than the mean in this particular sample [-1.96 * (s.e.)].

EXAMPLE:

Suppose we are trying to estimate adult age in the United States.
We can use the 2002 General Social Survey data to do so.

Mean age for 2002 according to the GSS = 46.28 years.
The standard deviation is 17.37 and the n = 2751.
This makes our estimated standard error 17.37 / )2751 or   17.37/52.45  = .33

When I write:  )n

That's short for "the square root of n". So in this case )2751 means "the square root of" 2751.

Our 95 percent confidence interval (going out almost two standard errors or 1.96 standard errors from the sample mean on either side) will be:

46.28 + 1.96 (.33) or 46.28 + 0.65
46.28 - .0.65 = 45.63  years of age
46.28 + 0.65 = 46.93  years of age

So our best estimate is that mean adult U.S. age (in 2002) is between 45.63 and 46.93 years of age.

We calculated just one confidence interval using just one sample (the 2002 General Social Survey). 95% of the confidence intervals we construct in this way will contain the true population mean for age.

As the casebase becomes larger, the standard error becomes smaller (remember you divide by 1/ )n). So larger samples have much smaller, hence much more precise confidence intervals and estimates of the population parameter.
 
 



GRAPHIC DISPLAYS AND PICTORIAL REPRESENTATIONS

 
CHARTS AND GRAPHS

Often the basic elements of tabular displays are presented instead in a chart or a graph. Or icons may be used to represent the frequencies in each category of a selected variable.

Histograms look like bar charts. On the horizontal, or "x" axis, are the categories of the variable. On the vertical, or "y" axis, are the frequencies for each category.

Line graphs, also called frequency polygraphs, connect the midpoints of (at least ordinal) categories to create lines.

Pie charts use "pie wedges" and percentages (make sure they add to 100%!)

Pictographs use representational icons such as small houses or moneybags to show relative frequencies, often across groups.

Using the number of computers per household, below is one example each of a histogram, a frequency polygram, and a pie chart.

 




 
PICTORAL ICONS OR "PICTOGRAPHS"

Some writers of reports like to use little pictoral symbols or icons to represent frequencies or relative frequencies.

For example, suppose that we find that 40 percent of the households where the top degree is a high school diploma have at least one personal computer (or laptop) and 80 percent of the households where the top degree is a college degree have at least one personal computer (or laptop). Pictorially, the artist might represent this comparison as follows:

CORRECT DEPICTION
Percent of United States Households Owning at least One Personal Computer by Education

HOUSEHOLDS WITH HIGH SCHOOL DEGREE
HOUSEHOLDS WITH COLLEGE DEGREE
40 PERCENT
80 PERCENT

Source = Current Population Survey Internet and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)

However, suppose our graphic artist, who doesn't have any stastistics background, decides that he doesn't want MULTIPLE icons because he feels that this clutters up his display. Instead, he decides to change the size of a pictoral icon in the college household to show the relative differences. So he tries the following change and makes the "80 percent" icon twice as high as the 40 percent icon:

CORRECT (BUT NOT "PRETTY") DEPICTION
Percent of United States Households Owning at least One Personal Computer by Education

HOUSEHOLDS WITH HIGH SCHOOL DEGREE
HOUSEHOLDS WITH COLLEGE DEGREE
40 PERCENT
80 PERCENT

Source = Current Population Survey Internet and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)

Only now, the artist decides that the tall, skinny computer looks wierd and out of proportion (it does), so he juggles the dimensions of the tall, skinny computer to make it "look better" producing the following comparison:

INCORRECT (but pretty) DEPICTION
Percent of United States Households Owning at least One Personal Computer by Education

HOUSEHOLDS WITH HIGH SCHOOL DEGREE
HOUSEHOLDS WITH COLLEGE DEGREE
40 PERCENT
80 PERCENT

Source = Current Population Survey Internet and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)

There! Now the big computer is more in proportion! Isn't that better?

Well, no, it's not. It's actually very misleading. Our artist, in a desire to make the computer icon pretty, now has not only made it twice as tall--but also twice as wide. The total icon for those with a college degree is now FOUR TIMES larger than the icon for those with a high school diploma, even though they are only twice as likely to have a home computer.

Thus, it is easy to misrepresent icons in pictographs, although the artist may have the best of intentions.Be sure that the icon used is always the same size for the entire pictograph. You can use multiples of the same size icon to convey group differences (as in the first, and correct, pictograph of computers.)

Other tips: Be sure the x and y axis of frequency polygons and histograms use equal interval units across the bottom and up the side. If you look at the Radon example in my handout (last page), it uses equal intervals up the side but unequal intervals across the bottom so that the effect of Radon exposure on smokers looks much more dramatic than it really is. (There will be a PAPER handout coming on graphic displays!)

If the graphic display is truncated (that is, it omits the middle portion of the y axis to concentrate on where all the frequencies are displayed), be sure the y axis starts at zero OR that the graph uses CLEAR truncation marks (see the page in the handout that compares the "good graph" and the "bad graph" for consumer confidence on the same page).

Be sure that the graphic display, whether histogram, frequency polygon, pie chart, pictoral representation, and so forth tells you (when appropriate) the total case base, valid case base, and the source of  the data.

GREAT BOOK TIP: Darrel Huff does the best job I have ever seen depicting the mismanagement of graphic displays of data. This section of Guide 3 owes a lot to How to Lie with Statistics. So make sure to check out the reading with lots of examples.
 
 
 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh Slightly revised at top September 18 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.
When it's 90 degrees, THINK SNOW