PLEASE NOTE: IF ANY STUDENT QUESTIONS ARE POSTED TO ME THAT WOULD CHANGE OR CLARIFY THE CONTENT OF THIS SITE, I WILL MAKE CORRECTIONS HERE. HOWEVER, NO CHANGES WILL BE MADE IN THIS SITE AFTER TUESDAY  9/28 AT 7 PM.

ERROR IN APPLICATION FOUND: This evening (9-27) discussing Assignment 2 with several students after class, I discovered that several students were using percentiles to see if the required numbers of cases met normal curve criteria. THE NORMAL CURVE IS NOT IN PERCENTILES. IT IS IN STANDARD DEVIATION UNITS. Using cumulative percents will NOT help you here. Closely examine the box about two-thirds of the way down the site in the Assignment 2 Feedback page where I work the "maeduc" example. For the first (- 1 standard deviation) criterion, subtract one standard deviation from the mean (11.45 - 3.49 = 7.96 ). That is the lower bound score. Add one standard deviation to the mean; that is the upper bound score (11.45 + 3.49 = 14.94). Add the cases between the lower and upper bound scores (8 through 14) and divide by the case base. To meet normal curve criteria, 68% of the cases should be within one positive and one negative standard deviation of the mean. (If you interpolate, you will get the same results I did, about 75.5% rather than 68%, within rounding error.)

OVERVIEW


EXAM 1 IS ON
SEPTEMBER 29

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS

PLUS
ASSIGNMENT 1 FEEDBACK

ASSIGNMENT 2 FEEDBACK

TO EDF 5400 READINGS AND ASSIGNMENTS

 

EDF 5400 INTRODUCTORY STATISTICS
FALL 2004


GENERAL GUIDE:   EXAM ONE
EXAM 1 IS ON WEDNESDAY SEPTEMBER 29

DR SUSAN CAROL LOSH


 
LAST DAY QUESTIONS ABOUT EXAM ONE? PLEASE SEE ME OR MARIA.

OR YOU MAY EMAIL US. HOWEVER, PLEASE DO NOT E-MAIL AFTER 8 PM TUESDAY NIGHT.
Different e-mail providers may take a long time to deliver their mail & we may not receive it in time. We are not responsible for late delivery of e-mail by either your provider or ours, or for server viruses that slow transmission, so please leave enough time!

IF YOU E-MAIL ME WEDNESDAY MORNING I WILL NOT HAVE ANY TIME TO RESPOND TO YOU. 
HOWEVER I WILL HAVE OFFICE HOURS 9-29 FROM 3:30-5:15


 
COVERAGE
A FEW DIFFERENCES FROM THE TEXT
BASIC CONCEPTS
SAMPLE QUESTIONS

Exam One is 100 points and should take about one hour to complete. It counts 25 percent toward your final grade.

The exam is in our classroom, regular time, CLOSED BOOK, CLOSED NOTE.
You don't need to bring a blue book but a pencil (rather than a pen) is recommended. You may wish to bring a calculator too.

In some cases you will be asked to choose the sections of a question that you answer, e.g., select three out of four sections. The purpose of this is to allow you to show off the areas that you know the best. DO NOT answer all choices in such instances. No extra credit! We only grade the first number of designated selections if you answer all the selections in these cases. So what can happen is that (for example, in a 3 out of 4 selection question) you get parts 1, 2 and 4 right, but we only grade parts 1, 2, and 3, so your credit is lower than if you had simply answered 1, 2 and 4.

The exam is a mix of multiple choice, true-false, short essay, and data interpretation questions. You may add a SHORT explanation to any short-answer question.

The data interpretation questions will be comparable to the assignments. You will see an example below under the SAMPLE QUESTIONS section.

GENERAL EXAM ONE COVERAGE

I RECOMMEND BRINGING AN INEXPENSIVE HAND-HELD CALCULATOR (e.g., a TI 30).

This exam covers the following  in Huff:

1. Introduction (pp. 7-9) and Chapter 1, pp. 10-26
2. Chapters 2 & 3, pp. 27-52
3. Chapters 4, 5 & 6, pp. 53-73

This exam covers the following  in Agresti and Finlay:

1. Preface (entire); Chapter 1 (pp. 1-9), Chapter 2 (pp. 12-17 ONLY)
2. Chapter 3, pp. 45-67 THEN Agresti & Finlay, Chapter 3, pp. 35-44.
3. Chapter 4, SKIM entire chapter to get general ideas. Focus on: pp. 86-89 AND pp. 94-111. .

PLUS:

WHAT WON'T BE ON THE EXAM

You will not have to work any formulae, although you may have to calculate percentages or a cumulative percent or a percent change over time hence the calculator suggestion.

Statistics such as means or standard deviations will be provided for you as they are with your computer output.

You will have to know basic differences between a sample and a sampling distribution.
However, we will cover types of samples later, after Exam 1.

You will not have to calculate a confidence interval. However, you need to know the "ingredients" of a confidence interval, notably: the sample standard deviation, the number of standard error units to go out on either side, and the sample size. You DO need to know there is a tradeoff between confidence and precision. The more standard error units on either side of the mean you include, the greater your confidence that your sample is a "good sample," that is close to the population mean mean. However, by going out more standard error units, your confidence interval is wider, and, thus less precise. By increasing the sample size, you can make your estimates more precise.
 
WHAT WILL BE ON THE EXAM

I expect you to know what a variable is and how a variable differs from a constant.

I expect you to know about properties of variable category systems.

You will need to be very familiar with levels of measurement in data. You will need to be able to receive information about a variable and accurately classify that variable as nominal, ordinal, interval, or ratio. Think of the skills that you needed and used for Assignment 2.

You will need to know the necessary components to include in constructing a table.You will need to know if any of these components are missing in the tables you examine. (Or if the table contains "too much" information.)

You need to know what percentages,rates, percent change over time, and ratios are.

You will need to be very familiar withmeasures of central tendency and measures of dispersion(variability).

You need to know when you can use a number for these entities and when you must use the verbal labels, that is, when true numbers are inapplicable.

You need to know how skew (extreme scores) may influence these measures, and what other measures you can use instead if skew is strongly positive or negative.

Do you know how to turn a set of metric scores (i.e., in the original metric of the variable, such as IQ points or years of age) into a set of standardized or Z-scores?
 

 
Standardized scores allow us to compare two different cases on the same variable, to see, relatively speaking, which case has the more extreme score. 

Standardized scores also allow us to compare the same case on two variables, to see whether a person's extreme score on one variable (Marilyn Vos Savant's IQ score) is matched by an extreme score on a second variable (Marilyn Vos Savant's income). We will encounter the second use of standardized scores in the second part of our course when we examine an interval-ratio level measure of correlation, Pearson's.

We use standardized standard errors when we set a confidence interval around a proportional (percentage) or a mean.
 

Do you know some statistically useful properties of a Z-score?

Can you QUICKLY tell whether a set of scores matches the criteria for the normal curve or not?
 
 

 
Try these for a fast decision:
  • If the data are NOT numeric, the data CAN'T follow a normal distribution.
  • This means nominal AND ordinal data do NOT follow a normal distribution.
  • If the histogram for the set of scores looks NOTHING like a bell-shaped curve, your scores DO NOT follow a normal distribution.
  • if the mean, median, and mode are very different numbers, your scores DO NOT follow a normal distribution.

Do you remember the basic differences between A SAMPLE and a SAMPLING DISTRIBUTION?

Which one is a set of cases?
Which one is a set of sample results for a series of samples of the same size and type, taken at about the same time?

Although both have a mean, which entity uses a standard deviation as a measure of variability?
Which entity uses the standard error as its measure of variability?

Can you tell an accurate graphic display from a misleading graphic display?
 


A FEW DIFFERENCES FROM AGRESTI & FINLAY

In general, I agree with most of the terms and usage in Agresti and Finlay. Here are a few exceptions or clarifications.
 
1. Many variables are NOT comprised of numeric values

Agresti and Finlay define nominal, ordinal and interval-ratio data. Later on, however, it seems like they forgot the distinction and assume that most data are numeric. That's why, for example, they subtract the smallest from the largest value in a set of scores and come up with just a single number for the range when the range is an ordinal measure.

You can use subtraction when variables have a common unit (such as one year or one dollar) that separates adjacent categories. Then, the variables are truly arithmetic as they are for interval or ratio data.

It's also true that data in a computer may be coded in numeric form to speed data retrieval and processing. You have already discovered that with the General Social Survey dataset.

However, for many variables, the "numbers" stored in the computer are NOT true numeric codes. The true values of the variable, instead of being numbers, are verbal labels. For example, the nominal variable "race" might be coded "white," "black" and "other". The ordinal value "approval of local police performance" might be graded "A" for "Excellent" through "E"  for "Failure". If you used the range for a variable such as this one, you would have to give the "A" and "E" endpoints. It would make no sense at all to try subtraction and come up with "a number."

You may also find it more informative to use the generic definition of the range (that is, the minimum and the maximum scores) most of the time because these scores convey where the distribution is located. After all, 90 years of age - 85 = 5. Twenty years of age - 15 years of age also equals 5. Giving the minimum and maximum scores tells us more about where the age groups are. Personally, I find the end points of the inter-quartile range (the middle 50 percent) even more informative.

Be sure to carefully check what the level of measurement of each variable that you study or read about really means.
 
2. Is ORDINAL DATA numeric?

I am trying to be honest with you, although this topic is complex--Agresti and Finlay are trying to be honest with you, too. So call this one a clarification of the text (not a difference).

In an ordinal variable, the categories can be rank ordered from lowest to highest. However, there is not a common unit, metric, or interval that separates adjacent categories. This means that the categories of ordinal variables are not really numbers.

Do some researchers use numeric statistics (such as a mean) on ordinal data? Yes, they do, and Agresti and Finlay are trying to make you aware of that.

When researchers do use numeric statistics on ordinal data, do they always mislead? No, they do not, not always.

Do many ordinal variables conceptually have an underlying metric, although they are not measured that way? Yes, this is true. For example, one can argue there is some kind of metric (even if we don't know exactly how to measure it) for spouse's level of education, for example. Other ordinal variables (the A to F "Report Card" for politicians, for example) do NOT have a clear underlying metric.

Does this mean that YOU can just go ahead and calculate means and standard deviations on ordinal variables, or assess them for a "normal distribution"? NO, YOU CANNOT!
 

 
As a beginner, please observe the level or scale of data rules. Please use only ordinal (or nominal, which you can use on any level of data) measures for ordinal variables.

And, of course, you can only use nominal measures and statistics if your data are nominal.
 
3. Don't worry about the bar graph/histogram distinction

You really CAN get the bars to touch each other when you do EXCEL charts. However, many people don't know how. And some software makes it very difficult to make the bars touch.

Concentrate on creating a clear, easy to read histogram. Did you include a title, the proper case base, and appropriate missing data? Did you use a common scale for the Y axis? Did you truncate clearly if you needed to?
 
4. Interval and ratio data

Agresti and Finlay don't really mention ratio data. Both interval and ratio data use a common and equal metric, unit or interval. However, ratio data also have a fixed or "real" zero. Count variables (number of children, siblings, books or dollars of income) have such an absolute zero. Ratio data are actually much more common in nature than interval data. We find interval data more often on humanly constructed scales, such as the GRE.

Agresti and Finlay probably don't mention ratio data because an absolute zero doesn't make much difference for the measures we examine this semester. You can do arithmetic operations (e.g., a mean) on either ratio or interval data.

You'll notice throughout this semester I group these two types of variables together, e.g., "interval/ratio variables".

If you take ratios (twice as tall; one-half as heavy), then you need a variable with an absolute zero.
 
 
5. Computer basics

Agresti and Finlay discuss computers in general and show you computer output, especially as the book progresses. However. they don't address many specifics about statistical programs.

However, we are REALLY doing computer statistical analyses and interpreting the results. If you analyze data and don't know computer basics, you can be DANGEROUS! You are more likely to overestimate what computers can do and substitute computer processing for common sense.

So, you need to know:

what is hardware (e.g., your floppy drive) and what is software (e.g., Windows)
what statistical programs can and can't do (e.g., they can do fast and accurate calculations; they CAN'T think)
what the SDA program that we use does; the SDA data routines are also typical of those in other statistical programs, such as SPSS:

as well as selecting the statistics that are most appropriate for each variable, which you did in Assignment 2
 
 
6. X-bar or Y-bar?

The mean of a variable can be designated with a letter, such as X or Y, with a little bar on the top to designate the mean score of a variable, e.g.,
_
or
_
Y

When we are only dealing with univariate distributions, it really doesn't matter, and most statisticians use "X-bar".

However, as soon as we move to bivariate distributions, and we designate an independent and a dependent variable, everyone (including me and Agresti and Finlay) uses Y-bar to designate the mean on a dependent variable. We will do so in the later sections of our course.
 
 
6. sampling, sampling distributions, sampling errors, and standard errors

We will discuss beginning sampling after Exam 1 in the context of inferential statistics.
We will also discuss different kinds of samples at the end of the course (time permitting).

Sampling error is a generic term to designate random, variations from one sample to another.

It is different from a standard error.

A standard error (of some statistic) is essentially the standard deviation of a particular statistic in a sampling distribution, such as the standard error of the mean. Remember, in contrast: the unit in a sample is an individual case.

A sampling distribution is a set of samples of the same size and type, and the unit is an individual sample.
 
 

BASIC CONCEPTS: A BARE BONES LIST

What's a VARIABLE?
HINT: CHECK OUT THIS CLASS WEB SITE FOR THESE TERMS 



What are some useful characteristics for a good category system to have?

HINT: CHECK OUT THIS CLASS WEB SITE FOR THESE TERMS 

AND REVIEW OVER HERE TOO 



Do you remember the basic components of a univariate TABLE?

HINT: CHECK OUT THIS CLASS WEB SITE FOR THESE TERMS 


Can you recognize examples of each of the following and can you define the following types of variables?

LEVELS OF MEASUREMENT 
Nominal variable
Ordinal variable
Interval variable
Ratio variable


Do you remember your percent basics?
HINT: CHECK OUT THIS CLASS WEB SITE FOR THIS TERM 



How about RATES, RATIOS, and the PERCENT CHANGE OVER TIME?

CHECK THEM OUT HERE 


What are the basics of THE NORMAL CURVE?

HINT: CHECK OUT THIS CLASS WEB SITE 



What are basic measures OF CENTRAL TENDENCY?
Which one is the most appropriate for YOUR data?
HINT: CHECK OUT THIS CLASS WEB SITE FOR THESE TERMS 



What are basic measures OF DISPERSION OR VARIATION?

When is it the most appropriate to use each one?

HINT: CHECK OUT THIS CLASS WEB SITE FOR THESE TERMS 



REVIEW the basics behind a SAMPLING DISTRIBUTION.

HINT: CHECK OUT THIS CLASS WEB SITE FOR THESE TERMS 



Graphic displays can be a lively, attractive way to array our data--or a misleading set of graphs and pictures.
Check out the example reproduced below of the animated computers.


THEN: Go back and look through the class handout. Notice how the Radon page (last page) is misleading because the bottom axis uses UNEQUAL intervals. It magnifies the effects that radon has for cigarette smokers and makes these effects look far worse than they would had equal intervals been used on the X axis. The right side axis also uses unequal intervals at the very top of it which also helps to magnify the effects of smoking cigarettes.*

*(and we now know the artist did not mean to mislead. But, because she did not have a statistics course, she knew not what was the effect of her art.)

Compare the "good" and the "bad" consumer confidence graphs (or frequency polygons). Notice how the truncation marks COULD be used (as they are in the "bad" graph) to magnify small dips and increases into large fluxuations.



 
A BIT MORE ON CONFIDENCE INTERVALS

Why do we go to the trouble to construct a confidence interval (CI)?

We know that sample results will vary from sample to sample.
On the average, the standard error will tell us how much results vary from sample to sample.

But how confident are we in our estimates?
The CI tells us what to expect for samples that produce results relatively close to the population parameter.

If we go out approximately 2 standard error units on either side of our sample mean or proportion (that means, plus or minus), 95 percent of the confidence intervals constructed in this way will contain the population mean or proportion. Multiplying the number of units we go out by the sample standard error tells us how precise we can expect our estimates to be.

Our sample precision is inversely related to the number of standard error units we go out. The more standard error units we go out on either side of the mean: (a) the more likely the interval is to contain the population mean or proportion but (b) the wider--hence less precise our estimate becomes.

We can make our estimates more precise by increasing the sample size, n. The square root of n is the denominator of the standard error (the sample standard deviation is the numerator). If we divide by a larger number, the result is a smaller standard error.

Of course, any one constructed CI either does contain the population mean (or proportion) or it does not. But our faith is in the process. If we take probability samples, 95 CIs will give good estimates and only 5 will not (if we go out 2 standard error units). When we take ONE sample, we are betting that we have one of the 95 good ones and not one of the 5 bad ones.

And (if we go out 2 standard error units) we even know the odds of being wrong: 5 in 100 for getting a bad sample.
 


SAMPLE QUESTIONS: EXAM 1

This is NOT an inclusive list. However, it should serve to give you samples of the kinds of questions that will be on Exam One.

You can add a few words or a brief explanatory sentence to any answer.

Multiple choice. Select the one best or most appropriate alternative response for each question.

If you do a univariate frequencies computer run and check the statistics option, which of the following does the SDA system calculate for you?

   [   ]A. A bivariate frequency distribution
   [   ]B. A Chi-square
  [   ]C. A standard deviation
   [   ]D. None of the above

Only the standard deviation is a univariate measure.

When you examine a univariate frequency distribution for father's educational level, you notice a lot of cases in the "don't know" and "other" categories. You decide:

[   ]A. To include the invalid cases in your percentage table
[   ]B. To investigate further to find out why there are so many missing cases
[   ]C. To simply eliminate the invalid cases
[   ]D. All of the above

Lots of missing cases? Better check out why before you proceed any further.

Which ONE of the following estimates the precision of your sample statistic?

[   ]A. The confidence interval
[   ]B. The histogram
[   ]C. The median
[   ]D. The ratio

Only the confidence interval of the terms above estimates the variability of results from sample to sample at a given level of confidence.


_____TRUE or _____ FALSE? Computers can determine the level of measurement in your variable.

Sorry, computers can do lots of things, but you should know by now that it's tough enough for a human to do this one. Computers do even worse than we do.

_____TRUE or _____ FALSE? Computers do calculations more accurately than humans.

Absolutely! Than virtually all humans, anyway.


Examples of some symbols that you should know

µ  The symbol for the POPULATION mean. (Mu)

     The symbol for the SAMPLE mean. (X-bar)

N The symbol for the POPULATION total case base.

The symbol for the POPULATION standard deviation. (sigma)
s The symbol for the SAMPLE standard deviation.

The symbol meaning "to add" or "to sum" (capital sigma)

I expect you to know that the population statistical symbols and the sample statistical symbols are typically different.



INTERPRET DATA QUESTION

Here is produced output from the 2002 General Social Survey on the mother's highest degree level for adult Americans:

MADEG     MOTHERS HIGHEST DEGREE
                                                        Valid     Cum
Value Label                 Value  Frequency  Percent  Percent  Percent

LT HIGH SCHOOL                  0       779     28.2     31.4     31.4
HIGH SCHOOL                     1      1289     46.6     52.0     83.5
JUNIOR COLLEGE                  2       119      4.3      4.8     88.3
BACHELOR                        3       210      7.6      8.5     96.8
GRADUATE                        4        80      2.9      3.2    100.0
NAP                             7       168      6.1   Missing
DK                              8       103      3.7   Missing
NA                              9        17       .6   Missing
                                     -------  -------  -------
                            Total      2765    100.0    100.0

Mean          1.000      Std err        .020      Median        1.000
Mode          1.000      Std dev        .996      Skewness      1.300
S E Skew       .049      Minimum        .000      Maximum       4.000
 

Percentile    Value      Percentile    Value      Percentile    Value
  25.00        .000        50.00       1.000        75.00       1.000

Valid cases    2477      Missing cases    288


Describe ONE piece of information that is missing in this data array that would be needed to make a complete table. (If you believe the table is complete, simply state this.)

This "table" lacks a title. (This is really computer output, not a table.) It also lacks a data source.

Describe ONE piece of information in this data array that is extraneous, or unnecessary. (If you believe there is no extraneous information, simply state this.)

Having both frequencies and percents clutters the data array. You probably don't want to include the cumulative percentages in your presentation table on top of the regular percents (although you may want to USE the cumulative percents for some calculations).

Is the "MADEG" variable an (CHECK ONE:)
 [   ]A.  Interval
 [   ]B. Nominal
 [   ]C. Ordinal or
 [   ]D. Ratio variable?

Briefly explain the rationale behind your answer: The categories can be rank ordered from highest to lowest. A mother who has a graduate degree has a higher degree level than a mother with a junior college degree. Because we can't see HOW MUCH more (or less), these data are ordinal, not numeric.

The measure of central tendency most appropriate to this variable would be (CHECK ONE):
[   ]A. Mean
[   ]B. Median
[   ]C. Mode
[   ]D. Standard deviation

The median is a measure of central tendency that is more informative than the mode and it is the most sophisticated to use on ordinal data.

This measure of central tendency would approximately correspond to a category value label of:

_________(FILL IN THE RIGHT CATEGORY VALUE LABEL FROM THE STATISTICS)

"1", that is, "High School". We use the VERBAL CATEGORY (NOT a number) when we are dealing with ordinal data that has verbal category labels.

What is the best measure of dispersion or variation to use to describe this data array?

 [   ] A. The Index of Dispersion (D)
 [   ] B. The inter-quartile range
 [   ] C. The standard deviation
 [   ] D. The standard error

Typically, the range or the inter-quartile range is the most descriptive measure for ordinal data. The IQR is generally better to use than the range when your variable has several values to identify the "center" of the distribution or the middle 50 percent.

Remember to be consistent with your measure of central tendency and dispersion. If you pick the median, the range or the inter-quartile range are the accompanying measures of dispersion.
 
 

MEASURE OF CENTRAL TENDENCY
ASSOCIATED MEASURE OF DISPERSION
Mode Index of Dispersion
Median Range or The Inter-Quartile Range
Mean Standard Deviation

What is (are) the value(s) of this measure of dispersion?

Use the quartile cutpoints to locate the 25th and the 75th percentiles. Those are "0" and "1"--those correspond to verbal labels of "less than high school degree" and "high school degree".



NOTE: If you have truly ordinal data, be sure to give the verbal label endpoints. DO NOT SUBTRACT TO CREATE A SINGLE NUMBER with truly ordinal data because that doesn't make ANY sense.

What percent of mothers have at least a high school degree?

This means A HIGH SCHOOL DEGREE OR MORE, i.e., a high school degree, junior college (or vocational) degree, baccalaureate, or graduate degree, or: 52.0% + 4.8 + 8.5 + 3.2 = 68.5% (all percents add to 99.9% due to rounding error).

68.5% of respondents' mothers have at least a high school diploma.

What percent of mothers have at most a junior college degree?

This means A JUNIOR COLLEGE DEGREE OR LESS, i.e., less than high school, high school degree, or junior college degree or: 31.4% + 52.0 + 4.8 = 88.2%

88.2% of respondents' mothers have at most a junior college degree.

(You can also just read this off the cumulative percent column, the 0.1% difference is due to rounding error; either figure would be correct.)


Briefly describe TWO (and ONLY two) properties of a good variable category system:
 
CHECK HERE
REVIEW HERE TOO


For each of the following variables, please indicate (1) whether the variable is nominal, ordinal, interval, or ratio and (2) IN ONLY ONE SHORT SENTENCE describe the reason behind your decision:
 
1. Years of Education 2. GRE Scores
3. Gender of teacher 4. Class ranking (e.g., Valedictorian, Salutatorian, etc.)

Years of education is truly numeric, equal intervals, with a fixed zero. It is RATIO.

GRE scores are standardized to have equal interval units. They are INTERVAL. Standardized tests are typically created to be interval-level variables.

Gender categories are simply names or tags. We can't rank them. Gender is NOMINAL.

Class ranking means that we have placed students in the class in order according to their grade point average. This variable is ORDINAL.



For each of the following variables, please indicate the best or most appropriate measure of central tendency or central location for that variable:
 
1. Years of education 2. GRE Scores
3. Gender of teacher 4. Class ranking (e.g., Valedictorian, Salutatorian, etc.)

Years of education is ratio, use the mean.This is the arithmetic average of the scores and can only be used on numeric (interval or ratio) data.

GRE score is interval, use the mean.

Gender is nominal, use the mode. The mode is the most frequent score, or the category with the largest number of cases. If your data are nominal, you can ONLY use the mode.

Class ranking is ordinal, use the median. The median is the middle position or the 50th percentile. Be sure you rank all your scores in order BEFORE you apply the median.

NOTE: If you expect your interval level variable to have a skew (a disproportionate number of extremely high or low scores that would "yank" the mean up or down), such as income scores, use the median instead (but BE SURE to explain your answer).


For each of the following variables, please indicate the best or most appropriate measure of dispersion or variability for that variable:
 
 
1. Years of education 2. GRE Scores
3. Gender of teacher 4. Class ranking (e.g., Valedictorian, Salutatorian, etc.)

Years of education is ratio, use the standard deviation. This is the "average" deviation of a score from the mean. (Remember, first you square the difference between the score and the mean to eliminate negative signs; at the end of the process, you take the square root of the variance.)(Also remember: if you decided to use the median for central tendency, use the range or inter-quartile range for your measure of variation.)

GRE score is interval, use the standard deviation. (See the note above if you decided to use the median rather than the mean.)

Gender is nominal, use the Index of Dispersion (it will be 1 if you have a 50-50 male-female split). If you have nominal data, D is the ONLY measure of dispersion that you can use.

Class ranking is ordinal, use the range. The most general definition of the range is that it is the two end points of the distribution, the highest score and the lowest score (after all the scores are placed in order.) Make sure you use the verbal category labels when the variable you examine is ordinal and remember: you can't add or subtract the values of ordinal variables because the categories aren't numbers.

You could also use the inter-quartile range, which is even more informative. This is the middle 50 percent.




REVIEW: Here is the series of pictographs from Guide 3, where the researcher wants to compare better and less well educated households on their computer ownership. Remember those cheerful computers?
 
 

CORRECT DEPICTION
Percent of United States Households Owning at least One Personal Computer by Education

HOUSEHOLDS WITH HIGH SCHOOL DEGREE
HOUSEHOLDS WITH COLLEGE DEGREE
40 PERCENT
80 PERCENT

Source = Current Population Survey Internet and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)

The pictograph above is a good representation. All the computer icons are the same size.So we can legitimately say that twice as many households with a college degree own at least one computer, compared with households with only a high school diploma.

CORRECT (BUT UGLY) DEPICTION
Percent of United States Households Owning at least One Personal Computer by Education

HOUSEHOLDS WITH HIGH SCHOOL DEGREE
HOUSEHOLDS WITH COLLEGE DEGREE
40 PERCENT
80 PERCENT

Source = Current Population Survey Internet and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)

This pictograph above is accurate, but most graphic artists won't like it. Because the larger computer is the same width and twice as high, it looks out of proportion.
 
 

INCORRECT (BUT GOOD-LOOKING) DEPICTION
Percent of United States Households Owning at least One Personal Computer by Education

HOUSEHOLDS WITH HIGH SCHOOL DEGREE
HOUSEHOLDS WITH COLLEGE DEGREE
40 PERCENT
80 PERCENT

Source = Current Population Survey Internet and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)

Graphic artists and most laypeople will like pictograph (3) better than pictograph (2). It is "prettier." It is also misleading and inaccurate, because the big computer is now four times the size of the smaller computer, instead of an accurate twice as large.

MORAL:  Stick with multiple representations of the SAME SIZE ICON. Maybe it won't be as dramatic as a large icon, but you will have an accurate summary to display.

Here's another, similar example. Let's suppose you are looking at annual housing starts and you want to compare communities. Below is a nice house icon:

You can let each house icon represent 10,000 houses that began construction that year. Suppose Tallahassee had 20,000 housing starts and Jacksonville, Florida had 50,000 housing starts. Compare Jacksonville and Tallahassee below:

NUMBER OF HOUSING STARTS PER COMMUNITY 1998. EACH HOUSE = 10,000 HOUSING STARTS
 

JACKSONVILLE, FLORIDA TALLAHASSEE, FLORIDA

QUESTIONS ON THIS PICTOGRAPH:

Which of the following is true about the findings in this pictograph?

True or False?  Jacksonville had two and a half times as many housing starts as Tallahassee in 1998.

50,000 versus 20,000

True or False? Tallahassee had 30,000 fewer housing starts than Jacksonville in 1998.

20,000 versus 50,000

True or False? The portrayal of the pictoral icons in this exhibit shows them out of proportion.

Each little house is the same size and each stands for 10,000 housing starts.

True or False? We cannot compare the number of housing starts because this is nominal data.

We can't take means or medians, but we can still see that 50,000 is over twice as many housing starts as 20,000. We can compare two groups with the same variable on any kind of data.



In a similar vein, double check that your intervals on the x and y axes of a graph are equal intervals. The radon graph in the handout is a BAD EXAMPLE. Its use of unequal intervals makes for dramatic reading (wow--look how strongly radon affects smokers as opposed to nonsmokers) but this graph gives the reader a false impression because the intervals on the x or bottom axis are not equal, and neither are the intervals on the y or far right vertical axis.


PLEASE NOTE: IF ANY QUESTIONS ARE POSTED TO ME THAT WOULD CHANGE OR CLARIFY THE CONTENT OF THIS SITE, I WILL CORRECT IT AT THE TOP OF THIS SITE.
 
 
ATTENTION OFFICE HOURS REMINDER:

Teaching Assistant Maria Teresa Ferreira  office and office hours

Maria meets students at the LRC which is 124 Stone Building
Office hours are 3:15-5:00 PM Tuesday and Thursday

(My normal office hours are Monday and Wednesday 3:30-5:00 PM in 307K Stone)
 


 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh September 19, 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.