AT THIS WRITING, FSU IS FULLY OPERATIONAL. PLEASE ACCESS THE "STORM ALERTS" LINK ON:
http://www.fsu.edu       FOR UPDATED INFORMATION. I WILL MAKE AN ANNOUNCEMENT ON BLACKBOARD IF OR WHEN THE SITUATION CHANGES ).
OVERVIEW

EXAM 1 IS 
SEPTEMBER 29
HERE IS THE 
STUDY GUIDE

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS
 
RETURN TO 
ASSIGNMENT PORTAL


 
EDF 5400 INTRODUCTORY STATISTICS
FALL 2004

DR SUSAN CAROL LOSH


 
 

ASSIGNMENT 2: UNIVARIATE DISTRIBUTIONS: SELECTING MEASURES OF CENTRAL TENDENCY AND DISPERSION, AND UNIVARIATE DISPLAYS

REVIEW ASSIGNMENT 2 HERE

GENERAL FEEDBACK ASSIGNMENT 2


 
PLEASE LOOK OVER THIS SITE AND YOUR EXERCISE WHEN YOU RECEIVE IT BACK.
I WILL CORRECT ANY ADDITION ERRORS OR TRANSLATE MY HANDWRITING DURING BREAK.
ANY OTHER ISSUES, PLEASE WAIT UNTIL AFTER CLASS. THANK YOU.

LAST DAY QUESTIONS ABOUT EXAM 1?

YOU MAY EMAIL ME OR MARIA. PLEASE DO NOT E-MAIL AFTER 8 PM TUESDAY NIGHT.

Different e-mail providers may take a long time to deliver their mail & we may not receive it in time. We are not responsible for late delivery of e-mail by either your provider or ours, or for server viruses that slow transmission, so please leave enough time!

IF YOU E-MAIL US WEDNESDAY MORNING WE WILL NOT HAVE TIME TO RESPOND TO YOU. 
 

MARIA WILL BE AVAILABLE IN THE LRC TUESDAY AFTERNOON 3:15-5.
I WILL BE AVAILABLE MONDAY AFTERNOON 3:30-5 AND WEDNESDAY 2-5 FOR LAST MINUTE QUESTIONS.
RYAN WILKE, A FORMER TA FOR THIS COURSE, CAN ALSO HELP OUT IF YOU CAN COME OUT TO THE HIGH MAGNETIC LAB (INNOVATION PARK NEAR ALUMNI VILLAGE) WHERE HE WORKS ON TUESDAY. EMAIL RYAN AT: raw7447@garnet.acns.fsu.edu FOR QUESTIONS AND DIRECTIONS.


 

This assignment is worth 5 PERCENT toward your final grade.
Remember! I use plus and minus grading on assignments and for the final grade.


This Feedback page is generic. If you feel it does not address the score on your paper, please make an appointment and we will go over your paper.

Please do not ask me or Maria to address your individual paper during class time or break, although after class is fine.

I am at least as interested in how you arrive at your answer as what your answer is.

This assignment is a good example. Let's suppose you misidentified mother's years of education as an ordinal variable (it is really a ratio variable with a fixed 0 and equal intervals of 1 year). Do you know the best measures to use with an ordinal variable? In this example, if you said the median and the Inter-Quartile Range, you would receive credit on those questions, because generally these are the best methods to use with an ordinal variable.

On the other hand, continuing with this example, if you misidentified  maeduc as ordinal, then said the mode and the Index of Dispersion were the best measures of central tendency and dispersion to use, you LOST credit. Why? Because I am looking for consistency between what you thought the level of the variable was and the best methods to use with that kind of variable. Although it's the best we can do for nominal data, generally, the mode is a poor measure of central tendency for ordinal or numeric data.
 

The 18-20 point paper
 (2 points).



Your histogram (3 points total) used equal intervals for both the frequencies or the percentage axis (the "y axis") and equal width bars for the "babies" variable (on the "X axis"). Either frequencies or percentages were acceptable. BUT NOT BOTH. That clutters the histogram.

Similarly, the good histogram EXCLUDED percentages (or frequencies) in the body of the histogram because anything that clutters the interior of the histogram makes it more difficult to read. Certainly, the good histogram did NOT include percentages (or frequencies) in BOTH the body of the table AND on the "Y axis".

The "Y axis" could either be percentages of households or the frequencies. It was NOT "relative frequencies" (which could also refer to proportions or rates). Further, to have full credit, you need to label your bars correctly (i.e., you lost credit if you called percentages frequencies instead or vice-versa.)

Although the bars should touch, I recognize that it might be difficult to locate the provision in your computer  program to make them do so, therefore it does NOT count against you if the bars did not touch.

However, your histogram should be complete: it should include a title, data source, total valid cases (that the histogram is based on) and missing cases.

It should be ACCURATE. The data source for this exercise was the 2002 General Social Survey. There were only 28 missing cases on the Babies variable.

Did not count the filtered cases for the years 1972-2000 that were never accessed.



Your designation of the level of measurement must relate to the variable as  presented in the output. For example, although religious attendance ("attend") COULD have been measured in times per month or per year (a ratio scale), the categories for this variable as gathered in the data came in unequal and irregular intervals, and, thus the category system is not ratio. Therefore, "attend" as presented is ORDINAL.

Ordinal data is non-numeric. It cannot follow a normal distribution.
Nominal data is non-numeric too, so it can't be normally distributed.

On the other hand, "maeduc" (mother's years of education) is RATIO. One year is the unit and you can't have less than zero years of education. (If you said interval, that was OK for purposes of this assignment.) The key thing is that you recognized that "maeduc" is numeric.  Even if the last category was truncated, this is basically a ratio variable. There are several reasons why the "high category" may be truncated or collapsed: there may be very few people in this category; the extreme high scores may be so extreme that they will distort univariate--and more complex--statistics; a very extreme score might enable a nosey individual to identify that person (the federal US government is very sensitive to this issue).

marital  (individual's marital status)  is NOMINAL.  We can say whether two people have the same or different marital status, but there is no inherent rank order to the categories.

The "marital" category values are not numbers or even graded positions. The order of the categories themselves is arbitrary, so it doesn't matter if or how you rearrange them.  That is the meaning of nominal data. If you rearranged categories for an ordinal or interval/ratio variable, it would make a difference. The categories would no longer have a rank order.

babies (number of household members under 6) is RATIO. You can count the number of household members under 6 and you can't have fewer than zero.

This section counted 4 points.


I examine your designated measure of central tendency and dispersion in the context of your answer about level of measurement. For example, if you incorrectly designated "marital" as ordinal, I expected you to choose the median as the best measure of central tendency and the range or inter-quartile range as the best measure of dispersion. While you lost one point for the earlier question, you would get 1 FULL POINT for correctly choosing the best measures for an incorrectly identified ordinal variable. However, if you designated "marital" as ordinal then selected the mode as the most appropriate measure of central tendency, you lost credit (unless you had a real good explanation) because the median is usually the best measure of central tendency for ordinal data.

Because "attend" as presented is ordinal, its best measure of central tendency is the median. You must use the VERBAL CATEGORY that corresponds to category number 3 because the variable values are not really numbers, which is "several times a year."

Because "maeduc" is a ratio variable, its best measure of central tendency is the mean, which is the number "11.45."
 

 
Remember that it's OK to have a fractional mean FOR A GROUP. Although the number of children under 6 ("babies") is an integer or "whole number," for a single household, it can be a fraction (0.18) for a group presentation. "Babies" was also a sizably skewed variable, with the largest concentration at 0. If you gave the "skew" argument and used the median, this was fine. BUT: just remember that if you use the median as your best measure, use the inter-quartile range or the range as your associated measure of dispersion.

Because marital is a nominal variable, its best measure of central tendency is the mode. You must use the VERBAL CATEGORY that corresponds to category number 1, which is "married."

Combining both the best measure of central tendency for each variable AND the correct value for the mean, median or mode was worth a total of 4 points.



Because marital is a nominal variable, its best measure of dispersion is the Index of Dispersion "D." Like many statistical programs, the SDA system does not calculate "D." (If you did, the value was 0.86.)

If you decided the mode was the best measure of central tendency for any of the other three variables, again, the measure of dispersion to use is the Index of Dispersion "D." The SDA system does not calculate "D." If you did it by hand, you probably had a hard time for either "attend" or mother's years of education.

You did not lose credit if you miscalculated "D" unless you presented a number larger than one.
D varies between zero and one. D CANNOT be larger than 1.
For fractional indices such as "D," use TWO decimal places.
(The convention will be two decimal places for correlation coefficients also.)

Because "maeduc" is a ratio variable, its best measure of dispersion was the standard deviation of the mean, which is the number "3.49."

Because "attend" is ordinal, its best measure of dispersion is also ordinal. The Inter-Quartile Range is much more informative than the Range (the range is less than high school to graduate work). The IQR or the endpoints of the "middle 50 percent" ranges from less than once a year to nearly once a week.

Do NOT substract for either the range or the IQR when you have ordinal data. The categories are not numbers and they cannot be added or subtracted.

Do your cumulative percents carefully! Some people were one category off on the high or the low end when they calculated their IQR.
 

 
What about if you are "almost there" with a cumulative percent of 74.6 and you just "fall short" of the 75th percentile? You BE CAREFUL! This looks like a case where it "makes sense" to stop short, and I do see the point. But once you start fudging, where do you stop? One of the problems with an ordinal distribution with relatively few categories is that the jump to the next category may be very high. As a general rule, stay with the 25th and 75th percentiles and be consistent. 

This is good practice for the decisions we make next about accepting or rejecting the null hypothesis. Generally, the cut off is "p < .05", that is, if there were no relationship in the population, you could expect to observe results as extreme as yours 5 times in a 100 or less. Do you "fudge" if p = .053? If p = .058? And if so, where do you draw the decision rule now?

Combining both the best measure of dispersion for each variable AND identifying the correct value for that measure (or indicating that "D" was unavailable) was worth a total of 4 points.


Are any of your variables normally distributed? Let's make it easy.

You need a numeric variable to examine a normal distribution. (How else can you discuss the mean or standard deviation which require arithmetic operations such as subtraction or division?)

"Attend" and "marital" are NOT numeric variables. Therefore they can't be normally distributed.

"Babies" IS a numeric variable, so let's examine a second or even a third criterion for a normal distribution.

"Babies" (1) has a large positive skew (those few cases with 3 or 4 household members under age 6) and (2) it doesn't look anything like a "bell" shape. Instead, the frequency distribution for "babies" looks like a backward "J". So rule out "babies" as following a normal distribution.

"Maeduc" is numeric and approximately bell-shaped. Its skew is negative and relatively small (-.82).

The mean, median and mode are the same number when data follow a normal distribution. For "maeduc" in this sample, the mean is 11.45 years, the median is 12 years, and the mode is 12 years. Are these "almost" the same? The mean is more than 2 standard error units (.07 X 1.96) away from the median but it's still pretty close.

"Maeduc" is close enough to apply the standard deviation property of the normal curve.
 
 

 
The mean + 1 standard deviation = 11.45 + 3.49 = 14.94 or 15.
The mean - 1 standard deviation = 11.45 - 3.49 = 7.96 or 8
The cumulative percent from (and including) 8 through 14 = 
9.3 + 2.8 + 5 + 3.5 + 41.6 + 4.8 + 8.5 = 75.5%
It should have been 68%.

The mean + 1.96 standard deviations = 11.45 + 3.49 x 1.96 = 11.45 + 6.84 =18.29.
The mean - 1.96 standard deviations = 11.45 - 3.49 x 1.96 = 11.45 - 6.84 =4.61.
The cumulative percent from (and including) 4 through 17 = 

1.1 + 1 + 3 + 1.5 + 9.3 + 2.8 + 5 + 3.5 + 41.6 + 4.8 + 8.5 + 1.7 + 8.6 + .5 =   92.5%
It should have been 95%

(Strictly speaking, I would have taken 0.39 * 1.1 for the fourth category and 0.29 X 2.2 for the eighteenth category to indicate the distribution of cases throughout each of those two categories; by rounding I took all of category 4 and none of category 18--a rough estimate.)
 

A "statistical purist" would say, no, not normal.
A "somewhat impurist" (me) would say "approximately normal".
Either way, I looked for your reasoning, not a simple "yes" or "no".  I looked to see what evidence you used and if you used it correctly.
And I expected you to cite at least two properties of the normal curve in your answer. That's what could enable you, for example, to distinguish between "maeduc" and "babies".

Assessing whether any of your variables were normally distributed, and why, was worth a total of 3 points.
 
 

 
Are ordinal variables numeric? NO! The intervals that divide adjacent categories are uneven or irregular. There is not one common unit such as a year of age, a single sibling, or one dollar of income. 

I know Agresti and Finlay state that sometimes interval level statistical techniques are applied to ordinal level data. Virtually anyone who is a professional in your discipline can tell you that too. Interval level statistics are typically more powerful than ordinal level statistics and use more information about the data, so the temptation is very strong to use means or other interval measures on ordinal data. Just because this is often done doesn't mean it's the right thing to do. 
 


 
YOU LOST CREDIT IF

You used numberic categories for a nominal or ordinal value of central tendency or dispersion and didn't state what the verbal values were. Similarly, you shouldn't do subtraction for either the range or the inter-quartile range UNLESS your data are numeric (even then, the end points are more informative).

You used number categories that appeared NOWHERE on your output.

You incorrectly identified the level of measurement for the type of variable that you had.

BEWARE! You will have similar problems on Exam 1.
Further, you must be able to identify the type of variables that you have in order to identify the best measure of correlation for those variables.

You choose an inappropriate measure of central tendency or dispersion for the kind of data that you identified.

Overusing the mode. Often, the mode is uninformative. If the data are relatively evenly spread across categories, the mode does not give useful information (and, what do you do if you have TWO OR MORE modes?) The mode does not incorporate every single score. We use a mode (always) with nominal data because we can't use anything else. But the median or the mean, especially coupled with their associated measures of dispersion, give us more information about what the "typical score" really looks like.

Your measures of central tendency and dispersion were inconsistent, for example, you selected a median, then a standard deviation of the mean for "maeduc" of for "babies" (this happened a lot). Stay with the same level (e.g., ordinal or interval/ratio) for your measures of central tendency and dispersion for a particular variable. (This was also mentioned in class Monday 9-20 and Wednesday 9-22.)

Some students first misidentified the level of measurement. Next, they specified a measure of central tendency that did correspond to the true level of measurement--but not to the level the student identified. Then, they specified a measure of dispersion inconsistent with either one.

BAD EXAMPLE: First (incorrectly) designating "maeduc" as ordinal.
Then specifying the mean as the best measure, You cannot do a mean on ordinal data.
I want to know if YOU know which type of measure goes with which level of category system.
Then, designating the Index of Dispersion ("D") for the best measure of dispersion. D is a nominal measure.

Your histogram was incomplete. For example, it lacked a title, a data source, or even omitted a category.

Your histogram stated an incorrect number of valid or missing cases.

Your histogram was needlessly cluttered. In addition to the percentage or frequency scale on the "y axis" you also labelled your percents (and some people labelled percents and then put the frequencies too. KEEP IT SIMPLE to make it readable.) The most common error was to place the percentages both along the side AND in the body of the histogram. Eliminate percents (or frequencies) in the body of the histogram.

You didn't mention at all what your "y axis" was or had an incorrect label. Frequencies? Percentages? (Since you had a choice, it was important to label this for your reader.)

You didn't understand why "maeduc" had an approximately normal distribution in this assignment.

You didn't realize that non-numeric variables (nominal or ordinal) cannot be normally distributed.

You didn't explain AT ALL why your variables were or were not normally distributed.

You not only used a range on nominal data, you SUBTRACTED the lowest from the highest category and produced a number for nominal or ordinal data. Neither nominal nor truly ordinal data are numeric so you cannot do numeric operations on them. You can't substract two numbers with ordinal data either because the values of an ordinal variable are not numbers.

You miscalculated doing cumulative percents, either for the Inter-quartile Range or while assessing the normality of "maeduc".
 
 
PLEASE STUDY  YOUR ASSIGNMENT. COMMENTS ARE ON THEM AS APPROPRIATE. 

 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh September 26 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.

You mean that's who we get to blame? 

Welcme to Flrida!