Statistics Individual Data Distribution
This lesson covers statistics individual data distribution and population observation.
statistics is the science of collecting, organizing, summarizing, and analyzing information to draw a conclusion and answer questions. in addition, statistics is about providing a measure of confidence in any conclusions
a person or object that is a member of the population being studied
consists of organizing and summarizing information collected
uses methods that generalize results obtained from a sample to the population and measure the reliability of the results
a numerical summary of a sample
a numerical summary of a population
the characteristics of the individuals of the population being studied
a sample of seniors is selected and it is found that 45% own a television
this is a statistic because the value is a numerical measurement describing a characteristic of a sample
the average annual salary of 50 of a company’s 800 employees is $54,000
this is a statistic, because the data set of salaries of 50 employees is a sample
nation of origin
the variable is qualitative because it is an attribute characteristic
medal won in race
the variable is qualitative because it is an attribute characteristic
area of a park
the variable is continuous because it is countable
height of an office building
the variable is continuous because it is not countable
a polling organization contacts 2526 undergraduates who attend a university and live in the United States and asks whether or not they had spent more than $200 on food in the last month
population: undergraduates who attend a university and live in the united states
sample: the 2526 undergraduates who attend a university and live in the united states
setup- a, b, c, d, e
size- 48, 40, 59, 41, 43
screen type- plasma, projection, projection, plasma, projection
number of channels available- 299, 111, 425, 270, 290
individuals being studied: the characteristics of high-definition televisions A through E
variables and their corresponding data being studied: size (48, 40, 59, 41, 43), screen type (plasma, projection, projection, plasma, projection), and number of channels available (299, 111, 425, 270, 290)
a study conducted by researchers was designed “to determine if the application of duct tape is as effective as cryotherapy in the treatment of common warts.” the researchers randomly divided 50 patients into two groups. the 25 patients in group 1 had their warts treated by applying duct tape. the 25 patients in group 2 had their warts treated by cryotherapy. once the treatments were complete, it was determined that 66% of the patients in group 1 & 86% of the patients in group 2 had complete resolution of their warts. the researchers concluded that cryotherapy is significantly more effective in treating warts than duct tape.
research objective: to determine if duct tape is as effective as cryotherapy in treating warts
sample: the 50 patients with warts
measures the value of the response variable without attempting to influence the value of either the response or explanatory variables. that is, in an observational study, the researcher observes the behavior of the individuals without trying to influence the outcome of the study
if a researcher assigns the individuals in a study to a certain group, intentionally changes the value of an explanatory variable, and then records the value of the response variable for each group
occurs when the effects of two or more explanatory variables are not separated. therefore, any relation that may exist between an explanatory variable and the response variable may be due to some other variable or variables not accounted for in the study
an explanatory variable that was not considered in a study, but that affects the value of the response variable in the study. in addition, lurking variables are typically related to explanatory variables considered in the study
three major categories of observational studies
1. cross-sectional studies: collect information about individuals at a specific point in time or over a very short period of time
2. case-control studies: retrospective, meaning that they require individuals to look back in time or require the researcher to look at existing records
3. cohort studies: first identify a group of individuals to participate in the study (the cohort) then observes them over a long period of time
a list of all individuals in a population along with certain characteristics of each individual
a study is conducted to determine if there is a relationship between Parkinson’s disease and childhood head trauma. doctors look at the hospital records for patients with parkinson’s disease for any childhood head trauma
the study is an observational study because the study examines individuals in a sample, but does not try to influence the response variable
while shopping, 350 people are asked to perform a taste test in which they drink two randomly placed, unmarked coffees. they are then asked which coffee they prefer
the study is an observational study because the study examines individuals in a sample, but does not try to influence the variable of interest
researchers wanted to determine if having a tv in the bedroom is associated with obesity. the researchers administered a questionnaire to 380 twelve-year-old adolescents. after analyzing the results, researchers determined that the body mass index of the adolescents who had a tv in their bedroom was significantly higher than that of the adolescents who did not have a tv in their bedroom
this is an observational study because the researchers observe the behavior of the individuals in the study without trying to influence an explanatory variable of the study
the response variable is the body mass index of the adolescents
the explanatory variable is whether the adolescent has a tv in the bedroom or not
possible lurking variables might be eating habits and the amount of exercise per week
“these results remain significant after adjustment for socioeconomic status” means that the researchers made an effort to avoid confounding by accounting for potential lurking variables
a television in the bedroom and obesity are associated because the body mass index of the adolescents who had a tv in their bedroom was significantly higher than that of the adolescents who did not have a tv in their bedroom
which sampling method does not require a frame?
obtained by dividing the population into groups and selecting all individuals within a random sample of the groups
obtained by dividing the population into homogenous groups and randomly selecting individuals from each group
when taking a systematic random sample of size n, every group of size n from the population has the same chance of being selected
false, because certain groups would never be selected
a simple random sample is always preferred because it obtains the same information as other sampling plans but requires a smaller sample size
false, because other sampling techniques may provide more information for less cost than a simple random sample
when conducting a cluster sample, it is better to have fewer clusters with more individuals when the clusters are heterogeneous
true, because when the clusters are heterogeneous, they are scaled down versions of the population
inferences based on voluntary response samples are generally not reliable
true, because it is often the case that the individuals who volunteer do not accurately represent the population
when obtaining a stratified sample, the number of individuals included within each stratum must be equal
false. within stratified samples, the number of individuals sampled from each stratum should be proportional to the size of the strata in the population
to estimate the percentage of defects in a recent manufacturing batch, a quality control manager at IBM selects every 14th computer that comes off the assembly line starting with the fourth until she obtains a sample of 30 computers
to determine customer opinion of their pricing, greyhound lines randomly selects 60 busses during a certain week and surveys all passengers on the busses
a salesperson obtained a systematic sample of size 30 from a list of 600 clients. to do so, he randomly selected a number from 1 to 20, obtaining the number 12. he included in the sample the 12th client on the list and every 20th client thereafter. list the numbers that correspond to the 30 clients selected
12, 32, …, 592
the human resource department at a certain company wants to conduct a survey regarding worker benefits. the department has an alphabetical list of all 7358 employees at the company and wants to conduct a systematic sample of size 70.
k = 105
determine the individuals who will be administered the survey. randomly select a number from 1 to k. suppose that we randomly select 4. starting with the first individual selected, the individuals in the survey will be 4, 109, …, 7249
what does it mean when a part of the population is under-represented?
a part of the population is under-represented when it is proportionally smaller in a sample than in its population
the owner of a shopping mall wishes to expand the number of shops available in the food court. he has a market researcher survey the first 110 customers who come into the food court during weekday afternoons to determine what types of food the shoppers would like to see added to the food court
cause of bias: sampling bias
best way to remedy this problem: ask customers throughout the day on both weekdays and weekends
a pro-life advocate wants to estimate the percentage of people who favor closing abortion clinics. she conducts a nationwide survey of 1980 randomly selected adults 18 years and older. the interviewer asks the respondents, “do you favor protecting unborn children by closing abortion clinics?”
a polling organization conducts a study to estimate the percentage of households that home school their children. it mails a questionnaire to 1958 randomly selected households across the United States and asks the head of each household if he or she home school their children. of the 1958 households selected, 18 responded.
a polling organization conducts a study to estimate the percentage of households that speak a foreign language as the primary language. they mail a questionnaire to 1,023 randomly selected households and asks the head of household if a foreign language is the primary language spoken at home. of the 1,023 households selected, 12 responded. this survey has bias.
possible remedy: conduct face-to-face or telephone interviews
to determine the public’s opinion of the police department, the police chief obtains a cluster sample of 15 census tracts within his jurisdiction and samples all households in the randomly selected tracts. uniformed police officers go door to door to conduct the survey
possible remedy: conduct a polling without police uniform
surveys tend to suffer from low response rates. based on past experience, a researcher determines that the typical response rate for an email survey is 40%. she wishes to obtain a sample of 400 respondents, so she emails the survey to 2000 randomly selected email addresses. assuming the response rate for her survey is 40%, will respondents form an unbiased sample?
no. the survey still suffers from undercoverage (sampling bias), nonresponse bias, and potentially response bias
what are some solutions to nonresponse?
offer rewards and incentives, attempt callbacks
what are the advantages of having a presurvey with open questions to assist in constructing a questionnaire that has closed questions?
the researcher can learn common answers
a person, object, or some other well-defined item upon which a treatment is applied
any combination of the values of the factors (explanatory variables)
the quantitative or qualitative variable for which the experimenter wishes to determine how its value is affected by the explanatory variable
a variable whose effect on the response variable is to be assessed by the experimenter
an innocuous medication, such as a sugar tablet, that looks, tastes, and smells like the experimental medication
the effect of two factors (explanatory variables on the response variable) cannot be distinguished
grouping together similar experimental units and then randomly assigning the experimental units within each group to a treatment
generally the goal of an experiment is to determine the effect that the treatment will have on the response variable
a school psychologist wants to test the effectiveness of a new method of teaching statistics. she recruits 200 second-grade students and randomly divides them into two groups. group 1 is taught by means of the new method, while group 2 is taught via traditional methods. the same teacher is assigned to both groups. at the end of the year, an achievement test is administered and the results of the two groups compared
response variable: the score on the achievement test
explanatory variable manipulated: method of teaching
2 levels of treatment
type of experimental design: completely randomized assignment
subjects: 200 students
researchers wanted to evaluate whether a certain herb improved memory in elderly adults as measured by objective tests. to do this, they recruited 98 men and 125 women older than 65 years and in good health. participants were randomly assigned to receive the herb, 45 mg 3 times a day, or a matching placebo. a measure of memory improvement was determined by a standardized test of learning and memory
type of experimental design: completely randomized design
population being studied: adults older than 65 years and in good health
response variable: score on standardized test of learning and memory
what is the factor? the herb
treatments: 45 mg 3 times a day or a matching placebo
experimental units: 98 men and 125 women older than 65 who are in good health that participated in the study
a marketing research firm wishes to determine the most effective method of promoting a rock band: print, radio, television, or online. the researcher segments volunteers by their ages. of the 490 volunteers, 140 are under 20 years old, 70 are 20-39 years old, 140 are 40-59 years old, and 140 are 60 years old or older. the volunteers from each group are randomly assigned to either the print advertising group, the radio group, the television group, or the online group. each group is exposed to the advertising. after 1 hour, a recall exam is given with the proportion of correct answers recorded.
randomized block design
response variable: the scores on the recall exam
explanatory variable manipulated: type of advertising
researchers wish to know if there is a link between hypertension (high blood pressure) and consumption of salt. past studies have indicated that the consumption of fruits and vegetables offsets the negative impact of salt consumption. it is also known that there is quite a bit of person-to-person variability as far as the ability of the body to process and eliminate salt. however, no method exists for identifying individuals who have a higher ability to process salt. it is recommended that daily intake of salt should not exceed 2300 milligrams (mg). the researchers want to keep the design simple, so they choose to conduct their study using a completely randomized design.
response variable: blood pressure
three factors that have been identified: daily consumption of fruits and vegetables, daily consumption of salt, body’s ability to process salt
blood pressure- not a factor
daily consumption of salt- can be controlled
daily consumption of fruits and vegetables- can be controlled
body’s ability to process salt- cannot be controlled
age- not a factor
gender- not a factor
if a factor cannot be controlled, what should be done to reduce variability in the response variable? experimental units should be randomized to each treatment group
to determine customer opinion of their safety features, daimler- chrysler randomly selects 120 service centers during a certain week and surveys all customers visiting the service center
the manager of a shopping mall wishes to expand the number of shops available in the food court. he has a market researcher survey the first 120 customers who come into the food court during weekend evenings to determine what types of food the shoppers would like to see added to the food court
cause of bias: sampling bias
best way to remedy this problem: ask customers throughout the day on both weekdays and weekends
a polling organization conducts a study to estimate the percentage of households that has two incomes. it mails a questionnaire to 1841 randomly selected households across the united states and asks the head of each household if he or she has two incomes. of the 1841 households selected, 42 responded.
a salesperson obtained a systematic sample size of 25 from a list of 500 clients. to do so, he randomly selected a number 1 to 20, obtaining number 13. he included in the sample the 13th client on the list and every 20th client thereafter. list the numbers that correspond to the 25 clients selected.
13, 33, …, 493
lists the number of occurrences of each category of data
relative frequency distribution
lists the proportion of occurrences of each category of data
a horizontal or vertical representation of the frequency or relative frequency of the categories. the height of each rectangle represents the category’s frequency or relative frequency
a bar graph whose bars are drawn in decreasing order of frequency or relative frequency
the categories by which data are grouped
stem-and-leaf plots are particularly useful for large sets of data
a histogram of a set of data indicates that the distribution of the data is skewed right. which measure of central tendency will likely be larger, the mean or the median? why?
the mean will likely be larger because the extreme values in the right tail tend to pull the mean in the direction of the tail
a data set will always have exactly one mode
for a large sporting event the broadcasters sold 51 ad slots for a total revenue of $135 million. what was the mean price per ad slot?
*the median for the given set of six ordered data values is 29.5
an insurance company crashed four cars of the same model at 5 mph. the costs of repair for each of the four crashes were 411, 443, 468, and 232. compute the mean, median, and mode cost of repair.
mode does not exist
which measure of central tendency best describes the “center” of the distribution?
the sum of the deviations about the mean always equals
complete the paragraph
the standard deviation is used in conjunction with the mean to numerically describe distributions that are bell shaped. the mean measures the center of the distribution, while the standard deviation measures the spread of the distribution
when comparing two populations, the larger the standard deviation, the more dispersion the distribution has, provided that the variable of interest from the two populations has the same unit of measure
true, because the standard deviation describes how far, on average, each observation is from the typical value. a larger standard deviation means that observations are more distant from the typical value, and therefore, more dispersed.
chebyshev’s inequality applies to all distributions regardless of shape, but the empirical rule holds only for distributions that are bell shaped
true, chebyshev’s inequality is less precise than the empirical rule, but will work for any distribution, while the empirical rule only works for bell-shaped distributions
find the sample variance and standard deviation: 23, 13, 6, 10, 9
find the population variance and standard deviation: 8, 11, 15, 17, 19
population variance: 16
standard deviation: 4
compute the range and sample standard deviation for strength of the concrete (in psi): 3970, 4140, 3400, 3200, 2910, 3840, 4140, 4040
the range is 1230 psi
the weight of an organ in adult males has a bell-shaped distribution with a mean of 300 grams and a standard deviation of 35 grams. use the empirical rule to determine the following
(a) about 95% of organs will be between what weights?
(b) what percentage of organs weighs between 265 grams and 335 grams?
(c) what percentage of organs weighs less than 265 grams or more than 335 grams?
(d) what percentage of organs weighs between 195 grams and 370 grams?
(a) 230 and 370 grams
what makes the range less desirable than the standard deviation as a measure of dispersion?
the range does not use all the observations