### Organizing and Summarizing Data Chapters 2 & 3

In chapters 2 and 3, we look at organizing and summarizing data in numerical statistics.

**raw data**

data obtained from either observational studies or designed experiments, before it is organized into a meaningful form.

**frequency distribution**

lists each category of data and the number of occurrences for each category of data

**relative frequency**

the proportion (or percent) of observations within a category

**relative frequency equation**

relative frequency = frequency / sum of all frequencies

**relative frequency distribution**

Lists each category of data together with the relative frequency. The sum of all the relative frequencies should add up to 1.

**bar graph**

Constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis. Rectangles of equal width are drawn for each category. The height of each rectangle represents the category’s frequency or relative frequency.

**Pareto chart**

a bar graph whose bars are drawn in decreasing order of frequency or relative frequency

**side-by-side bar graph**

Compares two sets of data by aligning the bars for one data set with the bars for another data set, by class. Should be compared using relative frequencies to avoid differences in population sizes.

**pie chart**

A circle divided into sectors. Each sector represents a category of data. The area of each sector is proportional to the frequency of the category.

**histogram**

Constructed by drawing rectangles for each class of data. The height of each rectangle is the frequency or relative frequency of the class. The width of each rectangle is the same and the rectangles touch each other.

**lower class limit**

the smallest value within a class

**upper class limit**

the largest value within a class

**class width**

the difference between consecutive lower class limits

**open ended**

a class whose first class has no lower class limit, or whose last class has no upper limit

**guidelines for determining the lower class limit of the first class and class width**

1. choose the lower class limit of the first class by choosing the smallest observation in the data set or a number slightly lower than the smallest observation in the data set

2. determine the class width by deciding on the number of classes, then compute and round up:

class width = (largest data value – smallest data value)/number of classes

**stem-and-leaf plot**

a method of representing quantitative data graphically by using the digits to the left of the rightmost digit to for the stem, and the rightmost digits to form the leaf

**dot plot**

graph drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it is observed

**distribution shapes**

1. uniform distribution

2. bell-shaped distribution

3. skewed right

4. skewed left

**uniform distribution**

the frequency of each value of the variable is evenly spread out across the values of a variable

**bell-shaped distribution**

the highest frequency occurs in the middle and frequencies tail off to the left and right of the middle

**skewed right**

the tail to the right of the peak is longer than the tail to the left of the peak

**skewed left**

the tail to the left of the peak is longer than the tail to the right of the peak

**time-series data**

data collected on the same element for the same variable at different points in time or for different time periods

**time-series plot**

Obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis. Line segments are then drawn connecting the points.

**misleading graph**

a graph that unintentionally creates an incorrect impression

**deceptive graph**

a graph that purposely attempts to create an incorrect impression

**graphical misrepresentations of data**

1. misrepresentation of data

2. misrepresentation of data by manipulating the vertical scale

3. misleading graphs

**guidelines for constructing good graphics**

1. Title and label the graphic axes clearly. Include units of measurement and a data source.

2. Avoid distortion. Never lie about the data.

3. Minimize the white space. Clearly indicate truncated scales.

4. Avoid clutter.

5. Avoid three dimensional graphs.

6. Do not use more than one design in the same graphic.

7. Avoid relative graphs that are devoid of data or scales.

**arithmetic mean**

computed by determining the sum of all the values of the variable in the data set and dividing by the number of observations

**population arithmetic mean**

a parameter computed using all the individuals in a population

**sample arithmetic mean**

statistic computed using sample data

**μ (mu)**

population arithmetic mean

**x̄ (x-bar)**

sample arithmetic mean

**∑ (sigma)**

the terms to be added, the sum

**population mean equation**

where x₁,x₂,…,x-sub-n are the N observations of a variable from a population:

μ =(x₁+x₂+…+x-sub-n)/n = (∑x-sub-i)/n

**sample mean**

where x₁,x₂,…,x-sub-n are the N observations of a variable from a sample:

x̄ =(x₁+x₂+…+x-sub-n)/n = (∑x-sub-i)/n

**median**

The value that lies in the middle of the data when arranged in ascending order. If the number of observations is even, the median is the average of the two middle observations.

**M**

median

**resistant numerical summary**

extreme values relative to the data do not affect its value substantially. median is resistant, mean is not.

**skewed left distribution shape**

mean is substantially smaller than median

**skewed right distribution shape**

mean is substantially larger than median

**symmetric distribution shape**

mean is roughly equal to median

**mode**

the most frequent observation of the variable that occurs in the data set

**no mode**

no observation occurs more than once

**bimodal**

two observations within a data set occur with equal frequency

**multimodal**

three or more observations within a data set occur with equal frequency

**dispersion**

the degree to which the data are spread out

**range**

the difference between the largest data value and the smallest data value

**R**

range

**deviation about the mean**

How far, on average, each observation is from the mean.

**sum of all deviations about the mean**

must equal zero.

∑(x-sub-i – µ) = 0 and ∑(x-sub-i – x̄) = 0

**σ**

population standard deviation

**population variance**

the sum of the squared deviations about the population mean divided by the number of observations in the population

**σ²**

population variance

**population variance equation**

σ² = [(x₁-µ)² + (x₂-µ)² + (x-sub-n-µ)² ]/N

or

σ² = [∑(x-sub-i-µ)²]/N

**sample variance**

computed by determining the sum of the squared deviations about the sample mean and dividing this result by n-1

**s²**

sample variance

**sample variance equation**

s² = [∑ (x-sub-i – x̄)²]/(n-1)

**biased**

a statistic that consistently overestimates or underestimates a parameter

**degrees of freedom**

n-1 observations can be whatever they want, but the nth value must be whatever value forces the sum of the deviations about the mean to equal zero

**population standard deviation**

σ = √σ²

sample standard deviation

s = √s²

**empirical rule**

if a distribution is bell shaped:

68% of the data will lie within 1 standard deviation

95% of the data will lie within 2 standard deviations

99.7% of the data will lie within 3 standard deviations

**Chebyshev’s inequality**

For any data set, regardless of the shape of the distribution, at least [1-(1/k²)]100% of the observations will lie within k standard deviations of the mean, where k is greater than 1.

**class midpoint**

found by adding consecutive lower class limits and dividing the result by 2

**population mean**

µ = ∑xifi /∑fi

**steps to approximate the mean**

1. Determine the class midpoint of each class

2. Compute the sum of the frequencies

3. Multiply the class midpoint by the frequency to obtain xifi for each class

4. Compute ∑xifi

5 Calculate the mean using x̄ = ∑xifi /∑fi

**sample mean**

x̄ = ∑xifi /∑fi

**weighted mean**

mean calculated when certain data values have a higher importance or weight associated with them

**weighted mean equation**

x̄w = ∑wixi/∑wi

multiply each value of the variable by its corresponding weight, sum the products, and divide the result by the sum of the weights

**approximate population variance from a frequency distribution**

σ² = [ ∑(xi – µ)² fi]/∑fi

where xi is the midpoint or value of the ith class

fi is the frequency of the ith class

**approximate sample variance from a frequency distribution**

s² = [∑(xi – x̄)² fi] / ∑fi – 1

where xi is the midpoint or value of the ith class

fi is the frequency of the ith class

**z-score**

The distance that a data value is from the mean in terms of the number of standard deviations. Obtained by subtracting the mean from the data value and dividing this result by the standard deviation. Unitless, mean is 0, standard deviation is 1.

**population z-score**

z = (x – µ)/σ

**sample z-score**

z = (x – x̄)/s

**kth percentile**

a value such that k percent of the observations are less than or equal to the value

**Pk**

kth percentile

**quartiles**

Percentile that divides the data into fourths.

First quartile = 25th percentile

Second quartile = 50th percentile

Third quartile = 75th percentile

**Finding quartiles**

1. Arrange the data in ascending order

2. Determine the median, M, or second quartile, Q₂.

3. Determine the first and third quartiles, Q₁ and Q₃, by dividing the data set into two halves. Q₁ is the median of the bottom half, Q₃ is the median of the top half.

**interquartile range**

the range of the middle 50% of the observations in a data set; the difference between the first and third quartiles

**interquartile range equation**

IQR = Q₃ – Q₁

**describe the distribution**

1.describe the shape (skewed left, skewed right, or symmetric)

2. describe the center (mean or median)

3. describe the spread (standard deviation or interquartile range)

**outliers**

extreme observations in the data set

check for outliers using quartiles

1. Determine the first and third quartiles of the data

2. Compute the interquartile range

3. Determine the lower and upper fences

4. If a data value is less than the lower fence or greater than the upper fence, it is considered an outlier

**fences**

cutoff points for determining outliers

Lower fence = Q₁ – 1.5(IQR)

Upper fence = Q₃ + 1.5(IQR)

**exploratory data analysis**

examination of data in order to describe their main features using statistical tools and ideas

**five-number summary**

MINIMUM Q₁ M Q₃ MAXIMUM

**Constructing a boxplot**

1. Determine the lower and upper fences

2. Draw vertical lines at Q1, M, and Q3. Enclose those vertical lines in a box.

3. Label the lower and upper fences.

4a. Draw a line (whisker) from Q1 to the smallest data value that is larger than the lower fence.

4b. Draw a line (whisker) from Q3 to the largest data value that is smaller than the upper fence.

5. Mark any data value less than the lower fence 0r greater than the upper fence (outliers) with an asterisk.