IntroductionAs the statistics course, the first weekly project of the course asked me to use both graphical descriptive statistics and numerical descriptive statistics to find out the relationship between the number of dogs vs.
cats in the US. (exclude Alaska and Hawaii) The project asked me to choose different variables and to analyze the data. 1) Choose the top 10 states with the largest number of households to create a Pareto chart. Pareto chart is also called a Pareto distribution diagram, are histograms drawn in order of frequency of occurrence, showing how many results are due to the identified type or category, the individual values are represented in descending order by bars, and the cumulative total is represented by the line. 1 In other word, it is showing what is the major problem that contribution the most of result.2) Choose three different variables “percentage of households with pets”, “percentage of dog owners” and for the “percentage of cat owners”. After that, I have to create relative frequency distribution histogram, cumulative frequency line plot, and Perform numerical descriptive statistics. A histogram is a type of bar plot which can show the underlying frequency distribution of a set of data.
2 In this case, the histogram show the relative frequency for each variable, which is comparing each frequency with the total count. The cumulative frequency line plot is a curve graphically showing the cumulative frequency distribution, it can show us how the process of the frequency accumulation. The numerical descriptive statistics are ways of summarizing large sets of quantitative (numerical) information. The descriptive statistics have two important parts measures of central tendency and measures of variability. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation or variance, the minimum and maximum variables, and the kurtosis and skewness. 3 After I did all the analysis, I should find out the shape of the distribution for these three variables and what are similarities or dissimilarities in the distributions of the three variables analyzed. Also, I have to determine whether there are any outliers in these data.
3) Choose the “mean number of dogs per household” and “mean number of cats” as the variables, and create a scatter plot to show the correlation between these two variables. Scatter plot is similar to the line plot, but it just shows the trend of the distribution of points. That could tell us how one variable is affected by another, which is the correlation.AnalysisIn order to create a Pareto chart for the top 10 states with the largest number of households, the first step I have to use the Sort& Filter to sort the households number in order as largest to smallest for the original data. Excel States Then I selected and copied the first 10 data and pasted to Pareto chart sheet in Excel. Since the cumulative frequency is one of an important part of the Pare chart, now I have to calculate the cumulative frequency of these data.
First, I am going to calculate the cumulative number of households. The cumulative number for each state is equal to the state’s number plus the last cumulative number. For example, Texas’ cumulative number equal to Texas’ number plus California’s cumulative number.
(12,974 + 9,002 = 21,976) You should know here, the cumulative number for California is equal to California’s number (12,976), because there is no state before California. I just need to keep adding the numbers until I got North Carolina’s cumulative number equal the sum of all ten numbers. (63,259) After I got the cumulative number of households, I can calculate the cumulative frequency.
The formula of the cumulative frequency is each point of cumulative number divided by sum of all ten numbers. (showing as the Pareto chart sheet in Excel) Now, I got all data for the Pareto chart, I selected the Location, Number of Households, and Cumulative Frequency three columns, then insert a Pareto chart. For the second part of the project, I just did the same process to three different variables “percentage of households with pets”, “percentage of dog owners” and for the “percentage of cat owners”. First, I copied each data and pasted into separated excel sheet (% households with pets, % of dog owners, and % of cat owners) Then, I choose “Data Analysis” from “Data” menu of Excel, (If “Data Analysis” is not present, it must be added in by choosing the “Add-Ins” option and checking the “Data Analysis” tool pack in the opened dialogue box.) selected the “Histogram” and click “OK”. After that, a “Histogram” dialogue box will open, then I selected the data cell B2 to B50 for the “Input Range”, and check the “Labels” option, “Chart output”, and “Output range”, then click “OK”.
Excel will come up with a two columns chart and a histogram of the frequency. In the chart Excel has divided the data into eight different classes or categories (“Bins”), and each being represented by its midpoint. For example, in “% households with pets”, there are 8 bins, each with a width of 7 units, the first “bin” consists of the percentage between 18.4 and 25.4 (the interval). This bin is represented by “21.9”, which is the midpoint of the above interval. And “frequency” column, the 1 means that only one measurement falls between 18.
4 and 25.4. Since the histogram I got is for frequency, to create a relative frequency histogram, I must first calculate the relative frequencies. I can use the formula Relative Frequency=Frequency/Total. Thus we know the frequency for each bin, we can get Total is summing of all frequency, which is 49.
So the relative frequency= Frequency/49, then we got a new column showing as Relative Frequency. Now, I just copy the “frequency histogram” and label as “relative frequency histogram”, then I change the Y- axis to “Relative Frequency” from “Frequency”. For next step, I need to create a “Cumulative Frequency line plot”, as I did before, I just keep adding each frequency until I got 49. Then I use the “Insert” to insert a line chart. I selected “bin” as the X- axis, and “Cumulative Frequency” as the Y- axis, then I got a “Cumulative Frequency line plot”. For last part, I just simply click “Data Analysis” in the “Data” menu, and choose “Descriptive Statistics”, then input the “percentage of households with pets” data, check the “Labels”, “Summary statistics” and choose an “output range”. After that, Excel will come up with a Numerical Descriptive Statistics chart for the “percentage of households with pets”. For “percentage of dog owners” and for the “percentage of cat owners”, I just repeat above steps.
For last part of the project, I copied and pasted the “mean number of dogs per household” and “mean number of cats” data into the Dogs vs Cats sheet. Then I choose the “mean number of dogs per household” as the Y- axis and “mean number of cats” as the X- axis to create a scatter plot. After that, I add the trendline to the scatter plot and display the trendline equation. ConclusionFor the Pareto chat of top 10 states with the largest number of households, the result does not show any significant few states contribute more than 80% of the households.
I think it could tell us that the population for this statistic is reasonable. (not too many households in few states) For the second part of the project, I compare the results of the three different variables “percentage of households with pets”, “percentage of dog owners” and for the “percentage of cat owners”. They both have a negatively skewed distribution and outliers. For the last part of the project, I can say that there is a positive correlation relation between “mean number of dogs per household” and “mean number of cats”, which means once “mean number of cats” goes up, the “mean number of dogs per household” also goes up. According to the analysis I did, I found that all the result we got is based on US households that only can reflect part of the relationship between Dogs and Cats, because there are many people have pets but not a household. Thus if we want the statistical results more accurate, we can just get data from US pets owner.