R Handbook: Descriptive Statistics (2025)

Packages used in this chapter

The packages used in this chapter include:

• psych

• DescTools

• Rmisc

• FSA

• plyr

• boot

The following commands will install these packages if theyare not already installed:

if(!require(psych)){install.packages("psych")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(Rmisc)){install.packages("Rmisc")}
if(!require(FSA)){install.packages("FSA")}
if(!require(plyr)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}

Descriptive statistics

Descriptive statistics are used to summarize data in a waythat provides insight into the information contained in the data. This mightinclude examining the mean or median of numeric data or the frequency ofobservations for nominal data. Plots can be created that show the data andindicating summary statistics.

Choosing which summary statistics are appropriate depend onthe type of variable being examined. Different statistics are usually used forinterval/ratio, ordinal, and nominal data.

In describing or examining data, you will typically beconcerned with measures of location, variation, and shape.

Location is also called central tendency. Itis a measure of the magnitude of the values of the data. For example, are thevalues close to 10 or 100 or 1000? Measures of location include mean andmedian, as well as somewhat less common statistics like M-estimators orWinsorized means.

Variation is also called dispersion. It is ameasure of how far the data points lie from one another. Common statisticsinclude standard deviation and coefficient of variation. For data that aren’t symmetrically-distributed,percentile, interquartile range, or median absolute deviation might be used.

Shape refers to the distribution of values. The besttools to evaluate the shape of data are histograms and related plots.Statistics include skewness and kurtosis, though they are often less usefulthan visual inspection. We can describe data shape as normally-distributed,log-normal, uniform, skewed, bi-modal, and others.

Descriptive statistics for interval/ratio data

For this example, imagine that Ren and Stimpy have each heldeight workshops educating the public about water conservation at home. Theyare interested in how many people showed up to the workshops.

Because the data are housed in a data frame, we can use theconvention Data$Attendees to access the variable Attendees withinthe data frame Data.

Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Instructor Location Attendees
Ren North 7
Ren North 22
Ren North 6
Ren North 15
Ren South 12
Ren South 13
Ren South 14
Ren South 16
Stimpy North 18
Stimpy North 17
Stimpy North 15
Stimpy North 9
Stimpy South 15
Stimpy South 11
Stimpy South 19
Stimpy South 23
")

Data ### Will output data frame calledData

str(Data) ### Shows the structure of the dataframe

summary(Data) ### Summarizes variables in the dataframe

Functions sum and length

The sum of a variable can be found with the sumfunction, and the number of observations can be found with the lengthfunction.


sum(Data$Attendees)


232


length(Data$Attendees)

16

Statistics of location for interval/ratio data

Mean

The mean is the arithmetic average, and is a commonstatistic used with interval/ratio data. It is simply the sum of the valuesdivided by the number of values. The mean function in R will return themean.


sum(Data$Attendees) / length(Data$Attendees)


14.5


mean(Data$Attendees)

14.5

Caution should be used when reporting mean values withskewed data, as the mean may not be representative of the center of the data.For example, imagine a town with 10 families, nine of whom have an income ofless than $50, 000 per year, but with one family with an income of $2,000,000per year. The mean income for families in the town would be $233,000, but thismay not be a reasonable way to summarize the income of the town.

Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000)

mean(Income)

233300

Median

The median is defined as the value below which 50% of theobservations are found. To find this value manually, you would order theobservations, and separate the lowest 50% from the highest 50%. For data setswith an odd number of observations, the median is the middle value. For datasets with an even number of observations, by convention, the median fallshalf-way between the two middle values.

The median is a robust statistic in that it is not affectedby adding extreme values. For example, if we changed Stimpy’s last Attendeesvalue from 23 to 1000, it would not affect the median.


median(Data$Attendees)

15

Attendees.2 = c(7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11,19, 1000)

Attendees.2

7 22 6 15 12 13 14 16 18 17 15 9 15 11 19 1000

median(Attendees.2)

15

The median is appropriate for either skewed or unskeweddata. The median income for the town discussed above is $40,500. Half thefamilies in the town have an income above this amount, and half have an incomebelow this amount.

Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000,36000, 2000000)

median(Income)

40500

Note that medians are sometimes reported as the “averageperson” or “typical family”. Saying, “The average American family earned$54,000 last year” means that the median income for families was $54,000. The“average family” is that one with the median income.

Mode

The mode is a summary statistic that is used rarely inpractice, but is often included in a discussion of mean and medians. Whenthere are discreet values for a variable, the mode is simply the value whichoccurs most frequently. For example, in the Statistics Learning Center videoin the Required Readings below, Dr. Nic gives an example of counting thenumber of pairs of shoes each student owns. The most common answer was 10, andtherefore 10 is the mode for that data set.

For our Ren and Stimpy example, the value 15 occurs threetimes and so is the mode.

The Mode function can be found in the package DescTools.

library(DescTools)

Mode(Data$Attendees)

15

Statistics of variation for interval/ratio data

Standard deviation

The standard deviation is a measure of variation which iscommonly used with interval/ratio data. It’s a measurement of how close theobservations in the data set are to the mean.

There’s a handy rule of thumb that—for normally distributeddata—68% of data points fall within the mean ± 1 standard deviation, 95% ofdata points fall within the mean ± 2 standard deviations, and 99.7% of datapoints fall within the mean ± 3 standard deviations.

Because the mean is often represented with the letter mu,and the standard deviation is represented with the letter sigma, sayingsomeone is “a few sigmas away from mu” indicates they are rathera rare character. (I originally heard this joke on an episode of Car Talk,for which I cannot find a reference or transcript.)


sd(Data$Attendees)


4.830459

Standard deviation may not be as useful for skewed data asit is for symmetric or normal data.

Median absolute deviation

The median absolute deviation is also a measure of variationused with interval/ratio data. It’s useful to measure the dispersion of thedata when the distribution of the data is not symmetric or not known. Bydefault, the mad() function in R adjusts the calculation for the medianabsolute deviation. For an unadjusted calculation, the constant = 1option could be used.


mad(Data$Attendees)

4.4478

mad(Data$Attendees, constant=1)

3

Standard error of the mean

Standard error of the mean is a measure that estimates howclose a calculated mean from a sample is likely to be to the true mean of thatpopulation. It is commonly used in tables or plots where multiple means arepresented together. For example, we might want to present the mean attendeesfor Ren with the standard error for that mean and the mean attendees for Stimpywith the standard error that mean.

The standard error is the standard deviation divided by thesquare root of the number of observations. It can also be found in the outputfor the describe function in the psych package, labelled se.

sd(Data$Attendees) /
sqrt(length(Data$Attendees))

1.207615

library(psych)

describe(Data$Attendees)

vars n mean sd median trimmed mad min max range skewkurtosis se
1 1 16 14.5 4.83 15 14.5 4.45 6 23 17 -0.04 -0.88 1.21

### se indicates the standard error of the mean

Five-number summary, quartiles, percentiles

The median is the same as the 50th percentile,because 50% of values fall below this value. Other percentiles for a data setcan be identified to provide more information. Typically, the 0th,25th, 50th, 75th, and 100thpercentiles are reported. This is sometimes called the five-number summary.

These values can also be called the minimum, 1stquartile, 2nd quartile, 3rd quartile, and maximum.

The five-number summary is a useful measure of variation forskewed interval/ratio data or for ordinal data. 25% of values fall below the 1stquartile and 25% of values fall above the 3rd quartile. This leavesthe middle 50% of values between the 1st and 3rdquartiles, giving a sense of the range of the middle half of the data. This rangeis called the interquartile range (IQR).

Percentiles and quartiles are relatively robust, as theyaren’t affected much by a few extreme values. They are appropriate for bothskewed and unskewed data.


summary(Data$Attendees)

Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 11.75 15.00 14.50 17.25 23.00

### The five-number summary and the mean

Optional technical note on calculating percentiles

It may have struck you as odd that the 3rdquartile for Attendees was reported as 17.25. After all, if you were toorder the values of Attendees, the 75th percentile would fallbetween 17 and 18. But why does R go with 17.25 and not 17.5?


sort(Data$Attendees)

6 7 9 11 12 13 14 15 15 15 16 17 18 19 22 23

The answer is that there are several different methods tocalculate percentiles, and they may give slightly different answers. Fordetails on the calculations, see ?quantiles.

For Attendees, the default type 7 calculation yieldsa 75th percentile value of 17.25, whereas the type 2 calculationsimply splits the difference between 17 and 18 and yields 17.5. The type 1calculation doesn’t average the two values, and so just returns 17.

quantile(Data$Attendees, 0.75, type=7)


75%
17.25

quantile(Data$Attendees, 0.75, type=2)


75%
17.5

quantile(Data$Attendees, 0.75, type=1)

75%
17

Percentiles other than the 25th, 50th,and 75th can be calculated with the quantiles function. Forexample, to calculate the 95th percentile:


quantile(Data$Attendees, .95)

95%
22.25

Statistics for grouped interval/ratio data

In many cases, we will want to examine summary statisticsfor a variable within groups. For example, we may want to examine statisticsfor the workshops lead by Ren and those lead by Stimpy.

Summarize in FSA

The Summarize function in the FSA packagereturns the number of observations, mean, standard deviation, minimum, 1stquartile, median, 3rd quartile, and maximum for grouped data.

Note the use of formula notation: Attendees is thedependent variable (the variable you want to get the statistics for); and Instructoris the independent variable (the grouping variable). Summarize allowsyou to summarize over the combination of multiple independent variables bylisting them to the right of the ~ separated by a plus sign (+).

library(FSA)

Summarize(Attendees ~ Instructor,
data=Data)

Instructor n nvalid mean sd min Q1 median Q3 max
1 Ren 8 8 13.125 5.083236 6 10.75 13.5 15.25 22
2 Stimpy 8 8 15.875 4.454131 9 14.00 16.0 18.25 23


Summarize(Attendees ~ Instructor + Location,
data=Data)


Instructor Location n nvalid mean sd min Q1 median Q3 max
1 Ren North 4 4 12.50 7.505554 6 6.75 11.0 16.75 22
2 Stimpy North 4 4 14.75 4.031129 9 13.50 16.0 17.25 18
3 Ren South 4 4 13.75 1.707825 12 12.75 13.5 14.50 16
4 Stimpy South 4 4 17.00 5.163978 11 14.00 17.0 20.00 23

summarySE in Rmisc

The summarySE function in the Rmisc packageoutputs the number of observations, mean, standard deviation, standard error ofthe mean, and confidence interval for the mean for grouped data. The summarySEfunction allows you to summarize over the combination of multipleindependent variables by listing them as a vector, e.g. c("Instructor","Student").

library(Rmisc)

summarySE(data=Data,
"Attendees",
groupvars="Instructor",
conf.interval = 0.95)

Instructor N Attendees sd se ci
1 Ren 8 13.125 5.083236 1.797195 4.249691
2 Stimpy 8 15.875 4.454131 1.574773 3.723747

summarySE(data=Data,
"Attendees",
groupvars = c("Instructor", "Location"),
conf.interval = 0.95)

Instructor Location N Attendees sd se ci
1 Ren North 4 12.50 7.505553 3.7527767 11.943011
2 Ren South 4 13.75 1.707825 0.8539126 2.717531
3 Stimpy North 4 14.75 4.031129 2.0155644 6.414426
4 Stimpy South 4 17.00 5.163978 2.5819889 8.217041

describeBy in psych

The describeBy function in the psych packagereturns the number of observations, mean, median, trimmed means, minimum,maximum, range, skew, kurtosis, and standard error of the mean for groupeddata. describeBy allows you to summarize over the combination ofmultiple independent variables by combining terms with a colon (:).

library(psych)

describeBy(Data$Attendees,
group = Data$Instructor,
digits= 4)


group: Ren
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 8 13.12 5.08 13.5 13.12 2.97 6 22 16 0.13 -1.08 1.8
-------------------------------------------------------------------------
group: Stimpy
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 8 15.88 4.45 16 15.88 3.71 9 23 14 -0.06 -1.26 1.57

describeBy(Data$Attendees,
group = Data$Instructor : Data$Location,
digits= 4)


group: Ren:North
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 4 12.5 7.51 11 12.5 6.67 6 22 16 0.26 -2.14 3.75
-------------------------------------------------------------------------
group: Ren:South
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 4 13.75 1.71 13.5 13.75 1.48 12 16 4 0.28 -1.96 0.85
-------------------------------------------------------------------------
group: Stimpy:North
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 4 14.75 4.03 16 14.75 2.22 9 18 9 -0.55 -1.84 2.02
-------------------------------------------------------------------------
group: Stimpy:South
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 4 17 5.16 17 17 5.93 11 23 12 0 -2.08 2.58

Summaries for data frames

Often we will want to summarize variables in a whole dataframe, either to get some summary statistics for individual variables, or tocheck that variables have the values we expected in inputting the data, to besure there wasn’t some error.

The str function

The str function in the native utils packagewill list variables for a data frame, with their types and levels. Datais the name of the data frame, created above.

str(Data)

'data.frame': 16 obs. of 3 variables:
$ Instructor: Factor w/ 2 levels "Ren","Stimpy": 1 1 1 1 11 1 1 2 2 ...
$ Location : Factor w/ 2 levels "North","South": 1 1 1 12 2 2 2 1 1 ...
$ Attendees : int 7 22 6 15 12 13 14 16 18 17 ...

### Instructor is a factor (nominal) variable withtwo levels.
### Location is a factor (nominal) variable with two levels.
### Attendees is an integer variable.

The summary function

The summary function in the native basepackage will summarize all variables in a data frame, listing the frequenciesfor levels of nominal variables; and for interval/ratio data, the minimum, 1stquartile, median, mean, 3rd quartile, and maximum.

summary(Data)

Instructor Location Attendees
Ren :8 North:8 Min. : 6.00
Stimpy:8 South:8 1st Qu.:11.75
Median :15.00
Mean :14.50
3rd Qu.:17.25
Max. :23.00

The headTail function in psych

The headTail function in the psych packagereports the first and last observations for a data frame.

library(psych)

headTail(Data)

Instructor Location Attendees
1 Ren North 7
2 Ren North 22
3 Ren North 6
4 Ren North 15
... <NA> <NA> ...
13 Stimpy South 15
14 Stimpy South 11
15 Stimpy South 19
16 Stimpy South 23

The head and tail functions in the native utilspackage reports the first and last observations for a data frame.


head(Data, n=10)

Instructor Location Attendees
1 Ren North 7
2 Ren North 22
3 Ren North 6
4 Ren North 15
5 Ren South 12
6 Ren South 13
7 Ren South 14
8 Ren South 16
9 Stimpy North 18
10 Stimpy North 17

tail(Data, n=10)

Instructor Location Attendees
7 Ren South 14
8 Ren South 16
9 Stimpy North 18
10 Stimpy North 17
11 Stimpy North 15
12 Stimpy North 9
13 Stimpy South 15
14 Stimpy South 11
15 Stimpy South 19
16 Stimpy South 23

The describe function in psych

The describe function in the psych packagereports number of observations, mean, standard deviation, trimmed means,minimum, maximum, range, skew, kurtosis, and standard error for variables in adata frame.

Note that factor variables are labeled with an asterisk (*),and the levels of the factors are coded as 1, 2, 3, etc.

library(psych)

describe(Data)


vars n mean sdmedian trimmed mad min max range skew kurtosis se
Instructor* 1 16 1.5 0.52 1.5 1.5 0.74 1 2 1 0.00 -2.120.13
Location* 2 16 1.5 0.52 1.5 1.5 0.74 1 2 1 0.00 -2.120.13
Attendees 3 16 14.5 4.83 15.0 14.5 4.45 6 23 17 -0.04 -0.881.21

Dealing with missing values

Sometimes a data set will have missing values. This canoccur for a variety of reasons, such as a respondent not answering a specificquestion, or a researcher being unable to make a measurement due to temporarilymalfunctioning equipment.

In R, a missing value is indicated with NA.

By default, different functions in R will handle missingvalues in different ways. But most have options to change how they treatmissing data.

In general, you should scan your data for missing data, andthink carefully about the best way to handle observations with missing values.

Data2 = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Instructor Location Attendees
Ren North 7
Ren North 22
Ren North 6
Ren North 15
Ren South 12
Ren South 13
Ren South NA
Ren South 16
Stimpy North 18
Stimpy North 17
Stimpy North NA
Stimpy North 9
Stimpy South 15
Stimpy South 11
Stimpy South 19
Stimpy South 23
")

### Note: This data frame will be called Data2to distinguish it
### from Data above.

Data2

The na.rm option for missing values with a simple function

Many common functions in R have a na.rm option. Ifthis option is set to FALSE, the function will return an NAresult if there are any NA’s in the data values passed to the function. If setto TRUE, observations with NA values will be discarded, and the functionwill continue to calculate its output.

Note that the na.rm option operates only on the datavalues actually passed to the function. In the following example with median,only Attendees is passed to the function; if there were NA’s in othervariables, this would not affect the function.

Not all functions have the same default for the na.rmoption. To determine the default, use e.g. ?median, ?mean, ?sd.


median(Data2$Attendees,
na.rm = FALSE)

NA

### na.rm=FALSE. Since there is an NA in thedata, report NA.

median(Data2$Attendees,
na.rm = TRUE)

15

### na.rm=TRUE. Drop observations with NA andthen calculate the median.

Missing values with Summarize in FSA

Summarize in FSA will indicate invalid values,including NA’s, with the count of valid observations outputted as the variable nvalid.

library(FSA)

Summarize(Attendees ~ Instructor,
data=Data2)

Instructor n nvalid mean sd min Q1 median Q3 max
1 Ren 8 7 13 5.477226 6 9.5 13 15.5 22
2 Stimpy 8 7 16 4.795832 9 13.0 17 18.5 23

### This function removes missing values, butindicates the number of
### missing values by not including them in the count for nvalid.

Missing values indicated with summary function

The summary function counts NA’s for numericvariables.


summary(Data2)

Instructor Location Attendees
Ren :8 North:8 Min. : 6.00
Stimpy:8 South:8 1st Qu.:11.25
Median :15.00
Mean :14.50
3rd Qu.:17.75
Max. :23.00
NA's :2

### Indicates two NA’s in Attendees.

Missing values in the describe function in psych

The describe function is the psych packageremoves NA’s by default.

library(psych)

describe(Data2$Attendees)

vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 14 14.5 5.19 15 14.5 5.19 6 23 17 -0.04 -1.17 1.39

### Note that two NA’s were removed by default,reporting an n of 14.

Missing values in the summarySE function in Rmisc

By default, the summarySE function does not removeNA’s, but can be made to do so with the na.rm=TRUE option.

library(Rmisc)

summarySE(data=Data2,
"Attendees")

.id N Attendees sd se ci
1 <NA> 16 NA NA NA NA

### Note an N of 16 is reported, and statisticsare reported as NA.

library(Rmisc)

summarySE(data=Data2,
"Attendees",
na.rm=TRUE)

.id N Attendees sd se ci
1 <NA> 14 14.5 5.185038 1.38576 2.993752

### Note an N of 14 is reported, and statisticsare calculated with
### NA’s removed.

Advanced techniques
Discarding subjects

Certain psychology studies use scales that are calculatedfrom responses of several questions. Because answers to all questions areneeded to reliably calculate the scale, any observation (subject, person) withmissing answers will simply be discarded. Some types of analyses in otherfields also follow this approach.

Subjects can be deleted with the subset function.The following code creates a new data frame, Data3, with allobservations with NA in the variable Attendees removed fromData2.

Data3 = subset(Data2,
!is.na(Attendees))

Data3

Instructor Location Attendees
1 Ren North 7
2 Ren North 22
3 Ren North 6
4 Ren North 15
5 Ren South 12
6 Ren South 13
8 Ren South 16
9 Stimpy North 18
10 Stimpy North 17
12 Stimpy North 9
13 Stimpy South 15
14 Stimpy South 11
15 Stimpy South 19
16 Stimpy South 23

### Note in the results that observations 7 and 11were deleted

Imputation of values

Missing values can be assigned a likely value through theprocess of imputation. An algorithm is used that determines a value based onthe values of other variables for that observation relative to values for othervariables. The mice package in R can perform imputation.

Optional code: removing missing values in vectors

If functions don’t have a na.rm option, it is possibleto remove NA observations manually. Here we’ll create a vector called validthat just contains those values of Attendees that are not NA’s.

The ! operator is a logical not. The bracketsserve as a list of observations to include for the preceding variable. So, thecode essentially says, “Define a vector valid as the values of Attendeeswhere the values of Attendees are not NA.”

valid = Data2$Attendees[!is.na(Data2$Attendees)]

valid

7 22 6 15 12 13 16 18 17 9 15 11 19 23

summary(valid)

Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 11.25 15.00 14.50 17.75 23.00

Statistics of shape for interval/ratio data

Perhaps the most common statistics describing the shape of adata distribution are skewness and kurtosis.

Skewness and kurtosis values are not reported too commonly.Instead, the shape of observed data is often presented visually with plot, suchas a histogram or quantile-quantile plot.

Skewness

Skewness indicates the degree of asymmetry in a data set.If the distribution shows a long tail to the right, the distribution ispositively skewed or right skewed. Negative or left skew is the opposite.

A symmetric distribution has a skewness of 0. The skewnessvalue for a positively skewed distribution is positive, and a negative value describesa negatively skewed distribution. Sometimes a skew with an absolute valuegreater than 1 or 1.5 or 2 is considered highly skewed. There is not agreementon this interpretation.

There are different methods to calculate skew and kurtosis.The describe function in the psych package has three options forhow to calculate them.

library(psych)

describe(Data$Attendees,
type=3) ### Type of calculationfor skewness and kurtosis

vars n mean sd median trimmed mad min max range skewkurtosis se
1 1 16 14.5 4.83 15 14.5 4.45 6 23 17 -0.04 -0.88 1.21

### Skewness and kurtosis among other statistics

R Handbook: Descriptive Statistics (1)

R Handbook: Descriptive Statistics (2)R Handbook: Descriptive Statistics (3)

The normal curve is symmetrical around its center. The positively-skewed distribution has a longer, thicker tail to the right. The negatively-skewed distribution has a longer, thicker tail to the left.

Normal distribution

The normal distribution is symmetric, bell-shaped, andfollows a specific mathematically-defined distribution. No real observed datais exactly normally distributed, but some real-world data are approximatelynormal in distribution.

For more information on the normal distribution, see thevideo on “Normal Distribution” from Statistics Learning Center in the “OptionalReadings” section. Additional thoughts on the normal distribution andreal-world data distributions are in the article by Dr. Nic in the “OptionalReadings” section.

Kurtosis

Kurtosis measures the degree to which the distribution ofdata has either fewer and less extreme outliers, or more and more extremeoutliers, where outliers are values that are far from the center of thedistribution. This can be thought of as how thick the tails are in adistribution. A normal distribution is mesokutic.

In general, the higher the kurtosis, the sharper the peakand the fatter the tails. This is called leptokurtic, and is indicatedby positive kurtosis values. The opposite—platykurtosis—has negativekurtosis values, with a flatter peak and thinner tails.

However, kurtosis doesn’t necessarily measure the“peakedness” of the distribution, or even the fatness of the tails of thedistribution. It’s a difficult concept to encapsulate in the general case.

Confusingly, the kurtosis for a normal distribution issometimes defined as 0, and sometimes defined as 3. The former is often called“excess kurtosis”.

Sometimes an excess kurtosis with an absolute value greaterthan 2 or 3 is considered a high deviation from being mesokurtic. There is notagreement on this interpretation.

Descriptive statistics for ordinal data

Descriptive statistics for ordinal data are more limitedthan those for interval/ratio data. You’ll remember that for ordinal data, thelevels can be ordered, but we can’t say that the intervals between the levelsare equal. For example, we can’t say that an Associate’s degree and Master’sdegree somehow average out to a Bachelor’s degree.

Because of this fact, several common descriptive statisticsare usually inappropriate for use with ordinal data. These include mean and standarddeviation.

Ordinal data can be described by either: 1) treating thedata as numeric and using appropriate statistics such as median and quartiles;or, 2) treating the data as nominal, and looking at counts or proportions ofthe data for each level.

A more-complete discussion of descriptive statistics forordinal data can be found in the Descriptive Statistics for Likert Datachapter.

Example of descriptive statistics for ordinal data

For this example, imagine that Arthur and Baxter have eachheld a workshop educating the public about preventing water pollution at home.They are interested in the education level of people who attended theworkshops. We can consider education level an ordinal variable. They alsocollected information on the gender and county of participants.

For this data, we have manually coded Education withnumbers in a separate variable Ed.code. These numbers list theeducation in order: High school < Associate’s < Bachelor’s <Master’s < Ph.D.

Of course we could have had R do this coding for us, but thecode is slightly messy, so I did the coding manually, and listed the optional Rcode below.

Ideally, we would want to treat Education as an orderedfactor variable in R. But unfortunately, most common functions in R won’thandle ordered factors well. Later in this book, ordered factor data will behandled directly with cumulative link models (CLM), permutation tests, andtests for ordered tables.

Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Date Instructor Student Gender County Education Ed.code
'2015-11-01' 'Arthur Read' a female 'Elwood' BA 3
'2015-11-01' 'Arthur Read' b female 'Bear Lake' PHD 5
'2015-11-01' 'Arthur Read' c male 'Elwood' BA 3
'2015-11-01' 'Arthur Read' d female 'Elwood' MA 4
'2015-11-01' 'Arthur Read' e male 'Elwood' HS 1
'2015-11-01' 'Arthur Read' f female 'Bear Lake' MA 4
'2015-11-01' 'Arthur Read' g male 'Elwood' HS 1
'2015-11-01' 'Arthur Read' h female 'Elwood' BA 3
'2015-11-01' 'Arthur Read' i female 'Elwood' BA 3
'2015-11-01' 'Arthur Read' j female 'Elwood' BA 3
'2015-12-01' 'Buster Baxter' k male 'Elwood' MA 4
'2015-12-01' 'Buster Baxter' l male 'Bear Lake' MA 4
'2015-12-01' 'Buster Baxter' m female 'Elwood' AA 2
'2015-12-01' 'Buster Baxter' n male 'Elwood' AA 2
'2015-12-01' 'Buster Baxter' o other 'Elwood' BA 3
'2015-12-01' 'Buster Baxter' p female 'Elwood' BA 3
'2015-12-01' 'Buster Baxter' q female 'Bear Lake' PHD 5
")

### Check the data frameData

str(Data)


'data.frame': 17 obs. of 7 variables:

$ Date : Factor w/ 2 levels"2015-11-01","2015-12-01": 1 1 1 1 1 1 1 1 1 1
$ Instructor: Factor w/ 2 levels "Arthur Read",..: 1 1 1 1 1 1 1 1 11 ...
$ Student : Factor w/ 17 levels"a","b","c","d",..: 1 2 3 4 5 6 7 8 910 ...
$ Gender : Factor w/ 3 levels "female","male",..: 1 1 21 2 1 2 1 1 1 ...
$ County : Factor w/ 2 levels "Bear Lake","Elwood": 2 12 2 2 1 2 2 2 2 ...
$ Education : Factor w/ 5 levels"AA","BA","HS",..: 2 5 2 4 3 4 3 2 2 2 ...
$ Ed.code : int 3 5 3 4 1 4 1 3 3 3 ...

summary(Data)

Date Instructor Student Gender County
2015-11-01:10 Arthur Read :10 a : 1 female:10 Bear Lake: 4
2015-12-01: 7 Buster Baxter: 7 b : 1 male : 6 Elwood :13
c : 1 other : 1
d : 1
e : 1
f : 1
(Other):11
Education Ed.code
AA :2 Min. :1.000
BA :7 1st Qu.:3.000
HS :2 Median :3.000
MA :4 Mean :3.118
PHD:2 3rd Qu.:4.000
Max. :5.00

Optional code to assign values to a variable based on another variable


Data$Ed.code[Data$Education=="HS"] = 1
Data$Ed.code[Data$Education=="AA"] = 2
Data$Ed.code[Data$Education=="BA"] = 3
Data$Ed.code[Data$Education=="MA"] = 4
Data$Ed.code[Data$Education=="PHD"] = 5

Optional code to change a factor variable to an ordered factor variable

Data$Education.ordered = factor(Data$Education,
ordered = TRUE,
levels = c("HS", "AA","BA", "MA", "PHD")
)
str(Data$Education.ordered)

summary(Data$Education.ordered)


HS AA BA MA PHD
2 2 7 4 2

median(Data$Education.ordered)

[1] "BA"

Statistics of location for ordinal data

Using the mean is usually not appropriate for ordinal data.Instead, the median could be used as a measure of location. We will use ournumerically-coded variable Ed.code for the analysis.

median(Data$Ed.code)

3

### Remember that 3 meant Bachelor’s degree in ourcoding

Statistics of variation for ordinal data

Statistics like standard deviation and standard error of themean are usually inappropriate for ordinal data. Instead, the five-numbersummary can be used.


summary(Data$Ed.code)

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 3.000 3.118 4.000 5.000

### Remember to ignore the mean value for ordinaldata.

### Remember that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.

Statistics for grouped ordinal data

Because the Summarize function in the FSApackage reports the five-number summary, it is a useful function for descriptivestatistics for grouped ordinal data.

library(FSA)

Summarize(Ed.code ~ Gender,
data=Data)


Gender n nvalid mean sd min Q1 median Q3 max
1 female 10 10 3.5 0.9718253 2 3.00 3.0 4.00 5
2 male 6 6 2.5 1.3784049 1 1.25 2.5 3.75 4
3 other 1 1 3.0 NA 3 3.00 3.0 3.00 3

### Remember to ignore the mean and sd values for ordinal data.

library(FSA)

Summarize(Ed.code ~ Gender + County,
data=Data)


Gender County n nvalid mean sd min Q1 median Q3 max
1 female Bear Lake 3 3 4.666667 0.5773503 4 4.5 5 5 5
2 male Bear Lake 1 1 4.000000 NA 4 4.0 4 4 4
3 female Elwood 7 7 3.000000 0.5773503 2 3.0 3 3 4
4 male Elwood 5 5 2.200000 1.3038405 1 1.0 2 3 4
5 other Elwood 1 1 3.000000 NA 3 3.0 3 3 3

### Remember to ignore the mean and sd values for ordinal data.

Statistics for ordinal data treated as nominal data

The summary function in the native basepackage and the xtabs function in the native stats packageprovide counts for levels of a nominal variable.

The prop.table function translates a table intoproportions. The margin=1 option indicates that the proportions arecalculated for each row.

First, we will order the levels of Education, otherwiseR with report results in alphabetical order.

Data$Education = factor(Data$Education,
levels = c("HS", "AA","BA", "MA", "PHD"))

### Order factors, otherwise R will alphabetizethem

summary(Data$Education)


HS AA BA MA PHD
2 2 7 4 2
### Counts of each level ofEducation

XT = xtabs( ~ Education,
data=Data)

XT

Education
HS AA BA MA PHD
2 2 7 4 2

prop.table(XT)

Education
HS AA BA MA PHD
0.1176471 0.1176471 0.4117647 0.2352941 0.1176471

Grouped data

XT = xtabs( ~ Gender + Education,
data = Data)

XT

Education
Gender HS AA BA MA PHD
female 0 1 5 2 2
male 2 1 1 2 0
other 0 0 1 0 0

prop.table(XT,
margin = 1)

Education
Gender HS AA BA MA PHD
female 0.0000000 0.1000000 0.5000000 0.2000000 0.2000000
male 0.3333333 0.1666667 0.1666667 0.3333333 0.0000000
other 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

### Proportion of responses for each row

Two-way grouped data

XT = xtabs(~ Gender + Education + County,
data=Data)

XT

### Note that the dependent variable,Education, is the middle
### of the variable list.

, , County = Bear Lake

Education
Gender HS AA BA MA PHD
female 0 0 0 1 2
male 0 0 0 1 0
other 0 0 0 0 0

, , County = Elwood

Education
Gender HS AA BA MA PHD
female 0 1 5 1 0
male 2 1 1 1 0
other 0 0 1 0 0

Descriptive statistics for nominal data

Descriptive statistics for nominal data consist of listingor plotting counts for levels of the nominal data, often by levels of agrouping variable. Tables of this information are often called contingencytables.

For this example, we will look again at Arthur and Buster’sdata, but this time considering Gender to be the dependent variable ofinterest.

If levels of a nominal variable are coded with numbers,remember that the numbers will be arbitrary. For example, if you assign female= 1, and male = 2, and other = 3, it makes no sense to say thatthe average gender in Arthur’s class was 1.3. Or that the average gender inBuster’s class was greater than that in Arthur’s. It also makes no sense tosay that the median gender was female.

Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Date Instructor Student Gender County Education Ed.code
'2015-11-01' 'Arthur Read' a female 'Elwood' BA 3
'2015-11-01' 'Arthur Read' b female 'Bear Lake' PHD 5
'2015-11-01' 'Arthur Read' c male 'Elwood' BA 3
'2015-11-01' 'Arthur Read' d female 'Elwood' MA 4
'2015-11-01' 'Arthur Read' e male 'Elwood' HS 1
'2015-11-01' 'Arthur Read' f female 'Bear Lake' MA 4
'2015-11-01' 'Arthur Read' g male 'Elwood' HS 1
'2015-11-01' 'Arthur Read' h female 'Elwood' BA 3
'2015-11-01' 'Arthur Read' i female 'Elwood' BA 3
'2015-11-01' 'Arthur Read' j female 'Elwood' BA 3
'2015-12-01' 'Buster Baxter' k male 'Elwood' MA 4
'2015-12-01' 'Buster Baxter' l male 'Bear Lake' MA 4
'2015-12-01' 'Buster Baxter' m female 'Elwood' AA 2
'2015-12-01' 'Buster Baxter' n male 'Elwood' AA 2
'2015-12-01' 'Buster Baxter' o other 'Elwood' BA 3
'2015-12-01' 'Buster Baxter' p female 'Elwood' BA 3
'2015-12-01' 'Buster Baxter' q female 'Bear Lake' PHD 5
")


### Check the data frame

Data

str(Data)

summary(Data)

Example of descriptive statistics for nominal data

The summary function in the native basepackage and the xtabs function in the native stats package providecounts for levels of a nominal variable.

The prop.table function translates a table intoproportions. The margin=1 option indicates that the proportions arecalculated for each row.

One-sample data


summary(Data$Gender)

female male other
10 6 1

### Counts of each level of Gender

One-way data


xtabs(~ Date + Gender,
data=Data)

Gender
Date female male other
2015-11-01 7 3 0
2015-12-01 3 3 1

XT = xtabs(~ Date + Gender,
data=Data)

prop.table(XT,
margin = 1)

Gender
Date female male other
2015-11-01 0.7000000 0.3000000 0.0000000
2015-12-01 0.4285714 0.4285714 0.1428571

### Proportion of each level of Gender for eachrow

sum(XT)

[1] 17

### Sum of observation in the table

rowSums(XT)

2015-11-01 2015-12-01
10 7

### Sum of observation in each row of the table

colSums(XT)

female male other
10 6 1

### Sum of observation in each column of the table

Two-way data

xtabs(~ County + Gender + Date,
data=Data)

### Note that the dependent variable, Gender,is the middle
### of the variable list.

, , Date = 2015-11-01

Gender
County female male other
Bear Lake 2 0 0
Elwood 5 3 0

, , Date = 2015-12-01

Gender
County female male other
Bear Lake 1 1 0
Elwood 2 2 1

Levels for factor variables

Most of the time in R, nominal variables will be handled bythe software as factor variables.

The order of levels of factor variables are importantbecause most functions, including plotting functions, will handle levels of thefactor in order. The order of the levels can be changed, for example, tochange the order that groups are plotted in a plot, or which groups are at thetop of the table.

The stringsAsFactors=TRUE option tells R to interpretcharacter data as factor variables in the read.table call. By default,R alphabetizes the levels of the factors.

Looking at the Arthur and Buster data, note that Instructor,Student, Gender, among other variables, are treated as factor variables.


str(Data)

'data.frame': 17 obs. of 7 variables:
$ Date : Factor w/ 2 levels"2015-11-01","2015-12-01": 1 1 1 1 1 1 1 1
$ Instructor: Factor w/ 2 levels "Arthur Read",..: 1 1 1 1 1 1 1 1 11 ...
$ Student : Factor w/ 17 levels "a","b","c","d",..:1 2 3 4 5 6 7 8 9 10 .
$ Gender : Factor w/ 3 levels "female","male",..: 1 1 21 2 1 2 1 1 1 ...
$ County : Factor w/ 2 levels "Bear Lake","Elwood": 2 12 2 2 1 2 2 2 2 .
$ Education : Factor w/ 5 levels"AA","BA","HS",..: 2 5 2 4 3 4 3 2 2 2 ...
$ Ed.code : int 3 5 3 4 1 4 1 3 3 3 ...

Note also that the levels of the factor variables werealphabetized by default. That is, even though Elwood was found in thedata before Bear Lake, R treats Bear Lake as the first level inthe variable County.


summary(Data)

Date Instructor Gender County
2015-11-01:10 Arthur Read :10 female:10 Bear Lake: 4
2015-12-01: 7 Buster Baxter: 7 male : 6 Elwood :13
other : 1

We can order factor levels by the order in which they wereread in the data frame.

Data$County = factor(Data$County,
levels=unique(Data$County))

levels(Data$County)

[1] "Elwood" "Bear Lake"

We can also order factor levels in an order of our choosing.

Data$Gender = factor(Data$Gender,
levels=c("male" , "female", "other"))

levels(Data$Gender)

[1] "male" "female" "other"

Note that in the actions above, we are not changing theorder in the data frame, simply which level is treated internally by thesoftware as "1" or "2", and so on.

Optional note on reporting summary statisticshonestly

Every time we report a descriptive statistic or the result ofa statistical test, we are condensing information, which may have been a wholedata set, into one or a few pieces of information. It is the job of the analystto choose the best way to present descriptive and graphical information, and tochoose the correct statistical test or method.

SAT score example

Imagine a set of changes in SAT scores from seven studentswho took a certain SAT preparation course. The change in scores, Increase,represents the difference in score; that is, the score after taking the courseminus their initial score.

Increase = c(50, 60, 120, -80, -10, 10, 0)

Our first instinct in assessing the course might be to lookat the average (mean) change in score. The result is a mean increase of 21points, which could be considered a success, perhaps.

mean(Increase)

[1] 21.42857

But we would be remiss to not look at other summarystatistics and plots.

Using the Summarize function in the FSApackage, we find that the median increase was only 10 points. The increase forthe first quartile was negative, suggesting at least 25% of students got alower score afterwards.

library(FSA)

Summarize(Increase,
digits = 2)

n nvalid mean sd min Q1 median Q3max
7.00 7.00 21.43 63.09 -80.00 -5.00 10.00 55.00120.00

A histogram of the changes in scores suggests a slight rightskew to the data, but that the mass of the data sits not far from zero.


hist(Increase,

col="gray")


R Handbook: Descriptive Statistics (4)

Finally, we’ll compute the 95% confidence intervals for themean change in score. Looking at the results for the percentile method, theconfidence interval includes zero, suggesting the change in scores for thiscourse were not statistically different from zero.

library(rcompanion)

Data = data.frame(Increase)

groupwiseMean(Increase ~ 1,
data = Data,
traditional = FALSE,
percentile = TRUE)

.id n Mean Conf.level Percentile.lower Percentile.upper
1 <NA> 7 21.4 0.95 -21.4 65.7

Based on the data exploration, it seems that it would beirresponsible or dishonest to simply report that the average increase in testscores was 20 points.

For descriptive statistics, we might report the mean and 95%confidence interval. Or perhaps the 5-point summary for the change in scores.Or show the histogram of values.

Optional analyses

Robust estimators: trimmed mean and Winsorizedmean

Robust estimators of central tendency are used to describethe location of a data set without undue influence of extreme values. Robustestimators include trimmed means, Winsorized means, and other estimates likeM-estimators (not discussed here).

Example

The New Jersey Envirothon is a statewide competition forhigh school students covering various aspects of natural resources science. Inthe Team Presentation station, student teams are judged by five judges.

Currently, scorekeepers drop the highest and lowest scoresfor each team to avoid the effect of aberrant low or high scores. This is anexample of a trimmed mean. In this case, the mean is trimmed to 60% (with 20%removed from each side).

Another option to ameliorate the effect of extreme scores isusing Winsorized means. A Winsorized mean removes extreme observations, butreplaces them with the closest observations in terms of magnitude.

In this example, two teams received five scores for theirpresentations that have the same mean and median. We will look at robustestimators to determine if one team should be scored higher than the other.


Team.A = c(100, 90, 80, 60, 20)

median(Team.A)

[1] 80

mean(Team.A)

[1] 70

mean(Team.A,
trim = .20) # This trims to theinner 60% of observations

[1] 76.66667

library(psych)

winsor(Team.A,
trim = 0.20) # This Winsorizes tothe inner 60% of observations

[1] 92 90 80 60 52

### Note that the Winsorized values at theextremes appear to be calculated
### with a function analogous to the quantile(x, probs, type = 7)function.

winsor.mean(Team.A,
trim = 0.20) # This Winsorizes tothe inner 60% of observations

[1] 74.8

Team.B = c(80, 80, 80, 80, 30)

median(Team.B)

[1] 80

mean(Team.B)

[1] 70

mean(Team.B,
trim = .20) # This trims to theinner 60% of observations

[1] 80

library(psych)

winsor(Team.B,
trim = 0.20) # This Winsorizes tothe inner 60% of observations

[1] 80 80 80 80 70

### Note that the Winsorized values at the extremesappear to be calculated
### with a function analogous to the quantile(x, probs, type = 7)function.

winsor.mean(Team.B,

trim = 0.20) # ThisWinsorizes to the inner 60% of observations

[1] 78

In this example, the means and medians for Team A and Team Bwere identical. However, the trimmed mean for Team A was less than for Team B(77 vs. 80), and the Winsorized mean for A was less than for Team B (75 vs.78). According to either the trimmed mean method or the Winsorized mean method,Team B had a higher score.

Note also that the Winsorized means were lower than for thetrimmed means, for both teams. This is because the Winsorized mean is betterable to take into account the low scores for each team.

Geometric mean

The geometric mean is used to summarize certainmeasurements, such as average return for investments, and for certainscientific measurements, such as bacteria counts in environmental water. It isuseful when data are log-normally distributed.

Practically speaking, using the geometric mean amelioratesthe effect of outlying values in the data.

To get the geometric mean, the log of each value is taken,these values are averaged, and then the result is the base of the log raised tothis value.

Imagine a series of 9 counts of bacteria from lake watersamples, called Bacteria here. The geometric mean can be calculatedwith a nested log, mean, and exp functions. Or moresimply, the geometric.mean function in the psych package can beused.

Bacteria example

Bacteria = c(20, 40, 50, 60, 100, 120, 150, 200, 1000)

exp(mean(log(Bacteria)))

[1] 98.38887

library(psych)

geometric.mean(Bacteria)

[1] 98.38887

hist(Bacteria, ### Produce a histogram of values
col="darkgray",
breaks="FD")

Annual return example

Geometric means are also used to calculate the averageannual return for investments. Each value in this vector represents thepercent return of the investment each year. The arithmetic mean will not givethe correct answer for average annual return.

Return = c(-0.80, 0.20, 0.30, 0.30, 0.20, 0.30)

library(psych)

geometric.mean(Return + 1)-1

[1] -0.07344666

### The geometric mean will give the correctanswer for average
### annual return

This can also be calculated manually. If you start with 100dollars with this investment, you will end up with 63 dollars, equivalent to areturn of – 0.07.


100 * (1-0.80) * 1.20 * 1.30 * 1.30 * 1.20 * 1.30

[1] 63.2736

Harmonic mean

The harmonic mean is another type of average that is used incertain situations, such as averaging rates or speeds.

To get the harmonic mean, the inverse of each value istaken, these values are averaged, and then the inverse of this result isreported.

Imagine a series of 9 speeds, called Speed here. Theharmonic mean can be calculated with a nested 1/x, mean, and 1/xfunctions. Or more simply, the harmonic.mean function in the psychpackage can be used.

Speed = c(20, 40, 50, 60, 100, 120, 150, 200, 1000)

1/mean(1/Speed)


63.08411

library(psych)

harmonic.mean(Speed)

63.08411

Required readings

[Video] “Understanding Summarystatistics: Mean, Median, Mode” from Statistics Learning Center (Dr.Nic). 2015. www.youtube.com/watch?v=rAN6DBctgJ0.

Optional readings

Statistics of location: mean and median

“Statistics of central tendency” in McDonald,J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/central.html.

"The median outclasses the mean" fromDr. Nic. 2013. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/median/.

“Measures of the Location of the Data”, Section2.3 in Openstax. 2013. Introductory Statistics. openstax.org/textbooks/introductory-statistics.

“Measures of the Center of the Data”, Section2.5 in Openstax. 2013. Introductory Statistics. openstax.org/textbooks/introductory-statistics.

“Skewness and the Mean, Median, and Mode”, Section2.6 in Openstax. 2013. Introductory Statistics. openstax.org/textbooks/introductory-statistics.

Statistics of variation: standard deviation and range

“Statistics of dispersion” in McDonald, J.H.2014. Handbook of Biological Statistics. www.biostathandbook.com/dispersion.html.

“Standard error of the mean” in McDonald, J.H.2014. Handbook of Biological Statistics. www.biostathandbook.com/standarderror.html.

“Measures of the Spread of the Data”, Section2.7 in Openstax. 2013. Introductory Statistics. openstax.org/textbooks/introductory-statistics.

The normal distribution

[Video] “Normal Distribution” fromStatistics Learning Center (Dr. Nic). 2016. www.youtube.com/watch?v=mtH1fmUVkfE.

"The normal distribution – three tricky bits"from Dr. Nic. 2016. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/the-normal-distribution/.

Exercises C

1. Considering Ren and Stimpy's workshops together,

a. How many total attendees were there?

b. How many total workshops were there?

c. How many locations were there?

d. What was the mean number of attendees?

e. What was the median number of attendees?

2. Considering Ren and Stimpy's workshops separately,

a. Which workshops had a higher mean number of attendees, andwhat was the mean? Ren’s or Stimpy’s?

b. Which workshops had a higher mean number of attendees, andwhat was the mean? Ren North or Ren South?

3. Considering Arthur and Buster's workshops together,

a How many students attended in total?

b. How many distinct levels of education were reported?

c. What was the median level of education reported?

d. 75% of students had what level of education or lower?

e. What was the median education level for females?

f. What was the median education level for males?

g. What was the median education level for females from ElwoodCounty?

h. What was the median education level for males from ElwoodCounty?

4. Considering Arthur and Buster's workshops together,

a. How many students were male?

b. What percentage is this of total students?

c. How many students in the November workshop were male?

d. What percentage is this of total students in the Novemberworkshop?

e. How many male students in the November workshop were fromBear Lake County?

5. As part of a nutrition education program, extensioneducators had students keep diaries of what they ate for a day and thencalculated the calories students consumed. The following data are the result.Rating indicates the score, from 1 to 5, that students gave as to theusefulness of the program. It should be considered an ordinal variable.


Student Teacher Gender Calories Rating
a Tetsuo male 2300 3
b Tetsuo female 1800 3
c Tetsuo male 1900 4
d Tetsuo female 1700 5
e Tetsuo male 2200 4
f Tetsuo female 1600 3
g Tetsuo male 1800 3
h Tetsuo female 2000 3
i Kaneda male 2100 4
j Kaneda female 1900 5
k Kaneda male 1900 4
l Kaneda female 1600 4
m Kaneda male 2000 4
n Kaneda female 2000 5
o Kaneda male 2100 3
p Kaneda female 1800 4

a. What are the variables in this data set and what type ofvariable is each? (Types are “nominal”, “ordinal”, and “interval/ratio”, nothow their types are reported by R.)

For each of the following, answer the question, and showthe output from the analyses you used to answer the question.

b. How many students were involved in this data set?

c. What was the mean caloric intake?

d. What was the median caloric intake?

e. What was the standard deviation of caloric intake?

f. What was the mean caloric intake for females?

g. What was the mean caloric intake for females in Kaneda’sclass?

h. How many males are in Kaneda’s class?


6. What do you conclude in practical terms about the caloric intake data forTetsuo’s and Kaneda’s classes? You might consider the mean or median valuesbetween males and females or between Tetsuo and Kaneda. Think about the realworld. Are the differences you are looking at important in a practical sense?

Be sure to pay attention to the spread of data as you thinkabout this, for example using minimums and maximums, Q1 and Q3 values, orstandard deviation values.

Be quantitative in your answer. That is, if you are talkingabout a difference in medians, mention the actual numerical difference. Ifexpressing values as a percentage makes sense, feel free to do so.

R Handbook: Descriptive Statistics (2025)

FAQs

How to generate descriptive statistics in R? ›

R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic. Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.

What are the 5 commonly used descriptive statistics? ›

The mean, the mode, the median, the range, and the standard deviation are all examples of descriptive statistics. Descriptive statistics are used because in most cases, it isn't possible to present all of your data in any form that your reader will be able to quickly interpret.

What package is the describeBy function in R? ›

This example demonstrates how to use the describeBy function from the psych package to summarize data by a grouping variable. The function takes the grouping variable (group) and the data frame to be summarized (data) as required arguments.

What is skewness in descriptive statistics in R? ›

Skewness is a measure of the asymmetry of a probability distribution. In R, you can calculate the skewness of a dataset using the skewness() function from the e1071 package.

How do I get started with R statistics? ›

No one starting point will serve all beginners, but here are 6 ways to begin learning R.
  1. Install , RStudio, and R packages like the tidyverse. ...
  2. Spend an hour with A Gentle Introduction to Tidy Statistics In R. ...
  3. Start coding using RStudio. ...
  4. Publish your work with R Markdown. ...
  5. Learn about some power tools for development.

Is there a describe function in R? ›

The describe() function is available in several R packages, with Hmisc and psych being the most popular. This article will guide you through using the describe() function from both packages.

What is the dplyr package for? ›

The dplyr package makes these steps fast and easy: By constraining your options, it helps you think about your data manipulation challenges. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.

What package is function %>% in R? ›

The pipe operator, written as %>% , is a longstanding feature of the magrittr package for R. It takes the output of one function and passes it into another function as an argument.

How to interpret skew and kurtosis in R? ›

A skewness value between -1 and +1 is excellent, while -2 to +2 is generally acceptable. Values beyond -2 and +2 suggest substantial nonnormality (Hair et al., 2022, p. 66). Kurtosis: Kurtosis indicates whether the distribution is too peaked or flat compared to a normal distribution.

What is the difference between skewness and kurtosis? ›

Skewness measures the degree of asymmetry of the distribution, while Kurtosis measures the degree of peakedness and flatness of a distribution. Skewness is the third measure of moments, while kurtosis is the fourth measure of moments. The value of both Skewness and Kurtosis ranges from -infinity to +infinity.

What is the kurtosis of a normal distribution in R? ›

A normal distribution has a kurtosis of 3, which follows from the fact that a normal distribution does have some of its mass in its tails. A distribution with a kurtosis greater than 3 has more returns out in its tails than the normal, and one with kurtosis less than 3 has fewer returns in its tails than the normal.

How do you display descriptive statistics? ›

There are several ways of presenting descriptive statistics in your paper. These include graphs, central tendency, dispersion and measures of association tables. Graphs: Quantitative data can be graphically represented in histograms, pie charts, scatter plots, line graphs, sociograms and geographic information systems.

How to display summary statistics in R? ›

To compute summary statistics in R, you can use the summary() function or a combination of functions such as mean() , median() , min() , max() , and sd() (standard deviation).

How to generate normal distribution data in R? ›

Normal Distribution in R
  1. dnorm(x, mean = 0, sd = 1, log = FALS E) : This function returns the probability density function (PDF) of the normal distribution at a given value x , with mean mean and standard deviation sd . ...
  2. pnorm(q, mean = 0, sd = 1, lower. ...
  3. qnorm(p, mean = 0, sd = 1, lower.

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Patricia Veum II

Last Updated:

Views: 5633

Rating: 4.3 / 5 (44 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.