First, we will examine the dataset.
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
skimr::skim(yrbss)
| Name | yrbss |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.0 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.6 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.2 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.0 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.0 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
From the glimpse function used in the chunk above, we can see there are 1004 missing values from the weight category.
yrbss %>%
ggplot(aes(x=weight)) +
geom_histogram(height = 0.5) +
labs(title = "Weight distribution", y="Count", x="Weight (kg)" )+
NULL

favstats(~weight, data=yrbss)
## min Q1 median Q3 max mean sd n missing
## 29.9 56.2 64.4 76.2 181 67.9 16.9 12579 1004
Next, consider the possible relationship between a high schooler’s weight and their physical activity.
Let’s create a new variable in the dataframe yrbss, called physical_3plus , which will be yes if they are physically active for at least 3 days a week, and no otherwise.
#reuse code from snap_insta
yrbss_new <- yrbss %>%
mutate(physical_3plus = ifelse(physically_active_7d >=3, 'yes',
'no'))
glimpse(yrbss_new)
## Rows: 13,583
## Columns: 14
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
## $ physical_3plus <chr> "yes", "no", "yes", "no", "no", "no", "yes", …
yrbss_prop <- yrbss_new %>%
filter(!is.na(physical_3plus)) %>%
group_by(physical_3plus) %>%
summarise(count = n()) %>%
mutate(prop=count/sum(count))
glimpse(yrbss_prop)
## Rows: 2
## Columns: 3
## $ physical_3plus <chr> "no", "yes"
## $ count <int> 4404, 8906
## $ prop <dbl> 0.331, 0.669
Provide a 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week
ci_yrbss_prop <- yrbss_new %>%
group_by(physical_3plus) %>%
filter(!is.na(physical_3plus)) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE),
count_weight = n(),
se_weight = sd_weight / sqrt(count_weight),
ci_weight_up = mean_weight + qt(.975, count_weight-1)*se_weight ,
ci_weight_dw = mean_weight - qt(.975, count_weight-1)*se_weight
)
ci_yrbss_prop
## # A tibble: 2 × 7
## physical_3plus mean_weight sd_weight count_weight se_weight ci_weight_up
## <chr> <dbl> <dbl> <int> <dbl> <dbl>
## 1 no 66.7 17.6 4404 0.266 67.2
## 2 yes 68.4 16.5 8906 0.175 68.8
## # … with 1 more variable: ci_weight_dw <dbl>
Make a boxplot of physical_3plus vs. weight.
yrbss_new %>%
filter(!is.na(physical_3plus)) %>%
ggplot(aes(x = physical_3plus, y = weight)) +
geom_boxplot()+
labs(title = 'Relationship between weight and excercising more than 3 times a week', x = 'Count of physical_3plus', y = 'Weight' )+
NULL

ci_using_formulas <- yrbss_new %>%
group_by(physical_3plus) %>%
filter(!is.na(physical_3plus)) %>%
summarise(mean_weight = mean(weight, na.rm = T),
median_weight = median(weight, na.rm = T),
sd_weight = sd(weight, na.rm = T),
count = n(),
# get t-critical value with (n-1) degrees of freedom
t_critical = qt(0.975, count-1),
se_weight = sd_weight/sqrt(count),
margin_of_error = t_critical * se_weight,
weight_low = mean_weight - margin_of_error,
weight_high = mean_weight + margin_of_error
) %>%
arrange(desc(mean_weight))
ci_using_formulas
## # A tibble: 2 × 10
## physical_3plus mean_weight median_weight sd_weight count t_critical se_weight
## <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 yes 68.4 65.8 16.5 8906 1.96 0.175
## 2 no 66.7 62.6 17.6 4404 1.96 0.266
## # … with 3 more variables: margin_of_error <dbl>, weight_low <dbl>,
## # weight_high <dbl>
\(H _0: \overline{x}_1 - \overline{x}_2 = 0\ vs.\ H_1: \overline{x}_1-\overline{x}_2 \neq 0\)
t.test(weight ~ physical_3plus, data = yrbss_new)
##
## Welch Two Sample t-test
##
## data: weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -2.42 -1.12
## sample estimates:
## mean in group no mean in group yes
## 66.7 68.4
inferobs_diff <- yrbss_new %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1.77
null_dist <- yrbss_new %>%
# specify variables
specify(weight ~ physical_3plus) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("yes", "no"))
null_dist
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 -0.573
## 2 2 -0.0306
## 3 3 -0.0364
## 4 4 0.223
## 5 5 0.117
## 6 6 -0.0646
## 7 7 0.181
## 8 8 0.126
## 9 9 -0.424
## 10 10 0.457
## # … with 990 more rows
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()+
labs(title = "Distribution for the null hypothesis", x = "Stat",y = "Count")+
NULL

To see how many of these null permutations have a difference of at least obs_stat of 1.77
obs_stat <- obs_diff %>%
pull() %>%
round(2)
obs_stat
## [1] 1.77
We can also calculate the p-value for your hypothesis test using the function infer::get_p_value().
null_dist %>% visualize() +
shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
This the standard workflow for performing hypothesis tests.
\(H_0: \mu_A - \mu_B = 0\) \(H_1 : \mu_A - \mu_B \neq 0\)
Load the data and examine its structure
movies <- read_csv(here::here("data1", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
#calculate the week4 and week5 confidence intervals
rating_comparison <- movies %>%
filter(director=="Steven Spielberg" | director=="Tim Burton") %>%
group_by(director) %>%
summarise(avg_rating = mean(rating),
sd_rating = sd(rating, na.rm=TRUE),
count_rating = n(),
se_rating = sd_rating / sqrt(count_rating),
ci_rating_up = avg_rating + qt(.975, count_rating-1)*se_rating,
ci_rating_dw = avg_rating - qt(.975, count_rating-1)*se_rating
)
#plot the confidence intervals
rating_comparison %>%
ggplot(aes(x=avg_rating, y=director, color=director))+
geom_rect(fill="grey",alpha=0.5, color = "grey",
aes(xmin=max(ci_rating_dw),
xmax=min(ci_rating_up),
ymin=-Inf,
ymax=Inf))+
geom_errorbarh(aes(xmin=ci_rating_dw,xmax=ci_rating_up))+
geom_point(aes(x=avg_rating, y=director), size=3)+
geom_text(aes(label=round(avg_rating, digits=2)), vjust = -1.5)+
labs(title="Do Spielberg and Burton have the same mean IMDB ratings?",
subtitle = "95% confidence intervals overlap",
x = "Mean IMDB Rating")

#calculate via t-test
comparison_rating <- movies %>%
filter(director=="Steven Spielberg" | director=="Tim Burton")
t.test(rating ~ director, data = comparison_rating)
##
## Welch Two Sample t-test
##
## data: rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
## 0.16 1.13
## sample estimates:
## mean in group Steven Spielberg mean in group Tim Burton
## 7.57 6.93
#calculate using infer package
set.seed(1234)
ratings_in_null <- comparison_rating %>%
specify(rating ~ director) %>%
hypothesise(null="independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in means", order = c("Tim Burton",
"Steven Spielberg"))
#calculate the observed difference
observed_difference <- comparison_rating %>%
specify(rating ~ director) %>%
calculate(stat = "diff in means")
#calculate the p-value of the differences
ratings_in_null %>%
get_pvalue(obs_stat=observed_difference, direction="both")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.02
omega <- read_csv(here::here("data1", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 47033 60338 64618 70033 78800 64543 7567 26 0
## 2 male 54768 68331 74675 78568 84576 73239 7463 24 0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size,
# the t-critical value, the standard error, the margin of error,
# and the low/high endpoints of a 95% condifence interval
salary_gender <- omega %>%
group_by(gender) %>%
summarise (mean = mean(salary), SD = sd(salary), sample_size = n()) %>%
mutate(se = sqrt(SD^2/sample_size), t_value = qt(p=.05/2, df=sample_size-1, lower.tail=FALSE),
margin_of_error = t_value*se, salary_low = mean-t_value*se, salary_high = mean+t_value*se)
salary_gender
## # A tibble: 2 × 9
## gender mean SD sample_size se t_value margin_of_error salary_low
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 female 64543. 7567. 26 1484. 2.06 3056. 61486.
## 2 male 73239. 7463. 24 1523. 2.07 3151. 70088.
## # … with 1 more variable: salary_high <dbl>
Run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money.
# hypothesis testing using t.test()
t.test(salary ~ gender, data = omega)
##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
set.seed(1234)
salary_gender_boot <- omega %>%
# Specify the variable of interest "salary" and group by gender
specify(salary ~ gender) %>%
# Hypothesize a null of no (or zero) difference
hypothesize (null = "independence") %>%
# Generate a bunch of simualted samples
generate (reps = 1000, type = "permute") %>%
# Find the mean diffference of each sample
calculate(stat = "diff in means",
order = c("female", "male"))
# Select the low and high endpoint from the formula-calculated CIs
formula_ci <- salary_gender %>%
select(salary_low,salary_high)
# Generate 95% percentile of the difference in the two genders' salaries from the bootstrap data
percentile_ci <- salary_gender_boot %>%
get_confidence_interval(level = 0.95, type = "percentile")
percentile_ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -4829. 5025.
observed_difference <- salary_gender$mean[1]-salary_gender$mean[2]
visualize(salary_gender_boot) +
annotate("rect", xmin=-Inf, xmax=observed_difference, ymin=0, ymax=Inf, alpha=0.3, fill = "pink") +
annotate("rect", xmin=-observed_difference, xmax=Inf, ymin=0, ymax=Inf, alpha=0.3, fill = "pink") +
#shade_ci(endpoints = percentile_ci,fill = "khaki")+
labs(title='Differences in Female and Male Mean Salary in a world where there is no difference',
subtitle = "Observed difference marked in red",
x = "Mean (female salary - male salary)", y = "Count")+
geom_vline(xintercept = observed_difference, colour = "red", linetype="solid", size=1.2)+
theme_bw()+
NULL

salary_gender_boot %>%
get_pvalue(obs_stat = observed_difference, direction = "both")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
t.test(experience ~ gender, data = omega)
##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
salary_exp <- omega %>%
ggplot(aes(x = experience, y = salary))+
geom_point()+
labs(title = "Relationship between salary and number of years of experience", x = "Year(s) of experience",y = "Salary")+
theme_bw()+
NULL
salary_exp

omega %>%
select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()

#Import the csv file
brexit <- read_csv(here::here("data1", "brexit_results.csv"))
glimpse(brexit) # examine the data frame
## Rows: 632
## Columns: 11
## $ Seat <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham and Sale W…
## $ con_2015 <dbl> 50.6, 52.0, 53.0, 44.0, 60.8, 22.4, 52.5, 22.1, 50.7, 53.0…
## $ lab_2015 <dbl> 18.3, 22.4, 26.7, 34.8, 11.2, 41.0, 18.4, 49.8, 15.1, 21.3…
## $ ld_2015 <dbl> 8.82, 3.37, 8.38, 2.98, 7.19, 14.83, 5.98, 2.42, 10.62, 5.…
## $ ukip_2015 <dbl> 17.87, 19.62, 8.01, 15.89, 14.44, 21.41, 18.82, 21.76, 19.…
## $ leave_share <dbl> 57.9, 67.8, 38.6, 65.3, 49.7, 70.5, 59.9, 61.8, 51.8, 50.3…
## $ born_in_uk <dbl> 83.1, 96.1, 90.5, 97.3, 93.3, 97.0, 90.5, 90.7, 87.0, 88.8…
## $ male <dbl> 49.9, 48.9, 48.9, 49.2, 48.0, 49.2, 48.5, 49.2, 49.5, 49.5…
## $ unemployed <dbl> 3.64, 4.55, 3.04, 4.26, 2.47, 4.74, 3.69, 5.11, 3.39, 2.93…
## $ degree <dbl> 13.87, 9.97, 28.60, 9.34, 18.78, 6.09, 13.12, 7.90, 17.80,…
## $ age_18to24 <dbl> 9.41, 7.33, 6.44, 7.75, 5.73, 8.21, 7.82, 8.94, 7.56, 7.61…
temp_brexit <- brexit %>%
select(con_2015, lab_2015, ld_2015, ukip_2015, leave_share) %>%
pivot_longer(cols = 1:4, names_to = "party", values_to = "party_percentage")
cols <- c("con_2015" = "#0087dc", "lab_2015" = "#d50000", "ld_2015" = "#FDBB30", "ukip_2015" = "#EFE600")
temp_brexit %>%
ggplot(aes(x=party_percentage,y=leave_share, group=party, color=party)) +
geom_point(alpha = 0.5)+
geom_smooth(method=lm, size = 0.5)+
labs(title = "How political affiliation translated to Brexit Voting", x = "Party % in the UK 2015 general election", y = "Leave % in the 2016 Brexit referendum")+
scale_colour_manual(labels = c("Conservative", "Labour","Lib Dems","UKIP"), values = cols)+
theme_bw()+
theme(legend.position = "bottom",plot.title = element_text(face = "bold", size = 13))+
theme(legend.title = element_blank())+
NULL

UN_GDP_data <- read_excel(here::here("data1", "Download-GDPconstant-USD-countries.xls"), # Excel filename
sheet="Download-GDPconstant-USD-countr", # Sheet name
skip=2) # Number of rows to skip
# Tidying data - Making data long and expressing all figures in billions
tidy_GDP_data_1 <- UN_GDP_data %>%
pivot_longer(4:51, names_to = "year", values_to = "indicator_data") %>%
mutate(indicator_data = indicator_data / 1000000000)
# Renaming indicators into shorter counterparts
tidy_GDP_data_1 <- tidy_GDP_data_1
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "Exports of goods and services"] <- "Exports"
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "General government final consumption expenditure"] <- "Government expenditure"
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "Household consumption expenditure (including Non-profit institutions serving households)"] <- "Household expenditure"
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "Imports of goods and services"] <- "Imports"
glimpse(tidy_GDP_data_1)
## Rows: 176,880
## Columns: 5
## $ CountryID <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
## $ IndicatorName <chr> "Final consumption expenditure", "Final consumption exp…
## $ year <chr> "1970", "1971", "1972", "1973", "1974", "1975", "1976",…
## $ indicator_data <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.09, 6…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")
# Required Indicators names for graph
list_indicator_name = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")
tidy_GDP_data_1 %>%
filter(Country%in% country_list) %>% #Filtering by aforementioned country_list
filter(IndicatorName %in% list_indicator_name ) %>% #Filtering by list_indicator_name
group_by(IndicatorName) %>%
# Line plot for GDP over time graph
ggplot(aes(x = year, y = indicator_data, color = IndicatorName, group = IndicatorName)) +
geom_line(aes(x = year, y = indicator_data, color = IndicatorName)) +
facet_wrap(~ Country) +
theme_bw() +
theme(legend.position="right") +
scale_x_discrete(breaks = seq(1970, 2017, 10)) +
scale_color_discrete("Components of GDP", breaks = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")) +
labs(title = "GDP Components over time", subtitle = "In constant 2010 USD", x = "", y = "Billion US$") +
theme(plot.title = element_text(face="bold"))

#converting to tidy wide format
tidy_GDP_data_2 <- tidy_GDP_data_1%>%
pivot_wider(
names_from = IndicatorName,
values_from = indicator_data)
glimpse(tidy_GDP_data_2)
## Rows: 10,560
## Columns: 20
## $ CountryID <dbl> …
## $ Country <chr> …
## $ year <chr> …
## $ `Final consumption expenditure` <dbl> …
## $ `Household expenditure` <dbl> …
## $ `Government expenditure` <dbl> …
## $ `Gross capital formation` <dbl> …
## $ `Gross fixed capital formation (including Acquisitions less disposals of valuables)` <dbl> …
## $ Exports <dbl> …
## $ Imports <dbl> …
## $ `Gross Domestic Product (GDP)` <dbl> …
## $ `Agriculture, hunting, forestry, fishing (ISIC A-B)` <dbl> …
## $ `Mining, Manufacturing, Utilities (ISIC C-E)` <dbl> …
## $ `Manufacturing (ISIC D)` <dbl> …
## $ `Construction (ISIC F)` <dbl> …
## $ `Wholesale, retail trade, restaurants and hotels (ISIC G-H)` <dbl> …
## $ `Transport, storage and communication (ISIC I)` <dbl> …
## $ `Other Activities (ISIC J-P)` <dbl> …
## $ `Total Value Added` <dbl> …
## $ `Changes in inventories` <dbl> …
# Using GDP formula to calculate GDP using GDP components - the sum of Household Expenditure (Consumption *C*), Gross Capital Formation (business investment *I*), Government Expenditure (G) and Net Exports (exports - imports)
tidy_GDP_data_2 <- tidy_GDP_data_2 %>%
mutate(calculated_GDP = tidy_GDP_data_2$`Household expenditure`+ tidy_GDP_data_2$`Gross capital formation`+ tidy_GDP_data_2$`Government expenditure`+ tidy_GDP_data_2$`Exports`- tidy_GDP_data_2$`Imports`)
#Calculating % difference between calculated GDP and UN_GDP_data GDP Figures
tidy_GDP_data_2 <- tidy_GDP_data_2 %>%
mutate(percentage_difference_calculatedGDP_columnGDP = (((calculated_GDP / tidy_GDP_data_2$`Gross Domestic Product (GDP)`)-1) * 100))
tidy_GDP_data_2
## # A tibble: 10,560 × 22
## CountryID Country year `Final consumpt… `Household expe… `Government exp…
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 4 Afghanistan 1970 5.56 5.07 0.372
## 2 4 Afghanistan 1971 5.33 4.84 0.382
## 3 4 Afghanistan 1972 5.20 4.70 0.402
## 4 4 Afghanistan 1973 5.75 5.21 0.421
## 5 4 Afghanistan 1974 6.15 5.59 0.431
## 6 4 Afghanistan 1975 6.32 5.65 0.598
## 7 4 Afghanistan 1976 6.37 5.68 0.627
## 8 4 Afghanistan 1977 6.90 6.15 0.676
## 9 4 Afghanistan 1978 7.09 6.30 0.725
## 10 4 Afghanistan 1979 6.92 6.15 0.708
## # … with 10,550 more rows, and 16 more variables:
## # Gross capital formation <dbl>,
## # Gross fixed capital formation (including Acquisitions less disposals of valuables) <dbl>,
## # Exports <dbl>, Imports <dbl>, Gross Domestic Product (GDP) <dbl>,
## # Agriculture, hunting, forestry, fishing (ISIC A-B) <dbl>,
## # Mining, Manufacturing, Utilities (ISIC C-E) <dbl>,
## # Manufacturing (ISIC D) <dbl>, Construction (ISIC F) <dbl>, …
# Plotting a graph, utilizing Percentage Difference of Calculated GDP and UN_GDP_data GDP Figures to compare this difference across Germany, India, and United States
tidy_GDP_data_2 %>%
filter(Country %in% country_list) %>%
ggplot(aes(x=year, y=percentage_difference_calculatedGDP_columnGDP)) +
geom_line(group = 1, color = "black", size = 0.2) + geom_line(group = 1,color = "black",y=0, size = 0.2) +
facet_wrap(~ Country) +
theme_bw() +
scale_x_discrete(breaks = seq(1970, 2017, 10)) +
geom_ribbon(aes(ymin = 0, ymax = pmin(0, percentage_difference_calculatedGDP_columnGDP), group=1), alpha=0.2, fill = "blue") +
geom_ribbon(aes(ymin = percentage_difference_calculatedGDP_columnGDP, ymax = pmin(0, percentage_difference_calculatedGDP_columnGDP), group=1), alpha=0.2, fill = "yellow") +
labs(title = "Percentage Difference of Calculated GDP and UN_GDP_data GDP Figures",
subtitle = "Yellow: Calculated GDP > UN_GDP_data GDP
Blue: Calculated GDP < UN_GDP_data GDP",
x = "Time period (1970-2017)",
y = "Percentage Difference",
)+
NULL

In the case of India, from the period of 1970 to 1990, the graph depicts a yellow region wherein calculated GDP is higher than the UN_GDP_data GDP figures. However, post 1990 (excluding 2007 and 2010), the graph depicts a blue region wherein calculated GDP is lower than the UN_GDP_data GDP figures. In 2007, the % difference between what you calculated as GDP and the GDP figure is 2.1309% and in 2010, the % difference between what you calculated as GDP and the GDP figure is 1.0660%
In the case of Germany, the graph depicts a yellow region throughout from 1970 to 2017 wherein calculated GDP is higher than the UN_GDP_data GDP figures.The % difference between calculated GDP and the GDP figure was maximum in 1972 at 3.56e+00. In the United States, the graph depicts a mostly yellow region i.e.calculated GDP is higher than the UN_GDP_data GDP figures, excluding the time period from 2005-2008, 2010, and 2016-2017 which depicted a blue region i.e. calculated GDP is lower than the UN_GDP_data GDP figures.
country_list_1 <- c("Austria","United Kingdom", "Italy")
list_indicator_name = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")
tidy_GDP_data_1 %>%
filter(Country%in% country_list_1) %>% #Filtering by aforementioned country_list
filter(IndicatorName %in% list_indicator_name ) %>% #Filtering by list_indicator_name
group_by(IndicatorName) %>%
# Line plot for GDP over time graph
ggplot(aes(x = year, y = indicator_data, color = IndicatorName, group = IndicatorName)) +
geom_line(aes(x = year, y = indicator_data, color = IndicatorName)) +
facet_wrap(~ Country) +
theme_bw() +
theme(legend.position="right") +
scale_x_discrete(breaks = seq(1970, 2017, 10)) +
scale_color_discrete("Components of GDP", breaks = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")) +
labs(title = "GDP Components over time", subtitle = "In constant 2010 USD", x = "", y = "Billion US$") +
theme(plot.title = element_text(face="bold"))
