Am01 applied statistics homework

Homework 3

Homework 3

Youth Risk Behavior Surveillance

Load the data

First, we will examine the dataset.

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
skimr::skim(yrbss)
Table 1: Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 ▁▂▅▅▇
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 ▁▅▇▃▁
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 ▆▇▂▁▁
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 ▆▂▅▃▇
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 ▇▂▅▂▅

Exploratory Data Analysis

From the glimpse function used in the chunk above, we can see there are 1004 missing values from the weight category.

yrbss %>% 
  ggplot(aes(x=weight)) +
  geom_histogram(height = 0.5) +
  labs(title = "Weight distribution", y="Count", x="Weight (kg)" )+
  NULL

favstats(~weight, data=yrbss)
##   min   Q1 median   Q3 max mean   sd     n missing
##  29.9 56.2   64.4 76.2 181 67.9 16.9 12579    1004

Next, consider the possible relationship between a high schooler’s weight and their physical activity.

Let’s create a new variable in the dataframe yrbss, called physical_3plus , which will be yes if they are physically active for at least 3 days a week, and no otherwise.

#reuse code from snap_insta
yrbss_new <- yrbss %>% 
  mutate(physical_3plus = ifelse(physically_active_7d >=3, 'yes',
                                    'no'))

glimpse(yrbss_new)
## Rows: 13,583
## Columns: 14
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
## $ physical_3plus           <chr> "yes", "no", "yes", "no", "no", "no", "yes", …
yrbss_prop <- yrbss_new %>%
  filter(!is.na(physical_3plus)) %>% 
  group_by(physical_3plus) %>% 
  summarise(count = n()) %>% 
  mutate(prop=count/sum(count))
  
glimpse(yrbss_prop)
## Rows: 2
## Columns: 3
## $ physical_3plus <chr> "no", "yes"
## $ count          <int> 4404, 8906
## $ prop           <dbl> 0.331, 0.669

Provide a 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week

ci_yrbss_prop <- yrbss_new %>% 
  group_by(physical_3plus) %>% 
  filter(!is.na(physical_3plus)) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE),
            sd_weight = sd(weight, na.rm = TRUE),
            count_weight = n(),
            se_weight = sd_weight / sqrt(count_weight),
            ci_weight_up = mean_weight + qt(.975, count_weight-1)*se_weight ,
            ci_weight_dw = mean_weight - qt(.975, count_weight-1)*se_weight 
            )  

ci_yrbss_prop
## # A tibble: 2 × 7
##   physical_3plus mean_weight sd_weight count_weight se_weight ci_weight_up
##   <chr>                <dbl>     <dbl>        <int>     <dbl>        <dbl>
## 1 no                    66.7      17.6         4404     0.266         67.2
## 2 yes                   68.4      16.5         8906     0.175         68.8
## # … with 1 more variable: ci_weight_dw <dbl>

Make a boxplot of physical_3plus vs. weight.

yrbss_new %>% 
  filter(!is.na(physical_3plus)) %>% 
  ggplot(aes(x = physical_3plus, y = weight)) +
  geom_boxplot()+
  labs(title = 'Relationship between weight and excercising more than 3 times a week', x = 'Count of physical_3plus', y = 'Weight' )+
  NULL

  • We would expect the kids which are physically active for 3 or more hours a week to weigh less on average than those who exercise less. This is the result we expect because as they are more physically active they consume more calories and so should weigh less. It appears that kids who do exercise tend to weigh more, the opposite of our prediction. There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant.

Confidence Interval

ci_using_formulas <- yrbss_new %>%
  group_by(physical_3plus) %>% 
  filter(!is.na(physical_3plus)) %>% 
  summarise(mean_weight = mean(weight, na.rm = T),
            median_weight = median(weight, na.rm = T),
            sd_weight = sd(weight, na.rm = T),
            count = n(),
            # get t-critical value with (n-1) degrees of freedom
            t_critical = qt(0.975, count-1),
            se_weight = sd_weight/sqrt(count),
            margin_of_error = t_critical * se_weight,
            weight_low = mean_weight - margin_of_error,
            weight_high = mean_weight + margin_of_error
  ) %>% 
  arrange(desc(mean_weight))

ci_using_formulas
## # A tibble: 2 × 10
##   physical_3plus mean_weight median_weight sd_weight count t_critical se_weight
##   <chr>                <dbl>         <dbl>     <dbl> <int>      <dbl>     <dbl>
## 1 yes                   68.4          65.8      16.5  8906       1.96     0.175
## 2 no                    66.7          62.6      17.6  4404       1.96     0.266
## # … with 3 more variables: margin_of_error <dbl>, weight_low <dbl>,
## #   weight_high <dbl>

Hypothesis test with formula

\(H _0: \overline{x}_1 - \overline{x}_2 = 0\ vs.\ H_1: \overline{x}_1-\overline{x}_2 \neq 0\)

t.test(weight ~ physical_3plus, data = yrbss_new)
## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -2.42 -1.12
## sample estimates:
##  mean in group no mean in group yes 
##              66.7              68.4

Hypothesis test with infer

obs_diff <- yrbss_new %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.77
null_dist <- yrbss_new %>%
  # specify variables
  specify(weight ~ physical_3plus) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("yes", "no"))

null_dist
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.573 
##  2         2 -0.0306
##  3         3 -0.0364
##  4         4  0.223 
##  5         5  0.117 
##  6         6 -0.0646
##  7         7  0.181 
##  8         8  0.126 
##  9         9 -0.424 
## 10        10  0.457 
## # … with 990 more rows

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()+
  labs(title = "Distribution for the null hypothesis", x = "Stat",y = "Count")+
  NULL

To see how many of these null permutations have a difference of at least obs_stat of 1.77

obs_stat <- obs_diff %>% 
  pull() %>% 
  round(2)

obs_stat
## [1] 1.77

We can also calculate the p-value for your hypothesis test using the function infer::get_p_value().

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

This the standard workflow for performing hypothesis tests.

IMDB ratings: Differences between directors

\(H_0: \mu_A - \mu_B = 0\) \(H_1 : \mu_A - \mu_B \neq 0\)

  • t-stat is equal to 3 and p-value is equal to 0.01. Therefore, we can reject the null hypothesis. Therefore, we can conclude that we are 95% confident that movies of Stephen Spielberg have higher average ratings than those of Tim Burton.

Load the data and examine its structure

movies <- read_csv(here::here("data1", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
#calculate the week4 and week5 confidence intervals
rating_comparison <- movies %>% 
  filter(director=="Steven Spielberg" | director=="Tim Burton") %>% 
  group_by(director) %>% 
  summarise(avg_rating = mean(rating),
            sd_rating = sd(rating, na.rm=TRUE),
            count_rating = n(),
            se_rating = sd_rating / sqrt(count_rating),
            ci_rating_up = avg_rating + qt(.975, count_rating-1)*se_rating,
            ci_rating_dw = avg_rating - qt(.975, count_rating-1)*se_rating
            )

#plot the confidence intervals
rating_comparison %>% 
  ggplot(aes(x=avg_rating, y=director, color=director))+
    geom_rect(fill="grey",alpha=0.5, color = "grey",
            aes(xmin=max(ci_rating_dw),
                xmax=min(ci_rating_up),
                ymin=-Inf,
                ymax=Inf))+
  geom_errorbarh(aes(xmin=ci_rating_dw,xmax=ci_rating_up))+
  geom_point(aes(x=avg_rating, y=director), size=3)+
  geom_text(aes(label=round(avg_rating, digits=2)), vjust = -1.5)+
  labs(title="Do Spielberg and Burton have the same mean IMDB ratings?",
       subtitle = "95% confidence intervals overlap",
       x = "Mean IMDB Rating")

#calculate via t-test
comparison_rating <- movies %>% 
  filter(director=="Steven Spielberg" | director=="Tim Burton")

t.test(rating ~ director, data = comparison_rating)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93
#calculate using infer package
set.seed(1234)
ratings_in_null <- comparison_rating %>% 
  
  specify(rating ~ director) %>% 
  
  hypothesise(null="independence") %>% 
  
  generate(reps = 100, type = "permute") %>%
  
  calculate(stat = "diff in means", order = c("Tim Burton",
                                              "Steven Spielberg"))

#calculate the observed difference
observed_difference <- comparison_rating %>%
  specify(rating ~ director) %>%
  calculate(stat = "diff in means")

#calculate the p-value of the differences
ratings_in_null %>% 
  get_pvalue(obs_stat=observed_difference, direction="both")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1    0.02

Omega Group plc- Pay Discrimination

Loading the data

omega <- read_csv(here::here("data1", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…

Relationship Salary - Gender ?

# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
##   gender   min    Q1 median    Q3   max  mean   sd  n missing
## 1 female 47033 60338  64618 70033 78800 64543 7567 26       0
## 2   male 54768 68331  74675 78568 84576 73239 7463 24       0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, 
# the t-critical value, the standard error, the margin of error, 
# and the low/high endpoints of a 95% condifence interval

salary_gender <- omega %>% 
  group_by(gender) %>% 
  summarise (mean = mean(salary), SD = sd(salary), sample_size = n()) %>% 
  mutate(se = sqrt(SD^2/sample_size), t_value = qt(p=.05/2, df=sample_size-1, lower.tail=FALSE),
         margin_of_error = t_value*se, salary_low = mean-t_value*se, salary_high = mean+t_value*se)

salary_gender
## # A tibble: 2 × 9
##   gender   mean    SD sample_size    se t_value margin_of_error salary_low
##   <chr>   <dbl> <dbl>       <int> <dbl>   <dbl>           <dbl>      <dbl>
## 1 female 64543. 7567.          26 1484.    2.06           3056.     61486.
## 2 male   73239. 7463.          24 1523.    2.07           3151.     70088.
## # … with 1 more variable: salary_high <dbl>
  • The 95% confidence interval for female is from 61486 to 67599, while that for male is from 70088 to 76490. Since their confidence intervals do not have any overlap, it can be concluded that the null hypothesis can be rejected. There is a significant difference in the mean of salary for female and male.

Run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money.

# hypothesis testing using t.test() 
t.test(salary ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239
# hypothesis testing using infer package
set.seed(1234)

salary_gender_boot <- omega %>% 
  # Specify the variable of interest "salary" and group by gender
  specify(salary ~ gender) %>% 
  
  # Hypothesize a null of no (or zero) difference
  hypothesize (null = "independence") %>% 
  
  # Generate a bunch of simualted samples
  generate (reps = 1000, type = "permute") %>% 
  
  # Find the mean diffference of each sample
  calculate(stat = "diff in means",
            order = c("female", "male"))


# Select the low and high endpoint from the formula-calculated CIs
formula_ci <- salary_gender %>%
  select(salary_low,salary_high)

# Generate 95% percentile of the difference in the two genders' salaries from the bootstrap data
percentile_ci <- salary_gender_boot %>% 
  get_confidence_interval(level = 0.95, type = "percentile")

percentile_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   -4829.    5025.
observed_difference <- salary_gender$mean[1]-salary_gender$mean[2]

visualize(salary_gender_boot) + 
  annotate("rect", xmin=-Inf, xmax=observed_difference, ymin=0, ymax=Inf, alpha=0.3, fill = "pink") +
  annotate("rect", xmin=-observed_difference, xmax=Inf, ymin=0, ymax=Inf, alpha=0.3, fill  = "pink") +
  #shade_ci(endpoints = percentile_ci,fill = "khaki")+
  labs(title='Differences in Female and Male Mean Salary in a world where there is no difference', 
       subtitle = "Observed difference marked in red",
       x = "Mean (female salary - male salary)", y = "Count")+
  geom_vline(xintercept = observed_difference, colour = "red", linetype="solid", size=1.2)+
  theme_bw()+
  NULL

salary_gender_boot %>% 
  get_pvalue(obs_stat = observed_difference, direction = "both")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0
  • With bootstrap, the confidence interval for the difference in the two genders’ salary is constructed, while in the null world. As a result, this CI does not include the observed difference in real world, which means that the null hypothesis should be rejected. There is a significant difference in the two genders’ salaries.

Relationship Experience - Gender?

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0
t.test(experience ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12
  • The t-stat value is -5, which has a larger absolute value than 1.96, indicating that there is a significant difference in the two genders’ experience.

Relationship Salary - Experience ?

salary_exp <- omega %>% 
  ggplot(aes(x = experience, y = salary))+
  geom_point()+
  labs(title = "Relationship between salary and number of years of experience", x = "Year(s) of experience",y = "Salary")+
  theme_bw()+
  NULL

salary_exp

Check correlations between the data

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

  • Generally, the distribution of years of experience for male is more widely distributed than female in the scatter plot. There is also more female than male at 0 year of experience, and there is no female with more than 30 years of experience. Overall, there is a positive relation between experience and salary. It can be seen in the gender - experience box plot that male has a higher mean value for years of experience than female, and in the gender - salary box plot male also has a higher mean salary, which is predicted. However, the difference in mean salary is smaller than that in mean experience. This indicate that salary is less likely to be dependent on gender. Moreover, gender has even narrowed the gap between the difference in the two genders’ experience. While there is no female within the 95% CI that has a higher experience level than male in their 95% CI, the CI for salaries do overlap.

Challenge 2: Brexit plot

#Import the csv file

brexit <- read_csv(here::here("data1", "brexit_results.csv"))
glimpse(brexit) # examine the data frame
## Rows: 632
## Columns: 11
## $ Seat        <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham and Sale W…
## $ con_2015    <dbl> 50.6, 52.0, 53.0, 44.0, 60.8, 22.4, 52.5, 22.1, 50.7, 53.0…
## $ lab_2015    <dbl> 18.3, 22.4, 26.7, 34.8, 11.2, 41.0, 18.4, 49.8, 15.1, 21.3…
## $ ld_2015     <dbl> 8.82, 3.37, 8.38, 2.98, 7.19, 14.83, 5.98, 2.42, 10.62, 5.…
## $ ukip_2015   <dbl> 17.87, 19.62, 8.01, 15.89, 14.44, 21.41, 18.82, 21.76, 19.…
## $ leave_share <dbl> 57.9, 67.8, 38.6, 65.3, 49.7, 70.5, 59.9, 61.8, 51.8, 50.3…
## $ born_in_uk  <dbl> 83.1, 96.1, 90.5, 97.3, 93.3, 97.0, 90.5, 90.7, 87.0, 88.8…
## $ male        <dbl> 49.9, 48.9, 48.9, 49.2, 48.0, 49.2, 48.5, 49.2, 49.5, 49.5…
## $ unemployed  <dbl> 3.64, 4.55, 3.04, 4.26, 2.47, 4.74, 3.69, 5.11, 3.39, 2.93…
## $ degree      <dbl> 13.87, 9.97, 28.60, 9.34, 18.78, 6.09, 13.12, 7.90, 17.80,…
## $ age_18to24  <dbl> 9.41, 7.33, 6.44, 7.75, 5.73, 8.21, 7.82, 8.94, 7.56, 7.61…
temp_brexit <- brexit %>% 
  select(con_2015, lab_2015, ld_2015, ukip_2015, leave_share) %>% 
  pivot_longer(cols = 1:4, names_to = "party", values_to = "party_percentage")

cols <- c("con_2015" = "#0087dc", "lab_2015" = "#d50000", "ld_2015" = "#FDBB30", "ukip_2015" = "#EFE600")

temp_brexit %>% 
  ggplot(aes(x=party_percentage,y=leave_share, group=party, color=party)) +
  geom_point(alpha = 0.5)+
  geom_smooth(method=lm, size = 0.5)+
  labs(title = "How political affiliation translated to Brexit Voting", x = "Party % in the UK 2015 general election", y = "Leave % in the 2016 Brexit referendum")+
  scale_colour_manual(labels = c("Conservative", "Labour","Lib Dems","UKIP"), values = cols)+
  theme_bw()+
  theme(legend.position = "bottom",plot.title = element_text(face = "bold", size = 13))+
  theme(legend.title = element_blank())+
  NULL

Challenge 3:GDP components over time and among countries

UN_GDP_data  <-  read_excel(here::here("data1", "Download-GDPconstant-USD-countries.xls"), # Excel filename
                sheet="Download-GDPconstant-USD-countr", # Sheet name
                skip=2) # Number of rows to skip
# Tidying data - Making data long and expressing all figures in billions
tidy_GDP_data_1 <- UN_GDP_data %>%
  pivot_longer(4:51, names_to = "year", values_to = "indicator_data") %>%
  mutate(indicator_data = indicator_data / 1000000000)
# Renaming indicators into shorter counterparts
tidy_GDP_data_1 <- tidy_GDP_data_1
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "Exports of goods and services"] <-  "Exports"
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "General government final consumption expenditure"] <-  "Government expenditure" 
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "Household consumption expenditure (including Non-profit institutions serving households)"] <-  "Household expenditure" 
tidy_GDP_data_1$IndicatorName[tidy_GDP_data_1$IndicatorName == "Imports of goods and services"] <-  "Imports"

glimpse(tidy_GDP_data_1)
## Rows: 176,880
## Columns: 5
## $ CountryID      <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ Country        <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
## $ IndicatorName  <chr> "Final consumption expenditure", "Final consumption exp…
## $ year           <chr> "1970", "1971", "1972", "1973", "1974", "1975", "1976",…
## $ indicator_data <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.09, 6…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")
# Required Indicators names for graph
list_indicator_name = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")
tidy_GDP_data_1 %>%
  filter(Country%in% country_list) %>% #Filtering by aforementioned country_list
  filter(IndicatorName %in% list_indicator_name ) %>% #Filtering by list_indicator_name
  group_by(IndicatorName) %>%  
  # Line plot for GDP over time graph
  ggplot(aes(x = year, y = indicator_data, color = IndicatorName, group = IndicatorName)) +
  geom_line(aes(x = year, y = indicator_data, color = IndicatorName)) +
  facet_wrap(~ Country) +
  theme_bw() +
  theme(legend.position="right") + 
   scale_x_discrete(breaks = seq(1970, 2017, 10)) +
   scale_color_discrete("Components of GDP", breaks = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")) + 
  labs(title = "GDP Components over time", subtitle = "In constant 2010 USD", x = "", y = "Billion US$") +
  theme(plot.title = element_text(face="bold")) 

#converting to tidy wide format 
tidy_GDP_data_2 <- tidy_GDP_data_1%>%
  pivot_wider(
    names_from = IndicatorName, 
    values_from = indicator_data)
glimpse(tidy_GDP_data_2)
## Rows: 10,560
## Columns: 20
## $ CountryID                                                                            <dbl> …
## $ Country                                                                              <chr> …
## $ year                                                                                 <chr> …
## $ `Final consumption expenditure`                                                      <dbl> …
## $ `Household expenditure`                                                              <dbl> …
## $ `Government expenditure`                                                             <dbl> …
## $ `Gross capital formation`                                                            <dbl> …
## $ `Gross fixed capital formation (including Acquisitions less disposals of valuables)` <dbl> …
## $ Exports                                                                              <dbl> …
## $ Imports                                                                              <dbl> …
## $ `Gross Domestic Product (GDP)`                                                       <dbl> …
## $ `Agriculture, hunting, forestry, fishing (ISIC A-B)`                                 <dbl> …
## $ `Mining, Manufacturing, Utilities (ISIC C-E)`                                        <dbl> …
## $ `Manufacturing (ISIC D)`                                                             <dbl> …
## $ `Construction (ISIC F)`                                                              <dbl> …
## $ `Wholesale, retail trade, restaurants and hotels (ISIC G-H)`                         <dbl> …
## $ `Transport, storage and communication (ISIC I)`                                      <dbl> …
## $ `Other Activities (ISIC J-P)`                                                        <dbl> …
## $ `Total Value Added`                                                                  <dbl> …
## $ `Changes in inventories`                                                             <dbl> …
# Using GDP formula to calculate GDP using GDP components - the sum of Household Expenditure (Consumption *C*), Gross Capital Formation (business investment *I*), Government Expenditure (G) and Net Exports (exports - imports)
tidy_GDP_data_2 <- tidy_GDP_data_2 %>%
  mutate(calculated_GDP = tidy_GDP_data_2$`Household expenditure`+ tidy_GDP_data_2$`Gross capital formation`+ tidy_GDP_data_2$`Government expenditure`+ tidy_GDP_data_2$`Exports`- tidy_GDP_data_2$`Imports`)
#Calculating % difference between calculated GDP and UN_GDP_data GDP Figures
tidy_GDP_data_2 <- tidy_GDP_data_2 %>%
  mutate(percentage_difference_calculatedGDP_columnGDP = (((calculated_GDP / tidy_GDP_data_2$`Gross Domestic Product (GDP)`)-1) * 100))

tidy_GDP_data_2
## # A tibble: 10,560 × 22
##    CountryID Country     year  `Final consumpt… `Household expe… `Government exp…
##        <dbl> <chr>       <chr>            <dbl>            <dbl>            <dbl>
##  1         4 Afghanistan 1970              5.56             5.07            0.372
##  2         4 Afghanistan 1971              5.33             4.84            0.382
##  3         4 Afghanistan 1972              5.20             4.70            0.402
##  4         4 Afghanistan 1973              5.75             5.21            0.421
##  5         4 Afghanistan 1974              6.15             5.59            0.431
##  6         4 Afghanistan 1975              6.32             5.65            0.598
##  7         4 Afghanistan 1976              6.37             5.68            0.627
##  8         4 Afghanistan 1977              6.90             6.15            0.676
##  9         4 Afghanistan 1978              7.09             6.30            0.725
## 10         4 Afghanistan 1979              6.92             6.15            0.708
## # … with 10,550 more rows, and 16 more variables:
## #   Gross capital formation <dbl>,
## #   Gross fixed capital formation (including Acquisitions less disposals of valuables) <dbl>,
## #   Exports <dbl>, Imports <dbl>, Gross Domestic Product (GDP) <dbl>,
## #   Agriculture, hunting, forestry, fishing (ISIC A-B) <dbl>,
## #   Mining, Manufacturing, Utilities (ISIC C-E) <dbl>,
## #   Manufacturing (ISIC D) <dbl>, Construction (ISIC F) <dbl>, …
# Plotting a graph, utilizing Percentage Difference of Calculated GDP and UN_GDP_data GDP Figures to compare this difference across Germany, India, and United States
tidy_GDP_data_2 %>%
  filter(Country %in% country_list) %>%
  ggplot(aes(x=year, y=percentage_difference_calculatedGDP_columnGDP)) +
  geom_line(group = 1, color = "black", size = 0.2) + geom_line(group = 1,color = "black",y=0, size = 0.2) +
  facet_wrap(~ Country) +
  theme_bw() +
  scale_x_discrete(breaks = seq(1970, 2017, 10)) +
  geom_ribbon(aes(ymin = 0, ymax = pmin(0, percentage_difference_calculatedGDP_columnGDP), group=1), alpha=0.2, fill = "blue") +
  geom_ribbon(aes(ymin = percentage_difference_calculatedGDP_columnGDP, ymax = pmin(0, percentage_difference_calculatedGDP_columnGDP), group=1), alpha=0.2, fill = "yellow") +
  labs(title = "Percentage Difference of Calculated GDP and UN_GDP_data GDP Figures",
       subtitle = "Yellow: Calculated GDP > UN_GDP_data GDP 
Blue: Calculated GDP < UN_GDP_data GDP",
         x = "Time period (1970-2017)", 
         y = "Percentage Difference", 
        )+
  NULL

  • In the case of India, from the period of 1970 to 1990, the graph depicts a yellow region wherein calculated GDP is higher than the UN_GDP_data GDP figures. However, post 1990 (excluding 2007 and 2010), the graph depicts a blue region wherein calculated GDP is lower than the UN_GDP_data GDP figures. In 2007, the % difference between what you calculated as GDP and the GDP figure is 2.1309% and in 2010, the % difference between what you calculated as GDP and the GDP figure is 1.0660%

  • In the case of Germany, the graph depicts a yellow region throughout from 1970 to 2017 wherein calculated GDP is higher than the UN_GDP_data GDP figures.The % difference between calculated GDP and the GDP figure was maximum in 1972 at 3.56e+00. In the United States, the graph depicts a mostly yellow region i.e.calculated GDP is higher than the UN_GDP_data GDP figures, excluding the time period from 2005-2008, 2010, and 2016-2017 which depicted a blue region i.e. calculated GDP is lower than the UN_GDP_data GDP figures.

country_list_1 <- c("Austria","United Kingdom", "Italy")
list_indicator_name = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")
tidy_GDP_data_1 %>%
  filter(Country%in% country_list_1) %>% #Filtering by aforementioned country_list
  filter(IndicatorName %in% list_indicator_name ) %>% #Filtering by list_indicator_name
  group_by(IndicatorName) %>%  
  # Line plot for GDP over time graph
  ggplot(aes(x = year, y = indicator_data, color = IndicatorName, group = IndicatorName)) +
  geom_line(aes(x = year, y = indicator_data, color = IndicatorName)) +
  facet_wrap(~ Country) +
  theme_bw() +
  theme(legend.position="right") + 
   scale_x_discrete(breaks = seq(1970, 2017, 10)) +
   scale_color_discrete("Components of GDP", breaks = c("Gross capital formation","Exports", "Government expenditure", "Household expenditure", "Imports")) + 
  labs(title = "GDP Components over time", subtitle = "In constant 2010 USD", x = "", y = "Billion US$") +
  theme(plot.title = element_text(face="bold"))