Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package.

library(fivethirtyeight)
data(drinks)

Check for variable types and missing values

glimpse(drinks)

## Rows: 193
## Columns: 5
## $ country                      <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings                <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings              <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings                <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …

For country it is character, for beer_serving, spirit_servings and wine_servings they are integer, for total_litres_of_pure_alcohol it is double.

skim(drinks)

(#tab:glimpse_skim_data)Data summary

Name	drinks
Number of rows	193
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	3	28	0	193	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
beer_servings	1	106.16	101.14	20.0	76.0	188.0	376.0	▇▃▂▂▁
spirit_servings	1	80.99	88.28	4.0	56.0	128.0	438.0	▇▃▂▁▁
wine_servings	1	49.45	79.70	1.0	8.0	59.0	370.0	▇▁▁▁▁
total_litres_of_pure_alcohol	1	4.72	3.77	1.3	4.2	7.2	14.4	▇▃▅▃▁

No missing values as seen in the result below in column “n_missing”.

Make a plot that shows the top 25 beer consuming countries

top_25_beer <- drinks %>%
  top_n(25,beer_servings)

ggplot(top_25_beer,aes(y=fct_reorder(country,beer_servings),x=beer_servings))+
  geom_col(fill="green")+
  labs(title = "Top 25 beer consuming countries",
       x = "Beer servings", y = "Country")+
  NULL

Make a plot that shows the top 25 wine consuming countries

top_25_wine <- drinks %>%
  top_n(25,wine_servings)

ggplot(top_25_wine,aes(y=fct_reorder(country,wine_servings),x=wine_servings))+
  geom_col(fill="red")+
  labs(title = "Top 25 wine consuming countries",
       x = "Wine servings", y = "Country")+
  NULL

Finally, make a plot that shows the top 25 spirit consuming countries

top_25_spirit <- drinks %>%
  top_n(25,spirit_servings)

ggplot(top_25_spirit,aes(y=fct_reorder(country,spirit_servings),
                         x=spirit_servings))+
  geom_col(fill="blue")+
  labs(title = "Top 25 spirit consuming countries",
       x = "Spirit servings", y = "Country")+
  NULL

Namibia has the highest beer consumption, which is unexpected. This maybe because it’s a former German colony, in which beer culture could have been established historically, leading to a higher beer consumption. France has the highest wine consumption, which is expected. It is a culture in France to associate every meal with a wine-pairing, and France is the second largest wine producing country in the world. Portugal has the second highest wine consumption, which is also expected, given its tradition for producing and consuming fortified wines, such as Port and Madeira. Grenada has the highest spirit consumption. This is probably due to the fact that spirits make up 66% of all alcohol consumption in this country.

Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data1", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

skim(movies)

(#tab:load_movies)Data summary

Name	movies
Number of rows	2961
Number of columns	11
_______________________
Column type frequency:
character	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	1	83	2907
genre	1	5	11	17
director	1	3	32	1366

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2.00e+03	9.95e+00	1920.0	2.00e+03	2.00e+03	2.01e+03	2.02e+03	▁▁▁▂▇
duration	1	1.10e+02	2.22e+01	37.0	9.50e+01	1.06e+02	1.19e+02	3.30e+02	▃▇▁▁▁
gross	1	5.81e+07	7.25e+07	703.0	1.23e+07	3.47e+07	7.56e+07	7.61e+08	▇▁▁▁▁
budget	1	4.06e+07	4.37e+07	218.0	1.10e+07	2.60e+07	5.50e+07	3.00e+08	▇▂▁▁▁
cast_facebook_likes	1	1.24e+04	2.05e+04	0.0	2.24e+03	4.60e+03	1.69e+04	6.57e+05	▇▁▁▁▁
votes	1	1.09e+05	1.58e+05	5.0	1.99e+04	5.57e+04	1.33e+05	1.69e+06	▇▁▁▁▁
reviews	1	5.03e+02	4.94e+02	2.0	1.99e+02	3.64e+02	6.31e+02	5.31e+03	▇▁▁▁▁
rating	1	6.39e+00	1.05e+00	1.6	5.80e+00	6.50e+00	7.10e+00	9.30e+00	▁▁▆▇▁

There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

gross : The gross earnings in the US box office, not adjusted for inflation
budget: The movie’s budget
cast_facebook_likes: the number of facebook likes cast memebrs received
votes: the number of people who voted for (or rated) the movie in IMDB
reviews: the number of reviews for that movie
rating: IMDB average rating

unique_movies <- movies %>%
  distinct(title,genre,director,year,.keep_all=TRUE)

new_unique_movies <- movies %>%
  summarise("number of unique movies"=n())

Produce a table with the count of movies by genre, ranked in descending order

count_movie_genre <- unique_movies %>%
  group_by(genre) %>%
  count(sort=TRUE)

colnames(count_movie_genre) <- c("Genre", "Number of Movies")

count_movie_genre %>% 
  kable()

Genre	Number of Movies
Comedy	844
Action	719
Drama	484
Adventure	281
Crime	198
Biography	135
Horror	128
Animation	35
Fantasy	26
Documentary	25
Mystery	15
Sci-Fi	7
Family	3
Musical	2
Romance	2
Western	2
Thriller	1

Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order

return_movie_genre <- unique_movies %>%
  group_by(genre) %>% 
  summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>% 
  mutate(return_on_budget = 
           ((mean_gross - mean_budget)/mean_budget)) %>% 
  arrange(-return_on_budget)

colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]",
                                  "Return on budget")
  
return_movie_genre %>% 
  kable()

Genre	Mean gross [$]	Mean budget [$]	Return on budget
Musical	9.21e+07	3189500	27.871
Family	1.49e+08	14833333	9.056
Western	2.08e+07	3465000	5.009
Documentary	1.74e+07	5887852	1.947
Horror	3.78e+07	13804379	1.737
Fantasy	4.19e+07	18484615	1.267
Comedy	4.25e+07	24458506	0.737
Mystery	6.91e+07	41500000	0.665
Animation	9.84e+07	61701429	0.595
Biography	4.52e+07	28543696	0.584
Adventure	9.44e+07	64692313	0.458
Drama	3.68e+07	25832605	0.423
Crime	3.76e+07	26527405	0.417
Romance	3.13e+07	25107500	0.245
Action	8.63e+07	70774558	0.219
Sci-Fi	2.98e+07	27607143	0.079
Thriller	2.47e+03	300000	-0.992

Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.

top_15_directors <- unique_movies %>%
  select(director, gross) %>% 
  group_by(director) %>% 
  summarise(total_gross = sum(gross), 
            mean_gross = mean(gross), 
            median_gross = median(gross), 
            sd_gross = sd(gross)) %>% 
  top_n(15,total_gross) %>% 
  arrange(-total_gross)

colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]",
                                "Median gross [$]", 
                                "Standard deviation of gross [$]")
  
top_15_directors %>% 
  kable()

Director	Total gross [$]	Mean gross [$]	Median gross [$]	Standard deviation of gross [$]
Steven Spielberg	4.01e+09	1.75e+08	1.64e+08	1.01e+08
Michael Bay	2.20e+09	1.83e+08	1.68e+08	1.26e+08
James Cameron	1.91e+09	3.18e+08	1.76e+08	3.09e+08
Christopher Nolan	1.81e+09	2.27e+08	1.97e+08	1.87e+08
George Lucas	1.74e+09	3.48e+08	3.80e+08	1.46e+08
Robert Zemeckis	1.62e+09	1.25e+08	1.01e+08	9.13e+07
Tim Burton	1.56e+09	1.11e+08	6.98e+07	9.93e+07
Sam Raimi	1.44e+09	1.80e+08	1.38e+08	1.75e+08
Clint Eastwood	1.38e+09	7.25e+07	4.67e+07	7.55e+07
Francis Lawrence	1.36e+09	2.72e+08	2.82e+08	1.35e+08
Ron Howard	1.34e+09	1.11e+08	1.02e+08	8.19e+07
Gore Verbinski	1.33e+09	1.90e+08	1.23e+08	1.54e+08
Andrew Adamson	1.14e+09	2.84e+08	2.80e+08	1.21e+08
Shawn Levy	1.13e+09	1.03e+08	8.55e+07	6.55e+07
Ridley Scott	1.13e+09	8.06e+07	4.78e+07	6.88e+07

Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.

dist_by_genre <- unique_movies %>%
  select(genre, rating) %>% 
  group_by(genre) %>% 
  summarise(mean_rating = mean(rating), 
            min_rating = min(rating), 
            max_rating = max(rating), 
            sd_rating = sd(rating)) 

colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating",
                             "Standard deviation of rating")


dist_by_genre %>% 
  kable()

Genre	Mean rating	Min rating	Max rating	Standard deviation of rating
Action	6.23	2.1	9.0	1.039
Adventure	6.51	2.3	8.6	1.106
Animation	6.65	4.5	8.0	0.968
Biography	7.11	4.5	8.9	0.760
Comedy	6.11	1.9	8.8	1.024
Crime	6.92	4.8	9.3	0.853
Documentary	6.66	1.6	8.5	1.767
Drama	6.74	2.1	8.8	0.915
Family	6.50	5.7	7.9	1.217
Fantasy	6.08	4.3	7.9	0.953
Horror	5.79	3.6	8.5	0.987
Musical	6.75	6.3	7.2	0.636
Mystery	6.84	4.6	8.5	0.910
Romance	6.65	6.2	7.1	0.636
Sci-Fi	6.66	5.0	8.2	1.094
Thriller	4.80	4.8	4.8	NA
Western	5.70	4.1	7.3	2.263

unique_movies %>%
  ggplot(aes(x=rating)) + 
  geom_histogram() +
  facet_wrap(vars(genre)) +
  labs(title = "Rating statistics of each genre", x = "Rating", y = "Count")

  NULL

## NULL

Examine the relationship between gross and cast_facebook_likes.

unique_movies %>%
  ggplot(aes(y=gross, x=cast_facebook_likes)) + 
  geom_point() +
  labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes", 
       y = "Gross [$]")+
  NULL

It’s unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable “cast_facebook_likes”, Y-axis is the dependent variable “gross”. There is no obvious correlation between the two variables.

Examine the relationship between gross and budget.

unique_movies %>%
  ggplot(aes(y=gross, x=budget)) + 
  geom_point() +
  labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
  NULL

It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.

Examine the relationship between gross and rating.

unique_movies %>%
  ggplot(aes(y=gross, x=rating)) + 
  geom_point() +
  facet_wrap(vars(genre))+
  labs(title = "Gross vs. the rating of the movie genre", x = "Rating", 
       y = "Gross [$]")+
  NULL

In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it’s hard to determine what the relationship is between the two variables.

Returns of financial stocks

nyse <- read_csv(here::here("data1","nyse.csv"))

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

companies_per_sector <- nyse %>%
  group_by(sector) %>%
  count(sort=TRUE)

colnames(companies_per_sector) <- c("Sector", "Number of Companies")
  
companies_per_sector %>% 
  kable()

Sector	Number of Companies
Finance	97
Consumer Services	79
Public Utilities	60
Capital Goods	45
Health Care	45
Energy	42
Technology	40
Basic Industries	39
Consumer Non-Durables	31
Miscellaneous	12
Transportation	10
Consumer Durables	8

colnames(companies_per_sector) <- c("sector", "number_of_companies")

companies_per_sector %>% 
  ggplot(aes(reorder(sector,-number_of_companies), number_of_companies)) +
  geom_bar(stat="identity")+
  labs(title = "Number of companies in each sector", x = "Sector", 
       y = "Number of Companies")+
  NULL

Next, let’s choose some stocks and their ticker symbols and download some data.

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("AAPL","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2021-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame

## Rows: 18,781
## Columns: 8
## Groups: symbol [7]
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high     <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low      <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close    <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume   <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.1, 10.2, 10.2, 10.2, 10.3, 10.5, 10.5, 10.6, 10.6, 10.7, 1…

First, calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table to summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

summarise_monthly_returns <- myStocks_returns_monthly %>%
  group_by(symbol) %>% 
  summarise(min_monthly_return = min(monthly_returns), 
            max_monthly_return = max(monthly_returns),
            median_monthly_return = median(monthly_returns), 
            mean_monthly_return = mean(monthly_returns), 
            sd_monthly_return = sd(monthly_returns)) 

colnames(summarise_monthly_returns) <- c("Stock","Minimum Monthly Return [$]", 
                                         "Maximum Monthly Return [$]", 
                                         "Median of Monthly Return [$]", 
                                         "Mean of Monthly Return [$]", 
                                         "Standard Deviation of Monthly Return [$]")
 

summarise_monthly_returns %>% 
  kable()

Stock	Minimum Monthly Return [$]	Maximum Monthly Return [$]	Median of Monthly Return [$]	Mean of Monthly Return [$]	Standard Deviation of Monthly Return [$]
AAPL	-0.181	0.217	0.026	0.024	0.078
ANF	-0.421	0.507	0.003	0.009	0.145
DIS	-0.179	0.234	0.009	0.016	0.068
DPZ	-0.188	0.342	0.025	0.031	0.075
SPY	-0.125	0.127	0.017	0.012	0.038
TSLA	-0.224	0.811	0.015	0.052	0.176
XOM	-0.262	0.233	0.002	0.003	0.066

Plot a density plot, using geom_density(), for each of the stocks

myStocks_returns_monthly %>%
  ggplot(aes(x=monthly_returns)) + 
  geom_density() +
  facet_wrap(vars(symbol))+
  labs(title = "Monthly return of each of the stocks", x = "Monthly return [$]",
       y = "Count")+
  NULL

The risk of the stock can be determined by visually analyzing the proportion of the graph area that lies in the negative direction of the x-axis. This determines how the stocks were performing and whether they were likely to produce negative returns. Based on this, the riskiest stock would be XOM, because its past performance has shown that it often gives negative returns. The least risky stock would be DPZ, because its past performance indicates that it is likely to produce positive returns.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis.

library(ggrepel)
myStocks_returns_monthly %>% 
  group_by(symbol) %>% 
  summarise(mean_monthly_return = mean(monthly_returns), 
            sd_monthly_return = sd(monthly_returns)) %>% 
  ggplot(aes(y=sd_monthly_return, x=mean_monthly_return, label = symbol)) + 
  geom_point() +
  geom_text_repel()+
  labs(title = "The risk of each of the stocks expressed as standard deviation 
       vs. the monthly return", x = "Monthly return [$]", 
       y = "Standard deviation [$]")+
  NULL

Based on this graph, ANF had a broad distribution of monthly returns but not a higher expected return when compared to other stocks. Therefore it is of high risk but not of high expected return.

On your own: IBM HR Analytics

For this task, you will analyse a data set on Human Resoruce Analytics.

First let us load the data

hr_dataset <- read_csv(here::here("data1", 
                                  "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)

## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)

How often do people leave the company (attrition)

leave_count <- hr_cleaned %>% 
  select(attrition) %>%
  group_by(attrition) %>% 
  summarise(attrition_count = count(attrition))

leave_count %>%
  ggplot(aes(x=attrition,y=attrition_count)) + 
  geom_col() +
  labs(title = "Number of people who leave the company expressed as attrition", 
       x = "Attrition", y = "Count")+
  NULL

leave_count[2,2]/sum(leave_count$attrition_count)

##   attrition_count
## 1           0.161

16.1% of the people in this survey leave the company.

How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?

employee_info <- hr_cleaned %>% 
  select(age, years_at_company, monthly_income, years_since_last_promotion) 

age_plot <- ggplot(employee_info, aes(x=age)) + 
  geom_density() +
  labs(title = "Density plot of employees' age", x = "Age", y = "Density")+
  NULL

years_plot <- ggplot(employee_info, aes(x=years_at_company)) + 
  geom_density() +
  labs(title = "Density plot of how long employees have stayed in the company", 
       x = "Years at company", y = "Density")+
  NULL

income_plot <- ggplot(employee_info, aes(x=monthly_income)) + 
  geom_density() +
  labs(title = "Density plot of employees' monthly income", x = "Monthly income"
       , y = "Density")+
  NULL

promotion_plot <- ggplot(employee_info, aes(x=years_since_last_promotion)) + 
  geom_density() +
  labs(title = "Density plot of how long it has been since employees' 
       are last promoted", x = "Years since last promotion", y = "Density")+
  NULL

ggarrange(age_plot,years_plot,income_plot,promotion_plot, ncol=2, nrow=2)

By looking at the above plots, the “age_plot” is closer to normal distribution.

How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total

job_satis_dist <- hr_cleaned %>% 
  select(job_satisfaction) %>%
  group_by(job_satisfaction) %>%
  summarise(cnt = n()) %>%
  mutate(freq = percent(cnt / sum(cnt)))
  

  
job_satis_dist %>% 
  ggplot(aes(reorder(job_satisfaction,-cnt), cnt)) +
  geom_bar(stat="identity")+
  geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
  labs(title = "Distribution of job satisfaction", x="Job satisfaction",
       y="Count")+
  NULL

work_life_dist <- hr_cleaned %>% 
  select(work_life_balance) %>%
  group_by(work_life_balance) %>%
  summarise(cnt = n()) %>%
  mutate(freq = percent(cnt / sum(cnt)))
  

  
work_life_dist %>% 
  ggplot(aes(reorder(work_life_balance,-cnt), cnt)) +
  geom_bar(stat="identity")+
  geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
  labs(title = "Distribution of work life balance", x="Work life balance", 
       y="Count")+
  NULL

Is there any relationship between monthly income and education? Monthly income and gender?

hr_cleaned %>% 
  select(education,monthly_income) %>% 
  ggplot(aes(x=education, y=monthly_income)) +
  geom_boxplot()+
  labs(title = "Relationship between monthly income and education", 
       x = "Education", y = "Monthly income") +
  NULL

income_gender <- hr_cleaned %>% 
  group_by(gender) %>% 
  ggplot(aes(x=gender, y=monthly_income)) +
  geom_boxplot()+
  labs(title = "Relationship between monthly income and gender", 
       x = "Gender", y = "Monthly income") +
  NULL

income_gender

Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first

hr_cleaned %>% 
  mutate(job_role = fct_rev(fct_reorder(job_role, monthly_income, median))) %>% 
  ggplot(aes(x=job_role, y=monthly_income)) +
  geom_boxplot()+
  labs(title = "Relationship between monthly income and job role", 
       x = "Job role", y = "Mean monthly income") +
  NULL

Calculate and plot a bar chart of the mean (or median?) income by education level.

income_edu <- hr_cleaned %>% 
  group_by(education) %>% 
  summarise(mean_income = mean(monthly_income)) %>% 
  ggplot(aes(x=education, y=mean_income)) +
  geom_col()+
  labs(title = "Relationship between monthly income and education", 
       x = "Education", y = "Mean monthly income") +
  NULL

Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes

hr_cleaned %>% 
  ggplot(aes(x=monthly_income)) +
  ggthemes::theme_wsj() +
  geom_density() +
  facet_wrap(vars(education)) +
  labs(title = "Distribution of monthly 
income by education level", x = "Education", y = "Density") +
  NULL

Plot income vs age, faceted by job_role

hr_cleaned %>% 
  ggplot(aes(y=monthly_income, x=age)) +
  geom_point() +
  facet_wrap(vars(job_role))+
  labs(title = "Relationship between income and age", x = "Age", 
       y = "Monthly income") +
  NULL

Challenge 1: Replicating a chart

First we look at the data if there’s missing values.

skim(election2020_results)

Table 1: Data summary

Name	election2020_results
Number of rows	22093
Number of columns	12
_______________________
Column type frequency:
character	8
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1	4	20	51
state_po	0	1	2	2	51
county_name	0	1	3	21	1865
fips	9	1	5	5	3153
office	0	1	9	12	2
candidate	0	1	5	17	4
party	0	1	5	11	5
mode	0	1	4	20	16

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1	2020	0	2020	2020	2020	2020	2020	▁▁▇▁▁
candidatevotes	1	1	7173	42426	0	15	177	2341	3028885	▇▁▁▁▁
totalvotes	5	1	50118	140287	66	6190	13669	36638	4264365	▇▁▁▁▁
version	0	1	20210622	0	20210622	20210622	20210622	20210622	20210622	▁▁▇▁▁

#We can see that there are 9 missing values.

We need to clean up the missing values.

#remove missing values
election2020_results <- na.omit(election2020_results)

Then, calculate the rate of votes for Trump in each county.

vote_data <- election2020_results %>% 
  select(county_name, fips, candidate, candidatevotes, totalvotes) %>% 
  filter(candidate == "DONALD J TRUMP") %>% 
  group_by(county_name, fips, totalvotes) %>% 
  summarise(vote_sum = sum(candidatevotes)) %>% 
  mutate(vote_rate = vote_sum/totalvotes)

Then, filter the most recent data from vaccinations, in this case, september data.

sept_vaccinations <- vaccinations %>% 
  select(date, fips,series_complete_pop_pct) %>% 
  filter(date == "09/04/2021")

#After we filtered the dataset, we can conduct cleaning so that the code is 
#more efficient regarding to a smaller dataset.
sept_vaccinations <- na.omit(sept_vaccinations) %>% 
  unique()

Next, join the september vaccination percentage to the percentage of votes for each county by each candidate.

pct_trump_votes_vacc <- inner_join (vote_data, 
                               sept_vaccinations , 
                               by="fips")

Next, join the population size for each county into the percentage trump votes vaccination table.

votes_vacc_pop <- inner_join (pct_trump_votes_vacc, 
                               population , 
                               by="fips")

Then, plot the relationship between percentage of trump vote and percentage of population vaccinated.

vac_pop_plot <- votes_vacc_pop %>% 
  ggplot(aes(x=vote_rate,y=series_complete_pop_pct/100, size=pop_estimate_2019)) + 
  
  #fit a regression line to the plot
  geom_smooth(method=lm) + 
  
  #aethetics for population size
  geom_point(color = "black", alpha = 0.2) +
  geom_point(size = 0.3)+
  scale_size(range = c(0, 30)) +
  
  #draw the dash lines
  geom_hline(yintercept=0.85, linetype="dashed") +
  geom_hline(yintercept=0.539, linetype="dashed") +
  geom_hline(yintercept=0.508, linetype="dashed") +
  
  #break the graph into 3x3 grids with solid lines
  geom_hline(yintercept=0.4, linetype="solid") +
  geom_hline(yintercept=0.6, linetype="solid") +
  geom_vline(xintercept=0.4, linetype="solid") +
  geom_vline(xintercept=0.6, linetype="solid") +
  labs(title = "COVID 19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY", 
       x = "2020 Trump Vote %", 
       y = "% of Total Population Vaccinated") +
  NULL

#add the text messages
vac_pop_plot + annotate("text", x = 0.2,y = 0.89, 
                        label = "Herd Immunity Threshold (?)",
                        color = 'blue',size = 3) +
  annotate("text", x = 0.05, y = 0.56, label = "TARGET:53.9%", 
           fontface = 'italic', color = 'blue',size = 3) +
  annotate("text", x = 0.05, y = 0.525, label = "ACTUAL:50.8%", 
           fontface = 'italic', color = 'blue',size = 3)+
  annotate("text", x = 0.3, y = 0.1, label = "09/04/2021", 
           fontface = 'italic', color = 'red') +
  annotate("text", x = 0.45, y = 1, label = "EVERY U.S. COUNTY", size = 6) +
  annotate("text", x = 0.1, y = 0.06, label = deparse(bquote(~R^2 ==~ 0.4216)),
            parse = T, size = 3, color = 'red')+
  annotate("text", x = 0.1, y = 0.1, label = "y = -0.4224x + 0.6451", 
           color = 'red',size = 3) +
  scale_y_continuous(labels = scales::percent)+
  scale_x_continuous(labels = scales::percent, limits = c(0,1))+
  theme(legend.position="none")

Challenge 2: Opinion polls for the 2021 German elections

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.

The following code will scrape the wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% 
  # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  
  # drop the first row, as it contains again the variable names and last row 
  # that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last 
         # date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date
         # object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, 
         # if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )

glimpse(german_election_polls)

## Rows: 239
## Columns: 16
## $ polling_firm   <chr> "INSA", "Forsa", "INSA", "Ipsos", "Forschungsgruppe Wah…
## $ fieldwork_date <chr> "17–20 Sep 2021", "14–20 Sep 2021", "13–17 Sep 2021", "…
## $ samplesize     <chr> "2,054", "2,502", "1,502", "2,000", "1,406", "1,512", "…
## $ abs            <chr> "–", "25", "–", "–", "22", "–", "–", "–", "–", "–", "–"…
## $ union          <dbl> 22.0, 22.0, 21.0, 21.0, 22.0, 22.0, 20.0, 23.0, 20.0, 2…
## $ spd            <dbl> 25, 25, 26, 27, 25, 26, 27, 25, 25, 26, 26, 25, 25, 26,…
## $ af_d           <dbl> 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 10.0, 11.0, 1…
## $ fdp            <dbl> 12.0, 11.0, 12.0, 10.0, 11.0, 11.0, 12.5, 12.0, 10.0, 1…
## $ linke          <dbl> 6.5, 6.0, 6.0, 7.0, 6.0, 6.0, 6.5, 6.0, 8.0, 6.0, 6.5, …
## $ grune          <dbl> 15.0, 17.0, 15.0, 18.0, 16.0, 15.0, 15.0, 17.0, 15.0, 1…
## $ fw             <chr> "–", "3", "3", "–", "–", "3", "–", "–", "3", "–", "–", …
## $ others         <chr> "8.5", "5", "6", "6", "9", "6", "8", "7", "7", "8", "8"…
## $ lead           <chr> "3", "3", "5", "6", "3", "4", "7", "2", "5", "6", "5.5"…
## $ end_date       <date> 2021-09-20, 2021-09-20, 2021-09-17, 2021-09-16, 2021-0…
## $ month          <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ week           <dbl> 38, 38, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 36,…

#selecting the columns that are useful
selected_data <- german_election_polls %>% 
  select (samplesize, union, spd, af_d, fdp, linke, grune, end_date)

#change "samplesize" to double
selected_data$samplesize <- as.numeric(gsub(",","",selected_data$samplesize)) 

election_data_raw <- selected_data %>% 
  mutate (number_of_vote_union = samplesize*union/100, 
          number_of_vote_spd = samplesize*spd/100, 
          number_of_vote_af_d = samplesize*af_d/100, 
          number_of_vote_fdp = samplesize*fdp/100, 
          number_of_vote_linke = samplesize*linke/100, 
          number_of_vote_grune = samplesize*grune/100) %>% 
  
  #calculate the number of votes for each party (sample size*percenrage of vote)
  group_by(end_date) %>% 
  mutate (totalsample = sum(samplesize)) %>% 
  
  #the total sample size of the result given by each company on the same date 
  mutate (sum_vote_union = sum(number_of_vote_union), 
          sum_vote_spd = sum(number_of_vote_spd), 
          sum_vote_af_d = sum(number_of_vote_af_d), 
          sum_vote_fdp = sum(number_of_vote_fdp), 
          sum_vote_linke = sum(number_of_vote_linke), 
          sum_vote_grune = sum(number_of_vote_grune)) %>% 
  
    # calculate the average percentage of vote by using the total number of votes 
    # for the party / the total sample size
  mutate (average_rate_union = sum_vote_union/totalsample, 
          average_rate_spd = sum_vote_spd/totalsample, 
          average_rate_af_d = sum_vote_af_d/totalsample, 
          average_rate_fdp = sum_vote_fdp/totalsample, 
          average_rate_linke = sum_vote_linke/totalsample, 
          average_rate_grune = sum_vote_grune/totalsample)

#remove the duplicated rows for the same date
election_data <- election_data_raw[!duplicated(election_data_raw$totalsample), ]

election_data = na.omit(election_data)

roll_mean_union = zoo::rollmean(election_data$average_rate_union,k = 14, 
                                fill = NA, align = "left")

roll_mean_spd = zoo::rollmean(election_data$average_rate_spd,k = 14, 
                              fill = NA, align = "left")

roll_mean_af_d = zoo::rollmean(election_data$average_rate_af_d,k = 14, 
                               fill = NA, align = "left")

roll_mean_fdp = zoo::rollmean(election_data$average_rate_fdp,k = 14, 
                              fill = NA, align = "left")

roll_mean_linke = zoo::rollmean(election_data$average_rate_linke,k = 14, 
                                fill = NA, align = "left")

roll_mean_grune= zoo::rollmean(election_data$average_rate_grune,k = 14, 
                               fill = NA, align = "left")
election_data = cbind(election_data,
                      roll_mean_union=roll_mean_union,
                      roll_mean_spd=roll_mean_spd,
                      roll_mean_af_d=roll_mean_af_d,
                      roll_mean_fdp=roll_mean_fdp,
                      roll_mean_linke=roll_mean_linke,
                      roll_mean_grune=roll_mean_grune)

colors = c("CDU/CSU" = "black", "SPD" = "red3", "AfD" = "midnightblue", 
           "FDP" = "orange1",
          "Linke" = "mediumvioletred", "Grune" = "green")

election_data %>% 
  ggplot(aes(x = end_date))+
  geom_point(aes(y = average_rate_union), alpha=0.3, size=1) + 
  geom_point(aes(y = average_rate_spd), color="red3", alpha=0.3, size=1) +
  geom_point(aes(y = average_rate_af_d), color="midnightblue", alpha=0.3, size=1)+
  geom_point(aes(y = average_rate_fdp), color="orange1", alpha=0.3, size=1)+
  geom_point(aes(y = average_rate_linke), color="mediumvioletred", alpha=0.3, size=1)+
  geom_point(aes(y = average_rate_grune), color="green", alpha=0.3, size=1)+
  geom_line(aes(y = roll_mean_union, color="CDU/CSU"), size=1) + 
  geom_line(aes(y = roll_mean_spd, color="SPD"),size=1) +
  geom_line(aes(y = roll_mean_af_d, color="AfD"),size=1)+
  geom_line(aes(y = roll_mean_fdp, color="FDP"),size=1)+
  geom_line(aes(y = roll_mean_linke, color="Linke"),size=1)+
  geom_line(aes(y = roll_mean_grune,  color="Grune"),size=1)+
  labs(title = "German Election Poll", x="Date", y="Percentage of vote")+
  scale_y_continuous(labels = scales::percent)+
  scale_x_date(labels = date_format("%B-%Y"))+
  theme(legend.position = "right",legend.background = element_rect(fill="snow", 
                                  size=0.5, linetype="solid")) +
  theme_bw()+
  scale_color_manual(values = colors, name = "Parties")

Homework 1

Where Do People Drink The Most Beer, Wine And Spirits?

Analysis of movies- IMDB dataset

Returns of financial stocks

On your own: IBM HR Analytics

Challenge 1: Replicating a chart

Challenge 2: Opinion polls for the 2021 German elections