Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package.
library(fivethirtyeight)
data(drinks)
glimpse(drinks)
## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
For country it is character, for beer_serving, spirit_servings and wine_servings they are integer, for total_litres_of_pure_alcohol it is double.
skim(drinks)
| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
No missing values as seen in the result below in column “n_missing”.
top_25_beer <- drinks %>%
top_n(25,beer_servings)
ggplot(top_25_beer,aes(y=fct_reorder(country,beer_servings),x=beer_servings))+
geom_col(fill="green")+
labs(title = "Top 25 beer consuming countries",
x = "Beer servings", y = "Country")+
NULL

top_25_wine <- drinks %>%
top_n(25,wine_servings)
ggplot(top_25_wine,aes(y=fct_reorder(country,wine_servings),x=wine_servings))+
geom_col(fill="red")+
labs(title = "Top 25 wine consuming countries",
x = "Wine servings", y = "Country")+
NULL

top_25_spirit <- drinks %>%
top_n(25,spirit_servings)
ggplot(top_25_spirit,aes(y=fct_reorder(country,spirit_servings),
x=spirit_servings))+
geom_col(fill="blue")+
labs(title = "Top 25 spirit consuming countries",
x = "Spirit servings", y = "Country")+
NULL

Namibia has the highest beer consumption, which is unexpected. This maybe because it’s a former German colony, in which beer culture could have been established historically, leading to a higher beer consumption. France has the highest wine consumption, which is expected. It is a culture in France to associate every meal with a wine-pairing, and France is the second largest wine producing country in the world. Portugal has the second highest wine consumption, which is also expected, given its tradition for producing and consuming fortified wines, such as Port and Madeira. Grenada has the highest spirit consumption. This is probably due to the fact that spirits make up 66% of all alcohol consumption in this country.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
movies <- read_csv(here::here("data1", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
skim(movies)
| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast memebrs receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratingunique_movies <- movies %>%
distinct(title,genre,director,year,.keep_all=TRUE)
new_unique_movies <- movies %>%
summarise("number of unique movies"=n())
count_movie_genre <- unique_movies %>%
group_by(genre) %>%
count(sort=TRUE)
colnames(count_movie_genre) <- c("Genre", "Number of Movies")
count_movie_genre %>%
kable()
| Genre | Number of Movies |
|---|---|
| Comedy | 844 |
| Action | 719 |
| Drama | 484 |
| Adventure | 281 |
| Crime | 198 |
| Biography | 135 |
| Horror | 128 |
| Animation | 35 |
| Fantasy | 26 |
| Documentary | 25 |
| Mystery | 15 |
| Sci-Fi | 7 |
| Family | 3 |
| Musical | 2 |
| Romance | 2 |
| Western | 2 |
| Thriller | 1 |
gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending orderreturn_movie_genre <- unique_movies %>%
group_by(genre) %>%
summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>%
mutate(return_on_budget =
((mean_gross - mean_budget)/mean_budget)) %>%
arrange(-return_on_budget)
colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]",
"Return on budget")
return_movie_genre %>%
kable()
| Genre | Mean gross [$] | Mean budget [$] | Return on budget |
|---|---|---|---|
| Musical | 9.21e+07 | 3189500 | 27.871 |
| Family | 1.49e+08 | 14833333 | 9.056 |
| Western | 2.08e+07 | 3465000 | 5.009 |
| Documentary | 1.74e+07 | 5887852 | 1.947 |
| Horror | 3.78e+07 | 13804379 | 1.737 |
| Fantasy | 4.19e+07 | 18484615 | 1.267 |
| Comedy | 4.25e+07 | 24458506 | 0.737 |
| Mystery | 6.91e+07 | 41500000 | 0.665 |
| Animation | 9.84e+07 | 61701429 | 0.595 |
| Biography | 4.52e+07 | 28543696 | 0.584 |
| Adventure | 9.44e+07 | 64692313 | 0.458 |
| Drama | 3.68e+07 | 25832605 | 0.423 |
| Crime | 3.76e+07 | 26527405 | 0.417 |
| Romance | 3.13e+07 | 25107500 | 0.245 |
| Action | 8.63e+07 | 70774558 | 0.219 |
| Sci-Fi | 2.98e+07 | 27607143 | 0.079 |
| Thriller | 2.47e+03 | 300000 | -0.992 |
top_15_directors <- unique_movies %>%
select(director, gross) %>%
group_by(director) %>%
summarise(total_gross = sum(gross),
mean_gross = mean(gross),
median_gross = median(gross),
sd_gross = sd(gross)) %>%
top_n(15,total_gross) %>%
arrange(-total_gross)
colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]",
"Median gross [$]",
"Standard deviation of gross [$]")
top_15_directors %>%
kable()
| Director | Total gross [$] | Mean gross [$] | Median gross [$] | Standard deviation of gross [$] |
|---|---|---|---|---|
| Steven Spielberg | 4.01e+09 | 1.75e+08 | 1.64e+08 | 1.01e+08 |
| Michael Bay | 2.20e+09 | 1.83e+08 | 1.68e+08 | 1.26e+08 |
| James Cameron | 1.91e+09 | 3.18e+08 | 1.76e+08 | 3.09e+08 |
| Christopher Nolan | 1.81e+09 | 2.27e+08 | 1.97e+08 | 1.87e+08 |
| George Lucas | 1.74e+09 | 3.48e+08 | 3.80e+08 | 1.46e+08 |
| Robert Zemeckis | 1.62e+09 | 1.25e+08 | 1.01e+08 | 9.13e+07 |
| Tim Burton | 1.56e+09 | 1.11e+08 | 6.98e+07 | 9.93e+07 |
| Sam Raimi | 1.44e+09 | 1.80e+08 | 1.38e+08 | 1.75e+08 |
| Clint Eastwood | 1.38e+09 | 7.25e+07 | 4.67e+07 | 7.55e+07 |
| Francis Lawrence | 1.36e+09 | 2.72e+08 | 2.82e+08 | 1.35e+08 |
| Ron Howard | 1.34e+09 | 1.11e+08 | 1.02e+08 | 8.19e+07 |
| Gore Verbinski | 1.33e+09 | 1.90e+08 | 1.23e+08 | 1.54e+08 |
| Andrew Adamson | 1.14e+09 | 2.84e+08 | 2.80e+08 | 1.21e+08 |
| Shawn Levy | 1.13e+09 | 1.03e+08 | 8.55e+07 | 6.55e+07 |
| Ridley Scott | 1.13e+09 | 8.06e+07 | 4.78e+07 | 6.88e+07 |
dist_by_genre <- unique_movies %>%
select(genre, rating) %>%
group_by(genre) %>%
summarise(mean_rating = mean(rating),
min_rating = min(rating),
max_rating = max(rating),
sd_rating = sd(rating))
colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating",
"Standard deviation of rating")
dist_by_genre %>%
kable()
| Genre | Mean rating | Min rating | Max rating | Standard deviation of rating |
|---|---|---|---|---|
| Action | 6.23 | 2.1 | 9.0 | 1.039 |
| Adventure | 6.51 | 2.3 | 8.6 | 1.106 |
| Animation | 6.65 | 4.5 | 8.0 | 0.968 |
| Biography | 7.11 | 4.5 | 8.9 | 0.760 |
| Comedy | 6.11 | 1.9 | 8.8 | 1.024 |
| Crime | 6.92 | 4.8 | 9.3 | 0.853 |
| Documentary | 6.66 | 1.6 | 8.5 | 1.767 |
| Drama | 6.74 | 2.1 | 8.8 | 0.915 |
| Family | 6.50 | 5.7 | 7.9 | 1.217 |
| Fantasy | 6.08 | 4.3 | 7.9 | 0.953 |
| Horror | 5.79 | 3.6 | 8.5 | 0.987 |
| Musical | 6.75 | 6.3 | 7.2 | 0.636 |
| Mystery | 6.84 | 4.6 | 8.5 | 0.910 |
| Romance | 6.65 | 6.2 | 7.1 | 0.636 |
| Sci-Fi | 6.66 | 5.0 | 8.2 | 1.094 |
| Thriller | 4.80 | 4.8 | 4.8 | NA |
| Western | 5.70 | 4.1 | 7.3 | 2.263 |
unique_movies %>%
ggplot(aes(x=rating)) +
geom_histogram() +
facet_wrap(vars(genre)) +
labs(title = "Rating statistics of each genre", x = "Rating", y = "Count")

NULL
## NULL
gross and cast_facebook_likes.unique_movies %>%
ggplot(aes(y=gross, x=cast_facebook_likes)) +
geom_point() +
labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes",
y = "Gross [$]")+
NULL

It’s unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable “cast_facebook_likes”, Y-axis is the dependent variable “gross”. There is no obvious correlation between the two variables.
gross and budget.unique_movies %>%
ggplot(aes(y=gross, x=budget)) +
geom_point() +
labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
NULL

It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.
gross and rating.unique_movies %>%
ggplot(aes(y=gross, x=rating)) +
geom_point() +
facet_wrap(vars(genre))+
labs(title = "Gross vs. the rating of the movie genre", x = "Rating",
y = "Gross [$]")+
NULL

In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it’s hard to determine what the relationship is between the two variables.
nyse <- read_csv(here::here("data1","nyse.csv"))
companies_per_sector <- nyse %>%
group_by(sector) %>%
count(sort=TRUE)
colnames(companies_per_sector) <- c("Sector", "Number of Companies")
companies_per_sector %>%
kable()
| Sector | Number of Companies |
|---|---|
| Finance | 97 |
| Consumer Services | 79 |
| Public Utilities | 60 |
| Capital Goods | 45 |
| Health Care | 45 |
| Energy | 42 |
| Technology | 40 |
| Basic Industries | 39 |
| Consumer Non-Durables | 31 |
| Miscellaneous | 12 |
| Transportation | 10 |
| Consumer Durables | 8 |
colnames(companies_per_sector) <- c("sector", "number_of_companies")
companies_per_sector %>%
ggplot(aes(reorder(sector,-number_of_companies), number_of_companies)) +
geom_bar(stat="identity")+
labs(title = "Number of companies in each sector", x = "Sector",
y = "Number of Companies")+
NULL

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("AAPL","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2021-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 18,781
## Columns: 8
## Groups: symbol [7]
## $ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.1, 10.2, 10.2, 10.2, 10.3, 10.5, 10.5, 10.6, 10.6, 10.7, 1…
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))
SPY; min, max, median, mean, SD.summarise_monthly_returns <- myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(min_monthly_return = min(monthly_returns),
max_monthly_return = max(monthly_returns),
median_monthly_return = median(monthly_returns),
mean_monthly_return = mean(monthly_returns),
sd_monthly_return = sd(monthly_returns))
colnames(summarise_monthly_returns) <- c("Stock","Minimum Monthly Return [$]",
"Maximum Monthly Return [$]",
"Median of Monthly Return [$]",
"Mean of Monthly Return [$]",
"Standard Deviation of Monthly Return [$]")
summarise_monthly_returns %>%
kable()
| Stock | Minimum Monthly Return [$] | Maximum Monthly Return [$] | Median of Monthly Return [$] | Mean of Monthly Return [$] | Standard Deviation of Monthly Return [$] |
|---|---|---|---|---|---|
| AAPL | -0.181 | 0.217 | 0.026 | 0.024 | 0.078 |
| ANF | -0.421 | 0.507 | 0.003 | 0.009 | 0.145 |
| DIS | -0.179 | 0.234 | 0.009 | 0.016 | 0.068 |
| DPZ | -0.188 | 0.342 | 0.025 | 0.031 | 0.075 |
| SPY | -0.125 | 0.127 | 0.017 | 0.012 | 0.038 |
| TSLA | -0.224 | 0.811 | 0.015 | 0.052 | 0.176 |
| XOM | -0.262 | 0.233 | 0.002 | 0.003 | 0.066 |
geom_density(), for each of the stocksmyStocks_returns_monthly %>%
ggplot(aes(x=monthly_returns)) +
geom_density() +
facet_wrap(vars(symbol))+
labs(title = "Monthly return of each of the stocks", x = "Monthly return [$]",
y = "Count")+
NULL

The risk of the stock can be determined by visually analyzing the proportion of the graph area that lies in the negative direction of the x-axis. This determines how the stocks were performing and whether they were likely to produce negative returns. Based on this, the riskiest stock would be XOM, because its past performance has shown that it often gives negative returns. The least risky stock would be DPZ, because its past performance indicates that it is likely to produce positive returns.
library(ggrepel)
myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(mean_monthly_return = mean(monthly_returns),
sd_monthly_return = sd(monthly_returns)) %>%
ggplot(aes(y=sd_monthly_return, x=mean_monthly_return, label = symbol)) +
geom_point() +
geom_text_repel()+
labs(title = "The risk of each of the stocks expressed as standard deviation
vs. the monthly return", x = "Monthly return [$]",
y = "Standard deviation [$]")+
NULL

Based on this graph, ANF had a broad distribution of monthly returns but not a higher expected return when compared to other stocks. Therefore it is of high risk but not of high expected return.
For this task, you will analyse a data set on Human Resoruce Analytics.
First let us load the data
hr_dataset <- read_csv(here::here("data1",
"datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)
attrition)leave_count <- hr_cleaned %>%
select(attrition) %>%
group_by(attrition) %>%
summarise(attrition_count = count(attrition))
leave_count %>%
ggplot(aes(x=attrition,y=attrition_count)) +
geom_col() +
labs(title = "Number of people who leave the company expressed as attrition",
x = "Attrition", y = "Count")+
NULL

leave_count[2,2]/sum(leave_count$attrition_count)
## attrition_count
## 1 0.161
16.1% of the people in this survey leave the company.
age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?employee_info <- hr_cleaned %>%
select(age, years_at_company, monthly_income, years_since_last_promotion)
age_plot <- ggplot(employee_info, aes(x=age)) +
geom_density() +
labs(title = "Density plot of employees' age", x = "Age", y = "Density")+
NULL
years_plot <- ggplot(employee_info, aes(x=years_at_company)) +
geom_density() +
labs(title = "Density plot of how long employees have stayed in the company",
x = "Years at company", y = "Density")+
NULL
income_plot <- ggplot(employee_info, aes(x=monthly_income)) +
geom_density() +
labs(title = "Density plot of employees' monthly income", x = "Monthly income"
, y = "Density")+
NULL
promotion_plot <- ggplot(employee_info, aes(x=years_since_last_promotion)) +
geom_density() +
labs(title = "Density plot of how long it has been since employees'
are last promoted", x = "Years since last promotion", y = "Density")+
NULL
ggarrange(age_plot,years_plot,income_plot,promotion_plot, ncol=2, nrow=2)

By looking at the above plots, the “age_plot” is closer to normal distribution.
job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of totaljob_satis_dist <- hr_cleaned %>%
select(job_satisfaction) %>%
group_by(job_satisfaction) %>%
summarise(cnt = n()) %>%
mutate(freq = percent(cnt / sum(cnt)))
job_satis_dist %>%
ggplot(aes(reorder(job_satisfaction,-cnt), cnt)) +
geom_bar(stat="identity")+
geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
labs(title = "Distribution of job satisfaction", x="Job satisfaction",
y="Count")+
NULL

work_life_dist <- hr_cleaned %>%
select(work_life_balance) %>%
group_by(work_life_balance) %>%
summarise(cnt = n()) %>%
mutate(freq = percent(cnt / sum(cnt)))
work_life_dist %>%
ggplot(aes(reorder(work_life_balance,-cnt), cnt)) +
geom_bar(stat="identity")+
geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
labs(title = "Distribution of work life balance", x="Work life balance",
y="Count")+
NULL

hr_cleaned %>%
select(education,monthly_income) %>%
ggplot(aes(x=education, y=monthly_income)) +
geom_boxplot()+
labs(title = "Relationship between monthly income and education",
x = "Education", y = "Monthly income") +
NULL

income_gender <- hr_cleaned %>%
group_by(gender) %>%
ggplot(aes(x=gender, y=monthly_income)) +
geom_boxplot()+
labs(title = "Relationship between monthly income and gender",
x = "Gender", y = "Monthly income") +
NULL
income_gender

hr_cleaned %>%
mutate(job_role = fct_rev(fct_reorder(job_role, monthly_income, median))) %>%
ggplot(aes(x=job_role, y=monthly_income)) +
geom_boxplot()+
labs(title = "Relationship between monthly income and job role",
x = "Job role", y = "Mean monthly income") +
NULL

income_edu <- hr_cleaned %>%
group_by(education) %>%
summarise(mean_income = mean(monthly_income)) %>%
ggplot(aes(x=education, y=mean_income)) +
geom_col()+
labs(title = "Relationship between monthly income and education",
x = "Education", y = "Mean monthly income") +
NULL
ggthemeshr_cleaned %>%
ggplot(aes(x=monthly_income)) +
ggthemes::theme_wsj() +
geom_density() +
facet_wrap(vars(education)) +
labs(title = "Distribution of monthly
income by education level", x = "Education", y = "Density") +
NULL

job_rolehr_cleaned %>%
ggplot(aes(y=monthly_income, x=age)) +
geom_point() +
facet_wrap(vars(job_role))+
labs(title = "Relationship between income and age", x = "Age",
y = "Monthly income") +
NULL

skim(election2020_results)
| Name | election2020_results |
| Number of rows | 22093 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 51 | 0 |
| state_po | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| county_name | 0 | 1 | 3 | 21 | 0 | 1865 | 0 |
| fips | 9 | 1 | 5 | 5 | 0 | 3153 | 0 |
| office | 0 | 1 | 9 | 12 | 0 | 2 | 0 |
| candidate | 0 | 1 | 5 | 17 | 0 | 4 | 0 |
| party | 0 | 1 | 5 | 11 | 0 | 5 | 0 |
| mode | 0 | 1 | 4 | 20 | 0 | 16 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2020 | 0 | 2020 | 2020 | 2020 | 2020 | 2020 | ▁▁▇▁▁ |
| candidatevotes | 1 | 1 | 7173 | 42426 | 0 | 15 | 177 | 2341 | 3028885 | ▇▁▁▁▁ |
| totalvotes | 5 | 1 | 50118 | 140287 | 66 | 6190 | 13669 | 36638 | 4264365 | ▇▁▁▁▁ |
| version | 0 | 1 | 20210622 | 0 | 20210622 | 20210622 | 20210622 | 20210622 | 20210622 | ▁▁▇▁▁ |
#We can see that there are 9 missing values.
#remove missing values
election2020_results <- na.omit(election2020_results)
vote_data <- election2020_results %>%
select(county_name, fips, candidate, candidatevotes, totalvotes) %>%
filter(candidate == "DONALD J TRUMP") %>%
group_by(county_name, fips, totalvotes) %>%
summarise(vote_sum = sum(candidatevotes)) %>%
mutate(vote_rate = vote_sum/totalvotes)
sept_vaccinations <- vaccinations %>%
select(date, fips,series_complete_pop_pct) %>%
filter(date == "09/04/2021")
#After we filtered the dataset, we can conduct cleaning so that the code is
#more efficient regarding to a smaller dataset.
sept_vaccinations <- na.omit(sept_vaccinations) %>%
unique()
pct_trump_votes_vacc <- inner_join (vote_data,
sept_vaccinations ,
by="fips")
votes_vacc_pop <- inner_join (pct_trump_votes_vacc,
population ,
by="fips")
vac_pop_plot <- votes_vacc_pop %>%
ggplot(aes(x=vote_rate,y=series_complete_pop_pct/100, size=pop_estimate_2019)) +
#fit a regression line to the plot
geom_smooth(method=lm) +
#aethetics for population size
geom_point(color = "black", alpha = 0.2) +
geom_point(size = 0.3)+
scale_size(range = c(0, 30)) +
#draw the dash lines
geom_hline(yintercept=0.85, linetype="dashed") +
geom_hline(yintercept=0.539, linetype="dashed") +
geom_hline(yintercept=0.508, linetype="dashed") +
#break the graph into 3x3 grids with solid lines
geom_hline(yintercept=0.4, linetype="solid") +
geom_hline(yintercept=0.6, linetype="solid") +
geom_vline(xintercept=0.4, linetype="solid") +
geom_vline(xintercept=0.6, linetype="solid") +
labs(title = "COVID 19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
x = "2020 Trump Vote %",
y = "% of Total Population Vaccinated") +
NULL
#add the text messages
vac_pop_plot + annotate("text", x = 0.2,y = 0.89,
label = "Herd Immunity Threshold (?)",
color = 'blue',size = 3) +
annotate("text", x = 0.05, y = 0.56, label = "TARGET:53.9%",
fontface = 'italic', color = 'blue',size = 3) +
annotate("text", x = 0.05, y = 0.525, label = "ACTUAL:50.8%",
fontface = 'italic', color = 'blue',size = 3)+
annotate("text", x = 0.3, y = 0.1, label = "09/04/2021",
fontface = 'italic', color = 'red') +
annotate("text", x = 0.45, y = 1, label = "EVERY U.S. COUNTY", size = 6) +
annotate("text", x = 0.1, y = 0.06, label = deparse(bquote(~R^2 ==~ 0.4216)),
parse = T, size = 3, color = 'red')+
annotate("text", x = 0.1, y = 0.1, label = "y = -0.4224x + 0.6451",
color = 'red',size = 3) +
scale_y_continuous(labels = scales::percent)+
scale_x_continuous(labels = scales::percent, limits = c(0,1))+
theme(legend.position="none")

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>%
# the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>%
# drop the first row, as it contains again the variable names and last row
# that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last
# date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date
# object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date,
# if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)
glimpse(german_election_polls)
## Rows: 239
## Columns: 16
## $ polling_firm <chr> "INSA", "Forsa", "INSA", "Ipsos", "Forschungsgruppe Wah…
## $ fieldwork_date <chr> "17–20 Sep 2021", "14–20 Sep 2021", "13–17 Sep 2021", "…
## $ samplesize <chr> "2,054", "2,502", "1,502", "2,000", "1,406", "1,512", "…
## $ abs <chr> "–", "25", "–", "–", "22", "–", "–", "–", "–", "–", "–"…
## $ union <dbl> 22.0, 22.0, 21.0, 21.0, 22.0, 22.0, 20.0, 23.0, 20.0, 2…
## $ spd <dbl> 25, 25, 26, 27, 25, 26, 27, 25, 25, 26, 26, 25, 25, 26,…
## $ af_d <dbl> 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 10.0, 11.0, 1…
## $ fdp <dbl> 12.0, 11.0, 12.0, 10.0, 11.0, 11.0, 12.5, 12.0, 10.0, 1…
## $ linke <dbl> 6.5, 6.0, 6.0, 7.0, 6.0, 6.0, 6.5, 6.0, 8.0, 6.0, 6.5, …
## $ grune <dbl> 15.0, 17.0, 15.0, 18.0, 16.0, 15.0, 15.0, 17.0, 15.0, 1…
## $ fw <chr> "–", "3", "3", "–", "–", "3", "–", "–", "3", "–", "–", …
## $ others <chr> "8.5", "5", "6", "6", "9", "6", "8", "7", "7", "8", "8"…
## $ lead <chr> "3", "3", "5", "6", "3", "4", "7", "2", "5", "6", "5.5"…
## $ end_date <date> 2021-09-20, 2021-09-20, 2021-09-17, 2021-09-16, 2021-0…
## $ month <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ week <dbl> 38, 38, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 36,…
#selecting the columns that are useful
selected_data <- german_election_polls %>%
select (samplesize, union, spd, af_d, fdp, linke, grune, end_date)
#change "samplesize" to double
selected_data$samplesize <- as.numeric(gsub(",","",selected_data$samplesize))
election_data_raw <- selected_data %>%
mutate (number_of_vote_union = samplesize*union/100,
number_of_vote_spd = samplesize*spd/100,
number_of_vote_af_d = samplesize*af_d/100,
number_of_vote_fdp = samplesize*fdp/100,
number_of_vote_linke = samplesize*linke/100,
number_of_vote_grune = samplesize*grune/100) %>%
#calculate the number of votes for each party (sample size*percenrage of vote)
group_by(end_date) %>%
mutate (totalsample = sum(samplesize)) %>%
#the total sample size of the result given by each company on the same date
mutate (sum_vote_union = sum(number_of_vote_union),
sum_vote_spd = sum(number_of_vote_spd),
sum_vote_af_d = sum(number_of_vote_af_d),
sum_vote_fdp = sum(number_of_vote_fdp),
sum_vote_linke = sum(number_of_vote_linke),
sum_vote_grune = sum(number_of_vote_grune)) %>%
# calculate the average percentage of vote by using the total number of votes
# for the party / the total sample size
mutate (average_rate_union = sum_vote_union/totalsample,
average_rate_spd = sum_vote_spd/totalsample,
average_rate_af_d = sum_vote_af_d/totalsample,
average_rate_fdp = sum_vote_fdp/totalsample,
average_rate_linke = sum_vote_linke/totalsample,
average_rate_grune = sum_vote_grune/totalsample)
#remove the duplicated rows for the same date
election_data <- election_data_raw[!duplicated(election_data_raw$totalsample), ]
election_data = na.omit(election_data)
roll_mean_union = zoo::rollmean(election_data$average_rate_union,k = 14,
fill = NA, align = "left")
roll_mean_spd = zoo::rollmean(election_data$average_rate_spd,k = 14,
fill = NA, align = "left")
roll_mean_af_d = zoo::rollmean(election_data$average_rate_af_d,k = 14,
fill = NA, align = "left")
roll_mean_fdp = zoo::rollmean(election_data$average_rate_fdp,k = 14,
fill = NA, align = "left")
roll_mean_linke = zoo::rollmean(election_data$average_rate_linke,k = 14,
fill = NA, align = "left")
roll_mean_grune= zoo::rollmean(election_data$average_rate_grune,k = 14,
fill = NA, align = "left")
election_data = cbind(election_data,
roll_mean_union=roll_mean_union,
roll_mean_spd=roll_mean_spd,
roll_mean_af_d=roll_mean_af_d,
roll_mean_fdp=roll_mean_fdp,
roll_mean_linke=roll_mean_linke,
roll_mean_grune=roll_mean_grune)
colors = c("CDU/CSU" = "black", "SPD" = "red3", "AfD" = "midnightblue",
"FDP" = "orange1",
"Linke" = "mediumvioletred", "Grune" = "green")
election_data %>%
ggplot(aes(x = end_date))+
geom_point(aes(y = average_rate_union), alpha=0.3, size=1) +
geom_point(aes(y = average_rate_spd), color="red3", alpha=0.3, size=1) +
geom_point(aes(y = average_rate_af_d), color="midnightblue", alpha=0.3, size=1)+
geom_point(aes(y = average_rate_fdp), color="orange1", alpha=0.3, size=1)+
geom_point(aes(y = average_rate_linke), color="mediumvioletred", alpha=0.3, size=1)+
geom_point(aes(y = average_rate_grune), color="green", alpha=0.3, size=1)+
geom_line(aes(y = roll_mean_union, color="CDU/CSU"), size=1) +
geom_line(aes(y = roll_mean_spd, color="SPD"),size=1) +
geom_line(aes(y = roll_mean_af_d, color="AfD"),size=1)+
geom_line(aes(y = roll_mean_fdp, color="FDP"),size=1)+
geom_line(aes(y = roll_mean_linke, color="Linke"),size=1)+
geom_line(aes(y = roll_mean_grune, color="Grune"),size=1)+
labs(title = "German Election Poll", x="Date", y="Percentage of vote")+
scale_y_continuous(labels = scales::percent)+
scale_x_date(labels = date_format("%B-%Y"))+
theme(legend.position = "right",legend.background = element_rect(fill="snow",
size=0.5, linetype="solid")) +
theme_bw()+
scale_color_manual(values = colors, name = "Parties")
