Humans are complex beings and hard to gauge how things are going. As so, happiness is also a hard aspect to gauge with data. The welfare of humanity is an important aspect that we should care about. This is especially important to world organizations like the WHO and the UN. The UN even has a day dedicated celebrating the happiness in the culture of people around the world. There have been independent researches done by the World Happiness Report on how happy people are in the world. In the 2020 report, the WHR discusses in detail what categories they used to group the variables. It discusses how the environment affects the happiness of the population, especially different social environments like connections and institutions in its country. The report then continues to discuss the differences in happiness in urban areas compared to other places. The report is an interesting analysis of what determines happiness for people around the world and bring it to an analytic point of view.
In this tutorial, we will be using this dataset from Kaggle which was gathered from the WHR. Our goal is to tidy up the data given to us and provide insight what the data tells us. We would like to see if there is a formula that would tell us how effective a each variable is in affect the happiness of people. In addition, using the data we could see any continuity of how the world changes overtime as people may value certain factors over the years. There could be other groupings we could look at like different regions might value happiness or whether first and third world countries have different values. Finally, with the given results we could compare the results with other analysis on happiness and see if there any difference between the analysis. We hope to show the progression of humanity as time goes on, the measure of happiness would shift to something else as more technology becomes avaliable and different philosophy rises. Hopefully we can teach people about how different countries sees happiness.
Using Python3 we will import some libraries to help with the data munging, analysis, and visual representation. The libraries imported are pandas as pd for orginizing the data, matplotlib.pyplot as plt and seaborn as sns for visualizations, numpy as np for math operation, sklearn Linear Regression for training the linear regression model, R Squared Score and Mean Squared Error for evaluating regression model accuracy. Finally we pip install folium to visualize the data of different regions.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
!pip install folium
import folium
Before the data can be used to we have to import it into the notebook and clean up the data so then it can be managable, easier to represent, and help withe the analysis. So we downloaded the files from the website and put it into the github repository in a so then we can read from them. Then using pandas csv reader we were able to read the files and put them into pandas dataframes.
data2015 = pd.read_csv("Data/2015")
data2016 = pd.read_csv("Data/2016")
data2017 = pd.read_csv("Data/2017")
data2018 = pd.read_csv("Data/2018")
data2019 = pd.read_csv("Data/2019")
data2020 = pd.read_csv("Data/2020")
The data collected by the WHR differed from year to year as they gathered information that they thought was important to the research they were doing and some columns were renamed. For example in the 2015 data, there is a feature called family while in 2020 family was replaced with social support. The columns of each dataset are named differently but represents the same informations. For example "Freedom" for one year could be "Explained by: Freedom to make life choices" in another year. So these columns would have to be renamed so it would be easier to merge the datasets into one big set. In addtion, in 2020 the names of the regions are completely different from other years such that it would not be feasible to map the countries to their regions, so we have to rename them to make with other tables.
data2020['Happiness Rank'] = range(1, len(data2020.index)+1)
data2020['Year'] = 2020
# Drop the extraneous generosity so that concat can work later on in the code
data2020 = data2020.drop(['Generosity'], axis = 1)
data2020 = data2020.rename(columns = {'Country name' : 'Country', 'Regional indicator': 'Region', 'Ladder score' : 'Happiness Score',
'Explained by: Log GDP per capita' : 'Economy (GDP per Capita)', 'Explained by: Social support' : 'Family',
'Explained by: Healthy life expectancy' : 'Health (Life Expectancy)',
'Explained by: Freedom to make life choices' : 'Freedom',
'Explained by: Perceptions of corruption' : 'Trust (Government Corruption)',
'Explained by: Generosity' : 'Generosity',
'Standard error of ladder score' : 'Standard Error', 'Regional indicator' : 'Region'})
#Just renaming some regions in 2020 as they changed some regions around. Commonwealth of Independent States are all
#Central and Eastern European countries. The Asian regions the report decided to use the noun versions of the
#cardinal directions rather the adjectives.
data2020['Region'] = data2020['Region'].replace(['Commonwealth of Independent States'],'Central and Eastern Europe')
data2020['Region'] = data2020['Region'].replace(['South Asia'],'Southern Asia')
data2020['Region'] = data2020['Region'].replace(['Southeast Asia'],'Southeastern Asia')
data2020['Region'] = data2020['Region'].replace(['East Asia'],'Eastern Asia')
data2020['Region'] = data2020['Region'].replace(['Middle East and North Africa'],'Middle East and Northern Africa')
#Split North America regions from Australia and New Zealand
for i, row in data2020.loc[data2020['Region'] == 'North America and ANZ'].iterrows():
if row['Country'] == 'United States' or row['Country'] == 'Canada':
data2020.at[i,'Region'] = 'North America'
else:
data2020.at[i,'Region'] = 'Australia and New Zealand'
data2020
# temp is just a table for the merge to be conducted on to match countries with their regions
temp = data2016[['Country','Region']]
data2019 = data2019.rename(columns = {'Overall rank' : 'Happiness Rank', 'Country or region' : 'Country',
'Score' : 'Happiness Score', 'GDP per capita' : 'Economy (GDP per Capita)',
'Social support': 'Family', 'Healthy life expectancy' : 'Health (Life Expectancy)',
'Freedom to make life choices':'Freedom','Perceptions of corruption' : 'Trust (Government Corruption)'})
#Matches the countries with their repective regions
data2019 = pd.merge(data2019, temp, how='left', on=['Country'])
data2019['Year'] = 2019
data2019
data2018 = data2018.rename(columns = {'Happiness.Rank' : 'Happiness Rank', 'Country or region':'Country', 'Overall rank':'Happiness Rank',
'Score' : 'Happiness Score', 'GDP per capita' : 'Economy (GDP per Capita)',
'Social support': 'Family', 'Healthy life expectancy' : 'Health (Life Expectancy)',
'Freedom to make life choices':'Freedom','Perceptions of corruption' : 'Trust (Government Corruption)'})
data2018 = pd.merge(data2018, temp, how='left', on=['Country'])
data2018['Year'] = 2018
data2018
data2017 = data2017.rename(columns = {'Happiness.Rank' : 'Happiness Rank', 'Country or region' : 'Country',
'Happiness.Score' : 'Happiness Score', 'Economy..GDP.per.Capita.' : 'Economy (GDP per Capita)',
'Health..Life.Expectancy.' : 'Health (Life Expectancy)',
'Freedom to make life choices':'Freedom','Trust..Government.Corruption.' : 'Trust (Government Corruption)'})
data2017 = pd.merge(data2017, temp, how='left', on=['Country']).dropna()
data2017['Year'] = 2017
data2017
data2016['Year'] = 2016
data2015['Year'] = 2015
data2016
data2015
The dataAll contains a concatenation of all 6 years. The dropna was used to drop and pieces of data that contained any missing values, specifically missing values in the region column as it is a categorical data, we cannot extrapolate data because we do not know what WHR's standards are for placing countries into different regions.
dataAll = pd.concat([data2020,data2019,data2018,data2017,data2016,data2015])
dataAll=dataAll[['Country','Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)',
'Family', 'Health (Life Expectancy)', 'Freedom','Trust (Government Corruption)',
'Generosity', 'Year',]].dropna()
dataAll
For the meaing of each column/variable refer to the index here. The first two pages explain what the report asked in their survey and how it converted the information to numbers.
# data to plot
n_groups = 10
economy = dataAll.groupby("Region")["Economy (GDP per Capita)"].mean()
fam = dataAll.groupby("Region")["Family"].mean()
free = dataAll.groupby("Region")["Freedom"].mean()
# create plot
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.25
opacity = 0.8
# Freedom
rects1 = plt.bar(index - bar_width, free, bar_width,
alpha=opacity,
color='r',
label='Freedom',
align='center')
# Economy
rects2 = plt.bar(index, economy, bar_width,
alpha=opacity,
color='b',
label='Economy (GDP per Capita)',
align='center')
# Family
rects3 = plt.bar(index + bar_width, fam, bar_width,
alpha=opacity,
color='g',
label='Family',
align='center')
#Labels
plt.xlabel('Regions')
plt.ylabel('Mean of freedom, economy and family')
plt.title('Comparison of Mean of Freedom, Economy and Family Based on Each Region')
plt.xticks(index, ('Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', 'Latin America and Caribbean', 'Middle East and Northern Africa', 'North America', 'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa', 'Western Europe'))
plt.xticks(rotation=90)
#create a key
plt.legend()
plt.show()
The above bar graph shows the mean of each region's freedom, economy and family score. We chose to analyze the mean because analyzing the cumulative score of each region would be an incurrate analysis of what region has the highest score. This is because the number of countries in each region are different so countries with more regions would seem to have a higher score when in reality, there are just more scores instead of higher scores.
Based off the graph, we can see that Central and Eastern Europe, Middle East and Northern Africa and Sub-Saharan Africa have the lowest freedom scores while Australia and New Zealand have the highest freedom scores. Looking at the economy, Sub-Saharan Africa has the lowest while North America has the highest. Additionally, Australia and New Zealand has the highest family score and Southern Asia and Sub-Saharan Africa have the lowest family scores.
As a whole, it seems like Sub-Saharan Africa consistently has one of the lowest scores in each of the three categories while North America and Australia and New Zealand have the highest scores in the three categories. Western Europe also has high scores in all three categories, although not the highest.
Next, we use folium to create a map of happiness score across regions.
# Create an interactive map of mean Happiness Score in each region
mapp = folium.Map(zoom_start=20)
tab = ['Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', \
'Latin America and Caribbean', 'Middle East and Northern Africa', 'North America', \
'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa', 'Western Europe']
happiness = dataAll.groupby("Region")["Happiness Score"].mean()
for i in tab:
if i == 'Western Europe':
radius = happiness.loc[i]
long = 46.2022
lat = 1.2644
color = 'Orange'
elif i == 'North America':
radius = happiness.loc[i]
long = 54.5260
lat = -105.2551
color = 'Blue'
elif i == 'Australia and New Zealand':
radius = happiness.loc[i]
long = -40.9006
lat = 174.8860
color = 'Green'
elif i == 'Middle East and Northern Africa':
radius = happiness.loc[i]
long = 29.2985
lat = 42.5510
color = 'Red'
elif i == 'Latin America and Caribbean':
radius = happiness.loc[i]
long = 21.4691
lat = -78.6569
color = 'Pink'
elif i == 'Southeastern Asia':
radius = happiness.loc[i]
long = -2.2180
lat = 115.6628
color = 'Purple'
elif i == 'Central and Eastern Europe':
radius = happiness.loc[i]
long = 52.0055
lat = 37.9587
color = 'Gray'
elif i == 'Eastern Asia':
radius = happiness.loc[i]
long = 38.7946
lat = 106.5348
color = 'Yellow'
elif i == 'Sub-Saharan Africa':
radius = happiness.loc[i]
long = -23.806078
lat = 11.288452
color = 'Cyan'
elif i == 'Southern Asia':
radius = happiness.loc[i]
long = 25.0376
lat = 76.4563
color = 'Black'
else:
continue
folium.CircleMarker(
location=[long, lat],
#so that radius is visible
radius = radius*2,
tooltip = i,
popup = radius,
color = color,
fill = True,
).add_to(mapp)
mapp
The above map shows the mean of each region's happiness score. The longitude and latitude of each region were found on Google. We made the radius of each circle the mean of their respective region's happiness score so that it is easier to visualize which regions have the highest and which have the lowest scores. We also enabled the popup feature so that the user can click to see the region's score. Additionally, when the user hover the mouse over the circle, the region name can be shown.
From this visualization, North America and Australia and New Zealand have the highest average happiness score, with the latter have a slightly higher score when clicking on the circle. Sub-Saharan Africa has an evidently smaller circle than the other regions, meaning it has the smallest happiness score.
Combining the analysis of both visualizations, it can be seen that Sub-Saharan Africa has the lowest scores in both analyses and Australia and New Zealand and North America with the highest scores. From this, it seems like the factors (family, freedom, economy) all play a role in each region's happiness score.
We then want to see how Happiness Score has changes over years. We calculate the mean Happiness Score in each year and plot the result.
# Get mean of happiness score in all countries over years
mean_happiness_score_by_year_df = dataAll \
.groupby("Year")["Happiness Score"] \
.mean()
# Plot Mean of happiness scores of all countries over years
plt.figure(figsize=(12,8))
mean_happiness_score_by_year_df.plot (
x="Year", y="Happiness Score",
kind="line", marker="o",
title="Mean of happiness scores of all countries over years",
xlabel="Year", ylabel="Mean of Happiness Scores"
)
# Show the result
plt.show()
Similarly, we calculate the standard deviation of Happiness Score across years, and plot the result.
# Get stddev of happiness score in all countries over years
std_happiness_score_by_year_df = dataAll \
.groupby("Year")["Happiness Score"] \
.std()
# Plot Standard Deviations of happiness scores of all countries over years
plt.figure(figsize=(12,8))
std_happiness_score_by_year_df.plot (
x="Year", y="Happiness Score",
kind="line", marker="o",
title="Standard Deviations of happiness scores of all countries over years",
xlabel="Year", ylabel="Standard Deviations of Happiness Scores"
)
# Show the result
plt.show()
The mean of Happiness Score fluctuated in the 2015-2017 period but has increased since 2018, meaning that people are getting happier over times. It's also interesting that the standard deviation of Happiness Score had a decreasing trend in the 2015-2019 period, meaning that the happiness has been spreaded more evenly around the world.
We want to predict the Happiness Score based on other variables in the dataset. From the graph "Mean Happiness scores of all countries over years" in part 3 (Exploratory Analysis and Data Visualization), we observe that there is an increasing trend in Happiness Score over years.
Our hypothesis: Happiness Score is dependent on year. In order to test the hypothesis, we plot the Distribution of happiness score over years using plt and sns.
#Plot the happiness score distribution
plt.figure(figsize=(14,6)) #Format figure size
sns.violinplot(x='Year', y='Happiness Score', data=dataAll).set( xlabel="Year", ylabel="Happiness score ")
plt.title("Distribution of happiness score across countries over years")
# Show the plot
plt.show()
The ranges of Happiness Score are quite similar (from ~2 to ~8.5) for different years. The distributions of Happiness Score are quite symmetric, but not normal. Especially, in 2018, 2019, and 2020, the distributions tend to have bimodal shapes.
It's quite hard to see the trend of mean Happiness Score from this plot because the changes of mean Happiness Score over years are very small as compared to the range of Happiness Score.
Although mean of Happiness Score has an increasing trend, the increasing factor is not significant as compared to the Happiness Score range. As we are unsure about the relationship between year and Happiness Score, we will use linear regression model to predict the Happiness Score based on Year, and use the R Squared Score as well as Mean Squared Error to evaluate the prediction accuracy.
# Function for linear regression from now on
def linear_reg (features_X, happiness_Y, feature_name):
# Linear Regression
regr = LinearRegression()
regr.fit(features_X, happiness_Y)
# Get predicted value
happiness_Y_predict = regr.predict(features_X)
# Print mean squared error & R^2 score:
print()
print('Mean squared error: %.2f'
% mean_squared_error(happiness_Y, happiness_Y_predict))
print('R squared score: %.2f\n'
% r2_score(happiness_Y, happiness_Y_predict))
# Print parameters
print("PARAMETERS:")
params_df = pd.DataFrame([regr.coef_],
columns = feature_name)
params_df["Intercept"] = regr.intercept_
print(params_df)
#Return happiness_Y_predict, coefficient, interception
return happiness_Y_predict, regr.coef_, regr.intercept_
#### MODEL 1: PREDICT HAPPINESS SCORE BASED ON YEAR #####
# Get X and Y axes
year_X = dataAll['Year'].to_numpy().reshape(-1, 1)
happiness_Y = dataAll['Happiness Score']
# Linear regression
happiness_Y_predict, coeff, intercept = linear_reg(year_X, happiness_Y, ["Year"])
# Regression Line
print(f'\nRegression line:\n\tHappiness Score = {intercept} + Year * {coeff[0]} \n', )
# Add the residual column to dataAll
dataAll['Happiness Residual 1'] = happiness_Y_predict - happiness_Y
# Plot the regression model (Happiness Score Vs. Year)
plt.figure(figsize=(14,6)) #Format figure size
ax = dataAll.plot.scatter(x='Year', y='Happiness Score')
ax.set_title("Scatter plot of Happiness Score vs Year")
ax.set_xlabel("Year")
ax.set_ylabel("Happiness Score")
plt.plot(dataAll['Year'], coeff[0] * dataAll['Year'] + intercept)
plt.show()
Scope for the linear regression line is 0.0197, which shows that the relationship between Year and Happiness Score is weak. Also, since R-Squared Error is 0.0 (which is not close to 1), and Mean Squared Error is 1.25 (which is quite high as the Happiness Score is in range from ~2 to ~8.5), Year is not a good predictor for Happiness Score.
Let's take a look at the Residual Plot of the above Linear Regression Model (Happiness Score Vs. Year)
#Plot the residual distribution of Happiness Score over years
plt.figure(figsize=(14,6)) #Format figure size
sns.violinplot(x='Year', y='Happiness Residual 1', data=dataAll) \
.set( xlabel="Year", ylabel="Residual of Happiness Score")
plt.title("Distribution of Happiness Score residual over time (Model 1: Happiness Score based on Year)")
# Show the plot
plt.show()
Although means of residuals are around 0, the distributions of residuals are not normal. This, again, confirms that Year is not a good predictor for Happiness Score. We reject the hypothesis that Happiness Score is dependent on year.
Now we need to choose another factor to be a predictor for Happiness Score. We believe that Happiness Score varies by regions. For example, people might feel happier when they live in a region that has a high living standard; on the other hand, people could feel less happy when they live in a region that is politically instable.
Our next hypothesis: Happiness Score is dependent on Region.
In order to test our hypothesis, we plot Happiness Score in different regions and different years to see the differences in Happiness Score with respected to Year and with respected to Region. As Happiness Score varies in each region, we calculate the mean of Happiness Score in each region for each year and plot these mean values.
# Get means of happiness score by year and region
mean_score_yr_region = dataAll.groupby(["Year", "Region"])["Happiness Score"].mean()
# Create a dataframe
frame = { 'Happiness Score': mean_score_yr_region }
mean_score_yr_region_df = pd.DataFrame(frame)
# Plot the mean of happiness score over years accross regions
plt.figure(figsize=(20,12))
ax = sns.lineplot(data=mean_score_yr_region_df,
x='Region', y='Happiness Score',
hue='Year', marker="o",
palette=["red", "blue", "orange", "purple", "green", "gray" ])
plt.xticks(rotation=90) #Make x labels vertical
plt.xlabel("Region")
plt.title("Mean Happiness Score by Region and Year")
plt.figure()
As can be observed from the above plot, Happiness Score does significantly vary by regions. Australia and New Zealand, North America, and Western Europe are the regions that have the highest Happiness Score. This matches with our assumption that people living in high-living-standard regions tend to feel happier than others. Southern Asia and Sub-Saharan Africa are the regions that have the smallest Happiness Scores. One explanation can be that, many countries in Southern Asia and Sub-Saharan Africa are either economically insufficient or politically instable.
Plotting the mean Happiness Score by Year and by Region also allows us to compare the effect of Year on Happiness Score and the effect of Region on Happiness Score. In the above plot, it's hard to distinguish the Happiness Score among years, while it's easy to distinguish the Happiness Score among regions. This means that Happiness Score is more likely to be dependent on Region than on Year.
corresponding to 10 dummy variables that we will create. The names of 10 dummy variables are similar to the values of the "Region" column.
### Function to assign values for dummy variables that represent region
def label_continent(row, region):
if row['Region'] == region:
return 1
else:
return 0
#Add new columns for regions
dataAll['Australia and New Zealand'] = dataAll.apply (lambda row: label_continent(row,'Australia and New Zealand'), axis=1)
dataAll['Central and Eastern Europe'] = dataAll.apply (lambda row: label_continent(row,'Central and Eastern Europe'), axis=1)
dataAll['Eastern Asia'] = dataAll.apply (lambda row: label_continent(row,'Eastern Asia'), axis=1)
dataAll['Latin America and Caribbean'] = dataAll.apply (lambda row: label_continent(row,'Latin America and Caribbean'), axis=1)
dataAll['Middle East and Northern Africa'] = dataAll.apply (lambda row: label_continent(row,'Middle East and Northern Africa'), axis=1)
dataAll['North America'] = dataAll.apply (lambda row: label_continent(row,'North America'), axis=1)
dataAll['Southeastern Asia'] = dataAll.apply (lambda row: label_continent(row,'Southeastern Asia'), axis=1)
dataAll['Southern Asia'] = dataAll.apply (lambda row: label_continent(row,'Southern Asia'), axis=1)
dataAll['Sub-Saharan Africa'] = dataAll.apply (lambda row: label_continent(row,'Sub-Saharan Africa'), axis=1)
dataAll['Western Europe'] = dataAll.apply (lambda row: label_continent(row,'Western Europe'), axis=1)
dataAll
#### MODEL 2: PREDICT HAPPINESS SCORE BASED ON REGION #####
feature_names = ['Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', \
'Latin America and Caribbean', 'Middle East and Northern Africa', 'North America', \
'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa', 'Western Europe']
# Get data of independent variables and dependent variable
region_X = dataAll[feature_names]
happiness_Y = dataAll['Happiness Score']
#Linear Regression
happiness_Y_predict, coeff, intercept = linear_reg(region_X, happiness_Y, feature_names)
#Regression line:
print(f"\nRegression Line: \nHappiness Score = {intercept}\n \
+ {coeff[0]} * {feature_names[0]} + {coeff[1]} * {feature_names[1]} \n \
+ {coeff[2]} * {feature_names[2]} + {coeff[3]} * {feature_names[3]} \n \
+ {coeff[4]} * {feature_names[4]} + {coeff[5]} * {feature_names[5]} \n \
+ {coeff[6]} * {feature_names[6]} + {coeff[7]} * {feature_names[7]} \n \
+ {coeff[8]} * {feature_names[8]} + {coeff[9]} * {feature_names[9]}")
# Add the residual column
dataAll['Happiness Residual 2'] = happiness_Y_predict - happiness_Y
#Plot the residual distribution over years
plt.figure(figsize=(10,5)) #Format figure size
sns.violinplot(x='Year', y='Happiness Residual 2', data=dataAll) \
.set( xlabel="Year", ylabel="Residual of Happiness Score")
plt.title("Distribution of happiness score residual over time (Model 2: Happiness Score based on Region)")
plt.show()
Analysis:
Model 2 has a better Mean Square Error, a better R Squared Score, and better residuals than Model 1
We predict the happiness score based on Economy, Family, Health, Freedom, Trust, Generosity in this model and compare the result with the previous models. From that we will choose the model that performs the best.
#### MODEL 3: PREDICT HAPPINESS SCORE BASED ON OTHER FACTORS #####
feature_names = ['Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', \
'Freedom', 'Trust (Government Corruption)', 'Generosity']
#Get X and Y axes
features_X = dataAll[feature_names]
happiness_Y = dataAll['Happiness Score']
# Linear regression
happiness_Y_predict, coeff, intercept = linear_reg (features_X, happiness_Y, feature_names)
# Regression line:
print(f"\nRegression Line: \nHappiness Score = {intercept}\n \
+ {coeff[0]} * {feature_names[0]} + {coeff[1]} * {feature_names[1]} \n \
+ {coeff[2]} * {feature_names[2]} + {coeff[3]} * {feature_names[3]} \n \
+ {coeff[4]} * {feature_names[4]} + {coeff[5]} * {feature_names[5]}")
# Add the residual column
dataAll['Happiness Residual 3'] = happiness_Y_predict - happiness_Y
#Plot the residual distribution over years
plt.figure(figsize=(10,5)) #Format figure size
sns.violinplot(x='Year', y='Happiness Residual 3', data=dataAll) \
.set( xlabel="Year", ylabel="Residual of Happiness Score")
plt.title("Distribution of happiness score residual over time (Model 3: Happiness Score based on Economy, Family, Health, Freedom, Trust, Generosity)")
plt.show()
The Mean Squared Error and R Squared Score of Model 3 have been improved from Model 2. However, the Residual Plot shows that: Means of Residuals are not 0, Distribution of Residuals of Model 3 are not as normal as that of Model 2.
Conclustion: Choose Model 2 (Predict Happiness Score based on Region) as a model for predicting happiness score.
Given that the model to predict happiness score has been done, our next goal is to check if there is a progression of humanity in the factors that affect Happiness Score as time goes on. In order to test that, we use Linear Regression to predict the Happiness Score based on Ecomomy, Family, Heath, Freedom, Trust, and Generosity in each year. From the parameters of the Linear Regression lines for different years, we could compare the effect of each factor on Happiness Score over years.
feature_names = ['Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', \
'Freedom', 'Trust (Government Corruption)', 'Generosity']
def factors_affected_happiness_by_year(year):
# Get year dataframe
year_df = dataAll[dataAll["Year"] == year]
#Get X and Y axes
features_X = year_df[feature_names]
happiness_Y = year_df['Happiness Score']
happiness_Y_predicted, coeff, intercept = linear_reg (features_X, happiness_Y, feature_names)
factors_affected_happiness_by_year(2015)
factors_affected_happiness_by_year(2016)
factors_affected_happiness_by_year(2017)
factors_affected_happiness_by_year(2018)
factors_affected_happiness_by_year(2019)
factors_affected_happiness_by_year(2020)
Happiness is an important factor to keeping the overall population healthy and raise a progressive country. It is not often that you find a data on so many countries with many features that the one created by the WHR. From these data set we were able to learn what is a major contributor to the overall happiness of people. It appears that in the past years, certain features have decreased in affecting what makes a person happy (e.g.: family decreased from 1.408892 in 2015 to 1.153 in 2020), while other features increased (e.g.: generosity rose from 0.159494 in 2016 to 0.620785 in 2020). It seems that people in the world are slowly shifting their concept of what makes them happy to a more holistic view of what benefits other rather than themselves. This tutorial provided some insight on how happiness is gauged in different regions of the world. Hopefully, from our tutorial you are about to learn about the steps needed to wrangle, display, and analyze data and present it in an organized documentation for other to read.
There are countless factors that goes into what makes people happy in different countries and we only use some basic information from a single data set. Other factors that may affect happiness include, if the country is currently in a military conflict, how much infrastructure does the country have, or is the country suffering/recovering from a natural disaster. Also, our tutorial stratified the data based on the region, you gather more data from other sources and group the data by population or by the countries themselves. There are countless ways to group the data and analyze the factors that results in the happiness score. This tutorial is just a sample of data that was collected around the world by single organization dedicated to providing information about the world population. We hope that this tutorial was able to teach you something new and inspire you to go out and analyze more data.