Happiness Within Countries¶

Iris Truong, Jason Zhong, Yun Qi¶

INTRODUCTION¶

Humans are complex beings and hard to gauge how things are going. As so, happiness is also a hard aspect to gauge with data. The welfare of humanity is an important aspect that we should care about. This is especially important to world organizations like the WHO and the UN. The UN even has a day dedicated celebrating the happiness in the culture of people around the world. There have been independent researches done by the World Happiness Report on how happy people are in the world. In the 2020 report, the WHR discusses in detail what categories they used to group the variables. It discusses how the environment affects the happiness of the population, especially different social environments like connections and institutions in its country. The report then continues to discuss the differences in happiness in urban areas compared to other places. The report is an interesting analysis of what determines happiness for people around the world and bring it to an analytic point of view.

In this tutorial, we will be using this dataset from Kaggle which was gathered from the WHR. Our goal is to tidy up the data given to us and provide insight what the data tells us. We would like to see if there is a formula that would tell us how effective a each variable is in affect the happiness of people. In addition, using the data we could see any continuity of how the world changes overtime as people may value certain factors over the years. There could be other groupings we could look at like different regions might value happiness or whether first and third world countries have different values. Finally, with the given results we could compare the results with other analysis on happiness and see if there any difference between the analysis. We hope to show the progression of humanity as time goes on, the measure of happiness would shift to something else as more technology becomes avaliable and different philosophy rises. Hopefully we can teach people about how different countries sees happiness.

1. DATA COLLECTION¶

Using Python3 we will import some libraries to help with the data munging, analysis, and visual representation. The libraries imported are pandas as pd for orginizing the data, matplotlib.pyplot as plt and seaborn as sns for visualizations, numpy as np for math operation, sklearn Linear Regression for training the linear regression model, R Squared Score and Mean Squared Error for evaluating regression model accuracy. Finally we pip install folium to visualize the data of different regions.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
!pip install folium
import folium

Before the data can be used to we have to import it into the notebook and clean up the data so then it can be managable, easier to represent, and help withe the analysis. So we downloaded the files from the website and put it into the github repository in a so then we can read from them. Then using pandas csv reader we were able to read the files and put them into pandas dataframes.

data2015 = pd.read_csv("Data/2015")
data2016 = pd.read_csv("Data/2016")
data2017 = pd.read_csv("Data/2017")
data2018 = pd.read_csv("Data/2018")
data2019 = pd.read_csv("Data/2019")
data2020 = pd.read_csv("Data/2020")

2. DATA PROCESSING¶

The data collected by the WHR differed from year to year as they gathered information that they thought was important to the research they were doing and some columns were renamed. For example in the 2015 data, there is a feature called family while in 2020 family was replaced with social support. The columns of each dataset are named differently but represents the same informations. For example "Freedom" for one year could be "Explained by: Freedom to make life choices" in another year. So these columns would have to be renamed so it would be easier to merge the datasets into one big set. In addtion, in 2020 the names of the regions are completely different from other years such that it would not be feasible to map the countries to their regions, so we have to rename them to make with other tables.

data2020['Happiness Rank'] =  range(1, len(data2020.index)+1)
data2020['Year'] = 2020

# Drop the extraneous generosity so that concat can work later on in the code
data2020 = data2020.drop(['Generosity'], axis = 1)

data2020 = data2020.rename(columns = {'Country name' : 'Country', 'Regional indicator': 'Region', 'Ladder score' : 'Happiness Score', 
                        'Explained by: Log GDP per capita' : 'Economy (GDP per Capita)', 'Explained by: Social support' : 'Family', 
                                    'Explained by: Healthy life expectancy' : 'Health (Life Expectancy)',
                        'Explained by: Freedom to make life choices' : 'Freedom', 
                                    'Explained by: Perceptions of corruption' : 'Trust (Government Corruption)',
                                      'Explained by: Generosity' : 'Generosity',
                                     'Standard error of ladder score' : 'Standard Error', 'Regional indicator' : 'Region'})

#Just renaming some regions in 2020 as they changed some regions around. Commonwealth of Independent States are all
#Central and Eastern European countries. The Asian regions the report decided to use the noun versions of the
#cardinal directions rather the adjectives.
data2020['Region'] = data2020['Region'].replace(['Commonwealth of Independent States'],'Central and Eastern Europe')
data2020['Region'] = data2020['Region'].replace(['South Asia'],'Southern Asia')
data2020['Region'] = data2020['Region'].replace(['Southeast Asia'],'Southeastern Asia')
data2020['Region'] = data2020['Region'].replace(['East Asia'],'Eastern Asia')
data2020['Region'] = data2020['Region'].replace(['Middle East and North Africa'],'Middle East and Northern Africa')

#Split North America regions from Australia and New Zealand
for i, row in data2020.loc[data2020['Region'] == 'North America and ANZ'].iterrows():
    if row['Country'] == 'United States' or row['Country'] == 'Canada':
        data2020.at[i,'Region'] = 'North America'
    else:
        data2020.at[i,'Region'] = 'Australia and New Zealand'

data2020

# temp is just a table for the merge to be conducted on to match countries with their regions
temp = data2016[['Country','Region']]

data2019 = data2019.rename(columns = {'Overall rank' : 'Happiness Rank', 'Country or region' : 'Country', 
                                      'Score' : 'Happiness Score', 'GDP per capita' : 'Economy (GDP per Capita)',
                                      'Social support': 'Family', 'Healthy life expectancy' : 'Health (Life Expectancy)',
                                      'Freedom to make life choices':'Freedom','Perceptions of corruption' : 'Trust (Government Corruption)'})
#Matches the countries with their repective regions
data2019 = pd.merge(data2019, temp, how='left', on=['Country'])
data2019['Year'] = 2019
data2019

data2018 = data2018.rename(columns = {'Happiness.Rank' : 'Happiness Rank', 'Country or region':'Country', 'Overall rank':'Happiness Rank',
                                      'Score' : 'Happiness Score', 'GDP per capita' : 'Economy (GDP per Capita)',
                                      'Social support': 'Family', 'Healthy life expectancy' : 'Health (Life Expectancy)',
                                      'Freedom to make life choices':'Freedom','Perceptions of corruption' : 'Trust (Government Corruption)'})
data2018 = pd.merge(data2018, temp, how='left', on=['Country'])
data2018['Year'] = 2018
data2018

data2017 = data2017.rename(columns = {'Happiness.Rank' : 'Happiness Rank', 'Country or region' : 'Country', 
                                      'Happiness.Score' : 'Happiness Score', 'Economy..GDP.per.Capita.' : 'Economy (GDP per Capita)',
                                      'Health..Life.Expectancy.' : 'Health (Life Expectancy)',
                                      'Freedom to make life choices':'Freedom','Trust..Government.Corruption.' : 'Trust (Government Corruption)'})
data2017 = pd.merge(data2017, temp, how='left', on=['Country']).dropna()
data2017['Year'] = 2017
data2017

data2016['Year'] = 2016
data2015['Year'] = 2015

data2016

data2015

The dataAll contains a concatenation of all 6 years. The dropna was used to drop and pieces of data that contained any missing values, specifically missing values in the region column as it is a categorical data, we cannot extrapolate data because we do not know what WHR's standards are for placing countries into different regions.

dataAll = pd.concat([data2020,data2019,data2018,data2017,data2016,data2015])
dataAll=dataAll[['Country','Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)',
                                   'Family', 'Health (Life Expectancy)', 'Freedom','Trust (Government Corruption)',
                                   'Generosity', 'Year',]].dropna()
dataAll

For the meaing of each column/variable refer to the index here. The first two pages explain what the report asked in their survey and how it converted the information to numbers.

3. EXPLORATORY ANALYSIS AND DATA VISUALIZATION:¶

# data to plot
n_groups = 10
economy = dataAll.groupby("Region")["Economy (GDP per Capita)"].mean()
fam = dataAll.groupby("Region")["Family"].mean()
free = dataAll.groupby("Region")["Freedom"].mean()

# create plot
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.25
opacity = 0.8

# Freedom
rects1 = plt.bar(index - bar_width, free, bar_width, 
alpha=opacity,
color='r',
label='Freedom',
align='center')

# Economy
rects2 = plt.bar(index, economy, bar_width, 
alpha=opacity,
color='b',
label='Economy (GDP per Capita)',
align='center')

# Family
rects3 = plt.bar(index + bar_width, fam, bar_width, 
alpha=opacity,
color='g',
label='Family',
align='center')

#Labels
plt.xlabel('Regions')
plt.ylabel('Mean of freedom, economy and family')
plt.title('Comparison of Mean of Freedom, Economy and Family Based on Each Region')
plt.xticks(index, ('Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', 'Latin America and Caribbean', 'Middle East and Northern Africa', 'North America', 'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa', 'Western Europe'))
plt.xticks(rotation=90)

#create a key
plt.legend()
plt.show()

The above bar graph shows the mean of each region's freedom, economy and family score. We chose to analyze the mean because analyzing the cumulative score of each region would be an incurrate analysis of what region has the highest score. This is because the number of countries in each region are different so countries with more regions would seem to have a higher score when in reality, there are just more scores instead of higher scores.

Based off the graph, we can see that Central and Eastern Europe, Middle East and Northern Africa and Sub-Saharan Africa have the lowest freedom scores while Australia and New Zealand have the highest freedom scores. Looking at the economy, Sub-Saharan Africa has the lowest while North America has the highest. Additionally, Australia and New Zealand has the highest family score and Southern Asia and Sub-Saharan Africa have the lowest family scores.

As a whole, it seems like Sub-Saharan Africa consistently has one of the lowest scores in each of the three categories while North America and Australia and New Zealand have the highest scores in the three categories. Western Europe also has high scores in all three categories, although not the highest.

Next, we use folium to create a map of happiness score across regions.

# Create an interactive map of mean Happiness Score in each region
mapp = folium.Map(zoom_start=20)
tab = ['Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', \
       'Latin America and Caribbean', 'Middle East and Northern Africa', 'North America', \
       'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa', 'Western Europe']

happiness = dataAll.groupby("Region")["Happiness Score"].mean()

for i in tab:
    if i == 'Western Europe':
        radius = happiness.loc[i]
        long = 46.2022
        lat = 1.2644
        color = 'Orange'
    
    elif i == 'North America':
        radius = happiness.loc[i]
        long = 54.5260
        lat = -105.2551
        color = 'Blue'
       
    elif i == 'Australia and New Zealand':
        radius = happiness.loc[i]
        long = -40.9006
        lat = 174.8860
        color = 'Green'
       
    elif i == 'Middle East and Northern Africa':
        radius = happiness.loc[i] 
        long = 29.2985
        lat = 42.5510
        color = 'Red'
        
    elif i == 'Latin America and Caribbean':
        radius = happiness.loc[i]
        long = 21.4691
        lat = -78.6569
        color = 'Pink'
        
    elif i == 'Southeastern Asia':
        radius = happiness.loc[i]
        long = -2.2180
        lat = 115.6628
        color = 'Purple'
        
    elif i == 'Central and Eastern Europe':
        radius = happiness.loc[i]
        long = 52.0055
        lat = 37.9587
        color = 'Gray'
       
    elif i == 'Eastern Asia':
        radius = happiness.loc[i]
        long = 38.7946
        lat = 106.5348
        color = 'Yellow'
        
    elif i == 'Sub-Saharan Africa':
        radius = happiness.loc[i] 
        long = -23.806078
        lat = 11.288452
        color = 'Cyan'
        
    elif i == 'Southern Asia':
        radius = happiness.loc[i] 
        long = 25.0376
        lat = 76.4563
        color = 'Black'
        
    else:
        continue
    folium.CircleMarker(
    location=[long, lat],
    #so that radius is visible
    radius = radius*2,
    tooltip = i,
    popup = radius,
    color = color,
    fill = True, 
).add_to(mapp)

mapp

Requirement already satisfied: folium in /opt/conda/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.24.0)
Requirement already satisfied: branca>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from folium) (0.4.1)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)

The above map shows the mean of each region's happiness score. The longitude and latitude of each region were found on Google. We made the radius of each circle the mean of their respective region's happiness score so that it is easier to visualize which regions have the highest and which have the lowest scores. We also enabled the popup feature so that the user can click to see the region's score. Additionally, when the user hover the mouse over the circle, the region name can be shown.

From this visualization, North America and Australia and New Zealand have the highest average happiness score, with the latter have a slightly higher score when clicking on the circle. Sub-Saharan Africa has an evidently smaller circle than the other regions, meaning it has the smallest happiness score.

Combining the analysis of both visualizations, it can be seen that Sub-Saharan Africa has the lowest scores in both analyses and Australia and New Zealand and North America with the highest scores. From this, it seems like the factors (family, freedom, economy) all play a role in each region's happiness score.

We then want to see how Happiness Score has changes over years. We calculate the mean Happiness Score in each year and plot the result.

# Get mean of happiness score in all countries over years
mean_happiness_score_by_year_df = dataAll \
                                .groupby("Year")["Happiness Score"] \
                                .mean()

# Plot Mean of happiness scores of all countries over years
plt.figure(figsize=(12,8))
mean_happiness_score_by_year_df.plot (
    x="Year", y="Happiness Score",
    kind="line", marker="o",
    title="Mean of happiness scores of all countries over years",
    xlabel="Year", ylabel="Mean of Happiness Scores"
)

# Show the result
plt.show()

Similarly, we calculate the standard deviation of Happiness Score across years, and plot the result.

# Get stddev of happiness score in all countries over years
std_happiness_score_by_year_df = dataAll \
                                .groupby("Year")["Happiness Score"] \
                                .std()

# Plot Standard Deviations of happiness scores of all countries over years
plt.figure(figsize=(12,8))
std_happiness_score_by_year_df.plot (
    x="Year", y="Happiness Score",
    kind="line", marker="o",
    title="Standard Deviations of happiness scores of all countries over years",
    xlabel="Year", ylabel="Standard Deviations of Happiness Scores"
)

# Show the result
plt.show()

The mean of Happiness Score fluctuated in the 2015-2017 period but has increased since 2018, meaning that people are getting happier over times. It's also interesting that the standard deviation of Happiness Score had a decreasing trend in the 2015-2019 period, meaning that the happiness has been spreaded more evenly around the world.

4. ANALYSIS, MACHING LEARNING, AND HYPOTHESIS TESTING:¶

We want to predict the Happiness Score based on other variables in the dataset. From the graph "Mean Happiness scores of all countries over years" in part 3 (Exploratory Analysis and Data Visualization), we observe that there is an increasing trend in Happiness Score over years.

Our hypothesis: Happiness Score is dependent on year. In order to test the hypothesis, we plot the Distribution of happiness score over years using plt and sns.

#Plot the happiness score distribution
plt.figure(figsize=(14,6)) #Format figure size
sns.violinplot(x='Year', y='Happiness Score', data=dataAll).set( xlabel="Year", ylabel="Happiness score ")
plt.title("Distribution of happiness score across countries over years")

# Show the plot
plt.show()

The ranges of Happiness Score are quite similar (from ~2 to ~8.5) for different years. The distributions of Happiness Score are quite symmetric, but not normal. Especially, in 2018, 2019, and 2020, the distributions tend to have bimodal shapes.

It's quite hard to see the trend of mean Happiness Score from this plot because the changes of mean Happiness Score over years are very small as compared to the range of Happiness Score.

MODEL 1: PREDICT HAPPINESS SCORE BASED ON YEAR¶

Although mean of Happiness Score has an increasing trend, the increasing factor is not significant as compared to the Happiness Score range. As we are unsure about the relationship between year and Happiness Score, we will use linear regression model to predict the Happiness Score based on Year, and use the R Squared Score as well as Mean Squared Error to evaluate the prediction accuracy.

# Function for linear regression from now on
def linear_reg (features_X, happiness_Y, feature_name):
    # Linear Regression
    regr = LinearRegression()
    regr.fit(features_X, happiness_Y)
    
    # Get predicted value
    happiness_Y_predict = regr.predict(features_X)

    # Print mean squared error & R^2 score:
    print()
    print('Mean squared error: %.2f'
          % mean_squared_error(happiness_Y, happiness_Y_predict))
    print('R squared score: %.2f\n'
          % r2_score(happiness_Y, happiness_Y_predict))
    
    # Print parameters
    print("PARAMETERS:")
    params_df = pd.DataFrame([regr.coef_],
                            columns = feature_name)
    params_df["Intercept"] = regr.intercept_
    print(params_df)
    

    #Return happiness_Y_predict, coefficient, interception
    return happiness_Y_predict, regr.coef_, regr.intercept_

#### MODEL 1: PREDICT HAPPINESS SCORE BASED ON YEAR #####
# Get X and Y axes
year_X = dataAll['Year'].to_numpy().reshape(-1, 1)
happiness_Y = dataAll['Happiness Score']

# Linear regression
happiness_Y_predict, coeff, intercept = linear_reg(year_X, happiness_Y, ["Year"])

# Regression Line
print(f'\nRegression line:\n\tHappiness Score = {intercept} + Year * {coeff[0]} \n', )

# Add the residual column to dataAll
dataAll['Happiness Residual 1'] = happiness_Y_predict - happiness_Y


# Plot the regression model (Happiness Score Vs. Year)
plt.figure(figsize=(14,6)) #Format figure size
ax = dataAll.plot.scatter(x='Year', y='Happiness Score')
ax.set_title("Scatter plot of Happiness Score vs Year")
ax.set_xlabel("Year")
ax.set_ylabel("Happiness Score")
plt.plot(dataAll['Year'], coeff[0] * dataAll['Year'] + intercept)
plt.show()

Mean squared error: 1.25
R squared score: 0.00

PARAMETERS:
       Year  Intercept
0  0.019734 -34.404176

Regression line:
	Happiness Score = -34.404175810438126 + Year * 0.019733564289423727

<Figure size 1008x432 with 0 Axes>

Scope for the linear regression line is 0.0197, which shows that the relationship between Year and Happiness Score is weak. Also, since R-Squared Error is 0.0 (which is not close to 1), and Mean Squared Error is 1.25 (which is quite high as the Happiness Score is in range from ~2 to ~8.5), Year is not a good predictor for Happiness Score.

Let's take a look at the Residual Plot of the above Linear Regression Model (Happiness Score Vs. Year)

#Plot the residual distribution of Happiness Score over years
plt.figure(figsize=(14,6)) #Format figure size
sns.violinplot(x='Year', y='Happiness Residual 1', data=dataAll) \
    .set( xlabel="Year", ylabel="Residual of Happiness Score")
plt.title("Distribution of Happiness Score residual over time (Model 1: Happiness Score based on Year)")

# Show the plot
plt.show()

Although means of residuals are around 0, the distributions of residuals are not normal. This, again, confirms that Year is not a good predictor for Happiness Score. We reject the hypothesis that Happiness Score is dependent on year.

Now we need to choose another factor to be a predictor for Happiness Score. We believe that Happiness Score varies by regions. For example, people might feel happier when they live in a region that has a high living standard; on the other hand, people could feel less happy when they live in a region that is politically instable.

Our next hypothesis: Happiness Score is dependent on Region.

In order to test our hypothesis, we plot Happiness Score in different regions and different years to see the differences in Happiness Score with respected to Year and with respected to Region. As Happiness Score varies in each region, we calculate the mean of Happiness Score in each region for each year and plot these mean values.

# Get means of happiness score by year and region
mean_score_yr_region = dataAll.groupby(["Year", "Region"])["Happiness Score"].mean()

# Create a dataframe
frame = { 'Happiness Score': mean_score_yr_region }
mean_score_yr_region_df = pd.DataFrame(frame)

# Plot the mean of happiness score over years accross regions
plt.figure(figsize=(20,12))
ax = sns.lineplot(data=mean_score_yr_region_df,
                x='Region', y='Happiness Score',
                hue='Year', marker="o",
                palette=["red", "blue", "orange", "purple", "green", "gray" ])
plt.xticks(rotation=90) #Make x labels vertical
plt.xlabel("Region")
plt.title("Mean Happiness Score by Region and Year")
plt.figure()

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

As can be observed from the above plot, Happiness Score does significantly vary by regions. Australia and New Zealand, North America, and Western Europe are the regions that have the highest Happiness Score. This matches with our assumption that people living in high-living-standard regions tend to feel happier than others. Southern Asia and Sub-Saharan Africa are the regions that have the smallest Happiness Scores. One explanation can be that, many countries in Southern Asia and Sub-Saharan Africa are either economically insufficient or politically instable.

Plotting the mean Happiness Score by Year and by Region also allows us to compare the effect of Year on Happiness Score and the effect of Region on Happiness Score. In the above plot, it's hard to distinguish the Happiness Score among years, while it's easy to distinguish the Happiness Score among regions. This means that Happiness Score is more likely to be dependent on Region than on Year.

MODEL 2: PREDICT HAPPINESS SCORE BASED ON REGION¶

Since "Region" is a categorical varible, we make dummy binary variables to represent it in the Linear Regression Model. There are 10 different values for column "Region" of the dataset.
- Australia and New Zealand
- Central and Eastern Europe
- Eastern Asia
- Latin America and Caribbean
- Middle East and Northern Africa
- North America
- Southeastern Asia
- Southern Asia
- Sub-Saharan Africa
- Western Europe

corresponding to 10 dummy variables that we will create. The names of 10 dummy variables are similar to the values of the "Region" column.

For example, if the value of column "Region" of a country is "Australia and New Zealand", the value of column "Australia and New Zealand" of that country will be 1, the value of column "Central and Eastern Europe" will be 0, the value of other dummy-variable columns will also be 0.

### Function to assign values for dummy variables that represent region
def label_continent(row, region):
    if row['Region'] == region:
        return 1
    else:
        return 0
    
#Add new columns for regions
dataAll['Australia and New Zealand'] = dataAll.apply (lambda row: label_continent(row,'Australia and New Zealand'), axis=1)
dataAll['Central and Eastern Europe'] = dataAll.apply (lambda row: label_continent(row,'Central and Eastern Europe'), axis=1)
dataAll['Eastern Asia'] = dataAll.apply (lambda row: label_continent(row,'Eastern Asia'), axis=1)
dataAll['Latin America and Caribbean'] = dataAll.apply (lambda row: label_continent(row,'Latin America and Caribbean'), axis=1)
dataAll['Middle East and Northern Africa'] = dataAll.apply (lambda row: label_continent(row,'Middle East and Northern Africa'), axis=1)
dataAll['North America'] = dataAll.apply (lambda row: label_continent(row,'North America'), axis=1)
dataAll['Southeastern Asia'] = dataAll.apply (lambda row: label_continent(row,'Southeastern Asia'), axis=1)
dataAll['Southern Asia'] = dataAll.apply (lambda row: label_continent(row,'Southern Asia'), axis=1)
dataAll['Sub-Saharan Africa'] = dataAll.apply (lambda row: label_continent(row,'Sub-Saharan Africa'), axis=1)
dataAll['Western Europe'] = dataAll.apply (lambda row: label_continent(row,'Western Europe'), axis=1)

dataAll

#### MODEL 2: PREDICT HAPPINESS SCORE BASED ON REGION #####

feature_names = ['Australia and New Zealand', 'Central and Eastern Europe', 'Eastern Asia', \
                 'Latin America and Caribbean', 'Middle East and Northern Africa', 'North America', \
                 'Southeastern Asia', 'Southern Asia', 'Sub-Saharan Africa', 'Western Europe']

# Get data of independent variables and dependent variable
region_X = dataAll[feature_names]
happiness_Y = dataAll['Happiness Score']

#Linear Regression
happiness_Y_predict, coeff, intercept = linear_reg(region_X, happiness_Y, feature_names)

#Regression line:
print(f"\nRegression Line: \nHappiness Score = {intercept}\n \
                + {coeff[0]} * {feature_names[0]} + {coeff[1]} * {feature_names[1]} \n \
                + {coeff[2]} * {feature_names[2]} + {coeff[3]} * {feature_names[3]} \n \
                + {coeff[4]} * {feature_names[4]} + {coeff[5]} * {feature_names[5]} \n \
                + {coeff[6]} * {feature_names[6]} + {coeff[7]} * {feature_names[7]} \n \
                + {coeff[8]} * {feature_names[8]} + {coeff[9]} * {feature_names[9]}")

# Add the residual column
dataAll['Happiness Residual 2'] = happiness_Y_predict - happiness_Y

#Plot the residual distribution over years
plt.figure(figsize=(10,5)) #Format figure size
sns.violinplot(x='Year', y='Happiness Residual 2', data=dataAll) \
    .set( xlabel="Year", ylabel="Residual of Happiness Score")
plt.title("Distribution of happiness score residual over time (Model 2: Happiness Score based on Region)")
plt.show()

Mean squared error: 0.49
R squared score: 0.61

PARAMETERS:
   Australia and New Zealand  Central and Eastern Europe  Eastern Asia  \
0               1.543091e+13                1.543091e+13  1.543091e+13   

   Latin America and Caribbean  Middle East and Northern Africa  \
0                 1.543091e+13                     1.543091e+13   

   North America  Southeastern Asia  Southern Asia  Sub-Saharan Africa  \
0   1.543091e+13       1.543091e+13   1.543091e+13        1.543091e+13   

   Western Europe     Intercept  
0    1.543091e+13 -1.543091e+13  

Regression Line: 
Happiness Score = -15430911074955.162
                 + 15430911074962.441 * Australia and New Zealand + 15430911074960.633 * Central and Eastern Europe 
                 + 15430911074960.8 * Eastern Asia + 15430911074961.207 * Latin America and Caribbean 
                 + 15430911074960.473 * Middle East and Northern Africa + 15430911074962.322 * North America 
                 + 15430911074960.463 * Southeastern Asia + 15430911074959.713 * Southern Asia 
                 + 15430911074959.395 * Sub-Saharan Africa + 15430911074961.94 * Western Europe

Analysis:
- The mean square error of Model 2 has been significantly improved compared to that of Model 1 (Model 1: 1.25, Model 2: 0.49)
- R Squared Score of Model 2 also gets closer to 1 (Model 1: 0.00, Model 2: 0.61)
- The Residual Distribution: normally distributed, mean is 0
Model 2 has a better Mean Square Error, a better R Squared Score, and better residuals than Model 1
Model 2 is also a good Model for predicting Happiness Score. Our hypothesis that Happiness Score is dependent on Region holds true

MODEL 3: PREDICT HAPPINESS SCORE BASED ON OTHER FACTORS¶

We predict the happiness score based on Economy, Family, Health, Freedom, Trust, Generosity in this model and compare the result with the previous models. From that we will choose the model that performs the best.

#### MODEL 3: PREDICT HAPPINESS SCORE BASED ON OTHER FACTORS #####

feature_names = ['Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', \
                 'Freedom', 'Trust (Government Corruption)', 'Generosity']

#Get X and Y axes
features_X = dataAll[feature_names]
happiness_Y = dataAll['Happiness Score']

# Linear regression
happiness_Y_predict, coeff, intercept = linear_reg (features_X, happiness_Y, feature_names)

# Regression line:
print(f"\nRegression Line: \nHappiness Score = {intercept}\n \
                + {coeff[0]} * {feature_names[0]} + {coeff[1]} * {feature_names[1]} \n \
                + {coeff[2]} * {feature_names[2]} + {coeff[3]} * {feature_names[3]} \n \
                + {coeff[4]} * {feature_names[4]} + {coeff[5]} * {feature_names[5]}")


# Add the residual column
dataAll['Happiness Residual 3'] = happiness_Y_predict - happiness_Y

#Plot the residual distribution over years
plt.figure(figsize=(10,5)) #Format figure size
sns.violinplot(x='Year', y='Happiness Residual 3', data=dataAll) \
    .set( xlabel="Year", ylabel="Residual of Happiness Score")
plt.title("Distribution of happiness score residual over time (Model 3: Happiness Score based on Economy,  Family, Health, Freedom, Trust, Generosity)")
plt.show()

Mean squared error: 0.30
R squared score: 0.76

PARAMETERS:
   Economy (GDP per Capita)    Family  Health (Life Expectancy)   Freedom  \
0                  1.134221  0.694356                  0.966204  1.446482   

   Trust (Government Corruption)  Generosity  Intercept  
0                       0.886863    0.631095    2.15112  

Regression Line: 
Happiness Score = 2.1511196155787027
                 + 1.1342213701811763 * Economy (GDP per Capita) + 0.6943559464575613 * Family 
                 + 0.9662037038417511 * Health (Life Expectancy) + 1.4464819344520123 * Freedom 
                 + 0.8868631626939317 * Trust (Government Corruption) + 0.6310947959907783 * Generosity

The Mean Squared Error and R Squared Score of Model 3 have been improved from Model 2. However, the Residual Plot shows that: Means of Residuals are not 0, Distribution of Residuals of Model 3 are not as normal as that of Model 2.

Conclustion: Choose Model 2 (Predict Happiness Score based on Region) as a model for predicting happiness score.

ANALYZE HOW FACTORS THAT AFFECT HAPPINESS SCORE CHANGE OVER YEARS¶

Given that the model to predict happiness score has been done, our next goal is to check if there is a progression of humanity in the factors that affect Happiness Score as time goes on. In order to test that, we use Linear Regression to predict the Happiness Score based on Ecomomy, Family, Heath, Freedom, Trust, and Generosity in each year. From the parameters of the Linear Regression lines for different years, we could compare the effect of each factor on Happiness Score over years.

feature_names = ['Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', \
                 'Freedom', 'Trust (Government Corruption)', 'Generosity']

def factors_affected_happiness_by_year(year):
    # Get year dataframe
    year_df = dataAll[dataAll["Year"] == year]
    
    #Get X and Y axes
    features_X = year_df[feature_names]
    happiness_Y = year_df['Happiness Score']
    
    happiness_Y_predicted, coeff, intercept = linear_reg (features_X, happiness_Y, feature_names)

factors_affected_happiness_by_year(2015)

Mean squared error: 0.29
R squared score: 0.78

PARAMETERS:
   Economy (GDP per Capita)    Family  Health (Life Expectancy)   Freedom  \
0                  0.860657  1.408892                  0.975309  1.333433   

   Trust (Government Corruption)  Generosity  Intercept  
0                       0.784538    0.388933   1.860185

factors_affected_happiness_by_year(2016)

Mean squared error: 0.28
R squared score: 0.79

PARAMETERS:
   Economy (GDP per Capita)    Family  Health (Life Expectancy)   Freedom  \
0                  0.721413  1.229754                  1.436403  1.513935   

   Trust (Government Corruption)  Generosity  Intercept  
0                       0.918927    0.159494   2.190294

factors_affected_happiness_by_year(2017)

Mean squared error: 0.23
R squared score: 0.81

PARAMETERS:
   Economy (GDP per Capita)   Family  Health (Life Expectancy)   Freedom  \
0                  0.794919  1.14852                  1.330857  1.407979   

   Trust (Government Corruption)  Generosity  Intercept  
0                        0.95305    0.393118   1.684468

factors_affected_happiness_by_year(2018)

Mean squared error: 0.26
R squared score: 0.79

PARAMETERS:
   Economy (GDP per Capita)    Family  Health (Life Expectancy)   Freedom  \
0                  0.888664  1.221054                  0.931914  1.355804   

   Trust (Government Corruption)  Generosity  Intercept  
0                       0.800946    0.463499   1.746957

factors_affected_happiness_by_year(2019)

Mean squared error: 0.28
R squared score: 0.77

PARAMETERS:
   Economy (GDP per Capita)    Family  Health (Life Expectancy)   Freedom  \
0                  0.766022  1.244737                  1.008123  1.407431   

   Trust (Government Corruption)  Generosity  Intercept  
0                       1.072692    0.406235   1.734016

factors_affected_happiness_by_year(2020)

Mean squared error: 0.31
R squared score: 0.75

PARAMETERS:
   Economy (GDP per Capita)  Family  Health (Life Expectancy)   Freedom  \
0                  0.739118   1.153                  0.980704  1.482473   

   Trust (Government Corruption)  Generosity  Intercept  
0                       0.972943    0.620785   1.887211

5. INSIGHT AND CONCLUSION:¶

Happiness is an important factor to keeping the overall population healthy and raise a progressive country. It is not often that you find a data on so many countries with many features that the one created by the WHR. From these data set we were able to learn what is a major contributor to the overall happiness of people. It appears that in the past years, certain features have decreased in affecting what makes a person happy (e.g.: family decreased from 1.408892 in 2015 to 1.153 in 2020), while other features increased (e.g.: generosity rose from 0.159494 in 2016 to 0.620785 in 2020). It seems that people in the world are slowly shifting their concept of what makes them happy to a more holistic view of what benefits other rather than themselves. This tutorial provided some insight on how happiness is gauged in different regions of the world. Hopefully, from our tutorial you are about to learn about the steps needed to wrangle, display, and analyze data and present it in an organized documentation for other to read.

There are countless factors that goes into what makes people happy in different countries and we only use some basic information from a single data set. Other factors that may affect happiness include, if the country is currently in a military conflict, how much infrastructure does the country have, or is the country suffering/recovering from a natural disaster. Also, our tutorial stratified the data based on the region, you gather more data from other sources and group the data by population or by the countries themselves. There are countless ways to group the data and analyze the factors that results in the happiness score. This tutorial is just a sample of data that was collected around the world by single organization dedicated to providing information about the world population. We hope that this tutorial was able to teach you something new and inspire you to go out and analyze more data.

	Country	Region	Happiness Score	Standard Error	upperwhisker	lowerwhisker	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	...	Ladder score in Dystopia	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Generosity	Trust (Government Corruption)	Dystopia + residual	Happiness Rank	Year
0	Finland	Western Europe	7.8087	0.031156	7.869766	7.747634	10.639267	0.954330	71.900825	0.949172	...	1.972317	1.285190	1.499526	0.961271	0.662317	0.159670	0.477857	2.762835	1	2020
1	Denmark	Western Europe	7.6456	0.033492	7.711245	7.579955	10.774001	0.955991	72.402504	0.951444	...	1.972317	1.326949	1.503449	0.979333	0.665040	0.242793	0.495260	2.432741	2	2020
2	Switzerland	Western Europe	7.5599	0.035014	7.628528	7.491272	10.979933	0.942847	74.102448	0.921337	...	1.972317	1.390774	1.472403	1.040533	0.628954	0.269056	0.407946	2.350267	3	2020
3	Iceland	Western Europe	7.5045	0.059616	7.621347	7.387653	10.772559	0.974670	73.000000	0.948892	...	1.972317	1.326502	1.547567	1.000843	0.661981	0.362330	0.144541	2.460688	4	2020
4	Norway	Western Europe	7.4880	0.034837	7.556281	7.419719	11.087804	0.952487	73.200783	0.955750	...	1.972317	1.424207	1.495173	1.008072	0.670201	0.287985	0.434101	2.168266	5	2020
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
148	Central African Republic	Sub-Saharan Africa	3.4759	0.115183	3.701658	3.250141	6.625160	0.319460	45.200001	0.640881	...	1.972317	0.041072	0.000000	0.000000	0.292814	0.253513	0.028265	2.860198	149	2020
149	Rwanda	Sub-Saharan Africa	3.3123	0.052425	3.415053	3.209547	7.600104	0.540835	61.098846	0.900589	...	1.972317	0.343243	0.522876	0.572383	0.604088	0.235705	0.485542	0.548445	150	2020
150	Zimbabwe	Sub-Saharan Africa	3.2992	0.058674	3.414202	3.184198	7.865712	0.763093	55.617260	0.711458	...	1.972317	0.425564	1.047835	0.375038	0.377405	0.151349	0.080929	0.841031	151	2020
151	South Sudan	Sub-Saharan Africa	2.8166	0.107610	3.027516	2.605684	7.425360	0.553707	51.000000	0.451314	...	1.972317	0.289083	0.553279	0.208809	0.065609	0.209935	0.111157	1.378751	152	2020
152	Afghanistan	Southern Asia	2.5669	0.031311	2.628270	2.505530	7.462861	0.470367	52.590000	0.396573	...	1.972317	0.300706	0.356434	0.266052	0.000000	0.135235	0.001226	1.507236	153	2020

	Happiness Rank	Country	Happiness Score	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Generosity	Trust (Government Corruption)	Region	Year
0	1	Finland	7.769	1.340	1.587	0.986	0.596	0.153	0.393	Western Europe	2019
1	2	Denmark	7.600	1.383	1.573	0.996	0.592	0.252	0.410	Western Europe	2019
2	3	Norway	7.554	1.488	1.582	1.028	0.603	0.271	0.341	Western Europe	2019
3	4	Iceland	7.494	1.380	1.624	1.026	0.591	0.354	0.118	Western Europe	2019
4	5	Netherlands	7.488	1.396	1.522	0.999	0.557	0.322	0.298	Western Europe	2019
...	...	...	...	...	...	...	...	...	...	...	...
151	152	Rwanda	3.334	0.359	0.711	0.614	0.555	0.217	0.411	Sub-Saharan Africa	2019
152	153	Tanzania	3.231	0.476	0.885	0.499	0.417	0.276	0.147	Sub-Saharan Africa	2019
153	154	Afghanistan	3.203	0.350	0.517	0.361	0.000	0.158	0.025	Southern Asia	2019
154	155	Central African Republic	3.083	0.026	0.000	0.105	0.225	0.235	0.035	NaN	2019
155	156	South Sudan	2.853	0.306	0.575	0.295	0.010	0.202	0.091	Sub-Saharan Africa	2019

	Happiness Rank	Country	Happiness Score	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Generosity	Trust (Government Corruption)	Region	Year
0	1	Finland	7.632	1.305	1.592	0.874	0.681	0.202	0.393	Western Europe	2018
1	2	Norway	7.594	1.456	1.582	0.861	0.686	0.286	0.340	Western Europe	2018
2	3	Denmark	7.555	1.351	1.590	0.868	0.683	0.284	0.408	Western Europe	2018
3	4	Iceland	7.495	1.343	1.644	0.914	0.677	0.353	0.138	Western Europe	2018
4	5	Switzerland	7.487	1.420	1.549	0.927	0.660	0.256	0.357	Western Europe	2018
...	...	...	...	...	...	...	...	...	...	...	...
151	152	Yemen	3.355	0.442	1.073	0.343	0.244	0.083	0.064	Middle East and Northern Africa	2018
152	153	Tanzania	3.303	0.455	0.991	0.381	0.481	0.270	0.097	Sub-Saharan Africa	2018
153	154	South Sudan	3.254	0.337	0.608	0.177	0.112	0.224	0.106	Sub-Saharan Africa	2018
154	155	Central African Republic	3.083	0.024	0.000	0.010	0.305	0.218	0.038	NaN	2018
155	156	Burundi	2.905	0.091	0.627	0.145	0.065	0.149	0.076	Sub-Saharan Africa	2018

	Country	Happiness Rank	Happiness Score	Whisker.high	Whisker.low	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Generosity	Trust (Government Corruption)	Dystopia.Residual	Region	Year
0	Norway	1	7.537	7.594445	7.479556	1.616463	1.533524	0.796667	0.635423	0.362012	0.315964	2.277027	Western Europe	2017
1	Denmark	2	7.522	7.581728	7.462272	1.482383	1.551122	0.792566	0.626007	0.355280	0.400770	2.313707	Western Europe	2017
2	Iceland	3	7.504	7.622030	7.385970	1.480633	1.610574	0.833552	0.627163	0.475540	0.153527	2.322715	Western Europe	2017
3	Switzerland	4	7.494	7.561772	7.426227	1.564980	1.516912	0.858131	0.620071	0.290549	0.367007	2.276716	Western Europe	2017
4	Finland	5	7.469	7.527542	7.410458	1.443572	1.540247	0.809158	0.617951	0.245483	0.382612	2.430182	Western Europe	2017
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
149	Togo	150	3.495	3.594038	3.395962	0.305445	0.431883	0.247106	0.380426	0.196896	0.095665	1.837229	Sub-Saharan Africa	2017
150	Rwanda	151	3.471	3.543030	3.398970	0.368746	0.945707	0.326425	0.581844	0.252756	0.455220	0.540061	Sub-Saharan Africa	2017
151	Syria	152	3.462	3.663669	3.260331	0.777153	0.396103	0.500533	0.081539	0.493664	0.151347	1.061574	Middle East and Northern Africa	2017
152	Tanzania	153	3.349	3.461430	3.236570	0.511136	1.041990	0.364509	0.390018	0.354256	0.066035	0.621130	Sub-Saharan Africa	2017
153	Burundi	154	2.905	3.074690	2.735310	0.091623	0.629794	0.151611	0.059901	0.204435	0.084148	1.683024	Sub-Saharan Africa	2017

	Country	Region	Happiness Rank	Happiness Score	Lower Confidence Interval	Upper Confidence Interval	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity	Dystopia Residual	Year
0	Denmark	Western Europe	1	7.526	7.460	7.592	1.44178	1.16374	0.79504	0.57941	0.44453	0.36171	2.73939	2016
1	Switzerland	Western Europe	2	7.509	7.428	7.590	1.52733	1.14524	0.86303	0.58557	0.41203	0.28083	2.69463	2016
2	Iceland	Western Europe	3	7.501	7.333	7.669	1.42666	1.18326	0.86733	0.56624	0.14975	0.47678	2.83137	2016
3	Norway	Western Europe	4	7.498	7.421	7.575	1.57744	1.12690	0.79579	0.59609	0.35776	0.37895	2.66465	2016
4	Finland	Western Europe	5	7.413	7.351	7.475	1.40598	1.13464	0.81091	0.57104	0.41004	0.25492	2.82596	2016
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
152	Benin	Sub-Saharan Africa	153	3.484	3.404	3.564	0.39499	0.10419	0.21028	0.39747	0.06681	0.20180	2.10812	2016
153	Afghanistan	Southern Asia	154	3.360	3.288	3.432	0.38227	0.11037	0.17344	0.16430	0.07112	0.31268	2.14558	2016
154	Togo	Sub-Saharan Africa	155	3.303	3.192	3.414	0.28123	0.00000	0.24811	0.34678	0.11587	0.17517	2.13540	2016
155	Syria	Middle East and Northern Africa	156	3.069	2.936	3.202	0.74719	0.14866	0.62994	0.06912	0.17233	0.48397	0.81789	2016
156	Burundi	Sub-Saharan Africa	157	2.905	2.732	3.078	0.06831	0.23442	0.15747	0.04320	0.09419	0.20290	2.10404	2016

	Country	Region	Happiness Rank	Happiness Score	Standard Error	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity	Dystopia Residual	Year
0	Switzerland	Western Europe	1	7.587	0.03411	1.39651	1.34951	0.94143	0.66557	0.41978	0.29678	2.51738	2015
1	Iceland	Western Europe	2	7.561	0.04884	1.30232	1.40223	0.94784	0.62877	0.14145	0.43630	2.70201	2015
2	Denmark	Western Europe	3	7.527	0.03328	1.32548	1.36058	0.87464	0.64938	0.48357	0.34139	2.49204	2015
3	Norway	Western Europe	4	7.522	0.03880	1.45900	1.33095	0.88521	0.66973	0.36503	0.34699	2.46531	2015
4	Canada	North America	5	7.427	0.03553	1.32629	1.32261	0.90563	0.63297	0.32957	0.45811	2.45176	2015
...	...	...	...	...	...	...	...	...	...	...	...	...	...
153	Rwanda	Sub-Saharan Africa	154	3.465	0.03464	0.22208	0.77370	0.42864	0.59201	0.55191	0.22628	0.67042	2015
154	Benin	Sub-Saharan Africa	155	3.340	0.03656	0.28665	0.35386	0.31910	0.48450	0.08010	0.18260	1.63328	2015
155	Syria	Middle East and Northern Africa	156	3.006	0.05015	0.66320	0.47489	0.72193	0.15684	0.18906	0.47179	0.32858	2015
156	Burundi	Sub-Saharan Africa	157	2.905	0.08658	0.01530	0.41587	0.22396	0.11850	0.10062	0.19727	1.83302	2015
157	Togo	Sub-Saharan Africa	158	2.839	0.06727	0.20868	0.13995	0.28443	0.36453	0.10731	0.16681	1.56726	2015