By Tom Grigg and Erika Russi (DS-10.22.2018)
Total Fertility Rate (TFR) is defined as the "average number of children that would be born to a woman by the time she ended childbearing if she were to pass through all her childbearing years conforming to the age-specific fertility rates of a given year." TFR is the best single measure to compare fertility across populations.
We built a model that predicts the TFR (births per woman) of a country in a given year. The independent variables in our model consisted of the following predictors by country and year (unless specified otherwise, all data pulled from the World Bank DataBank):
- GDP per capita (current US$)
- Population Density (people per sq. km of land area)
- Life expectancy at birth, total (years)
- Population ages 0-14 (% of total)
- Population ages 15-64 (% of total)
- Population ages 65 and above (% of total)
- Mortality rate, under-5 (per 1,000 live births)
- Mortality rate, infant (per 1,000 live births); and
From the Geert Hofstede model for Cultural Dimensions (kept constant for each country across time):
- Long Term Orientation Index (measured 0 - 100 for each country)
- Indulgence vs. Restraint Index (measured 0 - 100 for each country)
Our data comes from two main sources: World Bank DataBank and the Geert Hofstede model for Cultural Dimensions. These two sources store elements of data that we are interested in in different ways, and excess data that needs to be trimmed and cleaned. We defined a data massaging class called Masseuse to handle this step.
The masseuse object reads in the necessary CSV files from the specified path and begins merging into the main dataframe. Data in the World Bank datasets holds feature-values for each year in columns, but we wanted this data to be indexable by country and year. We elected to reframe our data with a MultiIndex of the form (ISO-3 Country Code, Year) so that the geography was unambiguous. This involved filtering the data for overlapping regions to avoid recounting further into our analysis.
We first import the masseuse class and tell it to construct our data.
from masseuse import Masseuse
m = Masseuse(csv_dir='/Users/griggles/Documents/FLATIRON/PROJECT_2/csv/')
data = m.build_data()We then build dummy variables for our categorical data. We only had one category which was datapoints continental region. We drop one of oue dummy variables from the data in order to remove initial multicollinearity. We also found multicollinearity caused by percentage of people between 15 and 64, so we dropped this too (more on this later).
import pandas as pd
df = pd.get_dummies(data, columns=['continent'], drop_first=True)
df = df.drop(columns=['pop_percentage_15_to_64'])The population density data we pulled was very spread out. For instance, in 2017 Macao has nearly 21,000 people per sq. km., while Mongolia has only 2 people per sq. km. We researched other studies on population density and realized that the convention for its analysis was to take the log of the population density. When doing this, we wouldn't need to correct for outliers like Macao because the spread of the data was reduced.
import numpy as np
import seaborn as sns
df['log_density'] = np.log(df['density'])sns.pairplot(df, vars=['density', 'fertility'])
sns.pairplot(df, vars=['log_density', 'fertility'])<seaborn.axisgrid.PairGrid at 0x10f81e198>
from matplotlib import pyplot as plt
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn import metrics
features = df.iloc[:,1:]
target = df['fertility']
features.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| mort_rate_under_5 | pop_percentage_above_65 | mort_rate_infant | pop_percentage_under_14 | life_expectancy | density | gdp | long term orientation | indulgence vs restraint | continent_Americas | continent_Asia | continent_Europe | continent_Oceania | log_density | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Code | Years | ||||||||||||||
| ALB | 1984 | 58.2 | 5.390937 | 48.9 | 34.310923 | 71.134 | 106.001058 | 662.520052 | 61.0 | 15.0 | 0 | 0 | 1 | 0 | 4.663449 |
| 1985 | 54.0 | 5.399862 | 45.7 | 33.987349 | 71.388 | 108.202993 | 662.914793 | 61.0 | 15.0 | 0 | 0 | 1 | 0 | 4.684009 | |
| 1986 | 50.4 | 5.413536 | 42.9 | 33.709853 | 71.605 | 110.315146 | 719.157296 | 61.0 | 15.0 | 0 | 0 | 1 | 0 | 4.703341 | |
| 1987 | 47.2 | 5.417392 | 40.5 | 33.476210 | 71.760 | 112.540328 | 699.384292 | 61.0 | 15.0 | 0 | 0 | 1 | 0 | 4.723312 | |
| 1988 | 44.5 | 5.423440 | 38.3 | 33.263863 | 71.843 | 114.683796 | 676.566733 | 61.0 | 15.0 | 0 | 0 | 1 | 0 | 4.742179 |
Initially we scaled our variables in order to standardise their variance. After a few iterations of our model we noticed that our p-values were being negatively affected by this standardisation (particularly due to the presence of percentage-values in our data, i.e. pre-scaled). We decided not to standardise the data in our final OLS regression.
Once we reviewed the scatterplot matrix illustrating how each variable interacted with each other, we noticed that some variables likely had a square relationship. For the preprocessing of our polynomial regression model, we generated interactive variables up to the exponential power of 2.
sns.pairplot(df, vars=['pop_percentage_under_14', 'fertility'])<seaborn.axisgrid.PairGrid at 0x10ffabd30>
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=32,test_size=0.2)from sklearn import preprocessing
from sklearn import pipeline
poly = preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
features_135_train = pd.DataFrame(poly.fit_transform(X_train), index=X_train.index, columns=poly.get_feature_names(X_train.columns))
features_135_test = pd.DataFrame(poly.fit_transform(X_test), index=X_test.index, columns=poly.get_feature_names(X_test.columns))We begun by removing low variance variables from our regression analysis - due to the differing scales of our independent variables this required us to standardise our variable matrix so that variance could be compared. We removed all variables whose standardised variance fell below our threshold of 0.5.
from sklearn.feature_selection import VarianceThreshold
thresholder = VarianceThreshold(threshold=.5)
def variance_threshold_selector(data, threshold=0.5):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
features_selected_train = variance_threshold_selector(features_135_train)After reviewing our correlation matrix (visualized as a heat map), we removed any variables with a collinearity above .95 (i.e. any predictors that were highly correlated with each other). Doing so also assisted us with reducing collinearity which could make our estimates very sensitive to slight changes in our model.
After doing initial runs of our model, we noticed that the parameter 'Population ages 15-64 (% of total)' was tainting the probability value of other variables with which it was interacting. We surmised that this population metric was highly correlated with the other two percentage-based population metrics, so we ultimately removed it from the analysis. Once we did so, the p-values were below the .05 threshold for a 95% confidence level.
import seaborn as sns
sns.set(style="white")
# Compute the correlation matrix
corr = features_selected_train.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": 1})<matplotlib.axes._subplots.AxesSubplot at 0x1a1c4e28d0>
def calculate_cols_to_drop(upper):
#to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
shape = upper.shape
print(shape)
row_index = upper.index
col_index = upper.columns
to_drop = []
for row in range(0, shape[0]):
for col in range(0, shape[1]):
#print(upper.iloc[row,col])
if upper.iloc[row, col] > 0.95:
if row_index[row] in to_drop or col_index[col] in to_drop:
pass
else:
to_drop.append(col_index[col])
return to_drop# Create correlation matrix
corr_matrix = features_selected_train.corr().abs()
# Select upper triangle of correlation matrix (so as not to include self-correlations of 1)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
features_selected_train.drop(columns=to_drop, inplace=True)from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
def information_selector(X, y, scoring, k=5):
selector = SelectKBest(score_func=scoring, k=k)
selector.fit(X, y)
return X[X.columns[selector.get_support(indices=True)]]
test = SelectKBest(score_func=mutual_info_regression, k=30)
fit = test.fit(features_selected_train, y_train)
features_selected_train = information_selector(features_selected_train, y_train, mutual_info_regression, k=30)
lm = linear_model.LinearRegression(fit_intercept=True)
model = lm.fit(features_selected_train, y_train)
import statsmodels.api as sm
import pylab
import statsmodels.formula.api as smf
est = sm.OLS(y_train, features_selected_train)
est2 = est.fit()
hi_p = est2.pvalues
p_limit = 0.05
hi_p = hi_p[hi_p > p_limit]
print(hi_p.index.tolist())['long term orientation', 'mort_rate_under_5^2', 'mort_rate_under_5 pop_percentage_above_65', 'mort_rate_under_5 density', 'mort_rate_under_5 long term orientation', 'mort_rate_under_5 indulgence vs restraint', 'pop_percentage_under_14 continent_Europe', 'long term orientation indulgence vs restraint']
#RETRAIN WITH DROPPED HIGH P_VALUES
features_selected_train = features_selected_train.drop(columns=hi_p.index.tolist())#NEW MODEL
test = SelectKBest(score_func=mutual_info_regression, k='all')
fit = test.fit(features_selected_train, y_train)
features_selected_train = information_selector(features_selected_train, y_train, mutual_info_regression, k='all')
# fit a model
lm2 = linear_model.LinearRegression(fit_intercept=True)
model2 = lm2.fit(features_selected_train, y_train)
features_selected_test = features_135_test[features_selected_train.columns]
y_pred = lm2.predict(features_selected_test)
print('Absolute Error: ', metrics.mean_absolute_error(y_test, y_pred))
print('Squared Error: ', metrics.mean_squared_error(y_test, y_pred))
print('Root Squared Error: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
from matplotlib import pyplot as plt
plt.scatter(y_test, y_pred)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.plot(np.unique(y_test), np.poly1d(np.polyfit(y_test, y_pred, 1))(np.unique(y_test)), color='r')Absolute Error: 0.30255669762836795
Squared Error: 0.1742911858907531
Root Squared Error: 0.4174819587607985
[<matplotlib.lines.Line2D at 0x1c1fa6f048>]
est3 = sm.OLS(y_train, features_selected_train)
est4 = est3.fit()
print(est4.summary()) OLS Regression Results
==============================================================================
Dep. Variable: fertility R-squared: 0.988
Model: OLS Adj. R-squared: 0.988
Method: Least Squares F-statistic: 1.197e+04
Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
Time: 14:33:47 Log-Likelihood: -1613.9
No. Observations: 3136 AIC: 3272.
Df Residuals: 3114 BIC: 3405.
Df Model: 22
Covariance Type: nonrobust
===================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
mort_rate_under_5 -0.0046 0.001 -5.785 0.000 -0.006 -0.003
pop_percentage_above_65 0.1874 0.020 9.468 0.000 0.149 0.226
pop_percentage_under_14 0.2822 0.009 29.990 0.000 0.264 0.301
life_expectancy -0.0246 0.005 -5.202 0.000 -0.034 -0.015
gdp 9.798e-06 2.04e-06 4.808 0.000 5.8e-06 1.38e-05
indulgence vs restraint -0.0290 0.004 -6.515 0.000 -0.038 -0.020
log_density -1.4751 0.112 -13.141 0.000 -1.695 -1.255
mort_rate_under_5 log_density 0.0032 0.000 16.397 0.000 0.003 0.004
pop_percentage_above_65 pop_percentage_under_14 -0.0045 0.000 -10.842 0.000 -0.005 -0.004
pop_percentage_above_65 long term orientation -0.0005 0.000 -4.310 0.000 -0.001 -0.000
pop_percentage_above_65 indulgence vs restraint 0.0008 0.000 4.168 0.000 0.000 0.001
pop_percentage_above_65 log_density -0.0109 0.003 -3.829 0.000 -0.016 -0.005
pop_percentage_under_14 life_expectancy -0.0011 0.000 -6.530 0.000 -0.001 -0.001
pop_percentage_under_14 long term orientation -0.0005 3.76e-05 -12.921 0.000 -0.001 -0.000
pop_percentage_under_14 indulgence vs restraint 0.0002 8.2e-05 2.680 0.007 5.9e-05 0.000
pop_percentage_under_14 log_density -0.0055 0.001 -4.082 0.000 -0.008 -0.003
life_expectancy log_density 0.0161 0.001 11.281 0.000 0.013 0.019
density continent_Europe -0.0002 8.61e-05 -2.188 0.029 -0.000 -1.95e-05
gdp long term orientation -1.939e-07 3.76e-08 -5.161 0.000 -2.68e-07 -1.2e-07
long term orientation log_density 0.0040 0.000 12.421 0.000 0.003 0.005
indulgence vs restraint log_density 0.0037 0.000 9.033 0.000 0.003 0.004
continent_Europe log_density 0.0604 0.008 7.989 0.000 0.046 0.075
==============================================================================
Omnibus: 253.354 Durbin-Watson: 2.054
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1086.426
Skew: 0.281 Prob(JB): 1.22e-236
Kurtosis: 5.828 Cond. No. 1.55e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.55e+07. This might indicate that there are
strong multicollinearity or other numerical problems.




