Fertility Rate Prediction Model

By Tom Grigg and Erika Russi (DS-10.22.2018)

Total Fertility Rate (TFR) is defined as the "average number of children that would be born to a woman by the time she ended childbearing if she were to pass through all her childbearing years conforming to the age-specific fertility rates of a given year." TFR is the best single measure to compare fertility across populations.

We built a model that predicts the TFR (births per woman) of a country in a given year. The independent variables in our model consisted of the following predictors by country and year (unless specified otherwise, all data pulled from the World Bank DataBank):

GDP per capita (current US$)
Population Density (people per sq. km of land area)
Life expectancy at birth, total (years)
Population ages 0-14 (% of total)
Population ages 15-64 (% of total)
Population ages 65 and above (% of total)
Mortality rate, under-5 (per 1,000 live births)
Mortality rate, infant (per 1,000 live births); and

From the Geert Hofstede model for Cultural Dimensions (kept constant for each country across time):

Long Term Orientation Index (measured 0 - 100 for each country)
Indulgence vs. Restraint Index (measured 0 - 100 for each country)

Preparing Data

Our data comes from two main sources: World Bank DataBank and the Geert Hofstede model for Cultural Dimensions. These two sources store elements of data that we are interested in in different ways, and excess data that needs to be trimmed and cleaned. We defined a data massaging class called Masseuse to handle this step.

Masseuse class

The masseuse object reads in the necessary CSV files from the specified path and begins merging into the main dataframe. Data in the World Bank datasets holds feature-values for each year in columns, but we wanted this data to be indexable by country and year. We elected to reframe our data with a MultiIndex of the form (ISO-3 Country Code, Year) so that the geography was unambiguous. This involved filtering the data for overlapping regions to avoid recounting further into our analysis.

We first import the masseuse class and tell it to construct our data.

from masseuse import Masseuse

m = Masseuse(csv_dir='/Users/griggles/Documents/FLATIRON/PROJECT_2/csv/')
data = m.build_data()

We then build dummy variables for our categorical data. We only had one category which was datapoints continental region. We drop one of oue dummy variables from the data in order to remove initial multicollinearity. We also found multicollinearity caused by percentage of people between 15 and 64, so we dropped this too (more on this later).

import pandas as pd

df = pd.get_dummies(data, columns=['continent'], drop_first=True)
df = df.drop(columns=['pop_percentage_15_to_64'])

Regression

Feature Engineering

Log Density

The population density data we pulled was very spread out. For instance, in 2017 Macao has nearly 21,000 people per sq. km., while Mongolia has only 2 people per sq. km. We researched other studies on population density and realized that the convention for its analysis was to take the log of the population density. When doing this, we wouldn't need to correct for outliers like Macao because the spread of the data was reduced.

import numpy as np
import seaborn as sns


df['log_density'] = np.log(df['density'])

sns.pairplot(df, vars=['density', 'fertility'])
sns.pairplot(df, vars=['log_density', 'fertility'])

<seaborn.axisgrid.PairGrid at 0x10f81e198>

from matplotlib import pyplot as plt
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn import metrics

features = df.iloc[:,1:]
target = df['fertility']
features.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		mort_rate_under_5	pop_percentage_above_65	mort_rate_infant	pop_percentage_under_14	life_expectancy	density	gdp	long term orientation	indulgence vs restraint	continent_Americas	continent_Asia	continent_Europe	continent_Oceania	log_density
Country Code	Years
ALB	1984	58.2	5.390937	48.9	34.310923	71.134	106.001058	662.520052	61.0	15.0	0	0	1	0	4.663449
	1985	54.0	5.399862	45.7	33.987349	71.388	108.202993	662.914793	61.0	15.0	0	0	1	0	4.684009
	1986	50.4	5.413536	42.9	33.709853	71.605	110.315146	719.157296	61.0	15.0	0	0	1	0	4.703341
	1987	47.2	5.417392	40.5	33.476210	71.760	112.540328	699.384292	61.0	15.0	0	0	1	0	4.723312
	1988	44.5	5.423440	38.3	33.263863	71.843	114.683796	676.566733	61.0	15.0	0	0	1	0	4.742179

Scaling

Initially we scaled our variables in order to standardise their variance. After a few iterations of our model we noticed that our p-values were being negatively affected by this standardisation (particularly due to the presence of percentage-values in our data, i.e. pre-scaled). We decided not to standardise the data in our final OLS regression.

Model Selection

Polynomial Regression

Once we reviewed the scatterplot matrix illustrating how each variable interacted with each other, we noticed that some variables likely had a square relationship. For the preprocessing of our polynomial regression model, we generated interactive variables up to the exponential power of 2.

sns.pairplot(df, vars=['pop_percentage_under_14', 'fertility'])

<seaborn.axisgrid.PairGrid at 0x10ffabd30>

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=32,test_size=0.2)

from sklearn import preprocessing
from sklearn import pipeline

poly = preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
features_135_train = pd.DataFrame(poly.fit_transform(X_train), index=X_train.index, columns=poly.get_feature_names(X_train.columns))
features_135_test = pd.DataFrame(poly.fit_transform(X_test), index=X_test.index, columns=poly.get_feature_names(X_test.columns))

Feature Selection

Variance

We begun by removing low variance variables from our regression analysis - due to the differing scales of our independent variables this required us to standardise our variable matrix so that variance could be compared. We removed all variables whose standardised variance fell below our threshold of 0.5.

from sklearn.feature_selection import VarianceThreshold

thresholder = VarianceThreshold(threshold=.5)

def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

features_selected_train = variance_threshold_selector(features_135_train)

Correlation

After reviewing our correlation matrix (visualized as a heat map), we removed any variables with a collinearity above .95 (i.e. any predictors that were highly correlated with each other). Doing so also assisted us with reducing collinearity which could make our estimates very sensitive to slight changes in our model.

Multi-collinearity

After doing initial runs of our model, we noticed that the parameter 'Population ages 15-64 (% of total)' was tainting the probability value of other variables with which it was interacting. We surmised that this population metric was highly correlated with the other two percentage-based population metrics, so we ultimately removed it from the analysis. Once we did so, the p-values were below the .05 threshold for a 95% confidence level.

import seaborn as sns

sns.set(style="white")


# Compute the correlation matrix
corr = features_selected_train.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": 1})

<matplotlib.axes._subplots.AxesSubplot at 0x1a1c4e28d0>

def calculate_cols_to_drop(upper):
    #to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
    shape = upper.shape
    print(shape)
    row_index = upper.index
    col_index = upper.columns
    to_drop = []
    
    for row in range(0, shape[0]):
        for col in range(0, shape[1]):
            #print(upper.iloc[row,col])
            if upper.iloc[row, col] > 0.95:
                if row_index[row] in to_drop or col_index[col] in to_drop:
                    pass
                else:
                    to_drop.append(col_index[col])
    return to_drop

# Create correlation matrix
corr_matrix = features_selected_train.corr().abs()

# Select upper triangle of correlation matrix (so as not to include self-correlations of 1)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

features_selected_train.drop(columns=to_drop, inplace=True)

Training Iterations

First Attempt

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression

def information_selector(X, y, scoring, k=5):
    selector = SelectKBest(score_func=scoring, k=k)
    selector.fit(X, y)
    return X[X.columns[selector.get_support(indices=True)]]

test = SelectKBest(score_func=mutual_info_regression, k=30)
fit = test.fit(features_selected_train, y_train)

features_selected_train = information_selector(features_selected_train, y_train, mutual_info_regression, k=30)

lm = linear_model.LinearRegression(fit_intercept=True)
model = lm.fit(features_selected_train, y_train)

import statsmodels.api as sm
import pylab
import statsmodels.formula.api as smf

est = sm.OLS(y_train, features_selected_train)
est2 = est.fit()
hi_p = est2.pvalues
p_limit = 0.05
hi_p = hi_p[hi_p > p_limit]
print(hi_p.index.tolist())

['long term orientation', 'mort_rate_under_5^2', 'mort_rate_under_5 pop_percentage_above_65', 'mort_rate_under_5 density', 'mort_rate_under_5 long term orientation', 'mort_rate_under_5 indulgence vs restraint', 'pop_percentage_under_14 continent_Europe', 'long term orientation indulgence vs restraint']

Rejecting Null Hypotheses: Dropping High P-Vals to improve model

#RETRAIN WITH DROPPED HIGH P_VALUES
features_selected_train = features_selected_train.drop(columns=hi_p.index.tolist())

Final Model

#NEW MODEL

test = SelectKBest(score_func=mutual_info_regression, k='all')
fit = test.fit(features_selected_train, y_train)

features_selected_train = information_selector(features_selected_train, y_train, mutual_info_regression, k='all')

# fit a model
lm2 = linear_model.LinearRegression(fit_intercept=True)
model2 = lm2.fit(features_selected_train, y_train)

features_selected_test = features_135_test[features_selected_train.columns]
y_pred = lm2.predict(features_selected_test)

print('Absolute Error: ', metrics.mean_absolute_error(y_test, y_pred))
print('Squared Error: ', metrics.mean_squared_error(y_test, y_pred))
print('Root Squared Error: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

from matplotlib import pyplot as plt

plt.scatter(y_test, y_pred)

plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.plot(np.unique(y_test), np.poly1d(np.polyfit(y_test, y_pred, 1))(np.unique(y_test)), color='r')

Absolute Error:  0.30255669762836795
Squared Error:  0.1742911858907531
Root Squared Error:  0.4174819587607985





[<matplotlib.lines.Line2D at 0x1c1fa6f048>]

est3 = sm.OLS(y_train, features_selected_train)
est4 = est3.fit()
print(est4.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              fertility   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                 1.197e+04
Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
Time:                        14:33:47   Log-Likelihood:                -1613.9
No. Observations:                3136   AIC:                             3272.
Df Residuals:                    3114   BIC:                             3405.
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
===================================================================================================================
                                                      coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------
mort_rate_under_5                                  -0.0046      0.001     -5.785      0.000      -0.006      -0.003
pop_percentage_above_65                             0.1874      0.020      9.468      0.000       0.149       0.226
pop_percentage_under_14                             0.2822      0.009     29.990      0.000       0.264       0.301
life_expectancy                                    -0.0246      0.005     -5.202      0.000      -0.034      -0.015
gdp                                              9.798e-06   2.04e-06      4.808      0.000     5.8e-06    1.38e-05
indulgence vs restraint                            -0.0290      0.004     -6.515      0.000      -0.038      -0.020
log_density                                        -1.4751      0.112    -13.141      0.000      -1.695      -1.255
mort_rate_under_5 log_density                       0.0032      0.000     16.397      0.000       0.003       0.004
pop_percentage_above_65 pop_percentage_under_14    -0.0045      0.000    -10.842      0.000      -0.005      -0.004
pop_percentage_above_65 long term orientation      -0.0005      0.000     -4.310      0.000      -0.001      -0.000
pop_percentage_above_65 indulgence vs restraint     0.0008      0.000      4.168      0.000       0.000       0.001
pop_percentage_above_65 log_density                -0.0109      0.003     -3.829      0.000      -0.016      -0.005
pop_percentage_under_14 life_expectancy            -0.0011      0.000     -6.530      0.000      -0.001      -0.001
pop_percentage_under_14 long term orientation      -0.0005   3.76e-05    -12.921      0.000      -0.001      -0.000
pop_percentage_under_14 indulgence vs restraint     0.0002    8.2e-05      2.680      0.007     5.9e-05       0.000
pop_percentage_under_14 log_density                -0.0055      0.001     -4.082      0.000      -0.008      -0.003
life_expectancy log_density                         0.0161      0.001     11.281      0.000       0.013       0.019
density continent_Europe                           -0.0002   8.61e-05     -2.188      0.029      -0.000   -1.95e-05
gdp long term orientation                       -1.939e-07   3.76e-08     -5.161      0.000   -2.68e-07    -1.2e-07
long term orientation log_density                   0.0040      0.000     12.421      0.000       0.003       0.005
indulgence vs restraint log_density                 0.0037      0.000      9.033      0.000       0.003       0.004
continent_Europe log_density                        0.0604      0.008      7.989      0.000       0.046       0.075
==============================================================================
Omnibus:                      253.354   Durbin-Watson:                   2.054
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1086.426
Skew:                           0.281   Prob(JB):                    1.22e-236
Kurtosis:                       5.828   Cond. No.                     1.55e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.55e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.ipynb_checkpoints		.ipynb_checkpoints
References:Examples		References:Examples
csv		csv
feature_selection		feature_selection
read.me_files		read.me_files
report		report
.DS_Store		.DS_Store
Density.csv		Density.csv
Fertility Rates.ipynb		Fertility Rates.ipynb
Fertility.csv		Fertility.csv
README.md		README.md
Read.me.ipynb		Read.me.ipynb
countries.csv		countries.csv
googlesheets_link.txt		googlesheets_link.txt
masseuse.py		masseuse.py
mod_2_guidelines.md		mod_2_guidelines.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fertility Rate Prediction Model

Preparing Data

Masseuse class

Regression

Feature Engineering

Log Density

Scaling

Model Selection

Polynomial Regression

Feature Selection

Variance

Correlation

Multi-collinearity

Training Iterations

First Attempt

Rejecting Null Hypotheses: Dropping High P-Vals to improve model

Final Model

About

Uh oh!

Releases

Packages

Languages

Erika-Russi/mod_2_project

Folders and files

Latest commit

History

Repository files navigation

Fertility Rate Prediction Model

Preparing Data

Masseuse class

Regression

Feature Engineering

Log Density

Scaling

Model Selection

Polynomial Regression

Feature Selection

Variance

Correlation

Multi-collinearity

Training Iterations

First Attempt

Rejecting Null Hypotheses: Dropping High P-Vals to improve model

Final Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages