Investigation of women rights contribution to gender gap

An experiment into using multiple learning algorithms on the problem of predicting the gender gap of a country from women rights data. Data is from the World Bank, and is a collection of 38 specific rights for each country in 2013. Binary data stating whether a right is implemented by law, e.g. Does the law mandate paid or unpaid paternity leave? In addition, four continuous values are used; GDP Per Cap, GINI-coefficient, mean age at first birth and educational level.

In decision trees, entropy or information gain is used to define where the most efficient split is. I.e. what feature should now be used to split the data set, so that we are able to isolate as many examples as possible. This can also be interpret as the most important feature to explain the gender gap. Below is a list of the rights with the highest weight:

When including the continuous values, which might have great contribution to the gender gap of a country, they all end up in the 5 highest weighted features:

The gender gap is a value between 0 and 1, thus in order to enable these classifiers to predict a continuous value, the gender gap is binned in both respectively 8 and 10 classes with equal sized bins from min to max gender gap. K-NN is run on both splits and scores very poorly.

Histogram of how wrong the classifier is in 5-CV on optimised set (10 highest weighted rights + continuous values)

The results suggests that the women rights data set does not contain sufficient information. This is emphasised by a multiple regression on the four continuous values that was used as a classifier – and it scored remarkably better than classification by women rights. Therefore, it can not be concluded that the rights highlighted above are in fact the greatest drivers of higher (better) gender gap. Contact me if you want to read the complete research.

https://github.com/andbis/genderGap