Introduction¶
In this report, we'll examine a dataset for NDBI048 Data Science course.
Dataset file¶
We're using elections dataset from the course's GitLab.
The selected dataset contains all-around information about Czech elections from 2013 to 2017.
Down below, this is the content of the volby_description.csv file describing columns of the dataset.
As we can see, a lot of unclear, duplicate, inconsistent and uninteresting values are included, as well as information about 2016 regional elections. We're concerened with the dataset's future expandibility right away.
The volby.sav file itself, which we're using, is circa 12 MB big and has a little under 15 k rows and 183 columns.
The dataset's structure is over-all really messy.
Goals¶
The reason for carrying out the analysis is to explore and possibly predict voting preferences of specific population units. We expect to find certain voter bases of large political entities.
We want to explore parliamentary elections only.
The analysis could point at deprived regional units as well as commonalities between certain sections of the population.
Data¶
We'll focus on the dataset's structural and semantic information here.
Structure¶
Let's have a look at a random sample of 3000 rows (first 5 shown).
import pandas as pd
import pyreadstat
K, meta = pyreadstat.read_sav('volby_dataset/volby.sav')
K.columns = K.columns.str.lower()
K = K.sample(n=3000, axis=0, random_state=42)
K.head()
id_n | obec | nazev_obce | okres_nazev | kraj_nazev | obec_okrsek_final2016 | obec_okrsek | par_2017_1_občanskádemokratickástrana | par_2017_2_řádnárodavlasteneckáunie | par_2017_3_cestaodpovědnéspolečnosti | ... | vzd_nast_a | vzd_str_sm_a | vzd_str_be_a | vzd_zakl_a | nab_rimsko_a | eko_zam_a | eko_post_p_a | eko_nezam_a | nepracduch_a | vaha | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1743 | 1744.0 | 539911.0 | Příbram | Příbram | Středočeský kraj | 5399119.0 | 539911_9 | 60.0 | 1.0 | 0.0 | ... | 2.879377 | 23.891051 | 26.459144 | 15.642023 | 8.638132 | 44.669261 | 5.758755 | 4.824903 | 23.735409 | 1.685789 |
12414 | 12457.0 | 597520.0 | Krnov | Bruntál | Moravskoslezský kraj | 5975204.0 | 597520_4 | 19.0 | 1.0 | 0.0 | ... | 1.560624 | 21.008403 | 30.852341 | 15.726291 | 6.962785 | 40.216086 | 4.321729 | 6.842737 | 26.770708 | 1.066376 |
1593 | 1594.0 | 531821.0 | Tachlovice | Praha-západ | Středočeský kraj | 5318211.0 | 531821_1 | 66.0 | 2.0 | 0.0 | ... | 2.264601 | 20.619785 | 27.056019 | 10.607867 | 11.442193 | 51.489869 | 10.369487 | 3.337306 | 13.230036 | 1.222989 |
1858 | 1859.0 | 541231.0 | Rožmitál pod Třemšínem | Příbram | Středočeský kraj | 5412311.0 | 541231_1 | 44.0 | 1.0 | 0.0 | ... | 1.955307 | 23.882682 | 31.284916 | 19.413408 | 9.636872 | 41.899441 | 7.681564 | 6.145251 | 27.653631 | 1.282819 |
13650 | 13694.0 | 598836.0 | Václavovice | Ostrava-město | Moravskoslezský kraj | 5988361.0 | 598836_1 | 90.0 | 3.0 | 0.0 | ... | 2.286336 | 25.585193 | 26.238432 | 12.411541 | 15.133370 | 45.944475 | 5.498095 | 3.483941 | 22.972237 | 2.731049 |
5 rows × 183 columns
At first sight, the dataset is a tragedy.
There are plenty of, so-called, technical columns which we won't take into consideration. They are described above in volby_description.csv file.
Apart from dataset's wrong encoding, we are mainly concerned about having political parties on separate columns (as well as their percentage - columns that end with _p) since in the upcoming exploratory analysis we'll probably have to group by political party which would be less intuitive here.
For now, let's have a look at the sample's basic info.
K.info()
<class 'pandas.core.frame.DataFrame'> Index: 3000 entries, 1743 to 10739 Columns: 183 entries, id_n to vaha dtypes: float64(174), object(9) memory usage: 4.2+ MB
Columns that contain textual values are:
K[K.select_dtypes(include=['object']).columns].head()
nazev_obce | okres_nazev | kraj_nazev | obec_okrsek | chyba_kraj16 | poriz_okr_kraj16 | okr_time_kraj16 | poriz_time_kraj16 | párování | |
---|---|---|---|---|---|---|---|---|---|
1743 | Příbram | Příbram | Středočeský kraj | 539911_9 | 0 | 00:00:01 | 02:48:44 | 05:50:01 | 539911_9 |
12414 | Krnov | Bruntál | Moravskoslezský kraj | 597520_4 | 0 | 00:00:01 | 03:02:29 | 03:43:43 | 597520_4 |
1593 | Tachlovice | Praha-západ | Středočeský kraj | 531821_1 | 0 | 00:00:01 | 03:28:26 | 06:07:34 | 531821_1 |
1858 | Rožmitál pod Třemšínem | Příbram | Středočeský kraj | 541231_1 | 0 | 00:00:01 | 02:48:21 | 03:42:08 | 541231_1 |
13650 | Václavovice | Ostrava-město | Moravskoslezský kraj | 598836_1 | 0 | 00:00:01 | 05:55:34 | 08:39:03 | 598836_1 |
It seems like there are no duplicities in terms of rows.
duplicate_rows = K[K.duplicated(keep=False)]
print("Number of duplicate rows: ", duplicate_rows.shape[0])
Number of duplicate rows: 0
But in terms of columns, there seem to be issues that can be addressed.
Apart technical-valued columns, columns 'obec_okrsek_final2016' , 'obec_okrsek' and 'par_2017_okrsek' seem to be sharing some values.
K[['obec_okrsek_final2016', 'obec_okrsek', 'par_2017_okrsek']].head()
obec_okrsek_final2016 | obec_okrsek | par_2017_okrsek | |
---|---|---|---|
1743 | 5399119.0 | 539911_9 | 9.0 |
12414 | 5975204.0 | 597520_4 | 4.0 |
1593 | 5318211.0 | 531821_1 | 1.0 |
1858 | 5412311.0 | 541231_1 | 1.0 |
13650 | 5988361.0 | 598836_1 | 1.0 |
We can see from the result below that 'obec_okrsek_final2016' and 'obec_okrsek' columns share almost every value.
satisfying_rows, unsatisfying_rows, mismatched_values = [], [], []
spec_K = K[['obec_okrsek_final2016', 'obec_okrsek']]
for index, row in spec_K.iterrows():
val1, val2 = row['obec_okrsek_final2016'], row['obec_okrsek']
if pd.isna(val1) or pd.isna(val2):
continue
if str(int(val1)) != val2.replace("_", ""):
unsatisfying_rows.append(index)
mismatched_values.append((val1, val2))
else:
satisfying_rows.append(index)
print(f"Rows where mismatch between non-nan values occured:\n{unsatisfying_rows[:min(15, len(unsatisfying_rows))]}")
if (len(unsatisfying_rows) > 0):
for i in range(min(5, len(mismatched_values))):
print(f"{mismatched_values[i][0]} != {mismatched_values[i][1]}")
if len(unsatisfying_rows) > 5:
print("...")
print(f"Number of equal values:\n{len(satisfying_rows)} out of {len(spec_K.index)}")
Rows where mismatch between non-nan values occured: [5749, 6128, 13527, 3838, 6086, 13402, 9526, 9742, 9839, 3745, 9688, 9671, 3817, 9538, 6156] 56789223.0 != 554804_1023 55690437.0 != 563889_37 54622431.0 != 554821_15031 54600357.0 != 554791_3057 5568912.0 != 563889_1002 ... Number of equal values: 2542 out of 3000
We don't have enough information to determine why that happens. But it seems like there would be an overflow error if it were connected like the other equal values - so in these cases, weird numbering appears.
pd.DataFrame({
'nan_values': K[['obec_okrsek_final2016', 'obec_okrsek']].isna().sum()
})
nan_values | |
---|---|
obec_okrsek_final2016 | 225 |
obec_okrsek | 0 |
From the table, we can't be one hundred percent sure but we can conclude that the values that are not equal are weird (in 'obec_okrsek_final2016'), missing or null.
So, 'obec_okrsek_final2016' should be dropped. And 'obec_okrsek' column should be split in two separate columns.
Flaws¶
Let's look at missing values and other dataset's problems in general.
na_counts = K.isnull().sum()
rows = K.shape[0]
na_percentage = round((na_counts / rows) * 100, 2)
na_K = pd.DataFrame({
'nan_values': na_counts,
'percentage': na_percentage,
})
na_K = na_K[na_K['nan_values'] > 0].sort_values(by='percentage', ascending=False)
na_K
nan_values | percentage | |
---|---|---|
par_2017_okrsek | 234 | 7.8 |
par_2017_id_okrsky | 234 | 7.8 |
par_2017_oprava | 234 | 7.8 |
par_2017_chyba | 234 | 7.8 |
par_2017_okres | 234 | 7.8 |
... | ... | ... |
par_2017_24_křesť.demokr.uniečs.str.lid._p | 6 | 0.2 |
par_2017_23_sprrepubl.str.čsl.m.sládka_p | 6 | 0.2 |
par_2017_22_dobrávolba2016_p | 6 | 0.2 |
par_2017_21_ano2011_p | 6 | 0.2 |
par_2017_6_radostnéčesko | 6 | 0.2 |
138 rows × 2 columns
We should look how they group together as well - if we suppose that the same flaws / inconsistencies happen between those with the same percentage of nan_values.
na_K.value_counts()
nan_values percentage 6 0.20 63 8 0.27 51 234 7.80 12 225 7.50 9 199 6.63 3 Name: count, dtype: int64
Unluckily, almost every column contains missing or NaN values. Let's examine what could be the reason behind that keeping in mind that we'll ignore values that we won't use in our exploratory analysis (that is "kraj16" values).
This is how it looks if we're ignoring the regional elections:
na_K = na_K.loc[[idx for idx in na_K.index if not idx.endswith('kraj16') and not idx.endswith('final2016')]]
print(na_K.value_counts())
nan_values percentage 6 0.20 63 8 0.27 51 234 7.80 12 Name: count, dtype: int64
But now, let's see which are the worst:
unique_na_K = sorted(list(set(na_K['percentage'])))
na_K[na_K['percentage'] == unique_na_K[-1]]
nan_values | percentage | |
---|---|---|
par_2017_okrsek | 234 | 7.8 |
par_2017_id_okrsky | 234 | 7.8 |
par_2017_oprava | 234 | 7.8 |
par_2017_chyba | 234 | 7.8 |
par_2017_okres | 234 | 7.8 |
par_2017_kc_1 | 234 | 7.8 |
par_2017_kc_2 | 234 | 7.8 |
par_2017_vol_seznam | 234 | 7.8 |
par_2017_vyd_obalky | 234 | 7.8 |
par_2017_odevz_obal | 234 | 7.8 |
par_2017_pl_hl_celk | 234 | 7.8 |
par_2017_typ_form | 234 | 7.8 |
We will drop those columns all at once later on. So we don't worry about them right now.
Let's move on to political parties' columns. Firstly, we'll take a look at how are NaN or Null values distributed across parties in 2013 and 2017. And secondly, we'll try to find where those values miss geographically.
We also have to point out that the structure of party columns is inconsistent between years.
cols_votes_17 = [col for col in K.columns if col[:4] == "par_" and col[9].isdigit() and not col.endswith("_p")]
cols_votes_13 = [col for col in K.columns if col[:4] == "par1" and not col.endswith("_p")]
miss_df_13 = K[cols_votes_13].isna().sum()
miss_df_17 = K[cols_votes_17].isna().sum()
print(
f"Number of Nan or Null values in 2013 parties:\n sum={miss_df_13.sum()}, mean={miss_df_13.mean()}, min={miss_df_13.min()}, max={miss_df_13.max()}\n\n" + \
f"Number of Nan or Null values in 2017 parties:\n sum={miss_df_17.sum()}, mean={miss_df_17.mean()}, min={miss_df_17.min()}, max={miss_df_17.max()}"
)
Number of Nan or Null values in 2013 parties: sum=184, mean=8.0, min=8, max=8 Number of Nan or Null values in 2017 parties: sum=186, mean=6.0, min=6, max=6
It seems like the missing or NaN values are year-specific. Which rows have that, how do they look and what do they contain?
2013:
nan_13_head = K[K[cols_votes_13].isna()].head()
nan_13_head
id_n | obec | nazev_obce | okres_nazev | kraj_nazev | obec_okrsek_final2016 | obec_okrsek | par_2017_1_občanskádemokratickástrana | par_2017_2_řádnárodavlasteneckáunie | par_2017_3_cestaodpovědnéspolečnosti | ... | vzd_nast_a | vzd_str_sm_a | vzd_str_be_a | vzd_zakl_a | nab_rimsko_a | eko_zam_a | eko_post_p_a | eko_nezam_a | nepracduch_a | vaha | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1743 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12414 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1593 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1858 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13650 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 183 columns
2017:
nan_17_head = K[K[cols_votes_17].isna()].head()
nan_17_head
id_n | obec | nazev_obce | okres_nazev | kraj_nazev | obec_okrsek_final2016 | obec_okrsek | par_2017_1_občanskádemokratickástrana | par_2017_2_řádnárodavlasteneckáunie | par_2017_3_cestaodpovědnéspolečnosti | ... | vzd_nast_a | vzd_str_sm_a | vzd_str_be_a | vzd_zakl_a | nab_rimsko_a | eko_zam_a | eko_post_p_a | eko_nezam_a | nepracduch_a | vaha | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1743 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12414 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1593 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1858 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13650 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 183 columns
Is there even any non-NaN value in these rows?
print("Row idx and the sum of the number of non-NaN values in:")
print(f"2013's head:\n{nan_13_head.notna().sum(axis=1)}")
print(f"2017's head:\n{nan_17_head.notna().sum(axis=1)}")
Row idx and the sum of the number of non-NaN values in: 2013's head: 1743 0 12414 0 1593 0 1858 0 13650 0 dtype: int64 2017's head: 1743 0 12414 0 1593 0 1858 0 13650 0 dtype: int64
No, so they should be ok to remove. Probably an error. So there's no way how to proceed in finding the missing values' geographical locations.
Let's do simple sanity checks.
party_cols = cols_votes_13 + cols_votes_17
def check_value(x):
if isinstance(x, (int, float)) and pd.notna(x) and x != 0 and (x < 0 or x % int(x) != 0):
return True
return False
if K[party_cols].applymap(check_value).any().any():
print("There is at least one element in party columns that is not a whole number or less than 0.")
else:
print("All non-na elements in party columns are either whole numbers or greater than or equal to 0.")
All non-na elements in party columns are either whole numbers or greater than or equal to 0.
It is also important to assure ourselves that all values, that tell us something about the number of people in certain population groups, sum, for example, to the total number of people living in the regional unit.
What needs to be said is that it is at least very weird to have the total number of people (and other similar columns) without year taken into account.
Let's check 'muzi', 'zeny' and then all educational levels.
# Check if 'muzi' + 'zeny' == 'oby_celkem'
sub_K = K[K[['muzi', 'zeny', 'oby_celkem']].notna().any(axis=1)][['muzi', 'zeny', 'oby_celkem']].head()
sub_K['muzi_zeny'] = sub_K['muzi'] + sub_K['zeny']
print(f"The number of rows where 'muzi' + 'zeny' does not equal 'oby_celkem' is {int(sub_K[sub_K['oby_celkem'] != sub_K['muzi_zeny']].sum().sum())}")
# Check if all levels of education sum up to 'oby_celkem'
levels = ['vzd_vysok', 'vzd_vos', 'vzd_nast', 'vzd_str_sm', 'vzd_str_be', 'vzd_zakl']
sub_K = K[K[levels].notna().any(axis=1)][levels + ['oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()
sub_K['levels_sum'] = sub_K[levels].sum(axis=1)
differ_rows = sub_K[sub_K['oby_celkem'] != sub_K['levels_sum']]
print(f"The number of rows where summing across the levels of education does not equal 'oby_celkem' is {int(differ_rows.sum().sum())} \
which is {100 * len(differ_rows) / sub_K.shape[0]} percent of total rows.")
sub_K[['levels_sum', 'oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()
The number of rows where 'muzi' + 'zeny' does not equal 'oby_celkem' is 0 The number of rows where summing across the levels of education does not equal 'oby_celkem' is 23546 which is 100.0 percent of total rows.
levels_sum | oby_celkem | průměrzvolici | par_2017_vol_seznam | |
---|---|---|---|---|
1743 | 1047.0 | 1285.0 | 958.0 | 939.0 |
12414 | 629.0 | 833.0 | 606.0 | 599.0 |
1593 | 660.0 | 839.0 | 695.0 | 754.0 |
1858 | 599.0 | 716.0 | 729.0 | 712.0 |
13650 | 1522.0 | 1837.0 | 1552.0 | 1578.0 |
It seems like there is a little discrepancy here. This could potentially be attributed to the method of storing the number of people having a certain educational level - a person's education could be weighted in only after they reach their's 18th year - so, hypothetically speaking, if we were to subtract people aged 0 to 17, then the difference could disappear.
On the other hand, this hypothesis seems to be wrong since columns with eligible voters should be telling exactly that number ('průměrzvolici', 'par_2017_vol_seznam') and they are not equal as well. We could think that it could be something in between those values, but that seems to be wrong as well. So we have to conclude that there are inconsistencies present and we don't really know why since the method of data collection is not known to us.
work_status = ['eko_zam', 'eko_post_p', 'eko_nezam', 'nepracduch']
sub_K = K[K[work_status].notna().any(axis=1)][work_status + ['oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()
sub_K['work_status_sum'] = sub_K[work_status].sum(axis=1)
print(f"The number of rows where summing across the work status does not equal 'oby_celkem' \
is {int(sub_K[sub_K['work_status_sum'] != sub_K['oby_celkem']].sum().sum())}")
sub_K[['work_status_sum', 'oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()
The number of rows where summing across the work status does not equal 'oby_celkem' is 23336
work_status_sum | oby_celkem | průměrzvolici | par_2017_vol_seznam | |
---|---|---|---|---|
1743 | 1015.0 | 1285.0 | 958.0 | 939.0 |
12414 | 651.0 | 833.0 | 606.0 | 599.0 |
1593 | 658.0 | 839.0 | 695.0 | 754.0 |
1858 | 597.0 | 716.0 | 729.0 | 712.0 |
13650 | 1431.0 | 1837.0 | 1552.0 | 1578.0 |
All said above applies here as well. With the exception that you can probably have a working status even if you're not eligible to vote.
Adjustments¶
At first, we'll remove all the unnecessary columns for our exploratory analysis - that includes technical columns, duplicities, regional elections and so on.
# delete columns with uninteresting or duplicate values
K.drop(columns=[col for col in K.columns if col.endswith('_p') \
or col.endswith('kraj16')] + ['par_2017_id_okrsky', 'par_2017_typ_form', 'par_2017_oprava', 'par_2017_chyba', 'par_2017_okres'] \
+ ['par_2017_kc_1', 'par_2017_kc_2'] + ['vaha'] + ["ztracenéobálky", "párování", "poc_budov"] + ['obec_okrsek_final2016'] \
+ ['par_2017_okrsek'] + ['par_2017_vyd_obalky', 'par_2017_odevz_obal', 'průměrzvydob', 'průměrzodob'] \
+ ['celkovýsoučet', 'par_2017_vol_seznam', 'par_2017_pl_hl_celk', 'průměrzvolici', 'průměrzhlasycel'], inplace=True)
Our main concern lies with each party having one separate column. To simplify the exploratory analysis, what we can do is "melt" those columns into 'rok_kandidatka_strana' and 'hlasy' columns.
In the end, what we possibly want to get is 4 distinct columns: 'rok', 'kandidatka', 'strana' and 'hlasy' which would effectively help us with orienting and all-around work with the dataset.
But first, we'll rename our party columns so that they all share the same naming structure, that is: "YYYY_CandidateListNum_PartyName".
renamed_cols_17 = [col[4:] for col in cols_votes_17]
renamed_cols_13 = [f'20{col[3:]}' for col in cols_votes_13]
rename_map = {}
for old_name, new_name in zip(cols_votes_13 + cols_votes_17, renamed_cols_13 + renamed_cols_17):
rename_map[old_name] = new_name
K = K.rename(columns=rename_map)
K.head()
id_n | obec | nazev_obce | okres_nazev | kraj_nazev | obec_okrsek | 2017_1_občanskádemokratickástrana | 2017_2_řádnárodavlasteneckáunie | 2017_3_cestaodpovědnéspolečnosti | 2017_4_českástr.sociálnědemokrat | ... | vzd_vos_a | vzd_nast_a | vzd_str_sm_a | vzd_str_be_a | vzd_zakl_a | nab_rimsko_a | eko_zam_a | eko_post_p_a | eko_nezam_a | nepracduch_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1743 | 1744.0 | 539911.0 | Příbram | Příbram | Středočeský kraj | 539911_9 | 60.0 | 1.0 | 0.0 | 39.0 | ... | 1.089494 | 2.879377 | 23.891051 | 26.459144 | 15.642023 | 8.638132 | 44.669261 | 5.758755 | 4.824903 | 23.735409 |
12414 | 12457.0 | 597520.0 | Krnov | Bruntál | Moravskoslezský kraj | 597520_4 | 19.0 | 1.0 | 0.0 | 26.0 | ... | 1.080432 | 1.560624 | 21.008403 | 30.852341 | 15.726291 | 6.962785 | 40.216086 | 4.321729 | 6.842737 | 26.770708 |
1593 | 1594.0 | 531821.0 | Tachlovice | Praha-západ | Středočeský kraj | 531821_1 | 66.0 | 2.0 | 0.0 | 13.0 | ... | 2.383790 | 2.264601 | 20.619785 | 27.056019 | 10.607867 | 11.442193 | 51.489869 | 10.369487 | 3.337306 | 13.230036 |
1858 | 1859.0 | 541231.0 | Rožmitál pod Třemšínem | Příbram | Středočeský kraj | 541231_1 | 44.0 | 1.0 | 0.0 | 30.0 | ... | 0.977654 | 1.955307 | 23.882682 | 31.284916 | 19.413408 | 9.636872 | 41.899441 | 7.681564 | 6.145251 | 27.653631 |
13650 | 13694.0 | 598836.0 | Václavovice | Ostrava-město | Moravskoslezský kraj | 598836_1 | 90.0 | 3.0 | 0.0 | 71.0 | ... | 1.197605 | 2.286336 | 25.585193 | 26.238432 | 12.411541 | 15.133370 | 45.944475 | 5.498095 | 3.483941 | 22.972237 |
5 rows × 94 columns
Now we can melt the party columns into 'hlasy' (which would contain values of every party column) and 'rok_kandidatka_strana'.
fixed_vars = list(set(K.columns) - set(renamed_cols_13 + renamed_cols_17)) # get fixed columns
K = pd.melt(K, id_vars=fixed_vars, var_name='rok_kandidatka_strana', value_name='hlasy') # melt
K[K.columns[-5:]].head()
zeny | vzd_zakl | okres_nazev | rok_kandidatka_strana | hlasy | |
---|---|---|---|---|---|
0 | 702.0 | 201.0 | Příbram | 2017_1_občanskádemokratickástrana | 60.0 |
1 | 385.0 | 131.0 | Bruntál | 2017_1_občanskádemokratickástrana | 19.0 |
2 | 426.0 | 89.0 | Praha-západ | 2017_1_občanskádemokratickástrana | 66.0 |
3 | 357.0 | 139.0 | Příbram | 2017_1_občanskádemokratickástrana | 44.0 |
4 | 926.0 | 228.0 | Ostrava-město | 2017_1_občanskádemokratickástrana | 90.0 |
'rok_kandidatka_strana' column should contain political party names only. To achive that we need to split it in three distinct columns: 'rok', 'kandidatka' and 'strana' which would effectively let us group by political party as we wanted in the beginning.
def split_col(col_name, num_of_cols, df):
df[col_name.split('_', maxsplit=num_of_cols-1)] = df[col_name].str.split('_', n=num_of_cols-1, expand=True)
return df.drop(col_name, axis=1)
K = split_col('rok_kandidatka_strana', 3, K) # split columns with votes
K[K.columns[-5:]].head()
okres_nazev | hlasy | rok | kandidatka | strana | |
---|---|---|---|---|---|
0 | Příbram | 60.0 | 2017 | 1 | občanskádemokratickástrana |
1 | Bruntál | 19.0 | 2017 | 1 | občanskádemokratickástrana |
2 | Praha-západ | 66.0 | 2017 | 1 | občanskádemokratickástrana |
3 | Příbram | 44.0 | 2017 | 1 | občanskádemokratickástrana |
4 | Ostrava-město | 90.0 | 2017 | 1 | občanskádemokratickástrana |
According to README.txt file, which is contained in the dataset's folder, some political parties, even though they are the same entity, have slightly different names throughout the years. Let's rename those subjects and abbreviate them to their standard.
naming = {
'ČSSD': [1, 4],
'SSO': [2, 12],
'PIR': [3, 15],
'TOP': [4, 20],
'ODS': [6, 1],
'KDU': [11, 24],
'PRBL': [12, 5],
'SPO': [15, 30],
'SPD': [17, 29],
'DSSS': [18, 28],
'ANO': [20, 21],
'KSČM': [21, 8],
'SZ': [23, 9],
}
for party, values in naming.items():
K.loc[(K['rok'] == '2013') & (K['kandidatka'] == str(values[0])), 'strana'] = party
K.loc[(K['rok'] == '2017') & (K['kandidatka'] == str(values[1])), 'strana'] = party
K.loc[K['strana'] == 'starostovéanezávislí', 'strana'] = 'STAN'
K[K.columns[-5:]].head()
okres_nazev | hlasy | rok | kandidatka | strana | |
---|---|---|---|---|---|
0 | Příbram | 60.0 | 2017 | 1 | ODS |
1 | Bruntál | 19.0 | 2017 | 1 | ODS |
2 | Praha-západ | 66.0 | 2017 | 1 | ODS |
3 | Příbram | 44.0 | 2017 | 1 | ODS |
4 | Ostrava-město | 90.0 | 2017 | 1 | ODS |
After dropping duplicate columns of 'obec_okrsek' we'll split it in two distinct columns 'obec' and 'okrsek' since it holds two values in one.
K = split_col('obec_okrsek', 2, K) # split obec_okrsek in two distinct columns
This is what we're working with now:
K.head()
vzd_nast | nar_romska | vzd_str_be_a | obec | v65avic | vzd_nast_a | nepracduch_a | sum_v0_14_a | eko_zam | v65avic_a | ... | sum_v0_14 | eko_zam_a | zeny | vzd_zakl | okres_nazev | hlasy | rok | kandidatka | strana | okrsek | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 37.0 | 0.0 | 26.459144 | 539911 | 233.0 | 2.879377 | 23.735409 | 14.163424 | 574.0 | 18.132296 | ... | 182.0 | 44.669261 | 702.0 | 201.0 | Příbram | 60.0 | 2017 | 1 | ODS | 9 |
1 | 13.0 | 0.0 | 30.852341 | 597520 | 111.0 | 1.560624 | 26.770708 | 15.846339 | 335.0 | 13.325330 | ... | 132.0 | 40.216086 | 385.0 | 131.0 | Bruntál | 19.0 | 2017 | 1 | ODS | 4 |
2 | 19.0 | 0.0 | 27.056019 | 531821 | 79.0 | 2.264601 | 13.230036 | 18.235995 | 432.0 | 9.415971 | ... | 153.0 | 51.489869 | 426.0 | 89.0 | Praha-západ | 66.0 | 2017 | 1 | ODS | 1 |
3 | 14.0 | 0.0 | 31.284916 | 541231 | 140.0 | 1.955307 | 27.653631 | 14.385475 | 300.0 | 19.553073 | ... | 103.0 | 41.899441 | 357.0 | 139.0 | Příbram | 44.0 | 2017 | 1 | ODS | 1 |
4 | 42.0 | 1.0 | 26.238432 | 598836 | 279.0 | 2.286336 | 22.972237 | 15.459989 | 844.0 | 15.187806 | ... | 284.0 | 45.944475 | 926.0 | 228.0 | Ostrava-město | 90.0 | 2017 | 1 | ODS | 1 |
5 rows × 44 columns
K.columns
Index(['vzd_nast', 'nar_romska', 'vzd_str_be_a', 'obec', 'v65avic', 'vzd_nast_a', 'nepracduch_a', 'sum_v0_14_a', 'eko_zam', 'v65avic_a', 'vzd_str_sm_a', 'eko_nezam', 'vzd_vysok_a', 'muzi_a', 'vzd_vos', 'vzd_vos_a', 'muzi', 'vzd_zakl_a', 'rodst_rozv', 'vzd_str_be', 'zeny_a', 'nepracduch', 'kraj_nazev', 'nazev_obce', 'nar_romska_a', 'id_n', 'nab_rimsko_a', 'oby_celkem', 'rodst_rozv_a', 'vzd_str_sm', 'eko_nezam_a', 'eko_post_p_a', 'nab_rimsko', 'vzd_vysok', 'sum_v0_14', 'eko_zam_a', 'zeny', 'vzd_zakl', 'okres_nazev', 'hlasy', 'rok', 'kandidatka', 'strana', 'okrsek'], dtype='object')
Let's move to another section of the report.
Exploratory analysis¶
In this section, we'll look at how the values are distributed and explore them in greater depth. For starters, a simple overview can do the job.
K.describe().apply(lambda x: x.round(2))
vzd_nast | nar_romska | vzd_str_be_a | v65avic | vzd_nast_a | nepracduch_a | sum_v0_14_a | eko_zam | v65avic_a | vzd_str_sm_a | ... | vzd_str_sm | eko_nezam_a | eko_post_p_a | nab_rimsko | vzd_vysok | sum_v0_14 | eko_zam_a | zeny | vzd_zakl | hlasy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | ... | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 162000.00 | 161630.00 |
mean | 16.61 | 0.34 | 30.68 | 111.59 | 2.18 | 23.74 | 14.23 | 307.75 | 16.60 | 22.01 | ... | 163.56 | 4.88 | 5.88 | 72.98 | 75.33 | 99.59 | 42.96 | 358.07 | 105.52 | 12.55 |
std | 14.02 | 1.82 | 6.99 | 85.10 | 1.04 | 7.05 | 3.33 | 225.55 | 6.21 | 4.87 | ... | 125.48 | 2.32 | 2.24 | 84.40 | 84.94 | 74.30 | 5.72 | 252.28 | 75.77 | 28.39 |
min | 0.00 | 0.00 | 4.14 | 3.00 | 0.00 | 1.90 | 0.00 | 0.00 | 1.20 | 2.56 | ... | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 | 2.00 | 0.00 |
25% | 4.00 | 0.00 | 26.51 | 36.00 | 1.52 | 19.70 | 12.28 | 93.00 | 13.06 | 19.09 | ... | 45.00 | 3.40 | 4.35 | 20.00 | 12.00 | 34.00 | 39.92 | 113.00 | 41.00 | 0.00 |
50% | 14.00 | 0.00 | 31.35 | 94.00 | 2.15 | 23.27 | 14.16 | 293.00 | 15.79 | 22.40 | ... | 151.00 | 4.49 | 5.63 | 49.00 | 45.00 | 91.00 | 43.33 | 354.50 | 95.00 | 1.00 |
75% | 27.00 | 0.00 | 35.39 | 170.00 | 2.82 | 27.17 | 16.22 | 468.25 | 19.16 | 25.33 | ... | 260.00 | 5.91 | 7.15 | 95.00 | 112.00 | 147.00 | 46.53 | 556.00 | 150.00 | 10.00 |
max | 93.00 | 59.00 | 63.64 | 503.00 | 9.43 | 100.00 | 28.32 | 1734.00 | 93.40 | 41.07 | ... | 860.00 | 36.18 | 26.32 | 939.00 | 687.00 | 671.00 | 62.03 | 1812.00 | 513.00 | 559.00 |
8 rows × 36 columns
Correlation table to check column dependencies.
%matplotlib inline
import seaborn.objects as so
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
mod_K = K.select_dtypes(exclude=['object'])
mod_K['rok'], mod_K['kandidatka'] = K['rok'].astype('int'), K['kandidatka'].astype('int')
plt.figure(figsize=(12, 8))
sns.heatmap(mod_K.corr(), annot=False, cmap="YlGnBu")
plt.show()
We've just uncovered uninteresting correlations that make sense if we take a closer look.
For example if there is higher share of people aged over 65 ('v65avic_a') then there is higher share of unemployed retired people ('nepracduch_a').
Another example would be the higher population share of men the less of women.
More explainable ones like the more women (count) the more employed unretired people ('eko_zam'). But that's just because columns with counts and not shares of population say next to nothing. Generally, weirdly said, the more people there is the more people there is, no matter what kind.
Elementary education level share ('vzd_zakl_a') is negatively correlated with employment in productive age ('eko_zam_a') and higher education level share ('vzd_vysok_a')
There is also a less visible negative correlation between the share of divorced people ('rodst_rozv_a') and the share of Catholics ('nab_rimsko_a').
Votes¶
Let's focus on votes. At first, we should compensate for deleted "_p" columns by creating 'hlasy_p' column.
total_13, total_17 = K.loc[K['rok'] == '2013', 'hlasy'].sum(), K.loc[K['rok'] == '2013', 'hlasy'].sum()
K.loc[K['rok'] == '2013', 'hlasy_p'] = 100 * K['hlasy'] / total_13
K.loc[K['rok'] == '2017', 'hlasy_p'] = 100 * K['hlasy'] / total_17
Down below, we can see a hint of election results (from the sample's point of view) as percentage of votes.
cutoff = 2
party_votes = K.groupby(['rok', 'strana'])['hlasy_p'].sum().reset_index().sort_values(by='hlasy_p', ascending=False)
party_votes.loc[party_votes['hlasy_p'] < cutoff, 'strana'] = 'jiné'
party_votes = party_votes.groupby(['rok', 'strana'])['hlasy_p'].sum().reset_index().sort_values(by='hlasy_p', ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(data=party_votes, x="rok", y="hlasy_p", hue="strana", estimator=sum, order=['2013', '2017'])
plt.xlabel('Year')
plt.ylabel('Percentage of Votes (%)')
plt.title('Percentage of Votes by Political Party in Parliamentary elections')
plt.show()
Let's see a sketch of how many votes were from which municipality with the help of a word cloud.
from wordcloud import WordCloud
votes_per_obec = K[['hlasy', 'nazev_obce']].groupby('nazev_obce')['hlasy'].sum()
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate_from_frequencies(votes_per_obec.to_dict())
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
What needs not to be overlooked are correlations between the percentage of votes for certain parties against other columns (which describe shares of people in a population unit).
column_names_map = {
'nab_rimsko_a': 'Catholics',
'nar_romska_a': 'Romani',
'vzd_vysok_a': 'Higher Ed',
'vzd_vos_a': 'Higher Vocational Ed',
'vzd_nast_a': 'Post-secondary Supplementary Ed',
'vzd_str_sm_a': 'Secondary Ed',
'vzd_str_be_a': 'Secondary Ed without Maturita',
'vzd_zakl_a': 'Elementary Ed or None',
'oby_celkem': 'Total Population',
'rodst_rozv_a': 'Divorced',
'eko_zam_a': 'Employed unretired',
'eko_post_p_a': 'Emloyed retired',
'eko_nezam_a': 'Unemployed unretired',
'nepracduch_a': 'Unemployed retired',
}
K_renamed = K.rename(columns=column_names_map)
grouped_by_strana = K_renamed.groupby('strana')
correlations = {}
for party, group in grouped_by_strana:
correlation = group[['Catholics', 'Romani', 'hlasy_p', 'Higher Ed', 'Higher Vocational Ed', 'Post-secondary Supplementary Ed',
'Secondary Ed', 'Secondary Ed without Maturita', 'Elementary Ed or None',
'Total Population']].corr()['hlasy_p'].drop('hlasy_p')
correlations[party] = correlation
corr_df = pd.DataFrame(correlations)
plt.figure(figsize=(14, 10))
sns.heatmap(corr_df.T, annot=False, cmap="Blues")
plt.xticks(rotation=90, fontsize='small')
plt.show()
This shows us very interesting correlations that should be explained later.
For example it seems like the bigger the city is - population-wise - the bigger the chances are for the top parties.
Or we can see, without much surprise, that KDU party which is in fact a 'Christian and Democratic Union' has higher chances of recieving votes from cities that have higher percentage of Catholics living there.
Status¶
What we should consider exploring is, generally speaking, educational, social and economic status.
The problem is however that we don't have sufficient-enough columns that can determine each status of a population unit. But we can at least try to tame what we have into something that is able to point us in the right direction.
Firstly, let's see the distribution of educational status.
edu_columns = ['Higher Ed', 'Higher Vocational Ed', 'Post-secondary Supplementary Ed',
'Secondary Ed', 'Secondary Ed without Maturita', 'Elementary Ed or None']
long_form = K_renamed.melt(value_vars=edu_columns, var_name='education', value_name='share_of_ppl_w_edu')
sns.catplot(data=long_form, x='education', y='share_of_ppl_w_edu', kind='bar', errorbar='sd', aspect=2)
plt.xticks(rotation=0, fontsize='x-small')
plt.xlabel('Education Level')
plt.ylabel('Share of People with Education Level (%)')
plt.show()
What about social status? That could help us determine the conservative-progressive axis (for example less divorces and more religious people could be suggesting that the region in question is more conservative - or maybe values traditions). Higher share of Roma population in the region can be, sadly enough, correlated with more socially deprived status of the region itself.
Let's plot something about population's values.
soc_columns = ['Divorced', 'Catholics']
long_form = K_renamed.melt(value_vars=soc_columns, var_name='social', value_name='share_of_ppl_w_social')
plt.figure(figsize=(12, 6))
sns.violinplot(x='social', y='share_of_ppl_w_social', data=long_form)
plt.xlabel('Social Status')
plt.ylabel('Share of People with Social Status (%)')
plt.show()
And we should look if the share of Roma population varies widely.
plt.figure(figsize=(10, 5))
sns.kdeplot(K['nar_romska_a'], fill=True)
print(f"The maximum is {round(K['nar_romska_a'].max(), 2)} \
\nThe minimum is {round(K['nar_romska_a'].min(), 2)} \
\nand the mean is {round(K['nar_romska_a'].mean(), 2)} \
\nand the variance is {round(K['nar_romska_a'].var(), 2)}")
plt.xlabel('Share of Roma People (%)')
plt.ylabel('Density')
plt.show()
The maximum is 7.04 The minimum is 0.0 and the mean is 0.04 and the variance is 0.06
And finally, closely tied, economic status.
eco_columns = ['Employed unretired', 'Emloyed retired', 'Unemployed unretired', 'Unemployed retired']
long_form = K_renamed.melt(value_vars=eco_columns, var_name='work', value_name='share_of_ppl_w_work_status')
sns.catplot(data=long_form, x='work', y='share_of_ppl_w_work_status', kind='bar', errorbar='sd', aspect=2)
plt.title('Share of People with Work Status\n(Employment with respect to Retirement)')
plt.xlabel('Work Status')
plt.ylabel('Share of People with Work Status (%)')
plt.show()
We will use that in further analysis, hoping that the share of unemployed people points at deprived regions.
So far we've explored how different variables (which account for shares of population) correlate with shares of votes for certain parties. We've also managed to visualize some interesting informations about the distributions of educational, social and economic status.
In the next part we would like dive deeper and scrutinize if any of the explored variables actually tell us something about voting preferences.
Explanatory analysis¶
What we would like to do is to learn something about geographical units with similar characteristics.
At the very end we should try to learn which parameters are valuable for more or less correct prediction of district results.
We would like to answer some burning questions (although the results will heavily depend on our metric):
Does deprivation of a region push voters to vote for some sort of parties? Do deprivated regions vote the same?
Based on features that we have, can we predict the election result?
Clustering¶
At first we'll need to group similar regions.
Deprivation¶
Let's start with trying to cluster deprived regions. For this we'll use K-means algorithm with columns that we think should hint at possible deprivation of the region:
Elementary education or without an education. Hints at the lack of resources of the region.
Unemployed unretired because jobless people in their active age should give us insight in how the region is doing economically (either low number of job opportunities or the incapibility of finding a job for whatever reason).
Share of Romani population. Population groups that lack financial and social capital are indirectly evicted to similar places thus generating concentrated groups of people with similar socio-economic status. Sadly, higher share of Romani people in a region tells us that this has most likely occured.
Total population and geographical location. Georaphical surroundings will of course tell us something about the district itself. But the information about the region alone would not be sufficient because for example the regional city in a poor region can be quite well economically but its neighboring villages might not. NOTE: using this turned out to be a wrong way since the clustering just split the data according to it. For this to be effectively utilized we would need more data.
It might be worth looking into Voter turnout as well because people without time to be interested in politics might be an indicator of something. This would be calculated as total votes divided by the total population of the district (or more effectively - total votes in a given municipality / total population of the municipality).
This "metric" will most likely not work very well - there are lot of unknowns about the region - we would need more variables. But it should give us an idea. It also might be worth exploring the municipality instead of the district.
As a regional unit we'll try to group municipalities (obec), not districts (okrsek). We'll be using these columns:
K.dropna(inplace=True)
data = K[['vzd_zakl_a', 'nar_romska_a', 'eko_nezam_a', 'rok', 'obec']].copy()
data.drop_duplicates(inplace=True)
# get municipality population
unique_muni_year = K[['rok', 'obec', 'oby_celkem']].drop_duplicates()
muni_population_df = unique_muni_year.groupby(['rok', 'obec'])['oby_celkem'].sum().reset_index()
muni_population_df.rename(columns={'oby_celkem': 'muni_population'}, inplace=True)
data = data.merge(muni_population_df, on=['rok', 'obec'], how='left')
# get municipality turnout
unique_muni_year_strana = K[['rok', 'obec', 'strana', 'hlasy']].drop_duplicates()
muni_turnout_df = unique_muni_year_strana.groupby(['rok', 'obec'])['hlasy'].sum().reset_index()
muni_turnout_df.rename(columns={'hlasy': 'muni_votes_sum'}, inplace=True)
data = data.merge(muni_turnout_df, on=['rok', 'obec'], how='left')
data.loc[:, 'muni_turnout'] = data['muni_votes_sum'] / data['muni_population']
data = data[data['muni_turnout'] <= 1]
data = data.groupby('obec').mean().reset_index()
data.drop(columns=['rok', 'muni_votes_sum', 'muni_population'], inplace=True)
data.drop_duplicates(inplace=True)
cluster_data = data.drop(columns=['obec'])
cluster_data.head()
vzd_zakl_a | nar_romska_a | eko_nezam_a | muni_turnout | |
---|---|---|---|---|
0 | 6.741604 | 0.017507 | 2.835240 | 0.550225 |
1 | 8.371457 | 0.000000 | 2.968587 | 0.385568 |
2 | 6.422949 | 0.013548 | 2.682892 | 0.452729 |
3 | 36.529680 | 0.000000 | 14.155251 | 0.415525 |
4 | 8.229658 | 0.097575 | 3.218088 | 0.434454 |
This is what we're dealing with ['Population share of people with elementary education', 'Share of Romani people', 'Share of unemployed people', 'Municipality turnout']. Let's rescale those values before passing them to the actual algorithm.
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
data_transformer = ColumnTransformer(
transformers=[
("ScaleRobust", RobustScaler(), ["vzd_zakl_a", "nar_romska_a", "eko_nezam_a", "muni_turnout"]),
]
)
features = data_transformer.fit_transform(cluster_data)
pd.DataFrame(features, columns=['vzd_zakl_a', 'nar_romska_a', 'eko_nezam_a', 'muni_turnout']).head()
vzd_zakl_a | nar_romska_a | eko_nezam_a | muni_turnout | |
---|---|---|---|---|
0 | -1.725250 | 0.017507 | -0.679902 | 0.237246 |
1 | -1.457444 | 0.000000 | -0.627379 | -1.276041 |
2 | -1.777609 | 0.013548 | -0.739909 | -0.658794 |
3 | 3.169320 | 0.000000 | 3.778829 | -1.000721 |
4 | -1.480744 | 0.097575 | -0.529106 | -0.826753 |
Every row column has been scaled in a way that the mean is 0. Let's run the K-means algorithm for 2 to 20 clusters and choose the one that has the highest silhouette score.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
min_cluster_num = 2
silhouette_scores = []
for i in range(min_cluster_num, 21):
kmeans = KMeans(n_clusters=i, random_state=0, n_init=10)
kmeans.fit(features)
score = silhouette_score(features, kmeans.labels_)
silhouette_scores.append(score)
# Finding the number of clusters with the highest Silhouette Score
optimal_clusters_num = np.argmax(silhouette_scores) + min_cluster_num
kmeans = KMeans(n_clusters=optimal_clusters_num, random_state=0, n_init='auto')
kmeans.fit(features)
cluster_data['cluster'] = kmeans.labels_
data['cluster'] = kmeans.labels_
print(f'Optimal cluster number: {optimal_clusters_num}, Silhouette score: {silhouette_score(features, kmeans.labels_)}')
cluster_data.head()
Optimal cluster number: 2, Silhouette score: 0.31666351891199335
vzd_zakl_a | nar_romska_a | eko_nezam_a | muni_turnout | cluster | |
---|---|---|---|---|---|
0 | 6.741604 | 0.017507 | 2.835240 | 0.550225 | 0 |
1 | 8.371457 | 0.000000 | 2.968587 | 0.385568 | 0 |
2 | 6.422949 | 0.013548 | 2.682892 | 0.452729 | 0 |
3 | 36.529680 | 0.000000 | 14.155251 | 0.415525 | 1 |
4 | 8.229658 | 0.097575 | 3.218088 | 0.434454 | 0 |
Well, we wanted more then two clusters but we can see that the "optimal" solution (picking the one with the best silhouette score) consists of them. And the silhouette score for these clusters is moderate but not necesarilly bad.
Let's try a different perspective - let's try to create a column derived from PCA. We'll use the same variables. Then we can compare those two.
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
principal_component = pca.fit_transform(features)
data['PCA_label'] = principal_component[:, 0]
plt.figure(figsize=(10, 3))
sns.scatterplot(data=data, x='PCA_label', y='cluster', hue='cluster', palette='deep')
plt.title('PCA based on clusters')
plt.ylim(-0.5, 1.5)
plt.ylabel('Cluster')
plt.yticks([0, 1])
plt.show()
It seems like it got the gist of it the same way the clustering has. From a specific value of PCA_label, the cluster does not change - neither it does in the opposite direction (ignoring some wiggle space in the middle - around the value 0 of our PCA_label).
Let's look if it has correlations which we would expect from our metric.
# plot correlations between features and PCA
plt.figure(figsize=(5, 5))
ax = sns.heatmap(pd.DataFrame(pca.components_, columns=['vzd_zakl_a', 'nar_romska_a', 'eko_nezam_a', 'muni_turnout']).T, annot=True, cmap="Blues")
plt.title('Correlation between features and PCA')
ax.set_xticks([])
plt.xlabel('PCA')
plt.show()
This looks great. As it captures the nature of our presumpted metric. Without doing a score or a metric, that would be subjectively weighted by hand, we arrived at something that resembled what we wanted (correlation-wise). It even gave the most importance to employment which is undoubtedly one of the biggest estimators of prosperity of a region.
So we say, with higher PCA_label the theoretically higher deprivation of a municipality. This is a bit of a stretch but let's work with that. It is nothing but an index that tells us something common about a region - in this case it is most likely unemployment (which of course at least hints at something). We'll rename our PCA_label to depriv_index - still an abstract term but more readable.
Why did we do all that? Let's examine the correlation between votes for a party and our new abstractly-called deprivation index.
# rename PCA_label to depriv_index
data.rename(columns={'PCA_label': 'depriv_index'}, inplace=True)
# merge data with the original dataset K
K = K.merge(data[['obec', 'muni_turnout', 'cluster', 'depriv_index']], on='obec', how='left')
# drop rows with nan values
K.dropna(inplace=True)
# correlations between votes for a party and depriv_index
grouped_by_strana = K.groupby('strana')
#
correlations = {}
for party, group in grouped_by_strana:
correlation = group[['depriv_index', 'hlasy_p']].corr()['hlasy_p'].drop('hlasy_p')
correlations[party] = correlation
corr_df = pd.DataFrame(correlations)
max_corr = corr_df.abs().max().max()
vmin, vmax = -max_corr, max_corr
plt.figure(figsize=(5, 10))
sns.heatmap(corr_df.T, annot=False, cmap="PiYG", vmin=vmin, vmax=vmax)
plt.xlabel('Deprivation Index')
plt.xticks([])
plt.show()
We've expected to find stronger correlations with for example 'DSSS' or 'česká národní fronta'. But what we got instead are negative correlations with the most popular liberal (in the economic sense) right-wing parties. All the correlations are not strong but let's still have a look at the most positively and the most negatively correlated pairs.
# make a figure with 4 subplots showing correlations between votes for a party and depriv_index
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
sorted_corr_df = corr_df.loc['depriv_index'].sort_values(ascending=False)
# get two highest correlations and two lowest correlations
top_corr = sorted_corr_df.index[:2]
bot_corr = sorted_corr_df.index[-2:]
corr_list = list(top_corr) + list(bot_corr)
# get parties with correlations in corr_list
corr_df_sel = corr_df[corr_list]
# plot correlations with regression line and confidence interval for each party in corr_df_sel
for i, party in enumerate(corr_df_sel.columns):
sns.regplot(ax=axes[i // 2, i % 2],
data=K[K['strana'] == party],
x='depriv_index',
y='hlasy_p',
line_kws={'color': 'red'},
scatter_kws={'s': 3, 'alpha': 0.1})
axes[i // 2, i % 2].set_title(party)
if i % 2 == 0:
axes[i // 2, i % 2].set_ylabel('Percentage of Votes (%)')
else:
axes[i // 2, i % 2].set_ylabel('')
axes[i // 2, i % 2].set_xlabel('Deprivation Index')
plt.show()
And now, just for the sake of learning statistical testing, let's formulate a null hypothesis $H_{0}$ which says that there is no linear relationship between the deprivation index and the votes for a given political party. We should test this for these four - we have 4 $H_0$ each with a different null distribution but still we'll perform significance level correction (Bonferroni). That gives us $\alpha = 0.05 / 4 = 0.0125$). We'll use t-statistic.
import scipy.stats as stats
for i, party in enumerate(corr_df_sel.columns):
x = K[K['strana'] == party]['depriv_index']
y = K[K['strana'] == party]['hlasy_p']
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
if p_value < 0.05 / 4:
print(f'******** Testing for party: {party} ********')
print(f'R-value: {np.round(r_value, 3)}')
print(f'P-value: {"{:.3e}".format(p_value)}')
print(f'Standard error: {"{:.3e}".format(std_err)}\n')
******** Testing for party: DSSS ******** R-value: 0.111 P-value: 8.945e-18 Standard error: 3.712e-06 ******** Testing for party: KSČM ******** R-value: 0.074 P-value: 1.057e-08 Standard error: 4.141e-05 ******** Testing for party: ODS ******** R-value: -0.284 P-value: 5.084e-111 Standard error: 4.134e-05 ******** Testing for party: TOP ******** R-value: -0.305 P-value: 1.877e-128 Standard error: 4.751e-05
To avoid unintended data-dredging we should have had tested in on a different sample but as we can see the p-values are so low that we can safely conclude that we haven't really p-hacked in any way. Our results are statistically significant - mostly because of the large sample size - therefore, we can reject our $H_0$ but the question still is if they are practically significant. Just with a quick look at the data above, it does not seem to follow a linear trend very clearly and it is not strong at all. We can believe that there is some sort of a weak trend for TOP09 party.
To conclude this part, although we have found results that are statistically significant (just to make an example) there is no reason to further work with the deprivation index since its linear relations with the percent of votes for a party are very weak and it is not a sufficient predictor alone. To model socio-economic deprivation, this index does not capture the complexity of it at all if any. This section was about trying out clustering, dimensionality reduction and hypothesis testing.
Predicting¶
GBDT¶
What we can try to do is to get a black box that tries to predict the winning party for given district. We should then compare it to the average (or some simple model) to test if a more complex model even makes sense in this case.
We'll use Gradient Boosted Decision Trees since they perform very well on tabular (low-dimensional) data. In the end, we would like to be able to say which parties compete for votes and which have very distinct voter bases (in terms of districts).
But at first we'll need to prepare our data. Our targets will be the winning parties of the considered districts - thats what we need to transform. With this approach we'll lose a lot of information since we will be ignoring the parties that come closely second etc. But let's continue with prediction of the winning party only just for the sake of simplicity. And we'll of course make a selection of the columns from which we're going to predict.
We're going to select columns that talk about shares and not the absolute amounts. And we're going to get rid of columns that correlate greatly with each other.
This is what the predictions will be based on:
economic = ['eko_post_p_a', 'eko_zam_a', 'eko_nezam_a']
educational = ['vzd_zakl_a', 'vzd_vysok_a', 'vzd_nast_a', 'vzd_vos_a', 'vzd_str_be_a', 'vzd_str_sm_a']
location = ['okrsek', 'obec']
social = ['sum_v0_14_a', 'v65avic_a', 'nab_rimsko_a', 'rodst_rozv_a', 'nar_romska_a']
sex = ['zeny_a']
political = ['strana', 'hlasy', 'rok', 'muni_turnout']
# 'rodst_rozv', 'id_n', 'v65avic', 'muzi', 'zeny', 'nar_romska', 'nab_rimsko', 'sum_v0_14',
# 'vzd_zakl', 'vzd_str_sm', 'vzd_vysok', 'vzd_str_be', 'vzd_nast', 'vzd_vos',
# 'eko_nezam', 'eko_zam', 'nepracduch'
# 'hlasy_p'
# 'muzi_a', since heavily correlated with 'zeny_a'
# 'okres_nazev', 'nazev_obce', since i dont want the model to fit to the names of the districts
# 'kandidatka', since i dont want this data to be used for prediction
# 'oby_celkem' because we use 'muni_population' instead
# 'kraj_nazev'
# 'nepracduch_a' because of correlations
data = K[economic + educational + social + sex + political + location].copy()
data = data.merge(muni_population_df, on=['rok', 'obec'], how='left')
data.dropna(inplace=True)
example_df = data.drop(['strana', 'okrsek', 'rok', 'obec', 'hlasy'], axis=1)
example_df.head()
eko_post_p_a | eko_zam_a | eko_nezam_a | vzd_zakl_a | vzd_vysok_a | vzd_nast_a | vzd_vos_a | vzd_str_be_a | vzd_str_sm_a | sum_v0_14_a | v65avic_a | nab_rimsko_a | rodst_rozv_a | nar_romska_a | zeny_a | muni_turnout | muni_population | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.758755 | 44.669261 | 4.824903 | 15.642023 | 11.517510 | 2.879377 | 1.089494 | 26.459144 | 23.891051 | 14.163424 | 18.132296 | 8.638132 | 14.630350 | 0.000000 | 54.630350 | 0.507470 | 8166.0 |
1 | 4.321729 | 40.216086 | 6.842737 | 15.726291 | 5.282113 | 1.560624 | 1.080432 | 30.852341 | 21.008403 | 15.846339 | 13.325330 | 6.962785 | 10.564226 | 0.000000 | 46.218487 | 0.383460 | 5127.0 |
2 | 10.369487 | 51.489869 | 3.337306 | 10.607867 | 15.733015 | 2.264601 | 2.383790 | 27.056019 | 20.619785 | 18.235995 | 9.415971 | 11.442193 | 7.151371 | 0.000000 | 50.774732 | 0.550060 | 839.0 |
3 | 7.681564 | 41.899441 | 6.145251 | 19.413408 | 6.145251 | 1.955307 | 0.977654 | 31.284916 | 23.882682 | 14.385475 | 19.553073 | 9.636872 | 9.636872 | 0.000000 | 49.860335 | 0.530726 | 716.0 |
4 | 5.498095 | 45.944475 | 3.483941 | 12.411541 | 15.133370 | 2.286336 | 1.197605 | 26.238432 | 25.585193 | 15.459989 | 15.187806 | 15.133370 | 6.641263 | 0.054437 | 50.408274 | 0.596625 | 1837.0 |
def get_first_k_parties_places(k, df):
# Create a copy of the DataFrame to avoid modifying the original one
df_copy = df.copy()
parties_list = []
for i in range(k):
# Group by 'rok' and 'okrsek' and get the kth highest 'hlasy' indices
indices = df_copy.groupby(['rok', 'okrsek'])['hlasy'].nlargest(i+1).groupby(['rok', 'okrsek']).nth(i).index.get_level_values(2)
# Extract the parties that came in kth place using the indices
parties = df_copy.loc[indices, ['rok', 'okrsek', 'strana']].reset_index(drop=True)
# Rename the 'strana' column to reflect the kth place
parties.rename(columns={'strana': f'place_{i+1}_party'}, inplace=True)
parties_list.append(parties)
for parties_df in parties_list:
df = df.merge(parties_df, on=['rok', 'okrsek'], how='left')
return df
data = get_first_k_parties_places(1, data)
data[['place_1_party']].value_counts()
place_1_party ANO 99723 TOP 42609 ČSSD 15878 PIR 1426 ODS 1271 KSČM 368 STAN 31 Name: count, dtype: int64
Unluckily, we can see that we've lost valuable information about the parties that came second etc. For example SPD, a party that has had relatively big result in terms of total votes, didn't win any district in our sample. And of course the target classes are really uneven so we have to adjust for that in the model.
from sklearn.metrics import accuracy_score, log_loss, f1_score
def print_scores(probabilities, predictions, targets):
print(f'Accuracy: {accuracy_score(targets, predictions)}')
print(f'F1-score (macro): {f1_score(targets, predictions, average="macro")}')
print(f'F1-score (micro): {f1_score(targets, predictions, average="micro")}')
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import log_loss
features_df = data.drop(['place_1_party', 'strana', 'okrsek', 'rok', 'obec', 'hlasy'], axis=1)
X = features_df # Features
y = data['place_1_party'] # Target
# transform data
data_transformer = ColumnTransformer(
transformers=[
("StandardScale", StandardScaler(), [
'eko_post_p_a', 'eko_zam_a', 'eko_nezam_a', # economic
'vzd_zakl_a', 'vzd_vysok_a', 'vzd_nast_a', 'vzd_vos_a', 'vzd_str_be_a', 'vzd_str_sm_a', # educational
'sum_v0_14_a', 'v65avic_a', 'nab_rimsko_a', 'rodst_rozv_a', # social
'zeny_a', # sex
'muni_turnout' # political
]),
("RobustScale", RobustScaler(), ['nar_romska_a', 'muni_population']), # location and social
]
)
X = data_transformer.fit_transform(X)
### MEL BYCH MIT JESTE VALIDATION SET
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Encode the target variables if they are strings
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)
# Consider class weights
encoded_classes = label_encoder.transform(y_train.unique())
class_counts = {cls: (y_train_encoded == cls).sum() for cls in encoded_classes}
num_classes = len(encoded_classes)
# Calculate sample weights
sample_weight = np.array([len(y_train_encoded) / (class_counts[cls] * num_classes) for cls in y_train_encoded])
# Initialize and train the classifier
classifier = GradientBoostingClassifier(random_state=42)
classifier.fit(X_train, y_train_encoded, sample_weight=sample_weight)
GradientBoostingClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=42)
Let's see the scores.
# Predict probabilities
probabilities = classifier.predict_proba(X_test)
predictions = classifier.predict(X_test)
print_scores(probabilities, predictions, y_test_encoded)
Accuracy: 0.42970057652966337 F1-score (macro): 0.4851621119123744 F1-score (micro): 0.42970057652966337
We can see right of the bat that the model is not very capable of predicting the winners. The accuracy is very low.
But still let's have a look at the training curves.
train_errors, test_errors = [], []
# Calculate the error at each stage
for y_train_pred in classifier.staged_predict_proba(X_train):
train_errors.append(log_loss(y_train_encoded, y_train_pred, sample_weight=sample_weight))
for y_test_pred in classifier.staged_predict_proba(X_test):
test_errors.append(log_loss(y_test_encoded, y_test_pred))
# Plotting
plt.figure(figsize=(10, 5))
plt.plot(train_errors, label='Training Error', color='blue')
plt.plot(test_errors, label='Test Error', color='red')
plt.title('Error at Each Boosting Stage')
plt.xlabel('Boosting Iterations')
plt.ylabel('Log Loss')
plt.legend()
plt.show()
It does not seem like there's much room for an improvement (in terms of number of iterations).
Which features were the most important for predictions?
feature_importances = classifier.feature_importances_
column_labels = list(features_df.columns)
feature_nums = column_labels
plt.figure(figsize=(10, 4))
plt.barh(feature_nums, feature_importances)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.show()
We can see that, surprisingly, the model considered the share of people with college-entry level education as the most important criterium. Following with, as we would have guessed, the scaled number of people living in the municipality where the district is located.
Let us compare it to the simple model. If it made sense to do it this way. I'll choose kNN as a base-line model. And then compare these two.
# perform a kNN algorithm as a baseline model to compare the complex one to
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train_encoded)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
knn_predictions = knn.predict(X_test)
knn_probabilities = knn.predict_proba(X_test)
print_scores(knn_probabilities, knn_predictions, y_test_encoded)
Accuracy: 0.5937635608455768 F1-score (macro): 0.48140470832275417 F1-score (micro): 0.5937635608455768
So it looks like our simple model performs better than the complex one. It might be due to unweigted target classes in the simpler model. But let's test it. Let's set null hypothesis $H_0$ that the simpler model performs the same as the complex one. We'll use bootstrap resampling and we'll compare F1 macro score.
f1_diffs = []
n_bootstrap = 1000
y_test_array = y_test_encoded
# Perform the bootstrap sampling
for _ in range(n_bootstrap):
indices = np.random.choice(range(len(X_test)), size=len(X_test), replace=True)
X_sample, y_sample = X_test[indices], y_test_array[indices]
# Evaluate GBDT
y_pred_gbdt = classifier.predict(X_sample)
f1_gbdt = f1_score(y_sample, y_pred_gbdt, average='macro')
# Evaluate kNN
y_pred_knn = knn.predict(X_sample)
f1_knn = f1_score(y_sample, y_pred_knn, average='macro')
# Calculate the difference in performance
f1_diffs.append(f1_gbdt - f1_knn)
# Statistical Analysis
mean_diff = np.mean(f1_diffs)
std_diff = np.std(f1_diffs)
confidence_interval = np.percentile(f1_diffs, [2.5, 97.5])
print(f'The mean difference in F1-score is {mean_diff:.3f}')
print(f'The standard deviation of the differences is {std_diff:.3f}')
print(f'The 95% confidence interval is {confidence_interval}')
The mean difference in F1-score is -0.000 The standard deviation of the differences is 0.032 The 95% confidence interval is [-0.05605834 0.04883434]
Zero is included in the confidence interval. So we can not reject the null hypothesis therefore we can not conclude which model is better. But the point was that the complex model is worse or just as bad as the simple one.
Discussion and conclusion¶
Let's sum up what we've done. We got a pretty messy dataset with lot of redundant information. We've explored our data and tried to see if we can get some in-depth view from socio-economic features of our data. With that in mind we've attempted to cluster it. The best clusters that we could get were only two so we needed to explore a different approach. We performed PCA and reduced our selection to one dimension - one new column. Under inspection, it seemed to be performing well. But as we later figured out, this factor is relevant for voting preferences but very very weakly and was relevent only to some parties. To conclude this, we've tried a simple prediction task where we tried to guess the winning party of a district and then compared our complex model to a simpler one.
We can speculate about lower deprivation (in terms of our constructed index) having a slightly positive effect on the chances of the top right-wing liberal (economically) parties. But the result we've got is heavily dependent on how we set our metric - this needs to be addressed. So to answer our question if deprivation of a region pushes voters to vote for some sort of parties - there is only a really really weak (but statistically significant) push towards authoritarian parties (like DSSS and KSČM). But this could have easily resulted from setting our metric in a specific way. So we can not conclude much. To be sure in search of answers to this question, a more reasonable and robust metric is needed.
When we tried to predict a winning party from the information that we had at our disposal, it didn't go well. So to answer our second question that we set, We conclude that more information in general is needed to fully capture the nature of districts' voting preferences.