Parliamentary elections analysis

Introduction¶

In this report, we'll examine a dataset for NDBI048 Data Science course.

Dataset file¶

We're using elections dataset from the course's GitLab.

The selected dataset contains all-around information about Czech elections from 2013 to 2017.

Down below, this is the content of the volby_description.csv file describing columns of the dataset.

    column_name_cz;column_meaning
    ID_N;District ID
    OBEC;Municipality code
    Nazev_obce;Municipality name
    Okres_nazev;County name
    Kraj_nazev;Region name
    Obec_okrsek_final2016;Municipality code with district number concatenated
    OBEC_Okrsek;Municipality code _ district number
    PAR_2017_1_Občansk�.demokratick�.strana;Count of votes in the district in elections 2017 for party
    PAR_2017_2_�.�.dn�.rodaVlasteneck�.unie;
    PAR_2017_3_CESTAODPOVĚDN�.SPOLEČNOSTI;
    PAR_2017_4_Česk�.str.soci�.ln�.demokrat;
    PAR_2017_5_VoltePr.Blokwww.cibulka.net;
    PAR_2017_6_Radostn�.Česko;
    PAR_2017_7_STAROSTOV�.ANEZ�.VISLÍ;
    PAR_2017_8_Komunistick�.str.ČechaMoravy;
    PAR_2017_9_Stranazelen�.ch;
    PAR_2017_10_ROZUMNÍstopmigracidikt�.t.EU;
    PAR_2017_11_Společ.protiv�.st.vProk.údol�.;
    PAR_2017_12_Stranasvobodn�.chobčanů;
    PAR_2017_13_Blokprotiislam.Obran.domova;
    PAR_2017_14_Občansk�.demokratick�.aliance;
    PAR_2017_15_Česk�.pir�.tsk�.strana;
    PAR_2017_16_OBČAN�.2011SPRAVEDL.PROLIDI;
    PAR_2017_17_UnieH.A.V.E.L;
    PAR_2017_18_Česk�.n�.rodn�.fronta;
    PAR_2017_19_ReferendumoEvropsk�.unii;
    PAR_2017_20_TOP09;
    PAR_2017_21_ANO2011;
    PAR_2017_22_Dobr�.volba2016;
    PAR_2017_23_SPRRepubl.str.Čsl.M.Sl�.dka;
    PAR_2017_24_K�.esť.demokr.unieČs.str.lid;
    PAR_2017_25_Česk�.stranan�.rodn�.soci�.ln�.;
    PAR_2017_26_REALIST�.;
    PAR_2017_27_SPORTOVCI;
    PAR_2017_28_D�.lnic.str.soci�.ln�.spravedl;
    PAR_2017_29_Svob.ap�..dem.T.OkamuraSPD;
    PAR_2017_30_StranaPr�.vObčanů;
    PAR_2017_31_N�.rodSob�.;
    Celkov�.součet;Total count of votes in the district in elections 2017
    PAR_2017_1_Občansk�.demokratick�.strana_p;Percentages of votes for party
    PAR_2017_2_�.�.dn�.rodaVlasteneck�.unie_p;
    PAR_2017_3_CESTAODPOVĚDN�.SPOLEČNOSTI_p;
    PAR_2017_4_Česk�.str.soci�.ln�.demokrat._p;
    PAR_2017_5_VoltePr.Blokwww.cibulka.net_p;
    PAR_2017_6_Radostn�.Česko_p;
    PAR_2017_7_STAROSTOV�.ANEZ�.VISLÍ_p;
    PAR_2017_8_Komunistick�.str.ČechaMoravy_p;
    PAR_2017_9_Stranazelen�.ch_p;
    PAR_2017_10_ROZUMNÍstopmigracidikt�.t.EU_p;
    PAR_2017_11_Společ.protiv�.st.vProk.údol�._p;
    PAR_2017_12_Stranasvobodn�.chobčanů_p;
    PAR_2017_13_Blokprotiislam.Obran.domova_p;
    PAR_2017_14_Občansk�.demokratick�.aliance_p;
    PAR_2017_15_Česk�.pir�.tsk�.strana_p;
    PAR_2017_16_OBČAN�.2011SPRAVEDL.PROLIDI_p;
    PAR_2017_17_UnieH.A.V.E.L._p;
    PAR_2017_18_Česk�.n�.rodn�.fronta_p;
    PAR_2017_19_ReferendumoEvropsk�.unii_p;
    PAR_2017_20_TOP09_p;
    PAR_2017_21_ANO2011_p;
    PAR_2017_22_Dobr�.volba2016_p;
    PAR_2017_23_SPRRepubl.str.Čsl.M.Sl�.dka_p;
    PAR_2017_24_K�.esť.demokr.unieČs.str.lid._p;
    PAR_2017_25_Česk�.stranan�.rodn�.soci�.ln�._p;
    PAR_2017_26_REALIST�._p;
    PAR_2017_27_SPORTOVCI_p;
    PAR_2017_28_D�.lnic.str.soci�.ln�.spravedl._p;
    PAR_2017_29_Svob.ap�..dem.T.OkamuraSPD_p;
    PAR_2017_30_StranaPr�.vObčanů_p;
    PAR_2017_31_N�.rodSob�._p;
    PAR_2017_ID_OKRSKY;Technical records, not necessary to keep
    PAR_2017_TYP_FORM;
    PAR_2017_OPRAVA;
    PAR_2017_CHYBA;
    PAR_2017_OKRES;
    PAR_2017_OKRSEK;District number within a municipality
    PAR_2017_KC_1;Technical records, not necessary to keep
    PAR_2017_KC_2;
    PAR_2017_VOL_SEZNAM;Count of eligible voters in the district (in 2017)
    PAR_2017_VYD_OBALKY;Count of issued envelopes
    PAR_2017_ODEVZ_OBAL;Count of envelopes in the box
    PAR_2017_PL_HL_CELK;Count of valid votes
    VOL_SEZNAM_KRAJ16;Statistical data from the regional elections in 2016
    VYD_OBALKY_KRAJ16;
    ODEVZ_OBAL_KRAJ16;
    PL_HL_CELK_KRAJ16;
    Ztracene_obalky_KRAJ16;
    Neplatne_KRAJ16;
    Podil_neplatn�.ch_KRAJ16;
    Vol_ucast_KRAJ16;
    CHYBA_KRAJ16;
    PORIZ_OKR_KRAJ16;
    OKR_TIME_KRAJ16;
    PORIZ_TIME_KRAJ16;
    PAR13_1_Česk�._strsoci�.ln�._demokrat;Count of votes in the district in elections 2013 for party
    PAR13_2_Strana_svobodn�.ch_občanů;
    PAR13_3_Česk�._pir�.tsk�._strana;
    PAR13_4_TOP_09;
    PAR13_5_HLAVU_VZH�.RU__volebn�._blok;
    PAR13_6_Občansk�._demokratick�._strana;
    PAR13_7_Romsk�._demokratick�._strana;
    PAR13_8_Klub_angažovan�.ch_nestran�.ků;
    PAR13_9_politick�._hnut�._Zm�.na;
    PAR13_10_Strana_soukromn�.ků_ČR;
    PAR13_11_K�.esťdemokrunieČsstrlid;
    PAR13_12_Volte_PrBlok_wwwcibulkanet;
    PAR13_13_SuverStrana_zdrav�.ho_rozumu;
    PAR13_14_Aktiv_nez�.visl�.ch_občanů;
    PAR13_15_Strana_Pr�.v_Občanů_ZEMANOVCI;
    PAR13_16_OBČAN�._2011;
    PAR13_17_Úsvit_p�.�.m�._demokrTOkamury;
    PAR13_18_D�.lnicstrsoci�.ln�._spravedl;
    PAR13_20_ANO_2011;
    PAR13_21_Komunistick�._strČech_a_Moravy;
    PAR13_22_LEV_21N�.rodn�._socialist�.;
    PAR13_23_Strana_zelen�.ch;
    PAR13_24_Koruna_Česk�._monarchstrana;
    PAR13_1_Česk�._strsoci�.ln�._demokrat_p;Percentages of votes for party
    PAR13_2_Strana_svobodn�.ch_občanů_p;
    PAR13_3_Česk�._pir�.tsk�._strana_p;
    PAR13_4_TOP_09_p;
    PAR13_5_HLAVU_VZH�.RU__volebn�._blok_p;
    PAR13_6_Občansk�._demokratick�._strana_p;
    PAR13_7_Romsk�._demokratick�._strana_p;
    PAR13_8_Klub_angažovan�.ch_nestran�.ků_p;
    PAR13_9_politick�._hnut�._Zm�.na_p;
    PAR13_10_Strana_soukromn�.ků_ČR_p;
    PAR13_11_K�.esťdemokrunieČsstrlid_p;
    PAR13_12_Volte_PrBlok_wwwcibulkanet_p;
    PAR13_13_SuverStrana_zdrav�.ho_rozumu_p;
    PAR13_14_Aktiv_nez�.visl�.ch_občanů_p;
    PAR13_15_Strana_Pr�.v_Občanů_ZEMANOVCI_p;
    PAR13_16_OBČAN�._2011_p;
    PAR13_17_Úsvit_p�.�.m�._demokrTOkamury_p;
    PAR13_18_D�.lnicstrsoci�.ln�._spravedl_p;
    PAR13_20_ANO_2011_p;
    PAR13_21_Komunistick�._strČech_a_Moravy_p;
    PAR13_22_LEV_21N�.rodn�._socialist�._p;
    PAR13_23_Strana_zelen�.ch_p;
    PAR13_24_Koruna_Česk�._monarchstrana_p;
    Prům�.rzVOLICI;Count of eligible voters in the district (in 2013)
    Prům�.rzVYDOB;Count of issued envelopes
    Prům�.rzODOB;Count of envelopes in the box
    Prům�.rzHLASYCEL;Count of valid votes
    Ztracen�.ob�.lky;Lost envelopes
    P�.rov�.n�.;Technical records, not necessary to keep
    Poc_budov;Count of buildings(?) - meaning is unclear
    Oby_celkem;Total population in the district
    Muzi;Total men
    Zeny;Total women
    Rodst_rozv;Count of divorced
    Sum_v0_14;Count of children up to 14 yrs
    V65avic;Count of people 65+ yrs
    Nar_romska;Gipsy population
    Vzd_vysok;Count of people with university education (highest)
    Vzd_vos;Count of people with education level 5
    Vzd_nast;Count of people with education level 4
    Vzd_str_sm;Count of people with education level 3
    Vzd_str_be;Count of people with education level 2
    Vzd_zakl;Count of people with at most grammar school (lowest)
    Nab_rimsko;Count of Roman Catholics
    Eko_zam;Count of employed people - unretired
    Eko_post_p;Count of employed people - retired but working
    Eko_nezam;Count of unemployed people
    Nepracduch;Count of retired not working people
    Muzi_A;Population share - men
    Zeny_A;Population share - women
    Rodst_rozv_A;Population share of divorced
    Sum_v0_14_A;Population share of children up to 14 yrs
    V65avic_A;Population share of people 65+ yrs
    Nar_romska_A;Gipsy population
    Vzd_vysok_A;Population share of people with university education (highest)
    Vzd_vos_A;Population share of people with education level 5
    Vzd_nast_A;Population share of people with education level 4
    Vzd_str_sm_A;Population share of people with education level 3
    Vzd_str_be_A;Population share of people with education level 2
    Vzd_zakl_A;Population share of people with at most grammar school (lowest)
    Nab_rimsko_A;Population share of Roman Catholics
    Eko_zam_A;Population share of employed people - unretired
    Eko_post_p_A;Population share of employed people - retired but working
    Eko_nezam_A;Population share of unemployed people
    Nepracduch_A;Population share of retired not working people
    vaha;Weight(?) - meaning unclear

As we can see, a lot of unclear, duplicate, inconsistent and uninteresting values are included, as well as information about 2016 regional elections. We're concerened with the dataset's future expandibility right away.

The volby.sav file itself, which we're using, is circa 12 MB big and has a little under 15 k rows and 183 columns.

The dataset's structure is over-all really messy.

Goals¶

The reason for carrying out the analysis is to explore and possibly predict voting preferences of specific population units. We expect to find certain voter bases of large political entities.

We want to explore parliamentary elections only.

The analysis could point at deprived regional units as well as commonalities between certain sections of the population.

Data¶

We'll focus on the dataset's structural and semantic information here.

Structure¶

Let's have a look at a random sample of 3000 rows (first 5 shown).

import pandas as pd
import pyreadstat

K, meta = pyreadstat.read_sav('volby_dataset/volby.sav')

K.columns = K.columns.str.lower()

K = K.sample(n=3000, axis=0, random_state=42)

K.head()

	id_n	obec	nazev_obce	okres_nazev	kraj_nazev	obec_okrsek_final2016	obec_okrsek	par_2017_1_občanskádemokratickástrana	par_2017_2_řádnárodavlasteneckáunie	...	vzd_nast_a	vzd_str_sm_a	vzd_str_be_a	vzd_zakl_a	nab_rimsko_a	eko_zam_a	eko_post_p_a	eko_nezam_a	nepracduch_a	vaha
1743	1744.0	539911.0	Příbram	Příbram	Středočeský kraj	5399119.0	539911_9	60.0	1.0	...	2.879377	23.891051	26.459144	15.642023	8.638132	44.669261	5.758755	4.824903	23.735409	1.685789
12414	12457.0	597520.0	Krnov	Bruntál	Moravskoslezský kraj	5975204.0	597520_4	19.0	1.0	...	1.560624	21.008403	30.852341	15.726291	6.962785	40.216086	4.321729	6.842737	26.770708	1.066376
1593	1594.0	531821.0	Tachlovice	Praha-západ	Středočeský kraj	5318211.0	531821_1	66.0	2.0	...	2.264601	20.619785	27.056019	10.607867	11.442193	51.489869	10.369487	3.337306	13.230036	1.222989
1858	1859.0	541231.0	Rožmitál pod Třemšínem	Příbram	Středočeský kraj	5412311.0	541231_1	44.0	1.0	...	1.955307	23.882682	31.284916	19.413408	9.636872	41.899441	7.681564	6.145251	27.653631	1.282819
13650	13694.0	598836.0	Václavovice	Ostrava-město	Moravskoslezský kraj	5988361.0	598836_1	90.0	3.0	...	2.286336	25.585193	26.238432	12.411541	15.133370	45.944475	5.498095	3.483941	22.972237	2.731049

5 rows × 183 columns

At first sight, the dataset is a tragedy.

There are plenty of, so-called, technical columns which we won't take into consideration. They are described above in volby_description.csv file.

Apart from dataset's wrong encoding, we are mainly concerned about having political parties on separate columns (as well as their percentage - columns that end with _p) since in the upcoming exploratory analysis we'll probably have to group by political party which would be less intuitive here.

For now, let's have a look at the sample's basic info.

K.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3000 entries, 1743 to 10739
Columns: 183 entries, id_n to vaha
dtypes: float64(174), object(9)
memory usage: 4.2+ MB

Columns that contain textual values are:

K[K.select_dtypes(include=['object']).columns].head()

	nazev_obce	okres_nazev	kraj_nazev	obec_okrsek	poriz_okr_kraj16	okr_time_kraj16	poriz_time_kraj16	párování
1743	Příbram	Příbram	Středočeský kraj	539911_9	00:00:01	02:48:44	05:50:01	539911_9
12414	Krnov	Bruntál	Moravskoslezský kraj	597520_4	00:00:01	03:02:29	03:43:43	597520_4
1593	Tachlovice	Praha-západ	Středočeský kraj	531821_1	00:00:01	03:28:26	06:07:34	531821_1
1858	Rožmitál pod Třemšínem	Příbram	Středočeský kraj	541231_1	00:00:01	02:48:21	03:42:08	541231_1
13650	Václavovice	Ostrava-město	Moravskoslezský kraj	598836_1	00:00:01	05:55:34	08:39:03	598836_1

It seems like there are no duplicities in terms of rows.

duplicate_rows = K[K.duplicated(keep=False)]
print("Number of duplicate rows: ", duplicate_rows.shape[0])

Number of duplicate rows:  0

But in terms of columns, there seem to be issues that can be addressed.

Apart technical-valued columns, columns 'obec_okrsek_final2016' , 'obec_okrsek' and 'par_2017_okrsek' seem to be sharing some values.

K[['obec_okrsek_final2016', 'obec_okrsek', 'par_2017_okrsek']].head()

	obec_okrsek_final2016	obec_okrsek	par_2017_okrsek
1743	5399119.0	539911_9	9.0
12414	5975204.0	597520_4	4.0
1593	5318211.0	531821_1	1.0
1858	5412311.0	541231_1	1.0
13650	5988361.0	598836_1	1.0

We can see from the result below that 'obec_okrsek_final2016' and 'obec_okrsek' columns share almost every value.

satisfying_rows, unsatisfying_rows, mismatched_values = [], [], []
spec_K = K[['obec_okrsek_final2016', 'obec_okrsek']]

for index, row in spec_K.iterrows():
    val1, val2 = row['obec_okrsek_final2016'], row['obec_okrsek']
    
    if pd.isna(val1) or pd.isna(val2):
        continue

    if str(int(val1)) != val2.replace("_", ""):
        unsatisfying_rows.append(index)
        mismatched_values.append((val1, val2))
    else:
        satisfying_rows.append(index)

print(f"Rows where mismatch between non-nan values occured:\n{unsatisfying_rows[:min(15, len(unsatisfying_rows))]}")
if (len(unsatisfying_rows) > 0):
    for i in range(min(5, len(mismatched_values))):
        print(f"{mismatched_values[i][0]} != {mismatched_values[i][1]}")
    if len(unsatisfying_rows) > 5:
        print("...")
print(f"Number of equal values:\n{len(satisfying_rows)} out of {len(spec_K.index)}")

Rows where mismatch between non-nan values occured:
[5749, 6128, 13527, 3838, 6086, 13402, 9526, 9742, 9839, 3745, 9688, 9671, 3817, 9538, 6156]
56789223.0 != 554804_1023
55690437.0 != 563889_37
54622431.0 != 554821_15031
54600357.0 != 554791_3057
5568912.0 != 563889_1002
...
Number of equal values:
2542 out of 3000

We don't have enough information to determine why that happens. But it seems like there would be an overflow error if it were connected like the other equal values - so in these cases, weird numbering appears.

pd.DataFrame({
    'nan_values': K[['obec_okrsek_final2016', 'obec_okrsek']].isna().sum()
})

	nan_values
obec_okrsek_final2016	225
obec_okrsek	0

From the table, we can't be one hundred percent sure but we can conclude that the values that are not equal are weird (in 'obec_okrsek_final2016'), missing or null.

So, 'obec_okrsek_final2016' should be dropped. And 'obec_okrsek' column should be split in two separate columns.

Flaws¶

Let's look at missing values and other dataset's problems in general.

na_counts = K.isnull().sum()
rows = K.shape[0]
na_percentage = round((na_counts / rows) * 100, 2)

na_K = pd.DataFrame({
    'nan_values': na_counts,
    'percentage': na_percentage,
})

na_K = na_K[na_K['nan_values'] > 0].sort_values(by='percentage', ascending=False)

na_K

	nan_values	percentage
par_2017_okrsek	234	7.8
par_2017_id_okrsky	234	7.8
par_2017_oprava	234	7.8
par_2017_chyba	234	7.8
par_2017_okres	234	7.8
...	...	...
par_2017_24_křesť.demokr.uniečs.str.lid._p	6	0.2
par_2017_23_sprrepubl.str.čsl.m.sládka_p	6	0.2
par_2017_22_dobrávolba2016_p	6	0.2
par_2017_21_ano2011_p	6	0.2
par_2017_6_radostnéčesko	6	0.2

138 rows × 2 columns

We should look how they group together as well - if we suppose that the same flaws / inconsistencies happen between those with the same percentage of nan_values.

na_K.value_counts()

nan_values  percentage
6           0.20          63
8           0.27          51
234         7.80          12
225         7.50           9
199         6.63           3
Name: count, dtype: int64

Unluckily, almost every column contains missing or NaN values. Let's examine what could be the reason behind that keeping in mind that we'll ignore values that we won't use in our exploratory analysis (that is "kraj16" values).

This is how it looks if we're ignoring the regional elections:

na_K = na_K.loc[[idx for idx in na_K.index if not idx.endswith('kraj16') and not idx.endswith('final2016')]]

print(na_K.value_counts())

nan_values  percentage
6           0.20          63
8           0.27          51
234         7.80          12
Name: count, dtype: int64

But now, let's see which are the worst:

unique_na_K = sorted(list(set(na_K['percentage'])))

na_K[na_K['percentage'] == unique_na_K[-1]]

	nan_values	percentage
par_2017_okrsek	234	7.8
par_2017_id_okrsky	234	7.8
par_2017_oprava	234	7.8
par_2017_chyba	234	7.8
par_2017_okres	234	7.8
par_2017_kc_1	234	7.8
par_2017_kc_2	234	7.8
par_2017_vol_seznam	234	7.8
par_2017_vyd_obalky	234	7.8
par_2017_odevz_obal	234	7.8
par_2017_pl_hl_celk	234	7.8
par_2017_typ_form	234	7.8

We will drop those columns all at once later on. So we don't worry about them right now.

Let's move on to political parties' columns. Firstly, we'll take a look at how are NaN or Null values distributed across parties in 2013 and 2017. And secondly, we'll try to find where those values miss geographically.

We also have to point out that the structure of party columns is inconsistent between years.

cols_votes_17 = [col for col in K.columns if col[:4] == "par_" and col[9].isdigit() and not col.endswith("_p")]
cols_votes_13 = [col for col in K.columns if col[:4] == "par1" and not col.endswith("_p")]

miss_df_13 = K[cols_votes_13].isna().sum()
miss_df_17 = K[cols_votes_17].isna().sum()

print(
    f"Number of Nan or Null values in 2013 parties:\n sum={miss_df_13.sum()}, mean={miss_df_13.mean()}, min={miss_df_13.min()}, max={miss_df_13.max()}\n\n" + \
    f"Number of Nan or Null values in 2017 parties:\n sum={miss_df_17.sum()}, mean={miss_df_17.mean()}, min={miss_df_17.min()}, max={miss_df_17.max()}"
)

Number of Nan or Null values in 2013 parties:
 sum=184, mean=8.0, min=8, max=8

Number of Nan or Null values in 2017 parties:
 sum=186, mean=6.0, min=6, max=6

It seems like the missing or NaN values are year-specific. Which rows have that, how do they look and what do they contain?

2013:

nan_13_head = K[K[cols_votes_13].isna()].head()
nan_13_head

	id_n	obec	nazev_obce	okres_nazev	kraj_nazev	obec_okrsek_final2016	obec_okrsek	par_2017_1_občanskádemokratickástrana	par_2017_2_řádnárodavlasteneckáunie	par_2017_3_cestaodpovědnéspolečnosti	...	vzd_nast_a	vzd_str_sm_a	vzd_str_be_a	vzd_zakl_a	nab_rimsko_a	eko_zam_a	eko_post_p_a	eko_nezam_a	nepracduch_a	vaha
1743	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12414	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1593	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1858	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13650	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 183 columns

2017:

nan_17_head = K[K[cols_votes_17].isna()].head()
nan_17_head

	id_n	obec	nazev_obce	okres_nazev	kraj_nazev	obec_okrsek_final2016	obec_okrsek	par_2017_1_občanskádemokratickástrana	par_2017_2_řádnárodavlasteneckáunie	par_2017_3_cestaodpovědnéspolečnosti	...	vzd_nast_a	vzd_str_sm_a	vzd_str_be_a	vzd_zakl_a	nab_rimsko_a	eko_zam_a	eko_post_p_a	eko_nezam_a	nepracduch_a	vaha
1743	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12414	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1593	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1858	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13650	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 183 columns

Is there even any non-NaN value in these rows?

print("Row idx and the sum of the number of non-NaN values in:")
print(f"2013's head:\n{nan_13_head.notna().sum(axis=1)}")
print(f"2017's head:\n{nan_17_head.notna().sum(axis=1)}")

Row idx and the sum of the number of non-NaN values in:
2013's head:
1743     0
12414    0
1593     0
1858     0
13650    0
dtype: int64
2017's head:
1743     0
12414    0
1593     0
1858     0
13650    0
dtype: int64

No, so they should be ok to remove. Probably an error. So there's no way how to proceed in finding the missing values' geographical locations.

Let's do simple sanity checks.

party_cols = cols_votes_13 + cols_votes_17

def check_value(x):
    if isinstance(x, (int, float)) and pd.notna(x) and x != 0 and (x < 0 or x % int(x) != 0):
        return True
    return False

if K[party_cols].applymap(check_value).any().any():
    print("There is at least one element in party columns that is not a whole number or less than 0.")
else:
    print("All non-na elements in party columns are either whole numbers or greater than or equal to 0.")

All non-na elements in party columns are either whole numbers or greater than or equal to 0.

It is also important to assure ourselves that all values, that tell us something about the number of people in certain population groups, sum, for example, to the total number of people living in the regional unit.

What needs to be said is that it is at least very weird to have the total number of people (and other similar columns) without year taken into account.

Let's check 'muzi', 'zeny' and then all educational levels.

# Check if 'muzi' + 'zeny' == 'oby_celkem'
sub_K = K[K[['muzi', 'zeny', 'oby_celkem']].notna().any(axis=1)][['muzi', 'zeny', 'oby_celkem']].head()
sub_K['muzi_zeny'] = sub_K['muzi'] + sub_K['zeny']
print(f"The number of rows where 'muzi' + 'zeny' does not equal 'oby_celkem' is {int(sub_K[sub_K['oby_celkem'] != sub_K['muzi_zeny']].sum().sum())}")

# Check if all levels of education sum up to 'oby_celkem'
levels = ['vzd_vysok', 'vzd_vos', 'vzd_nast', 'vzd_str_sm', 'vzd_str_be', 'vzd_zakl']
sub_K = K[K[levels].notna().any(axis=1)][levels + ['oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()
sub_K['levels_sum'] = sub_K[levels].sum(axis=1)

differ_rows = sub_K[sub_K['oby_celkem'] != sub_K['levels_sum']]

print(f"The number of rows where summing across the levels of education does not equal 'oby_celkem' is {int(differ_rows.sum().sum())} \
which is {100 * len(differ_rows) / sub_K.shape[0]} percent of total rows.")

sub_K[['levels_sum', 'oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()

The number of rows where 'muzi' + 'zeny' does not equal 'oby_celkem' is 0
The number of rows where summing across the levels of education does not equal 'oby_celkem' is 23546 which is 100.0 percent of total rows.

	levels_sum	oby_celkem	průměrzvolici	par_2017_vol_seznam
1743	1047.0	1285.0	958.0	939.0
12414	629.0	833.0	606.0	599.0
1593	660.0	839.0	695.0	754.0
1858	599.0	716.0	729.0	712.0
13650	1522.0	1837.0	1552.0	1578.0

It seems like there is a little discrepancy here. This could potentially be attributed to the method of storing the number of people having a certain educational level - a person's education could be weighted in only after they reach their's 18th year - so, hypothetically speaking, if we were to subtract people aged 0 to 17, then the difference could disappear.

On the other hand, this hypothesis seems to be wrong since columns with eligible voters should be telling exactly that number ('průměrzvolici', 'par_2017_vol_seznam') and they are not equal as well. We could think that it could be something in between those values, but that seems to be wrong as well. So we have to conclude that there are inconsistencies present and we don't really know why since the method of data collection is not known to us.

work_status = ['eko_zam', 'eko_post_p', 'eko_nezam', 'nepracduch']

sub_K = K[K[work_status].notna().any(axis=1)][work_status + ['oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()
sub_K['work_status_sum'] = sub_K[work_status].sum(axis=1)

print(f"The number of rows where summing across the work status does not equal 'oby_celkem' \
is {int(sub_K[sub_K['work_status_sum'] != sub_K['oby_celkem']].sum().sum())}")

sub_K[['work_status_sum', 'oby_celkem', 'průměrzvolici', 'par_2017_vol_seznam']].head()

The number of rows where summing across the work status does not equal 'oby_celkem' is 23336

	work_status_sum	oby_celkem	průměrzvolici	par_2017_vol_seznam
1743	1015.0	1285.0	958.0	939.0
12414	651.0	833.0	606.0	599.0
1593	658.0	839.0	695.0	754.0
1858	597.0	716.0	729.0	712.0
13650	1431.0	1837.0	1552.0	1578.0

All said above applies here as well. With the exception that you can probably have a working status even if you're not eligible to vote.

Adjustments¶

At first, we'll remove all the unnecessary columns for our exploratory analysis - that includes technical columns, duplicities, regional elections and so on.

# delete columns with uninteresting or duplicate values

K.drop(columns=[col for col in K.columns if col.endswith('_p') \
    or col.endswith('kraj16')] + ['par_2017_id_okrsky', 'par_2017_typ_form', 'par_2017_oprava', 'par_2017_chyba', 'par_2017_okres'] \
    + ['par_2017_kc_1', 'par_2017_kc_2'] + ['vaha'] + ["ztracenéobálky", "párování", "poc_budov"] + ['obec_okrsek_final2016'] \
    + ['par_2017_okrsek'] + ['par_2017_vyd_obalky', 'par_2017_odevz_obal', 'průměrzvydob', 'průměrzodob'] \
    + ['celkovýsoučet', 'par_2017_vol_seznam', 'par_2017_pl_hl_celk', 'průměrzvolici', 'průměrzhlasycel'], inplace=True)

Our main concern lies with each party having one separate column. To simplify the exploratory analysis, what we can do is "melt" those columns into 'rok_kandidatka_strana' and 'hlasy' columns.

In the end, what we possibly want to get is 4 distinct columns: 'rok', 'kandidatka', 'strana' and 'hlasy' which would effectively help us with orienting and all-around work with the dataset.

But first, we'll rename our party columns so that they all share the same naming structure, that is: "YYYY_CandidateListNum_PartyName".

renamed_cols_17 = [col[4:] for col in cols_votes_17]
renamed_cols_13 = [f'20{col[3:]}' for col in cols_votes_13]

rename_map = {}
for old_name, new_name in zip(cols_votes_13 + cols_votes_17, renamed_cols_13 + renamed_cols_17):
    rename_map[old_name] = new_name

K = K.rename(columns=rename_map)

K.head()

	id_n	obec	nazev_obce	okres_nazev	kraj_nazev	obec_okrsek	2017_1_občanskádemokratickástrana	2017_2_řádnárodavlasteneckáunie	2017_4_českástr.sociálnědemokrat	...	vzd_vos_a	vzd_nast_a	vzd_str_sm_a	vzd_str_be_a	vzd_zakl_a	nab_rimsko_a	eko_zam_a	eko_post_p_a	eko_nezam_a	nepracduch_a
1743	1744.0	539911.0	Příbram	Příbram	Středočeský kraj	539911_9	60.0	1.0	39.0	...	1.089494	2.879377	23.891051	26.459144	15.642023	8.638132	44.669261	5.758755	4.824903	23.735409
12414	12457.0	597520.0	Krnov	Bruntál	Moravskoslezský kraj	597520_4	19.0	1.0	26.0	...	1.080432	1.560624	21.008403	30.852341	15.726291	6.962785	40.216086	4.321729	6.842737	26.770708
1593	1594.0	531821.0	Tachlovice	Praha-západ	Středočeský kraj	531821_1	66.0	2.0	13.0	...	2.383790	2.264601	20.619785	27.056019	10.607867	11.442193	51.489869	10.369487	3.337306	13.230036
1858	1859.0	541231.0	Rožmitál pod Třemšínem	Příbram	Středočeský kraj	541231_1	44.0	1.0	30.0	...	0.977654	1.955307	23.882682	31.284916	19.413408	9.636872	41.899441	7.681564	6.145251	27.653631
13650	13694.0	598836.0	Václavovice	Ostrava-město	Moravskoslezský kraj	598836_1	90.0	3.0	71.0	...	1.197605	2.286336	25.585193	26.238432	12.411541	15.133370	45.944475	5.498095	3.483941	22.972237

5 rows × 94 columns

Now we can melt the party columns into 'hlasy' (which would contain values of every party column) and 'rok_kandidatka_strana'.

fixed_vars = list(set(K.columns) - set(renamed_cols_13 + renamed_cols_17)) # get fixed columns

K = pd.melt(K, id_vars=fixed_vars, var_name='rok_kandidatka_strana', value_name='hlasy') # melt

K[K.columns[-5:]].head()

	zeny	vzd_zakl	okres_nazev	rok_kandidatka_strana	hlasy
0	702.0	201.0	Příbram	2017_1_občanskádemokratickástrana	60.0
1	385.0	131.0	Bruntál	2017_1_občanskádemokratickástrana	19.0
2	426.0	89.0	Praha-západ	2017_1_občanskádemokratickástrana	66.0
3	357.0	139.0	Příbram	2017_1_občanskádemokratickástrana	44.0
4	926.0	228.0	Ostrava-město	2017_1_občanskádemokratickástrana	90.0

'rok_kandidatka_strana' column should contain political party names only. To achive that we need to split it in three distinct columns: 'rok', 'kandidatka' and 'strana' which would effectively let us group by political party as we wanted in the beginning.

def split_col(col_name, num_of_cols, df):
    df[col_name.split('_', maxsplit=num_of_cols-1)] = df[col_name].str.split('_', n=num_of_cols-1, expand=True)
    return df.drop(col_name, axis=1)

K = split_col('rok_kandidatka_strana', 3, K) # split columns with votes

K[K.columns[-5:]].head()

	okres_nazev	hlasy	rok	kandidatka	strana
0	Příbram	60.0	2017	1	občanskádemokratickástrana
1	Bruntál	19.0	2017	1	občanskádemokratickástrana
2	Praha-západ	66.0	2017	1	občanskádemokratickástrana
3	Příbram	44.0	2017	1	občanskádemokratickástrana
4	Ostrava-město	90.0	2017	1	občanskádemokratickástrana

According to README.txt file, which is contained in the dataset's folder, some political parties, even though they are the same entity, have slightly different names throughout the years. Let's rename those subjects and abbreviate them to their standard.

naming = {
    'ČSSD': [1, 4], 
    'SSO': [2, 12],
    'PIR': [3, 15],
    'TOP': [4, 20],
    'ODS': [6, 1],
    'KDU': [11, 24],
    'PRBL': [12, 5],
    'SPO': [15, 30],
    'SPD': [17, 29],
    'DSSS': [18, 28],
    'ANO': [20, 21],
    'KSČM': [21, 8],
    'SZ': [23, 9],
}

for party, values in naming.items():
    K.loc[(K['rok'] == '2013') & (K['kandidatka'] == str(values[0])), 'strana'] = party
    K.loc[(K['rok'] == '2017') & (K['kandidatka'] == str(values[1])), 'strana'] = party

K.loc[K['strana'] == 'starostovéanezávislí', 'strana'] = 'STAN'

K[K.columns[-5:]].head()

	okres_nazev	hlasy	rok	kandidatka	strana
0	Příbram	60.0	2017	1	ODS
1	Bruntál	19.0	2017	1	ODS
2	Praha-západ	66.0	2017	1	ODS
3	Příbram	44.0	2017	1	ODS
4	Ostrava-město	90.0	2017	1	ODS

After dropping duplicate columns of 'obec_okrsek' we'll split it in two distinct columns 'obec' and 'okrsek' since it holds two values in one.

K = split_col('obec_okrsek', 2, K) # split obec_okrsek in two distinct columns

This is what we're working with now:

K.head()

	vzd_nast	nar_romska	vzd_str_be_a	obec	v65avic	vzd_nast_a	nepracduch_a	sum_v0_14_a	eko_zam	v65avic_a	...	sum_v0_14	eko_zam_a	zeny	vzd_zakl	okres_nazev	hlasy	rok	kandidatka	strana	okrsek
0	37.0	0.0	26.459144	539911	233.0	2.879377	23.735409	14.163424	574.0	18.132296	...	182.0	44.669261	702.0	201.0	Příbram	60.0	2017	1	ODS	9
1	13.0	0.0	30.852341	597520	111.0	1.560624	26.770708	15.846339	335.0	13.325330	...	132.0	40.216086	385.0	131.0	Bruntál	19.0	2017	1	ODS	4
2	19.0	0.0	27.056019	531821	79.0	2.264601	13.230036	18.235995	432.0	9.415971	...	153.0	51.489869	426.0	89.0	Praha-západ	66.0	2017	1	ODS	1
3	14.0	0.0	31.284916	541231	140.0	1.955307	27.653631	14.385475	300.0	19.553073	...	103.0	41.899441	357.0	139.0	Příbram	44.0	2017	1	ODS	1
4	42.0	1.0	26.238432	598836	279.0	2.286336	22.972237	15.459989	844.0	15.187806	...	284.0	45.944475	926.0	228.0	Ostrava-město	90.0	2017	1	ODS	1

5 rows × 44 columns

K.columns

Index(['vzd_nast', 'nar_romska', 'vzd_str_be_a', 'obec', 'v65avic',
       'vzd_nast_a', 'nepracduch_a', 'sum_v0_14_a', 'eko_zam', 'v65avic_a',
       'vzd_str_sm_a', 'eko_nezam', 'vzd_vysok_a', 'muzi_a', 'vzd_vos',
       'vzd_vos_a', 'muzi', 'vzd_zakl_a', 'rodst_rozv', 'vzd_str_be', 'zeny_a',
       'nepracduch', 'kraj_nazev', 'nazev_obce', 'nar_romska_a', 'id_n',
       'nab_rimsko_a', 'oby_celkem', 'rodst_rozv_a', 'vzd_str_sm',
       'eko_nezam_a', 'eko_post_p_a', 'nab_rimsko', 'vzd_vysok', 'sum_v0_14',
       'eko_zam_a', 'zeny', 'vzd_zakl', 'okres_nazev', 'hlasy', 'rok',
       'kandidatka', 'strana', 'okrsek'],
      dtype='object')

Let's move to another section of the report.

Exploratory analysis¶

In this section, we'll look at how the values are distributed and explore them in greater depth. For starters, a simple overview can do the job.

K.describe().apply(lambda x: x.round(2))

	vzd_nast	nar_romska	vzd_str_be_a	v65avic	vzd_nast_a	nepracduch_a	sum_v0_14_a	eko_zam	v65avic_a	vzd_str_sm_a	...	vzd_str_sm	eko_nezam_a	eko_post_p_a	nab_rimsko	vzd_vysok	sum_v0_14	eko_zam_a	zeny	vzd_zakl	hlasy
count	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	...	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	162000.00	161630.00
mean	16.61	0.34	30.68	111.59	2.18	23.74	14.23	307.75	16.60	22.01	...	163.56	4.88	5.88	72.98	75.33	99.59	42.96	358.07	105.52	12.55
std	14.02	1.82	6.99	85.10	1.04	7.05	3.33	225.55	6.21	4.87	...	125.48	2.32	2.24	84.40	84.94	74.30	5.72	252.28	75.77	28.39
min	0.00	0.00	4.14	3.00	0.00	1.90	0.00	0.00	1.20	2.56	...	1.00	0.00	0.00	0.00	0.00	0.00	0.00	10.00	2.00	0.00
25%	4.00	0.00	26.51	36.00	1.52	19.70	12.28	93.00	13.06	19.09	...	45.00	3.40	4.35	20.00	12.00	34.00	39.92	113.00	41.00	0.00
50%	14.00	0.00	31.35	94.00	2.15	23.27	14.16	293.00	15.79	22.40	...	151.00	4.49	5.63	49.00	45.00	91.00	43.33	354.50	95.00	1.00
75%	27.00	0.00	35.39	170.00	2.82	27.17	16.22	468.25	19.16	25.33	...	260.00	5.91	7.15	95.00	112.00	147.00	46.53	556.00	150.00	10.00
max	93.00	59.00	63.64	503.00	9.43	100.00	28.32	1734.00	93.40	41.07	...	860.00	36.18	26.32	939.00	687.00	671.00	62.03	1812.00	513.00	559.00

8 rows × 36 columns

Correlation table to check column dependencies.

%matplotlib inline

import seaborn.objects as so
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme()

mod_K = K.select_dtypes(exclude=['object'])
mod_K['rok'], mod_K['kandidatka'] = K['rok'].astype('int'), K['kandidatka'].astype('int')

plt.figure(figsize=(12, 8))
sns.heatmap(mod_K.corr(), annot=False, cmap="YlGnBu")

plt.show()

We've just uncovered uninteresting correlations that make sense if we take a closer look.

For example if there is higher share of people aged over 65 ('v65avic_a') then there is higher share of unemployed retired people ('nepracduch_a').
Another example would be the higher population share of men the less of women.
More explainable ones like the more women (count) the more employed unretired people ('eko_zam'). But that's just because columns with counts and not shares of population say next to nothing. Generally, weirdly said, the more people there is the more people there is, no matter what kind.
Elementary education level share ('vzd_zakl_a') is negatively correlated with employment in productive age ('eko_zam_a') and higher education level share ('vzd_vysok_a')
There is also a less visible negative correlation between the share of divorced people ('rodst_rozv_a') and the share of Catholics ('nab_rimsko_a').

Votes¶

Let's focus on votes. At first, we should compensate for deleted "_p" columns by creating 'hlasy_p' column.

total_13, total_17 = K.loc[K['rok'] == '2013', 'hlasy'].sum(), K.loc[K['rok'] == '2013', 'hlasy'].sum()

K.loc[K['rok'] == '2013', 'hlasy_p'] = 100 * K['hlasy'] / total_13
K.loc[K['rok'] == '2017', 'hlasy_p'] = 100 * K['hlasy'] / total_17

Down below, we can see a hint of election results (from the sample's point of view) as percentage of votes.

cutoff = 2
party_votes = K.groupby(['rok', 'strana'])['hlasy_p'].sum().reset_index().sort_values(by='hlasy_p', ascending=False)
party_votes.loc[party_votes['hlasy_p'] < cutoff, 'strana'] = 'jiné'
party_votes = party_votes.groupby(['rok', 'strana'])['hlasy_p'].sum().reset_index().sort_values(by='hlasy_p', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(data=party_votes, x="rok", y="hlasy_p", hue="strana", estimator=sum, order=['2013', '2017'])

plt.xlabel('Year')
plt.ylabel('Percentage of Votes (%)')
plt.title('Percentage of Votes by Political Party in Parliamentary elections')

plt.show()

Let's see a sketch of how many votes were from which municipality with the help of a word cloud.

from wordcloud import WordCloud

votes_per_obec = K[['hlasy', 'nazev_obce']].groupby('nazev_obce')['hlasy'].sum()

wordcloud = WordCloud(width=800, height=400, background_color ='white').generate_from_frequencies(votes_per_obec.to_dict())

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

plt.show()

What needs not to be overlooked are correlations between the percentage of votes for certain parties against other columns (which describe shares of people in a population unit).

column_names_map = {
    'nab_rimsko_a': 'Catholics',
    'nar_romska_a': 'Romani',
    'vzd_vysok_a': 'Higher Ed',
    'vzd_vos_a': 'Higher Vocational Ed',
    'vzd_nast_a': 'Post-secondary Supplementary Ed',
    'vzd_str_sm_a': 'Secondary Ed',
    'vzd_str_be_a': 'Secondary Ed without Maturita',
    'vzd_zakl_a': 'Elementary Ed or None',
    'oby_celkem': 'Total Population',

    'rodst_rozv_a': 'Divorced',

    'eko_zam_a': 'Employed unretired',
    'eko_post_p_a': 'Emloyed retired',
    'eko_nezam_a': 'Unemployed unretired',
    'nepracduch_a': 'Unemployed retired', 
}

K_renamed = K.rename(columns=column_names_map)
grouped_by_strana = K_renamed.groupby('strana')

correlations = {}

for party, group in grouped_by_strana:
    correlation = group[['Catholics', 'Romani', 'hlasy_p', 'Higher Ed', 'Higher Vocational Ed', 'Post-secondary Supplementary Ed', 
                         'Secondary Ed', 'Secondary Ed without Maturita', 'Elementary Ed or None', 
                         'Total Population']].corr()['hlasy_p'].drop('hlasy_p')
    correlations[party] = correlation

corr_df = pd.DataFrame(correlations)

plt.figure(figsize=(14, 10))
sns.heatmap(corr_df.T, annot=False, cmap="Blues")

plt.xticks(rotation=90, fontsize='small')

plt.show()

This shows us very interesting correlations that should be explained later.

For example it seems like the bigger the city is - population-wise - the bigger the chances are for the top parties.
Or we can see, without much surprise, that KDU party which is in fact a 'Christian and Democratic Union' has higher chances of recieving votes from cities that have higher percentage of Catholics living there.

Status¶

What we should consider exploring is, generally speaking, educational, social and economic status.

The problem is however that we don't have sufficient-enough columns that can determine each status of a population unit. But we can at least try to tame what we have into something that is able to point us in the right direction.

Firstly, let's see the distribution of educational status.

edu_columns = ['Higher Ed', 'Higher Vocational Ed', 'Post-secondary Supplementary Ed',
               'Secondary Ed', 'Secondary Ed without Maturita', 'Elementary Ed or None']

long_form = K_renamed.melt(value_vars=edu_columns, var_name='education', value_name='share_of_ppl_w_edu')

sns.catplot(data=long_form, x='education', y='share_of_ppl_w_edu', kind='bar', errorbar='sd', aspect=2)

plt.xticks(rotation=0, fontsize='x-small')
plt.xlabel('Education Level')
plt.ylabel('Share of People with Education Level (%)')


plt.show()

What about social status? That could help us determine the conservative-progressive axis (for example less divorces and more religious people could be suggesting that the region in question is more conservative - or maybe values traditions). Higher share of Roma population in the region can be, sadly enough, correlated with more socially deprived status of the region itself.

Let's plot something about population's values.

soc_columns = ['Divorced', 'Catholics']
long_form = K_renamed.melt(value_vars=soc_columns, var_name='social', value_name='share_of_ppl_w_social')

plt.figure(figsize=(12, 6))
sns.violinplot(x='social', y='share_of_ppl_w_social', data=long_form)

plt.xlabel('Social Status')
plt.ylabel('Share of People with Social Status (%)')

plt.show()

And we should look if the share of Roma population varies widely.

plt.figure(figsize=(10, 5))

sns.kdeplot(K['nar_romska_a'], fill=True)

print(f"The maximum is {round(K['nar_romska_a'].max(), 2)} \
\nThe minimum is {round(K['nar_romska_a'].min(), 2)} \
\nand the mean is {round(K['nar_romska_a'].mean(), 2)} \
\nand the variance is {round(K['nar_romska_a'].var(), 2)}")

plt.xlabel('Share of Roma People (%)')
plt.ylabel('Density')

plt.show()

The maximum is 7.04 
The minimum is 0.0 
and the mean is 0.04 
and the variance is 0.06

And finally, closely tied, economic status.

eco_columns = ['Employed unretired', 'Emloyed retired', 'Unemployed unretired', 'Unemployed retired']
long_form = K_renamed.melt(value_vars=eco_columns, var_name='work', value_name='share_of_ppl_w_work_status')

sns.catplot(data=long_form, x='work', y='share_of_ppl_w_work_status', kind='bar', errorbar='sd', aspect=2)

plt.title('Share of People with Work Status\n(Employment with respect to Retirement)')
plt.xlabel('Work Status')
plt.ylabel('Share of People with Work Status (%)')

plt.show()

We will use that in further analysis, hoping that the share of unemployed people points at deprived regions.

So far we've explored how different variables (which account for shares of population) correlate with shares of votes for certain parties. We've also managed to visualize some interesting informations about the distributions of educational, social and economic status.

In the next part we would like dive deeper and scrutinize if any of the explored variables actually tell us something about voting preferences.

Explanatory analysis¶

What we would like to do is to learn something about geographical units with similar characteristics.

At the very end we should try to learn which parameters are valuable for more or less correct prediction of district results.

We would like to answer some burning questions (although the results will heavily depend on our metric):

Does deprivation of a region push voters to vote for some sort of parties? Do deprivated regions vote the same?
Based on features that we have, can we predict the election result?

Clustering¶

At first we'll need to group similar regions.

Deprivation¶

Let's start with trying to cluster deprived regions. For this we'll use K-means algorithm with columns that we think should hint at possible deprivation of the region:

Elementary education or without an education. Hints at the lack of resources of the region.
Unemployed unretired because jobless people in their active age should give us insight in how the region is doing economically (either low number of job opportunities or the incapibility of finding a job for whatever reason).
Share of Romani population. Population groups that lack financial and social capital are indirectly evicted to similar places thus generating concentrated groups of people with similar socio-economic status. Sadly, higher share of Romani people in a region tells us that this has most likely occured.
Total population and geographical location. Georaphical surroundings will of course tell us something about the district itself. But the information about the region alone would not be sufficient because for example the regional city in a poor region can be quite well economically but its neighboring villages might not. NOTE: using this turned out to be a wrong way since the clustering just split the data according to it. For this to be effectively utilized we would need more data.
It might be worth looking into Voter turnout as well because people without time to be interested in politics might be an indicator of something. This would be calculated as total votes divided by the total population of the district (or more effectively - total votes in a given municipality / total population of the municipality).

This "metric" will most likely not work very well - there are lot of unknowns about the region - we would need more variables. But it should give us an idea. It also might be worth exploring the municipality instead of the district.

As a regional unit we'll try to group municipalities (obec), not districts (okrsek). We'll be using these columns:

K.dropna(inplace=True)

data = K[['vzd_zakl_a', 'nar_romska_a', 'eko_nezam_a', 'rok', 'obec']].copy()
data.drop_duplicates(inplace=True)

# get municipality population
unique_muni_year = K[['rok', 'obec', 'oby_celkem']].drop_duplicates()
muni_population_df = unique_muni_year.groupby(['rok', 'obec'])['oby_celkem'].sum().reset_index()
muni_population_df.rename(columns={'oby_celkem': 'muni_population'}, inplace=True)

data = data.merge(muni_population_df, on=['rok', 'obec'], how='left')

# get municipality turnout
unique_muni_year_strana = K[['rok', 'obec', 'strana', 'hlasy']].drop_duplicates()
muni_turnout_df = unique_muni_year_strana.groupby(['rok', 'obec'])['hlasy'].sum().reset_index()
muni_turnout_df.rename(columns={'hlasy': 'muni_votes_sum'}, inplace=True)

data = data.merge(muni_turnout_df, on=['rok', 'obec'], how='left')

data.loc[:, 'muni_turnout'] = data['muni_votes_sum'] / data['muni_population']
data = data[data['muni_turnout'] <= 1]

data = data.groupby('obec').mean().reset_index()

data.drop(columns=['rok', 'muni_votes_sum', 'muni_population'], inplace=True)
data.drop_duplicates(inplace=True)
cluster_data = data.drop(columns=['obec'])

cluster_data.head()

	vzd_zakl_a	nar_romska_a	eko_nezam_a	muni_turnout
0	6.741604	0.017507	2.835240	0.550225
1	8.371457	0.000000	2.968587	0.385568
2	6.422949	0.013548	2.682892	0.452729
3	36.529680	0.000000	14.155251	0.415525
4	8.229658	0.097575	3.218088	0.434454

This is what we're dealing with ['Population share of people with elementary education', 'Share of Romani people', 'Share of unemployed people', 'Municipality turnout']. Let's rescale those values before passing them to the actual algorithm.

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer

data_transformer = ColumnTransformer(
    transformers=[
        ("ScaleRobust", RobustScaler(), ["vzd_zakl_a", "nar_romska_a", "eko_nezam_a", "muni_turnout"]),
    ]
)

features = data_transformer.fit_transform(cluster_data)

pd.DataFrame(features, columns=['vzd_zakl_a', 'nar_romska_a', 'eko_nezam_a', 'muni_turnout']).head()

	vzd_zakl_a	nar_romska_a	eko_nezam_a	muni_turnout
0	-1.725250	0.017507	-0.679902	0.237246
1	-1.457444	0.000000	-0.627379	-1.276041
2	-1.777609	0.013548	-0.739909	-0.658794
3	3.169320	0.000000	3.778829	-1.000721
4	-1.480744	0.097575	-0.529106	-0.826753

Every row column has been scaled in a way that the mean is 0. Let's run the K-means algorithm for 2 to 20 clusters and choose the one that has the highest silhouette score.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

min_cluster_num = 2
silhouette_scores = []
for i in range(min_cluster_num, 21):
    kmeans = KMeans(n_clusters=i, random_state=0, n_init=10)
    kmeans.fit(features)
    score = silhouette_score(features, kmeans.labels_)
    silhouette_scores.append(score)

# Finding the number of clusters with the highest Silhouette Score
optimal_clusters_num = np.argmax(silhouette_scores) + min_cluster_num
kmeans = KMeans(n_clusters=optimal_clusters_num, random_state=0, n_init='auto')
kmeans.fit(features)

cluster_data['cluster'] = kmeans.labels_
data['cluster'] = kmeans.labels_

print(f'Optimal cluster number: {optimal_clusters_num}, Silhouette score: {silhouette_score(features, kmeans.labels_)}')
cluster_data.head()

Optimal cluster number: 2, Silhouette score: 0.31666351891199335

	vzd_zakl_a	nar_romska_a	eko_nezam_a	muni_turnout	cluster
0	6.741604	0.017507	2.835240	0.550225	0
1	8.371457	0.000000	2.968587	0.385568	0
2	6.422949	0.013548	2.682892	0.452729	0
3	36.529680	0.000000	14.155251	0.415525	1
4	8.229658	0.097575	3.218088	0.434454	0

Well, we wanted more then two clusters but we can see that the "optimal" solution (picking the one with the best silhouette score) consists of them. And the silhouette score for these clusters is moderate but not necesarilly bad.

Let's try a different perspective - let's try to create a column derived from PCA. We'll use the same variables. Then we can compare those two.

from sklearn.decomposition import PCA

pca = PCA(n_components=1)
principal_component = pca.fit_transform(features)

data['PCA_label'] = principal_component[:, 0]

plt.figure(figsize=(10, 3))
sns.scatterplot(data=data, x='PCA_label', y='cluster', hue='cluster', palette='deep')
plt.title('PCA based on clusters')
plt.ylim(-0.5, 1.5) 
plt.ylabel('Cluster')
plt.yticks([0, 1])
plt.show()

It seems like it got the gist of it the same way the clustering has. From a specific value of PCA_label, the cluster does not change - neither it does in the opposite direction (ignoring some wiggle space in the middle - around the value 0 of our PCA_label).

Let's look if it has correlations which we would expect from our metric.

# plot correlations between features and PCA
plt.figure(figsize=(5, 5))
ax = sns.heatmap(pd.DataFrame(pca.components_, columns=['vzd_zakl_a', 'nar_romska_a', 'eko_nezam_a', 'muni_turnout']).T, annot=True, cmap="Blues")
plt.title('Correlation between features and PCA')
ax.set_xticks([])
plt.xlabel('PCA')
plt.show()

This looks great. As it captures the nature of our presumpted metric. Without doing a score or a metric, that would be subjectively weighted by hand, we arrived at something that resembled what we wanted (correlation-wise). It even gave the most importance to employment which is undoubtedly one of the biggest estimators of prosperity of a region.

So we say, with higher PCA_label the theoretically higher deprivation of a municipality. This is a bit of a stretch but let's work with that. It is nothing but an index that tells us something common about a region - in this case it is most likely unemployment (which of course at least hints at something). We'll rename our PCA_label to depriv_index - still an abstract term but more readable.

Why did we do all that? Let's examine the correlation between votes for a party and our new abstractly-called deprivation index.

# rename PCA_label to depriv_index
data.rename(columns={'PCA_label': 'depriv_index'}, inplace=True)

# merge data with the original dataset K
K = K.merge(data[['obec', 'muni_turnout', 'cluster', 'depriv_index']], on='obec', how='left')

# drop rows with nan values
K.dropna(inplace=True)

# correlations between votes for a party and depriv_index
grouped_by_strana = K.groupby('strana')

# 

correlations = {}

for party, group in grouped_by_strana:
    correlation = group[['depriv_index', 'hlasy_p']].corr()['hlasy_p'].drop('hlasy_p')
    correlations[party] = correlation


corr_df = pd.DataFrame(correlations)

max_corr = corr_df.abs().max().max()
vmin, vmax = -max_corr, max_corr

plt.figure(figsize=(5, 10))
sns.heatmap(corr_df.T, annot=False, cmap="PiYG", vmin=vmin, vmax=vmax)

plt.xlabel('Deprivation Index')
plt.xticks([])
plt.show()

We've expected to find stronger correlations with for example 'DSSS' or 'česká národní fronta'. But what we got instead are negative correlations with the most popular liberal (in the economic sense) right-wing parties. All the correlations are not strong but let's still have a look at the most positively and the most negatively correlated pairs.

# make a figure with 4 subplots showing correlations between votes for a party and depriv_index
fig, axes = plt.subplots(2, 2, figsize=(10, 10))

sorted_corr_df = corr_df.loc['depriv_index'].sort_values(ascending=False)

# get two highest correlations and two lowest correlations
top_corr = sorted_corr_df.index[:2]
bot_corr = sorted_corr_df.index[-2:]

corr_list = list(top_corr) + list(bot_corr)

# get parties with correlations in corr_list
corr_df_sel = corr_df[corr_list]

# plot correlations with regression line and confidence interval for each party in corr_df_sel
for i, party in enumerate(corr_df_sel.columns):
    sns.regplot(ax=axes[i // 2, i % 2],
                data=K[K['strana'] == party],
                x='depriv_index',
                y='hlasy_p',
                line_kws={'color': 'red'},
                scatter_kws={'s': 3, 'alpha': 0.1})
    
    axes[i // 2, i % 2].set_title(party)
    if i % 2 == 0:
        axes[i // 2, i % 2].set_ylabel('Percentage of Votes (%)')
    else:
        axes[i // 2, i % 2].set_ylabel('')
    axes[i // 2, i % 2].set_xlabel('Deprivation Index')

plt.show()

And now, just for the sake of learning statistical testing, let's formulate a null hypothesis $H_{0}$ which says that there is no linear relationship between the deprivation index and the votes for a given political party. We should test this for these four - we have 4 $H_0$ each with a different null distribution but still we'll perform significance level correction (Bonferroni). That gives us $\alpha = 0.05 / 4 = 0.0125$). We'll use t-statistic.

import scipy.stats as stats

for i, party in enumerate(corr_df_sel.columns):
    x = K[K['strana'] == party]['depriv_index']
    y = K[K['strana'] == party]['hlasy_p']
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

    if p_value < 0.05 / 4:
        print(f'******** Testing for party: {party} ********')
        print(f'R-value: {np.round(r_value, 3)}')
        print(f'P-value: {"{:.3e}".format(p_value)}')
        print(f'Standard error: {"{:.3e}".format(std_err)}\n')

******** Testing for party: DSSS ********
R-value: 0.111
P-value: 8.945e-18
Standard error: 3.712e-06

******** Testing for party: KSČM ********
R-value: 0.074
P-value: 1.057e-08
Standard error: 4.141e-05

******** Testing for party: ODS ********
R-value: -0.284
P-value: 5.084e-111
Standard error: 4.134e-05

******** Testing for party: TOP ********
R-value: -0.305
P-value: 1.877e-128
Standard error: 4.751e-05

To avoid unintended data-dredging we should have had tested in on a different sample but as we can see the p-values are so low that we can safely conclude that we haven't really p-hacked in any way. Our results are statistically significant - mostly because of the large sample size - therefore, we can reject our $H_0$ but the question still is if they are practically significant. Just with a quick look at the data above, it does not seem to follow a linear trend very clearly and it is not strong at all. We can believe that there is some sort of a weak trend for TOP09 party.

To conclude this part, although we have found results that are statistically significant (just to make an example) there is no reason to further work with the deprivation index since its linear relations with the percent of votes for a party are very weak and it is not a sufficient predictor alone. To model socio-economic deprivation, this index does not capture the complexity of it at all if any. This section was about trying out clustering, dimensionality reduction and hypothesis testing.

Predicting¶

GBDT¶

What we can try to do is to get a black box that tries to predict the winning party for given district. We should then compare it to the average (or some simple model) to test if a more complex model even makes sense in this case.

We'll use Gradient Boosted Decision Trees since they perform very well on tabular (low-dimensional) data. In the end, we would like to be able to say which parties compete for votes and which have very distinct voter bases (in terms of districts).

But at first we'll need to prepare our data. Our targets will be the winning parties of the considered districts - thats what we need to transform. With this approach we'll lose a lot of information since we will be ignoring the parties that come closely second etc. But let's continue with prediction of the winning party only just for the sake of simplicity. And we'll of course make a selection of the columns from which we're going to predict.

We're going to select columns that talk about shares and not the absolute amounts. And we're going to get rid of columns that correlate greatly with each other.

This is what the predictions will be based on:

economic = ['eko_post_p_a', 'eko_zam_a', 'eko_nezam_a']
educational = ['vzd_zakl_a', 'vzd_vysok_a', 'vzd_nast_a', 'vzd_vos_a', 'vzd_str_be_a', 'vzd_str_sm_a']
location = ['okrsek', 'obec']
social = ['sum_v0_14_a', 'v65avic_a', 'nab_rimsko_a', 'rodst_rozv_a', 'nar_romska_a']
sex = ['zeny_a'] 
political = ['strana', 'hlasy', 'rok', 'muni_turnout']

# 'rodst_rozv', 'id_n', 'v65avic', 'muzi', 'zeny', 'nar_romska', 'nab_rimsko', 'sum_v0_14',
# 'vzd_zakl', 'vzd_str_sm', 'vzd_vysok', 'vzd_str_be', 'vzd_nast', 'vzd_vos',
# 'eko_nezam', 'eko_zam', 'nepracduch'
# 'hlasy_p'
# 'muzi_a', since heavily correlated with 'zeny_a'
# 'okres_nazev', 'nazev_obce', since i dont want the model to fit to the names of the districts
# 'kandidatka', since i dont want this data to be used for prediction
# 'oby_celkem' because we use 'muni_population' instead
# 'kraj_nazev'
# 'nepracduch_a' because of correlations

data = K[economic + educational + social + sex + political + location].copy()

data = data.merge(muni_population_df, on=['rok', 'obec'], how='left')
data.dropna(inplace=True)

example_df = data.drop(['strana', 'okrsek', 'rok', 'obec', 'hlasy'], axis=1)
example_df.head()

	eko_post_p_a	eko_zam_a	eko_nezam_a	vzd_zakl_a	vzd_vysok_a	vzd_nast_a	vzd_vos_a	vzd_str_be_a	vzd_str_sm_a	sum_v0_14_a	v65avic_a	nab_rimsko_a	rodst_rozv_a	nar_romska_a	zeny_a	muni_turnout	muni_population
0	5.758755	44.669261	4.824903	15.642023	11.517510	2.879377	1.089494	26.459144	23.891051	14.163424	18.132296	8.638132	14.630350	0.000000	54.630350	0.507470	8166.0
1	4.321729	40.216086	6.842737	15.726291	5.282113	1.560624	1.080432	30.852341	21.008403	15.846339	13.325330	6.962785	10.564226	0.000000	46.218487	0.383460	5127.0
2	10.369487	51.489869	3.337306	10.607867	15.733015	2.264601	2.383790	27.056019	20.619785	18.235995	9.415971	11.442193	7.151371	0.000000	50.774732	0.550060	839.0
3	7.681564	41.899441	6.145251	19.413408	6.145251	1.955307	0.977654	31.284916	23.882682	14.385475	19.553073	9.636872	9.636872	0.000000	49.860335	0.530726	716.0
4	5.498095	45.944475	3.483941	12.411541	15.133370	2.286336	1.197605	26.238432	25.585193	15.459989	15.187806	15.133370	6.641263	0.054437	50.408274	0.596625	1837.0

def get_first_k_parties_places(k, df):
    # Create a copy of the DataFrame to avoid modifying the original one
    df_copy = df.copy()
    parties_list = []

    for i in range(k):
        # Group by 'rok' and 'okrsek' and get the kth highest 'hlasy' indices
        indices = df_copy.groupby(['rok', 'okrsek'])['hlasy'].nlargest(i+1).groupby(['rok', 'okrsek']).nth(i).index.get_level_values(2)
        
        # Extract the parties that came in kth place using the indices
        parties = df_copy.loc[indices, ['rok', 'okrsek', 'strana']].reset_index(drop=True)
        
        # Rename the 'strana' column to reflect the kth place
        parties.rename(columns={'strana': f'place_{i+1}_party'}, inplace=True)
        
        parties_list.append(parties)

    for parties_df in parties_list:
        df = df.merge(parties_df, on=['rok', 'okrsek'], how='left')

    return df

data = get_first_k_parties_places(1, data)

data[['place_1_party']].value_counts()

place_1_party
ANO              99723
TOP              42609
ČSSD             15878
PIR               1426
ODS               1271
KSČM               368
STAN                31
Name: count, dtype: int64

Unluckily, we can see that we've lost valuable information about the parties that came second etc. For example SPD, a party that has had relatively big result in terms of total votes, didn't win any district in our sample. And of course the target classes are really uneven so we have to adjust for that in the model.

from sklearn.metrics import accuracy_score, log_loss, f1_score

def print_scores(probabilities, predictions, targets):
    print(f'Accuracy: {accuracy_score(targets, predictions)}')
    print(f'F1-score (macro): {f1_score(targets, predictions, average="macro")}')
    print(f'F1-score (micro): {f1_score(targets, predictions, average="micro")}')

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import log_loss

features_df = data.drop(['place_1_party', 'strana', 'okrsek', 'rok', 'obec', 'hlasy'], axis=1)

X = features_df # Features
y = data['place_1_party']  # Target

# transform data
data_transformer = ColumnTransformer(
    transformers=[
        ("StandardScale", StandardScaler(), [
            'eko_post_p_a', 'eko_zam_a', 'eko_nezam_a', # economic
            'vzd_zakl_a', 'vzd_vysok_a', 'vzd_nast_a', 'vzd_vos_a', 'vzd_str_be_a', 'vzd_str_sm_a', # educational
            'sum_v0_14_a', 'v65avic_a', 'nab_rimsko_a', 'rodst_rozv_a', # social
            'zeny_a',  # sex
            'muni_turnout' # political
        ]),
        ("RobustScale", RobustScaler(), ['nar_romska_a', 'muni_population']), # location and social
    ]
)

X = data_transformer.fit_transform(X)

### MEL BYCH MIT JESTE VALIDATION SET
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Encode the target variables if they are strings
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Consider class weights
encoded_classes = label_encoder.transform(y_train.unique())
class_counts = {cls: (y_train_encoded == cls).sum() for cls in encoded_classes}
num_classes = len(encoded_classes)

# Calculate sample weights
sample_weight = np.array([len(y_train_encoded) / (class_counts[cls] * num_classes) for cls in y_train_encoded])

# Initialize and train the classifier
classifier = GradientBoostingClassifier(random_state=42)
classifier.fit(X_train, y_train_encoded, sample_weight=sample_weight)

GradientBoostingClassifier(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let's see the scores.

# Predict probabilities
probabilities = classifier.predict_proba(X_test)
predictions = classifier.predict(X_test)

print_scores(probabilities, predictions, y_test_encoded)

Accuracy: 0.42970057652966337
F1-score (macro): 0.4851621119123744
F1-score (micro): 0.42970057652966337

We can see right of the bat that the model is not very capable of predicting the winners. The accuracy is very low.

But still let's have a look at the training curves.

train_errors, test_errors = [], []
# Calculate the error at each stage
for y_train_pred in classifier.staged_predict_proba(X_train):
    train_errors.append(log_loss(y_train_encoded, y_train_pred, sample_weight=sample_weight))

for y_test_pred in classifier.staged_predict_proba(X_test):
    test_errors.append(log_loss(y_test_encoded, y_test_pred))

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(train_errors, label='Training Error', color='blue')
plt.plot(test_errors, label='Test Error', color='red')
plt.title('Error at Each Boosting Stage')
plt.xlabel('Boosting Iterations')
plt.ylabel('Log Loss')
plt.legend()
plt.show()

It does not seem like there's much room for an improvement (in terms of number of iterations).

Which features were the most important for predictions?

feature_importances = classifier.feature_importances_
column_labels = list(features_df.columns)
feature_nums = column_labels

plt.figure(figsize=(10, 4))
plt.barh(feature_nums, feature_importances)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.show()

We can see that, surprisingly, the model considered the share of people with college-entry level education as the most important criterium. Following with, as we would have guessed, the scaled number of people living in the municipality where the district is located.

Let us compare it to the simple model. If it made sense to do it this way. I'll choose kNN as a base-line model. And then compare these two.

# perform a kNN algorithm as a baseline model to compare the complex one to
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train_encoded)

KNeighborsClassifier()

knn_predictions = knn.predict(X_test)
knn_probabilities = knn.predict_proba(X_test)

print_scores(knn_probabilities, knn_predictions, y_test_encoded)

Accuracy: 0.5937635608455768
F1-score (macro): 0.48140470832275417
F1-score (micro): 0.5937635608455768

So it looks like our simple model performs better than the complex one. It might be due to unweigted target classes in the simpler model. But let's test it. Let's set null hypothesis $H_0$ that the simpler model performs the same as the complex one. We'll use bootstrap resampling and we'll compare F1 macro score.

f1_diffs = []

n_bootstrap = 1000
y_test_array = y_test_encoded

# Perform the bootstrap sampling
for _ in range(n_bootstrap):
    indices = np.random.choice(range(len(X_test)), size=len(X_test), replace=True)
    X_sample, y_sample = X_test[indices], y_test_array[indices]

    # Evaluate GBDT
    y_pred_gbdt = classifier.predict(X_sample)
    f1_gbdt = f1_score(y_sample, y_pred_gbdt, average='macro')

    # Evaluate kNN
    y_pred_knn = knn.predict(X_sample)
    f1_knn = f1_score(y_sample, y_pred_knn, average='macro')

    # Calculate the difference in performance
    f1_diffs.append(f1_gbdt - f1_knn)

# Statistical Analysis
mean_diff = np.mean(f1_diffs)
std_diff = np.std(f1_diffs)
confidence_interval = np.percentile(f1_diffs, [2.5, 97.5])

print(f'The mean difference in F1-score is {mean_diff:.3f}')
print(f'The standard deviation of the differences is {std_diff:.3f}')
print(f'The 95% confidence interval is {confidence_interval}')

The mean difference in F1-score is -0.000
The standard deviation of the differences is 0.032
The 95% confidence interval is [-0.05605834  0.04883434]

Zero is included in the confidence interval. So we can not reject the null hypothesis therefore we can not conclude which model is better. But the point was that the complex model is worse or just as bad as the simple one.

Discussion and conclusion¶

Let's sum up what we've done. We got a pretty messy dataset with lot of redundant information. We've explored our data and tried to see if we can get some in-depth view from socio-economic features of our data. With that in mind we've attempted to cluster it. The best clusters that we could get were only two so we needed to explore a different approach. We performed PCA and reduced our selection to one dimension - one new column. Under inspection, it seemed to be performing well. But as we later figured out, this factor is relevant for voting preferences but very very weakly and was relevent only to some parties. To conclude this, we've tried a simple prediction task where we tried to guess the winning party of a district and then compared our complex model to a simpler one.

We can speculate about lower deprivation (in terms of our constructed index) having a slightly positive effect on the chances of the top right-wing liberal (economically) parties. But the result we've got is heavily dependent on how we set our metric - this needs to be addressed. So to answer our question if deprivation of a region pushes voters to vote for some sort of parties - there is only a really really weak (but statistically significant) push towards authoritarian parties (like DSSS and KSČM). But this could have easily resulted from setting our metric in a specific way. So we can not conclude much. To be sure in search of answers to this question, a more reasonable and robust metric is needed.

When we tried to predict a winning party from the information that we had at our disposal, it didn't go well. So to answer our second question that we set, We conclude that more information in general is needed to fully capture the nature of districts' voting preferences.