by Tanveer Sayyed

# The penalty of missing values in Data Science

#### And using a “soft” method to impute the same.

This post focuses more on a conceptual level rather than coding skills and is divided into two parts. Part-I describes the problems with missing values and when and why should we use mean/median/mode. Part-II **repudiates** Part-I and argues why we should use none of them and use the *soft* method instead— random but proportional representation.

### Part-I: Why do we remove missing values? When to use mean, median, mode? And Why?

The problem with missing data is that there is no fixed way of dealing with them, and the problem is universal. Missing values affect our performance and predictive capacity. They have the potential to change all our statistical parameters. The way they interact with outliers once again affects our statistics. Conclusions can thus be misleading.

The different missing values can be:

1. NaN

2. None

3.

4. “Null”

5. “missing”

6. “not available”

7. “NA”

While the last four are string values, pandas by default identify NaN(no assigned number) and None. However, both are not the same; the code snippet below shows why.

The problem is that if we do not remove the NaNs then, we are in for double jeopardy. Firstly we already are suffering from loss of true data and secondly if not handled with care NaNs start ‘*devouring*’ our true data and might get propagated throughout the data-set as we proceed. Let's instantiate two series and see.

Now let’s see what happens when we perform certain operations on those lists.

We can see how true data (integers 1, 2) were lost while performing operations (Out: 21, 22). Another thing to note is conflicting results in the in-built python method and the series method due to the presence of NaNs (Out: 23, 24).

Now let’s create a data-frame which has all the missing values stated above as well as a garbage value(‘#$%’). We’ll remove missing values by wrangling with this tiny toy data-set.

The data-frame has one complete row(i1) and one complete column(c2) filled with only NaNs. Other missing value identifiers have also been deliberately scattered.

Above we can see that the last term in c3 should be “True”(for “not available”). To make it so we’ll have to read the data-frame again. This time we’ll *force* pandas to identify “missing”/“not available” /”NA” as NaNs.

All missing values have been identified, shall we get rid of them by dropping them altogether?

Looks like we lost all the values! The *.dropna()* deletes the complete row(index) even if a “single” value is missing. Hence *.dropna()* comes with the price of losing data which may be valuable.

#### How to proceed?

Now one may presume imputing all missing values with — zero. But there is a fundamental problem with this approach: *the sanctity/veracity of our data is lost, as in the real world a missing term can take ANY value. But we are forcing it to take only one rigid value, i.e. 0.*

The *official document** *of sklearn mentions: (emphasis mine)

… infer them(missing values) from theknownpart of the data.

So what should we do now? A better option is to use **mean** as it at least is a better “representative” of a feature than zero. Why? Because for continuous/numeric features no matter how many times we add mean, it still gets conserved. Here is how:

Three numbers — 2, 6, 7 — have, mean = (2 + 6 + 7)/3 = 5

Assuming this list has an infinite number of missing values, lets impute it with mean: — 2, 6, 7, 5, 5, 5, 5….. The mean will remain 5 no matter how many times we add it!

But there are problems with mean. Firstly it is heavily influenced by outliers, mean(2 + 6 + 7+ **55**) = 17.5! Secondly, although it does ‘*represent’* a feature, it is also the *worst* in reflecting central tendency of a normal data(see below bullets **b & c** [respectively for right skewed and left skewed data]).

As we can clearly observe in bullets b & c, the mode best reflects the central tendency. **Mode** is the most frequent value in our data set. But when it comes to continuous data then mode can create *ambiguities*. There might be more than one mode or (rarely)none at all if none of the values are repeated. Mode is thus used to impute missing values in columns which are **categorical** in nature.

After mode, it is the median that reflects the central tendency the best. Which implies that for **continuous** data, the use of the median *is better* than mean! **Median** is the middle score of data-points when arranged in order. And unlike the mean, the median is not influenced by outliers of the data set — the median of the already arranged numbers (2, 6, 7, **55**) is 6.5!

So for categorical data using mode makes more sense and for continuous data the median. So why do we still use mean for continuous data?

#### Legacy

Previously in the world with no computers, it was easier to calculate the mean than the median. In those times it definitely made sense as re-arranging thousands of entries manually, each time the data-set is updated, and then finding the median was indeed a cumbersome task. But should we continue this legacy, when we have the power of computation at our fingertips, today? **No**, that would imply under-utilizing our potential.

But again, the rigidity remains, as we are still using a *single* value — mean/median/mode. We’ll discuss more about this in the next section shortly. For now let’s replace values with mean(in c0), median(in c1) and mode(in c3). Before, let’s deal with the garbage value ‘#$%’ at (‘i2’, ‘c3’).

The respective values are:

We’ll use 3 different methods to replace NaNs.

It looks like we’ll have to drop the c2 column altogether as it contains no data. *Note* that in the beginning one row and one column were completely filled with NaNs but we were only able to successfully manipulate the rows but not the columns. Dropping c2.

We finally got rid of all the missing values!

### Part-II: Random but proportional replacement (RBPR)

The above methods, *I think*, can be described as ** hard imputation approaches**, as they rigidly accept only one value. Now let’s focus on a “

*soft”*imputation approach. Soft because it makes use of

**. Here we are**

*probabilities**not forced*to pick a single value. We’ll replace NaNs

*randomly in a ratio which is “proportional” to the population without NaNs*(the proportion is calculated using probabilities but with a touch of randomness).

An explanation with an example would be better. Assume a list having 15 elements with *one-third data missing*:

[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, ** NaN, NaN, NaN, NaN, NaN**] — — — (

*original*)

Now notice in the *original *list there are sets of **4** ones, **4** twos, **2** threes, and **5** NaNs. Thus the ones & twos are in ** majority** while threes are in

**. Now let's begin by calculating the probabilities and expected values.**

*minority**prob*(1 occurring in NaNs)

= (no. of 1s)/(population without NaNs)

= 4/10

= 2/5- Expected value/counts of 1

= (prob) * (total no. of NaNs)

= (2 / 5) * (5)

=**2**

Similarly expected value of prob(2 occurring in NaNs) is **2** and prob(3 occurring in NaNs) is **1** (Note that **2+2+1=5**, is equal to the number of NaNs). Thus our list will now look like this:

[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, ** 1, 1, 2, 2, 3**] — — — (

*replaced_by_proportion*)

The ratio of ones, twos, and threes replacing NaNs is thus **2 : 2 : 1**. That is when we have ‘*nothing*’ it is ** highly likely** that ‘ones’ and ‘twos’ form the major part of it than ‘threes’, instead of

*a single hard*mean/mode/median.

If we simply impute NaNs by mean(1.8), then our list looks like:

[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, ** 1.8, 1.8, 1.8, 1.8, 1.8**] — — — (

*replaced_by_mean*)

Let’s *box-plot* these three lists and draw **conclusions** from the same:

**First**, the list with proportional replacement has far better *data distribution* than the mean replaced one. **Second**, observe how mean affects the distribution with ‘3’*(a minority): *it was originally *not* an outlier, suddenly turned so in plot-2 but regained its original status in plot-3. This shows that plot-3 distribution is less *biased*. **Third**, this approach is also fairer, it gave ‘3’(*the minority*) a “*chance”* in the missing values which otherwise it would have *never *got. The **fourth** beauty of this approach is that we still have successfully conserved the mean!

**Fifth**, the distribution (based on probability) ensures, without doubt, that the chances of this method to over-fit a model is definitely lesser than imputing with the *hard* approach. **Sixth**, if NaNs are replaced *“randomly”* then applying a little logic we can easily calculate that there are: 5!/(2!*2!*1!) = 30, different arrangements (permutations) possible:

… 1, 1, 2, 2, 3],

… 1, 1, 2, 3, 2],

… 1, 1, 3, 2, 2],

… 1, 3, 1, 2, 2],

… 3, 1, 1, 2, 2] and 25 more!

To make this dynamism even clearer and intuitive see this gif with 4 NaNs. Each color represents a different NaN value.

Notice how different arrangements generate different *interaction*s between columns each time we run the code. Per se, we are not ‘generating’ new data here, as we are only resourcefully utilizing the already available data. We are only generating newer and newer interactions. *And these fluctuating interactions are the real penalty of NaNs.*

#### Code:

Now let’s code this concept and bound it. The code for dealing with numerical features can be found ** here** and for categorical features

**. (I am purposely avoiding displaying the code here as the focus is on the concept, also it would needlessly make the article lengthy. If you do find the code useful and are [algorithmically]**

*here**greedy*enough to optimize it further, I’ll be

*glad*if you revert). How to use the code?

`random.seed = 0 np.random.seed = 0# important so that results are reproducible`

`# The df_original is free of impurities(eg. no '$' or ',' in price # field) and df_original.dtypes are all set.`

`1. df = df_original.copy()2. Call the CountAll() function given in the code3. categorical list = [all categorical column names in df]4. numerical list = [all numerical column names in df]5. run a for loop to fill NaNs through numerical list, using the Fill_NaNs_Numeric() function6. run a for loop to fill NaNs through categorical list, using the Fill_NaNs_Catigorical() function7. perform a train test split and check for the accuracy(do not specify the random_state)`

`(After step 7 we require a bit of imputation tuning. Ensuring steps 1-7 are in a single cell, repeatedly run it 15-20 times manually to get an idea of the 'range' of accuracies as it'll keep fluctuating due to randomness. 7th step helps one get an estimate of the limits of accuracies and helps us to boil down to the "best accuracy")`

`8.("skip" this step if df is extremely huge) run a conditioned while loop again through 1 to 7 this time to directly get our desired(tuned) accuracy.`

`(One may want to write down and save this 'updated'-df for future use to save oneself from repeating this process).`

** Here** is a complete example, with all the steps just mentioned, on the famous Iris data-set included in sklearn library. 20% values from each column, including the target, have been randomly deleted. Then the NaNs in this data-set is imputed using

*this*approach. By step-7 its easily identifiable that after imputation we can

*tune*our

*recall*at-least ≥ 0.7 for

**“**, and the same is the condition in the 8-th step. After running several times few reports are as follows:

*each” class of the iris plant*Next, for a second confirmation, we plot PR-curves post-tuning, this time with a RandomForestClassifier (n_estimators= 100). [the *classes* are {0 :’setosa’, 1: ‘versicolor’, 2: ‘virginica’}].

These figures look okay. Now shifting our attention to *hard* imputation. *One* of the many classification reports is shown below: [observe the 1s(to be discussed shortly) in *precision* and *recall* along with the** class imbalance** in *support*]

` precision recall f1-score support`

`setosa 1.00 0.52 0.68 25versicolor 0.45 1.00 0.62 9virginica 0.67 0.73 0.70 11`

#### The law of large numbers

Now lets put to use the *law of large numbers *using DecisionTreeClassifier to perform 500 iterations, each with a different randomly removed set of values, over the same Iris dataset ** without tuning **the imputations; that is, we skip the tuning stage to “

*deliberately*” obtain the

**worst**

*soft*scores. The code is

*here*. The final comparisons in terms of precision and recall scores, for both hard and soft imputation, are as follows:

*Precision* and *recall* come handy mainly when we observe class imbalance. Although initially, we did have a well-balanced target but ** hard** imputing with it with mode made it imbalanced. Observe a large number of

*hard*recalls and

*hard*precisions having value = 1. Here the use of the word

*“over-fit”*would be incorrect as these are test scores not train ones. So the correct way to put would be: the prophetic

*hard*model already knew what to predict, or the use of mode ensured the two scores to overshoot.

Now observe the ** soft** scores. Despite any

*tuning*as well as much fewer values being= 1, the

*soft*scores are

*still*able to

**catch up/converge**with the

*hard*scores (except in two cases —

*versicolor-recall*and

*stetosa-precision*— for obvious reasons where a huge number of prophetic 1s forcefully pull up the average). Also, observe the

*soft-stetosa-recall*(despite the presence of large 1s in the

*hard counterpart),*and, the increased s

*oft-versicolor-precision*. The last thing to note is the overall reduction in variation and standard deviation in the

*soft*approach.

For reference the f1 scores and accuracy scores are: (once again note the reduced variation and standard deviation in *soft* approach)

Thus we can observe that in the long run, evenwithoutsoft imputationtuningwe have obtained results which match the performance of the hard imputation strategy. Thusaftertuning we can obtain even better results.

### Conclusion

Why are we doing this? The only reason is to improve our chances of *dealing with uncertainty*. *We never penalize ourselves for missing values!* Whenever we find a missing value we simply anchor our ship in the ‘middle’ of the sea falsely presuming that our anchor has successfully fathomed the deepest trench of “uncertainties”. The attempt here is to keep the ship sailing by employing the resources available at hand — the wind speed and direction, the location of stars, the energy of the waves and tides, etc. to get the best ‘*diversified*’ catch, for a *better* return.

(If you identify anything wrong/incorrect, please do respond. Criticism is welcomed).