Handling Missing Data

Missing data is a common hurdle in data analysis, impacting the reliability of insights drawn from datasets. Python offers a range of solutions to address this issue, some of which we discussed in the earlier weeks. In this notebook, we look into the top three missing data imputation methods in Python—SimpleImputer, KNNImputer, and IterativeImputer from scikit-learn—providing insights into their functionalities and practical considerations. We’ll explore these essential techniques, using the weather dataset.

# install the libraries for this demonstration ! pip install dataidea==0.2.5
 from dataidea.packages import * from dataidea.datasets import loadDataset 

from dataidea.packages import * imports for us np, pd, plt, etc. loadDataset allows us to load datasets inbuilt in the dataidea library

 weather = loadDataset('weather') 
 weather 
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 NaN
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
 weather.isna().sum() 
day 0 temperature 4 windspead 4 event 2 dtype: int64

Let’s demonstrate how to use the top three missing data imputation methods—SimpleImputer, KNNImputer, and IterativeImputer—using the simple weather dataset.

 # select age from the data temp_wind = weather[['temperature', 'windspead']].copy() 
 temp_wind_imputed = temp_wind.copy() 

SimpleImputer from scikit-learn: