Missing data is a common hurdle in data analysis, impacting the reliability of insights drawn from datasets. Python offers a range of solutions to address this issue, some of which we discussed in the earlier weeks. In this notebook, we look into the top three missing data imputation methods in Python—SimpleImputer, KNNImputer, and IterativeImputer from scikit-learn—providing insights into their functionalities and practical considerations. We’ll explore these essential techniques, using the weather dataset.
# install the libraries for this demonstration ! pip install dataidea==0.2.5
from dataidea.packages import * from dataidea.datasets import loadDataset
from dataidea.packages import * imports for us np, pd, plt, etc. loadDataset allows us to load datasets inbuilt in the dataidea library
weather = loadDataset('weather')
weather
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
weather.isna().sum()
day 0 temperature 4 windspead 4 event 2 dtype: int64
Let’s demonstrate how to use the top three missing data imputation methods—SimpleImputer, KNNImputer, and IterativeImputer—using the simple weather dataset.
# select age from the data temp_wind = weather[['temperature', 'windspead']].copy()
temp_wind_imputed = temp_wind.copy()
from sklearn.impute import SimpleImputer simple_imputer = SimpleImputer(strategy='mean') temp_wind_simple_imputed = simple_imputer.fit_transform(temp_wind) temp_wind_simple_imputed_df = pd.DataFrame(temp_wind_simple_imputed, columns=temp_wind.columns)
Let’s have a look at the outcome
temp_wind_simple_imputed_df
temperature | windspead | |
---|---|---|
0 | 32.0 | 6.0 |
1 | 33.2 | 9.0 |
2 | 28.0 | 8.4 |
3 | 33.2 | 7.0 |
4 | 32.0 | 8.4 |
5 | 33.2 | 8.4 |
6 | 33.2 | 8.4 |
7 | 34.0 | 8.0 |
8 | 40.0 | 12.0 |
from sklearn.impute import KNNImputer knn_imputer = KNNImputer(n_neighbors=2) temp_wind_knn_imputed = knn_imputer.fit_transform(temp_wind) temp_wind_knn_imputed_df = pd.DataFrame(temp_wind_knn_imputed, columns=temp_wind.columns)
If we take a look at the outcome
temp_wind_knn_imputed_df
temperature | windspead | |
---|---|---|
0 | 32.0 | 6.0 |
1 | 33.0 | 9.0 |
2 | 28.0 | 7.0 |
3 | 33.0 | 7.0 |
4 | 32.0 | 7.0 |
5 | 33.2 | 8.4 |
6 | 33.2 | 8.4 |
7 | 34.0 | 8.0 |
8 | 40.0 | 12.0 |
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer iterative_imputer = IterativeImputer() temp_wind_iterative_imputed = iterative_imputer.fit_transform(temp_wind) temp_wind_iterative_imputed_df = pd.DataFrame(temp_wind_iterative_imputed, columns=temp_wind.columns)
Let’s take a look at the outcome
temp_wind_iterative_imputed_df
temperature | windspead | |
---|---|---|
0 | 32.000000 | 6.000000 |
1 | 35.773287 | 9.000000 |
2 | 28.000000 | 3.321648 |
3 | 33.042537 | 7.000000 |
4 | 32.000000 | 6.238915 |
5 | 33.545118 | 7.365795 |
6 | 33.545118 | 7.365795 |
7 | 34.000000 | 8.000000 |
8 | 40.000000 | 12.000000 |
Datawig is a library specifically designed for imputing missing values in tabular data using deep learning models.
# import datawig # # Impute missing values # df_imputed = datawig.SimpleImputer.complete(weather)
These top imputation methods offer different trade-offs in terms of computational complexity, handling of missing data patterns, and ease of use. The choice between them depends on the specific characteristics of the dataset and the requirements of the analysis.
If you’re serious about learning Programming, Data Analysis with Python and getting prepared for Data Science roles, I highly encourage you to enroll in my Programming for Data Science Course, which I’ve taught to hundreds of students. Don’t waste your time following disconnected, outdated tutorials
My Complete Programming for Data Science Course has everything you need in one place.
The course offers:
What you’l learn:
Data Scientist, Instructor