How to Handle Missing Data in Datasets — CBSE AI Students Guide

Real data is never perfect. Sensors malfunction, survey respondents skip questions, spreadsheets get corrupted. Before you train any AI model, you must deal with missing values — or your model will either crash or produce wrong predictions.

This guide covers the complete missing data workflow required in Class 11, Unit 5: Data Literacy — Data Pre-processing and Class 12, Unit 1: Python Programming – II of the CBSE AI syllabus (Subject Code 843, 2025-26).

What You’ll Learn

  • What missing data is and why it breaks AI models
  • How to detect missing values with Pandas
  • Three strategies: fill, drop, or replace — when to use each
  • Complete programs ready for your practical file and Lab Test

Why Missing Data Matters in AI

Every machine learning algorithm expects complete, clean data. When values are missing, algorithms either crash with an error or silently produce wrong results — both outcomes are worse than taking 10 minutes to clean the data first.

In India’s agricultural data collected by state governments, district-level crop yield records routinely have missing entries for remote areas. A predictive model trained on that data without handling missing values would produce unreliable yield forecasts — the same problem at a school scale is a failing practical program.

Missing values in Python are represented as NaNNot a Number. Pandas uses NaN for any missing numeric or text value.


Part 1 — Detecting Missing Values

Before fixing missing data, you must find it.

python

# Program to detect missing values in a dataset

import pandas as pd
import numpy as np

# Create a dataset with deliberate missing values
data = {
    "Student"    : ["Arjun", "Priya", "Kiran", "Meena", "Rohan", "Sneha"],
    "Maths"      : [85, np.nan, 78, 95, np.nan, 88],
    "Science"    : [90, 82, np.nan, 88, 74, np.nan],
    "AI"         : [92, 95, 88, np.nan, 79, 91],
    "City"       : ["Delhi", "Mumbai", np.nan, "Chennai", "Delhi", "Mumbai"]
}

df = pd.DataFrame(data)

print("Dataset:")
print(df)
print("\n--- Missing Value Analysis ---")

# Count missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())

# Total missing values across entire DataFrame
print("\nTotal missing values:", df.isnull().sum().sum())

# Percentage missing per column
print("\nPercentage missing per column:")
print((df.isnull().sum() / len(df) * 100).round(1))

# Which rows have at least one missing value
print("\nRows with missing values:")
print(df[df.isnull().any(axis=1)])

Expected Output:

Dataset:
  Student  Maths  Science    AI     City
0   Arjun   85.0     90.0  92.0    Delhi
1   Priya    NaN     82.0  95.0   Mumbai
2   Kiran   78.0      NaN  88.0     None
3   Meena   95.0     88.0   NaN  Chennai
4   Rohan    NaN     74.0  79.0    Delhi
5   Sneha   88.0      NaN  91.0   Mumbai

--- Missing Value Analysis ---

Missing values per column:
Student    0
Maths      2
Science    2
AI         1
City       1
dtype: int64

Total missing values: 6

Percentage missing per column:
Student     0.0
Maths      33.3
Science    33.3
AI         16.7
City       16.7
dtype: float64

Rows with missing values:
  Student  Maths  Science    AI     City
1   Priya    NaN     82.0  95.0   Mumbai
2   Kiran   78.0      NaN  88.0     None
3   Meena   95.0     88.0   NaN  Chennai
4   Rohan    NaN     74.0  79.0    Delhi
5   Sneha   88.0      NaN  91.0   Mumbai

Key methods for detection:

MethodWhat It Returns
df.isnull()DataFrame of True/False — True where value is missing
df.isnull().sum()Count of missing values per column
df.isnull().sum().sum()Total missing values in entire DataFrame
df.info()Shows non-null counts per column
df.isnull().any(axis=1)True for rows that have at least one missing value

Part 2 — Strategy 1: Fill Missing Values (fillna)

fillna() replaces missing values with something — without losing any rows.

Fill with Column Mean (Most Common for Numeric Data)

python

# Program to fill missing numeric values with column mean

import pandas as pd
import numpy as np

data = {
    "Student": ["Arjun","Priya","Kiran","Meena","Rohan"],
    "Maths"  : [85, np.nan, 78, 95, np.nan],
    "Science": [90, 82, np.nan, 88, 74]
}
df = pd.DataFrame(data)

print("Before filling:")
print(df)
print("\nMissing:", df.isnull().sum().sum())

# Fill with column mean
df["Maths"].fillna(df["Maths"].mean(), inplace=True)
df["Science"].fillna(df["Science"].mean(), inplace=True)

print("\nAfter filling with mean:")
print(df.round(2))
print("Missing:", df.isnull().sum().sum())

Expected Output:

Before filling:
  Student  Maths  Science
0   Arjun   85.0     90.0
1   Priya    NaN     82.0
2   Kiran   78.0      NaN
3   Meena   95.0     88.0
4   Rohan    NaN     74.0

Missing: 3

After filling with mean:
  Student  Maths  Science
0   Arjun  85.00     90.0
1   Priya  86.00     82.0
2   Kiran  78.00     83.5
3   Meena  95.00     88.0
4   Rohan  86.00     74.0

Missing: 0

Why mean? The mean preserves the overall average of the column — adding a mean value doesn’t shift the centre of your data.

Fill with Median (Better When Data Has Outliers)

python

# Fill with median — better when data has extreme values

import pandas as pd
import numpy as np

salaries = pd.Series([25000, 28000, np.nan, 30000, 500000, np.nan, 27000])

print("Mean  :", salaries.mean().round(0))     # pulled up by 500000
print("Median:", salaries.median())             # not affected by 500000

# Fill with median to avoid inflating missing values
salaries_filled = salaries.fillna(salaries.median())
print("\nAfter filling with median:")
print(salaries_filled)

Expected Output:

Mean  : 122000.0
Median: 28000.0

After filling with median:
0     25000.0
1     28000.0
2     28000.0
3     30000.0
4    500000.0
5     28000.0
6     27000.0

Mean vs Median for missing data — when to choose:

SituationUse
Data is roughly symmetric (most marks, heights)Mean
Data has outliers (salaries, house prices)Median
Categorical column (city names, grades)Mode or placeholder

Fill Categorical/Text Columns

python

# Fill text column with mode (most common value) or placeholder

import pandas as pd
import numpy as np

data = {"City": ["Delhi", "Mumbai", np.nan, "Delhi", np.nan, "Chennai"]}
df = pd.DataFrame(data)

# Option A: fill with most common value (mode)
mode_city = df["City"].mode()[0]
df["City"].fillna(mode_city, inplace=True)
print("Filled with mode:", df)

Expected Output:

Filled with mode:
      City
0    Delhi
1   Mumbai
2    Delhi   ← filled with "Delhi" (mode)
3    Delhi
4    Delhi   ← filled with "Delhi" (mode)
5  Chennai

Part 3 — Strategy 2: Drop Missing Values (dropna)

dropna() removes rows (or columns) that contain missing values.

python

# Program to demonstrate dropna() with different options

import pandas as pd
import numpy as np

data = {
    "Name"    : ["Arjun","Priya","Kiran","Meena","Rohan"],
    "Maths"   : [85, np.nan, 78, 95, np.nan],
    "Science" : [90, 82, np.nan, 88, 74],
    "AI"      : [92, 95, 88, np.nan, 79]
}
df = pd.DataFrame(data)
print("Original shape:", df.shape)

# Drop rows where ANY column has a missing value
df_any = df.dropna()
print("After dropna() — any missing:", df_any.shape)

# Drop rows where ALL columns are missing (rare, more conservative)
df_all = df.dropna(how="all")
print("After dropna(how='all')      :", df_all.shape)

# Drop rows only where a specific column is missing
df_maths = df.dropna(subset=["Maths"])
print("After dropna(subset=['Maths']):", df_maths.shape)

Expected Output:

Original shape: (5, 4)
After dropna() — any missing: (1, 4)
After dropna(how='all')      : (5, 4)
After dropna(subset=['Maths']): (3, 4)

Part 4 — Choosing the Right Strategy

This is the decision that most students skip — and it is the most important one.

SituationBest StrategyReason
Missing values are few (< 5% of rows)dropna()Small loss, clean data
Missing values are many (> 20%)fillna() with mean/medianCan’t afford to lose rows
Column is numeric, symmetric distributionfillna(mean)Preserves average
Column is numeric, has outliersfillna(median)Median is robust to extremes
Column is text or categoryfillna(mode) or "Unknown"Mean/median don’t apply to text
Row is missing most of its valuesdropna() for that rowRow adds no useful information
Column is missing more than 50%Drop the columnColumn is unreliable

python

# Program: complete missing data decision workflow

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")    # Replace with your file

# Step 1: Diagnose
missing_pct = (df.isnull().sum() / len(df)) * 100
print("Missing percentage per column:")
print(missing_pct.round(1))

# Step 2: Drop columns with > 50% missing
cols_to_drop = missing_pct[missing_pct > 50].index
df.drop(columns=cols_to_drop, inplace=True)
print(f"\nDropped columns: {list(cols_to_drop)}")

# Step 3: Fill remaining numeric columns with median
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Step 4: Fill remaining text columns with placeholder
text_cols = df.select_dtypes(include='object').columns
df[text_cols] = df[text_cols].fillna("Unknown")

# Step 5: Verify
print("\nMissing values after cleaning:")
print(df.isnull().sum())

# Step 6: Save
df.to_csv("data_cleaned.csv", index=False)
print("\nCleaned file saved.")

Part 5 — Forward Fill and Backward Fill (Time-Series Data)

These methods are used when data has a natural order — like daily temperatures or monthly rainfall. Instead of filling with a fixed value, you carry forward the last known value or fill backward from the next known value.

python

# Program to demonstrate forward fill and backward fill

import pandas as pd
import numpy as np

rainfall = pd.Series([120, np.nan, np.nan, 95, np.nan, 110, 130])

print("Original :", list(rainfall))

# Forward fill — use last known value
ffill = rainfall.ffill()
print("Forward  :", list(ffill))

# Backward fill — use next known value
bfill = rainfall.bfill()
print("Backward :", list(bfill))

Expected Output:

Original : [120, nan, nan, 95, nan, 110, 130]
Forward  : [120, 120, 120, 95, 95, 110, 130]
Backward : [120, 95, 95, 95, 110, 110, 130]

The CBSE Class 11 Unit 5 content on data preprocessing and Class 12 Unit 1 (handling missing values in DataFrames) both expect you to know fillna() and dropna() at minimum. Forward/backward fill is a useful additional technique to demonstrate in your Viva.


Quick Revision Box

MethodWhat It Does
df.isnull()Returns True where values are missing
df.isnull().sum()Count of missing values per column
df.isnull().sum().sum()Total missing values in DataFrame
df.fillna(value)Replaces NaN with given value
df["col"].fillna(df["col"].mean())Fills column NaN with column mean
df["col"].fillna(df["col"].median())Fills with column median
df["col"].fillna(df["col"].mode()[0])Fills with most common value
df.dropna()Drops all rows with any NaN
df.dropna(how="all")Drops rows where ALL values are NaN
df.dropna(subset=["col"])Drops rows where specific column is NaN
df.ffill()Forward fill — use previous value
df.bfill()Backward fill — use next value
inplace=TrueModifies the original DataFrame directly

Practice Questions

Q1 (2 marks): Write Python code to read a CSV file, check for missing values in each column, and fill all numeric missing values with the column mean.

Model Answer:

python

import pandas as pd

df = pd.read_csv("data.csv")
print("Missing values:", df.isnull().sum())

numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

print("After filling:", df.isnull().sum())

Q2 (MCQ): Which method removes rows from a DataFrame where a specific column has missing values?

a) df.fillna(subset=["col"]) b) df.dropna(subset=["col"]) c) df.isnull(subset=["col"]) d) df.remove(na=["col"])

Answer: b) df.dropna(subset=["col"]) — the subset parameter tells Pandas to only drop rows where the specified column(s) have missing values, leaving other rows intact.


Frequently Asked Questions

Q1: Should I always fill missing values rather than dropping them? Not always. The decision depends on how much data you can afford to lose and why values are missing. If only 2–3 rows out of 500 have missing values, dropping them is clean and harmless. If 30% of your Maths scores are missing, dropping those rows destroys too much data — fill instead. As a general rule: if missing rows are less than 5% of your dataset, dropping is fine.

Q2: Does filling missing values with the mean make my model less accurate? Filling with mean introduces a small inaccuracy — you are guessing values that were not observed. However, the alternative (dropping rows or leaving NaN) is usually worse: models that cannot handle NaN will crash, and dropping rows loses real data. Mean/median imputation is an acceptable trade-off for CBSE practicals and most real-world scenarios.

Q3: fillna() doesn’t seem to work — the NaN values are still there after I run it. This is a very common issue. By default, fillna() returns a new DataFrame without modifying the original. Either use inplace=True: df.fillna(value, inplace=True) — or save the result back: df = df.fillna(value). Both work; pick one and use it consistently.