How to Handle Missing Data in Datasets — CBSE AI Students Guide

Real data is never perfect. Sensors malfunction, survey respondents skip questions, spreadsheets get corrupted. Before you train any AI model, you must deal with missing values — or your model will either crash or produce wrong predictions.

This guide covers the complete missing data workflow required in Class 11, Unit 5: Data Literacy — Data Pre-processing and Class 12, Unit 1: Python Programming – II of the CBSE AI syllabus (Subject Code 843, 2025-26).

What You’ll Learn

What missing data is and why it breaks AI models
How to detect missing values with Pandas
Three strategies: fill, drop, or replace — when to use each
Complete programs ready for your practical file and Lab Test

Why Missing Data Matters in AI

Every machine learning algorithm expects complete, clean data. When values are missing, algorithms either crash with an error or silently produce wrong results — both outcomes are worse than taking 10 minutes to clean the data first.

In India’s agricultural data collected by state governments, district-level crop yield records routinely have missing entries for remote areas. A predictive model trained on that data without handling missing values would produce unreliable yield forecasts — the same problem at a school scale is a failing practical program.

Missing values in Python are represented as NaN — Not a Number. Pandas uses NaN for any missing numeric or text value.

Part 1 — Detecting Missing Values

Before fixing missing data, you must find it.

python

# Program to detect missing values in a dataset

import pandas as pd
import numpy as np

# Create a dataset with deliberate missing values
data = {
    "Student"    : ["Arjun", "Priya", "Kiran", "Meena", "Rohan", "Sneha"],
    "Maths"      : [85, np.nan, 78, 95, np.nan, 88],
    "Science"    : [90, 82, np.nan, 88, 74, np.nan],
    "AI"         : [92, 95, 88, np.nan, 79, 91],
    "City"       : ["Delhi", "Mumbai", np.nan, "Chennai", "Delhi", "Mumbai"]
}

df = pd.DataFrame(data)

print("Dataset:")
print(df)
print("\n--- Missing Value Analysis ---")

# Count missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())

# Total missing values across entire DataFrame
print("\nTotal missing values:", df.isnull().sum().sum())

# Percentage missing per column
print("\nPercentage missing per column:")
print((df.isnull().sum() / len(df) * 100).round(1))

# Which rows have at least one missing value
print("\nRows with missing values:")
print(df[df.isnull().any(axis=1)])

Expected Output:

Dataset:
  Student  Maths  Science    AI     City
0   Arjun   85.0     90.0  92.0    Delhi
1   Priya    NaN     82.0  95.0   Mumbai
2   Kiran   78.0      NaN  88.0     None
3   Meena   95.0     88.0   NaN  Chennai
4   Rohan    NaN     74.0  79.0    Delhi
5   Sneha   88.0      NaN  91.0   Mumbai

--- Missing Value Analysis ---

Missing values per column:
Student    0
Maths      2
Science    2
AI         1
City       1
dtype: int64

Total missing values: 6

Percentage missing per column:
Student     0.0
Maths      33.3
Science    33.3
AI         16.7
City       16.7
dtype: float64

Rows with missing values:
  Student  Maths  Science    AI     City
1   Priya    NaN     82.0  95.0   Mumbai
2   Kiran   78.0      NaN  88.0     None
3   Meena   95.0     88.0   NaN  Chennai
4   Rohan    NaN     74.0  79.0    Delhi
5   Sneha   88.0      NaN  91.0   Mumbai

Key methods for detection:

Method	What It Returns
`df.isnull()`	DataFrame of True/False — True where value is missing
`df.isnull().sum()`	Count of missing values per column
`df.isnull().sum().sum()`	Total missing values in entire DataFrame
`df.info()`	Shows non-null counts per column
`df.isnull().any(axis=1)`	True for rows that have at least one missing value

Part 2 — Strategy 1: Fill Missing Values (`fillna`)

fillna() replaces missing values with something — without losing any rows.

Fill with Column Mean (Most Common for Numeric Data)

python

# Program to fill missing numeric values with column mean

import pandas as pd
import numpy as np

data = {
    "Student": ["Arjun","Priya","Kiran","Meena","Rohan"],
    "Maths"  : [85, np.nan, 78, 95, np.nan],
    "Science": [90, 82, np.nan, 88, 74]
}
df = pd.DataFrame(data)

print("Before filling:")
print(df)
print("\nMissing:", df.isnull().sum().sum())

# Fill with column mean
df["Maths"].fillna(df["Maths"].mean(), inplace=True)
df["Science"].fillna(df["Science"].mean(), inplace=True)

print("\nAfter filling with mean:")
print(df.round(2))
print("Missing:", df.isnull().sum().sum())

Expected Output:

Before filling:
  Student  Maths  Science
0   Arjun   85.0     90.0
1   Priya    NaN     82.0
2   Kiran   78.0      NaN
3   Meena   95.0     88.0
4   Rohan    NaN     74.0

Missing: 3

After filling with mean:
  Student  Maths  Science
0   Arjun  85.00     90.0
1   Priya  86.00     82.0
2   Kiran  78.00     83.5
3   Meena  95.00     88.0
4   Rohan  86.00     74.0

Missing: 0

Why mean? The mean preserves the overall average of the column — adding a mean value doesn’t shift the centre of your data.

Fill with Median (Better When Data Has Outliers)

python

# Fill with median — better when data has extreme values

import pandas as pd
import numpy as np

salaries = pd.Series([25000, 28000, np.nan, 30000, 500000, np.nan, 27000])

print("Mean  :", salaries.mean().round(0))     # pulled up by 500000
print("Median:", salaries.median())             # not affected by 500000

# Fill with median to avoid inflating missing values
salaries_filled = salaries.fillna(salaries.median())
print("\nAfter filling with median:")
print(salaries_filled)

Expected Output:

Mean  : 122000.0
Median: 28000.0

After filling with median:
0     25000.0
1     28000.0
2     28000.0
3     30000.0
4    500000.0
5     28000.0
6     27000.0

Mean vs Median for missing data — when to choose:

Situation	Use
Data is roughly symmetric (most marks, heights)	Mean
Data has outliers (salaries, house prices)	Median
Categorical column (city names, grades)	Mode or placeholder

Fill Categorical/Text Columns

python

# Fill text column with mode (most common value) or placeholder

import pandas as pd
import numpy as np

data = {"City": ["Delhi", "Mumbai", np.nan, "Delhi", np.nan, "Chennai"]}
df = pd.DataFrame(data)

# Option A: fill with most common value (mode)
mode_city = df["City"].mode()[0]
df["City"].fillna(mode_city, inplace=True)
print("Filled with mode:", df)

Expected Output:

Filled with mode:
      City
0    Delhi
1   Mumbai
2    Delhi   ← filled with "Delhi" (mode)
3    Delhi
4    Delhi   ← filled with "Delhi" (mode)
5  Chennai

Part 3 — Strategy 2: Drop Missing Values (`dropna`)

dropna() removes rows (or columns) that contain missing values.

python

# Program to demonstrate dropna() with different options

import pandas as pd
import numpy as np

data = {
    "Name"    : ["Arjun","Priya","Kiran","Meena","Rohan"],
    "Maths"   : [85, np.nan, 78, 95, np.nan],
    "Science" : [90, 82, np.nan, 88, 74],
    "AI"      : [92, 95, 88, np.nan, 79]
}
df = pd.DataFrame(data)
print("Original shape:", df.shape)

# Drop rows where ANY column has a missing value
df_any = df.dropna()
print("After dropna() — any missing:", df_any.shape)

# Drop rows where ALL columns are missing (rare, more conservative)
df_all = df.dropna(how="all")
print("After dropna(how='all')      :", df_all.shape)

# Drop rows only where a specific column is missing
df_maths = df.dropna(subset=["Maths"])
print("After dropna(subset=['Maths']):", df_maths.shape)

Expected Output:

Original shape: (5, 4)
After dropna() — any missing: (1, 4)
After dropna(how='all')      : (5, 4)
After dropna(subset=['Maths']): (3, 4)

Part 4 — Choosing the Right Strategy

This is the decision that most students skip — and it is the most important one.

Situation	Best Strategy	Reason
Missing values are few (< 5% of rows)	`dropna()`	Small loss, clean data
Missing values are many (> 20%)	`fillna()` with mean/median	Can’t afford to lose rows
Column is numeric, symmetric distribution	`fillna(mean)`	Preserves average
Column is numeric, has outliers	`fillna(median)`	Median is robust to extremes
Column is text or category	`fillna(mode)` or `"Unknown"`	Mean/median don’t apply to text
Row is missing most of its values	`dropna()` for that row	Row adds no useful information
Column is missing more than 50%	Drop the column	Column is unreliable

python

# Program: complete missing data decision workflow

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")    # Replace with your file

# Step 1: Diagnose
missing_pct = (df.isnull().sum() / len(df)) * 100
print("Missing percentage per column:")
print(missing_pct.round(1))

# Step 2: Drop columns with > 50% missing
cols_to_drop = missing_pct[missing_pct > 50].index
df.drop(columns=cols_to_drop, inplace=True)
print(f"\nDropped columns: {list(cols_to_drop)}")

# Step 3: Fill remaining numeric columns with median
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Step 4: Fill remaining text columns with placeholder
text_cols = df.select_dtypes(include='object').columns
df[text_cols] = df[text_cols].fillna("Unknown")

# Step 5: Verify
print("\nMissing values after cleaning:")
print(df.isnull().sum())

# Step 6: Save
df.to_csv("data_cleaned.csv", index=False)
print("\nCleaned file saved.")

Part 5 — Forward Fill and Backward Fill (Time-Series Data)

These methods are used when data has a natural order — like daily temperatures or monthly rainfall. Instead of filling with a fixed value, you carry forward the last known value or fill backward from the next known value.

python

# Program to demonstrate forward fill and backward fill

import pandas as pd
import numpy as np

rainfall = pd.Series([120, np.nan, np.nan, 95, np.nan, 110, 130])

print("Original :", list(rainfall))

# Forward fill — use last known value
ffill = rainfall.ffill()
print("Forward  :", list(ffill))

# Backward fill — use next known value
bfill = rainfall.bfill()
print("Backward :", list(bfill))

Expected Output:

Original : [120, nan, nan, 95, nan, 110, 130]
Forward  : [120, 120, 120, 95, 95, 110, 130]
Backward : [120, 95, 95, 95, 110, 110, 130]

The CBSE Class 11 Unit 5 content on data preprocessing and Class 12 Unit 1 (handling missing values in DataFrames) both expect you to know fillna() and dropna() at minimum. Forward/backward fill is a useful additional technique to demonstrate in your Viva.

Quick Revision Box

Method	What It Does
`df.isnull()`	Returns True where values are missing
`df.isnull().sum()`	Count of missing values per column
`df.isnull().sum().sum()`	Total missing values in DataFrame
`df.fillna(value)`	Replaces NaN with given value
`df["col"].fillna(df["col"].mean())`	Fills column NaN with column mean
`df["col"].fillna(df["col"].median())`	Fills with column median
`df["col"].fillna(df["col"].mode()[0])`	Fills with most common value
`df.dropna()`	Drops all rows with any NaN
`df.dropna(how="all")`	Drops rows where ALL values are NaN
`df.dropna(subset=["col"])`	Drops rows where specific column is NaN
`df.ffill()`	Forward fill — use previous value
`df.bfill()`	Backward fill — use next value
`inplace=True`	Modifies the original DataFrame directly

Practice Questions

Q1 (2 marks): Write Python code to read a CSV file, check for missing values in each column, and fill all numeric missing values with the column mean.

Model Answer:

python

import pandas as pd

df = pd.read_csv("data.csv")
print("Missing values:", df.isnull().sum())

numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

print("After filling:", df.isnull().sum())

Q2 (MCQ): Which method removes rows from a DataFrame where a specific column has missing values?

a) df.fillna(subset=["col"]) b) df.dropna(subset=["col"]) c) df.isnull(subset=["col"]) d) df.remove(na=["col"])

Answer: b) df.dropna(subset=["col"]) — the subset parameter tells Pandas to only drop rows where the specified column(s) have missing values, leaving other rows intact.

Frequently Asked Questions

Q1: Should I always fill missing values rather than dropping them? Not always. The decision depends on how much data you can afford to lose and why values are missing. If only 2–3 rows out of 500 have missing values, dropping them is clean and harmless. If 30% of your Maths scores are missing, dropping those rows destroys too much data — fill instead. As a general rule: if missing rows are less than 5% of your dataset, dropping is fine.

Q2: Does filling missing values with the mean make my model less accurate? Filling with mean introduces a small inaccuracy — you are guessing values that were not observed. However, the alternative (dropping rows or leaving NaN) is usually worse: models that cannot handle NaN will crash, and dropping rows loses real data. Mean/median imputation is an acceptable trade-off for CBSE practicals and most real-world scenarios.

Q3: fillna() doesn’t seem to work — the NaN values are still there after I run it. This is a very common issue. By default, fillna() returns a new DataFrame without modifying the original. Either use inplace=True: df.fillna(value, inplace=True) — or save the result back: df = df.fillna(value). Both work; pick one and use it consistently.

What You’ll Learn

Why Missing Data Matters in AI

Part 1 — Detecting Missing Values

Part 2 — Strategy 1: Fill Missing Values (fillna)

Fill with Column Mean (Most Common for Numeric Data)

Fill with Median (Better When Data Has Outliers)

Fill Categorical/Text Columns

Part 3 — Strategy 2: Drop Missing Values (dropna)

Part 4 — Choosing the Right Strategy

Part 5 — Forward Fill and Backward Fill (Time-Series Data)

Quick Revision Box

Practice Questions

Frequently Asked Questions

Part 2 — Strategy 1: Fill Missing Values (`fillna`)

Part 3 — Strategy 2: Drop Missing Values (`dropna`)