Real data is never perfect. Sensors malfunction, survey respondents skip questions, spreadsheets get corrupted. Before you train any AI model, you must deal with missing values — or your model will either crash or produce wrong predictions.
This guide covers the complete missing data workflow required in Class 11, Unit 5: Data Literacy — Data Pre-processing and Class 12, Unit 1: Python Programming – II of the CBSE AI syllabus (Subject Code 843, 2025-26).
What You’ll Learn
- What missing data is and why it breaks AI models
- How to detect missing values with Pandas
- Three strategies: fill, drop, or replace — when to use each
- Complete programs ready for your practical file and Lab Test
Why Missing Data Matters in AI
Every machine learning algorithm expects complete, clean data. When values are missing, algorithms either crash with an error or silently produce wrong results — both outcomes are worse than taking 10 minutes to clean the data first.
In India’s agricultural data collected by state governments, district-level crop yield records routinely have missing entries for remote areas. A predictive model trained on that data without handling missing values would produce unreliable yield forecasts — the same problem at a school scale is a failing practical program.
Missing values in Python are represented as NaN — Not a Number. Pandas uses NaN for any missing numeric or text value.
Part 1 — Detecting Missing Values
Before fixing missing data, you must find it.
python
# Program to detect missing values in a dataset
import pandas as pd
import numpy as np
# Create a dataset with deliberate missing values
data = {
"Student" : ["Arjun", "Priya", "Kiran", "Meena", "Rohan", "Sneha"],
"Maths" : [85, np.nan, 78, 95, np.nan, 88],
"Science" : [90, 82, np.nan, 88, 74, np.nan],
"AI" : [92, 95, 88, np.nan, 79, 91],
"City" : ["Delhi", "Mumbai", np.nan, "Chennai", "Delhi", "Mumbai"]
}
df = pd.DataFrame(data)
print("Dataset:")
print(df)
print("\n--- Missing Value Analysis ---")
# Count missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())
# Total missing values across entire DataFrame
print("\nTotal missing values:", df.isnull().sum().sum())
# Percentage missing per column
print("\nPercentage missing per column:")
print((df.isnull().sum() / len(df) * 100).round(1))
# Which rows have at least one missing value
print("\nRows with missing values:")
print(df[df.isnull().any(axis=1)])
Expected Output:
Dataset:
Student Maths Science AI City
0 Arjun 85.0 90.0 92.0 Delhi
1 Priya NaN 82.0 95.0 Mumbai
2 Kiran 78.0 NaN 88.0 None
3 Meena 95.0 88.0 NaN Chennai
4 Rohan NaN 74.0 79.0 Delhi
5 Sneha 88.0 NaN 91.0 Mumbai
--- Missing Value Analysis ---
Missing values per column:
Student 0
Maths 2
Science 2
AI 1
City 1
dtype: int64
Total missing values: 6
Percentage missing per column:
Student 0.0
Maths 33.3
Science 33.3
AI 16.7
City 16.7
dtype: float64
Rows with missing values:
Student Maths Science AI City
1 Priya NaN 82.0 95.0 Mumbai
2 Kiran 78.0 NaN 88.0 None
3 Meena 95.0 88.0 NaN Chennai
4 Rohan NaN 74.0 79.0 Delhi
5 Sneha 88.0 NaN 91.0 Mumbai
Key methods for detection:
| Method | What It Returns |
|---|---|
df.isnull() | DataFrame of True/False — True where value is missing |
df.isnull().sum() | Count of missing values per column |
df.isnull().sum().sum() | Total missing values in entire DataFrame |
df.info() | Shows non-null counts per column |
df.isnull().any(axis=1) | True for rows that have at least one missing value |
Part 2 — Strategy 1: Fill Missing Values (fillna)
fillna() replaces missing values with something — without losing any rows.
Fill with Column Mean (Most Common for Numeric Data)
python
# Program to fill missing numeric values with column mean
import pandas as pd
import numpy as np
data = {
"Student": ["Arjun","Priya","Kiran","Meena","Rohan"],
"Maths" : [85, np.nan, 78, 95, np.nan],
"Science": [90, 82, np.nan, 88, 74]
}
df = pd.DataFrame(data)
print("Before filling:")
print(df)
print("\nMissing:", df.isnull().sum().sum())
# Fill with column mean
df["Maths"].fillna(df["Maths"].mean(), inplace=True)
df["Science"].fillna(df["Science"].mean(), inplace=True)
print("\nAfter filling with mean:")
print(df.round(2))
print("Missing:", df.isnull().sum().sum())
Expected Output:
Before filling:
Student Maths Science
0 Arjun 85.0 90.0
1 Priya NaN 82.0
2 Kiran 78.0 NaN
3 Meena 95.0 88.0
4 Rohan NaN 74.0
Missing: 3
After filling with mean:
Student Maths Science
0 Arjun 85.00 90.0
1 Priya 86.00 82.0
2 Kiran 78.00 83.5
3 Meena 95.00 88.0
4 Rohan 86.00 74.0
Missing: 0
Why mean? The mean preserves the overall average of the column — adding a mean value doesn’t shift the centre of your data.
Fill with Median (Better When Data Has Outliers)
python
# Fill with median — better when data has extreme values
import pandas as pd
import numpy as np
salaries = pd.Series([25000, 28000, np.nan, 30000, 500000, np.nan, 27000])
print("Mean :", salaries.mean().round(0)) # pulled up by 500000
print("Median:", salaries.median()) # not affected by 500000
# Fill with median to avoid inflating missing values
salaries_filled = salaries.fillna(salaries.median())
print("\nAfter filling with median:")
print(salaries_filled)
Expected Output:
Mean : 122000.0
Median: 28000.0
After filling with median:
0 25000.0
1 28000.0
2 28000.0
3 30000.0
4 500000.0
5 28000.0
6 27000.0
Mean vs Median for missing data — when to choose:
| Situation | Use |
|---|---|
| Data is roughly symmetric (most marks, heights) | Mean |
| Data has outliers (salaries, house prices) | Median |
| Categorical column (city names, grades) | Mode or placeholder |
Fill Categorical/Text Columns
python
# Fill text column with mode (most common value) or placeholder
import pandas as pd
import numpy as np
data = {"City": ["Delhi", "Mumbai", np.nan, "Delhi", np.nan, "Chennai"]}
df = pd.DataFrame(data)
# Option A: fill with most common value (mode)
mode_city = df["City"].mode()[0]
df["City"].fillna(mode_city, inplace=True)
print("Filled with mode:", df)
Expected Output:
Filled with mode:
City
0 Delhi
1 Mumbai
2 Delhi ← filled with "Delhi" (mode)
3 Delhi
4 Delhi ← filled with "Delhi" (mode)
5 Chennai
Part 3 — Strategy 2: Drop Missing Values (dropna)
dropna() removes rows (or columns) that contain missing values.
python
# Program to demonstrate dropna() with different options
import pandas as pd
import numpy as np
data = {
"Name" : ["Arjun","Priya","Kiran","Meena","Rohan"],
"Maths" : [85, np.nan, 78, 95, np.nan],
"Science" : [90, 82, np.nan, 88, 74],
"AI" : [92, 95, 88, np.nan, 79]
}
df = pd.DataFrame(data)
print("Original shape:", df.shape)
# Drop rows where ANY column has a missing value
df_any = df.dropna()
print("After dropna() — any missing:", df_any.shape)
# Drop rows where ALL columns are missing (rare, more conservative)
df_all = df.dropna(how="all")
print("After dropna(how='all') :", df_all.shape)
# Drop rows only where a specific column is missing
df_maths = df.dropna(subset=["Maths"])
print("After dropna(subset=['Maths']):", df_maths.shape)
Expected Output:
Original shape: (5, 4)
After dropna() — any missing: (1, 4)
After dropna(how='all') : (5, 4)
After dropna(subset=['Maths']): (3, 4)
Part 4 — Choosing the Right Strategy
This is the decision that most students skip — and it is the most important one.
| Situation | Best Strategy | Reason |
|---|---|---|
| Missing values are few (< 5% of rows) | dropna() | Small loss, clean data |
| Missing values are many (> 20%) | fillna() with mean/median | Can’t afford to lose rows |
| Column is numeric, symmetric distribution | fillna(mean) | Preserves average |
| Column is numeric, has outliers | fillna(median) | Median is robust to extremes |
| Column is text or category | fillna(mode) or "Unknown" | Mean/median don’t apply to text |
| Row is missing most of its values | dropna() for that row | Row adds no useful information |
| Column is missing more than 50% | Drop the column | Column is unreliable |
python
# Program: complete missing data decision workflow
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv") # Replace with your file
# Step 1: Diagnose
missing_pct = (df.isnull().sum() / len(df)) * 100
print("Missing percentage per column:")
print(missing_pct.round(1))
# Step 2: Drop columns with > 50% missing
cols_to_drop = missing_pct[missing_pct > 50].index
df.drop(columns=cols_to_drop, inplace=True)
print(f"\nDropped columns: {list(cols_to_drop)}")
# Step 3: Fill remaining numeric columns with median
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# Step 4: Fill remaining text columns with placeholder
text_cols = df.select_dtypes(include='object').columns
df[text_cols] = df[text_cols].fillna("Unknown")
# Step 5: Verify
print("\nMissing values after cleaning:")
print(df.isnull().sum())
# Step 6: Save
df.to_csv("data_cleaned.csv", index=False)
print("\nCleaned file saved.")
Part 5 — Forward Fill and Backward Fill (Time-Series Data)
These methods are used when data has a natural order — like daily temperatures or monthly rainfall. Instead of filling with a fixed value, you carry forward the last known value or fill backward from the next known value.
python
# Program to demonstrate forward fill and backward fill
import pandas as pd
import numpy as np
rainfall = pd.Series([120, np.nan, np.nan, 95, np.nan, 110, 130])
print("Original :", list(rainfall))
# Forward fill — use last known value
ffill = rainfall.ffill()
print("Forward :", list(ffill))
# Backward fill — use next known value
bfill = rainfall.bfill()
print("Backward :", list(bfill))
Expected Output:
Original : [120, nan, nan, 95, nan, 110, 130]
Forward : [120, 120, 120, 95, 95, 110, 130]
Backward : [120, 95, 95, 95, 110, 110, 130]
The CBSE Class 11 Unit 5 content on data preprocessing and Class 12 Unit 1 (handling missing values in DataFrames) both expect you to know fillna() and dropna() at minimum. Forward/backward fill is a useful additional technique to demonstrate in your Viva.
Quick Revision Box
| Method | What It Does |
|---|---|
df.isnull() | Returns True where values are missing |
df.isnull().sum() | Count of missing values per column |
df.isnull().sum().sum() | Total missing values in DataFrame |
df.fillna(value) | Replaces NaN with given value |
df["col"].fillna(df["col"].mean()) | Fills column NaN with column mean |
df["col"].fillna(df["col"].median()) | Fills with column median |
df["col"].fillna(df["col"].mode()[0]) | Fills with most common value |
df.dropna() | Drops all rows with any NaN |
df.dropna(how="all") | Drops rows where ALL values are NaN |
df.dropna(subset=["col"]) | Drops rows where specific column is NaN |
df.ffill() | Forward fill — use previous value |
df.bfill() | Backward fill — use next value |
inplace=True | Modifies the original DataFrame directly |
Practice Questions
Q1 (2 marks): Write Python code to read a CSV file, check for missing values in each column, and fill all numeric missing values with the column mean.
Model Answer:
python
import pandas as pd
df = pd.read_csv("data.csv")
print("Missing values:", df.isnull().sum())
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
print("After filling:", df.isnull().sum())
Q2 (MCQ): Which method removes rows from a DataFrame where a specific column has missing values?
a) df.fillna(subset=["col"]) b) df.dropna(subset=["col"]) c) df.isnull(subset=["col"]) d) df.remove(na=["col"])
Answer: b) df.dropna(subset=["col"]) — the subset parameter tells Pandas to only drop rows where the specified column(s) have missing values, leaving other rows intact.
Frequently Asked Questions
Q1: Should I always fill missing values rather than dropping them? Not always. The decision depends on how much data you can afford to lose and why values are missing. If only 2–3 rows out of 500 have missing values, dropping them is clean and harmless. If 30% of your Maths scores are missing, dropping those rows destroys too much data — fill instead. As a general rule: if missing rows are less than 5% of your dataset, dropping is fine.
Q2: Does filling missing values with the mean make my model less accurate? Filling with mean introduces a small inaccuracy — you are guessing values that were not observed. However, the alternative (dropping rows or leaving NaN) is usually worse: models that cannot handle NaN will crash, and dropping rows loses real data. Mean/median imputation is an acceptable trade-off for CBSE practicals and most real-world scenarios.
Q3: fillna() doesn’t seem to work — the NaN values are still there after I run it. This is a very common issue. By default, fillna() returns a new DataFrame without modifying the original. Either use inplace=True: df.fillna(value, inplace=True) — or save the result back: df = df.fillna(value). Both work; pick one and use it consistently.
