If NumPy is for working with numbers, Pandas is for working with data. The moment you have a table — student records, weather readings, sales data — Pandas is what you reach for. This tutorial covers everything Class 11 needs: creating DataFrames, reading CSV files, cleaning data, and analysing it, with every program ready for your practical file.
This tutorial covers Unit 3: Python Programming (Level 2) and supports Unit 5: Data Literacy – Data Pre-processing of the CBSE AI Class 11 syllabus (Subject Code 843, 2025-26). The same programs also map directly to Class 12, Unit 1: Python Programming – II sample programs.
What You’ll Learn
- What a DataFrame is and how Pandas thinks about data
- How to create DataFrames from dictionaries, lists, and CSV files
- Essential operations: exploring, filtering, sorting, grouping data
- Handling missing values — the most important data cleaning skill
- Exporting cleaned data back to CSV
- All practical file programs with code and expected output
What Is Pandas?
Pandas is a Python library for working with structured, tabular data — data that has rows and columns, like a spreadsheet or a database table.
Think of it this way: NumPy is excellent at fast maths on arrays of numbers. But real datasets have mixed types — names (text), marks (numbers), dates, categories. Pandas handles all of that in a single structure called a DataFrame.
In India’s agriculture sector, government agencies use Pandas to load district-level crop production data from CSV files published on data.gov.in, clean missing entries for districts that didn’t report, and compute state-wise averages. The exact same workflow — read_csv(), isnull(), fillna(), groupby() — is what you will practise here.
python
import pandas as pd # Standard alias — always use this
Part 1 — Creating DataFrames
A DataFrame is Pandas’ core data structure. Think of it as a table with labelled rows (index) and labelled columns (column names).
From a Dictionary
The most common way to create a DataFrame for practice programs:
python
# Program to create a Pandas DataFrame using a dictionary (sequence data type)
# and perform basic display operations
import pandas as pd
data = {
"Name" : ["Arjun", "Priya", "Kiran", "Meena", "Rohan", "Sneha"],
"Marks" : [85, 92, 78, 95, 70, 88],
"Grade" : ["B", "A", "C", "A", "D", "B"],
"Attendance": [88, 95, 72, 98, 65, 91]
}
df = pd.DataFrame(data)
# a) Display the full DataFrame
print("Full DataFrame:")
print(df)
# b) Display first 5 records
print("\nFirst 5 records:")
print(df.head(5))
# c) Display last 10 records (only 6 rows exist, so all are shown)
print("\nLast 10 records:")
print(df.tail(10))
# d) Display the number of missing values
print("\nMissing values in each column:")
print(df.isnull().sum())
Expected Output:
Full DataFrame:
Name Marks Grade Attendance
0 Arjun 85 B 88
1 Priya 92 A 95
2 Kiran 78 C 72
3 Meena 95 A 98
4 Rohan 70 D 65
5 Sneha 88 B 91
First 5 records:
Name Marks Grade Attendance
0 Arjun 85 B 88
1 Priya 92 A 95
2 Kiran 78 C 72
3 Meena 95 A 98
4 Rohan 70 D 65
Last 10 records:
Name Marks Grade Attendance
0 Arjun 85 B 88
...
Missing values in each column:
Name 0
Marks 0
Grade 0
Attendance 0
dtype: int64
📌 Class 12 note: This program directly maps to the Class 12 (Subject Code 843, 2025-26) Unit 1 sample program: “Write Python code to create a Pandas DataFrame using any sequence data type” — a dictionary is a sequence data type.
From a List of Lists
python
# Program to create a DataFrame from a list of lists
import pandas as pd
rows = [
["Aarav", 82, "B"],
["Diya", 91, "A"],
["Ishaan", 74, "C"],
["Kavya", 96, "A"]
]
df = pd.DataFrame(rows, columns=["Name", "Marks", "Grade"])
print(df)
Expected Output:
Name Marks Grade
0 Aarav 82 B
1 Diya 91 A
2 Ishaan 74 C
3 Kavya 96 A
Part 2 — Exploring a DataFrame
Before analysing any dataset, you always explore it first. These five methods are the standard starting sequence in every data science project.
python
# Program to explore a DataFrame using standard methods
import pandas as pd
data = {
"City" : ["Mumbai", "Delhi", "Bengaluru", "Chennai", "Kolkata"],
"Population" : [20667656, 32941309, 13193000, 10971108, 14850066],
"Area_km2" : [603, 1484, 741, 426, 205],
"Literacy_%" : [89.2, 86.3, 87.7, 90.2, 87.1]
}
df = pd.DataFrame(data)
print("Shape (rows, columns):", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:\n", df.dtypes)
print("\nFirst 3 rows:\n", df.head(3))
print("\nStatistical Summary:\n", df.describe())
Expected Output:
Shape (rows, columns): (5, 4)
Column Names: ['City', 'Population', 'Area_km2', 'Literacy_%']
Data Types:
City object
Population int64
Area_km2 int64
Literacy_% float64
dtype: object
First 3 rows:
City Population Area_km2 Literacy_%
0 Mumbai 20667656 603 89.2
1 Delhi 32941309 1484 86.3
2 Bengaluru 13193000 741 87.7
Statistical Summary:
Population Area_km2 Literacy_%
count 5.000000e+00 5.000000 5.000000
mean 1.852365e+07 691.800000 88.100000
...
The exploration sequence — memorise this for Viva:
| Method | What It Tells You |
|---|---|
df.shape | Number of rows and columns |
df.columns | Column names |
df.dtypes | Data type of each column |
df.head(n) | First n rows (default 5) |
df.tail(n) | Last n rows (default 5) |
df.info() | Column types + non-null counts + memory |
df.describe() | Min, max, mean, std for numeric columns |
Part 3 — Reading and Writing CSV Files
CSV (Comma-Separated Values) is the most common data format in AI. Every real-world project starts here.
Reading a CSV File
python
# Program to read a CSV file and perform statistical analysis
# (Download dataset from Kaggle, data.gov.in, or use rainfall.csv from CBSE)
import pandas as pd
# Read the CSV
df = pd.read_csv("rainfall.csv") # Replace with your filename
# a) Basic exploration
print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
# b) Statistical summary
print("\nStatistical Summary:")
print(df.describe())
# c) Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# d) Column-wise statistics
print("\nMean of each numeric column:")
print(df.mean(numeric_only=True))
Which file to use: The CBSE Class 11 Unit 5 syllabus specifically mentions
rainfall.csvfor data literacy programs. Ask your teacher for this file, or download Indian rainfall data from data.gov.in or IMD (India Meteorological Department).
Writing a DataFrame to CSV
python
# Program to export a DataFrame to a CSV file
import pandas as pd
data = {
"Student" : ["Arjun", "Priya", "Kiran"],
"Score" : [85, 92, 78]
}
df = pd.DataFrame(data)
# Save to CSV — index=False prevents adding an extra index column
df.to_csv("student_scores.csv", index=False)
print("File saved successfully as student_scores.csv")
# Read it back to verify
df_verify = pd.read_csv("student_scores.csv")
print(df_verify)
Expected Output:
File saved successfully as student_scores.csv
Student Score
0 Arjun 85
1 Priya 92
2 Kiran 78
index=False explained: Without it, Pandas adds a column of row numbers (0, 1, 2…) as the first column of your CSV. This creates a duplicate index when you read the file back. Always use index=False when saving.
Part 4 — Handling Missing Values
This is the most important data cleaning skill in AI. Real datasets are almost always incomplete — sensors fail, forms are left blank, values get corrupted. Before training any model, you must handle missing data.
python
# Program to detect and handle missing values in a DataFrame
import pandas as pd
import numpy as np
# Dataset with deliberate missing values (NaN = Not a Number)
data = {
"Name" : ["Arjun", "Priya", "Kiran", "Meena", "Rohan"],
"Marks" : [85, np.nan, 78, 95, np.nan],
"Attendance" : [88, 95, np.nan, 98, 65],
"Grade" : ["B", "A", np.nan, "A", "D"]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nTotal missing values:", df.isnull().sum().sum())
# Strategy 1: Fill numeric missing values with column mean
df["Marks"].fillna(df["Marks"].mean(), inplace=True)
# Strategy 2: Fill numeric missing values with a specific value
df["Attendance"].fillna(0, inplace=True)
# Strategy 3: Fill text/category column with a placeholder
df["Grade"].fillna("Unknown", inplace=True)
print("\nDataFrame after handling missing values:")
print(df)
print("\nMissing values after cleaning:")
print(df.isnull().sum())
Expected Output:
Original DataFrame:
Name Marks Attendance Grade
0 Arjun 85.0 88.0 B
1 Priya NaN 95.0 A
2 Kiran 78.0 NaN None
3 Meena 95.0 98.0 A
4 Rohan NaN 65.0 D
Missing values per column:
Name 0
Marks 2
Attendance 1
Grade 1
dtype: int64
Total missing values: 4
DataFrame after handling missing values:
Name Marks Attendance Grade
0 Arjun 85.0 88.0 B
1 Priya 86.0 95.0 A
2 Kiran 78.0 0.0 Unknown
3 Meena 95.0 98.0 A
4 Rohan 86.0 65.0 D
Missing values after cleaning:
Name 0
Marks 0
Attendance 0
Grade 0
dtype: int64
Three strategies for missing values:
| Strategy | When to Use | Code |
|---|---|---|
| Fill with mean | Numeric column, data is roughly symmetric | df["col"].fillna(df["col"].mean(), inplace=True) |
| Fill with median | Numeric column with outliers | df["col"].fillna(df["col"].median(), inplace=True) |
| Fill with placeholder | Text/category column | df["col"].fillna("Unknown", inplace=True) |
| Drop rows | Very few rows missing, can afford to lose them | df.dropna(inplace=True) |
Part 5 — Filtering, Sorting, and Selecting Data
python
# Program to filter, sort, and select data from a DataFrame
import pandas as pd
data = {
"Name" : ["Arjun", "Priya", "Kiran", "Meena", "Rohan", "Sneha", "Dev"],
"Marks" : [85, 92, 78, 95, 70, 88, 63],
"Grade" : ["B", "A", "C", "A", "D", "B", "D"],
"City" : ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai", "Delhi", "Chennai"]
}
df = pd.DataFrame(data)
# Filter: students scoring above 80
high_scorers = df[df["Marks"] > 80]
print("Students scoring above 80:")
print(high_scorers)
# Filter with multiple conditions: marks > 75 AND city is Delhi
delhi_toppers = df[(df["Marks"] > 75) & (df["City"] == "Delhi")]
print("\nDelhi students scoring above 75:")
print(delhi_toppers)
# Select specific columns
name_marks = df[["Name", "Marks"]]
print("\nName and Marks only:")
print(name_marks)
# Sort by Marks descending
sorted_df = df.sort_values("Marks", ascending=False)
print("\nSorted by Marks (highest first):")
print(sorted_df)
Expected Output:
Students scoring above 80:
Name Marks Grade City
0 Arjun 85 B Delhi
1 Priya 92 A Mumbai
3 Meena 95 A Chennai
5 Sneha 88 B Delhi
Delhi students scoring above 75:
Name Marks Grade City
0 Arjun 85 B Delhi
5 Sneha 88 B Delhi
Name and Marks only:
Name Marks
0 Arjun 85
...
Sorted by Marks (highest first):
Name Marks Grade City
3 Meena 95 A Chennai
1 Priya 92 A Mumbai
...
Part 6 — Grouping and Aggregation
GroupBy is one of the most used operations in data analysis — it answers questions like “what is the average marks by city?” or “how many students per grade?”
python
# Program to demonstrate groupby and aggregation in Pandas
import pandas as pd
data = {
"Name" : ["Arjun","Priya","Kiran","Meena","Rohan","Sneha","Dev","Anita"],
"Marks" : [85, 92, 78, 95, 70, 88, 63, 91],
"Grade" : ["B","A","C","A","D","B","D","A"],
"City" : ["Delhi","Mumbai","Delhi","Chennai","Mumbai","Delhi","Chennai","Mumbai"]
}
df = pd.DataFrame(data)
# Average marks by Grade
print("Average marks by Grade:")
print(df.groupby("Grade")["Marks"].mean())
# Count of students by City
print("\nNumber of students per City:")
print(df.groupby("City")["Name"].count())
# Multiple aggregations at once
print("\nMarks summary by City:")
print(df.groupby("City")["Marks"].agg(["mean", "min", "max"]))
Expected Output:
Average marks by Grade:
Grade
A 92.666667
B 86.500000
C 78.000000
D 66.500000
Name: Marks, dtype: float64
Number of students per City:
City
Chennai 2
Delhi 3
Mumbai 3
Name: Name, dtype: int64
Marks summary by City:
mean min max
City
Chennai 79.000000 63 95
Delhi 83.666667 78 88
Mumbai 84.333333 70 92
Part 7 — Adding and Dropping Columns
python
# Program to add new columns and drop unwanted columns
import pandas as pd
data = {
"Name" : ["Arjun", "Priya", "Kiran", "Meena"],
"Marks" : [85, 92, 78, 95],
"Max_Marks": [100, 100, 100, 100]
}
df = pd.DataFrame(data)
# Add a new column: Percentage
df["Percentage"] = (df["Marks"] / df["Max_Marks"]) * 100
# Add a new column based on condition: Pass/Fail
df["Result"] = df["Marks"].apply(lambda x: "Pass" if x >= 33 else "Fail")
print("DataFrame with new columns:")
print(df)
# Drop the Max_Marks column (no longer needed)
df.drop(columns=["Max_Marks"], inplace=True)
print("\nAfter dropping Max_Marks:")
print(df)
Expected Output:
DataFrame with new columns:
Name Marks Max_Marks Percentage Result
0 Arjun 85 100 85.0 Pass
1 Priya 92 100 92.0 Pass
2 Kiran 78 100 78.0 Pass
3 Meena 95 100 95.0 Pass
After dropping Max_Marks:
Name Marks Percentage Result
0 Arjun 85 85.0 Pass
1 Priya 92 92.0 Pass
2 Kiran 78 78.0 Pass
3 Meena 95 95.0 Pass
lambda in one line: lambda x: "Pass" if x >= 33 else "Fail" is a small anonymous function that runs on each value in the Marks column. apply() passes each value through it and returns the result. It is the clean way to add a calculated category column.
NumPy vs Pandas — Knowing Which to Use
Students often get confused about when to use NumPy and when to use Pandas. Here is the rule:
| Use Case | NumPy | Pandas |
|---|---|---|
| Pure number crunching (arrays, matrices) | ✅ | Not needed |
| Tabular data (rows + columns, mixed types) | ❌ | ✅ |
| Statistical analysis on a single column | ✅ | ✅ |
| Reading CSV files | ❌ | ✅ |
| Filtering rows by condition | Possible but verbose | ✅ |
| Input to Scikit-learn | ✅ (arrays) | ✅ (DataFrames) |
| Matrix operations | ✅ | ❌ |
In practice, they work together: Pandas loads and cleans the data, NumPy does the maths underneath, Scikit-learn trains the model.
Quick Revision Box
| Function | What It Does |
|---|---|
pd.DataFrame(data) | Creates a DataFrame from a dictionary or list |
pd.read_csv("file.csv") | Reads a CSV file into a DataFrame |
df.to_csv("file.csv", index=False) | Saves DataFrame to CSV without extra index column |
df.shape | Returns (rows, columns) |
df.head(n) | First n rows |
df.tail(n) | Last n rows |
df.info() | Column types and null counts |
df.describe() | Statistical summary of numeric columns |
df.isnull().sum() | Count of missing values per column |
df.fillna(value, inplace=True) | Fill missing values with given value |
df.dropna(inplace=True) | Remove rows with any missing value |
df[df["col"] > value] | Filter rows by condition |
df.sort_values("col") | Sort by column (ascending by default) |
df.groupby("col").mean() | Group rows and compute mean per group |
df["new"] = ... | Add a new column |
df.drop(columns=["col"]) | Remove a column |
df["col"].apply(func) | Apply a function to every value in a column |
Practice Questions
Q1 (2 marks): Write a Python program to create a Pandas DataFrame of 3 students with columns Name, Marks, and Grade. Display the first 2 rows and count the missing values.
Model Answer:
python
import pandas as pd
data = {
"Name" : ["Arjun", "Priya", "Kiran"],
"Marks": [85, 92, 78],
"Grade": ["B", "A", "C"]
}
df = pd.DataFrame(data)
print(df.head(2))
print("Missing values:", df.isnull().sum())
Q2 (MCQ): Which Pandas method removes rows that contain missing values?
a) df.fillna() b) df.isnull() c) df.dropna() d) df.remove()
Answer: c) df.dropna() — removes all rows containing at least one missing (NaN) value.
Frequently Asked Questions
Q1: What is the difference between df.fillna() and df.dropna()? fillna() replaces missing values with something — a number, a string, or the column mean — so you keep all your rows. dropna() deletes any row that has at least one missing value. Use fillna() when you cannot afford to lose data (small datasets, important rows). Use dropna() when you have enough data and the missing rows are few and random.
Q2: Why does inplace=True appear in so many Pandas operations? By default, Pandas operations return a new DataFrame and leave the original unchanged. inplace=True modifies the existing DataFrame directly without creating a copy. Without it, you must write df = df.fillna(0) to save the change. With it, df.fillna(0, inplace=True) modifies df directly. Both approaches are correct — inplace=True is just shorter.
Q3: What CSV file should I use for the Unit 5 practical programs? The CBSE Class 11 AI syllabus specifically references rainfall.csv for Unit 5 programs. Ask your teacher for this file. Alternatively, download Indian district-level rainfall data from the IMD website or data.gov.in. Any properly formatted CSV works — the programs are not tied to a specific dataset.