Learn Pandas for Data Analysis Beginners
If you want to learn pandas for data analysis beginners, you are in the right place. Pandas is one of the most powerful Python libraries used for handling, cleaning, and transforming data easily. In this complete pandas tutorial in Python with examples, you will understand how to work with real datasets step by step.
In this pandas tutorial in Python, you will learn how to use pandas in Python to read files, create DataFrames, and explore datasets. After that, we will move toward data transformation using pandas, where you will learn how to modify DataFrame using pandas, filter data, and apply operations easily.
This end to end pandas tutorial in Python is designed especially for beginners who want practical knowledge with real examples. By the end of this guide, you will feel confident working with datasets and performing data analysis using pandas in Python.
Getting Started with Pandas for Data Analysis Beginners
Getting started with pandas for data analysis may feel confusing at first. However, once you understand the basics, you can quickly perform data cleaning using pandas, handle missing values, and prepare raw data for analysis. In fact, pandas makes data preprocessing simple and beginner friendly.
What is Pandas in Python?
Pandas is an open-source Python library used for data analysis and data manipulation. It helps you work with structured data easily.
- In simple words, Pandas allows you to read, clean, transform, and analyze data with just a few lines of code.
- Primarily, it is used for handling tabular data such as CSV files, Excel files, and databases.
- The main data structure in Pandas is the DataFrame, which looks like a table with rows and columns. Therefore, it is very easy for beginners to understand.
- Additionally, Pandas provides built-in functions for data cleaning using pandas, such as removing missing values, deleting duplicates, and fixing data types.
- Moreover, it supports data preprocessing, which means you can prepare raw data before analysis.
- For example, you can filter rows, sort values, group data, and modify DataFrame using pandas without writing complex logic.
- As a result, if you want to learn pandas for data analysis beginners, understanding Pandas basics is the first and most important step.
Why Learn Pandas for Data Analysis Beginners?
- First of all, Pandas is beginner friendly and easy to understand, especially if you are new to data analysis in Python.
- Because it uses simple syntax, you can perform complex data manipulation tasks with very little code.
- Most importantly, Pandas helps you work with real-world datasets such as CSV and Excel files. Therefore, it is very useful for practical projects.
- Another important reason is, Pandas is widely used in companies for data analysis, reporting, and business intelligence. As a result, learning it improves your job opportunities.
- Furthermore, it integrates well with other Python libraries like NumPy, Matplotlib, and Scikit-learn for complete data analysis projects.
- Finally, if you want to learn pandas for data analysis beginners, mastering Pandas will give you a strong foundation for advanced data science and machine learning.
How to Install Pandas in Python Step by Step
- Before, installing Pandas, make sure Python is already available on your system. Using the latest version improves performance and compatibility.
- To begin the installation,open the command prompt or terminal based on your operating system..
- For installing Pandas, use the following command:
pip install pandas
- At this stage, press the Enter key to start the installation process. Pandas downloads and installs automatically.
- After the installation completes, the library becomes available in your Python environment.
- If you are using Anaconda, the following command can be used instead:
conda install pandas
import pandas as pd
- When no error appears, the setup is successful and you can continue learning pandas for data analysis beginners.
How to Use Pandas in Python for the First Time
- When starting for the first time,open your Python editor such as VS Code, Jupyter Notebook, or any IDE you prefer.
- To work with data in Python, the Pandas library must be imported:
import pandas as pd
- For understanding how Pandas organizes data, create a simple DataFrame:
data = {
"Name": ["Rahul", "Amit", "Sneha"],
"Age": [23, 25, 22],
"City": ["Delhi", "Mumbai", "Pune"]
}
df = pd.DataFrame(data)
- To view the data in tabular form, display the DataFrame:
print(df)
- At this point, notice how the data appears in rows and columns. This structure is known as a DataFrame in Pandas.
| Name | Age | City |
|---|---|---|
| Rahul | 23 | Delhi |
| Amit | 25 | Mumbai |
| Sneha | 22 | Pune |
- As a basic operation, try selecting a single column:
print(df["Name"])
- Output looks like:
0 Rahul
1 Amit
2 Sneha
Name: Name, dtype: object
- you can perform simple data manipulation with Pandas, such as adding a new column:
df["Salary"] = [30000, 35000, 28000]
- With regular practice, these steps help build confidence while learning pandas for data analysis beginners.
Basic Pandas Tutorial in Python β How to Use Pandas in Python
In this basic pandas tutorial in python, you will understand how to use pandas in python step by step. Pandas is one of the most important libraries for working with structured data, especially for beginners who want to learn pandas for data analysis beginners. It provides simple functions that help you read files, create DataFrames, explore datasets, and perform data cleaning using pandas.
Creating a DataFrame in Pandas with Example
- To begin, import pandas in Python so you can start working with data:
import pandas as pd
- Next, create a simple dictionary to store sample data. This method is very helpful when you learn pandas for data analysis beginners.
data = {
"Name": ["Rahul", "Amit", "Sneha"],
"Age": [23, 25, 22],
"City": ["Delhi", "Mumbai", "Pune"]
}
- Then, convert this dictionary into a DataFrame using pandas:
df = pd.DataFrame(data)
- After that, display the DataFrame to see the output:
print(df)
| Index | Name | Age | City | Salary |
|---|---|---|---|---|
| 0 | Rahul | 23 | Delhi | 30000 |
| 1 | Amit | 25 | Mumbai | 35000 |
| 2 | Sneha | 22 | Pune | 28000 |
- Now, observe how pandas organizes the data into rows and columns. This table structure is called a DataFrame in any pandas tutorial in python.
- Finally, practicing this example helps you build a strong foundation before moving to data cleaning using pandas and advanced data analysis tasks.
How to Read CSV and Excel Files Using Pandas
- First, import pandas in Python so you can start working with external files:
import pandas as pd
-
Next, read a CSV file using the
read_csv()function. This function loads the file directly into a DataFrame.
df = pd.read_csv("data.csv")
- Then, display the DataFrame to check the loaded data:
print(df)
- Now, observe how pandas automatically creates rows and columns from the CSV file. This structure works the same way as when you manually create a DataFrame in a pandas tutorial in python.
-
After that, read an Excel file using the
read_excel()function:
df_excel = pd.read_excel("data.xlsx")
- Similarly, print the Excel DataFrame to view the data:
print(df_excel)
- In addition, you can select a specific sheet in Excel by using:
df_excel = pd.read_excel("data.xlsx", sheet_name="Sheet1")
- As a result, you can easily load real-world datasets for data cleaning using pandas and further data transformation using pandas.
- Finally, reading CSV and Excel files is one of the most important steps when you learn pandas for data analysis beginners, because real projects always start with external data.
If you want to learn more about automating Excel reports, you can read my detailed guide on automate excel file using pandas .
Exploring Data in a Pandas DataFrame
You can download the sample dataset used in this tutorial from here: Download the CSV file for practice and use it to follow along with this pandas tutorial in python.
Understanding head() in Pandas
-
First, load your dataset into a DataFrame using
read_csv()orread_excel()so you can begin exploring the data.
import pandas as pd
df = pd.read_csv("data.csv")
-
Next, use the
head()function to display the first few rows of the DataFrame:
df.head()
| Actor | Film | Year | Genre | BoxOffice(INR Crore) | IMDb | |
|---|---|---|---|---|---|---|
| 0 | Shah Rukh Khan | Pathaan | 2023 | Action | 1050 | 7.2 |
| 1 | Salman Khan | Tiger Zinda Hai | 2017 | Action | 565 | 6.0 |
| 2 | Aamir Khan | Dangal | 2016 | Biography | 2024 | 8.4 |
| 3 | Ranbir Kapoor | Brahmastra | 2022 | Fantasy | 431 | 5.6 |
| 4 | Ranveer Singh | Padmaavat | 2018 | Historical | 585 | 7.0 |
- By default, pandas shows the first 5 rows. However, you can pass a number to see more rows:
df.head(3)
| Actor | Film | Year | Genre | BoxOffice(INR Crore) | IMDb | |
|---|---|---|---|---|---|---|
| 0 | Shah Rukh Khan | Pathaan | 2023 | Action | 1050 | 7.2 |
| 1 | Salman Khan | Tiger Zinda Hai | 2017 | Action | 565 | 6.0 |
| 2 | Aamir Khan | Dangal | 2016 | Biography | 2024 | 8.4 |
Using tail() to Inspect Data from the End
-
Then, use the
tail()function to display the last few rows of the dataset:
df.tail()
| Β | Actor | Film | Year | Genre | BoxOffice(INR Crore) | IMDb |
|---|---|---|---|---|---|---|
| 7 | Hrithik Roshan | War | 2019 | Action | 475 | 6.5 |
| 8 | Akshay Kumar | Good Newwz | 2019 | Comedy | 318 | 7.0 |
| 9 | Kartik Aaryan | Bhool Bhulaiyaa 2 | 2022 | Horror Comedy | 266 | 5.9 |
| 10 | Varun Dhawan | Badrinath Ki Dulhania | 2017 | Romantic Comedy | 201 | 6.1 |
| 11 | Vicky Kaushal | Uri: The Surgical Strike | 2019 | Action | 342 | 8.2 |
Getting Dataset Overview with info()
-
After that, use the
info()function to get a summary of the DataFrame:
df.info()
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Actor 12 non-null object
1 Film 12 non-null object
2 Year 12 non-null int64
3 Genre 12 non-null object
4 BoxOffice(INR Crore) 12 non-null int64
5 IMDb 12 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 708.0+ bytes
info() function shows:
- Number of rows
- Column names
- Data types of each column
- Non-null (non-missing) values
- As a result, you can quickly understand the structure of your dataset before starting data cleaning using pandas.
- In addition, these functions help you identify missing values and incorrect data types, which are important for data transformation using pandas.
-
Therefore, exploring data with
head(),tail(), andinfo()is an essential step when you learn pandas for data analysis beginners.
How to Select Rows and Columns in Pandas DataFrame
- First, make sure you already created or loaded a DataFrame:
import pandas as pd
data = {
"Name": ["Rahul", "Amit", "Sneha"],
"Age": [23, 25, 22],
"City": ["Delhi", "Mumbai", "Pune"]
}
df = pd.DataFrame(data)
Selecting Columns in Pandas
- To select a single column, use the column name inside square brackets:
df["Name"]
0 Rahul
1 Amit
2 Sneha
Name: Name, dtype: object
- If multiple columns are needed, pass a list of column names:
df[["Name", "City"]]
Name City
0 Rahul Delhi
1 Amit Mumbai
2 Sneha Pune
- As a result, pandas returns only the selected columns from the DataFrame.
Selecting Rows in Pandas
-
For selecting rows by index position, use the
iloc[]function:
df.iloc[0]
Name Rahul
Age 23
City Delhi
Name: 0, dtype: object
- When multiple rows are required, provide a range of index values:
df.iloc[0:3]
| Name | Age | City | |
|---|---|---|---|
| 0 | Rahul | 23 | Delhi |
| 1 | Amit | 25 | Mumbai |
| 2 | Sneha | 22 | Pune |
-
To select rows using labels, use the
loc[]function:
df.loc[0]
Name Rahul
Age 23
City Delhi
Name: 0, dtype: object
- Specific rows and columns together can be selected like this:
df.loc[0:2, ["Name", "City"]]
| Name | City | |
|---|---|---|
| 0 | Rahul | Delhi |
| 1 | Amit | Mumbai |
| 2 | Sneha | Pune |
- As a result, you can quickly access only the data you need, which is very important in data cleaning using pandas.
- Moreover, selecting and filtering data plays a major role in data transformation using pandas and helps you modify DataFrame using pandas efficiently.
- Therefore, mastering row and column selection is an essential step when you learn pandas for data analysis beginners.
Data Cleaning Using Pandas β How to Preprocess Data in Pandas
Before performing any analysis, raw data must be prepared properly. Data cleaning using pandas helps you remove errors, handle missing values, and correct inconsistent information inside a dataset. In fact, understanding how to preprocess data in pandas is one of the most important skills when you learn pandas for data analysis beginners.
Moreover, clean data improves accuracy and makes data transformation using pandas much easier. With simple functions, you can detect null values, remove duplicates, fix column names, and modify DataFrame using pandas without writing complex code. Therefore, this section will guide you step by step through practical techniques that help you prepare real-world datasets for analysis. By the end, you will confidently handle messy data in any pandas tutorial in python project. π
Handling Missing Values in Pandas DataFrame
Missing values are common in real-world datasets. Therefore, handling missing values correctly is an important step in data cleaning using pandas and how to preprocess data in pandas.
Create Example Data with Missing Values
- First, create a sample DataFrame that contains some missing values:
import pandas as pd
data = {
"Name": ["Rahul", "Amit", "Sneha", "Karan"],
"Age": [23, None, 22, 24],
"Salary": [30000, 35000, np.nan, 28000]
}
df = pd.DataFrame(data)
print(df)
| Index | Name | Age | Salary |
|---|---|---|---|
| 0 | Rahul | 23 | 30000 |
| 1 | Amit | NaN | 35000 |
| 2 | Sneha | 22 | NaN |
| 3 | Karan | 24 | 28000 |
Detect Missing Values
- Next, check how many missing values exist in each column:
df.isnull().sum()
Name 0
Age 1
Salary 1
dtype: int64
Remove Missing Values
- If the missing data is small, you can remove rows with missing values:
df_clean = df.dropna()
print(df_clean)
| Index | Name | Age | Salary |
|---|---|---|---|
| 0 | Rahul | 23 | 30000 |
| 3 | Karan | 24 | 28000 |
Fill Missing Values
- Instead of removing rows, you can fill missing values.
- Fill with the column mean:
df["Age"].fillna(df["Age"].mean(), inplace=True)
| Name | Age | Salary | |
|---|---|---|---|
| 0 | Rahul | 23.0 | 30000.0 |
| 1 | Amit | 23.0 | 35000.0 |
| 2 | Sneha | 22.0 | NaN |
| 3 | Karan | 24.0 | 28000.0 |
- This method is useful in data transformation using pandas when you want to keep all rows.
Handling missing values is a crucial skill when you learn pandas for data analysis beginners. Once you manage missing data properly, further analysis becomes much easier and more accurate.
How to Remove Duplicate Data Using Pandas
Duplicate data often appears in real-world datasets. Therefore, removing duplicates is an important step in data cleaning using pandas and proper data preprocessing.
Create Example Data with Duplicates
- First, create a sample DataFrame that contains duplicate rows:
import pandas as pd
data = {
"Name": ["Rahul", "Amit", "Sneha", "Rahul"],
"Age": [23, 25, 22, 23],
"City": ["Delhi", "Mumbai", "Pune", "Delhi"]
}
df = pd.DataFrame(data)
print(df)
| Index | Name | Age | City |
|---|---|---|---|
| 0 | Rahul | 23 | Delhi |
| 1 | Amit | 25 | Mumbai |
| 2 | Sneha | 22 | Pune |
| 3 | Rahul | 23 | Delhi |
Detect Duplicate Rows
- Next, check for duplicate rows in the DataFrame:
print(df.duplicated())
- This function returns True for duplicate rows.
0 False
1 False
2 False
3 True
dtype: bool
- To count duplicate rows:
df.duplicated().sum()
Remove Duplicate Rows
- Now, remove duplicate rows using:
df_clean = df.drop_duplicates()
print(df_clean)
- This command keeps the first occurrence and removes repeated rows.
| Index | Name | Age | City |
|---|---|---|---|
| 0 | Rahul | 23 | Delhi |
| 1 | Amit | 25 | Mumbai |
| 2 | Sneha | 22 | Pune |
Remove Duplicates Based on Specific Columns
- Sometimes, duplicates should be checked only for certain columns:
print(df.drop_duplicates(subset=["Name"]))
| Name | Age | City | |
|---|---|---|---|
| 0 | Rahul | 23 | Delhi |
| 1 | Amit | 25 | Mumbai |
| 2 | Sneha | 22 | Pune |
- This removes duplicate names while keeping the first record.
Renaming Columns and Fixing Data Types in Pandas
Clear column names and correct data types make analysis easier. Therefore, renaming columns and fixing data types is an important step in data cleaning using pandas and proper data preprocessing.
Create Example Data
- First, create a sample DataFrame with unclear column names and incorrect data types:
import pandas as pd
data = {
"emp_name": ["Rahul", "Amit", "Sneha"],
"emp_age": ["23", "25", "22"], # Age stored as string
"emp_salary": ["30000", "35000", "28000"] # Salary stored as string
}
df = pd.DataFrame(data)
print(df)
| emp_name | emp_age | emp_salary | |
|---|---|---|---|
| 0 | Rahul | 23 | 30000 |
| 1 | Amit | 25 | 35000 |
| 2 | Sneha | 22 | 28000 |
Rename Columns
- Next, rename the columns to make them simple and readable:
df.rename(columns={
"emp_name": "Name",
"emp_age": "Age",
"emp_salary": "Salary"
}, inplace=True)
print(df)
| Name | Age | Salary | |
|---|---|---|---|
| 0 | Rahul | 23 | 30000 |
| 1 | Amit | 25 | 35000 |
| 2 | Sneha | 22 | 28000 |
Fix Data Types
- Now, convert string columns into proper numeric types:
df["Age"] = df["Age"].astype(int)
df["Salary"] = df["Salary"].astype(float)
print(df.dtypes)
Name object
Age int64
Salary float64
dtype: object
Data Transformation Using Pandas to Modify DataFrame
After cleaning the dataset, the next important step is data transformation using pandas. In this stage, you reshape, filter, and update your data so it becomes more useful for analysis. While data cleaning using pandas focuses on fixing errors, transformation helps you organize and modify DataFrame using pandas according to your analysis needs.
Moreover, proper transformation makes reports clearer and improves decision-making. For example, you can create new columns, group data, apply calculations, or sort values easily. Therefore, understanding how to transform data is essential when you learn pandas for data analysis beginners.
How to Modify DataFrame Using Pandas
Modifying a DataFrame allows you to update, add, or change data according to your analysis needs. In data transformation using pandas, these operations help you prepare datasets for deeper insights.
Create Example Data
- To begin, create a simple DataFrame:
import pandas as pd
data = {
"Name": ["Rahul", "Amit", "Sneha"],
"Age": [23, 25, 22],
"Salary": [30000, 35000, 28000]
}
df = pd.DataFrame(data)
print(df)
| Name | Age | Salary | |
|---|---|---|---|
| 0 | Rahul | 23 | 30000 |
| 1 | Amit | 25 | 35000 |
| 2 | Sneha | 22 | 28000 |
Add a New Column
- Next, add a new column to the DataFrame:
df["Bonus"] = [2000, 2500, 1800]
print(df)
| Name | Age | Salary | Bonus | |
|---|---|---|---|---|
| 0 | Rahul | 23 | 30000 | 2000 |
| 1 | Amit | 25 | 35000 | 2500 |
| 2 | Sneha | 22 | 28000 | 1800 |
Update Existing Values
- Sometimes, specific values need correction. In that case, update them directly:
df.loc[0, "Salary"] = 32000
print(df)
| Β | Name | Age | Salary | Bonus |
|---|---|---|---|---|
| 0 | Rahul | 23 | 32000 | 2000 |
| 1 | Amit | 25 | 35000 | 2500 |
| 2 | Sneha | 22 | 28000 | 1800 |
Apply a Calculation to a Column
- In addition, calculations can be applied to modify values:
df["Salary"] = df["Salary"] + 1000
print(df)
| Name | Age | Salary | Bonus | |
|---|---|---|---|---|
| 0 | Rahul | 23 | 33000 | 2000 |
| 1 | Amit | 25 | 36000 | 2500 |
| 2 | Sneha | 22 | 29000 | 1800 |
Remove Columns
- If a column is not required, remove it:
df.drop("Bonus", axis=1, inplace=True)
print(df)
| Name | Age | Salary | |
|---|---|---|---|
| 0 | Rahul | 23 | 33000 |
| 1 | Amit | 25 | 36000 |
| 2 | Sneha | 22 | 29000 |
Filtering and Sorting Data in Pandas
Filtering and sorting help you organize data in a meaningful way. In data transformation using pandas, these operations allow you to focus only on relevant records and arrange them properly for analysis.
Create Example Data
import pandas as pd
data = {
"Name": ["Rahul", "Amit", "Sneha", "Karan"],
"Age": [23, 25, 22, 24],
"Salary": [30000, 35000, 28000, 40000]
}
df = pd.DataFrame(data)
print(df)
| Name | Age | Salary | |
|---|---|---|---|
| 0 | Rahul | 23 | 30000 |
| 1 | Amit | 25 | 35000 |
| 2 | Sneha | 22 | 28000 |
| 3 | Karan | 24 | 40000 |
Filtering Data in Pandas
Filtering allows you to select rows based on conditions.
Filter Rows Based on a Condition:
- To display employees with salary greater than 30000:
print(df[df["Salary"] > 30000])
- Only rows that meet the condition will appear.
Apply Multiple Conditions
- More specific filtering can be done like this:
print(df[(df["Age"] > 22) & (df["Salary"] > 30000)])
- Both conditions must be true for a row to appear.
Sorting Data in Pandas
- Sorting arranges data in ascending or descending order.
- To sort salaries in ascending order:
print(df.sort_values("Salary"))
- For descending order:
print(df.sort_values("Salary", ascending=False))
- Data can also be sorted by more than one column:
print(df.sort_values(["Age", "Salary"]))
- Here, pandas sorts first by Age, then by Salary.
GroupBy Operations in Pandas with Simple Example
GroupBy operations help you summarize and analyze data based on categories. In data transformation using pandas, this method allows you to combine similar records and perform calculations easily.
Create Example Data
import pandas as pd
data = {
"Product": ["Laptop", "Mobile", "Laptop", "Tablet", "Mobile", "Tablet"],
"Region": ["North", "South", "East", "West", "North", "South"],
"Sales": [50000, 30000, 45000, 20000, 35000, 25000]
}
df = pd.DataFrame(data)
print(df)
| Product | Region | Sales | |
|---|---|---|---|
| 0 | Laptop | North | 50000 |
| 1 | Mobile | South | 30000 |
| 2 | Laptop | East | 45000 |
| 3 | Tablet | West | 20000 |
| 4 | Mobile | North | 35000 |
| 5 | Tablet | South | 25000 |
- This dataset contains product names, regions, and sales amounts.
Group by One Column
- Next, group the data by Product and calculate total sales:
print(df.groupby("Product")["Sales"].sum())
Product
Laptop 95000
Mobile 65000
Tablet 45000
Name: Sales, dtype: int64
- As a result, pandas combines rows with the same product and adds their sales values.
Group by Another Column
- Similarly, group by Region to see total sales in each area:
print(df.groupby("Region")["Sales"].sum())
Region
East 45000
North 85000
South 55000
West 20000
Name: Sales, dtype: int64
Apply Multiple Calculations
- Moreover, multiple statistics can be calculated at once:
print(df.groupby("Product")["Sales"].agg(["sum", "mean", "max"]))
| Product | sum | mean | max |
|---|---|---|---|
| Laptop | 95000 | 47500.0 | 50000 |
| Mobile | 65000 | 32500.0 | 35000 |
| Tablet | 45000 | 22500.0 | 25000 |
- Here, pandas shows total sales, average sales, and maximum sales for each product.
Group by Multiple Columns
- In addition, grouping can be done using more than one column:
print(df.groupby(["Product", "Region"],as_index=False)["Sales"].sum())
| Product | Region | Sales | |
|---|---|---|---|
| 0 | Laptop | East | 45000 |
| 1 | Laptop | North | 50000 |
| 2 | Mobile | North | 35000 |
| 3 | Mobile | South | 30000 |
| 4 | Tablet | South | 25000 |
| 5 | Tablet | West | 20000 |
- This provides detailed insights based on both product and region.
βFrequently Asked Questions (FAQ)
What is Pandas in Python used for?
Why should beginners learn Pandas for data analysis?
What is a DataFrame in Pandas?
How do I perform data cleaning using Pandas?
dropna() and drop_duplicates(). Can I use Pandas for Excel files?
read_excel() to load data and to_excel() to export data.