DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • The Power of Visualization in Exploratory Data Analysis (EDA)
  • How to Use Python for Data Science
  • Python in Urban Planning
  • Importance and Impact of Exploratory Data Analysis in Data Science

Trending

  • Initializing Services in Node.js Application
  • How To Optimize Your Agile Process With Project Management Software
  • WebSocket vs. Server-Sent Events: Choosing the Best Real-Time Communication Protocol
  • Understanding Escape Analysis in Go
  1. DZone
  2. Data Engineering
  3. Data
  4. How To Use Pandas and Matplotlib To Perform EDA In Python

How To Use Pandas and Matplotlib To Perform EDA In Python

In this article, we will explore how to use two popular Python libraries, Pandas and Matplotlib, to perform EDA.

By 
Stylianos Kampakis user avatar
Stylianos Kampakis
·
May. 29, 23 · Tutorial
Like (5)
Save
Tweet
Share
5.7K Views

Join the DZone community and get the full member experience.

Join For Free

Exploratory Data Analysis (EDA) is an essential step in any data science project, as it allows us to understand the data, detect patterns, and identify potential issues. In this article, we will explore how to use two popular Python libraries, Pandas and Matplotlib, to perform EDA. Pandas is a powerful library for data manipulation and analysis, while Matplotlib is a versatile library for data visualization. We will cover the basics of loading data into a pandas DataFrame, exploring the data using pandas functions, cleaning the data, and finally, visualizing the data using Matplotlib. By the end of this article, you will have a solid understanding of how to use Pandas and Matplotlib to perform EDA in Python.

Importing Libraries and Data

Importing Libraries

To use the pandas and Matplotlib libraries in your Python code, you need to first import them. You can do this using the import statement followed by the name of the library.

Python
 
python import pandas as pd
import matplotlib.pyplot as plt


In this example, we're importing pandas and aliasing it as 'pd', which is a common convention in the data science community. We're also importing matplotlib.pyplot and aliasing it as 'plt'. By importing these libraries, we can use their functions and methods to work with data and create visualizations.

Loading Data

Once you've imported the necessary libraries, you can load the data into a pandas DataFrame. Pandas provides several methods to load data from various file formats, including CSV, Excel, JSON, and more. The most common method is read_csv, which reads data from a CSV file and returns a DataFrame.

Python
 
python# Load data into a pandas DataFrame
data = pd.read_csv('path/to/data.csv')


In this example, we're loading data from a CSV file located at 'path/to/data.csv' and storing it in a variable called 'data'. You can replace 'path/to/data.csv' with the actual path to your data file.

By loading data into a pandas DataFrame, we can easily manipulate and analyze the data using pandas' functions and methods. The DataFrame is a 2-dimensional table-like data structure that allows us to work with data in a structured and organized way. It provides functions for selecting, filtering, grouping, aggregating, and visualizing data.

Data Exploration

head() and tail()

The head() and tail() functions are used to view the first few and last few rows of the data, respectively. By default, these functions display the first/last five rows of the data, but you can specify a different number of rows as an argument.

Python
 
python# View the first 5 rows of the data
print(data.head()) 
# View the last 10 rows of the data
print(data.tail(10))


info()

The info() function provides information about the DataFrame, including the number of rows and columns, the data types of each column, and the number of non-null values. This function is useful for identifying missing values and determining the appropriate data types for each column.

Python
 
python# Get information about the data
print(data.info())


describe()

The describe() function provides summary statistics for numerical columns in the DataFrame, including the count, mean, standard deviation, minimum, maximum, and quartiles. This function is useful for getting a quick overview of the distribution of the data.

Python
 
python# Get summary statistics for the data
print(data.describe())


value_counts()

The value_counts() function is used to count the number of occurrences of each unique value in a column. This function is useful for identifying the frequency of specific values in the data.

Python
 
python# Count the number of unique values in a column
print(data['column_name'].value_counts())


These are just a few examples of panda functions you can use to explore data. There are many other functions you can use depending on your specific data exploration needs, such as isnull() to check for missing values, groupby() to group data by a specific column, corr() to calculate correlation coefficients between columns and more.

Data Cleaning 

isnull()

The isnull() function is used to check for missing or null values in the DataFrame. It returns a DataFrame of the same shape as the original, with True values where the data is missing and False values where the data is present. You can use the sum() function to count the number of missing values in each column.

Python
 
python# Check for missing values
print(data.isnull().sum())


dropna()

The dropna() function is used to remove rows or columns with missing or null values. By default, this function removes any row that contains at least one missing value. You can use the subset argument to specify which columns to check for missing values and the how argument to specify whether to drop rows with any missing values or only rows where all values are missing.

Python
 
python# Drop rows with missing values
data = data.dropna()


drop_duplicates()

The drop_duplicates() function is used to remove duplicate rows from the DataFrame. By default, this function removes all rows that have the same values in all columns. You can use the subset argument to specify which columns to check for duplicates.

Python
 
python# Drop duplicate rows
data = data.drop_duplicates()


replace()

The replace() function is used to replace values in a column with new values. You can specify the old value to replace and the new value to replace it. This function is useful for handling data quality issues such as misspellings or inconsistent formatting.

Python
 
python# Replace values in a column
data['column_name'] = data['column_name'].replace('old_value', 'new_value')


These are just a few examples of pandas functions you can use to clean data. There are many other functions you can use depending on your specific data-cleaning needs, such as fillna() to fill missing values with a specific value or method, astype() to convert data types of columns, clip() to trim outliers and more.

Data cleaning plays a crucial role in preparing data for analysis, and automating the process can save time and ensure data quality. In addition to the panda's functions mentioned earlier, automation techniques can be applied to streamline data-cleaning workflows. For instance, you can create reusable functions or pipelines to handle missing values, drop duplicates, and replace values across multiple datasets. Moreover, you can leverage advanced techniques like imputation to fill in missing values intelligently or regular expressions to identify and correct inconsistent formatting. By combining the power of pandas functions with automation strategies, you can efficiently clean and standardize data, improving the reliability and accuracy of your exploratory data analysis (EDA).

Data Visualization 

Data visualization is a critical component of data science, as it allows us to gain insights from data quickly and easily. Matplotlib is a popular Python library for creating a wide range of data visualizations, including scatter plots, line plots, bar charts, histograms, box plots, and more.

Here are a few examples of how to create these types of visualizations using Matplotlib:

Scatter Plot

A scatter plot is used to visualize the relationship between two continuous variables. You can create a scatter plot in Matplotlib using the scatter() function.

Python
 
python# Create a scatter plot
plt.scatter(data['column1'], data['column2']) plt.xlabel('Column 1') plt.ylabel('Column 2') plt.show()


In this example, we're creating a scatter plot with column1 on the x-axis and column2 on the y-axis. We're also adding labels to the x-axis and y-axis using the xlabel() and ylabel() functions.

Histogram

A histogram is used to visualize the distribution of a single continuous variable. You can create a histogram in Matplotlib using the hist() function.

Python
 
python# Create a histogram
plt.hist(data['column'], bins=10) plt.xlabel('Column') plt.ylabel('Frequency') plt.show()


In this example, we're creating a histogram of the column variable with 10 bins. We're also adding labels to the x-axis and y-axis using the xlabel() and ylabel() functions.

Box Plot

A box plot is used to visualize the distribution of a single continuous variable and to identify outliers. You can create a box plot in Matplotlib using the boxplot() function.

Python
 
python# Create a box plot
plt.boxplot(data['column']) plt.ylabel('Column') plt.show()


In this example, we're creating a box plot of the column variable. We're also adding a label to the y-axis using the ylabel() function.

These are just a few examples of what you can do with Matplotlib for data visualization. There are many other functions and techniques you can use, depending on the specific requirements of your project.

Conclusion

Exploratory data analysis (EDA) is a crucial step in any data science project, and Python provides powerful tools to perform EDA effectively. In this article, we have learned how to use two popular Python libraries, Pandas and Matplotlib, to load, explore, clean, and visualize data. Pandas provides a flexible and efficient way to manipulate and analyze data, while Matplotlib provides a wide range of options to create visualizations. By leveraging these two libraries, we can gain insights from data quickly and easily. With the skills and techniques learned in this article, you can start performing EDA on your own datasets and uncover valuable insights that can drive data-driven decision-making.

Data analysis Data science Exploratory data analysis Matplotlib Pandas Python (language)

Opinions expressed by DZone contributors are their own.

Related

  • The Power of Visualization in Exploratory Data Analysis (EDA)
  • How to Use Python for Data Science
  • Python in Urban Planning
  • Importance and Impact of Exploratory Data Analysis in Data Science

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: