Visualize your data in Python using Matplotlib
Once you have manipulated and analysed your financial data with Pandas, visualizing it is crucial to gain deeper insights into trends and relationships among various variables. In this chapter, we'll explore how to create meaningful charts using the Matplotlib library, one of the most popular visualization libraries in Python.
Introduction to Matplotlib
Matplotlib is a versatile data visualization library for Python, enabling the creation of a wide array of charts such as bar graphs, histograms, scatter plots, and line graphs. It is especially well-suited for financial data visualization and is commonly used in conjunction with Pandas.
Installing and Importing Matplotlib
Before using Matplotlib, you need to install it by executing the following command in your terminal or command prompt:
pip install matplotlib
Once installed, you can import Matplotlib into your Python script with the following line:
import matplotlib.pyplot as plt
The alias "plt" is conventionally used for Matplotlib and simplifies access to the library's functions.
Creating Charts with Matplotlib and Pandas
With Matplotlib imported, we can begin creating charts to visualize our financial data. This section will cover some common chart types used in the finance industry.
First Use: Line Graphs
The most basic chart is the line graph, created using the plt.plot()
method. For example:
x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]
plt.plot(x, y)
plt.show()
Here, we built a chart from two lists, one for the x-axis and one for the y-axis.
Adding Titles and Labels
Enhance your chart by adding titles and labels:
x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]
plt.plot(x, y)
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.title("Plot test")
plt.show()
Here we've added titles to each axis and an overall chart title.
Bar Charts
Another commonly used chart is the bar chart, which you can create using the plt.bar()
method.
plt.bar(x, y)
plt.show()
Scatter Plots
Scatter plots are useful for comparing data points to one another. You can create one using the plt.scatter()
method.
plt.scatter(x, y)
plt.show()
Visualizing Complex Data with Pandas and Matplotlib
Using a real-world dataset, let's visualize complex data with Matplotlib. We'll work with the "House Sales in King County, USA" dataset, which contains detailed information about house sales in the area. You can download the dataset from this link https://github.com/RobotTraders/Python_For_Finance/blob/main/house_sales_in_king_county_usa.csv.
Start by loading the dataset with Pandas:
import pandas as pd
df = pd.read_csv('house_sales_in_king_county_usa.csv')
Here is a description of the different columns in the dataset.
- id: This is the unique identifier for each house sold.
- date: This is the date the house was sold. It can be important because real estate market trends can vary over time.
- price: This is the selling price of the house. It is often the target variable in real estate prediction models.
- bedrooms: This is the number of bedrooms in the house.
- bathrooms: This is the number of bathrooms in the house. Half bathrooms (toilet without shower or bathtub) can be counted as 0.5.
- sqft_living: This is the living space of the house in square feet.
- sqft_lot: This is the total land area of the house in square feet.
- floors: This is the number of floors in the house.
- waterfront: This variable indicates whether the house has a view of water (a lake, river, sea, etc.). It is often a binary variable (0 = no water view, 1 = water view).
- view: This is an assessment of the view quality from the house. The rating system can vary, but often a scale of 0 to 4 is used.
- condition: This is an assessment of the general condition of the house. As with view, it is often rated on a scale, with a higher value indicating a better condition.
- grade: This is an assessment of the quality of construction and design of the house. Houses with higher-quality construction materials and more elaborate designs will receive a higher rating.
- sqft_above: This is the above-ground living space of the house in square feet.
- sqft_basement: This is the basement area of the house in square feet.
- yr_built: This is the year the house was built.
- yr_renovated: This is the year the house was last renovated. If the house has never been renovated, this value could be 0.
- zipcode: This is the postal code of the house. It can be important because location has a major impact on house prices.
- lat: This is the latitude of the house, a component of its geographical position.
- long: This is the longitude of the house, another component of its geographical position.
- sqft_living15: This is the average living space of the 15 closest houses, in square feet.
- sqft_lot15: This is the average land area of the 15 closest houses, in square feet.
You can use df.describe() to analyze your data, but the dataset is quite clean, so you normally will not need to do any data cleaning.
Histograms
Histograms are great for visualizing data distributions. Create one using plt.hist()
with Matplotlib. The bins
parameter specifies the number of bins you want.
plt.hist(df["price"], bins=100)
plt.show()
The x-axis scale is in scientific notation (1e6 represents 1 x 10 to the power of 6, which means 1,000,000). We can see that the vast majority of houses in our dataset are priced between 0 and 2 million dollars. However, we note that some houses have sold for more than 7 million dollars. If we wish to better observe the distribution between 0 and 2 million dollars, we can apply a filter:
plt.hist(df.loc[df["price"]<2000000]["price"], bins=100)
plt.show()
We observe here that the vast majority of houses fall between 250k and 500k dollars. Each bar represents the number of houses within a certain price range.
The bins
parameter represents the number of bars you want; the higher the parameter, the more detail we will have in the data distribution.
Scatter Plots with Variable Colors
It would be interesting to see if there is a relationship between the size of the house and its sale price. A scatter plot is perfect for this purpose.
plt.scatter(df['sqft_living'], df['price'], s=5, c=df['sqft_living'], cmap='inferno')
plt.title('Price as a function of the living area')
plt.xlabel('Living Area')
plt.ylabel('Price')
plt.show()
The s
parameter defines the size of the dots, while c
sets the color variation, with warmer colors indicating higher prices. The cmap
parameter determines the color range.
Time Graphs
Another type of chart useful in finance is the time series chart, which illustrates how variables change over time. Let's consider an example with our dataset. If we want to examine the trend of average house prices over time, we first need to convert the date column into datetime format.
df['date'] = pd.to_datetime(df['date'])
Next, we can group the data by month and calculate the average price:
df_grouped = df.resample('M', on='date').mean()
and finally we can plot the evolution of the average price:
plt.plot(df_grouped.index, df_grouped['price'])
plt.xlabel('Date')
plt.ylabel('Average Price')
plt.title('Trend of Average House Prices Over Time')
plt.show()
Other Charts
Through this link https://matplotlib.org/stable/gallery/index, you can find a whole gallery of possible charts that can be created with Matplotlib.
Seaborn, Another Visualization Library
While Matplotlib is a powerful and flexible library for data visualization, its interface can sometimes be a bit complex for more advanced visualization tasks. That's where Seaborn comes in. Seaborn is a data visualization library in Python built on Matplotlib. It offers a high-level interface, which makes creating complex charts a bit simpler. Seaborn is particularly useful for creating attractive and informative statistical graphics and for visualizing complex data with multiple variables. It provides a variety of chart types such as distribution charts, matrix charts, and much more. Additionally, Seaborn integrates well with Pandas DataFrames, making it easy to directly visualize data from your DataFrame. With this link, you'll find a list of examples https://seaborn.pydata.org/examples/index.html.
Here is an example of its use:
import seaborn as sns # pip install seaborn to install the library
sns.scatterplot(x='sqft_living', y='price', data=df)
plt.show()
Conclusion
As you can see, Matplotlib is a powerful tool for data visualization in Python. It offers great flexibility and can be used to create a wide range of charts and visualizations, which can be extremely useful in the financial field to understand trends and relationships between different variables.
Remember that the key to good visualization is that it should be informative and easy to understand. Therefore, it is important to always take the time to properly format your charts, add appropriate titles, legends, and axis labels, and choose the types of charts that best illustrate your data.
Practical Exercise: Financial Data Analysis and Visualization with Matplotlib and Seaborn
For this exercise, we will continue using the "House Sales in King County, USA" dataset. The goal of this exercise is to deepen your understanding of financial data analysis and visualization.
Part 1: Display the 5 Largest Sales Volumes by Year
Your first challenge is to extract the year from the 'date' column and count the number of sales per year. Then, display a bar chart of the five years with the highest sales volume. Use Matplotlib for this.
Part 2: Display the Price Distribution for Houses with vs. Without a Water View
Use Matplotlib's hist()
method to display two histograms on the same chart: one for the price distribution of houses with a water view and one for those without a water view. Make sure to add a legend to distinguish the two histograms.
Part 3: Correlation Diagram
Use Seaborn to create a heatmap of the correlation between the variables 'price', 'sqft_living', 'grade', 'sqft_above', and 'sqft_living15'. What conclusions can you draw from this heatmap?
Part 4: Boxplot of Price by Grade
The 'grade' in our dataset is an evaluation of the construction quality and design of the house. Use Seaborn to create a boxplot of the house price by 'grade'. What do you observe about the relationship between 'grade' and 'price'?
Part 5: Scatterplot of 'Price' vs. 'Sqft_Living' with Linear Regression
Create a scatterplot of 'price' as a function of lmplot()
. Use Seaborn's lmplot() function to also display a linear regression line. What relationship can you observe between the size of the house and its price?
A correction example can be found here: https://github.com/RobotTraders/Python_For_Finance/blob/main/exercise_correction_chapter_6.ipynb.
Visualize your data in Python using Matplotlib