Master financial analysis with Pandas and avoid loops

Pandas is a Python library that offers powerful and easy-to-use data structures for data analysis. It's particularly effective for working with structured or tabular data, such as financial data.

One of the major benefits of using Pandas is its speed. By employing vectorized operations instead of loops, Pandas can perform operations on entire datasets much faster than traditional loop-based code would allow. Vectorized operations are also more readable and concise, making the code easier to understand and maintain.

In this episode, we'll show you how to unlock the full potential of Pandas for analyzing cryptocurrency market data. We'll focus on concrete examples to illustrate the concepts, hoping that you can then apply them to your own financial data analysis problems.

Importing and Preparing Data

Before diving into the analysis, we first need to prepare our data. In this training, we'll use cryptocurrency data available on the https://github.com/RobotTraders/Python_For_Finance GitHub repository . You should be able to download 10 CSV files representing the prices of various cryptocurrencies. These data are organized in CSV files, each containing the OHLCV (Open, High, Low, Close, Volume) data for a specific cryptocurrency.

Each CSV file includes a series of measurements, including the opening price, the highest price, the lowest price, the closing price, and the trading volume for each period (hourly period).

However, to effectively analyse these data, it would be more practical to consolidate them into a single DataFrame. This will allow us to work with all the data in a unified manner, rather than handling each file separately. Additionally, using a single DataFrame, we can easily perform comparative analyses between different cryptocurrencies.

Importing Data

The first step is to download all the available files from this link, and then place them in your directory.

This code will read the CSV file for each cryptocurrency, transform it into a DataFrame, and store it in a dictionary named df_list which has the names of the cryptocurrencies as keys.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

crypto_list = ["ADA", "BCH", "BNB", "BTC", "ETC", "ETH", "LINK", "LTC", "XLM", "XRP"]
df_list = {}
for crypto in crypto_list:
    df_list[crypto] = pd.read_csv(f"{crypto}-USDT.csv")
    df_list[crypto]["date"] = pd.to_datetime(df_list[crypto]["date"], unit="ms")
    df_list[crypto] = df_list[crypto].set_index("date")

We notice that the files always have the format {crypto}-USDT.csv. To read the files, we will create a list with the names of the cryptocurrencies that we will iterate through with a for loop to read the csv file and create an associated DataFrame. We will then transform the date column into datetime so that it is usable and set it as the index.

Data Cleaning

To verify data quality, you can use .describe().

df_list["ADA"].describe()  # df_list["ADA"] allows us to access the ADA DataFrame

At first glance, we do not notice any specific issues. However, a good practice when modifying indexes is also to check the uniqueness (no duplicates) of these indexes (as a reminder, we have set the dates as indexes). For this, we can use the .nunique() method to count the number of unique values and compare it with the number of rows in our DataFrame.

Here, we notice that there are more rows in our DataFrame than unique indexes.

We can check if there are duplicate indexes in our DataFrame using the duplicated method of Pandas. The keep='last' parameter means that we keep the last occurrence in case of duplicates.

print(df_list["ADA"].index.duplicated(keep='last'))

Our first line of code displays an array of False and True, where each line will be True if the index is duplicated.

To display the concerned rows, simply use DataFrame.loc[filter]. Here, our DataFrame is df_list["ADA"] and our filter is df_list["ADA"].index.duplicated(keep='last').

From there, we can delete all the duplicated rows. We start by storing our duplicated indexes and selecting our DataFrame without the duplicated indexes (to make an inverse selection, you can use the ~ symbol).

duplicate_index = df_list["ADA"].index.duplicated(keep='last')
df_list["ADA"] = df_list["ADA"][~duplicate_index]

Finally, we will apply this operation to all our DataFrames in the for loop to have "clean" data:

crypto_list = ["ADA", "BCH", "BNB", "BTC", "ETC", "ETH", "LINK", "LTC", "XLM", "XRP"]
df_list = {}
for crypto in crypto_list:
    df_list[crypto] = pd.read_csv(f"{crypto}-USDT.csv")
    df_list[crypto]["date"] = pd.to_datetime(df_list[crypto]["date"], unit="ms")
    df_list[crypto] = df_list[crypto].set_index("date")
    duplicate_index = df_list[crypto].index.duplicated(keep='last')
    df_list[crypto] = df_list[crypto][~duplicate_index]

Creating a Single DataFrame

We will now merge all this data into a single DataFrame, with one column for each cryptocurrency. To do this, we'll iterate through df_list and add the close column from each individual DataFrame.

df = pd.DataFrame()
for crypto in df_list:
    df[crypto] = df_list[crypto]["close"]

We will then remove all rows containing NaN (Not a Number) values using the dropna method from Pandas.

df = df.dropna()

We now have a DataFrame where each column represents the historical data of a cryptocurrency.

Data Visualization

Next, we can visualize the closing prices of each cryptocurrency using Matplotlib.

for col in df.columns:
    fig, ax = plt.subplots(figsize=(15, 7))
    df[col].plot(label=col)
    ax.set_ylabel(col) # Displays the name of the cryptocurrency on the Y-axis
    plt.show()

Correlation Analysis

Next, we can compute the correlation matrix of our data using the corr method from Pandas.

Correlation measures the degree of association between two variables. This association can be positive (when one variable increases, so does the other) or negative (when one variable increases, the other decreases). The value of the correlation ranges from -1 to 1. A 1 indicates a perfect positive correlation, a -1 indicates a perfect negative correlation, and a 0 indicates no correlation.

The correlation matrix thus allows us to display whether each of our cryptocurrencies is correlated with others.

df.corr()

The Seaborn visualization library offers a method for visualizing matrices, such as a correlation matrix, using 'heatmap' type graphs.

f, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(df.loc[:].corr(), annot=True)

We can notice from this graph that the highest correlation is between LINK and XLM. However, here, we calculate the correlation directly on the price, meaning two assets that are generally bullish will be highly correlated. It is much more relevant to calculate the correlation on the percentage change.

Therefore, we will calculate the percentage changes for each cryptocurrency using the pct_change method of Pandas, and visualize the correlation matrix of these percentage changes.

df_pct = df.pct_change()

Now, we are displaying the correlation matrix only for the data from 2023."

f, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(df_pct.loc["2023"].corr(), annot=True)

In 2023, the most significant correlation in price evolution is between Bitcoin (BTC) and Ethereum (ETH), with a 89% correlation.

Comparing Two Assets

Next, we will compare two assets by plotting their prices on two different y-axes but sharing the same x-axis. This will allow us to visually observe the correlation.

Let's start with the two most correlated assets, BTC and ETH, which have a 89% correlation.

fig, ax = plt.subplots(figsize=(15, 8))

data = df.loc["2023"]
asset1 = "BTC"
asset2 = "ETH"
# Plot the first asset on the primary axis
ax.plot(data.index, data[asset1], color='blue')
ax.set_ylabel(asset1, color='blue')
ax.tick_params(axis='y', labelcolor='blue')

# Create a second axis sharing the same x-axis
ax2 = ax.twinx()

# Plot the second asset on the secondary axis
ax2.plot(data.index, data[asset2], color='red')
ax2.set_ylabel(asset2, color='red')
ax2.tick_params(axis='y', labelcolor='red')
plt.show()

We can observe that there is an almost perfect correlation. Conversely, let's consider the least correlated assets, XRP and BCH, with a correlation of 43%

We notice that there seemed to be a strong correlation up until March 2023, but after that, the two assets were no longer correlated at all.

Analysis of Average Returns and Volatility

Which is the best cryptocurrency to have in a portfolio? It's hard to answer this question, but one response might be the crypto with the best returns and the lowest risk.

Therefore, we will attempt to address this question by displaying a chart with returns on one axis and risks (or more precisely, volatility) on the other axis for our cryptocurrencies.

To do this, we will calculate the return for each crypto as the average of the hourly returns, and the risk as the standard deviation of these hourly returns. We will then insert these values into lists so that we can display a scatter plot.

mean_return_list = []
volatility_list = []

temp_data = df_pct.loc[:]

fig, ax = plt.subplots(figsize=(15, 8))

for col in temp_data.columns:
    mean_return = temp_data[col].mean() * 100
    volatility = temp_data[col].std() * 100
    mean_return_list.append(mean_return)
    volatility_list.append(volatility)

    ax.text(mean_return, volatility, col)
    ax.scatter(mean_return, volatility, label=col)

ax.grid(True, linestyle='--', alpha=0.6)
ax.axvline(0, color='black', linestyle='--')
ax.axhline(0, color='black', linestyle='--')

ax.set_xlabel('Average Return', fontsize=12)
ax.set_ylabel('Volatility', fontsize=12)

ax.legend()

plt.show()

According to the graph, the cryptocurrency with the highest average return is BNB, while the most volatile is ETC.

From this graph, it's challenging to compare BTC and ETH directly, as BTC has lower volatility, but ETH offers higher returns. However, based on this graph, we can infer that historically, it was preferable to invest in ETH rather than LINK, since ETH both yields better returns and has lower volatility. The three most appealing cryptocurrencies to hold seem to be BTC, ETH, and BNB, depending on your risk aversion.

It could be insightful to generate this graph specifically for the year 2023.

temp_data = df_pct.loc["2023"]

In the year 2023, the best cryptocurrencies to have were BTC and BCH.

Analysis of the Sharpe Ratio

When discussing return/risk ratio, we cannot overlook the Sharpe Ratio. The Sharpe Ratio is a well-known financial metric that precisely calculates the return/risk couple. The formula is:

SharpeRatio = \frac{Average[R_p]}{{StandardDeviation[R_p]}} \times \sqrt{N}

Rp: Return (in percentage)

N: Number of periods per year

This formula provides a number (often around 1). If the result is below 0, it indicates poor returns. If it is between 0 and 1, it suggests that the risks incurred are too high relative to the expected returns. Finally, if it is above 1, it means that the expected returns compensate for the incurred risks.

We can start by displaying the Sharpe Ratio for each of the cryptocurrencies in our list:

temp_data = df_pct.loc[:]

for col in temp_data.columns:
    sharpe_ratio = (8760**0.5)*(temp_data[col].mean()/temp_data[col].std())
    print(col, sharpe_ratio)

The operator ** is used to raise to a power. To perform a square root, you raise to the power of 0.5.

We multiply by the square root of 8760 because our data is hourly, and there are 8760 hours in a year.

Next, we will visualize these Sharpe Ratios in a bar chart

sharpe_ratios = []

temp_data = df_pct.loc[:]

for col in temp_data.columns:
    sharpe_ratio = (8760**0.5)*(temp_data[col].mean()/temp_data[col].std())
    sharpe_ratios.append(sharpe_ratio)
    print(col, sharpe_ratio)

df_sharpe = pd.DataFrame(sharpe_ratios, index=temp_data.columns, columns=['Sharpe Ratio'])
df_sharpe = df_sharpe.sort_values(by='Sharpe Ratio')

plt.figure(figsize=(10, 6))
plt.barh(df_sharpe.index, df_sharpe['Sharpe Ratio'], color='lightgreen')
plt.xlabel('Sharpe Ratio')
plt.title('Sharpe Ratio per asset')
plt.show()

As could be anticipated from the scatter plot, the asset with the best Sharpe Ratio over our analysis period is BNB.

Analyzing Drawdowns

A drawdown represents a decline in an asset's price from its highest peak. This metric is crucial as it allows us to assess the most significant drops an asset has experienced in the past. If an asset has encountered a maximum drawdown of -80%, it's vital to consider that when purchasing this asset, there's a potential for it to lose 80% of its value. However, it's important to remember that past performance does not necessarily predict future results.

To calculate the drawdown, instead of doing it hourly, we can make it more readable by calculating it on a daily basis (this will be more readable on a graph). To transform our hourly data into daily data, we can use the .resample() method. Also, it's worth mentioning that we will be using our df DataFrame, which contains the prices, not their percentage changes.

To perform our calculation, we need the asset's maximum value over time. The .cummax() method (for cumulative maximum) allows us to do just that. For instance, if we have a series s containing 2, 4, 7, 5, 1, 8, 4, then s.cummax() will yield 2, 4, 7, 7, 7, 8, 8, representing the highest value reached on a rolling basis. Finally, to calculate our drawdown, we simply compare our current value with the cumulative maximum and convert it into a percentage.

In this code, we calculate the daily drawdown for BTC and ETH and display it on a graph.

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 10))
temp_data = df.resample('D').last()
for col in ["BTC", "ETH"]:
    running_max = temp_data[col].cummax()
    drawdown = (temp_data[col] - running_max) / running_max
    plt.plot(drawdown, label=col)

plt.title('Asset Drawdown Over Time')
plt.xlabel('Time')
plt.ylabel('Drawdown')
plt.legend()
plt.grid(True)
plt.show()

We notice here that during the COVID crash in March 2020, Bitcoin experienced a drawdown of about 50%, and Ethereum a bit more than 60%. Lastly, in July 2022, both assets underwent a drawdown of more than 70%. For instance, if we consider Bitcoin, its peak was at $68,000 in November 2021, and in July 2022, its price was at $16,000.

(16000-68000)/68000 = -0.76

This equates to a drop of 76%.

Practical Exercise and Conclusion

This session was in a way a practical exercise in itself. However, if you wish to practice further, we invite you to, for example, recreate the scatter plot we saw earlier, replacing volatility with maximum drawdown. You could also try calculating the number of drawdowns exceeding 10% for each asset (how often the asset dropped more than 10% from its peak). A correction example can be found here: https://github.com/RobotTraders/Python_For_Finance/blob/main/exercise_correction_chapter_7.ipynb.

It's crucial to understand what was done in this session. If you've grasped it well, you should be able to apply these methods for in-depth analyses of different financial assets according to your interests.

So, you now know how to use Pandas effectively for financial data analysis. The key is understanding how Pandas works and leveraging its features to your advantage, aiming to avoid writing unnecessary loops and making your code more efficient. In the next episode, we will focus on mastering dates, an important component when dealing with numerous time series as in finance.