This tutorial focuses on the Spotify Audio Features data set, described below.
Spotify Audio Features – Available on Kaggle, posted by tomigelo
This Spotify Audio Features dataset is intriguing to me because of its relation to my capstone project. The dataset describes songs (entries) with several different variables, including artist name, track ID, track name, artist name. The remaining 18 variables are the audio features themselves, which include common musical features like duration, tempo, time signature, and key signature, and also other features coined by Spotify, for example acousticness and danceability. There’s information about roughly 130k songs in the dataset. My primary source of interest in this dataset is the connection with my capstone project, and the prospect of having multiple opportunities for different classes to familiarize myself with this information and how to work with it efficiently. I also think that the finalized tutorial for this course done on the Spotify API would be a valuable resource to include with the capstone project, and it could even be edited to describe exactly the processes involved in the capstone. For the tutorial, this data will be analyzed in several ways. Various audio features are combined and compared in order to reach informative conclusions. For example, what effect does tempo have on danceability? What connection might mode (the musical mode, scale or sound of a piece) have with Spotify’s variable valence (happiness meter, basically)? The dataset is also alluring because of its size and uniformity - 130k songs is a lot, and Spotify describes exactly the format of each feature on its website.
Link: https://www.kaggle.com/tomigelo/spotify-audio-features/data
My question when examining this data has to do with the generation of some of the song 'scores' that spotify calculates and using to rate songs on their musical features and feel. How are these scores generated? Are they grounded in other quantitative data about the songs that spotify has also made available?
I'll be using the Spotify data set from April 2019 because it's the largest and most recent. Since the data set comes as a CSV file, it's easy to load into Pandas - which I do below.
import pandas as pd
import numpy as np
import seaborn as sns
!pip install lxml
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
spotify_df = pd.read_csv("./data/SpotifyAudioFeaturesApril2019.csv")
spotify_df.head()
This data is already upkept pretty well - each row records a single song and each column records specific attributes for that song. I'm going to drop the track_id, key, and popularity columns, which we won't need to anylize the traits of the music.
spotify_df = spotify_df.drop(columns = ['track_id', 'key', 'popularity'])
spotify_df.head()
The code below is used to scrape a table from Spotify's API documentation found at the URL in the code below. This table contains dtypes and descriptions of each column key in the main dataset (above).
URL = "https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/"
r = requests.get(URL)
html_page = r.content
soup = BeautifulSoup(html_page, 'html.parser')
#soup.prettify()
table = soup.findAll('table')[2] #the table we want is the 3rd one on the page
table_str = str(table)
table_body = table.find('tbody')
spotify_features_df = pd.read_html(table_str, flavor = None)
spotify_features_df[0]
spotify_features_df = spotify_features_df[0]
spotify_features_df = spotify_features_df.drop([1, 13, 14,15,16,17]) #removing unnecasary entries that aren't in the original dataset
spotify_features_df['Value Description'] = spotify_features_df['Value Description'].str.replace(r'The distribution of values for this feature look like this:', '') #removing unnecasary text from certain entries' Value Descriptions
pd.set_option('display.max_colwidth', -1) #set option to display full text in the table
display(spotify_features_df)
Now that we know what all the song description keys refer to, let's see generally how they are related. The code below calculates and displays the correlation between each quantitative variable in the original dataset.
correlation = spotify_df[spotify_df.columns.tolist()].corr()
sns.heatmap(correlation, center = 0)
This chart displays the correlation between all the quantifiable variables in the set. Black indicates no correlation. Blue hue indicates negative correlation and red hue indicates positive correlation.
This plot largely follows common sense. Variables like key or duration (ms) wouldn't be correlated with variables that describe the feel of the music like acousticness, danceability, energy, etc etc. On the other hand, it makes sense that energy and loudness are positively correlated, and that variables speechiness and instrumentalness are negatively correlated.
Basically -- nothing too suprising here.
Next we'll view the distributions of the different variables.
columns = spotify_df.columns.tolist()
# I'm going to remove the columns for which the distributions don't make sense or don't matter
columns = columns[2:]
#seperating attributes which are 0 -1 from others for display purposes
columns_01 = columns[0:2] + columns[3:6]
columns_01.append(columns[8])
columns_01.append(columns[11])
columns_other = columns[9:11]
columns_other.append(columns[2])
columns_other.append(columns[6])
columns_other.append(columns[7])
fig, axs = plt.subplots(7,1,figsize=(15,15))
a = axs.ravel()
for idx,ax in enumerate(a):
ax.hist(spotify_df[columns_01[idx]], bins=50)
ax.set_xlabel(columns_01[idx])
plt.tight_layout()
The graphs above display the distributions of the variables for which spotify calculates their own score for each song. These variables are all measured between 0 - 1. The graphs below show the distributions for the variables that descibe the song itself,
fig, axs = plt.subplots(5, 1 ,figsize=(15,15))
a = axs.ravel()
for idx,ax in enumerate(a):
ax.hist(spotify_df[columns_other[idx]], bins=50)
ax.set_xlabel(columns_other[idx])
plt.tight_layout()
There's a lot of information displayed in these various distributions. I see three main trends that are present for different variables:
Most of the variables clearly display one or more of these trends. Acousticness and instramentalness are grouped around 0 and 1. Danceability and energy are more evenly spread but still have clear peaks. Liveliness and speachiness are grouped more around 0.
Loudness has an interesting scale because it's measuring in decibles, which are measured below zero. Tempo measures beats per minute. Time signiture measures beat per measure, of which it seems they measure only the lowest factor (no 6 or 8 or 10, intertestingly enough).
In this tutorial, I am going to measure the predictability of certain variables which describe the feel or sound of the music based on variables which technically describe the music itself. To me, the former set of variables are more subjective - spotify has developed algorithms to assign songs with a score. The latter set of variables are more factual - a song might be in 4/4 time and play at 140 bpm, with an average loudness of -10 db. These traits are more or less undisputable. In this process, I will basically be attemping to reverse engineer spotify's scores given all the concrete data about the music that they also provide in the API.
I'm going to focus on three 'score' variables - danceability, energy, and valence - and see what their relationships with factual variables can tell us, and if it's possible to make predications between these two sets of variables.
First I'll start by calculating and viewing some specific correlations between these variables. This correlation matrix contains the same information as the heatmap from above, but narrows it down to our variables of focus.
Pay specific attention to the relationships between 'score' and 'factual' variables.
I predict that energy and tempo will be positively correlated. I also predict that mode and valence will be extremely positively correlated.
Generally, my hypothesis is that the scores will not be easy to reverse engineer given the other, more 'factual' data that spotify provides. This is my hunch because the 'score' variables aren't ones that could be described easily using only the other provided variables. For instance, it's difficult to identify a range of tempos that make a song 'more danceable' - that's more subjective.
variables = ['danceability', 'energy', 'valence', 'loudness', 'tempo', 'time_signature', 'mode']
spotify_df[variables].corr()
This information contains some suprises. I expected mode and valence to be positively coorelated because mode of 1 indicates a major key and major keys are often associated with happy music (valence of 1). None of the subjective to non-subjective correlations are > abs(0.23).
Next I'll conjur some visuals of specific relationships.
spotify_df.plot.scatter(x = 'time_signature', y = 'danceability', c = 'tempo', alpha = 0.5, sharex=False)
Pretty even spread of danceability across tempo and time signature.
spotify_df.plot.scatter(x = 'tempo', y = 'valence', c = 'mode', sharex = False)
What a mess! Not much to see here.
spotify_df.plot.scatter(x = 'loudness', y = 'energy', c = 'tempo', alpha = 0.5, sharex = False)
Finally a positive correlation, though tempo doesn't seem to have much correlation with energy or loudness.
These visuals aren't particularly clear or helpful, illustrating yet again the lack of any extreme correlations between the sets of subjective and non subjective variables. This doesn't bode well for predicting future values of subjective variables from non subjective variables, but this is exactly what we'll try to do next.
Below I use the SciKit Learn API to attempt to use a K-nearest neighbor system to model the regression of the data and use it to predict future data. I grade the predication model using 10-fold cross validation to minimize validation error.
Note: The source for much of this code is lab11 - ModelErrorandTuning
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
After several trials (and leaving my computer running over night), I determined that the data set is too large to run through these processes in a reasonable amount of time. I implement a three part solution to address this problem:
spotify_df_1 = np.array_split(spotify_df, 6)[0]
features = ["loudness", "tempo",
"time_signature", "mode"] #including the relevant 'non-subjective' variables
X_dict = spotify_df_1[features].to_dict(orient="records")
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
def get_cv_error(k, y):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mse = np.mean(-cross_val_score(
pipeline, X_dict, y,
cv=5, scoring='neg_mean_squared_error'
))
return mse
#Here I define a function (taken from lab11) that will be used to test the optimal k value to minimize MSE for each variable being tested
Next I create a loop which will return the root mean squared error of three variables, danceability, energy, and valence, using several steps:
def get_rmse(y):
#applying the above function for y and k 10-40 to optimize k
ks = pd.Series(range(10, 41))
ks.index = range(10, 41)
test_errs = ks.apply(get_cv_error, args=(y,))
#assigning optimized k as k for the remainder of trials
my_k = test_errs.idxmin()
#creating pipeline model using optimized k
model = KNeighborsRegressor(n_neighbors=my_k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
#calculating scores for 5 fold cross validation
scores = cross_val_score(pipeline, X_dict, y,
cv=5, scoring="neg_mean_squared_error")
#calculating and returning mean of scores
return np.sqrt(np.mean(-scores))
Valence, energy, and danceability are all measured between 0 - 1, so I'll run the function for each to determine which is the most easily predictable.
rmse_scores = []
rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))
data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data)
rmse_scores_df
These rmse values are alright. I'm now going to see whether excluding some of the variables from the machine learning model will lower the error values.
features = ["loudness", "tempo",
"time_signature"] #excluding mode for this trial
X_dict = spotify_df_1[features].to_dict(orient="records")
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
rmse_scores = []
rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))
data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data)
rmse_scores_df
features = ["loudness", "tempo", "mode"] #excluding time signature for this trial
X_dict = spotify_df_1[features].to_dict(orient="records")
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
rmse_scores = []
rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))
data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data)
rmse_scores_df
features = ["loudness", "mode", "time_signature"] #excluding tempo for this trial
X_dict = spotify_df_1[features].to_dict(orient="records")
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
rmse_scores = []
rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))
data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data)
rmse_scores_df
features = ["mode", "time_signature", "tempo"] #excluding loudness for this trial
X_dict = spotify_df_1[features].to_dict(orient="records")
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
rmse_scores = []
rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))
data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data)
rmse_scores_df
The errors were lowest by a tiny margin when all four variables, tempo, time signature, loudness, and mode were used in the machine learning model. However, the small variance in error values when excluding each variable is another indication that these variables don't have much predictability on the values being generated. No one of the variables helps particularly more when attempting to reverse engineer the scores.
In my attempt to reverse engineer three spotify 'score' variables, valence, energy, and danceability, using other, more factual variables, I succeeded with errors of 25% and under. Of course, these errors are still fairly high and indicate that more information is needed to generate these scores.
My hypothesis that reverse engineering the scores with the given information would be difficult was correct - more information is needed to generate these scores.
Surface level research does not yeild any resources explaining how spotify generates these scores, what they take into account, or what the algorithms might look like. For now, these scores and the algorithms behind generating them will remain a mystery.