Final Tutorial- Jacob Gartenstein

This tutorial focuses on the Spotify Audio Features data set, described below.

Spotify Audio Features – Available on Kaggle, posted by tomigelo

This Spotify Audio Features dataset is intriguing to me because of its relation to my capstone project. The dataset describes songs (entries) with several different variables, including artist name, track ID, track name, artist name. The remaining 18 variables are the audio features themselves, which include common musical features like duration, tempo, time signature, and key signature, and also other features coined by Spotify, for example acousticness and danceability. There’s information about roughly 130k songs in the dataset. My primary source of interest in this dataset is the connection with my capstone project, and the prospect of having multiple opportunities for different classes to familiarize myself with this information and how to work with it efficiently. I also think that the finalized tutorial for this course done on the Spotify API would be a valuable resource to include with the capstone project, and it could even be edited to describe exactly the processes involved in the capstone. For the tutorial, this data will be analyzed in several ways. Various audio features are combined and compared in order to reach informative conclusions. For example, what effect does tempo have on danceability? What connection might mode (the musical mode, scale or sound of a piece) have with Spotify’s variable valence (happiness meter, basically)? The dataset is also alluring because of its size and uniformity - 130k songs is a lot, and Spotify describes exactly the format of each feature on its website.

Link: https://www.kaggle.com/tomigelo/spotify-audio-features/data

My question when examining this data has to do with the generation of some of the song 'scores' that spotify calculates and using to rate songs on their musical features and feel. How are these scores generated? Are they grounded in other quantitative data about the songs that spotify has also made available?

Extraction, Transform, and Load (ETL) + Exploratory Data Analysis

I'll be using the Spotify data set from April 2019 because it's the largest and most recent. Since the data set comes as a CSV file, it's easy to load into Pandas - which I do below.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
!pip install lxml
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
Requirement already satisfied: lxml in /opt/conda/lib/python3.7/site-packages (4.4.2)
In [2]:
spotify_df = pd.read_csv("./data/SpotifyAudioFeaturesApril2019.csv")
spotify_df.head()
Out[2]:
artist_name track_id track_name acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness tempo time_signature valence popularity
0 YG 2RM4jf1Xa9zPgMGRDiht8O Big Bank feat. 2 Chainz, Big Sean, Nicki Minaj 0.005820 0.743 238373 0.339 0.000 1 0.0812 -7.678 1 0.4090 203.927 4 0.118 15
1 YG 1tHDG53xJNGsItRA3vfVgs BAND DRUM (feat. A$AP Rocky) 0.024400 0.846 214800 0.557 0.000 8 0.2860 -7.259 1 0.4570 159.009 4 0.371 0
2 R3HAB 6Wosx2euFPMT14UXiWudMy Radio Silence 0.025000 0.603 138913 0.723 0.000 9 0.0824 -5.890 0 0.0454 114.966 4 0.382 56
3 Chris Cooq 3J2Jpw61sO7l6Hc7qdYV91 Lactose 0.029400 0.800 125381 0.579 0.912 5 0.0994 -12.118 0 0.0701 123.003 4 0.641 0
4 Chris Cooq 2jbYvQCyPgX3CdmAzeVeuS Same - Original mix 0.000035 0.783 124016 0.792 0.878 7 0.0332 -10.277 1 0.0661 120.047 4 0.928 0

This data is already upkept pretty well - each row records a single song and each column records specific attributes for that song. I'm going to drop the track_id, key, and popularity columns, which we won't need to anylize the traits of the music.

In [3]:
spotify_df = spotify_df.drop(columns = ['track_id', 'key', 'popularity'])
spotify_df.head()
Out[3]:
artist_name track_name acousticness danceability duration_ms energy instrumentalness liveness loudness mode speechiness tempo time_signature valence
0 YG Big Bank feat. 2 Chainz, Big Sean, Nicki Minaj 0.005820 0.743 238373 0.339 0.000 0.0812 -7.678 1 0.4090 203.927 4 0.118
1 YG BAND DRUM (feat. A$AP Rocky) 0.024400 0.846 214800 0.557 0.000 0.2860 -7.259 1 0.4570 159.009 4 0.371
2 R3HAB Radio Silence 0.025000 0.603 138913 0.723 0.000 0.0824 -5.890 0 0.0454 114.966 4 0.382
3 Chris Cooq Lactose 0.029400 0.800 125381 0.579 0.912 0.0994 -12.118 0 0.0701 123.003 4 0.641
4 Chris Cooq Same - Original mix 0.000035 0.783 124016 0.792 0.878 0.0332 -10.277 1 0.0661 120.047 4 0.928

The code below is used to scrape a table from Spotify's API documentation found at the URL in the code below. This table contains dtypes and descriptions of each column key in the main dataset (above).

In [4]:
URL = "https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/"

r = requests.get(URL) 
html_page = r.content
soup = BeautifulSoup(html_page, 'html.parser')

#soup.prettify()
In [5]:
table = soup.findAll('table')[2] #the table we want is the 3rd one on the page

table_str = str(table)
table_body = table.find('tbody')

spotify_features_df = pd.read_html(table_str, flavor = None)
spotify_features_df[0]

spotify_features_df = spotify_features_df[0]
spotify_features_df = spotify_features_df.drop([1, 13, 14,15,16,17]) #removing unnecasary entries that aren't in the original dataset
spotify_features_df['Value Description'] = spotify_features_df['Value Description'].str.replace(r'The distribution of values for this feature look like this:', '') #removing unnecasary text from certain entries' Value Descriptions
In [6]:
pd.set_option('display.max_colwidth', -1) #set option to display full text in the table

display(spotify_features_df)
Key Value Type Value Description
0 duration_ms int The duration of the track in milliseconds.
2 mode int Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
3 time_signature int An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
4 acousticness float A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
5 danceability float Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
6 energy float Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
7 instrumentalness float Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
8 liveness float Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
9 loudness float The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
10 speechiness float Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
11 valence float A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
12 tempo float The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Now that we know what all the song description keys refer to, let's see generally how they are related. The code below calculates and displays the correlation between each quantitative variable in the original dataset.

In [7]:
correlation = spotify_df[spotify_df.columns.tolist()].corr()
sns.heatmap(correlation, center = 0)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7485afef98>

This chart displays the correlation between all the quantifiable variables in the set. Black indicates no correlation. Blue hue indicates negative correlation and red hue indicates positive correlation.

This plot largely follows common sense. Variables like key or duration (ms) wouldn't be correlated with variables that describe the feel of the music like acousticness, danceability, energy, etc etc. On the other hand, it makes sense that energy and loudness are positively correlated, and that variables speechiness and instrumentalness are negatively correlated.

Basically -- nothing too suprising here.

Next we'll view the distributions of the different variables.

In [8]:
columns = spotify_df.columns.tolist()
# I'm going to remove the columns for which the distributions don't make sense or don't matter

columns = columns[2:]

#seperating attributes which are 0 -1 from others for display purposes
columns_01 = columns[0:2] + columns[3:6]
columns_01.append(columns[8])
columns_01.append(columns[11])
columns_other = columns[9:11]
columns_other.append(columns[2])
columns_other.append(columns[6])
columns_other.append(columns[7])


fig, axs = plt.subplots(7,1,figsize=(15,15))
a = axs.ravel()

for idx,ax in enumerate(a):
    ax.hist(spotify_df[columns_01[idx]], bins=50)
    ax.set_xlabel(columns_01[idx])

plt.tight_layout()

The graphs above display the distributions of the variables for which spotify calculates their own score for each song. These variables are all measured between 0 - 1. The graphs below show the distributions for the variables that descibe the song itself,

In [16]:
fig, axs = plt.subplots(5, 1 ,figsize=(15,15))
a = axs.ravel()

for idx,ax in enumerate(a):
    ax.hist(spotify_df[columns_other[idx]], bins=50)
    ax.set_xlabel(columns_other[idx])

plt.tight_layout()

There's a lot of information displayed in these various distributions. I see three main trends that are present for different variables:

  • centered around 0 and 1
  • more or less evenly distributed
  • centered around either 0 or 1

Most of the variables clearly display one or more of these trends. Acousticness and instramentalness are grouped around 0 and 1. Danceability and energy are more evenly spread but still have clear peaks. Liveliness and speachiness are grouped more around 0.

Loudness has an interesting scale because it's measuring in decibles, which are measured below zero. Tempo measures beats per minute. Time signiture measures beat per measure, of which it seems they measure only the lowest factor (no 6 or 8 or 10, intertestingly enough).

Analysis

In this tutorial, I am going to measure the predictability of certain variables which describe the feel or sound of the music based on variables which technically describe the music itself. To me, the former set of variables are more subjective - spotify has developed algorithms to assign songs with a score. The latter set of variables are more factual - a song might be in 4/4 time and play at 140 bpm, with an average loudness of -10 db. These traits are more or less undisputable. In this process, I will basically be attemping to reverse engineer spotify's scores given all the concrete data about the music that they also provide in the API.

I'm going to focus on three 'score' variables - danceability, energy, and valence - and see what their relationships with factual variables can tell us, and if it's possible to make predications between these two sets of variables.

First I'll start by calculating and viewing some specific correlations between these variables. This correlation matrix contains the same information as the heatmap from above, but narrows it down to our variables of focus.

Pay specific attention to the relationships between 'score' and 'factual' variables.
I predict that energy and tempo will be positively correlated. I also predict that mode and valence will be extremely positively correlated.

Generally, my hypothesis is that the scores will not be easy to reverse engineer given the other, more 'factual' data that spotify provides. This is my hunch because the 'score' variables aren't ones that could be described easily using only the other provided variables. For instance, it's difficult to identify a range of tempos that make a song 'more danceable' - that's more subjective.

In [20]:
variables = ['danceability', 'energy', 'valence', 'loudness', 'tempo', 'time_signature', 'mode']
spotify_df[variables].corr()
Out[20]:
danceability energy valence loudness tempo time_signature mode
danceability 1.000000 0.286196 0.461468 0.431554 0.081791 0.206328 -0.057912
energy 0.286196 1.000000 0.314768 0.766697 0.229930 0.165030 -0.069263
valence 0.461468 0.314768 1.000000 0.319881 0.104857 0.069162 0.011082
loudness 0.431554 0.766697 0.319881 1.000000 0.223067 0.179679 -0.036081
tempo 0.081791 0.229930 0.104857 0.223067 1.000000 0.083759 -0.000249
time_signature 0.206328 0.165030 0.069162 0.179679 0.083759 1.000000 -0.036244
mode -0.057912 -0.069263 0.011082 -0.036081 -0.000249 -0.036244 1.000000

This information contains some suprises. I expected mode and valence to be positively coorelated because mode of 1 indicates a major key and major keys are often associated with happy music (valence of 1). None of the subjective to non-subjective correlations are > abs(0.23).

Next I'll conjur some visuals of specific relationships.

In [21]:
spotify_df.plot.scatter(x = 'time_signature', y = 'danceability', c = 'tempo', alpha = 0.5, sharex=False)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f74699eec88>

Pretty even spread of danceability across tempo and time signature.

In [22]:
spotify_df.plot.scatter(x = 'tempo', y = 'valence', c = 'mode', sharex = False)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7479c47c88>

What a mess! Not much to see here.

In [23]:
spotify_df.plot.scatter(x = 'loudness', y = 'energy', c = 'tempo', alpha = 0.5, sharex = False)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f74692c72b0>

Finally a positive correlation, though tempo doesn't seem to have much correlation with energy or loudness.

These visuals aren't particularly clear or helpful, illustrating yet again the lack of any extreme correlations between the sets of subjective and non subjective variables. This doesn't bode well for predicting future values of subjective variables from non subjective variables, but this is exactly what we'll try to do next.

Below I use the SciKit Learn API to attempt to use a K-nearest neighbor system to model the regression of the data and use it to predict future data. I grade the predication model using 10-fold cross validation to minimize validation error.

Note: The source for much of this code is lab11 - ModelErrorandTuning

In [24]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

After several trials (and leaving my computer running over night), I determined that the data set is too large to run through these processes in a reasonable amount of time. I implement a three part solution to address this problem:

  • reducing the data set to 1/6 of its original size
  • slightly truncating the range of k values which are tested
  • performing 5 fold cross validation rather than 10 fold
In [30]:
spotify_df_1 = np.array_split(spotify_df, 6)[0]
In [31]:
features = ["loudness", "tempo",
            "time_signature", "mode"] #including the relevant 'non-subjective' variables

X_dict = spotify_df_1[features].to_dict(orient="records")

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
In [32]:
def get_cv_error(k, y):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_dict, y, 
        cv=5, scoring='neg_mean_squared_error'
    ))
    return mse
#Here I define a function (taken from lab11) that will be used to test the optimal k value to minimize MSE for each variable being tested

Next I create a loop which will return the root mean squared error of three variables, danceability, energy, and valence, using several steps:

  • create a model using the same four variables listed above as features for prediction and the assigned y variable
  • calculate the best k value to minimize error for each y value
  • use 5 fold cross validation to run 5 trials of prediction using the optimized k value
  • average the trials to optain a final root mean squared error for each y variable
In [33]:
def get_rmse(y):
    #applying the above function for y and k 10-40 to optimize k 
    ks = pd.Series(range(10, 41))
    ks.index = range(10, 41)
    test_errs = ks.apply(get_cv_error, args=(y,))
    
    #assigning optimized k as k for the remainder of trials
    my_k = test_errs.idxmin()
    
    #creating pipeline model using optimized k
    model = KNeighborsRegressor(n_neighbors=my_k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    
    #calculating scores for 5 fold cross validation
    scores = cross_val_score(pipeline, X_dict, y, 
                         cv=5, scoring="neg_mean_squared_error")
    
    #calculating and returning mean of scores
    return np.sqrt(np.mean(-scores))

Valence, energy, and danceability are all measured between 0 - 1, so I'll run the function for each to determine which is the most easily predictable.

In [34]:
rmse_scores = []

rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))

data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data) 
  

rmse_scores_df
Out[34]:
variable rmse_score
0 valence 0.239873
1 energy 0.151358
2 danceability 0.154630

These rmse values are alright. I'm now going to see whether excluding some of the variables from the machine learning model will lower the error values.

In [35]:
features = ["loudness", "tempo",
            "time_signature"] #excluding mode for this trial

X_dict = spotify_df_1[features].to_dict(orient="records")

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()


rmse_scores = []

rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))

data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data) 
  

rmse_scores_df
Out[35]:
variable rmse_score
0 valence 0.239568
1 energy 0.151185
2 danceability 0.154519
In [36]:
features = ["loudness", "tempo", "mode"] #excluding time signature for this trial

X_dict = spotify_df_1[features].to_dict(orient="records")

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()


rmse_scores = []

rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))

data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data) 
  

rmse_scores_df
Out[36]:
variable rmse_score
0 valence 0.240642
1 energy 0.151711
2 danceability 0.157018
In [37]:
features = ["loudness", "mode", "time_signature"] #excluding tempo for this trial

X_dict = spotify_df_1[features].to_dict(orient="records")

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()


rmse_scores = []

rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))

data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data) 
  

rmse_scores_df
Out[37]:
variable rmse_score
0 valence 0.242121
1 energy 0.153711
2 danceability 0.163393
In [38]:
features = ["mode", "time_signature", "tempo"] #excluding loudness for this trial

X_dict = spotify_df_1[features].to_dict(orient="records")

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()


rmse_scores = []

rmse_scores.append(get_rmse(spotify_df_1['valence']))
rmse_scores.append(get_rmse(spotify_df_1['energy']))
rmse_scores.append(get_rmse(spotify_df_1['danceability']))

data = {'variable':['valence', 'energy', 'danceability'], 'rmse_score':[rmse_scores[0], rmse_scores[1], rmse_scores[2]]}
rmse_scores_df = pd.DataFrame(data) 
  

rmse_scores_df
Out[38]:
variable rmse_score
0 valence 0.250236
1 energy 0.234008
2 danceability 0.160782

The errors were lowest by a tiny margin when all four variables, tempo, time signature, loudness, and mode were used in the machine learning model. However, the small variance in error values when excluding each variable is another indication that these variables don't have much predictability on the values being generated. No one of the variables helps particularly more when attempting to reverse engineer the scores.

Conclusion

In my attempt to reverse engineer three spotify 'score' variables, valence, energy, and danceability, using other, more factual variables, I succeeded with errors of 25% and under. Of course, these errors are still fairly high and indicate that more information is needed to generate these scores.

My hypothesis that reverse engineering the scores with the given information would be difficult was correct - more information is needed to generate these scores.

Surface level research does not yeild any resources explaining how spotify generates these scores, what they take into account, or what the algorithms might look like. For now, these scores and the algorithms behind generating them will remain a mystery.