Spotify Information Evaluation utilizing Synapse and Energy BI

On this article, I created a quite simple end-end-end Enterprise Intelligence (BI) challenge utilizing Azure Synapse and Energy BI. However, earlier than that, what’s BI? In brief, BI is a means of amassing, storing, analyzing, and changing uncooked information into actionable insights which assist organizations make higher data-driven selections.
This challenge was created utilizing Azure Synapse analytics and used Spotify dataset. Azure Synpase is a unified platform which ingest, discover, put together, remodel, handle, and serve information for fast BI and machine studying wants.

Beneath are the steps adopted on this challenge (please consult with above),
- Extracted information from Spotify API by means of Azure Synapse pocket book (powered by Apache Spark)
- Remodeled the information once more utilizing Synapse pocket book
- Loaded the information into Azure Information Lake from the Synapse pocket book
- Analysed the information utilizing Synapse pocket book
- Related the information into Energy BI from the information lake and constructed the Dashboard.
We will additionally use the Azure Synpase pipeline for extracting the information as a substitute of a pocket book. Nevertheless, I used a pocket book on this challenge.
Right here, I copied the code from my Synapse pocket book and pasted it into this text as there isn’t a possibility for embedding the synapse pocket book into the medium. Taking pocket book screenshots shouldn’t be good for studying.
This text received’t cowl the best way to create a Synapse Pocket book in Azure or any cloud ideas. Please examine right here for the best way to create a Synapse pocket book.
- Spotify ClientID and Consumer Secret
- Azure Synapse Subscription
- Energy BI desktop
- Fundamental cloud information
Step one is to get the ClientID and Consumer Secret from Spotify. For that, we have to signup on the Spotify web site and create an app to seize the ClientID and Consumer Secret.
Spotify offers SpotiPy API as a wrapper, and it is vitally simple to extract the information.
In Synapse Pocket book, we are able to add the exterior python library by means of a textual content file and add it as a bundle within the spark pool. I named the textual content file necessities.txt
and added the library identify. Please examine the documentation. Additionally, we have to create a spark pool (used for compute capabilities) for operating the synapse pocket book. Please examine right here the best way to create a spark pool in synapse studio.

Import spotipy library and initialise the spotipy and supply your ClinetID
and Consumer Secret
saved in a variable known as client_id and client_secret
# Import spotipy and initialiseimport spotipy
from spotipy.oauth2 import SpotifyClientCredentials
client_id = 'YOUR SPOTIFY CLIENT ID'
client_secret = 'YOUR SPOTIFY CLIENT SECRET ID'
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
Import different vital libraries and used Plotly for the visualization.
# Import vital libraryimport pandas as pd
import numpy as npimport chart_studio.plotly as py
import chart_studio.instruments as tls
import plotly.categorical as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(related=True)
import plotly.graph_objs as go
import chart_studio
Spotify API offers tracks, artists and monitor options. First, we extracted the monitor and artist options. Then, we extracted the monitor options and merge them into one dataset.
Spotipy offers a sp.search() methodology for querying the monitor options. Please examine the documentation right here. Please word we are able to solely have a most of fifty tracks at a time and might have a max of 1000 information. I’m very to see the information pertains to an Indian market, and therefore I’ve extracted solely pertains to India. We have to point out the nation code available in the market parameter within the search question. For extra particulars in the marketplace code, please examine right here.
# Creating an empty lists for monitor optionsartist = []
monitor = []
track_popularity = []
artist_id = []
track_id = []
size = []#Quering Observe optionsfor i in vary(0,1000,50):# Market = 'IN' -> IN is Indiatracks = sp.search(q='12 months:2022', kind = 'monitor', market = 'IN', restrict = 50, offset = i)for i, j in enumerate(tracks['tracks']['items']):
artist.append(j['artists'][0]['name'])
artist_id.append(j['artists'][0]['id'])
monitor.append(j['name'])
track_id.append(j['id'])
track_popularity.append(j['popularity'])
size.append(j['duration_ms'])
Convert the monitor options into an information body
# Retailer within the dataframedf = pd.DataFrame({'Artist' : artist, 'Observe' : monitor, 'Track_id' : track_id, 'Track_popularity' : track_popularity, 'Artist_id' : artist_id, 'Size' : size})print(df.form)df

Now, we have to add the artist options for the above tracks we extracted. API offers three artist options Artist reputation
, generes
and followers
.
# Creating an empty checklist for artist optionsartist_popularity = []
genres = []
followers = []for i in artist_id:
artist = sp.artist(i)
artist_popularity.append(artist['popularity'])
genres.append(artist['genres'])
followers.append(artist['followers']['total'])#Assigning the above options into dataframe 'df'df = df.assign(Artist_Popularity = artist_popularity, Genres = genres, Followers= followers)
df

Final, extracted the numerical options of the above tracks. It’s simple and easy to get these numeric options. Please examine the documentation for the options particulars. Intresting to see these options danceability, vitality, loudness, speechiness, acousticness, liveness, valence and tempo
.
#Making a empty checklisttrack_features = []#Extracting the monitor options by looping into monitor id and creating a brand new dataframe known as "tfeatures"for i in df['Track_id']:
characteristic = sp.audio_features(i)
track_features.append(characteristic)
tfeatures = pd.DataFrame(columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature'])
for j in track_features:
for ok in j:
tfeatures = tfeatures.append(ok, ignore_index=True)
tfeatures.head()

We completed extracting the information from the API. The following step is to remodel the information.
Did small transformation as under,
- Merged the Observe, Artist and monitor numerical options right into a single information body.
- Dropped the undesirable columns.
- Capitalized the Artist and Observe characteristic; additionally modified the datatype.
- Transformed period columns into minutes from milliseconds.
# Merging Observe and Audio Options and saved in 'df1'df1 = pd.merge(df, tfeatures, how = 'interior', left_on = 'Track_id', right_on = 'id')# Dropping the undesirable optionsdf1 = df1[['Artist', 'Track', 'Track_id', 'Track_popularity', 'Artist_id','Artist_Popularity', 'Genres', 'Followers', 'danceability','energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']]#Capitalizing the Artist and Observedf1['Artist'] = df1['Artist'].str.capitalize()
df1['Track'] = df1['Track'].str.capitalize()# Altering the information kinddf1['key'] = df1['key'].astype(int)
df1['mode'] = df1['mode'].astype(int)
df1['instrumentalness'] = df1['instrumentalness'].astype(float)
df1['duration_ms'] = df1['duration_ms'].astype(int)# Duration_ms in milliseconds. Convert into minutesdf1['duration_ms'] = df1.apply(lambda x: spherical(x['duration_ms']/60000,2), axis=1)
Saved the information body df1
into Azure Information Lake as a CSV file known as spotify.csv
We have to write our spotify.csv
into Information Lake, and it is vitally easy & easy with one line of code. Please word it is advisable have to put in writing permission to save lots of the information into an information lake.
abfss://[email protected]_storage_account_name.dfs.core.home windows.internet/filepath
Right here, my container identify spotify
and my storage account identify is storagegen2dstest
# Saving the Spotify.csv into the information lake.df1.to_csv('abfss://[email protected]storagegen2dstest.dfs.core.home windows.internet/spotify.csv', sep=',', encoding='utf-8', index=False)
We completed loading the information into an information lake. Now, we are able to begin doing our evaluation. You may straight start evaluation with out saving the information into an information lake; nevertheless, I wish to maintain it as I need to join with Energy BI later. Additionally, we are able to automate the above extraction, transformation and cargo as information will get modified each day in Spotify.
I wished to seek out the reply to the under questions,
What number of Indian songs are on the Prime 10 checklist?
What are the highest 10 least well-liked songs?
What are the highest 10 genres? What’s the share of songs within the Pop class?
What are the highest 15 artists by followers? Are any Indian artists there?
Which track is one of the best to bounce to?
Establish 5 artists with a couple of track within the dataset
Is there any relationship between the audio options?
Studying our CSV from the information lake.
df1 = pd.read_csv('abfss://[email protected]storagegen2dstest.dfs.core.home windows.internet/spotify.csv')
What number of Indian songs are on the Prime 10 checklist?
# Sorting the dataframe by Track_popularity
df1.sort_values('Track_popularity', ascending = False, inplace = True)# taking the Observe and track_popularity and retailer in a seperate dataframe known as "prime"prime = df1[['Track', 'Track_popularity']]
prime = prime.head(15)# Constructing visualsfig = px.bar(prime, x='Observe', y="Track_popularity", orientation = 'v', title = "Prime 10 hottest track",labels={"Observe": "Observe",
"Track_popularity": "Observe Recognition"}
)
fig.present()
Our information pertained to Indian Market. I don’t see any Indian songs within the prime 10 checklist.
#Sorting the dataframe by Track_popularitydf1.sort_values('Track_popularity', ascending = True, inplace = True)
prime = df1[['Track', 'Track_popularity', 'Artist']]
prime = prime.head(10)

Attention-grabbing, there aren’t any Indian songs within the prime 10 least checklist as nicely.
# Taking monitor, Track_id, style and retailer in a brand new dataframe known as stylestyle = df1[['Track', 'Track_id', 'Genres']]
style = style.explode('Genres')style['Genres'] = style['Genres'].str.capitalize()
style = style.groupby('Genres')['Track_id'].depend()
style = style.to_frame().reset_index()
style.sort_values('Track_id', ascending = False, inplace = True)
style = style.head(10)#Constructing Visualsfig = px.pie(style, values= 'Track_id', names = 'Genres', title = "Prime Genres by complete songs", color_discrete_sequence=px.colours.sequential.RdBu)
fig.present()
Pop is dominating the highest 10 genres. Additionally, we are able to see plenty of Indian genres are on the checklist like Bollywood, desi hip hop, filmi and many others. Near 25% of songs belong to the ‘pop’ class.
# Grouping the artist and discovering prime 15 by followers
followers = df1[['Artist', 'Followers']].groupby('Artist').sum().sort_values('Followers', ascending = False).head(15).reset_index()#Constructing visualsfig = px.bar(followers, y = "Followers", x = 'Artist', orientation = 'v', title = "Prime 15 Artist by Followers",labels={"Artist": "Artist","Followers": "Whole Followers"})fig.present()
Undecided whether or not the artist’s followers are calculated globally or pertain to every market. Nevertheless, I can see two Indian artists like ‘Anirudh Ravichander & ‘A.R.Rahman’ within the prime 15 checklist.
We will use the danceability characteristic to seek out which track is finest to bounce to. Danceability values are from 0 to 1. If the worth is 0.0 then it’s least danceable and 1 is most danceable.
# Discovering prime track for finest to bouncedance = df1[['Track', 'danceability']].groupby('Observe').imply().sort_values('danceability', ascending = False).head(15).reset_index()# Constructing visuals
fig = px.bar(dance, x = 'Observe', y = "danceability", orientation = 'v', title = "Which track is one of the best to bounce to?",labels={"Observe": "Observe","danceability": "Danceability"})fig.present()
Indian track is accessible however not within the prime 5.
# Discovering prime 5 artists with a couple of track
quantity = df1[['Artist', 'Track_id']].groupby('Artist').depend().sort_values('Track_id', ascending = False).head(5).reset_index()quantity.rename(columns = {'Track_id':'Whole Songs'}, inplace = True)colours=['#fae588','#f79d65','#f9dc5c','#e8ac65','#e76f51','#ef233c','#b7094c'] #shade palette# Constructing Visuals
fig = px.treemap(quantity, path=['Artist'],values='Whole Songs', title = "Prime 5 artists with a couple of track", width=800, top=400)fig.update_layout(treemapcolorway = colours, #defines the colours within the treemap margin = dict(t=50, l=25, r=25, b=25))fig.present()
I can see one Indian artist (Anirudh Ravinchander) within the Prime 5 checklist.
# Separating options from the dataframeoptions = df1[['danceability', 'energy','loudness', 'mode','speechiness', 'acousticness','instrumentalness', 'liveness', 'valence', 'tempo', ]]# Discovering correlation
df = options.corr()#Constructing Visuals
fig = go.Determine()
fig.add_trace(go.Heatmap(x = df.columns,y = df.index,z = np.array(df),textual content=df.values,texttemplate='%{textual content:.2f}'))fig.update_layout(width = 800, top = 800,title = "Correlation Matrix for Observe Options", coloraxis1=dict(colorscale="Plasma"))
fig.present()
Correlation matrix among the many audio options. I don’t see any excessive correlation among the many options and in addition see unfavourable correlations.
The above outcomes are dynamic as on a regular basis information will get modified within the Spotify API primarily based on the person.
I’ve performed your complete above evaluation within the Azure Synapse pocket book. As I discussed, there isn’t a possibility for embedding a synapse pocket book into the medium. Taking an image of the synapse pocket book right here received’t be simple to learn. Therefore, I’ve simply copied the code on this article.

That is an fascinating dataset to inform a narrative. I simply related the CSV file from the information lake to Energy BI and constructed the easy report. You may examine interactive hyperlink right here.

We will use different cloud platforms as nicely to execute the above challenge. I simply used Azure as I’m predominantly working each day foundation. We will use Python, MySQL and Energy BI whether it is an on-premise resolution.
Thanks for studying.