# Implement the Spectrogram from scratch in python

Spectrogram is an awesome tool to analyze the properties of signals that evolve over time. There are lots of Spect4ogram modules available in python e.g. matplotlib.pyplot.specgram. Users need to specify parameters such as "window size", "the number of time points to overlap" and "sampling rates". So without correct understanding of Spectrogram, I believe that it is difficult to utilize these existing modules.

In this blog post, I will implement Spectrogram from scratch so that its implementation is cristal clear. This blog post assumes that the audience understand Discrete Fourier Transform (DFT). If you do not understand it, I kindly ask you to read my previous blog post Review on Discrete Fourier Transform.

Among all the youtube tutorial about Spectrogram, I found The Short Time Fourier Transform | Digital Signal Processing the most useful. This blog post also try to reproduce the plots shown in this video.

## Generate synthetic data¶

First, I will simulate 7-second analog dial sound signal.

The first 3 seconds is the digit 1 sound, the next 2 seconds is the slience, and the last 3 seconds i the digit 2 second.

According to The Short Time Fourier Transform | Digital Signal Processing every analog telephone buttom in dial pad generates 2 sine waves.
For example, pressing digit 1 buttom generates the sin waves at frequency 697Hz and 1209Hz. The frequency 697Hz means that the sin wave repeats its fulle cycle 697 times within a second. Two sin waves at two different frequencies mean that the signal is sum of these two waves.

I will assume that sampling rate of 4000, meaning that every second, 4000 sample points are taken. This means that there are 28,000 sample points in total.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

def get_signal_Hz(Hz,sample_rate,length_ts_sec):
## 1 sec length time series with sampling rate
ts1sec = list(np.linspace(0,np.pi*2*Hz,sample_rate))
## 1 sec length time series with sampling rate
ts = ts1sec*length_ts_sec
return(list(np.sin(ts)))

sample_rate   = 4000
length_ts_sec = 3
## --------------------------------- ##
## 3 seconds of "digit 1" sound
## Pressing digit 2 buttom generates
## the sine waves at frequency
## 697Hz and 1209Hz.
## --------------------------------- ##
ts1  = np.array(get_signal_Hz(697, sample_rate,length_ts_sec))
ts1 += np.array(get_signal_Hz(1209,sample_rate,length_ts_sec))
ts1  = list(ts1)

## -------------------- ##
## 2 seconds of silence
## -------------------- ##
ts_silence = [0]*sample_rate*1

## --------------------------------- ##
## 3 seconds of "digit 2" sounds
## Pressing digit 2 buttom generates
## the sine waves at frequency
## 697Hz and 1336Hz.
## --------------------------------- ##
ts2  = np.array(get_signal_Hz(697, sample_rate,length_ts_sec))
ts2 += np.array(get_signal_Hz(1336,sample_rate,length_ts_sec))
ts2  = list(ts2)

## -------------------- ##
## Add up to 7 seconds
## ------------------- ##
ts = ts1 + ts_silence  + ts2


## Let's listen to the generated sound signal.¶

Great. The sounds of telephone dialing in good old days.

In [2]:
from IPython.display import Audio
Audio(ts, rate=sample_rate)

Out[2]:

## Plot the generated sound signal in time domain.¶

The plot shows when the two digit sounds start and end. You can also see the magnitude of the sounds from amplitudes. However, this time representation of the signal hides frequency infomation, meaning that you cannot tell which digits are pressed or which frequency waves create this noise pattern.

In [3]:
total_ts_sec = len(ts)/sample_rate
print("The total time series length = {} sec (N points = {}) ".format(total_ts_sec, len(ts)))
plt.figure(figsize=(20,3))
plt.plot(ts)
plt.xticks(np.arange(0,len(ts),sample_rate),
np.arange(0,len(ts)/sample_rate,1))
plt.ylabel("Amplitude")
plt.xlabel("Time (second)")
plt.title("The total length of time series = {} sec, sample_rate = {}".format(len(ts)/sample_rate, sample_rate))
plt.show()

The total time series length = 7.0 sec (N points = 28000)


## Plot the generated sound signal in frequency domain.¶

### Discrete Fourier Transform Functions¶

These DTF functions are previously defined in Review on Discrete Fourier Transform.

Notice that get_xns only calculate the Fourier coefficients up to the Nyquest limit. Also the absolute value of each Fourier coefficient is doubled to account for the symmetry of the Fourier coefficients around the Nyquest Limit.

It is impossible to calculate the frequencies above Nyquest Limit = sample rate / 2

The discussion of the Nyquest Limit can be found in my previous blog Review on Discrete Fourier Transform. Section "Interpreting Fourier coefficients in Hertz".

Please also see this video that discusses Nyquest limit.

In [4]:
def get_xn(Xs,n):
'''
calculate the Fourier coefficient X_n of
Discrete Fourier Transform (DFT)
'''
L  = len(Xs)
ks = np.arange(0,L,1)
xn = np.sum(Xs*np.exp((1j*2*np.pi*ks*n)/L))/L
return(xn)

def get_xns(ts):
'''
Compute Fourier coefficients only up to the Nyquest Limit Xn, n=1,...,L/2
and multiply the absolute value of the Fourier coefficients by 2,
to account for the symetry of the Fourier coefficients above the Nyquest Limit.
'''
mag = []
L = len(ts)
for n in range(int(L/2)): # Nyquest Limit
mag.append(np.abs(get_xn(ts,n))*2)
return(mag)
mag = get_xns(ts)


### DFT on entire dataset to visualize the signals at frequency domain for all k=1,...L.¶

The aboslute values of Fourier coefficients $X_k$ are calculated for all $k=1,\cdots,L/2$. The frequency plot shows three peaks, which probablly corresponds to 697Hz, 1209Hz and 1336Hz. However, because the units of x axis is not Hz, we cannot tell if they are really corresponding to these frequencies. So Let's fix this.

In [5]:
# the number of points to label along xaxis
Nxlim = 10

plt.figure(figsize=(20,3))
plt.plot(mag)
plt.xlabel("Frequency (k)")
plt.title("Two-sided frequency plot")
plt.ylabel("|Fourier Coefficient|")
plt.show()


As disucssed in Review on Discrete Fourier Transform Section "Interpreting Fourier coefficients in Hertz", $X_k$, Fourier coefficient's frequency ($i.e. k$ in $x_k$) can be translated to frequency in herts as:

$$\frac{\textrm{Sample Rate *} k}{\textrm{Total N of Sample Points} }\;\;\;\; \textrm{(Hz)}$$

Let's create a frequency domain from with xaxis unit in Hz. As expected, the three peaks are at 697Hz, 1209Hz and 1336Hz.

In [6]:
def get_Hz_scale_vec(ks,sample_rate,Npoints):
freq_Hz = ks*sample_rate/Npoints
freq_Hz  = [int(i) for i in freq_Hz ]
return(freq_Hz )

ks   = np.linspace(0,len(mag),Nxlim)
ksHz = get_Hz_scale_vec(ks,sample_rate,len(ts))

plt.figure(figsize=(20,3))
plt.plot(mag)
plt.xticks(ks,ksHz)
plt.title("Frequency Domain")
plt.xlabel("Frequency (Hz)")
plt.ylabel("|Fourier Coefficient|")
plt.show()


# Create Spectrogram¶

While frequency domain plot shows the the signal's underlying frequency in Hz, this plot does not contain temporal infomation. For example, we cannot tell when the sin wave at 697Hz is observed within the 7 seconds.

If we perform DFT on subwindow of the original time series and slide down the subwindow across the signal, then we can obtain the time-dependent Fourier coefficients. This is essentially the short term DFT (SDFT).

Spectrogram is a clever way to visualize the time-varing frequency infomation created by SDFT. most python modules for spectrogram requires users to specify the following two parameters. For example, matplotlib.pyplot.specgram) requires the following three parameters:

• NFFT: The number of data points used in each block for the DFT.　(FFT is part of the name probablly because Fast Fourier Transform is used internaly in matplotlib.pyplot.specgram) rather than DFT).
• Fs: the number of points sampled per second, so called sample_rate
• noverlap: The number of points of overlap between blocks.

Finally the absolute values of Spectrogram is rescaled to have better color code the magnitudes.
$$10\log_{10}(|X_k|)$$

## SDFT with window size L for Spectogram¶

In [7]:
def create_spectrogram(ts,NFFT,noverlap = None):
'''
ts: original time series
NFFT: The number of data points used in each block for the DFT.
Fs: the number of points sampled per second, so called sample_rate
noverlap: The number of points of overlap between blocks. The default value is 128.
'''
if noverlap is None:
noverlap = NFFT/2
noverlap = int(noverlap)
starts  = np.arange(0,len(ts),NFFT-noverlap,dtype=int)
# remove any window with less than NFFT sample size
starts  = starts[starts + NFFT < len(ts)]
xns = []
for start in starts:
# short term discrete fourier transform
ts_window = get_xns(ts[start:start + NFFT])
xns.append(ts_window)
specX = np.array(xns).T
# rescale the absolute value of the spectrogram as rescaling is standard
spec = 10*np.log10(specX)
assert spec.shape[1] == len(starts)
return(starts,spec)

L = 256
noverlap = 84
starts, spec = create_spectrogram(ts,L,noverlap = noverlap )


Let's plot the hand-made spectrogram! You can clearly see that the first 3 seconds contains 697Hz and 1209Hz frequencies, followed by 2 seconds of slience and then the last 3 seconds contains the 693Hz and 1336Hz waves. This is great.

In [8]:
def plot_spectrogram(spec,ks,sample_rate, L, starts, mappable = None):
plt.figure(figsize=(20,8))
plt_spec = plt.imshow(spec,origin='lower')

## create ylim
Nyticks = 10
ks      = np.linspace(0,spec.shape[0],Nyticks)
ksHz    = get_Hz_scale_vec(ks,sample_rate,len(ts))
plt.yticks(ks,ksHz)
plt.ylabel("Frequency (Hz)")

## create xlim
Nxticks = 10
ts_spec = np.linspace(0,spec.shape[1],Nxticks)
ts_spec_sec  = ["{:4.2f}".format(i) for i in np.linspace(0,total_ts_sec*starts[-1]/len(ts),Nxticks)]
plt.xticks(ts_spec,ts_spec_sec)
plt.xlabel("Time (sec)")

plt.title("Spectrogram L={} Spectrogram.shape={}".format(L,spec.shape))
plt.colorbar(mappable,use_gridspec=True)
plt.show()
return(plt_spec)
plot_spectrogram(spec,ks,sample_rate,L, starts)

Out[8]:

## Wideband spectrogram vs narrowband spectrogram¶

Finally I will discuss the uncertainty principle in spectrogram.

**Uncertainty principle** We cannot arbitrarily narrow our focus both in time and in frequency. If we want higher time resolusion, we need to give up frequency resolusion and vise verse.
This part of discussion is associated to the second half of my favorite youtube spectrogram video by Free Engineering Lectures.

In the previous spectrogram, the the window size was set to 256, and sampling rate was 4000. Therefore, each window contains

$$\textrm{time resolusion}:\;\;\;\frac{\textrm{window size}}{\textrm{sample rate}} = \frac{256}{4000}= 0.064\; \textrm{ second}$$

The frequency resoulsion has the reciprocal relation and it is: $$\textrm{freq resolusion}:\;\;\;\frac{\textrm{sample rate}}{\textrm{window size}} = \frac{4000}{256}= 15.6\; \textrm{ Hz}$$

The following plots show the compromize between frequency resoulsion and the time resolsions. You can see that when Spectrogram is narrowband (i.e., higher window size) then the frequency is more precise while when Spectrogram is widerband, temporal infomation is more clear.

In [9]:
plt_spec1 = None
for iL, (L, bandnm) in enumerate(zip([150, 200, 400],["wideband","middleband","narrowband"])):
print("{:20} time resoulsion={:4.2f}sec, frequency resoulsion={:4.2f}Hz".format(bandnm,L/sample_rate,sample_rate/L))
starts, spec = create_spectrogram(ts,L,noverlap = 1 )
plt_spec = plot_spectrogram(spec,ks,sample_rate, L, starts,
mappable = plt_spec1)
if iL == 0:
plt_spec1 = plt_spec

wideband             time resoulsion=0.04sec, frequency resoulsion=26.67Hz

middleband           time resoulsion=0.05sec, frequency resoulsion=20.00Hz

narrowband           time resoulsion=0.10sec, frequency resoulsion=10.00Hz