Yumi's Blog

Decode the dial-up sounds using Spectrogram

In [1]:
from IPython.display import IFrame
src = "https://www.youtube.com/embed/gsNaR6FRuO0"
IFrame(src, width=990/2, height=800/2)
Out[1]:

The first 5 seconds of The youtube video above contains the Dual Tone Multi Frequency (DTMF) signals of keytones commonly heard on telephone dial pads. I hear 11 distinct keytones here in the first six seconds. The goal of this blog is to find the corresponding 11 digits in the dial pads using Spectogram.

This is the third blog post related to Discrete Fourier Transform (DFT). I have reviewed DFT's theory (See Review on Discrete Fourier Transform) and implemented Spectrogram from scratch in python (See Implement the Spectrogram from scratch in python ). So if you are interested in concep of DFT or Spectrogram, so please refer to the previous posts.

Import Youtube sound data

I will download this youtube video using Youtube to mp3 and save it with the file name "The Sound of dial-up Internet.mp3" at my current directory. First, let's listen to this mp3 to makes sure that we correctly downloaded data.

In [2]:
from IPython.display import Audio
dial_up_internet = "The Sound of dial-up Internet.mp3"
display(Audio(dial_up_internet))

While, I can use the Spectrogram module that I wrote from scratch in Implement the Spectrogram from scratch in python, it is not computationally optimized. So instead, I will use librosa and matplotlib.pyplot.specgram to calcualte and plot the Spectrogram.

The line below reads in the signal time series using librosa.

In [3]:
import librosa
fs_dial_up0, sample_rate = librosa.load(dial_up_internet)
print("MP3 from the Youtube")
print("sample_rate {}".format(sample_rate))
print("N of time points {}".format(len(fs_dial_up0)))
print("The length of time series {:3.2f} seconds".format(float(len(fs_dial_up0))/sample_rate))
MP3 from the Youtube
sample_rate 22050
N of time points 634176
The length of time series 28.76 seconds

I will only keep the first 6 seconds of this data as the 11 keytones are recorded within the first 6 seconds.

In [4]:
import numpy as np
fs_dial_up = np.array(fs_dial_up0)[:int(6*sample_rate)]

Plot the signals in time domain

The time domain plot shows 11 peaks, each probablly corresponds to a single keytone .

In [5]:
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 
Nxlim   = 10
ts_orig = np.arange(0,len(fs_dial_up),sample_rate)
ts_sec  = np.arange(0,len(ts_orig),1) 
plt.figure(figsize=(17,5))
plt.plot(fs_dial_up)
plt.xticks(ts_orig,ts_sec)
plt.xlabel("time (sec)")
plt.ylabel("Ampritude")
plt.title("The time domain plot")
plt.show()

Plot the signals in frequency domain

In [6]:
fig = plt.figure(figsize=(17,5))
Pxx, freqs, bins, im = plt.specgram(fs_dial_up,
                                    Fs=sample_rate,
                                    NFFT=1000, noverlap=20)
plt.colorbar()
plt.xlabel("time (sec)")
plt.ylabel("Frequency (Hz)")
plt.title("Spectrogram")
plt.show()

The spectrogram above shows that each sounds consists of two frequencies. This makes sense because under Dual Tone Multi Frequency (DTMF), keytone of a single digit is designed to contain two distinct frequencies.

The following line shows the composition of the keytone based on two distinct frequencies of sin waves: For example, you can see that the keytone for digit 1 consists of sin wave of 697Hz and 1209Hz. If we can decode two distinct frequencies for each keytone, we can associate it to the correct dial pad number.

In [7]:
# dual tone multi frequency (DTMF) signaling
DTFT_dials = np.array(
    [[697,1209], # 1
     [697,1336], # 2
     [697,1477], # 3
     [770,1209], # 4  
     [770,1336], # 5
     [770,1477], # 6
     [852,1209], # 7
     [852,1336], # 8 
     [852,1477], # 9
     [941,1209], # *
     [941,1336], # 0
     [941,1477]] # #
)
telephone_keypad = ["{}".format(i) for i in range(1,10)] + ["*","0","#"]

for t,hzs in zip(telephone_keypad,DTFT_dials):
    print("digit={:}: ({}Hz,{}Hz)".format(t,*hzs))
digit=1: (697Hz,1209Hz)
digit=2: (697Hz,1336Hz)
digit=3: (697Hz,1477Hz)
digit=4: (770Hz,1209Hz)
digit=5: (770Hz,1336Hz)
digit=6: (770Hz,1477Hz)
digit=7: (852Hz,1209Hz)
digit=8: (852Hz,1336Hz)
digit=9: (852Hz,1477Hz)
digit=*: (941Hz,1209Hz)
digit=0: (941Hz,1336Hz)
digit=#: (941Hz,1477Hz)

Find the Dual-tone frequencies of each keytone via clustering on the spectrogram

I will look for the scaled FFT's magnitudes greater than -50 to hopefully extract the frequencies corresponding to 11 dial-up sounds. I will remove the frequencies lower than 500 because this frequency sound seems to be the starting flat dial tone (the sounds) not associated to any keytones.

In [8]:
ifreqs, ibins = np.where(np.log10(Pxx )*10 > -50)
bins_tone = bins[ibins]
freqs_tone = freqs[ifreqs]
pick  = freqs_tone > 500
X = np.array([bins_tone[pick],
              freqs_tone[pick]]).T

Currently, our data looks like this. A lot of points are sampled for each (frequency, time) group, and it is rather difficult to find the right pair to define keytones.

In [9]:
## The extracted point data looks 
for icoord in range(X.shape[0]):
    print("Time ={:4.2f}sec, Frequency={:5.2f}Hz".format(*X[icoord,:]))
Time =1.98sec, Frequency=683.55Hz
Time =2.02sec, Frequency=683.55Hz
Time =2.07sec, Frequency=683.55Hz
Time =2.78sec, Frequency=683.55Hz
Time =2.82sec, Frequency=683.55Hz
Time =2.87sec, Frequency=683.55Hz
Time =3.00sec, Frequency=683.55Hz
Time =3.04sec, Frequency=683.55Hz
Time =3.98sec, Frequency=683.55Hz
Time =4.02sec, Frequency=683.55Hz
Time =4.07sec, Frequency=683.55Hz
Time =1.98sec, Frequency=705.60Hz
Time =2.02sec, Frequency=705.60Hz
Time =2.07sec, Frequency=705.60Hz
Time =2.78sec, Frequency=705.60Hz
Time =2.82sec, Frequency=705.60Hz
Time =2.87sec, Frequency=705.60Hz
Time =3.00sec, Frequency=705.60Hz
Time =3.04sec, Frequency=705.60Hz
Time =3.98sec, Frequency=705.60Hz
Time =4.02sec, Frequency=705.60Hz
Time =4.07sec, Frequency=705.60Hz
Time =2.78sec, Frequency=727.65Hz
Time =2.20sec, Frequency=749.70Hz
Time =2.24sec, Frequency=749.70Hz
Time =3.18sec, Frequency=749.70Hz
Time =3.22sec, Frequency=749.70Hz
Time =3.27sec, Frequency=749.70Hz
Time =2.20sec, Frequency=771.75Hz
Time =2.24sec, Frequency=771.75Hz
Time =3.18sec, Frequency=771.75Hz
Time =3.22sec, Frequency=771.75Hz
Time =3.27sec, Frequency=771.75Hz
Time =2.20sec, Frequency=793.80Hz
Time =2.24sec, Frequency=793.80Hz
Time =3.18sec, Frequency=793.80Hz
Time =3.22sec, Frequency=793.80Hz
Time =3.27sec, Frequency=793.80Hz
Time =2.38sec, Frequency=837.90Hz
Time =2.42sec, Frequency=837.90Hz
Time =2.47sec, Frequency=837.90Hz
Time =2.38sec, Frequency=859.95Hz
Time =2.42sec, Frequency=859.95Hz
Time =2.47sec, Frequency=859.95Hz
Time =2.38sec, Frequency=882.00Hz
Time =2.60sec, Frequency=926.10Hz
Time =2.64sec, Frequency=926.10Hz
Time =3.40sec, Frequency=926.10Hz
Time =3.44sec, Frequency=926.10Hz
Time =3.58sec, Frequency=926.10Hz
Time =3.62sec, Frequency=926.10Hz
Time =3.67sec, Frequency=926.10Hz
Time =3.80sec, Frequency=926.10Hz
Time =3.84sec, Frequency=926.10Hz
Time =2.60sec, Frequency=948.15Hz
Time =2.64sec, Frequency=948.15Hz
Time =3.40sec, Frequency=948.15Hz
Time =3.44sec, Frequency=948.15Hz
Time =3.58sec, Frequency=948.15Hz
Time =3.62sec, Frequency=948.15Hz
Time =3.67sec, Frequency=948.15Hz
Time =3.80sec, Frequency=948.15Hz
Time =3.84sec, Frequency=948.15Hz
Time =3.58sec, Frequency=970.20Hz
Time =3.67sec, Frequency=970.20Hz
Time =1.98sec, Frequency=1190.70Hz
Time =2.02sec, Frequency=1190.70Hz
Time =2.07sec, Frequency=1190.70Hz
Time =2.38sec, Frequency=1190.70Hz
Time =2.42sec, Frequency=1190.70Hz
Time =2.47sec, Frequency=1190.70Hz
Time =3.18sec, Frequency=1190.70Hz
Time =3.22sec, Frequency=1190.70Hz
Time =3.27sec, Frequency=1190.70Hz
Time =1.98sec, Frequency=1212.75Hz
Time =2.02sec, Frequency=1212.75Hz
Time =2.07sec, Frequency=1212.75Hz
Time =2.38sec, Frequency=1212.75Hz
Time =2.42sec, Frequency=1212.75Hz
Time =2.47sec, Frequency=1212.75Hz
Time =3.18sec, Frequency=1212.75Hz
Time =3.22sec, Frequency=1212.75Hz
Time =3.27sec, Frequency=1212.75Hz
Time =1.98sec, Frequency=1234.80Hz
Time =2.02sec, Frequency=1234.80Hz
Time =2.07sec, Frequency=1234.80Hz
Time =2.38sec, Frequency=1234.80Hz
Time =2.42sec, Frequency=1234.80Hz
Time =2.47sec, Frequency=1234.80Hz
Time =3.18sec, Frequency=1234.80Hz
Time =3.22sec, Frequency=1234.80Hz
Time =3.27sec, Frequency=1234.80Hz
Time =2.78sec, Frequency=1300.95Hz
Time =2.87sec, Frequency=1300.95Hz
Time =3.58sec, Frequency=1300.95Hz
Time =3.67sec, Frequency=1300.95Hz
Time =2.20sec, Frequency=1323.00Hz
Time =2.24sec, Frequency=1323.00Hz
Time =2.60sec, Frequency=1323.00Hz
Time =2.64sec, Frequency=1323.00Hz
Time =2.78sec, Frequency=1323.00Hz
Time =2.82sec, Frequency=1323.00Hz
Time =2.87sec, Frequency=1323.00Hz
Time =3.40sec, Frequency=1323.00Hz
Time =3.44sec, Frequency=1323.00Hz
Time =3.58sec, Frequency=1323.00Hz
Time =3.62sec, Frequency=1323.00Hz
Time =3.67sec, Frequency=1323.00Hz
Time =3.80sec, Frequency=1323.00Hz
Time =3.84sec, Frequency=1323.00Hz
Time =2.20sec, Frequency=1345.05Hz
Time =2.24sec, Frequency=1345.05Hz
Time =2.60sec, Frequency=1345.05Hz
Time =2.64sec, Frequency=1345.05Hz
Time =2.78sec, Frequency=1345.05Hz
Time =2.82sec, Frequency=1345.05Hz
Time =2.87sec, Frequency=1345.05Hz
Time =3.40sec, Frequency=1345.05Hz
Time =3.44sec, Frequency=1345.05Hz
Time =3.58sec, Frequency=1345.05Hz
Time =3.62sec, Frequency=1345.05Hz
Time =3.67sec, Frequency=1345.05Hz
Time =3.80sec, Frequency=1345.05Hz
Time =3.84sec, Frequency=1345.05Hz
Time =2.78sec, Frequency=1367.10Hz
Time =2.87sec, Frequency=1367.10Hz
Time =3.58sec, Frequency=1367.10Hz
Time =3.67sec, Frequency=1367.10Hz
Time =3.00sec, Frequency=1455.30Hz
Time =3.04sec, Frequency=1455.30Hz
Time =3.98sec, Frequency=1455.30Hz
Time =4.02sec, Frequency=1455.30Hz
Time =4.07sec, Frequency=1455.30Hz
Time =3.00sec, Frequency=1477.35Hz
Time =3.04sec, Frequency=1477.35Hz
Time =3.98sec, Frequency=1477.35Hz
Time =4.02sec, Frequency=1477.35Hz
Time =4.07sec, Frequency=1477.35Hz
Time =3.00sec, Frequency=1499.40Hz
Time =3.04sec, Frequency=1499.40Hz
Time =3.98sec, Frequency=1499.40Hz
Time =4.02sec, Frequency=1499.40Hz
Time =4.07sec, Frequency=1499.40Hz

I want to have two frequency estimates for each keytone without going through the list of sounds as shown above. For this, I will conduct Kmeans clustering with cluster size 11 * 2 on the extracted (time,frequency) samples!

In [10]:
from sklearn.cluster import KMeans

Xcolstd = np.std(X,axis=0)
X_scale = X/Xcolstd

km = KMeans(n_clusters = 22)
km.fit(X_scale)
Out[10]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=22, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Kmeans results

Kmeans is able to cluster the 22 observed frequencies.

In [11]:
y_pred = km.predict(X_scale)

plt.scatter(X[:,0],X[:,1],c=y_pred)
plt.xlabel("Time (sec)")
plt.ylabel("Frequencies (Hz)")
plt.show()

I will use the cluster centers along y-axis as the identified frequencies of each clusters.

In [12]:
# rescale the cluster center
ccenter = km.cluster_centers_*Xcolstd
ccenter = np.array([np.round(ccenter[:,0],2),
                    np.round(ccenter[:,1],0)]).T
## reorder according to the time (sec)
ccenter = ccenter[np.argsort(ccenter[:,0]),:]
for icenter in range(ccenter.shape[0]):
    print("{:5.2f} sec {:7.0f}Hz".format(ccenter[icenter][0],
                                         ccenter[icenter][1]))
 2.02 sec    1213Hz
 2.02 sec     695Hz
 2.22 sec     772Hz
 2.22 sec    1334Hz
 2.42 sec    1213Hz
 2.42 sec     854Hz
 2.62 sec    1334Hz
 2.62 sec     937Hz
 2.82 sec     699Hz
 2.82 sec    1334Hz
 3.02 sec     695Hz
 3.02 sec    1477Hz
 3.22 sec    1213Hz
 3.22 sec     772Hz
 3.42 sec     937Hz
 3.42 sec    1334Hz
 3.62 sec    1334Hz
 3.62 sec     945Hz
 3.82 sec     937Hz
 3.82 sec    1334Hz
 4.02 sec     695Hz
 4.02 sec    1477Hz

Reshape and reorder the matrix so that

  • each row corresponds to one keytone
  • columns are ordered so that the lower frequency appears in the 0th column and higher in the 1st column.
In [13]:
ccenter_xy0 = np.array([ccenter[1::2,1],
                       ccenter[::2,1]]).T

ccenter_xy = []
for icenter in range(ccenter_xy0.shape[0]):
    ccenter_xy.append(np.sort(ccenter_xy0[icenter]))
ccenter_xy = np.array(ccenter_xy)
ccenter_xy
Out[13]:
array([[ 695., 1213.],
       [ 772., 1334.],
       [ 854., 1213.],
       [ 937., 1334.],
       [ 699., 1334.],
       [ 695., 1477.],
       [ 772., 1213.],
       [ 937., 1334.],
       [ 945., 1334.],
       [ 937., 1334.],
       [ 695., 1477.]])

Finally report the estimated dials for the 11 keytone sounds.

In [14]:
def distDTFT(DTFT_dials, cent):
    return(np.sum((DTFT_dials - cent )**2,axis=1))
    
freq = []
for icenter in range(ccenter_xy.shape[0]):
    cent = ccenter_xy[icenter]
    dist2center = distDTFT(DTFT_dials, cent)
    i = np.argmin(dist2center)
    freq.append(telephone_keypad[i])
    
for i, f in enumerate(freq,1):
    print("{:02} dial pad button={}".format(i,f))
01 dial pad button=1
02 dial pad button=5
03 dial pad button=7
04 dial pad button=0
05 dial pad button=2
06 dial pad button=3
07 dial pad button=4
08 dial pad button=0
09 dial pad button=0
10 dial pad button=0
11 dial pad button=3

You can use Online Tone Generator to check if this sequence of number sounds the same as the youtube video.

The two keytone sound similar to me!

Comments