I noticed that tensorflow does not have functions to compute Mahalanobis distance between two groups of samples. So here I go and provide the code with explanation. The blog is organized and explain the following topics
- Euclidean distance with Scipy
- Euclidean distance with Tensorflow v2
- Mahalanobis distance with Scipy
- Mahalanobis distance with Tensorflow v2
import tensorflow as tf
print("tensorflow:",tf.__version__)
Euclidean distance with Spicy¶
Here is Scipy version of calculating the Euclidean distance between two group of samples:
$$ \boldsymbol{a}, R^{\textrm{M1 x n_feat}} \boldsymbol{b} \in R^{\textrm{M2 x n_feat}} $$
At the end we want a distance matrix of size
$$ npeuc \in R^{M1 x M2} $$
In the following code, there is a double loop and distance is calculated for each sample pair one by one. This is not very efficient!
import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import distance
M1, M2, n_feat = 5, 4, 2
# Scipy calculation
a = np.random.rand(M1, n_feat).astype(np.float32)
b = np.random.rand(M2, n_feat).astype(np.float32)
print("a")
print(a)
print("b")
print(b)
npeuc = [[0 for _ in range(b.shape[0])]
for _ in range(a.shape[0])]
for ib in range(b.shape[0]):
for ia in range(a.shape[0]):
npeuc[ia][ib] = distance.euclidean(a[ia],b[ib])
npeuc = np.array(npeuc)
print("Euclidean distances with scipy")
print(npeuc.shape)
Euclidean distance with Tenworflow v2¶
def Euclidean(A,B):
v = tf.expand_dims(tf.reduce_sum(tf.square(A), 1), 1)
p1 = tf.reshape(tf.reduce_sum(v,axis=1),(-1,1))
v = tf.reshape(tf.reduce_sum(tf.square(B), 1), shape=[-1, 1])
p2 = tf.transpose(tf.reshape(tf.reduce_sum(v,axis=1),(-1,1)))
res = tf.sqrt(tf.add(p1, p2) - 2 * tf.matmul(A, B, transpose_b=True))
return(res)
A = tf.Variable(a)
B = tf.Variable(b)
print("A")
print(A)
print("B")
print(B)
res = Euclidean(A,B)
tfeuc = res.numpy()
print("Euclidean distances with tensorflow")
tfeuc
#with tf.Session() as sess: tensorflow v2!!
# print(sess.run(res))
Let's compare the aggreement between the two Euclidean distances:
print("average difference in the np calculation and tf calculations are:",np.mean(np.abs(tfeuc - npeuc)))
Maharanobis distance with Scipy¶
Now we are ready to calculate Mahalanobis distance. As before, let's start with Scipy version.
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import distance
M1, M2, n_feat = 5, 4, 2
# Scipy calculation
a = np.random.rand(M1, n_feat).astype(np.float32)
b = np.random.rand(M2, n_feat).astype(np.float32)
# inverse of the covariance matrix
S = np.random.rand(n_feat,n_feat).astype(np.float32)
S = np.matmul(S,np.transpose(S))
print("a")
print(a)
print("b")
print(b)
npmah = [[0 for _ in range(b.shape[0])]
for _ in range(a.shape[0])]
for ib in range(b.shape[0]):
for ia in range(a.shape[0]):
npmah[ia][ib] = distance.mahalanobis(a[ia],b[ib],S)
npmah = np.array(npmah)
print("Maharanobis distances with scipy")
npmah
Mahalanobis distance with tensorflow¶
The mahalanobis distances of two samples $\boldsymbol{x}$ and $\boldsymbol{y}$ $\in R^{Nfeat}$ with covariance matrix $\Sigma \in R^{\textrm{N_feat x N_feat}}$ are defined as:
\begin{array}{rl} (\boldsymbol{x}-\boldsymbol{y})^T \Sigma^{-1}(\boldsymbol{x}-\boldsymbol{y}) \in R^1 \end{array}
My calcualtion of Mahalanobis distance utilizes the calculation of Euclidean distance described above. How? Let me show you!
The key is to use Cholesky decomposition. Cholesky decomposition allows decomposing real symmetric matrix $A$ into two matrix $A=LL^T$. Fortunately covariance matrix $\Sigma$ is symmetric matrix so we can use this decomposition and get a matrix $\boldsymbol{C}$
\begin{array}{rl} \Sigma^{-1} = \boldsymbol{C}^T\boldsymbol{C}, \end{array} where $\boldsymbol{C}\in R^{\textrm{N_feat x N_feat}}$
Then consider standardized vectors:
- $\boldsymbol{x}^*=\boldsymbol{C} \boldsymbol{x}\;\;\in R^{Nfeat}$
- $\boldsymbol{y}^*=\boldsymbol{C} \boldsymbol{y}$
Their Euclidean distances are:
\begin{array}{rl} ||\boldsymbol{y}^* - \boldsymbol{x}^* || &= (\boldsymbol{y}^* - \boldsymbol{x}^* )^T(\boldsymbol{y}^* - \boldsymbol{x}^* )\\ &= (\boldsymbol{C}\boldsymbol{y} - \boldsymbol{C}\boldsymbol{x})^T(\boldsymbol{C}\boldsymbol{y}- \boldsymbol{C}\boldsymbol{x} )\\ &= (\boldsymbol{y} - \boldsymbol{x})^T\boldsymbol{C}^T\boldsymbol{C}(\boldsymbol{y}- \boldsymbol{x} ) &= (\boldsymbol{y} - \boldsymbol{x})^T\Sigma^{-1}(\boldsymbol{y}- \boldsymbol{x} ) \end{array}
Notice that the Euclidean distance between $\boldsymbol{x}^*$ and $\boldsymbol{y}^*$ is Mahalanobis distance between $\boldsymbol{x}$ and $\boldsymbol{y}$.
Using this idea, we calculate the Mahalanobis distances.
def EfficientMaharanobis(A,B,invS):
'''
A : tensor, N sample1 by N feat
B : tensor, N sample2 by N feat
S : tensor, N feat by N feat
Output:
marahanobis distance of each pair (A[i],B[j]) with inv variance S
'''
S_half = tf.linalg.cholesky(invS)
A_star = tf.matmul(A,S_half)
B_star = tf.matmul(B,S_half)
res = Euclidean(A_star,B_star)
return(res)
# TF calculation
A = tf.Variable(a)
B = tf.Variable(b)
S = tf.Variable(S)
print("A")
print(A)
print("B")
print(B)
print("S")
print(S)
tfmah = EfficientMaharanobis(A,B,S)
print("Maharanobis distances with tensorflow v2")
tfmah.numpy()
print("average difference in the np calculation and tf calculations are:",np.mean(np.abs(tfmah - npmah)))