Introduction¶
How many sample do you need to predict the proportion of fake account in a social network?
Obviously, if human manually check every account one by one for ALL the accounts in the social network, we can get the actual proportion of the fake accounts. But this would be too expensive and time consuming, especially when social accounts nowadays contain billions of active accounts! Facebook has 2 billion users, LinkedIn has .5 billion users.
Instead, more practical approach would be to sample enough to make an accurate estimation.
This will be a long blog. It will cover:
- sample size calculation
- concept of confidence
- finite sample correction
Sample size calculation¶
Let's call the sample size $n$, and let's denote $x_i = 1$ if $i$th account is fake account and else $x_i=0$. Then the point estimate of the proportion of fake account is:
$$ \hat{p}_n = \frac{\sum_{k=1}^n x_k}{n} $$
Error rate of 0.01%¶
The question is "how large $n$ should be to have an acceptable accuracy?" Suppose that this social network is really large, like Facebook and it has 1 billion users. Then 1% error in prediction means that 10 million accounts are wrongly classified as fake, which is pretty large. Suppose that I can only accept the error upto 100,000 accounts, then the error has to be within 0.01%. Let's use this 0.01% as an acceptable error rate for now.
Confidence level of 99.9%¶
In practice, I only sample once, and this sample could be a good sample in the sense that it is a good representation of population or bad samples, in the worst case sample scenario all the sampled accounts could be fake! We obviously want to avoid such a bad biased sample by increasing the sample size. If sample size is large enough then the samples are good representation of the population. Extreme case is when sample = population, then the sample is the perfect representation of population. This is controlled by confidence level.
Although I only sample once in practice, this is just "a" sample. Everytime, you randomly sample from the population, sample is different and hence the estimate is (unfortunately) also different. Ideally we want all sample's estimate to stay within the acceptable error rate (0.01%, defined in previous section). But there is always the chance of making error (unless all the accounts in populations are not fake (or fake)). Let's set our confidence level to 99.9% for now to ensure that 99.9% of times the point estimate stay within the acceptable error range.
Formulate business problem into optimization statement with probability function¶
We can now formulate the problem in probability statements.
We want the smallest $n$ such that $$ Pr\Big(|\hat{p}_n - p| \le 0.01\%\Big) > 0.999 $$ where $p$ is the ground truth proportion of fake accounts.
$$ Pr\Big(-0.0001\le \hat{p}_n - p \le 0.0001\Big) > 0.999 $$ Let $SD(\hat{p}_n)$ be standard deviation of my estimate $\hat{p}$, explaining how large the variation in the fake account point estimate is from sample to sample in theory. Obviously, this quantity depends on the sample size. Then rewrite the above equation as: $$ Pr\Big(-\frac{0.0001}{SD(\hat{p}_n)}\le \frac{\hat{p}_n - p}{SD(\hat{p}_n)} \le \frac{0.0001}{SD(\hat{p}_n)}\Big) > 0.999 $$ Notice that $$ Z = \frac{\hat{p}_n - p}{SD(\hat{p}_n)} $$ follows standard normal distribution N(0,1) for large enough $n$ by central limit theorem, and if we choose the 0.0005 quantile (=(1- 0.999)/2) and 0.9995 quantile (1 - 0.0005) of the standard normal distribution as lower and upper bounds of the inequality, this equation is hold. As the normal distribution is symmetric around 0, 0.9995 quantile is simply negation of 0.0005 quantile. $$ Pr\Big(-\frac{0.0001}{SD(\hat{p}_n)}\le Z \le \frac{0.0001}{SD(\hat{p}_n)}\Big) > 0.999 $$
$$ Pr\Big(q_{0.0005}\le Z \le - q_{0.0005}\Big) > 0.999 $$
We can find the value of $q_{0.0005}$ using python as below. $q_{0.0005} = -3.2905$
from scipy.stats import norm
quart = 0.0005
print("{} quartile is".format(quart),norm(0,1).ppf(quart))
Then we can have an equality: \begin{array}{r} q_{0.0005 } = -3.290=-\frac{0.0001}{SD(\hat{p}_n)} \end{array}
To solve the above equality with respect to $n$, we now need to know the form of $SD(\hat{p}_n)$. We first shows the simple approach with the infinite population size assumption (<-- False!), and show its shortcoming.
The form of $SD(\hat{p}_n)$¶
Let's consider its squared form, variance: \begin{array}{rl} Var\Big(\hat{p}_n\Big) &= Var\Big(\frac{\sum_{k=1}^n X_k}{n}\Big) \\ &= \frac{1}{n^2}\sum_{k=1}^nVar( X_k) \;\; (\textrm{for now, let's assume the independence of each account})\\ &= \frac{1}{n}Var( X_i)\\ \end{array}
$X_k$ is a Bernoulli random variable with success probability 1, so: $Var( X_k) = E(X_k^2) - E(X_k)^2 = p - p^2 = p(1-p)$
\begin{array}{rl} Var\Big(\hat{p}_n\Big) &= \frac{1}{n}p(1-p)\\ \end{array} Therefore, the estimate of Var\Big(\hat{p}_n\Big) is $$ \hat{Var}\Big(\hat{p}_n\Big) = \frac{1}{n}\hat{p}_n(1-\hat{p}_n)\\ $$ and its standard deviation is: $$ \hat{SD}\Big(\hat{p}_n\Big) = \sqrt{\frac{1}{n}\hat{p}_n(1-\hat{p}_n)}\\ $$ Unfortunately, we do not know the value of this term without actually getting a sample and have $\hat{p}_n$. As the largest (hence conservative) value of $\hat{SD}$ is attained when $\hat{p}_n=0.5$. So we substitute this value and get $\hat{SD}\Big(\hat{p}_n\Big) = \sqrt{\frac{0.5^2}{n}}=\frac{0.5}{\sqrt{n}}$
Now we are back to the equality: \begin{array}{ll} q_{0.0005} &=-\frac{0.0001}{SD(\hat{p}_n)}\\ q_{0.0005} &=-\frac{0.0001}{\frac{0.5}{\sqrt{n}}}\\ q_{0.0005} &=-\frac{0.0001 \sqrt{n} }{0.5}\\ \end{array} We solve this equality for $n$ as:
\begin{array}{ll} q_{0.0005} \frac{0.5}{0.0001}&= \sqrt{n} \\ \end{array}
Finally, substitute $q_{0.0005}=−3.290$ and we get:
\begin{array}{ll} n&=\Big( \frac{0.5 \times 3.290}{ 0.0001 }\Big)^2 = 270,602,500\\ \end{array}
WOOO, to get an acceptable prediction accuracy, we need THIS MANY sample.
Wait a second, what would happen if we want the confidence level to be 99.99%? This will simply change the denominator in the formula above from 0.0001 to 0.00001 which gives me the sample size of 27 billions!! This is larger than the Facebook population size. (Larger than population on earth.) So where did my calculation get wrong?
\begin{array}{ll} n&=\Big( \frac{0.5 \times 3.290}{ 0.00001 }\Big)^2 = 27,060,250,000\\ \end{array}
The form of $SD(\hat{p}_n)$, finite sample¶
The secret is at the calculation of $SD(\hat{p}_n)$, when we made the independence assumption: $$ Var\Big(\hat{p}_n\Big) = Var\Big(\frac{\sum_{k=1}^n X_k}{n}\Big) = \frac{1}{n^2}\sum_{k=1}^nVar( X_k) $$
$X_1,\cdots,X_n$ are not quite independent sample when sample size is large because randomly sampling the $X_i$ decreases the change of sampling $X_k$ ($k\ne i$).
To consider this finite sample scenario, let's reconsider this variance.
Define $$ Z_i = 1 \;\;\textrm{if unit $i$ is sampled},\; 0 \;\;\textrm{ otherwise}, $$ be random variable.
Denoting $N$ be the total population size, then the sample average can be represented. $$ \sum_{k=1}^n \frac{X_k}{n} = \sum_{i=1}^N Z_i \frac{x_i}{n} $$ In this presentation, $X_i$s are fixed quantity and $Z_i$ is random. The chance of $Z_i=1$ is the chance of being sample: $Pr(Z_i = 1) = n/N$.
The variance of $Z_i$ is $$ Var(Z_i) = E(Z_i^2) - E(Z_i)^2 = \frac{n}{N} - \left(\frac{n}{N}\right)^2 = \frac{n}{N}\left(1-\frac{n}{N} \right) $$
To calculate $Cov(Z_i,Z_j)$, I need to calculate $E(Z_iZ_j)$
\begin{array}{rl} E(Z_i Z_j) &= 1 \times 1 \times Pr(Z_i=1,Z_j=1) + 1 \times 0 \times Pr(Z_i=1,Z_j=0) + 0 \times 1 \times Pr(Z_i=0,Z_j=1) + 0 \times 0 \times Pr(Z_i=0,Z_j=0)\\ &=Pr(Z_i=1,Z_j=1)\\ &=Pr(Z_i=1|Z_j=1)Pr(Z_j=1)\\ &=\frac{n-1}{N-1} \frac{n}{N} \end{array}
Now, calculate $Cov(Z_i,Z_j)$:
\begin{array}{rl} Cov(Z_i,Z_j) &= E(Z_i Z_j) - E(Z_i)(Z_j)\\ &= \frac{n-1}{N-1} \frac{n}{N} - \left( \frac{n}{N}\right)^2\\ &= \frac{n}{N} \left( \frac{n-1}{N-1} - \frac{n}{N}\right)\\ &= \frac{n}{N} \left( \frac{N(n-1)-n(N-1)}{(N-1)N} \right)\\ &= \frac{n}{N} \frac{-N+n}{(N-1)N} \\ &= -\frac{1}{N-1} \frac{n}{N} \frac{N-n}{N} \\ &= -\frac{1}{N-1} \frac{n}{N} \left( 1 - \frac{n}{N}\right) \\ \end{array}
Finally, we can calculate \begin{array}{rl} Var\Big(\hat{p}_n\Big) &= Var\Big(\frac{\sum_{k=1}^N Z_k x_k}{N}\Big) \\ &= Cov\Big(\frac{\sum_{k=1}^N Z_k x_k}{N},\frac{\sum_{k=1}^N Z_k x_k}{N}\Big) \\ &= \frac{1}{n^2}Cov\Big(\sum_{k=1}^N Z_k x_k,\sum_{k=1}^N Z_k x_k\Big) \\ &= \frac{1}{n^2} \sum_{k=1}^N \sum_{j=1}^N x_k x_j Cov\Big( Z_k , Z_j\Big) \\ &= \frac{1}{n^2} \left ( \sum_{k=1}^N x_k^2 Var(Z_k) + \sum_{k=1}^N \sum_{j=1;j\ne k}^N x_k x_j Cov ( Z_k , Z_j) \right)\\ &= \frac{1}{n^2} \left ( \sum_{k=1}^N x_k^2 \frac{n}{N}\left(1 - \frac{n}{N} \right) - \sum_{k=1}^N \sum_{j=1;j\ne k}^N x_k x_j\frac{1}{N-1} \frac{n}{N} \left( 1 - \frac{n}{N}\right) \right)\\ &= \frac{1}{n^2} \left ( \frac{n}{N}\left(1 - \frac{n}{N} \right) \sum_{k=1}^N x_k^2 - \frac{1}{N-1} \frac{n}{N} \left( 1 - \frac{n}{N}\right) \sum_{k=1}^N \sum_{j=1;j\ne k}^N x_k x_j\right)\\ &= \frac{1}{n^2} \frac{n}{N}\left(1 - \frac{n}{N} \right) \left (\sum_{k=1}^N x_k^2 - \frac{1}{N-1} \sum_{k=1}^N \sum_{j=1;j\ne k}^N x_k x_j\right)\\ &= \frac{1}{n} \frac{1}{N}\left(1 - \frac{n}{N} \right) \frac{1}{N-1} \left ( (N-1)\sum_{k=1}^N x_k^2 - \sum_{k=1}^N \sum_{j=1;j\ne k}^N x_k x_j\right) \end{array}
Trick: Notice that $(y_1+\cdots+y_N)^2 = \sum_{i=1}^N y_i^2 + \sum_{i\ne j} y_i y_j$. So,
\begin{array}{rl} Var\left(\hat{p}_n \right) &= \frac{1}{n} \frac{1}{N}\left(1 - \frac{n}{N} \right) \frac{1}{N-1} \left( (N-1)\sum_{k=1}^N x_k^2 -\left( \Big(\sum_{i=1}^N x_i\Big)^2 - \sum_{i=1}^N x_i^2 \right)\right)\\ &= \frac{1}{n} \frac{1}{N}\left(1 - \frac{n}{N} \right) \frac{1}{N-1} \left( N\sum_{k=1}^N x_k^2 -\Big(\sum_{i=1}^N x_i\Big)^2 \right)\\ \end{array}
Trick: Notice also that: \begin{array}{rl} N\sum_{k=1}^N x_k^2 -\Big(\sum_{i=1}^N x_i\Big)^2 &= N^2\left(\frac{\sum_{k=1}^N x_k^2}{N} -\Big(\sum_{i=1}^N \frac{x_i}{N}\Big)^2 \right)\\ &= N^2\left(\bar{x^2} - \bar{x}^2\right)\\ &= N^2\left(\frac{1}{N}\sum_{i=1}^N( x_i - \bar{x})^2\right)\\ &= N^2 \sigma^2 \\ \end{array} where $\sigma^2=\left(\frac{1}{N}\sum_{i=1}^N( x_i - \bar{x})^2\right)$
This formula is the essentially the same idea as $Var(X) = E(X^2)-E(X)^2$, and easy to prove as: \begin{array}{rl} \sigma^2 = \frac{1}{N}\sum_{i=1}^N( x_i - \bar{x})^2 &= \frac{1}{N}\sum_{i=1}^N( x_i^2 - 2x_i \bar{X} + \bar{x}^2 )\\ &= \frac{1}{N}( \sum_{i=1}^Nx_i^2 - 2\bar{x} \sum_{i=1}^N x_i + N\bar{x}^2 )\\ &= \frac{1}{N}( \sum_{i=1}^Nx_i^2 - 2N \bar{x} \bar{x} + N\bar{x}^2 )\\ &= \frac{1}{N} \sum_{i=1}^Nx_i^2 - \bar{x}^2\\ \end{array}
Finally, finally we obtain the variance formula in finite population setting. \begin{array}{rl} Var\left(\hat{p}_n \right) &= \frac{1}{n} \frac{1}{N}\left(1 - \frac{n}{N} \right) \frac{1}{N-1} N^2 \sigma^2\\ &= \frac{1}{n} \frac{N-n}{N} \frac{N}{N-1} \sigma^2 \\ &= \frac{N-n}{N-1} \frac{\sigma^2}{n} \\ \end{array} Remind you that with infinite population assumption, $Var\left(\hat{p}_n \right) = \frac{SD^2}{n} = \frac{p(1-p)}{n} $. So the only difference is the term:
$$ FPC = \frac{N-n}{N-1} $$
This term has a name, and it is Finite Population Correction (FPC). FPC ~= 1 when sample size is very small relative to the population size: $N>>n$. On the other hand, when the sample size is nearly the same as the population size $N~=n$. FPC becomes zero.
To recap, in the infinite population scenario, the sample size calculation can be done as:
$$ n_{infinite} = \left( \frac{\sigma^2 q_{0.0005}^2}{\delta^2} \right) $$ where we previously consider $\delta=0.0001$ or $\delta = 0.00001$, an acceptable error.
This formula was derived by solving the equality: \begin{array}{rl} q_{0.0005} = - \frac{0.0001}{SD(\hat{p}_n)} \end{array} In case of finite population case, this equality becomes:
$$ n_{finite} = \frac{N}{\frac{N-1}{n_{infinite}}+1} = \frac{n_{infinite} N}{N-1 + n_{infinite}} $$
Notice that when $N$ goes to infinity, $n_{finite} \rightarrow n_{infinite}$.
Once again, assuming conservative population variance of $\sigma^2 = p(1-p) = 0.5*0.5 = 0.25$, and $N=2,000,000,000$, the required sample size becomes.
The difference in sample size become considerable especially when the acceptable error rate is small.
N = 2*(10**9)
q = -3.29
sigma = np.sqrt(0.5*0.5)
delta = 0.0001
def sample_size(delta):
n_infinite = (sigma * q/ delta)**2
n_finite = n_infinite * N /(N-1 + n_infinite)
print("n_infinite:",n_infinite, "n_finite:",n_infinite,"n_finite/n_infinite:",n_finite/n_infinite)
sample_size(0.0001)
sample_size(0.00001)