# MIT Probability Reference

### Set Equations

P(A|B) = (P(A nn B)) / (P(B)) provided P(B)!=0

 = (|A nn B|)/(|B|)

"Bayes Theorem" = P(B|A) = (P(A|B)*P(B))/(P(A))

"Bayes Theorem" = P(B|A) " " alpha " " P(A|B)*P(B) where alpha means "is proportional to"

(A uu B)^c = A^c nn B^c

(A nn B)^c = A^c uu B^c

This means A is independent of B:

P(A nn B) = P(A)*P(B)

P(A|B) = P(A)

P(B|A) = P(B)

"Addition of probabilities" = P(A) = P(A|B) * P(B) + P(A|B^c)*P(B^c)

P(E^c) = "complement of E" = 1 - P(E)

O(E^c) = 1 / (O(E))

O(E) = "odds" = (P(E)) / (1 - P(E))

P(E) = (O(E)) / (1 + O(E))

BF = "Bayes factor" = (P(A|H))/(P(A|H^c)

If C and D are conditionally independent:

• O(H|C,D) = BF_C * BF_D * O(H)
• ln(O(H|C,D)) = ln(BF_C) + ln(BF_D) + ln(O(H))

### Bernoulli Distribution

Range X : 0, 1

P(X=1) = p

P(X=0) = 1 - p

X ~ "Bernoulli"(p) or "Ber"(p)

"Ber"(p) = "Bin"(1,p)

E(X) = p

Var(X) = (1-p)p

### Binomial Distribution

This is the sum of independent Bernoulli(p) variables.

Range X : 0, 1, … n

X ~ "Binomial"(n,p)

p(k) = "Bin"(n,k) = ((n),(k)) p^k (1-p)^(n-k) = dbinom(k,n,p)

"Bin"(n,n) = p^n

E(X) = np

Var(X)=np(1-p)

### Geometric Distribution

This models number of failures before first success in a sequence of coin flips (Bernoulli(0.5) trials).

Range X : 0, 1, 2, ...

X ~ "geometric"(p) or "geo"(p)

p(k) = P(X=k) = (1-p)^k p = dgeom(k,p)

E(X) = (1-p)/p

P(X=n+k | X >= n) = P(X=k)

Var(X) = (1-p) / p^2

### Uniform Distribution

Where all outcomes are equally likely.

Range X : [a, b]

p(k) = 1 / (b-a)

E(X) = (a + b) / 2

X ~ "uniform"(a,b) or U(a,b)

f(x) = 1 / (b-a) for a <= x <= b

F(x) = (x-a)/(b-a) for a <= x <= b

Var(X) = (b - a)^2/12

### Exponential Distribution

Models: Waiting times

X ~ exponential(lambda) or exp(lambda)

Parameter: lambda (called the rate parameter)

Range: [0,oo)

Density: f(x) = lambda e^(-lambda x) for x >= 0 = dexp(x,lambda)

F(x) = 1 - e^(-lambda x)

mu = 1/lambda

### Normal distribution

Models: Measurement error, intelligence/ability, height, averages of lots of data.

• Range: (-oo,oo)
• Parameters: mu sigma
• Notation: "normal"(mu,sigma^2) or N(mu,sigma^2)
• Density: f(x)=1/(sigma sqrt(2pi)) e^(-(x-mu)^2/(2sigma^2)) = dnorm(x,μ,σ)
• Distribution: F(x) has no formula, so use tables or software such as pnorm in R to compute F(x).
• pnorm(.6,0,1) returns the .6 quantile of the standard normal distribution.
• Standard Normal Cumulative Distribution : N(0,1) = Phi(z) : has mean 0 and variance 1.
• Standard Normal Density: phi(z) = 1/sqrt(2pi) e^(-x^2/2)
• N(mu,sigma^2) has mean mu, variance sigma^2, and standard deviation sigma.
• P(-1 <= Z <= 1) ~~ 0.6826895, P(-2 <= Z <= 2) ~~ 0.9544997, P(-3 <= Z <= 3) ~~ 0.9973002
• Phi(x) = P(Z <= x)
• Phi(1) = P(Z <= 1) ~~ 0.8413447 = "pnorm"(1,0,1)
• Phi(2) = P(Z <= 2) ~~ 0.9772499 = "pnorm"(2,0,1)
• Phi(3) = P(Z <= 3) ~~ 0.9986501 = "pnorm"(3,0,1)
• P(|Z|) = "pnorm"(Z) - "pnorm"(-Z)

#### Normal-normal update formulas for n data points

Normal distribution is its own conjugate prior. If normal prior and normal likelihood then we get normal posterior.

So, if prior is N(mu_"prior", sigma_"prior"^2) and likelihood is N(theta, sigma^2) then posterior is N(mu_"post", sigma_"post"^2).

• a = 1 / sigma_"prior"^2
• b = n / sigma^2
• bar x = (x_1 + … + x_n) / n
• mu_"post" = (a mu_"prior" + b bar x) / (a + b)
• sigma_"post"^2 = 1 / (a+b)

### Beta distribution

f(theta) = c theta^(a-1) (1-theta)^(b-1) = dbeta(θ,a,b)

c = ((a+b-1)!) / ((a-1)!(b-1)!)

Beta distribution is a conjugate prior of the binomial distribution. This means if the prior is a beta and the likelihood is a binomial then the posterior is also beta.

So, if prior is dbeta(p,a,b) and likelihood is dbinom(k,n,p) then posterior is dbeta(p,a+k,b+n-k).

if prior is dbeta(p,a,b) and likelihood is dgeom(k,p) then posterior is dbeta(p,a+1,b+k).

### Discrete Random Variables

Random variable X assigns a number to each outcome: have stuff  X : Omega -> R

X = a " denotes event " {omega | X(omega) = a}

"probability mass function (pmf) of X is given by: " p(a) = P(X=a)

"Cumulative distribution function (cdf) of X is given by: " F(a) = P(X<=a)

### Continuous random variables

"Cumulative distribution function (cdf)" = F(x) = P(X<=x) = int_-oo^x f(t) dt

"Probability density function (pdf)" = P(c<=x<=d) = int_c^d f(x) dx "for" f(x)>=0

• cdf of X is F_x(x)=P(X <= x)
• pdf of X is f_X(x)=F'_X(x)
Properties of the cdf (Same as for discrete distributions)
• (Definition) F(x) = P(X<=x)
• 0 <= F(x) <= 1
• non-decreasing
• lim_(x->-oo) F(x) = 0
• lim_(x->oo) F(x) = 1
• P(c < X <= d) = F(d) - F(c)
• F'(x) = f(x)

### Expected Value (mean or average)

• weighted average = E(X) = sum_(i=1)^n x_i * p(x_i)
• E(X+Y) = E(X) + E(Y)
• E(aX+b) = a*E(X) + b
• E(h(X)) = sum_i h(x_i) * p(x_i)
• E(X-mu_x) = 0
• E(X) = int_a^b x * p(x) dx (units for p(x) are probability/dx).
• (not sure if this is correct) E(XY) = int_c^d int_a^b xy * p(x,y) dx dy

### Variance

• "mean" = E(X) = mu
• variance of X = Var(X) = E((x-mu)^2) = sigma^2 = sum_(i=1)^n p(x_i)(x_i-mu)^2
• standard deviation = sigma = sqrt(Var(X))
• Var(aX+b) = a^2 Var(X)
• Var(X) = E(X^2) - E(X)^2 = E(X^2) - mu^2
• If X and Y are independent then: Var(X+Y) = Var(X) + Var(Y)

### Covariance

Measure of how much two random variables vary together.

A positive value means an increase in one leads to an increase in the other. A negative value means an increase in one leads to a decrease in the other. A value of 0 does not mean they are independent of each other though. (see Y=X^2)

• "Cov"(X,Y)="Covariance"=E((X-mu_X)(Y-mu_Y))
• "Cov"(X,Y)=int_c^d int_a^b (x-mu_x)(y-mu_y)f(x,y)dxdy
•  = (int_c^d int_a^b xy f(x,y) dxdy) - mu_x mu_y
Properties:
• "Cov"(aX+b, cY+d) = ac"Cov"(X,Y)
• "Cov"(X_1+X_2,Y) = "Cov"(X_1,Y) + "Cov"(X_2,Y)
• "Cov"(X,X) = "Var"(X)
• "Cov"(X,Y) = E(XY) - mu_X mu_Y
• "Var"(X+Y) = "Var"(X) + "Var"(Y) + 2"Cov"(X,Y)
• If X and Y are independent then Cov(X,Y) = 0.

### Correlation Coefficient

Ratio of the linear relationship between variables. Doesn't apply for higher order relationships (like Y=X^2)

rho is the covariance of the standardizations of X an Y.

• "Cor"(X,Y) = rho = ("Cov"(X,Y))/(sigma_X sigma_Y)
• rho is dimensionless (it's a ratio)
• -1 <= rho <= 1
• rho = +1 iff Y=aX+b with a > 0
• rho = -1 iff Y=aX+b with a < 0

### Quantiles

The 60th percentile is the same as the 0.60 quantile and the 6th decile. The 3rd quartile would be 75th percentile.

• median is x for which P(X<=x) = P(X>=x)
• median is when cdf F(x) = P(X<=x) = .5
• The pth quantile of X is the value q_p such that F(q_p)=P(X<=q_p)=p. In this notation q_.5 is the median.

### Central Limit Theorem & Law of Large Numbers

• LoLN: As n grows, the probability that E(X) is close to mu goes to 1.
• LoLN: lim_(n->oo) P(|E(X) - mu| < alpha) = 1
• CLT: As n grows, the distribution of E(X) converges to the normal distribution N(mu,sigma^2/n)
• E(S_n) = mu_S = n mu
• "Var"(S_n) = n sigma^2
• sigma_(S_n) = sqrt(n) sigma
• S_n = sum_(i=1)^n X_i
• bar X_n = S_n / n
• For large n, Z = standardization of X = (X - mu_S)/sigma_S = (S - n mu) / (sqrt(n sigma^2)) = (bar X - mu) / (sigma/sqrt(n))
• Z has mean 0, standard deviation 1
• If X has a normal distribution, Z is the standard normal distribution

### Joint Distributions

• range [a,b] x [c,d]
• f(x,y) = joint density
• F(x,y)="joint cdf"=P(X<=x,Y<=y)=int_c^y int_a^x f(u,v) du dv
• f(x,y)=(d^2 F)/(dx dy)(x,y)
• marginal pdf = f_X (x) = int_c^d f(x,y) dy
• marginal cdf = F_X (x) = F(x,d)
• X and Y are independent if f(x,y)=f_X (x) f_Y (y)
• X and Y are independent if F(X,Y)=F_X (x) F_Y (y)

### Maximum Likelihood Estimates

likelihood function = f(x_1,...,x_n|p) = distributions multiplied together (if independent)

log likelihood = ln(f(x_1,...,x_n|p))

hat p = maximum likelihood estimate for p : (delta "likelihood function") / (delta p) = 0

hat p = maximum likelihood estimate for p : (delta "log likelihood function") / (delta p) = 0

For uniform distributions, max likelihood results from:

• hat a = "min"(x_1, ..., x_n)
• hat b = "max"(x_1, ..., x_n)
Notations:
• θ is the value of the hypothesis.
• p(θ) is the prior probability mass function of the hypothesis.
• p(θ|D) is the posterior probability mass function of the hypothesis given the data.
• p(D|θ) is the likelihood function. (This is not a pmf!)
hypothesispriorlikelihoodBayes numeratorposteriorposterior predictive probabilityposterior#2
θP(θ)P(D|θ)P(θ) * P(D|θ)P(θ|D)P(θ|D) * P(D|θ)P(θ|D2)
AaPaLa = aP * aLaP2 = a / SUMa2 = aP2 * aLa2 / SUM2
BbPbLb = bP * bLbP2 = b / SUMb2 = bP2 * bLb2 / SUM2
CcPcLc = cP * cLcP2 = c / SUMc2 = cP2 * cLc2 / SUM2
total1 SUM1SUM21

Law of total probability: prior predictive probability = p(x) = int_a^b p(x|theta) f(theta) d theta

hypothesispriorlikelihoodBayes numeratorposterior
HP(H)P(D|H)P(D|H)P(H)P(H|D)
Discrete θθp(θ)p(x|θ)p(x|θ)p(θ)p(θ|x)
Continuous θθf(θ) dθp(x|θ)p(x|θ)f(θ) dθf(θ|x) dθ