NEXT: section 9, reading 17a
Set Equations
`P(A|B) = (P(A nn B)) / (P(B))` provided `P(B)!=0`
` = (|A nn B|)/(|B|)`
`"Bayes Theorem" = P(B|A) = (P(A|B)*P(B))/(P(A))`
`"Bayes Theorem" = P(B|A) " " alpha " " P(A|B)*P(B)` where `alpha` means "is proportional to"
`(A uu B)^c = A^c nn B^c`
`(A nn B)^c = A^c uu B^c`
This means A is independent of B:
`P(A nn B) = P(A)*P(B)`
`P(A|B) = P(A)`
`P(B|A) = P(B)`
`"Addition of probabilities" = P(A) = P(A|B) * P(B) + P(A|B^c)*P(B^c)`
`P(E^c) = "complement of E" = 1 - P(E)`
`O(E^c) = 1 / (O(E))`
`O(E) = "odds" = (P(E)) / (1 - P(E))`
`P(E) = (O(E)) / (1 + O(E))`
`BF = "Bayes factor" = (P(A|H))/(P(A|H^c)`
If C and D are conditionally independent:
- `O(H|C,D) = BF_C * BF_D * O(H)`
- `ln(O(H|C,D)) = ln(BF_C) + ln(BF_D) + ln(O(H))`
Range X : 0, 1
`P(X=1) = p`
`P(X=0) = 1 - p`
`X ~ "Bernoulli"(p)` or `"Ber"(p)`
`"Ber"(p) = "Bin"(1,p)`
`E(X) = p`
`Var(X) = (1-p)p`
This is the sum of independent Bernoulli(p) variables.
Range X : 0, 1, … n
`X ~ "Binomial"(n,p)`
`p(k) = "Bin"(n,k) = ((n),(k)) p^k (1-p)^(n-k)` = dbinom(k,n,p)
`"Bin"(n,n) = p^n`
`E(X) = np`
Var(X)=np(1-p)
This models number of failures before first success in a sequence of coin flips (Bernoulli(0.5) trials).
Range X : 0, 1, 2, ...
`X ~ "geometric"(p) or "geo"(p)`
`p(k) = P(X=k) = (1-p)^k p` = dgeom(k,p)
`E(X) = (1-p)/p`
`P(X=n+k | X >= n) = P(X=k)`
Var(X) = (1-p) / p^2
Where all outcomes are equally likely.
Range X : `[a, b]`
`p(k) = 1 / (b-a)`
`E(X) = (a + b) / 2`
`X ~ "uniform"(a,b)` or `U(a,b)`
`f(x) = 1 / (b-a)` for `a <= x <= b`
`F(x) = (x-a)/(b-a)` for `a <= x <= b`
`Var(X) = (b - a)^2/12`
Models: Waiting times
X ~ exponential(`lambda`) or exp(`lambda`)
Parameter: `lambda` (called the rate parameter)
Range: `[0,oo)`
Density: `f(x) = lambda e^(-lambda x)` for `x >= 0` = dexp(x,`lambda`)
`F(x) = 1 - e^(-lambda x)`
`mu = 1/lambda`
Models: Measurement error, intelligence/ability, height, averages of lots of data.
- Range: `(-oo,oo)`
- Parameters: `mu` `sigma`
- Notation: `"normal"(mu,sigma^2)` or `N(mu,sigma^2)`
- Density: `f(x)=1/(sigma sqrt(2pi)) e^(-(x-mu)^2/(2sigma^2))` = dnorm(x,μ,σ)
- Distribution: F(x) has no formula, so use tables or software such as pnorm in R to compute F(x).
- pnorm(.6,0,1) returns the .6 quantile of the standard normal distribution.
- Standard Normal Cumulative Distribution : N(0,1) = `Phi(z)` : has mean 0 and variance 1.
- Standard Normal Density: `phi(z) = 1/sqrt(2pi) e^(-x^2/2)`
- `N(mu,sigma^2)` has mean `mu`, variance `sigma^2`, and standard deviation `sigma`.
- `P(-1 <= Z <= 1) ~~ 0.6826895`, `P(-2 <= Z <= 2) ~~ 0.9544997`, `P(-3 <= Z <= 3) ~~ 0.9973002`
- `Phi(x) = P(Z <= x)`
- `Phi(1) = P(Z <= 1) ~~ 0.8413447 = "pnorm"(1,0,1)`
- `Phi(2) = P(Z <= 2) ~~ 0.9772499 = "pnorm"(2,0,1)`
- `Phi(3) = P(Z <= 3) ~~ 0.9986501 = "pnorm"(3,0,1)`
- `P(|Z|) = "pnorm"(Z) - "pnorm"(-Z)`
Normal distribution is its own conjugate prior. If normal prior and normal likelihood then we get normal posterior.
So, if prior is `N(mu_"prior", sigma_"prior"^2)` and likelihood is `N(theta, sigma^2)` then posterior is `N(mu_"post", sigma_"post"^2)`.
- `a = 1 / sigma_"prior"^2`
- `b = n / sigma^2`
- `bar x = (x_1 + … + x_n) / n`
- `mu_"post" = (a mu_"prior" + b bar x) / (a + b)`
- `sigma_"post"^2 = 1 / (a+b)`
`f(theta) = c theta^(a-1) (1-theta)^(b-1)` = dbeta(θ,a,b)
`c = ((a+b-1)!) / ((a-1)!(b-1)!)`
Beta distribution is a conjugate prior of the binomial distribution. This means if the prior is a beta and the likelihood is a binomial then the posterior is also beta.
So, if prior is dbeta(p,a,b) and likelihood is dbinom(k,n,p) then posterior is dbeta(p,a+k,b+n-k).
if prior is dbeta(p,a,b) and likelihood is dgeom(k,p) then posterior is dbeta(p,a+1,b+k).
Discrete Random Variables
Random variable X assigns a number to each outcome:
have stuff
`X : Omega -> R`
`X = a " denotes event " {omega | X(omega) = a}`
`"probability mass function (pmf) of X is given by: " p(a) = P(X=a)`
`"Cumulative distribution function (cdf) of X is given by: " F(a) = P(X<=a)`
Continuous random variables
`"Cumulative distribution function (cdf)" = F(x) = P(X<=x) = int_-oo^x f(t) dt`
`"Probability density function (pdf)" = P(c<=x<=d) = int_c^d f(x) dx "for" f(x)>=0`
- cdf of X is `F_x(x)=P(X <= x)`
- pdf of X is `f_X(x)=F'_X(x)`
Properties of the cdf
(Same as for discrete distributions)
- (Definition) `F(x) = P(X<=x)`
- `0 <= F(x) <= 1`
- non-decreasing
- `lim_(x->-oo) F(x) = 0`
- `lim_(x->oo) F(x) = 1`
- `P(c < X <= d) = F(d) - F(c)`
- `F'(x) = f(x)`
Expected Value (mean or average)
- weighted average = `E(X) = sum_(i=1)^n x_i * p(x_i)`
- `E(X+Y) = E(X) + E(Y)`
- `E(aX+b) = a*E(X) + b`
- `E(h(X)) = sum_i h(x_i) * p(x_i)`
- `E(X-mu_x) = 0`
- `E(X) = int_a^b x * p(x) dx` (units for p(x) are probability/dx).
- (not sure if this is correct) `E(XY) = int_c^d int_a^b xy * p(x,y) dx dy`
Variance
- `"mean" = E(X) = mu`
- variance of X = `Var(X) = E((x-mu)^2) = sigma^2 = sum_(i=1)^n p(x_i)(x_i-mu)^2`
- standard deviation = `sigma = sqrt(Var(X))`
- `Var(aX+b) = a^2 Var(X)`
- `Var(X) = E(X^2) - E(X)^2 = E(X^2) - mu^2`
- If X and Y are independent then: `Var(X+Y) = Var(X) + Var(Y)`
Covariance
Measure of how much two random variables vary together.
A positive value means an increase in one leads to an increase in the other. A negative value means an increase in one leads to a decrease in the other. A value of 0 does not mean they are independent of each other though. (see `Y=X^2`)
- `"Cov"(X,Y)="Covariance"=E((X-mu_X)(Y-mu_Y))`
- `"Cov"(X,Y)=int_c^d int_a^b (x-mu_x)(y-mu_y)f(x,y)dxdy`
- ` = (int_c^d int_a^b xy f(x,y) dxdy) - mu_x mu_y`
Properties:
- `"Cov"(aX+b, cY+d) = ac"Cov"(X,Y)`
- `"Cov"(X_1+X_2,Y) = "Cov"(X_1,Y) + "Cov"(X_2,Y)`
- `"Cov"(X,X) = "Var"(X)`
- `"Cov"(X,Y) = E(XY) - mu_X mu_Y`
- `"Var"(X+Y) = "Var"(X) + "Var"(Y) + 2"Cov"(X,Y)`
- If X and Y are independent then Cov(X,Y) = 0.
Correlation Coefficient
Ratio of the linear relationship between variables. Doesn't apply for higher order relationships (like `Y=X^2`)
`rho` is the covariance of the standardizations of X an Y.
- `"Cor"(X,Y) = rho = ("Cov"(X,Y))/(sigma_X sigma_Y)`
- `rho` is dimensionless (it's a ratio)
- `-1 <= rho <= 1`
- `rho = +1` iff `Y=aX+b` with a > 0
- `rho = -1` iff `Y=aX+b` with a < 0
Quantiles
The 60th percentile is the same as the 0.60 quantile and the 6th decile. The 3rd quartile would be 75th percentile.
- median is x for which `P(X<=x) = P(X>=x)`
- median is when cdf `F(x) = P(X<=x) = .5`
- The pth quantile of X is the value `q_p` such that `F(q_p)=P(X<=q_p)=p`. In this notation `q_.5` is the median.
Central Limit Theorem & Law of Large Numbers
- LoLN: As n grows, the probability that E(X) is close to `mu` goes to 1.
- LoLN: `lim_(n->oo) P(|E(X) - mu| < alpha) = 1`
- CLT: As n grows, the distribution of E(X) converges to the normal distribution `N(mu,sigma^2/n)`
- `E(S_n) = mu_S = n mu`
- `"Var"(S_n) = n sigma^2`
- `sigma_(S_n) = sqrt(n) sigma`
- `S_n = sum_(i=1)^n X_i`
- `bar X_n = S_n / n`
- For large n, Z = standardization of X = `(X - mu_S)/sigma_S = (S - n mu) / (sqrt(n sigma^2)) = (bar X - mu) / (sigma/sqrt(n))`
- Z has mean 0, standard deviation 1
- If X has a normal distribution, Z is the standard normal distribution
Joint Distributions
- range [a,b] x [c,d]
- f(x,y) = joint density
- `F(x,y)="joint cdf"=P(X<=x,Y<=y)=int_c^y int_a^x f(u,v) du dv`
- `f(x,y)=(d^2 F)/(dx dy)(x,y)`
- marginal pdf = `f_X (x) = int_c^d f(x,y) dy`
- marginal cdf = `F_X (x) = F(x,d)`
- X and Y are independent if `f(x,y)=f_X (x) f_Y (y)`
- X and Y are independent if `F(X,Y)=F_X (x) F_Y (y)`
Maximum Likelihood Estimates
likelihood function = `f(x_1,...,x_n|p)` = distributions multiplied together (if independent)
log likelihood = `ln(f(x_1,...,x_n|p))`
`hat p` = maximum likelihood estimate for p : `(delta "likelihood function") / (delta p) = 0`
`hat p` = maximum likelihood estimate for p : `(delta "log likelihood function") / (delta p) = 0`
For uniform distributions, max likelihood results from:
- `hat a = "min"(x_1, ..., x_n)`
- `hat b = "max"(x_1, ..., x_n)`
Notations:
- θ is the value of the hypothesis.
- p(θ) is the prior probability mass function of the hypothesis.
- p(θ|D) is the posterior probability mass function of the hypothesis given the data.
- p(D|θ) is the likelihood function. (This is not a pmf!)
hypothesis | prior | likelihood | Bayes numerator | posterior | posterior predictive probability | posterior#2 |
---|
θ | P(θ) | P(D|θ) | P(θ) * P(D|θ) | P(θ|D) | P(θ|D) * P(D|θ) | P(θ|D2) |
A | aP | aL | a = aP * aL | aP2 = a / SUM | a2 = aP2 * aL | a2 / SUM2 |
B | bP | bL | b = bP * bL | bP2 = b / SUM | b2 = bP2 * bL | b2 / SUM2 |
C | cP | cL | c = cP * cL | cP2 = c / SUM | c2 = cP2 * cL | c2 / SUM2 |
total | 1 | | SUM | 1 | SUM2 | 1 |
Law of total probability: prior predictive probability = `p(x) = int_a^b p(x|theta) f(theta) d theta`
| hypothesis | prior | likelihood | Bayes numerator | posterior |
---|
| H | P(H) | P(D|H) | P(D|H)P(H) | P(H|D) |
Discrete θ | θ | p(θ) | p(x|θ) | p(x|θ)p(θ) | p(θ|x) |
Continuous θ | θ | f(θ) dθ | p(x|θ) | p(x|θ)f(θ) dθ | f(θ|x) dθ |