NEXT: section 9, reading 17a

Set Equations

`P(A|B) = (P(A nn B)) / (P(B))` provided `P(B)!=0`

` = (|A nn B|)/(|B|)`

`"Bayes Theorem" = P(B|A) = (P(A|B)*P(B))/(P(A))`

`"Bayes Theorem" = P(B|A) " " alpha " " P(A|B)*P(B)` where `alpha` means "is proportional to"

`(A uu B)^c = A^c nn B^c`

`(A nn B)^c = A^c uu B^c`

This means A is independent of B:

`P(A nn B) = P(A)*P(B)`

`P(A|B) = P(A)`

`P(B|A) = P(B)`

`"Addition of probabilities" = P(A) = P(A|B) * P(B) + P(A|B^c)*P(B^c)`

`P(E^c) = "complement of E" = 1 - P(E)`

`O(E^c) = 1 / (O(E))`

`O(E) = "odds" = (P(E)) / (1 - P(E))`

`P(E) = (O(E)) / (1 + O(E))`

`BF = "Bayes factor" = (P(A|H))/(P(A|H^c)`

If C and D are conditionally independent:

  • `O(H|C,D) = BF_C * BF_D * O(H)`
  • `ln(O(H|C,D)) = ln(BF_C) + ln(BF_D) + ln(O(H))`

Bernoulli Distribution

Range X : 0, 1

`P(X=1) = p`

`P(X=0) = 1 - p`

`X ~ "Bernoulli"(p)` or `"Ber"(p)`

`"Ber"(p) = "Bin"(1,p)`

`E(X) = p`

`Var(X) = (1-p)p`

Binomial Distribution

This is the sum of independent Bernoulli(p) variables.

Range X : 0, 1, … n

`X ~ "Binomial"(n,p)`

`p(k) = "Bin"(n,k) = ((n),(k)) p^k (1-p)^(n-k)` = dbinom(k,n,p)

`"Bin"(n,n) = p^n`

`E(X) = np`

Var(X)=np(1-p)

Geometric Distribution

This models number of failures before first success in a sequence of coin flips (Bernoulli(0.5) trials).

Range X : 0, 1, 2, ...

`X ~ "geometric"(p) or "geo"(p)`

`p(k) = P(X=k) = (1-p)^k p` = dgeom(k,p)

`E(X) = (1-p)/p`

`P(X=n+k | X >= n) = P(X=k)`

Var(X) = (1-p) / p^2

Uniform Distribution

Where all outcomes are equally likely.

Range X : `[a, b]`

`p(k) = 1 / (b-a)`

`E(X) = (a + b) / 2`

`X ~ "uniform"(a,b)` or `U(a,b)`

`f(x) = 1 / (b-a)` for `a <= x <= b`

`F(x) = (x-a)/(b-a)` for `a <= x <= b`

`Var(X) = (b - a)^2/12`

Exponential Distribution

Models: Waiting times

X ~ exponential(`lambda`) or exp(`lambda`)

Parameter: `lambda` (called the rate parameter)

Range: `[0,oo)`

Density: `f(x) = lambda e^(-lambda x)` for `x >= 0` = dexp(x,`lambda`)

`F(x) = 1 - e^(-lambda x)`

`mu = 1/lambda`

Normal distribution

Models: Measurement error, intelligence/ability, height, averages of lots of data.

  • Range: `(-oo,oo)`
  • Parameters: `mu` `sigma`
  • Notation: `"normal"(mu,sigma^2)` or `N(mu,sigma^2)`
  • Density: `f(x)=1/(sigma sqrt(2pi)) e^(-(x-mu)^2/(2sigma^2))` = dnorm(x,μ,σ)
  • Distribution: F(x) has no formula, so use tables or software such as pnorm in R to compute F(x).
    • pnorm(.6,0,1) returns the .6 quantile of the standard normal distribution.
  • Standard Normal Cumulative Distribution : N(0,1) = `Phi(z)` : has mean 0 and variance 1.
  • Standard Normal Density: `phi(z) = 1/sqrt(2pi) e^(-x^2/2)`
  • `N(mu,sigma^2)` has mean `mu`, variance `sigma^2`, and standard deviation `sigma`.
  • `P(-1 <= Z <= 1) ~~ 0.6826895`, `P(-2 <= Z <= 2) ~~ 0.9544997`, `P(-3 <= Z <= 3) ~~ 0.9973002`
  • `Phi(x) = P(Z <= x)`
  • `Phi(1) = P(Z <= 1) ~~ 0.8413447 = "pnorm"(1,0,1)`
  • `Phi(2) = P(Z <= 2) ~~ 0.9772499 = "pnorm"(2,0,1)`
  • `Phi(3) = P(Z <= 3) ~~ 0.9986501 = "pnorm"(3,0,1)`
  • `P(|Z|) = "pnorm"(Z) - "pnorm"(-Z)`

Normal-normal update formulas for n data points

Normal distribution is its own conjugate prior. If normal prior and normal likelihood then we get normal posterior.

So, if prior is `N(mu_"prior", sigma_"prior"^2)` and likelihood is `N(theta, sigma^2)` then posterior is `N(mu_"post", sigma_"post"^2)`.

  • `a = 1 / sigma_"prior"^2`
  • `b = n / sigma^2`
  • `bar x = (x_1 + … + x_n) / n`
  • `mu_"post" = (a mu_"prior" + b bar x) / (a + b)`
  • `sigma_"post"^2 = 1 / (a+b)`

Beta distribution

`f(theta) = c theta^(a-1) (1-theta)^(b-1)` = dbeta(θ,a,b)

`c = ((a+b-1)!) / ((a-1)!(b-1)!)`

Beta distribution is a conjugate prior of the binomial distribution. This means if the prior is a beta and the likelihood is a binomial then the posterior is also beta.

So, if prior is dbeta(p,a,b) and likelihood is dbinom(k,n,p) then posterior is dbeta(p,a+k,b+n-k).

if prior is dbeta(p,a,b) and likelihood is dgeom(k,p) then posterior is dbeta(p,a+1,b+k).

Discrete Random Variables

Random variable X assigns a number to each outcome: have stuff  `X : Omega -> R`

`X = a " denotes event " {omega | X(omega) = a}`

`"probability mass function (pmf) of X is given by: " p(a) = P(X=a)`

`"Cumulative distribution function (cdf) of X is given by: " F(a) = P(X<=a)`

Continuous random variables

`"Cumulative distribution function (cdf)" = F(x) = P(X<=x) = int_-oo^x f(t) dt`

`"Probability density function (pdf)" = P(c<=x<=d) = int_c^d f(x) dx "for" f(x)>=0`

  • cdf of X is `F_x(x)=P(X <= x)`
  • pdf of X is `f_X(x)=F'_X(x)`
Properties of the cdf (Same as for discrete distributions)
  • (Definition) `F(x) = P(X<=x)`
  • `0 <= F(x) <= 1`
  • non-decreasing
  • `lim_(x->-oo) F(x) = 0`
  • `lim_(x->oo) F(x) = 1`
  • `P(c < X <= d) = F(d) - F(c)`
  • `F'(x) = f(x)`

Expected Value (mean or average)

  • weighted average = `E(X) = sum_(i=1)^n x_i * p(x_i)`
  • `E(X+Y) = E(X) + E(Y)`
  • `E(aX+b) = a*E(X) + b`
  • `E(h(X)) = sum_i h(x_i) * p(x_i)`
  • `E(X-mu_x) = 0`
  • `E(X) = int_a^b x * p(x) dx` (units for p(x) are probability/dx).
  • (not sure if this is correct) `E(XY) = int_c^d int_a^b xy * p(x,y) dx dy`

Variance

  • `"mean" = E(X) = mu`
  • variance of X = `Var(X) = E((x-mu)^2) = sigma^2 = sum_(i=1)^n p(x_i)(x_i-mu)^2`
  • standard deviation = `sigma = sqrt(Var(X))`
  • `Var(aX+b) = a^2 Var(X)`
  • `Var(X) = E(X^2) - E(X)^2 = E(X^2) - mu^2`
  • If X and Y are independent then: `Var(X+Y) = Var(X) + Var(Y)`

Covariance

Measure of how much two random variables vary together.

A positive value means an increase in one leads to an increase in the other. A negative value means an increase in one leads to a decrease in the other. A value of 0 does not mean they are independent of each other though. (see `Y=X^2`)

  • `"Cov"(X,Y)="Covariance"=E((X-mu_X)(Y-mu_Y))`
  • `"Cov"(X,Y)=int_c^d int_a^b (x-mu_x)(y-mu_y)f(x,y)dxdy`
  • ` = (int_c^d int_a^b xy f(x,y) dxdy) - mu_x mu_y`
Properties:
  • `"Cov"(aX+b, cY+d) = ac"Cov"(X,Y)`
  • `"Cov"(X_1+X_2,Y) = "Cov"(X_1,Y) + "Cov"(X_2,Y)`
  • `"Cov"(X,X) = "Var"(X)`
  • `"Cov"(X,Y) = E(XY) - mu_X mu_Y`
  • `"Var"(X+Y) = "Var"(X) + "Var"(Y) + 2"Cov"(X,Y)`
  • If X and Y are independent then Cov(X,Y) = 0.

Correlation Coefficient

Ratio of the linear relationship between variables. Doesn't apply for higher order relationships (like `Y=X^2`)

`rho` is the covariance of the standardizations of X an Y.

  • `"Cor"(X,Y) = rho = ("Cov"(X,Y))/(sigma_X sigma_Y)`
  • `rho` is dimensionless (it's a ratio)
  • `-1 <= rho <= 1`
  • `rho = +1` iff `Y=aX+b` with a > 0
  • `rho = -1` iff `Y=aX+b` with a < 0

Quantiles

The 60th percentile is the same as the 0.60 quantile and the 6th decile. The 3rd quartile would be 75th percentile.

  • median is x for which `P(X<=x) = P(X>=x)`
  • median is when cdf `F(x) = P(X<=x) = .5`
  • The pth quantile of X is the value `q_p` such that `F(q_p)=P(X<=q_p)=p`. In this notation `q_.5` is the median.

Central Limit Theorem & Law of Large Numbers

  • LoLN: As n grows, the probability that E(X) is close to `mu` goes to 1.
  • LoLN: `lim_(n->oo) P(|E(X) - mu| < alpha) = 1`
  • CLT: As n grows, the distribution of E(X) converges to the normal distribution `N(mu,sigma^2/n)`
  • `E(S_n) = mu_S = n mu`
  • `"Var"(S_n) = n sigma^2`
  • `sigma_(S_n) = sqrt(n) sigma`
  • `S_n = sum_(i=1)^n X_i`
  • `bar X_n = S_n / n`
  • For large n, Z = standardization of X = `(X - mu_S)/sigma_S = (S - n mu) / (sqrt(n sigma^2)) = (bar X - mu) / (sigma/sqrt(n))`
  • Z has mean 0, standard deviation 1
  • If X has a normal distribution, Z is the standard normal distribution

Joint Distributions

  • range [a,b] x [c,d]
  • f(x,y) = joint density
  • `F(x,y)="joint cdf"=P(X<=x,Y<=y)=int_c^y int_a^x f(u,v) du dv`
  • `f(x,y)=(d^2 F)/(dx dy)(x,y)`
  • marginal pdf = `f_X (x) = int_c^d f(x,y) dy`
  • marginal cdf = `F_X (x) = F(x,d)`
  • X and Y are independent if `f(x,y)=f_X (x) f_Y (y)`
  • X and Y are independent if `F(X,Y)=F_X (x) F_Y (y)`

Maximum Likelihood Estimates

likelihood function = `f(x_1,...,x_n|p)` = distributions multiplied together (if independent)

log likelihood = `ln(f(x_1,...,x_n|p))`

`hat p` = maximum likelihood estimate for p : `(delta "likelihood function") / (delta p) = 0`

`hat p` = maximum likelihood estimate for p : `(delta "log likelihood function") / (delta p) = 0`

For uniform distributions, max likelihood results from:

  • `hat a = "min"(x_1, ..., x_n)`
  • `hat b = "max"(x_1, ..., x_n)`
Notations:
  • θ is the value of the hypothesis.
  • p(θ) is the prior probability mass function of the hypothesis.
  • p(θ|D) is the posterior probability mass function of the hypothesis given the data.
  • p(D|θ) is the likelihood function. (This is not a pmf!)
hypothesispriorlikelihoodBayes numeratorposteriorposterior predictive probabilityposterior#2
θP(θ)P(D|θ)P(θ) * P(D|θ)P(θ|D)P(θ|D) * P(D|θ)P(θ|D2)
AaPaLa = aP * aLaP2 = a / SUMa2 = aP2 * aLa2 / SUM2
BbPbLb = bP * bLbP2 = b / SUMb2 = bP2 * bLb2 / SUM2
CcPcLc = cP * cLcP2 = c / SUMc2 = cP2 * cLc2 / SUM2
total1 SUM1SUM21

Law of total probability: prior predictive probability = `p(x) = int_a^b p(x|theta) f(theta) d theta`

 hypothesispriorlikelihoodBayes numeratorposterior
 HP(H)P(D|H)P(D|H)P(H)P(H|D)
Discrete θθp(θ)p(x|θ)p(x|θ)p(θ)p(θ|x)
Continuous θθf(θ) dθp(x|θ)p(x|θ)f(θ) dθf(θ|x) dθ

Version 195.1 last modified by Geoff Fortytwo on 18/01/2018 at 15:01

Attachments 0

No attachments for this document
Website Top
Send Me Mail!:
   g42website4 AT g42.org
My Encyclopaedia Blog

Creator: Geoff Fortytwo on 2015/03/02 17:51
Copyright 2004-2007 (c) XPertNet and Contributing Authors
1.3.2.9174