Sampling and Point estimation
Point Estimation
Back to "Bayesics”
Although an event could generate the highest probability (maximum likelihood), if the event itself is unlikely, we wouldn’t choose it.
So conditional probability solely is not something we want to maximize, instead, we want to maximize the probability of two events happening at the same time, which is $P(A \cap B) = P(A|B)P(B)$ and this turns out to be the Bayesian formula.
With the given example, we know that a popcorn-throwing contest is more likely than a movie when we see popcorn on the floor. Still, we also know that the probability of a popcorn-throwing contest is much lower than the movie since watching movies is pretty common, whereas popcorn-throwing contests are not common.
So we multiply the individual event’s probability by the conditional probability and that is what we want to maximize at the end and that is Bayes’ theorem.
Bayesian Statistics - Frequentist vs. Bayesian
Bayesian Statistics - MAP
Non-informative prior: Since there’s no information, the probabilities are all equal (uniform).
If one has a conservative belief then the thoughts of probabilities don't change.
Maximum a posteriori (MAP) is where we choose the highest probability (update to the one with the highest probability).
Bayesian Statistics - Updating Priors
Here, the priors of 0.75 and 0.25 update to the posteriors of 0.652 and 0.348.
Bayesian Statistics - Full Worked Example
We use Bayes’ theorem with the given situation to update the prior to a posterior.
With the coin example given above, we represent the probability of heads with a continuous random variable $\Theta$ because for now, we do not know what it is.
We collect the data of coin flips into a random variable $\bf{X}$, which has data from $X_1$ to $X_{10}$, representing each coin flip with 1 if heads and 0 if tails.
If we were to know the probability of heads, we could model each flip as a Bernoulli distribution.
Since the coin flips are independent, the joint probability of this outcome is a multiplication of all flips.
If our prior (belief) of a coin flip was that anything could happen, then our prior would be the uniform distribution, any number between 0 and 1.
Luis drops $0 \le \theta \le$ part of the PDF for theta to simplify the notation as 1 is still part of the prior and also the posterior.
The new expression says the posterior PDF (probability density function) on the left is proportional to the product on the right.
The second expression on the right, $1$, is the likelihood of the model, so we give the same likelihood to each model and the result is a PDF.
To express/summarize the sum of beliefs in a single number, we use MAP (maximum a posteriori), which is the mode of the posterior distribution (the value of theta that made the posterior maximum).
That is why we can ignore the constant from the previous expression because the constant in the denominator scales the curve of our new PDF and won’t change the shape of the curve and the value of theta giving the maximum.
A prior with a different distribution is the same as updating the previous posterior from the prior of the uniform distribution, so we continue with the previous results to get a new posterior.
Two interesting facts from the result and highlight two important elements of Bayesian statistics:
- This time the prior was informative and it affected the result
- Frequentists would have had the result as 6/10 or 0.6
- Out of 20 total coin flips, 14 were heads
- Frequentists would have had the same result as 14/20 or 0.7
- It’s the same as the Bayesian with the prior would conclude the same
- This shows that whether we incorporate all the data at once or in chunks, our final posterior beliefs will be the same
Relationship between MAP, MLE, and Regularization
Like the popcorn-throwing example, if we account for only the conditional probability, then the most complex model (model 3) will be selected.
However, when we also account for the probabilities of each model along with the conditional probabilities, we get the result that the model 2 has the highest probability.
The probability of a model P(model) is the likelihood of the model and we get this by applying the coefficients to the standard normal distribution.
The distances of points P(data | model) are the distances applied to the standard normal distribution.
The point is to maximize the given formula and there’s a negative term, so applying the negative and minimizing the distances and coefficients (the inner formula) would be the goal.
All the information provided is based on the Probability & Statistics for Machine Learning & Data Science | Coursera from DeepLearning.AI