Probability & Statistics for Machine Learning & Data Science (20)

728x90

▤ 목차

Sampling and Point estimation

Point Estimation

728x90

Back to "Bayesics”

Although an event could generate the highest probability (maximum likelihood), if the event itself is unlikely, we wouldn’t choose it.

So conditional probability solely is not something we want to maximize, instead, we want to maximize the probability of two events happening at the same time, which is $P(A \cap B) = P(A|B)P(B)$ and this turns out to be the Bayesian formula.

With the given example, we know that a popcorn-throwing contest is more likely than a movie when we see popcorn on the floor. Still, we also know that the probability of a popcorn-throwing contest is much lower than the movie since watching movies is pretty common, whereas popcorn-throwing contests are not common.

So we multiply the individual event’s probability by the conditional probability and that is what we want to maximize at the end and that is Bayes’ theorem.

Bayesian Statistics - Frequentist vs. Bayesian

Bayesian Statistics - MAP

Left: Conservative, middle: Somewhere in the middle, right: Non-informative prior

Non-informative prior: Since there’s no information, the probabilities are all equal (uniform).

If one has a conservative belief then the thoughts of probabilities don't change.

Maximum a posteriori (MAP) is where we choose the highest probability (update to the one with the highest probability).

Bayesian Statistics - Updating Priors

Here, the priors of 0.75 and 0.25 update to the posteriors of 0.652 and 0.348.

Bayesian Statistics - Full Worked Example

We use Bayes’ theorem with the given situation to update the prior to a posterior.

With the coin example given above, we represent the probability of heads with a continuous random variable $\Theta$ because for now, we do not know what it is.

We collect the data of coin flips into a random variable $\bf{X}$, which has data from $X_1$ to $X_{10}$, representing each coin flip with 1 if heads and 0 if tails.

If we were to know the probability of heads, we could model each flip as a Bernoulli distribution.

Since the coin flips are independent, the joint probability of this outcome is a multiplication of all flips.

If our prior (belief) of a coin flip was that anything could happen, then our prior would be the uniform distribution, any number between 0 and 1.

Luis drops $0 \le \theta \le$ part of the PDF for theta to simplify the notation as 1 is still part of the prior and also the posterior.

The new expression says the posterior PDF (probability density function) on the left is proportional to the product on the right.

The second expression on the right, $1$, is the likelihood of the model, so we give the same likelihood to each model and the result is a PDF.

To express/summarize the sum of beliefs in a single number, we use MAP (maximum a posteriori), which is the mode of the posterior distribution (the value of theta that made the posterior maximum).

That is why we can ignore the constant from the previous expression because the constant in the denominator scales the curve of our new PDF and won’t change the shape of the curve and the value of theta giving the maximum.

A prior with a different distribution is the same as updating the previous posterior from the prior of the uniform distribution, so we continue with the previous results to get a new posterior.

Two interesting facts from the result and highlight two important elements of Bayesian statistics:

This time the prior was informative and it affected the result
- Frequentists would have had the result as 6/10 or 0.6
Out of 20 total coin flips, 14 were heads
- Frequentists would have had the same result as 14/20 or 0.7
- It’s the same as the Bayesian with the prior would conclude the same
- This shows that whether we incorporate all the data at once or in chunks, our final posterior beliefs will be the same

Relationship between MAP, MLE, and Regularization

Like the popcorn-throwing example, if we account for only the conditional probability, then the most complex model (model 3) will be selected.

However, when we also account for the probabilities of each model along with the conditional probabilities, we get the result that the model 2 has the highest probability.

The probability of a model P(model) is the likelihood of the model and we get this by applying the coefficients to the standard normal distribution.

The distances of points P(data | model) are the distances applied to the standard normal distribution.

The point is to maximize the given formula and there’s a negative term, so applying the negative and minimizing the distances and coefficients (the inner formula) would be the goal.

All the information provided is based on the Probability & Statistics for Machine Learning & Data Science | Coursera from DeepLearning.AI

728x90

저작자표시 비영리 변경금지 (새창열림)

'Coursera > Mathematics for ML and Data Science' 카테고리의 다른 글

Probability & Statistics for Machine Learning & Data Science (22) (0)	2024.10.27
Probability & Statistics for Machine Learning & Data Science (21) (2)	2024.10.24
Probability & Statistics for Machine Learning & Data Science (19) (3)	2024.10.22
Probability & Statistics for Machine Learning & Data Science (18) (1)	2024.10.21
Probability & Statistics for Machine Learning & Data Science (17) (2)	2024.10.20

안녕하세요

Probability & Statistics for Machine Learning & Data Science (20)

Sampling and Point estimation

Point Estimation

Back to "Bayesics”

Bayesian Statistics - Frequentist vs. Bayesian

Bayesian Statistics - MAP

Bayesian Statistics - Updating Priors

Bayesian Statistics - Full Worked Example

Relationship between MAP, MLE, and Regularization

'Coursera > Mathematics for ML and Data Science' 카테고리의 다른 글

티스토리툴바

Probability & Statistics for Machine Learning & Data Science (20)

Sampling and Point estimation

Point Estimation

Back to "Bayesics”

Bayesian Statistics - Frequentist vs. Bayesian

Bayesian Statistics - MAP

Bayesian Statistics - Updating Priors

Bayesian Statistics - Full Worked Example

Relationship between MAP, MLE, and Regularization

'Coursera > Mathematics for ML and Data Science' 카테고리의 다른 글

관련글

티스토리툴바