Rediscovering Entropy


Entropy

Entropy is the core idea of information theory, and it tells us how much uncertainty is present in some events. Let’s look at an example. Imagine you are playing some gacha game, looking for new items:

RarityDrop rate
Gold5%
Silver20%
Bronze75%

When a gold item drops, you will be more surprised, right? Entropy can be explained pretty well with the notion of surprise. If you repeat the action multiple times, then entropy is the average surprise that you feel.

So what is the formula of entropy? From that example alone, we can make an educated guess. We can see that surprise is the inverse if probability, kinda. When the probability is low, the surprise is high and vice versa. The simplest formula of inverse that we can think of right now is the reciprocal, which makes sense.

surprise1prob(x)\text{surprise} \approx \frac{1}{\text{prob}(x)}

But turns out the more suitable one is the log base 2 of reciprocal.

surprise=log2(1prob(x))\text{surprise} = \log_2 \left( \frac{1}{\text{prob}(x)} \right)

The reason we put log here not is really important right now. We will see why as we read.

Going back to the first sentence of this post, entropy can be used to measure uncertainty. Entropy works not only for the drop rate table above, but to any discrete probability distribution. This type of distribution is called discrete because it only has finite outcomes (gold, silver, bronze). The other super popular discrete probability distribution is a coin with head and tail.

Imagine you have 2 coins. This first one is a fair coin with a head tail ratio of 50:50, and the second one is an unfair coin with a head tail ratio of 999:1. Can you feel that the first coin is more uncertain than the second one? We can guess for head everytime and will win most of the time for the second coin, thus it has less uncertainty. Feeling alone is not enough to quantify the coins entropy. We need to calculate them.

Remember, entropy is the average of surprise. For now, let’s use HH for entropy, pp for probability, XX for probability distribution, also the base 2 will be omitted.

H(Xfair)=p(head)log(1p(head))+p(tail)log(1p(tail))=p(head)log(p(head))p(tail)log(p(tail))=12log(12)12log(12)=1\begin{align*} H(X_{\text{fair}}) &= p(\text{head}) \log \left( \frac{1}{p(\text{head})} \right) + p(\text{tail}) \log \left( \frac{1}{p(\text{tail})} \right) \\ &= -p(\text{head}) \log (p(\text{head})) -p(\text{tail}) \log (p(\text{tail})) \\ &= -\frac{1}{2} \log \left(\frac{1}{2}\right) -\frac{1}{2} \log \left(\frac{1}{2}\right) \\ &= 1 \end{align*}

The unfair coin has 1 entropy. What we just did is the average of surprise. Since p(head)+p(tail)=1p(\text{head}) + p(\text{tail}) = 1, this is a weighted average. The negative signs come from the logarithm identities. Now, let’s calculate the unfair coin entropy. We will also try to generalize the formula a little bit because, as you read before, entropy works for any probability distribution, not just for coins.

H(Xunfair)=xXp(x)log(p(x))=p(head)log(p(head))p(tail)log(p(tail))=9991000log(9991000)11000log(11000)=0.0114\begin{align*} H(X_{\text{unfair}}) &= -\sum_{x \in X} p(x) \log (p(x)) \\ &= -p(\text{head}) \log (p(\text{head})) -p(\text{tail}) \log (p(\text{tail})) \\ &= -\frac{999}{1000} \log \left(\frac{999}{1000}\right) -\frac{1}{1000} \log \left(\frac{1}{1000}\right) \\ &= 0.0114 \end{align*}

xXx \in X is similar to for x in ["head", "tail"] in Python. We are just rewriting the formula of entropy to be able to work for any discrete outcomes.

The unfair coin has 0.0114 entropy. Since entropy measures uncertainty, the first coin with entropy of 1 is more uncertain than the second coin. Then, there is this argument, “If I happen to get a tail with the unfair coin, I will be super surprised. Why doesn’t the entropy reflect that?” Well, it does! log2(1/1000)-log_2(1/1000) is quite high, but thanks to weighted average (the probability), this high value will be diluted down because you are seeing tails so rarely compared to heads.

We can see that probability affects entropy. How would the entropy evolve as the probability changes? Below is the relationship between them on a probability distribution that has 2 outcomes (e.g. coins). As we can see, entropy will be highest if both head and tail are equally likely, a fair 50:50 coin.

The last one, why do we need to use log base 2? In the original Shannon’s paper, the unit of entropy is “bits”. This has a nice intuition. Entropy can also be described as the average number of bits to represent one event from a probability distribution. What does it mean to represent? Represent is just another word for encode. Below we will see why that makes sense.

Imagine you have 64 balls numbered from 0 to 63. If you put them all in the bag, then the probability of getting any specific number should be uniform, that is p(X=x)=164p(X=x)=\frac{1}{64}. Now your friend grabs a ball from the bag, and he tells you to guess the ball number by providing some questions. What is the best way to correctly guess the ball with the fewest possible questions? Brute force is one of the way to do that, just ask him like this:

Your questionAnswer
Is the ball number 0?No
Is the ball number 1?No
Is the ball number 47?Yes
(you stop asking now)

With this strategy, you will be asking 32 questions on average. We can do better. By asking your friend to partition the balls to 2 groups everytime, you can determine accurately with just 6 questions.

Your questionAnswer
Is the ball number 0-31?No (then it must be 32-63)
Is the ball number 32-47?Yes (then it must be 32-47)
Is the ball number 32-39?No (then it must be 40-47)
Is the ball number 40-43?No (then it must be 44-47)
Is the ball number 44-45?No (then it must be 46-47)
Is the ball number 46?No (then the answer is 47)

If you see the above answers, every single ball from 0 until 63 has a distinct yes-no pattern. This is what it meant to represent or to encode, because each yes-no sequence uniquely maps to each ball. Moreover, since we are halving the possible outcomes for every step, this is where log base 2 comes from. For every bit (answer) received, we cut the uncertainty down by 2.

Cross entropy

In the section above, we have been playing with perfect information, i.e. we always know the probability beforehand. Now, assume the true probability is hidden from us.

WeatherTrue probabilityAssumption
Sunny50%25%
Overcast12.5%25%
Rain25%25%
Thunderstorm12.5%25%

What is the entropy of the weather? Entropy is calculated with the true probability. Turns out the entropy is 1.75 bits.

H(Xweather)=xXp(x)log(p(x))=1.75H(X_{\text{weather}}) = -\sum_{x \in X} p(x) \log (p(x)) = 1.75

”The more and more assumption differs from the truth, the more and more one feels surprised”. Think about that. We decided to calculate it anyway, and since the assumption does not match true probability, the surprise we will get should be bigger than 1.75. HassumptionH_{\text{assumption}} is not the same as HH!

Hassumption(Xweather)=xXptrue(x)log(passumption(x))=2H_{\text{assumption}}(X_{\text{weather}}) = -\sum_{x \in X} p_{\text{true}}(x) \log (p_{\text{assumption}}(x)) = 2

We changed the probability inside the log function, and not the outside one, because remember the formula for surprise? That is the one inside the log, and the one outside the log is just weighted average. The surprise that we get if we incorrectly assume the weather probability is 2 bits. That is higher than true entropy.

Spoiler alert: What we just did is called cross entropy! I didn’t want to use the word “entropy” during the calculation for assumption back then, because the real name is cross entropy (and I didn’t want to spoil the fun). So, cross entropy is the average surprise that we get, if we use another probability distribution to encode some probability distribution. The real formula is really similar too, only differs in naming. Move the assumption probability distribution “inside” the HH, reduce mouthful naming, and that’s it.

H(Xtruth,Xassumption)=xXptrue(x)log(passumption(x))H(P,Q)=xXp(x)log(q(x))\begin{align*} H(X_{\text{truth}}, X_{\text{assumption}}) &= -\sum_{x \in X} p_{\text{true}}(x) \log (p_{\text{assumption}}(x)) \\ H(P, Q) &= -\sum_{x \in X} p(x) \log (q(x)) \\ \end{align*}

Some people have pointed out that H(P,Q)H(P, Q) notation is kinda misleading, since it looks like another math concept called joint entropy. Also, it looks like we can swap P and Q and the answer will be the same, but no, cross entropy is not symmetric, i.e. H(P,Q)H(Q,P)H(P,Q) \neq H(Q,P). Good to know.

KL divergence

Looking back, if we calculate the “excess” surprise because of assumption, we get 0.25 bits. This difference is called Kullback-Leibler (KL) divergence or relative entropy. Formally,

DKL(PQ)=H(P,Q)H(P)D_{\text{KL}}(P||Q) = H(P,Q) - H(P)

Some interesting facts about KL divergence:

  • Just like cross entropy, KL divergence is not symmetric as well.
  • If the assumption probability is the same as the truth, Q=PQ=P, then cross entropy H(P,Q)H(P,Q) becomes H(P,P)H(P,P), which is the same as H(P)H(P). Because of this, KL divergence will be 0.

CE as loss function

Let’s do one example. ImageNet is a super popular dataset for pretraining computer vision models. ImageNet has 1000 classes. Now imagine we are only looking at just one sample, say class 578. Anytime 2 distributions appear, we can use cross entropy and try to minimize it as the training objective for multiclass classification tasks.

The first distribution is the label (true) distribution PP. The label is a scalar, so first we have to convert it to one-hot vector. Everything will be at 0% probability except for the class 578 itself, which is 100%. The second distribution is the predicted probability (assumption) from our model QQ.

Oh yes, in most nn libraries, the base used is ee, not 2, so we use natural logarithm for all entropy-related functions.

H(P,Q)=xXp(x)ln(q(x))=p(x0)ln(q(x0))p(x999)ln(q(x999))=0ln(q(x0))1ln(q(x578))0ln(q(x999))=ln(q(x578))\begin{align*} H(P, Q) &= -\sum_{x \in X} p(x) \ln (q(x)) \\ &= -p(x_0)\ln(q(x_0)) - \ldots -p(x_{999})\ln(q(x_{999})) \\ &= -0\cdot\ln(q(x_0)) - \ldots - 1\cdot\ln(q(x_{578})) - \ldots - 0\cdot\ln(q(x_{999})) \\ &= -\ln(q(x_{578})) \\ \end{align*}

If your task is training with hard labels (one-hot labels), cross entropy will be reduced down to just negative log probability of the current label class, since everything else will be 0.

Naming things is one of two hard things in computer science, they said. nn.CrossEntropyLoss in PyTorch is not the exact same as our formula above. This is because the raw model output is not probability distribution (does not sum to 1), but instead we call them logits. In order to transform logits to probability distribution, we use softmax first. nn.CrossEntropyLoss accepts raw logits instead of probabilities, which is nice.

The behavior of our cross entropy and PyTorch implementation can be explained using code below.

import torch
import torch.nn.functional as F

# H(P, Q)
def my_cross_entropy(p, q):
    return -(p * q.log()).sum()

# prepare some random data
logits = torch.tensor([3.1, 2.2, 1.0, -1.5])
probs = logits.softmax(-1)
labels = torch.tensor(1)
labels_1hot = torch.tensor([0, 1, 0, 0]).float()

# these should be equal
my_cross_entropy(labels_1hot, probs)
F.nll_loss(F.log_softmax(logits, dim=-1), labels)
F.cross_entropy(logits, labels)

Extras

CE vs BCE

For binary classification task, we actually have 2 choices on structuring the training code:

  • Use 1 neuron with BCEWithLogitsLoss
  • Or use 2 neurons and CrossEntropyLoss

Turns out they are the same. Read more. This is because for binary classification (2 classes), BCE is actually a special case for CE. For the general case though, unlike CE, BCE probability does not have to sum to 1. BCE is suitable for multilabel classification tasks where a single sample can have 0 to all classes.

Verify loss @ init

Go read at Karpathy’s recipe on training neural net and come back. Can you explain why this is true?

Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization.

Because a well randomly initialized model will output roughly the same probability (close to uniform) for all classes, hence the q(x)=1n_classesq(x)=\frac{1}{\text{n\_classes}}. Or if we use log identities, we can even simplify the loss to be log(n_classes). Here, log is natural log.

Further reading