back

Shannon Entropy, Boltzmann Entropy, and Variety

I've been learning a bit about entropy in information theory recently, and I've had some realizations about the relationships between Shannon entropy, Boltzmann entropy, and variety. I am not saying anything particularly original here... If I am, that probably means I made a mistake. Hopefully I am at least saying things that are correct.

The variety \(v\) of a system, the cybernetics concept defined by Ashby, is its number of distinguishable states. Or rather, measured in bits, it's the log of the number of states.1 If the state space is \( S \), the variety is:2 \[ v_S=\log |S| \] If my system can be 16 different ways, then it has 16 possible states, and its variety is \( v = \log 16 = 4 \) bits.

Ashby uses this concept in establishing his Law of Requisite Variety. In order for a controller \(C\) to perfectly steer a system \(S\), it must, at minimum, match its variety: \( v_C \geq v_S \).

This seems well and good until you consider a case like the following. A system \(S=A \cup B\) is always either in a state from \(A\) or a state from \(B\). The varieties are: \[ v_A = 1000 \text{ bits} \] \[ v_B = 10^{999999999999} \text{ bits} \] In order to perfectly steer \(S\), according to Ashby's law, we need a controller with \( v_C \approx 10^{999999999999} \). (Note that there are about \(10^{80}\) particles in the observable universe.) But consider, perhaps, that \(S\) spends 99.9% of its time in \(A\), only rarely wandering into \(B\) and then quickly back to \(A\). In this case, we can get about 99.9% steering power over \(S\) just by ignoring \(B\). In that case, we only need a controller with variety \( v_C = 1000 \). For readers who are rusty with mathematics, I will remind you that 1000 is a little bit smaller than \(10^{999999999999}\), and will be a little bit easier to make a controller for.

To get a better grasp on how difficult it really is to steer something, we clearly need to account for frequency. Those bajillion extra states don't matter much if the system rarely takes advantage of them. This is where we graduate to the concept of entropy. Given some probability distribution3 on states, the entropy is: \[ S = - \sum_i p_i \log p_i \] ...where \(p_i\) is the probability of the \(i\)'th state. The graph \(-p\log p\) is a little left-skewed mound, having value 0 for \(p=0\) and \(p=1\), and peaking around 0.16 for \(p\approx0.37\). Thus, probability distributions of complete confidence, being 1 at a point and 0 everywhere else, have 0 entropy. Distributions which are more spread-out, and thus more uncertain, have higher entropy. The entropy of a system is the amount of information one needs to model it.

What if we are maximally uncertain, so our probability distribution is a flat \( p_i = 1/|S| \)? That's to say, we are equally likely to be in any state, so the probabilities are just one over the total number? Then the entropy is: \[ S = -\sum_{i=1}^{|S|} \frac{1}{|S|} \log \frac{1}{|S|} \] \[ = -|S|( \frac{1}{|S|} \log \frac{1}{|S|} ) \] \[ = -\log \frac{1}{|S|} \] \[ = \log|S| \] \[ = v\] The variety is the maximum entropy! Or more specifically, it is the entropy in the particular case where we have no idea which state will come next, and must account for all equally. It's the worst-case entropy, from the perspective of a steersman.



Now let's consider Boltzmann entropy. The definition carved into Boltzmann's tombstone is: \[ S = k \log W \] ...where \(k\) is just a constant, which I will disregard. \(W\) is the important thing, the relation between some higher- and lower-level models. We consider two descriptions of a system: The more fundamental, low-level state space \(m\), with states called microstates, and the higher-level state space \(M\), with states called macrostates. There is a (surjective) mapping: \[ f: m\to M \] ...which maps microstates onto their corresponding macrostates.

For intuition, consider a stoplight. We know that the stoplight is made of a bajillion atoms which may be configured in one of a flajillion ways- these configurations are the microstates \(m\). But in practice, all we care about is the state space \( M= \{ \text{ Red, Yellow, Green } \} \). The function \(f:m\to M\) takes an atomic configuration and tells you which colored light is on.

Boltzmann's equation assigns entropies to macrostates. Given that I am in some known macrostate, \(W\) is the number of microstates I could be in.4 More formally, the \(W\) value of a macrostate \(y\) is the cardinality of \( \{x\in m \text{ | } f(x)=y \}\). The \(W\) of a red stoplight is the number of possible atomic configurations for which the stoplight is red.

So... how does this relate to Shannon's entropy?5

It turns out, the answer is very similar to the derivation I did with variety. Consider the Boltzmann entropy as a matter of microstate uncertainty: I know my macrostate, and I am speculating about what my microstate might be. If I know \(f\), I know all the microstates I could be in, but I have no idea which of those are more or less likely. Thus, I can define a simple flat probability distribution, where all the \(W\) microstates which map to my macrostate are equally likely, with value \( p_i = 1/W \). Then: \[ S = -\sum_{i=1}^{W} \frac{1}{W} \log \frac{1}{W} \] \[ = -W( \frac{1}{W} \log \frac{1}{W} ) \] \[ = -\log \frac{1}{W} \] \[ = \log W \] I have derived the Boltzmann entropy from the Shannon entropy, in almost exactly the same way that I derived the variety. I can connect the two explicitly by noting that \(W\) is the variety of a subset of possible states in the micro- description. The variety is the maximum entropy in general, and the Boltzmann entropy is the variety of the possible microstates with respect to a given macrostate. Or put differently, the Boltzmann entropy is the Shannon entropy in the case where our \(p_i\) values are uniform for some subset of states, and 0 everywhere else.

  1. The log is just there for convenience, so that the total variety of two systems is the sum of their individual varieties, rather than the product. \(|A\times B| = |A||B|\), but \( \log|A\times B|= \log|A| + \log|B| \).
  2. Log base 2 is implied.
  3. I'm guessing a good a priori probability distribution might be derivable from a state transition function, where states are more likely if they have more incoming transitions from other likely states. Something like: \[p_j = \lambda \sum_k p_k \] ...for all states \(s_k\) such that \( f(s_k)=s_j \), with \(\lambda\) chosen so that \(\sum p_j=1\)? Maybe?
  4. In the continuous case where there are infinite microstates, we can use volumes instead of counting.
  5. Or Gibbs Entropy, equivalently.