# An Ill-Defined Probability Model and the Uniqueness of Labeling

The following started out as a thought experiment about where and how probability theory connects to our perception of the physical world around us. There may be a hole or two in my reasoning, and I'm always open to hearing opinions or questions -- so feel free to comment or contact me about this post!

## Review: The Canonical Probability Model

The typical probability model by definition consist of the following:

1. A set of outcomes Ω.
2. A set of measurable events S such that
1. ∀ s ∈ S, s ⊆ Ω
2. Ω, ∅ ∈ S
3. S is closed under union and complementation.
3. A function P: S → ℝ such that
1. P(α) ≥ 0 ∀ α ∈ S
2. P(Ω) = 1
3. α, β ∈ S, α ∩ β = ∅ ⇒ P(α ∪ β) = P(α) + P(β)

I'll call this the "canonical" probability model.

Putting ourselves in the shoes of a designer of such requirements, 3(a) and 3(b) clearly seem physically motivated.

1. We tend to relate positive numbers to the presence of something, in this case the presence of likelihood. Further, it doesn't make sense that there be anti-likelihoods, but at the same time we can fathom a nil-likelihood. Taken all-together, we require strictly non-negative probabilities (3(a)).
2. We want to assume that something will occur when measured. We don't envision the universe suddenly disappearing (and if it did we wouldn't be around to measure it). Though there may be some intuition to assume P(Ω) = ∞ ∈ ℝ∪{∞}, why not first keep things simple and try P(Ω) = 1 which is just as intuitive.

And then we come to 3(c). Where's the explanation in physical terms for that?

In this post, we shall construct a real-world -based probability model that violates this third requirement. We will then alter our model to fit 3(c), and compare the two for insights. First though, let's get physical...

## Crossing over into the Physical Realm

At some point, a probabilistic model isn't useful unless you can sample the real-world for comparison against your model. The mathematical definitions above have no direct requirement for how your real-world sampling might work (i.e. produce elements in Ω). Let's bridge the gap between the math world and the real-world.

Let's construct a real-world sampling function M: A → 2Ω where A is a set containing imaginably possible, but not directly known whether true or not, physical states. 2Ω is a power set of set Ω of possible outcomes, and each element of 2Ω represents the presence or absence of one or more outcomes possibly perceived by our sampler. It is impossible to know the precise state of the real-world -- we are always relying on samplers (whether biological or electronic or otherwise) for our physical information. Our sampler returning data in terms of Ω enables us then to compare real-world data directly against our probability model built upon Ω.

Before moving on, let's talk about set A and the rationale behind choosing 2Ω.

Regarding A, it's perhaps difficult (or philosophically tedious?) to define all of the elements of A. One can imagine each element of A as a specific arrangement of molecules for some bounded area of the universe. One such element of A might correspond to the molecular arrangement of a quarter facing heads up; or likewise a molecular arrangement corresponding to tails facing up. Regardless, elements of A represent possible or imaginable states of the physical world.

The rationale for 2Ω and not simply Ω is two-fold. Firstly, the same physical state can be perceived in different ways: is that vision yonder a tree, a cluster of green and brown shapes, or an arrangement of molecules? Perception and labels for such are not in absolute terms, but perception is all we have. Thus the best we can do, while maintaining a sort of philosophical generality, is to imagine a construction of Ω that contains many perceptions of the same physical state. Continuing with this idea, we construct our sampler M such that it returns a subset of Ω containing each possible interpretable outcome for a given element or subset of elements in A. Thus we formally define the codomain of M to be 2Ω. For an example, consider Ω = { ..., tree, green & brown cluster of form #1234, molecular arrangement of form #5678901, ... }, and for some A1 ⊆ A, M(a∈A1) = { tree, organism, green & brown cluster of form #1234, molecular arrangement of form #5678901}.

Secondly, it makes sense (if Ω is constructed for such) to have our sampler be able to return the state of two or more outcomes describing the state of the physical world. For an example, consider two quarters in our bounded sub-universe, with Ω = { Q1_Heads, Q1_Tails, Q2_Heads, Q2_Tails }, and for some A1 ⊆ A, M(a∈A1) = { Q1_Heads, Q2_Tails }.

Now that we've defined a real-world sampler and the means for comparison against a probability model, let's see how well an example holds up...

## The Ill-Defined Probability Model

Let's construct a probability model from the ground up, starting with A, Ω, and M. Suppose we have a simplified roulette wheel with only two possible values: 1 (red) and 2 (black). We can say "1" is an outcome, "red" is an outcome, "2", is an outcome, and "black" is an outcome. Thus we define Ω = { 1, 2, red, black }. It should be clear that we can create a mapping M: A → 2Ω, such that some subset of A maps to { 1, red } ∈ 2Ω, and the remaining subset maps to { 2, black } ∈ 2Ω (we assume the roulette ball will always land in one of the two slots). We can now sample the real-world for outcomes, for comparison later against the rest of our probability model.

Now let's construct S within our probability model. We can attempt to build a simple version of S by pulling single elements from Ω, but we must be closed under complementation and union. Thus we end up with S = { ∅, {1}, {2}, {red}, {black}, {1, 2}, {1, red}, ..., Ω } = 2Ω.

And finally let's construct P. Let's say we know as a physical matter of fact the roulette wheel is fair, equally favoring each of the two slots. Thus P({1}) = P({2}) = c for some value c. 1 is our value for any event at all occurring, and {1}, {2} are mappable back to the only physically possible events, thus c + c = 1, so that P({1}) = P({2}) = 0.5.

We also know that "red" occurs whenever "1" does, so we can say P(red) = P(1) = 0.5. Similarly P(black) = P(2) = 0.5 for "black" and "2". We also know certain events are mutually exclusive, such as the event { 1, 2 } which has probability 0. We have then P is:

P(α) = {
0.0 if α = ∅,
0.5 if α = {1} or {red} or {1, red},
0.5 if α = {2} or {black} or {2, black},
1.0 if α = Ω,
0.0 for all other α
}

It is clear that 2(a), 2(b), 2(c), 3(a), and 3(b) all hold. Testing 3(c) though, our model falls short.

It's true, for instance, that {1} ∩ {red} = ∅, but P({1} ∪ {red}) = 0.5 + 0.5 = 1, which implies we always land on the red or "1" slot. This contradicts our knowledge of how our roulette wheel behaves physically. Thus our model (A,Ω,M,S,P) is not valid as a canonical probability model.

## A 3(c)-compliant Model

We know A corresponds directly to imaginable states in the physical world, so there is no changing A in any meaningful way. Further Ω and M are constructed on a philosophically sound basis, I believe. So let's reconsider the construction of S, our measurable events.

We, as sophisticated observers, can perceive both the difference between red and black, and the presence of two unique symbols "1" and "2".  We are able to see that perception of the red slot occurrence is inseparable from perception of the "1" slot occurrence, so let's simplify S as S' = { ∅, { 1, red }, { 2, black }, { 1, 2, red, black } }. It can be verified that S is closed under union and complementation.

Rebuilding P yields:

P'(α) = {
0.0 if α = ∅,
0.5 if α = { 1, red },
0.5 if α = { 2, black },
1.0 if α = Ω
0.0 for all other α
}

Again, 2(a), 2(b), 2(c), 3(a), and 3(b) hold. Further, it should be clear that 3(c) now holds in this case for all α, β ∈ S', and so our revised model (A,Ω,M,S',P') is valid as a canonical probability model.

## What is the Essence Here?

Comparing our non-canonical and canonical models, what is the difference? Well, essentially the canonical model requires uniqueness in the labeling of measurable events. In our initial model (A,Ω,Μ,S,P), the same fundamental, physical state is ultimately mapped to multiple l abels (elements) in S. In our revised model (A,Ω,Μ,S',P'), each unique (in a vague sense of the word) physical state is mapped to a single element in S.

In fact any set S'' of measurable events that contains one or more elements not unique to a particular subset of physical states will lead to a non-canonical probability model.

Proof:

Let (A'',Ω'',Μ'',S'',P'') be a model such that:

1. 1, 2(a-c), and 3(a-b) hold from the canonical probability model.
2. A'' is a set of imaginable physical states.
3. M'' is a sampler function A'' → 2Ω such that for each ε1 ∈ Image(M''), there is a unique subset A1 ⊆ A'' such that M''(a∈A1)=ε and M''(b∉A1) ∩ ε = ∅.
4. For a subset A2 ⊆ A'' there exists an ε2 ∈ Image(M'') such that M''(a∈A2)=ε2 and |ε2| ≥ 2.
5. There exists some non-empty s1 ∈ S'' such that s1 ⊂ ε2 (strict subset), and P''(s1) > 0.

Because s1 is a strict subset of ω, there exists some non-empty s2 ⊂ ε2 - s1. Because s1 and s2 are subsets of ε2, M''(a∈A2)=ε2 always indicates both s1 and s2. Further, because s1 and s2 are only found when states of A2 are sampled, P''(s1) = P''(s2). Let γ = P''(s1). From above we have γ > 0.

Let's now assume 3(c) holds. Because s2 ⊂ ε2 - s1, we have s1 ∩ s2 = ∅. Thus by 3(c), P''(s1 ∪ s2) = P''(s1) + P''(s2) = 2γ. However s1 ∪ s⊆ ε2, thus (s1 ∪ s2) is sampled by M'' whenever s1, s2 are, which implies P''(s1) = P''(s1 ∪ s2) or γ = 2γ. But γ > 0, thus we have a contradiction and 3(c) does not hold. Q.E.D.

By contrapositive, we have that 3(c) in the canonical model implies that each subset of A'' with a corresponding element in S'' has a unique such element in S''. In other words, S'' has unique labels for physical states.

We now have a explanation in a physical (or philosophical :)) context for requiring 3(c) in the first place -- it demands we (arbitrarily) choose a unique way of perceiving and labeling each class of physical outcomes! 