# Hunting for Simpson’s Paradox (part 2)

Simpson’s paradox illustrates an intuitive error in reasoning that’s hard to accept without credible examples.  It occurs when a correlation or trend that is present in groups is reversed when the groups are combined.  The example we’ve been working with deals with batting averages.  A hypothetical situation was presented in the first part of the article in which we stated that one batter had a higher batting average against left-handed pitchers and against right-handed pitchers, separately.  However, when the overall batting average was calculated, that batter had a lower average.

Here’s the table once more just for reference:

Certainly it was an interesting example but it was entirely made-up.  I just tweaked the numbers a bit to get the table to work.  I set my sights on finding some of my own examples of Simpson’s paradox based on REAL data.

### Attempt 1

Yahoo Sports provides some easily accessible baseball team split stats, e.g., home vs. road, pre-all star break vs. post-all star break, turf vs. grass, indoor vs. outdoor (http://yhoo.it/hKAiLn).  I started by pulling the overall team batting averages and then their averages split by home games versus road games (for the 2010 season).  By pasting into Excel and sorting, then manually checking for a good 15 minutes, I was able to establish that there was no example of Simpson’s paradox in this data set. Bummer.

### Attempt 2

This process had to be automated because there was no way I was going to spend 15+ minutes checking every split, every season, until I found an example.  While this could be fairly easily done with a VBA script (e.g. macro) in Excel, I decided to move over to MATLAB simply because I’m more well-versed in scripting there.  Algorithmically speaking, it really is a piece of cake to automate.

NOTE: If the details of the script don’t interest you, you can skip down to the results.  And yes, I found some examples.  WOOHOOO!

First, you need your data read into the program.  In Excel, I had a table with the first column as the team names, the second column was the overall batting average and the third and fourth column were the split stats (i.e., team batting averages for road vs. home).  To move this into MATLAB, I created a cell array called teams which would contain team names.

teams = {}

Then, after opening the teams variable in the variable editor window (double-click it in the Workspace), I pasted in the team names (right-click, “Paste Excel Data”).  Next, I created a variable x and pasted in the three columns of batting average data.

At this point a very simple little script (simphunt.m) tests every pair of teams for Simpson’s paradox.  The main step of the script is to compare the overall averages for each pair and the split stats for each pair.  If the relation of the split is reversed in the overall, then we have an example.

if (x(i,1)>x(j,1) && x(i,2)<x(j,2) && x(i,3)<x(j,3)) || …
(x(i,1)<x(j,1) && x(i,2)>x(j,2) && x(i,3)>x(j,3))

### Results

The first split I tested was, again, home vs. road (to which I already knew the answer):

There are no cases of Simpson’s Paradox

The next was day vs. night.  Disappointed again.

There are no cases of Simpson’s Paradox

Fortunately, it was only taking a minute or so to fetch the data and move it into MATLAB.  Nevertheless, I was still feeling a little disheartened.  I was beginning to expect to run through dozens of splits, maybe from multiple seasons, or even moving on to player vs. player stats instead of teams.

Then I went with indoor vs. outdoor.  JACKPOT!!!

There are 5 cases of Simpson’s Paradox

‘ Chicago Cubs’          ‘ Oakland Athletics’
‘ Cleveland Indians’     ‘ Tampa Bay Rays’
‘ Colorado Rockies’      ‘ Milwaukee Brewers’
‘ Los Angeles Angels’    ‘ Tampa Bay Rays’
‘ New York Mets’         ‘ Toronto Blue Jays’

Five Cases!!!  Looking back at the data we can confirm:

I could clearly keep hunting but I’ve satisfied my desire for finding my own examples.  Next time I bring up Simpson’s paradox in any of my classes, I’ll start with my manufactured example but quickly move on to these REAL examples.

# Hunting for Simpson’s Paradox, part 1

Let’s say we  happen to know the batting average for two baseball players.  Overall, player 1 has a higher average that player 2.  However, if you consider only how each player hits against left-handed pitchers we find that player 2 actually has a better average player 1.  In this hypothetical scenario, it also turns out that player 2 also has a better average against right-handed pitchers.  How is that possible?

Doesn’t it make intuitive sense that if player 2 is better than player 1 against left and right handed pitchers separately, that he must be better than player 1 against all pitchers?  While that may be what our intuition tells us, it turns out that it’s not necessarily true.

Consider the following table.  Note that batting average is simply the ratio of a players number of hits over the number of at-bats.

This table presents exactly the hypothetical scenario described above.  Separately, player 2 had a higher average than player 1 against left and right handed pitchers, but over all player 1 has a higher average than player 2.

This phenomenon is commonly known in statistics as Simpson’s paradox.  It demonstrates how our intuition can get us into trouble.  Briefly stated, Simpson’s paradox occurs when a correlation or trend that is present in groups is reversed when the groups are combined.

I was recently reminded of Simpson’s paradox when @Math_Bits posted a link on twitter to an article, “Instances of Simpson’s Paradox” by Thomas R. Knapp.  It got me thinking.  Sure I can manufacture an example and I’ve seen a few examples in papers, texts and even wikipedia.  But I want to find my own examples.  And of course, manufactured examples like the table above don’t count.  I need real data.

I figured the best place to start for real data that’s easy to find is in sports, say baseball, while we’re thinking of it.  I started simple and pulled up the first split data set I found.  Over on Yahoo sports, I pulled up team statistics for the full 2010 season, see http://yhoo.it/hKAiLn.  I pulled the data over into excel and began (manually, ug…) hunting for an example of Simpson’s paradox.  Of course, I would start the hardest way possible.  I looked at batting averages (overall, home and on the road).  I made three lists, one for each category: overall, home and road.  Then, I sorted the teams from highest to lowest and began looking one-by-one for pairs of teams where one had a higher average overall but lower both at home and on the road.

… to no avail.  I even reversed the process by sorting from lowest to highest.

I’m beginning to believe that Knapp was right when he claimed that examples of Simpson’s paradox are extremely rare.

The next step was to automate this process.  As a programmer, I began devising a simple code that will take an overall list and lists for each group and identifies all those pairs that satisfy Simpson’s paradox.

In the next post, I’ll walk through the progress I made using Matlab to do my dirty work.

# Unlisted Entries in Your Word Finds

Over dinner tonight, my daughter told us a story about how they were working on a Word Find in school today and she found the word “ELF”.  She was fascinated by the fact that the theme of the Word Find was supposed to be Thanksgiving and that the word was not listed among the words to find.

The story started me thinking.

### What do you suppose is the probability that in an m x m Word Find (meaning m rows and m columns) you would find a given three letter word, like “ELF”?

To answer the question, you must first lay out some ground rules, or assumptions as we call them in mathematical modeling.  First, let’s assume that the letters are distributed randomly and uniformly.  Second, we’ll assume that there are three orientations to finding the word: vertical, horizontal and diagonal.  As with most of these puzzles I’ve every done and especially the ones my daughter is now doing as a 10 year old, we’ll also expect to find the word forward or backward.

For a single 3 letter random string, the probability that it matches our given word would clearly be

$$p=\frac{2}{26^3}$$

because the likelihood of each letter matching is 1/26 and because we assumed random letters, a forward match has a probability of 1/26 * 1/26 * 1/26, as does a backward match.

The next question to answer would be, how many 3-letter sequences are there for an m x m matrix of letters?

In each row, there would (m-2) 3-letter sequences which means m*(m-2) horizontal sequences.  The same goes for the columns.  Now consider the diagonals.  On the main diagonal there are m letters so there would be (m-2) 3-letter sequences.  On the first super-diagonal there are (m-1) letters so there would be (m-3) 3-letter sequences.  The same is true for the first sub-diagonal.  If you keep marching up the diagonals, you’ll see (m-4), then (m-5), all the way down to 1.

So we’ll have a total of

$$(m-2) + 2(m-3)+2(m-4)+\cdots+2(1)$$

for the number of forward diagonal 3-letter sequences.  Since our word-find is assumed to be square, we’ll have the same number going on the back diagonal.  So here is a total count on 3-letter sequences: horizontal + vertical + diagonal.

$$\mathrm{Total} = 2m(m-2)+2(m-2)+4(m-3)+\cdots + 4(1)$$

To simplify this, recall that if you have the sum of the first k integers,

$$1+2+\cdots+k=\frac{k(k+1)}{2}$$

So we can simplify the Total as first combining the first two terms,

$$\mathrm{Total}=(2m+2)(m-2)+4(m-3)+\cdots+4(1)$$

Then we combine the last terms using the sum formula above,

$$\mathrm{Total}=2(m+1)(m-2)+4\frac{(m-3)(m-2)}{2}$$

Some minimal algebra finally gives us the formula for the number of 3-letter sequences as a function of m:

$$\mathrm{Total}=4(m-1)(m-2)$$

That’s surprisingly simple!

Back the main question now: we want the probability that at least one of these 3-letter sequences matches our given word.  Turns out, under our assumptions, this is just a binomial distribution.  Woohoo!

Recall that a binomial distribution with n trials and  the probability, p, for a success in a single trial has the probability density function

$$P(X=x)=\left( \begin{array}{cc}n\\ x\end{array} \right) p^x (1-p)^{n-x}$$

Thus, we are looking for “at least one success” which means we calculate $$1-P(X=0)$$.  We know that $$n=4(m-1)(m-2)$$ and $$p=\frac{2}{26^3}$$

Now, you can plug and chug in your calculator or run out and stick a formula in Excel, Wolfram Alpha, Maple, etc.  Either way, we now have an easy answer to a couple of interesting questions:

### 1. If we have 20×20 grid of randomly distributed letters, what is the probability that the word “ELF” appears?

IN EXCEL:  =1-binomdist(0,4*(m-2)*(m-1),2/26^3,FALSE)

### 2.  What size of a grid do we need to have a greater than 50% chance that the word “ELF” appears?

You could find this using “What-If Analysis” in Excel or with just a little algebra on the formulas above we can show that the m that solves the equation below, yields the answer (which is left as an exercise to the reader).

$$4(m-1)(m-2)=\frac{-\ln(2)}{\ln(1-p)}$$.

Just for fun, here’s a graph of the probability of at least one match as a function of m.

Well, now I can tell my daughter it’s not so special that she found the word “ELF” in her Thanksgiving puzzle.  Or… I can just smile and nod, like a daddy should.

(And, by the way, to all you fellow math enthusiasts, I welcome your comments and critiques.  If I missed something or messed something up, let me know what I did and we’ll find a better answer.)

# How do you measure two-thirds?

From an article by Mary Ann Bragg which appeared on CapeCodeOnline and was also printed in this month’s College Mathematics Journal:

TRURO — Voters narrowly approved one of four zoning amendments late Tuesday night at the annual town meeting. But town officials were still looking at the exact vote count on that article yesterday.

In a vote of 136 to 70, voters passed a new time limit on how quickly a cottage colony, cabin colony, motel or hotel can be converted to condominiums. The new limit requires that those properties be in operation for three years before being converted to condominiums.

The idea behind the zoning amendment is to slow the pace of condominium development in Truro and preserve more affordable accommodations for tourists, according to citizens proposing the warrant article.

Currently Truro does not allow condominiums complexes to be built outright in its zoning bylaws. Instead, property owners must build a cottage colony, cabins, motel or hotel first and then covert it to condominiums through a special permit.

The exact count of the vote — 136 to 70 —had town officials hitting their calculators yesterday. The zoning measure needed a two-thirds vote to pass. A calculation by town accountant Trudy Brazil indicated that 136 votes are two-thirds of 206 total votes, said Town Clerk Cynthia Slade.

But is it?  Is 136 a sufficient number of votes to be considered two-thirds of the total 206 votes?  Let’s check:

If you use the fact that $$\frac{2}{3} \approx 0.66$$ and then proceed to multiply 206 by 0.66 you get 135.96.  There were 136 votes in favor which is  more than 135.96 so that means it passes, right?  If you think so, then you’d be WRONG!!!

The main problem is the rounding.  In fact, $$\frac{2}{3} = 0.666666\ldots$$ or using repeated decimal notation, $$\frac{2}{3} = 0.\bar{6}$$.  When you round, you are actually creating an error that, in this case, makes a pretty significant difference.

Think of it another way, lets compare 136 / 206 to 2 / 3.  First, just do it by decimal approximation:

$$\frac{136}{206} \approx 0.660194174757 < 0.6666666667 \approx \frac{2}{3}$$

My calculator cannot exactly represent either of these fractions but its accurate to 12 decimal places and I can clearly see that 136/206 < 2/3 so the vote should not pass.

Do you remember another way you can compare fractions?  Find a common denominator and convert each fraction, then compare.

$$\frac{136}{206} \cdot \frac{3}{3} = \frac{408}{618}$$

$$\frac{2}{3} \cdot \frac{206}{206} = \frac{412}{618}$$

So, here we see that, again,

$$\frac{136}{206} = \frac{408}{618} < \frac{412}{618} = \frac{2}{3}$$

This second method of checking is even better than the first because there are no approximations involved.  We’ve confirmed, absolutely, that 136 votes out of a total of 206 does NOT constitute two-thirds.

Fortunately, a good citizen made an anonymous call in Truro, MA, to clear this up.  What perplexes me is that they decided they needed to let the State Attorney General’s office decide on the correct count. The mathematical explanation wasn’t good enough. Can you say quantitative illiteracy?

# Shortest Sudoku Solver in Python

Well over two years ago on this blog (have I really been around that long?), I posted a link to a story that Sudoku had been solved.  (The original link to the Math-Forge Story is broken, so here in alternative version of the story.) While just about every computer scientist and programmer I know has thought up a quick little code to solve a Sudoku puzzle, the interesting element of the above story is that the algorithm solving Sudoku was connected to techniques used in diffraction microscopy.

Now, when I say “quick little code”, I meant an easy algorithm to implement, but not necessarily an elegant or amazingly small code that would accomplish the solution.  Here is definitely the smallest (shortest) code I’ve seen that will do it.

def r(a):i=a.find('0');~i or exit(a);[m
in[(i-j)%9*(i/9^j/9)*(i/27^j/27|i%9/3^j%9/3)or a[j]for
j in range(81)]or r(a[:i]+m+a[i+1:])for m in'%d'%5**18]
from sys import*;r(argv[1])

Here’s one that is slightly longer (185 bytes as opposed to the 178 above)

use integer;sub R{for$i(grep!$A[$_],@x=0..80){ %t=map{$_/27-$i/27|$_%9/3-$i%9/3&&amp;amp;amp;$_/9-$i/9&&($_-$i)%9?0:$A[$_]=>1}@x; R($A[$i]=$_)for grep!$t{$_},1..9;return$A[$i]=0}
die@A}@A=split//,<>;R

HT: Scott’s Blog

# Forest Fire Simulation in MATLAB

In my Fall course of Math Models, I have three groups working on projects to finish up the semester.  One of the groups have an assignment to explore a model of the spread of a forest fire.  The assumptions are that the trees are on a rectangular grid, or a lattice.  The time is a discrete variable and at each time step the probability that the fire spreads from one point in the lattice to an adjacent point (up, down, left or right) is given by p.  For simplicity, the event that the fire spreads to each point is assumed to be independent of any other point.

Part of their project is to implement a numerical simulation of their forest fire.  I couldn’t let them have all the fun, so below is an example of my version of the simulation in MATLAB.  I have to hold off on posting the code until after they have handed in their project.

In the graphical representation of my simulation, green represents an unburnt tree, black is burnt and red is currently on fire.  The fire lasts for exactly one time step.  I also implemented a 3-D version, where a height of 1 is unburnt, 2 is on fire, and 0 is burnt.  I’ll confess to having way too much fun with this.

I have used a 200×200 lattice with p = 0.5.

# Algebra – It’s Everywhere

A good read from the San Francisco Chronicle:  Algebra – it’s everywhere by Jill Tucker.

Algebra, says Devlin, is a language, a very precise language written in symbols, and it’s everywhere: in nearly all electronic devices, every statistic and each Internet search engine – and, indeed, in every train leaving Boston.

"You can store information using it. You can communicate information using it," Devlin said. "Google has made billions capitalizing on algebra."

Yet our schools don’t always do a very good job teaching it, Devlin said. Instead of showing students the possibilities and beauty algebra offers, they ultimately steer frustrated and bored students away from math and the 21st century careers that use it – the opposite of the intended result.

Algebra, by the dictionary’s definition, is essentially abstract arithmetic, letters and symbols representing relationships between groups, sets, matrices or fields. It’s a way to find a piece to a puzzle using the pieces you already have in place.

It comes in very handy for engineers, financial analysts and sociologists, not to mention World of Warcraft video game players, some of whom use algebraic formulas to decide which weapon is more effective under certain circumstances – perhaps another hook to lure unsuspecting teens into seeing the useful side of algebra.

Laptop computer. The computer is just an implementation in electrical circuits of a special form of algebra (called Boolean algebra) invented in the 19th century. Ordinary algebra is used to design and manufacture computers, and is at the heart of how to program them.

Cell phone. A cell phone is a particular kind of computer. An important feature of cell phones is that your phone receives all the signals sent to every cell phone in the region, but only responds to signals sent to your phone. This is achieved by using signal coding systems built on algebra.

Parking cop. Today’s parking enforcement officers may carry equipment connecting them directly to a central vehicle database that registers your parking fine before you get back to the car and see the ticket on the windshield. Without algebra, such a system could not exist.

Hybrid car. Modern cars often come equipped with GPS, a highly sophisticated system that is designed using enormous amounts of mathematics that builds on algebra.

Delivery truck. Large retail chains use mathematical methods to determine the routing and scheduling of their delivery trucks; algebra is fundamental to those methods.

Stoplight. These days, stoplights are centrally controlled by computers, so there is even algebra involved in turning the light from red to green.

IPod. This is a math device in your hand. The iPod stores music using sophisticated mathematics built on algebra. And the iPod shuffle mechanism uses regular school algebra to order your songs randomly.

Even though it is a very pro-algebra article, my favorite quote was by an unknown source:

"Algebra … the intensive study of the last three letters of the alphabet."

# Why would I even need to learn that?

I have a calculator.  I can answer all the math problems I’ll ever need because I own a calculator.  There are many people that worry me when they say they were never any good at math: the nurse administering the medication, the clerk counting my change, the broker managing my investments, the salesman offering me financing at the car dealership, and now, the cop giving parking tickets:

From 360 (Unofficial Blog of the Nazareth College Math Department in Rochester, New York):

The Herald reported last week that a Traffic Warden was incorrectly ticketing cars in a Devon, England parking lot because of how he was using a calculator. In this parking lot, drivers would pay for a certain amount of time and then post a slip in the windshield with the time they’d entered and how long they’d paid for. One driver, for example, entered at 2:49pm and paid for 75 minutes.

Now 75 minutes is 1 hour, 15 minutes so the driver was covered until 4:04pm. But the Traffic Warden figured out the expiration time by entering in 14.49 into his calculator (for 1449 military time, which corresponds to 2:49pm) and adding on 0.75 (for the 75 minutes). He got 15.24, which he interpreted as meaning that the driver was only covered until 3:24pm. Since it was already 3:41pm, he issued the car a ticket. The car owner saw all this and tried to explain the error — that hours have 60 minutes, not 100, so standard decimal addition doesn’t apply — but the Traffic Warden didn’t see any problem and continued to ticket cars.

In good news, after appeal the incorrect tickets were repealed and a letter of apology sent.

# Gas Price Economics

I received an email earlier to day trying to promote some sort of a boycott of Exxon/Mobile Mobil gasoline stations in an effort to force them to lower their gas prices. Recognizing that there are few around my neck of the woods, I didn’t pay much attention to the email. Plus, I pretty much disregard those kinds of efforts anyway.

A follow up email attempted to make the point that we aren’t paying that much more for gasoline considering a significant increase in fuel efficiency over the last 20 – 30 years. The examples cited were anecdotal and encouraged me to do a little research on my own.

I was surprised to see that the increase in fuel economy is a lot less than one might have expected over the last 30 years. According to the National Highway Traffic Safety Administration (NHTSA) the average gas mileage for new vehicles sold in the United States has gone from 23.1 miles per gallon (mpg) in 1980 to 26.7 mpg in 2007. This represents a paltry increase of 15% over the 27 year period. Even if you limit yourself to domestic passenger cars the increase is from 22.6 mpg in 1980 to 31.3 mpg in 2007.

Even more interesting to me is the fact that we have benefited from a relatively low cost of gasoline for an extended period of time. (see here) Adjusting for inflation we see a steady decline in the cost of gasoline dating all the way back to the 1920s. The only exception is the late 70s, early 80s and the last 5 years. Prices are at their upper limit even with inflation considered. When considering only yearly averages, the highest cost occurred during 1981 at $3.17 (adjusted to 2008 dollars). Through March of 2008, this year’s annual average has been$3.08.

Now back to the original point, on average the cost (in 2008 valuation) per mile was 12.8 cents in 1981 (when gas averaged $3.17 per gallon in 2008 dollars and the average fuel economy was 24.6 miles per gallon) . The average cost per mile, currently, is 13.6 cents (with a current national average of$3.63 per gallon and average fuel economy of 26.7 mpg). In the end, while it seems that we are paying a ghastly amount at the pump we aren’t that far above the historical high, nevertheless we are, in fact, paying more than ever.

# Extraterrestrial Life Unlikely

Professor Andrew Watson of the University of East Anglia has recently published a paper in the February issue of Astrobiology entitled Implications of an anthropic model of evolution for emergence of complex life and intelligence. In this article he argues that a number of limitations must be overcome in order for evolution to progress to the point to leading to intelligent live.

Watson postulates that for intelligent observers to evolve, a small number (n) of very difficult evolutionary steps must be passed. Once passed, evolution occurs quickly until the next stage is reached. Complex and intelligent life evolved quite late on Earth and Watson suggests that this may be because of the difficulty in passing these stages. He suggests that n is less than 10 and most likely equal to 4. These stages include the emergence of single-celled bacteria, bacteria with complex cells, cells allowing complex life forms, and intelligent life.

Professor Watson uses the Earth’s fossil records to establish upper bounds on the probability for each state.

The work supports the Rare Earth hypothesis which postulates that the emergence of complex multicellular life (metazoa) on Earth required an improbable combination of astrophysical and geological events and circumstances.

Read more about his paper at Plus Magazine.  At the time I am writing this entry, the article is freely available at the Astrobiology Journal site.