{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Loading required package: ggplot2\n", "0x00007f452c00a250" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ ":ext QuasiQuotes\n", "import qualified H.Prelude as H\n", "H.initialize H.defaultConfig\n", "[r| require(\"ggplot2\") |]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bayes Explorations for a Newbie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm trying to learn the fundamentals of Bayes. It seems like a very wide subject. My main interest would be applying\n", "it to AI but I wanted to start from the basics.\n", "\n", "When I was learning probability, to check my answers, I would write a Haskell program\n", "to simulate the problem and compare it's anwer to my math answer. It was a great help\n", "in building my confidence.\n", "\n", "So, I thought I would do the same thing while learning Bayes.\n", "\n", "I took what I thought would be an easy problem and wrote a simulator, the results\n", "of which are shown below.\n", "\n", "My confusion about the results led me to some fundamental concepts that I didn't understand\n", "(and probably don't yet), independence, and causality.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sample Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have a cargo trailer with an alarm system. It sometimes has false alarms. So, I wanted\n", "to know if I should panic whenever the alarm goes off.\n", "\n", "Or, what is the probability that I am being burgularized given the the alarm is going off?\n", "\n", "Ok, so off to a Haskell simulation ..\n", "\n", "I generated large boolean vectors to simulate both my alarm going off and burgularies,\n", "independent of each other. I know now that this doesn't really represent the\n", "situation but suspend disbelief for\n", "a moment.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the probability of burgularies be 10 percent (pretend I live in a high crime area),\n", "and the probability of my alarm going off is 3 percent (bad alarm)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$P_B = 0.10\\ \\ \\$ Probality of being burgled (from county statistics>
\n", "$P_A = 0.03\\ \\ \\$ From my experience" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Here's the relevant code," ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/html": [ "
Redundant if
Found:
if (x > pctOnes) then False else True
Why Not:
not (x > pctOnes)
Redundant bracket
Found:
if (x > pctOnes) then False else True
Why Not:
if x > pctOnes then False else True
Redundant bracket
Found:
(fromIntegral aT) / (fromIntegral bT)
Why Not:
fromIntegral aT / (fromIntegral bT)
Redundant bracket
Found:
(fromIntegral aT) / (fromIntegral bT)
Why Not:
(fromIntegral aT) / fromIntegral bT
Redundant bracket
Found:
if (b && a) then cumA + 1 else cumA
Why Not:
if b && a then cumA + 1 else cumA
Redundant bracket
Found:
(fromIntegral aAndb) / (fromIntegral total)
Why Not:
fromIntegral aAndb / (fromIntegral total)
Redundant bracket
Found:
(fromIntegral aAndb) / (fromIntegral total)
Why Not:
(fromIntegral aAndb) / fromIntegral total
Redundant bracket
Found:
if (a && b) then cumAandB + 1 else cumAandB
Why Not:
if a && b then cumAandB + 1 else cumAandB
Redundant bracket
Found:
(fromIntegral cumA) / (fromIntegral total)
Why Not:
fromIntegral cumA / (fromIntegral total)
Redundant bracket
Found:
(fromIntegral cumA) / (fromIntegral total)
Why Not:
(fromIntegral cumA) / fromIntegral total
" ], "text/plain": [ "Line 21: Redundant if\n", "Found:\n", "if (x > pctOnes) then False else True\n", "Why not:\n", "not (x > pctOnes)Line 21: Redundant bracket\n", "Found:\n", "if (x > pctOnes) then False else True\n", "Why not:\n", "if x > pctOnes then False else TrueLine 30: Redundant bracket\n", "Found:\n", "(fromIntegral aT) / (fromIntegral bT)\n", "Why not:\n", "fromIntegral aT / (fromIntegral bT)Line 30: Redundant bracket\n", "Found:\n", "(fromIntegral aT) / (fromIntegral bT)\n", "Why not:\n", "(fromIntegral aT) / fromIntegral bTLine 34: Redundant bracket\n", "Found:\n", "if (b && a) then cumA + 1 else cumA\n", "Why not:\n", "if b && a then cumA + 1 else cumALine 40: Redundant bracket\n", "Found:\n", "(fromIntegral aAndb) / (fromIntegral total)\n", "Why not:\n", "fromIntegral aAndb / (fromIntegral total)Line 40: Redundant bracket\n", "Found:\n", "(fromIntegral aAndb) / (fromIntegral total)\n", "Why not:\n", "(fromIntegral aAndb) / fromIntegral totalLine 43: Redundant bracket\n", "Found:\n", "if (a && b) then cumAandB + 1 else cumAandB\n", "Why not:\n", "if a && b then cumAandB + 1 else cumAandBLine 49: Redundant bracket\n", "Found:\n", "(fromIntegral cumA) / (fromIntegral total)\n", "Why not:\n", "fromIntegral cumA / (fromIntegral total)Line 49: Redundant bracket\n", "Found:\n", "(fromIntegral cumA) / (fromIntegral total)\n", "Why not:\n", "(fromIntegral cumA) / fromIntegral total" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import Text.Printf\n", "import System.Random\n", "\n", "type Fun = ((Int,Int) -> (Bool,Bool) -> (Int,Int))\n", "type Cum = (Int,Int)\n", "\n", "data FunCum = FunCum Fun Cum\n", "instance Show FunCum where\n", " show (FunCum _ cum) = show cum\n", "\n", "-----------------\n", "-- Utilities --\n", "-----------------\n", "\n", "genRands :: Double -> Double -> [Double]\n", "genRands low high =\n", " randomRs (low, high) (mkStdGen 42)\n", "\n", "asBools :: Double -> [Double] -> [Bool] \n", "asBools pctOnes = \n", " map (\\x -> if (x > pctOnes) then False else True)\n", "\n", "---------------------\n", "-- Probabilities --\n", "---------------------\n", "\n", "pAGivenB :: [Bool] -> [Bool] -> Double\n", "pAGivenB a b = \n", " let (bT, aT) = foldl f (0,0) (zip a b)\n", " in (fromIntegral aT) / (fromIntegral bT) :: Double\n", " where f :: (Int,Int) -> (Bool,Bool) -> (Int, Int)\n", " f (cumB, cumA) (a,b) =\n", " let bTrues = if b then cumB + 1 else cumB\n", " aTrues = if (b && a) then cumA + 1 else cumA \n", " in (bTrues, aTrues)\n", "\n", "pAandB :: [Bool] -> [Bool] -> Double\n", "pAandB a b = \n", " let (total, aAndb) = foldl f (0,0) (zip a b)\n", " in (fromIntegral aAndb) / (fromIntegral total) :: Double\n", " where f :: (Int,Int) -> (Bool,Bool) -> (Int, Int)\n", " f (cumT, cumAandB) (a,b) =\n", " let aAndbs = if (a && b) then cumAandB + 1 else cumAandB \n", " in (cumT+1, aAndbs)\n", "\n", "pA :: [Bool] -> Double\n", "pA as =\n", " let (total, cumA) = foldl f (0,0) as\n", " in (fromIntegral cumA) / (fromIntegral total) :: Double\n", "\n", " where f :: (Int,Int) -> Bool -> (Int, Int)\n", " f (cumT, cumA) a =\n", " let newCumA = if a then cumA + 1 else cumA \n", " in (cumT+1, newCumA)\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ " let cnt = 1000000\n", "\n", " -- Generate the Distributions\n", " rs <- return $genRands 0.0 1.0\n", " burgulary <- return$ take cnt (asBools 0.10 rs)\n", " alarm <- return $take cnt (asBools 0.03 (drop cnt rs))\n", "\n", " -- Do the computation\n", " let aAndb = pAandB alarm burgulary\n", " let a = pA alarm\n", " let b = pA burgulary\n", " let aGb = pAGivenB alarm burgulary\n", " let bGa = pAGivenB burgulary alarm\n", "\n", " -- A bit of formatting\n", " let aStr = (printf \"%7.3e\" (100.0*a::Double))\n", " let aAndbStr = (printf \"%7.3e\" (100.0*aAndb::Double))\n", " let bStr = (printf \"%7.3e\" (100.0*b::Double))\n", " let agbStr = (printf \"%7.3e\" (100.0*aGb::Double))\n", " let bgaStr = (printf \"%7.3e\" (100.0*bGa::Double))\n", "\n", " --putStrLn$ \"alarm => \" ++ aStr ++ \" percent\"\n", " --putStrLn $\"burg => \" ++ bStr ++ \" percent\"\n", " --putStrLn$ \"burg and alarm => \" ++ aAndbStr ++ \" percent\"\n", " --putStrLn $\"alarm given burg => \" ++ agbStr ++ \" percent\"\n", " --putStrLn$ \"burg given alarm => \" ++ bgaStr ++ \" percent\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After $\\ 10^6\\$ samples, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "| Percent Probabilities | Value |\n", "| --- | --- |\n", "| Alarm | 2.965 |\n", "| Burgulary | 9.991 |\n", "| Alarm and Burgulary | 0.2976 |\n", "| Alarm given Burgulary | 2.979 |\n", "| Burgulary given Alarm | 10.04 |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, first (for fun) let's see if the independence check is passed," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$P(A\\ and\\ B) = P(A)\\cdot P(B)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here,

\n", "$P(A\\ and B) = 0.02976$
\n", "\n", "$P(A)\\cdot P(B) = 0.02965 \\cdot 0.0991 = 0.2938$,
\n", "\n", " ..... Pretty close." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dependence / Causality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above burgulary and alarm distrubutions are independent but\n", "for my specific situation there are some dependencies.\n", "\n", "Clearly if there is a burgular my alarm could be activated.\n", "And although remote, it's possible that the burgularly could\n", "be effected by my alarm if my alarm was going off (accidentally)\n", "as the burgular approached my cargo trailer. He would probably\n", "just leave (quickly) and his action would not be recorded in the\n", "county statistics for burgularies.\n", "\n", "There are other reasons my alarm might go off,\n", "\n", "- My cat\n", "- The wind (if it was an ultrasonic alarm)\n", "- Part of the ceiling falling in front of the sensor\n", "- A power glitch\n", "- etc....\n", "\n", "So, we need to capture all these factors to compute our desired,\n", "\n", "$$P\\ (\\ \\ cargo\\_crailer\\ being\\_burgled\\ \\ \\ |\\ \\ \\ alarm\\_going\\_off\\ \\ )$$\n", "\n", "or, from now on, in shorthand,\n", "\n", "$$P(B\\ |\\ A)$$\n", "\n", "as I want to know if I should panic when I receive a text message\n", "(yes I have an IOT alarm :-($\\ \\$) or discount it as another false\n", "alarm.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, let's start with the basics,\n", "\n", "$$P(B\\ |\\ A) = \\frac{P(A\\ |\\ B)\\cdot P(B)}{P(A)}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do we know?\n", "\n", "We will use different numbers than the simulation above.\n", "\n", "We can come pretty close with the probability of being burgularized, $P(B)$.\n", "\n", "We will ignore the edge case (the burgular hears my alarm and leaves) and \n", "just look at the county records for burgularies.\n", "\n", "So,\n", "\n", "$$P(B) = 0.10$$\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The $P(A)$ is harder to compute as it's dependent on several factors besides a burgular.\n", "Here we will only conder the cat and the wind.\n", "\n", "So here we can turn to total probability,\n", "\n", "$$P(A) = P(A|B)\\cdot P(B) \\ +\\ P(A|Cat)\\cdot P(Cat) \\ +\\ P(A|Wind)\\cdot P(Wind)$$\n", "\n", "Let's fill in some 'belief' numbers,\n", "\n", "$$P(A|B) = 0.95$$\n", "$$P(B) = 0.10$$\n", "$$P(A|Cat) = 0.03$$\n", "$$P(Cat) = 0.50$$\n", "$$P(A|Wind) = 0.01$$\n", "$$P(Wind) = 0.05$$\n", "\n", "Yielding,\n", "\n", "$$P(A) = 0.95 \\cdot 0.10 + 0.03 \\cdot 0.50 + 0.01 \\cdot 0.05$$\n", "\n", "$$P(A) = 0.1105$$\n", "\n", "Now all we have left is, $P(A|B)$\n", "\n", "Here we take a guess as to the reliability of our alarm. The manufacturer spec\n", "says the it will recognize being burgled 98% of the time.\n", "\n", "So,\n", "\n", "$$P(A|B) = 0.95$$\n", "\n", "Putting it all together we get,\n", "\n", "$$P(B|A) = \\frac{0.95\\cdot 0.10}{0.1105}$$\n", "\n", "or,\n", "\n", "$$P(B|A) = 0.85$$\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now how do we tackle the edge cases?\n", "\n", "In this case the edge cases will likely not change my mind as to calling the police\n", "when my SMS message is received (83% instead of 85% ?).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about time frames .. in a year?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "External link below," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[VPS](http://vps.faroth.com)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Really Cool Figure\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n" ], "text/plain": [ " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "[rgraph|\n", " ggplot(mtcars, aes(x=wt, y=mpg)) + \n", " ggtitle(\"Fuel Efficiency of 32 Cars\") +\n", " xlab(\"Weight (x1000 lb)\") + ylab(\"Miles per Gallon\") +\n", " geom_line(linetype=\"solid\", color=\"red\", size=0.5) +\n", " theme(text=element_text(family=\"Ariel\", size=12)) |]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n" ], "text/plain": [ " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "[rgraph|\n", " d <- data.frame(age_hs,shoeSize_hs,constant_hs)\n", " ggplot(d, aes(age_hs)) + \n", " theme(text=element_text(family=\"Ariel\", size=12)) +\n", " geom_line(aes(y = shoeSize_hs, color = \"showSize_hs\")) + \n", " geom_line(aes(y = constant_hs, color = \"constant_hs\"), linetype=\"dashed\") +\n", " coord_fixed(ratio=2) +\n", " labs(color = \"Plot Type\") |]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put this below labs to change legend size?
\n", "guides(color = guide_legend(override.aes = list(size = 5)))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "let patients = 1000 -- Patient count (limited to 10^6)\n", "let infectionRates = [0.001,0.005..0.4]\n", "let bkevens = take (length infectionRates) (repeat 1000.0)\n", "let (tests,iRates) = unzip \$ hermanPlotData patients infectionRates" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n" ], "text/plain": [ " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "[rgraph|\n", " d <- data.frame(iRates_hs,tests_hs,bkevens_hs)\n", " ggplot(d, aes(iRates_hs)) + \n", " ggtitle(\"Tests Required for Infection Rates\") + theme_grey() +\n", " theme(text=element_text(family=\"Ariel\", size=12),\n", " legend.title=element_blank(),\n", " plot.title = element_text(hjust = 0.5),\n", " legend.position=\"bottom\", aspect.ratio=0.7) +\n", " geom_line(aes(y = tests_hs, color = \"tests_hs\")) + \n", " geom_line(aes(y = bkevens_hs, color = \"bkeven_hs\"),\n", " linetype=\"dashed\") +\n", " labs(color = \"Plot Type\") +\n", " annotate( geom = \"curve\", x = 0.1, y = 1400,\n", " xend = 0.15, yend = 1100, \n", " curvature = -.3, arrow = arrow(length = unit(2, \"mm\"))) +\n", " annotate(geom = \"text\", x = 0.09, y = 1400,\n", " family = \"ariel\", label = \"Break Even\", hjust = \"right\") |]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> \n", " Background color is great\n", " and even more awesomeand even more awesomeand even awesome and even more awesomeand even more awesomeand even more awesome and even more awesomeand even more awesomeand even more awesomeand even more awesomeand even more awesome\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">>
\n", " Background color is great\n", " and even more awesomeand even more awesomeand even awesome and even more awesomeand even more awesomeand even more awesome and even more awesomeand even more awesomeand even more awesomeand even more awesomeand even more awesome\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">
alarm            =>  2.965  percent\n",
"burg             =>  9.991  percent  \n",
"burg and alarm   =>  0.2976 percent\n",
"alarm given burg =>  2.979  percent\n",
"burg given alarm =>  10.04  percent\n",
"