• see here for canvas stuff
  • see here for Calvins notes (this should be transcried w/ OCR later..)

Left off on: stats_quiz 4 Tables and Probability & dpqr2

Some basic variable defitions (maths)

In say

Expected and varience

  • E(X) = expected value. Always = to
  • V(X) = Varience. Always = to
  • Directly plugin these values

E(X) is always mu and varience is always sigma^2?

read/filter/table

Read in csv, filter some cols, generate table, find probility

# read in csv
file = read.csv("filename")
 
# short reminder filtering data : 
file[file$LENGTH > 50 & file$SPECIES == "CCATFISH",]
 
# we can also use filter for sm other things: 
ddt %>% filter(LENGTH > 50 & SPECIES == "CCATFISH")# Todo this should have a better example
 
#if we want to make a table:
tab <- with(ddt,table(SPECIES,RIVER))#with just means !need 2 
addmargins(tab)

Tables w/ OR, AND, GIVEN

stats_quiz 4 Tables and Probability

# given MTBE dataset load into table:
tb = table(MTBE$col1, MTBE$col2)
addmargins(tb)
tb
# outputs below..
          Below Limit Detect Sum
  Private          81     22 103
  Public           72     48 120
  Sum             153     70 223

22/70 #note its js sum of detect..

72/223 # intersection over total

70/223

Filtering qns

Sources:

#How many fish have a WEIGHT strictly between 1000 and 1600 and are of the LMBASS SPECIES? Use dplyr!
ddt %>% filter(WEIGHT > 1000 & WEIGHT < 1600 & SPECIES == "LMBASS")
 
# How many fish have a WEIGHT larger than 1600? Use "[]"!
d = ddt[ddt$WEIGHT > 1600,]

Outliers: 3 types (extreme,mild,all) IMPT using 3xIQR

#Boxplot method using 3x IQR
b1 = boxplot(ddt$DDT, range = 3) # Extreme outliers
length(b1$out)
[1] 12 
 
b2 = boxplot(ddt$DDT, range = 1.5) #All outliers
setdiff(b2$out, b1$out) # Only mild outliers
[1] 28 31 33

Stats_Quiz 3 boxplot

Outliers using z-score method

We use the keyword scale() on an array to generate the z-score of the array. We then use the

# z-score method..
df <- ddt %>% 
filter(SPECIES == "CCATFISH") %>% # create subset w/ only catfish
mutate(z = scale(DDT)) #compute z-scores subarray
 
sum(abs(df$z) > 3) #why 3? TODO./.

z-score (standard score) aka z-transformation 2) Find all possible DDT outliers. Submit the number of fish classified as possible outliers using boxplot()!

dpqr

Formal defition (not too impt)

  • density:
    • binom/poisson = prob getting certain value
    • norm = prob over certain range (probability of getting a value between 0.9 and 1.1?)
  • probability: total probability of getting a value certain point (aka AUC up till x-value)
  • quantile: inverse of p, xth % gives y value. What x-value has 97.5% of the data below it
  • random: generates random sample,

Table condensed of Discrete and cont. probilities:

Sm basic defs; upper tail = , lower tail OPA.

  • Discrete (bar chart, countable)
  • Cont. (AuC)

Condensed table:

  • p function,
  • 1-p..
  • k-1 (first arg -1)
  • : function
Probability you wantR expression
(lower tail)pbinom(k, n, p)
pbinom(k-1, n, p)
1 - pbinom(k-1, n, p)
1 - pbinom(k, n, p)
dbinom(k, n, p)
pbinom(m, n, p) - pbinom(k, n, p)

Distribution types

  • Normal distribution (bell curve)
  • Binom: # sucess in fixed # trials
  • Geo: # trials till first sucess
  • Poisson: # events happening in x interval
  • hyper: successes w/o replacement (card draw ! go back into deck)
DistributionPMF/PDF (d*)CDF (p*)Quantile (q*)Random (r*)
Normaldnormpnormqnormrnorm
Binomialdbinompbinomqbinomrbinom
Geometricdgeompgeomqgeomrgeom
Hypergeometricdhyperphyperqhyperrhyper
Poissondpoisppoisqpoisrpois

Train type problem -m dpqr

The problem:

Generating own dpqr

dtrain <- function(x){
  # -5,5 is from \int bounds, its a triangle we're finding so we use .2 for base, .04 for slope
ifelse(x > -5 & x < 5, 0.2 - 0.04 * abs(x), 0)
}
 
ptrain <- function(q){
ifelse(q <= -5,
         0,  # Case 1: Left triangle
         ifelse(q <= 0,
                0.02 * (q + 5)^2,  # Case 2:left of slope
                ifelse(q < 5,
                       0.5 + 0.2 * q - 0.02 * q^2, # Case 3:  right slope
                       1))) # Case 4:right of the triangle
}
 
qtrain <- function(p){
  myroot <- function(p) {
     k <- function(x){
     p -  1/500*(75*x - x^3 + 250)
  }
    l <- stats::uniroot(f = k, interval = c(-5, 5))
    l$root
  }
}
#Doubt this will be tested
rtrain <- function(n){
  r <- runif(n, min = 0, max = 1)  
  qtrain(r)
}

dpqr code examples

1) Y ~ Bin(n = 10, p = 0.4).

1- pbinom(8-1,10,.4)
#just following the table we saw

2) X ~ Pois(lambda = 5).

ppois(8,5) - ppois(3,5) #no need for 1-x since we're alrdy doing that here

Bayes Testing problem (drug testing)

{1-user} user ------- P(positive | user) user + (1-tru neg) truNeg

Just plug in:

  • aka 1-P(user)
EventNotationValueDescription from Problem
Prior Prob. (User)5% of people actually use cannabis.
Prior Prob. (Non-user)
Sensitivity (True Positive Rate)\text{U})$
Specificity (True Negative Rate)\text{N})$
False Positive Rate\text{N})$

nonUser * user / { P()}

Drug test problem

A particular test for whether someone has been using cannabis is 95% sensitive and 87% specific, meaning it leads to 95% true “positive” results (meaning, “Yes he used cannabis”) for cannabis users and 87% true negative results for non-users. Assuming 5% of people actually do use cannabis, what is the probability that a random person who tests positive is really a cannabis user?

  • A: The person is a cannabis user.
  • (or ): The person is a non-user (i.e., not a cannabis user).
  • B: The test result is positive.
  • (or ): The test result is negative.

Birthday problem

Birthday Problem

Central formula: n= # of people share 2 bdays

birthday <- function(k){
  1 - exp(lchoose(365,k) + lfactorial(k) - k*log(365))
}

w-F theory: Wright-Fisher model

MGF & MOM

Moment Generating Functions

Estimating unknown parameters of a probability distribution using sample data. Given We use the general formula: to find the k’th moment. taking the -th derivative of the MGF with respect to and then plugging in .

For example if our MGF is Then we would take the first derative Eval at t=0 mean is which is Parameter (Probability of Success)

Z-score + emnpirical

z-score (standard score) aka z-transformation

T.test Samples

t.test(x,y, 
       var.equal = TRUE, #equal variances ? (default false)
       conf.level = 0.80 #confidence interval
       )
...
  • Take line below 95% conf interval, L = Left most value OPA

t_test_one_and_two_sample_Stats_Quiz

without t.test we need to do:

y <- c(3,4,5) #sample dataset
a = 0.2 #define alpha
n = length(y) #sample size
t <- qt(1-a/2,n-1) # critical t-value
 
mp = c(-1,1) #jsut to get both ends..
mean(y)+mp*t*sd(y)/sqrt(n) #final conf interval

Linear Combinations in Expected and Varience

  • Finding E(#) plug mu in
  • V(#) drop +b square entire thing (we always have to square all terms in V(#))!