5/5 - (1 vote)
Assignment 3
EMATM0061: StatisticalComputing andEmpiricalMethods, TB1,2024
Introduction
Create an R Markdown for the assignment
First, it is recommended that youcreate asingleRMarkdowndocumenttoincludeyour solutions, with headingscreated by headingcodessuch as“##1.1(Q1)”,“##3(Q1)”,etc.
It is a good practice touseRMarkdown toorganiseyourcodeandresults.For example,youcanstartwiththetemplatecalledAssignment02_TEMPLATE.Rmdwhich can be downloaded viaBlackboard.
You canoptionally hand in this assignmentby13:00Monday7October.This will help us understand your work but will notcount towards yourfinalgrade.Ifyou want to hand in the assignment, pleaseuploadaPDFfilecontainingyouranswerstoBlackboards(click on the “Assignment 03” under theassignment tab).Thereisno requirement on how thePDF file isgenerated.Oneexampleis tochoosetheoutputof R-markdown asPDF(which may requireLaTex to beinstalledin yourcomputer).Another example is to choose ahtmloutputatR-markdownandconvertthehtml file into aPDF file.Ifyou have multiplePDFfiles,pleasecombinethemintoasinglePDF file before thesubmission.
Load packages
We need to load two packages, namely“Stat2Data”and“tidyverse”,before answering the questions.Ifthey haven’t been installedon yourcomputer,pleaseuse“install.packages()” toinstallthemfirst.
1. Load the tidyverse package:
library(tidyverse)
2. Load theStat2Data package and then thedatasetHawks:
library(Stat2Data)
data(“Hawks”)
1. Visualisation
This part is mainly about visualisation usingggplot2.ItcoversmainlyLecture7.We are going to use the “Hawks”datasetthatwasusedlastweek.
(Q1) After the dataset Stat2Data was loadedsucesfully,createasmallerdatasetwith the codebelow.
hawksSmall<-
drop_na(select(Hawks,Age,Day,Month,Year,CaptureTime,Species,Wing,Weight,Tail))
Use the dim() function to checkhowmanyrowsandhowmanycolumnsinhawksSmall. Then use the head() function todisplay thefirst5rowsofthehawksSmal. The result shouldlook likethis:
## | Age | Day | Month | Year | CaptureTime | Species | Wing | Weight | Tail |
## 1 | I | 19 | 9 | 1992 | 13:30 | RT | 385 | 920 | 219 |
## 2 | I | 22 | 9 | 1992 | 10:30 | RT | 376 | 930 | 221 |
## 3 | I | 23 | 9 | 1992 | 12:45 | RT | 381 | 990 | 235 |
## 4 | I | 23 | 9 | 1992 | 10:50 | CH | 265 | 470 | 220 |
## 5 | I | 27 | 9 | 1992 | 11:15 | SS | 205 | 170 | 157 |
(Q2) Use acombination ofthe functions“ggplot()” and“geom_histogram”tocreate a histogram plot ofthe weights of Hawks withinthehawksSmalldataframewithbinwidths of 10. Your result should looksomethinglikethefollowing.
(Q3) Similar to(Q2),use the “geom_density()” function to createtwodensityplotsofthe tail lengths of Hawks within the hawksSmalldata frame,withparameters “adjust=0.5” and “adjust=2”, respectively. Your resultsshouldlook likethis:
Can you explain the difference between the twoplots?Howmanymodesdoesthedistributionof Hawk tail lengths have ineach ofthe plots.
(Q4) Basedon the answer from(Q3),can youcreate the followingplot?
(Q5) Violin plot:
Use the ggplot and geom_violin() functions tocreate thefollowingviolinplotforthethree species.
(Q6) Scatter plot
Generate a plot similar to the followingplotusingtheggplot()andgeom_point()functions.
1. How many aesthetics are present within thefollowingplot?
2.What are the glyphs within thisplot?
3. Whatare thevisualcuesbeingusedwithinthisplot?
(Q7) Trend lines andfacetwraps:
Generate the following plot using the ggplot(),geom_point(),geom_smooth()andfacet_wrap() functions.Note that in the facet plot, thethreepanelsusedifferentscales.
1. Whatare thevisualcuesbeingusedwithinthisplot?
2. Basedon the plot below, what can we sayabouttherelationshipbetweentheweight ofthe hawks and theirtaillengths?
(Q8) Adding annotations
First,compute the “Weight” and the “Tail”oftheheaviesthawkin thedataset.Youcan use filter() and select()function toselectproperdata.
Second, reuse the code that you createfromQ(3),addinganarrowandan annotation to indicate the heaviest hawk. Your resultshouldlooksimilartothis:
2. Finite probability spaces
Now, let’s move on to considerprobabilityonfinitesamplespaces,whichiscoveredinLecture8.
2.1 Sampling with replacement
Recall that for positive integers nand k, givesthenumberof subsetsofsize kfrom asetof nobjects.
You cancompute this number straightforwardly within“R” via thechoosefunction “choose()” .For example, ifwe want to compute thenumberof differentsubsetsofsize3 from a collection of size8 wewouldcompute:
choose(8,3)
##[1]56
Suppose we have a bag containing10spheres. Thisincludes3redspheresand7blue spheres.
Let’s suppose that we draw asphereatrandomfrom thebag(allsphereshaveequal probability ofbeing drawn). We record itscolourandthenreturnthespheretothebag. This process is repeated22 times.Thisisanexampleof sampling with replacement since the spheres are replacedaftereachdraw.
(Q1) Write down a mathematical expressionfortheprobabilitythatZoutofthe22selections were redspheres(here Z ∈{0,1, ⋯ ,22}).
You can try writing down your mathematicalexpressionusing“LaTex”codein yourR markdown file, making useoftheLaTex functions“binom{”{}}and“frac{”{}}.Forinformation about writing mathematical formulae inRmarkdowndocuments,findillustrative examples in “Assignment_R
MarkdownMathformulasandSymbolsExamples.rmd”(under the resource list tab onblackboardcourse webpage) or visit:
https://bookdown.org/yihui/bookdown/markdown-extensions-by- bookdown.html#equations
(Q2) Next write anR function called“prob_red_spheres()” which takesZasan argument andcomputes the probability that Zout of a totalofthe22 ballsselectedarered.
Test your function asfollows:
prob_red_spheres(10)
##[1] 0.05285129
(Q3) Generate a data frame. called“prob_by_num_reds” withtwocolumns “num_reds” and“prob”. The “num_reds”column shouldcontainnumbers1 through22 and the “prob”column should givetheassociatedprobabilityof selecting that many reds out of a totalnumberof 22selections.Display the first3 rowsofyourdataframe.:
prob_by_num_reds %>% head(3)
##num_redsprob
## 110.003686403
##22 0.016588812
##33 0.047396606
(Q4) Now use the “geom_line()” function within the“ggplot2”library,inconjunctionwith your data frame. todisplay a plotoftheprobabilityasafunctionofthenumberof reds.
Your plot should lookasfollows:
(Q5)
Next we shall explore the “sample()”function withinR.Let’ssuppose wewanttosimulate a random experiment in which wesamplewithreplacementfroma
collection of 10objects, and repeat thisprocess22times.
Wecando this bycalling:
sample(10, 22,replace=TRUE)
##[1]5 86104531651094576771015
5
Try this out for yourself. The outputshouldbea vectorof length22consisting entirely of numbers between1 and10.Since this issampling withreplacementsandthe number of samples exceeds the numberofelements, therewillberepetitions.
Try rerunning the function. You probably getadifferentsample.Thisis tobe expected, andeven desirable, since thefunction simulatesarandomsample.
However, the fact that we get a differentanswereverytimewerunthecodeis problematic from the perspective ofreproducibility. To avoidthisprocesswecanset a random seed via the functionset.seed().Bydoingsoweshouldgetthesameoutputeverytime. Trythefollowingoutforyourself:
## case 1: Setting the random seed just once
set.seed(0)
for(i in 1:5){
print(sample(100,5,replace=FALSE))
# The result may well differ every time
}
## case 2: Resetting the random seed every time
set.seed(1)
print(sample(100,5,replace=FALSE)) set.seed(1)
print(sample(100,5,replace=FALSE)) set.seed(1)
print(sample(100,5,replace=FALSE))
# The result should not change
## case 3: reproducing case 1 if we set a random seed at the beginning.
set.seed(0)
for(i in 1:5){
print(sample(100,5,replace=FALSE))
} # The result will be 5 samples exactly the same as in case 1 (why?).
Next, try to understand what the following codedoes
itermap <- function(.x,.f){result<- list()
for (item in .x){
result <- c(result, list(.f(item)))}
return(result)}
itermap( c(1,2,3), function(x){ return(c(x,x^2))})
##[[1]]
##[1]11##
##[[2]]
##[1]24##
##[[3]]
##[1]39
itermap_dbl <- function(.x,.f){result <- numeric(length(.x))
for (i in 1:length(.x)){
result[i] <- .f(.x[[i]])}
return(result)}
itermap_dbl( c(1,2,3), function(x){ return(x^3)})
##[1]1827
The function “itermap” and“itermap_dbl” definedabovearesimilar to the“map()”and“map_dbl” functions inR that will be introducednext week.
We shall now use the “sample()”toconstructasimulationstudytoexplorethe
probabilityof selecting z red balls from a bagof size10,with3redand7blueballs,when sampling22 balls with replacement.
First set a random seed.Thencreateadataframecalled
“sampling_with_replacement_simulation”consistingoftwocolumns. Thefirstis called“trial” andcontains numbers1 through1000. The secondiscalled
“sample_balls” andcorresponds to random samples of size22froma bagof size10,with replacement. We cando thisasfollows:
num_trials<-1000 # set the number of trials set.seed(0) # set the random seed
sampling_with_replacement_simulation<-data.frame(trial=1:num_trials) %>%
mutate(sample_balls = itermap(.x=trial, function(x){sample(10,22,replace =TRUE)}))
# generate collection of num_trials simulations
Now add a new columncalled“num_reds”such that,foreachrow,“num_reds”
contains an integer which gives the numberof items withinthesamplefor thatrow(i.e., the sample represented by theentry ofthe“sample_balls”column)whichare
less thanor equal to three(assuming that the threered ballsarelabelledby{1,2,3}).For example, suppose that for somerowofthedataframe,the“sample_balls”
column contains aentry which is thefollowinglist:
410410835555107121105657110
Then the corresponding row ofthe“num_reds”column shouldcontainthenumber5, since5 ofthese values are less thanequalto3.Youmaywanttousethefunctions “mutate()”, “itermap_dbl” and “sum()”. After performing this step,thedataframe.
“sampling_with_replacement_simulation” should contain a newcolumncalled “num_reds”.
(Q6)
Next we shall add a newcolumncalled“predicted_prob” toourexistingdataframe. “prob_by_num_reds” which gives the number of times (that weobserved the
corresponding number of reds within our simulation)divided bythetotalnumberof observations. This aims to estimate the probabilityoftheeventthatZoutof a
totalofthe22 balls selected are red(for Z = 1,2, ⋯,22). Wecando this asfollows:
num_reds_in_simulation<-sampling_with_replacement_simulation %>% pull(num_reds)
# we extract a vector corresponding to the number of reds in each trial
prob_by_num_reds<-prob_by_num_reds %>%
mutate(predicted_prob=itermap_dbl(.x=num_reds, function(.x)
sum(num_reds_in_simulation==.x))/num_trials)
# add a column which gives the number of trials with a given number of reds
Here “prob_by_num_reds” was initially defined in(Q3).
(Q7) Finally,create a plot which compares the resultsofyoursimulationwithyourprobability formula.
Your result should look something liketheplotbelow.Of course,since thisisarandom simulation, your resultmay well lookslightlydifferent.
prob_by_num_reds %>%
rename(TheoreticalProbability=prob,
EstimatedProbability=predicted_prob) %>%
ggplot() + geom_line(aes(x=num_reds, y=TheoreticalProbability)) +
geom_line(aes(x=num_reds, y=EstimatedProbability), linetype=’dashed’)
+
geom_point(aes(x=num_reds, y=EstimatedProbability)) +
theme_bw() + xlab(“Number of reds”) + ylab(“Probabilities”)
Try to make senseof each linein theabovecode.
Notice the close relationship between probabilitiesand long-runfrequencybehaviour. We shall explore this connectionover thenextfewlectures.
2.2 Sampling without replacement
This is a more challengingquestion,but theideasfromthelastquestionareusefulhere.
Let’s suppose we have a large bagcontaining100spheres.Thereare50redspheres,30 blue spheres and20 green spheres.Supposethatwesample10spheresfromthebag without replacement.
(Q1) What is the probability that oneormorecoloursaremissingfrom your
selection?First aim to answer this question viaasimulationstudy usingideasfromthe previousquestion.
You may want to use thefollowingsteps:
1. First setarandomseed;
2. Next set a numberoftrials(e.g.,10 or1000.Trythisinitially withasmallnumber of simulations.Increase your number of simulations to about a relatively large number to get amoreaccurateanswer,onceeverything seems to be working well), and asamplesize(10);
3. Now use a combination ofthefunctions“sample()”,“mutate()”and
“itermap()” to generate your samples.Here you are creatingasampleof size 10 from a collection of100 balls–thesamplingisdone without replacement;
4. Now compute the number of “reds”,“greens” and“blues”inyoursampleusing the “itermap_dbl()” and “mutate()”functions.
5. Computetheminimumofthethreecountsusingthe“pmin()”function.Whenthis minimum is zero, thenone ofthethreecoloursismissing.Itis
recommendedthatyoulookupthedifferencebetween “pmin()”and“min()”here;
6. Computetheproportionof rowsfor whichtheminimumnumberofthethreecounts iszero.
(Q2) (*Optional) Thisquestion is optionalandmaybeomittedifyouareshortontime.
Let’s derive a mathematical expressionfor theprobabilitythatweconsideredin
(Q1): You can try and use “combinations”with(k(n))tocomputetheprobability
directly.First aim tocompute the number of subsetsof size10 from100 which
eitherentirely missout oneofthe subsetsRed={1, ⋯ ,50},Blues={51, ⋯ ,80},
Greens={81, ⋯,100}. Thencompute the numberof subsetsof size10 from100
which missout twoofthe subsetsRed,Blues,Green.Be careful nottodoublecount some ofthese subsets! Once you havecomputedallsuchsubsets,combinethemwith the formula for the total numberof subsetsof size10fromasetof100, to compute the probability of missing acolour.
Once you have the mathematical expression fortheprobability,youcancheckhowclose the probabilitycomputed in (Q1) to thetheoretical values.