R programming for data science

103
R Programming for Data Science Sovello Hildebrand Mgani [email protected]

Transcript of R programming for data science

Page 1: R programming for data science

R Programming for Data Science

Sovello Hildebrand [email protected]

Page 2: R programming for data science

2

Outline

● History of R● Installation (Windows and Linux)● Data Types● Reading Data:

– Tabular– Large datasets

● Textual Data Formats● Subsetting:

– Lists, Matrices, Partial matching– Removing missing values

Page 3: R programming for data science

3

Outline● Vectorized operations● Control Structures

– If-else– For, while, repeat, next break

● Functions– Scoping

● Dates and Times● Loop functions

– lapply, tapply, apply, mapply, split,

● Simulation and profiling– Generating random numbers, simulating a linear model, random sampling

● Visualizations

Page 4: R programming for data science

4

History of R

● Originates from S language. S was initiated in 1976 as an internal statistical analysis environment—originally implemented as Fortran libraries– History of S:

http://www.stat.bell-labs.com/S/history.html

● R development history:– https://en.wikipedia.org/wiki/R_(programming_la

nguage)

Page 5: R programming for data science

5

R and Statistics

● R developed from S which is a statistical analysis tool, and so is R

● Its functionality is divided into modules– Need to load a module for different functionalities

● Has very sophisticated graphics capabilities than most other statistical packages

● Useful for interactive work: run from terminal● Contains a powerful programming language for

developing new tools– Tools: for visualizations and analysis

Page 6: R programming for data science

6

Design of the R System

● The “base” system, downloaded from CRAN● “All other stuff”● Packages in R

– The “base” has the base package required to run R and has the most fundamental functions

– Other packages contained in the “base”. Need to load these to be able to use them: utils, stats, datasets, graphics, grDevices, tools, etc.

– Recommended packages: boot, class, cluster, codetools, foreign, lattice, etc.

– Load packages with library(), or require()

Page 7: R programming for data science

7

R Resources

● CRAN:– http://cran.r-project.org

● Quick-R: a book– http://www.statmethods.net/

● R bloggers (platform): not a social network– R-Bloggers is about empowering bloggers to empower

other R users– R-Bloggers.com is a blog aggregator of content

contributed by bloggers who write about R (in English)– https://www.r-bloggers.com/

Page 8: R programming for data science

8

Installation of R: Ubuntu● Run from terminal:

– sudo apt-get install r-base r-base-dev

● If this doesn’t work, then you need – To add the repositories:

sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list

– Add the keyring: gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9

gpg -a --export E084DAB9 | sudo apt-key add -

– Install R-Base sudo apt-get update; sudo apt-get install r-base r-base-dev

● You can install from a PPA which has the most recent versions– Add the PPA

sudo add-apt-repository ppa:marutter/rrutter

– Install R-Base sudo apt-get update; sudo apt-get install r-base r-base-dev

Page 9: R programming for data science

9

Installation of R: Windows

● Visit CRAN– https://cran.r-project.org/

● CRAN: Comprehensive R Archive Network

Page 10: R programming for data science

10

Installation of R: Windows

Click/Select Download R for Windows

Page 11: R programming for data science

11

Installation of R: Windows

Then click/select base or install R for the first time

Page 12: R programming for data science

12

Installation of R: Windows

● Then click/select Download R X.X.X for Windows● After the download has finished, locate thedownloaded file and install.

Page 13: R programming for data science

13

RStudio: www.rstudio.com

Page 14: R programming for data science

14

RStudio: Introduction

● RStudio is a set of integrated tools designed to help you be more productive with R.

● How?– It includes a console,– syntax-highlighting editor that supports direct

code execution, – a variety of robust tools for

plotting, viewing history, debugging and managing your workspace.

Page 15: R programming for data science

15

RStudio: Installation

● From the RStudio home page, go to Products then select RStudio– Then scroll down and click

Download RStudio Desktop– Then click Download under RStudio Desktop

Personal License.– Select RStudio for your platform. Clicking on the

link will download the file directly.– Locate the file in your system Downloads folder

and start the installation.

Page 16: R programming for data science

16

RStudio: Parts

The Console is where you write and run code interactively

The Files tab shows all the files and folders in your default workspace as if you were on a PC/Mac window.

The Plots tab will show all your graphs.

The Packages tab will list a series of packages or add-ons needed to run certain processes.

For additional info see the Help tab

The Environment tab shows all the active objects The History tab shows a list of commands used so far

Page 17: R programming for data science

17

RStudio: Working Directory

● It is important to organize all files for a particular project under one main/parent directory

● A working directory in RStudio is where all the files for a particular project are stored

● All paths used in the console to load data files and scripts are relative to the working directory.

Page 18: R programming for data science

18

● To set the working directory:– Start RStudio the same way you start other

programs in your computer– From the File menu options select New Project then

select New Directory then Empty Project then type the directory name (rprogramming) then under create project as subdirectory of click Browse and select Desktop

RStudio: Working Directory

Page 19: R programming for data science

19

R: Getting Started● A few basic commands to test them on the console

– getwd(): get current working directory

– setwd(“/path/to/directory”): set a working directory to the specified path

– install.packages(“package_name”): install a package. Requires internet connection

– library(package_name), require(package_name): load and attach add-on packages

– ?object: provide documentation/help for an object. e.g. ?mtcars

– summary(object): provide a summary of an object like a dataset e.g. summary(mtcars)

● Everytime you run library(package_name) and get an error “there is no package called ‘package_name’”, you will need to install it first then call library on it.

Page 20: R programming for data science

20

Data Visualizations in R: Introduction

● R has different systems (packages) for making graphs (visualizations)

● For this case we are going to use ggplot2 which is more elegant and versatile compared to many others. (ggvis, rgl, htmlwidgets, googleVis, etc.)

● Ggplot2 is built upon the “The Layered Grammar of Graphics”

Page 21: R programming for data science

21

Data Visualizations in R: Tidyverse

● Tidyverse is a set of packages– The packages work in harmony

Reason: they share common data representations and API design.

● The tidyverse package makes it easy to install and load core packages from it in a single command

● To install run: install.packages(“tidyverse”)

● To use it run: library(tidyverse)which loads tidyverse core packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr.– Google each one of these packages to learn what they do

Page 22: R programming for data science

22

Data Visualizations: First Steps● library(tidyverse) loads all the core packages from

tidyverse● The library() function also tells any conflicts with base R

or other packages that arise from loading the named package. ● e.g. for this case filter() and lag() are functions from

tidyverse that conflict with similar functions from dplyr and stats packages

● In this case you may need to call a function explicitly from a package in the form. package::function()● e.g. ggplot2::ggplot() calls the ggplot function from

ggplot2 package.

Page 23: R programming for data science

23

● Which is more fuel efficient: cars with big engines or cars with small engines?

● The mpg data frame:– Data Frame: is a rectangular collection of

variables in columns and observations in rows The mpg data frame in ggplot2 contains observations

collected by the US Environment Protection Agency on 38 models of cars.

● Run (from console) ?mpg to learn more about the data set.

Data Visualizations: First Steps

Page 24: R programming for data science

24

First Steps Creating a ggplot

● To answer the question about fuel efficiency plot fuel consumption (hwy: y-axis) against engine size (displ: x-axis)

● See the magic of this command:– ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy))

Page 25: R programming for data science

25

First Steps Creating a ggplot

> ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

A negative relationship between engine size (displ) and fuel efficiency (hwy) means Cars with bigger engines use more fuel.

Page 26: R programming for data science

26

Creating a ggplot● In ggplot2,

– You begin with the function ggplot() ggplot() creates a coordinate system that you can add layers onto. The first argument is the data set that you are going to use for plotting

– To complete the graph add more layers to the coordinate system created by ggplot()

geom_point() function adds a layer of points to plot (which creates a scatter plot for this case)

Each function in ggplot2 takes a mapping argument which defines how variables are mapped to visual properties.

The mapping argument is always paired with aes()– The x and y arguments of aes() specify which variables to map to the x and y

axes.

– ggplot2 looks for the mapped variable in the data argument, in this case, mpg

Page 27: R programming for data science

27

Creating a ggplot: Template

● A graphing template for ggplot

● You can get a list of <GEOM_FUNCTION>s by following this link (http://docs.ggplot2.org/current/)

Page 28: R programming for data science

28

ggplot: Aesthetics Mappings

● Look at the graph and note the circled dots

● What is special with these big engine cars?

Page 29: R programming for data science

29

ggplot: Aesthetics● Ggplot Aesthetic mappings can help answer the

question● An aesthetic is a visual property of the objects in a

plot. – These are things like size, shape or color of points.

● You can therefore display a point in different ways by changing the values of its aesthetic properties.

● You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset.– e.g. you can map the colors of your points to the class

variable to reveal the class of each car.

Page 30: R programming for data science

30

ggplot: Aesthetics● New plot with aesthetics for class:

ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy, color = class))

● Try for year and manufacturer and look at the trends

Page 31: R programming for data science

31

ggplot: Aesthetics

● Other aesthetics:– Size: for ordered variables, so each point reveals

its attribute size– Alpha: controls the transparency of the points– Shape: points will be of different shapes

Exercise: try plotting the same geom with these different aesthetics

● ggplot2 takes care of selecting a reasonable scale to use with the aesthetic and constructs a legend

Page 32: R programming for data science

32

ggplot: Aesthetics

● The aesthetic properties of a geom can be set manually.– For example:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

– Will set all points to blue– Note color is outside the aes()

Page 33: R programming for data science

33

ggplot: Facets

Page 34: R programming for data science

34

● When the data has categorical variables, it is possible to split the plot into facets.

● Facets are subplots that each displays a subset of data.

● To plot facets, with a single variable, use the function facet_wrap(formula, …)– formula is created with ~ variable-name– formula is the name of a data structure in R, not a

synonym for equation.– The variable (variable-name) should be discrete.

ggplot: Facets

Page 35: R programming for data science

35

ggplot: Facets● For example:

– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color=”red”) + facet_wrap(~ class, nrow = 3)

● This will produce a plot for each element in mpg.class, and the plot will display in three rows.

Page 36: R programming for data science

36

ggplot: Facets

● Can we facet the plot using two discrete variables:● Do this:

– ?facet_grid– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)

In the plot, why do we have empty sub-plots?●

Page 37: R programming for data science

37

ggplot: Facets

● Hack:– With facet grid, what happens when you use a . at

the place of one variable?– Is there an advantage of faceting over the color

aesthetic? Any disadvantages? What is the dataset is very large?

– In facet_wrap() what do nrow or ncol do?

– When using facet_grid() put the variable with more unique levels in the columns (RHS of formula), why?

Why doesn’t facet_grid() have nrow, and ncolumn

Page 38: R programming for data science

38

ggplot2::Geometric objects (geoms)

● These are the geometric objects used to represent the data.– e.g. bar geoms, point geoms, line geoms, smooth geoms,

etc.

● To change the geom in your plot, change the geom function (geom_xxx())

● For example:– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

– ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))

● Not every aesthetic works with every geom– e.g. you can’t set a shape of a line but of a point– Read: ?geom_point, ?geom_smooth

Page 39: R programming for data science

39

ggplot2: geoms● ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

● Try: – ggplot(data = mpg) +

geom_line(mapping = aes(x = displ, y = hwy, linetype = drv))

Page 40: R programming for data science

40

ggplot2: geoms

● Plot:– ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

– ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y – hwy, group = drv))

What is the difference? Which is better? Why?

Page 41: R programming for data science

41

Ggplot2: combined geoms

● Can we use more than one geoms on the same plot?

● Try:– ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy)) +

geom_smooth(mapping = aes(x = displ, y = hwy))

● When using multiple geoms on the same plot you can use global mappings:– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point() +

geom_smooth()

Which makes the code easy to read and modify.

Page 42: R programming for data science

42

ggplot2: combined geoms● When you use global mappings and set some mappings in a geom function,

these mappings will be treated as local to this layer only.

● For example:– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point(mapping = aes(color = class)) +

geom_smooth()

Page 43: R programming for data science

43

ggplot2: combined geoms

● In the same way, you can specify different data for each layer.– Say you only want to fit a smooth line for one class of

cars– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point(mapping = aes(color = class)) +

geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

– Hack: can we plot more than one of the same

geom? –Try a smooth geom with different car class

Page 44: R programming for data science

44

Ggplot2: combined geoms

Page 45: R programming for data science

45

Combined Geoms: exercise

Page 46: R programming for data science

46

Ggplot2: geoms

● How many geoms does ggplot2 have?– Visit this page:

https://www.rstudio.com/resources/cheatsheets/ Look for Data Visualization Cheat Sheet

● ggplot2 extensions provide more geoms to use. Take a look at available extensions from this gallery (http://www.ggplot2-exts.org/gallery/)

Page 47: R programming for data science

47

ggplot2: statistical transformations

● Read: ?diamonds– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut))

– Where does count come from?

Page 48: R programming for data science

48

Statistical Transformations

● Some plots plot raw values – e.g. scatterplots,

● Some plots use calculated values– bar charts, histograms, and frequency polygons bin

your data and then plot bin counts, the number of points that fall in each bin.

– smoothers fit a model to your data and then plot predictions from the model. (Remember regression lines)

– boxplots compute a robust summary of the distribution and then display a specially formatted box.

Page 49: R programming for data science

49

Statistical Transformation

● The algorithm used to calculate new values for a graph is called a stat, (Statistical Transformation)

● You can check which stat is used by default by looking at the default value of stat.– geom_bar() uses count. Thus you can recreate the bar

chart by running ggplot(data = diamonds) +

stat_count(mapping = aes(x = cut))

● Every geom has a default stat; and vice-versa. This means that you can typically use geoms without worrying about the underlying statistical transformation.

Page 50: R programming for data science

50

Statistical Transformation

● You can explicitly specify a stat:● When you want to override the default stat

e.g. Run demo <- tribble(

~a, ~b,

"bar_1", 20,

"bar_2", 30,

"bar_3", 40

)

Then runggplot(data = demo) +

geom_bar(mapping = aes(x = a, y = b), stat = "identity")

Page 51: R programming for data science

51

Statistical Transformation● Reasons to explicitly specify a stat: cntd

– You want to override the default mapping from transformed variables to aesthetics.

ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))– This will draw a bar chart of proportion instead of count

Page 52: R programming for data science

52

Position Adjustments

● A bar chart can be colored in either of two ways: color and fill.– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, colour = cut))

– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = cut))

Page 53: R programming for data science

53

Position Adjustments

● Check how the following plots will look like– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity))

– ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +

geom_bar(alpha = 1/5, position = "identity")

– ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +

geom_bar(fill = NA, position = "identity")

– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

– ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Page 54: R programming for data science

54

Position Adjustments

● Learn more about position adjustments– ?position_dodge,

– ?position_fill,

– ?position_identity,

– ?position_jitter

– ?position_stack

Page 55: R programming for data science

55

Position Adjustments:overplotting.

● Recall: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

– It displays fewer than 234 points: the number of observations (can you count them?)

– The values of displ and hwy are rounded and many points overlap each other. That is a problem called overplotting.

● You can avoid this gridding by setting the position adjustment to “jitter”– position = “jitter” adds a small amount of random noise to each point

– Since no points can receive the same amount of noise, they are going to be spread out.

● Jittering makes the graph less accurate at small scales, however it will make the graph more revealing at large scales.

● In ggplot2 the shorthand for geom_point(position = "jitter") is geom_jitter()

Page 56: R programming for data science

56

Position Adjustments: jitter● ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

Page 57: R programming for data science

57

Thank You! Asanteni!

Page 58: R programming for data science

58

Working with Data

● In this part we are going to learn how to work with your data.– Getting data

Importing your own data Tidying data

– How to work with different data types: Relational data, Strings, Factors, Dates and Times

Page 59: R programming for data science

59

Importing Data● For importing files, we will use the readr package which

is part of the tidyverse core packages.● Most of readr functions turn flat files into data frames. A

Data Frame is a tabular data format with rows and columns. It is a list of vectors of equal length.– read_csv(): reads comma separated files

– read_csv2(): reads semicolon separated files

– read_tsv(): read tab delimited files

– read_delim(): reads files with any delimiter

● Activity:– Check what read_table(), read_fwf() and read_log()

do?

Page 60: R programming for data science

60

Importing Data: read_csv()● The first argument is the path to the file to read

– read_csv(“data/students.csv”)

● read_csv() prints out a column specification● read_csv() by default uses the first row as the column names

– You can use skip = n, to skip the first n lines if they contain data you don’t need, (most likely metadata)

– You can use comment = “#” to drop all lines that start with # for example

– Use col_names = FALSE so that read_csv() doesn’t treat the first row as the column names

● Missing values in R are specified out by na or NA. When loading files where missing values are specified differently, use na = “.” for example if missing values are specified by a period.– What will this line do?

read_csv(“students.csv”, skip = 2, comment = “//”, col_names = FALSE, na = “-”)

Page 61: R programming for data science

61

Importing Data: Parsing● The parse_*() functions:

– ?parse_logical, ?parse_integer, ?parse_date

● The parse functions take in a character vector and return a more specialized vector.– Characters include everything, all letters and numbers, e.g.

“dLab”, “2013”, “xyz3”, “12.09”– A specialized would contain say only numbers, or only decimal

numbers, or only characters, and this is what the parse functions do: return a list of specific type of characters

● A vector in R is a list of characters surrounded enclosed in c() – For example names <- c(“John”, “Jean”, “Giovanni”, “Joni”)

dates_of_birth <- c(“2012-12-31”, “1988-05-02”, “1990-01-06”)

Page 62: R programming for data science

62

Importing Data: Parsing● What happens to the following?

parse_integer(c("1", "231", ".", "456"), na = ".")

x <- parse_integer(c("123", "345", "abc", "123.45"))

● parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

● parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

● parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.

● parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

● parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

Page 63: R programming for data science

63

Importing Data: parsing● One important thing to note is encoding when parsing character.

UTF-8 is the most common, it may save you hours of fixing problems. Specify it when parsing characters like

x <- "El Niño was particularly bad this year"

parse_character(x, locale = locale(encoding = "utf-8"))

● ?parse_datetime, ?parse_date, ?parse_time

● Generate correct format strings to parse each of the following dates and times– d1 <- "January 1, 2010"

– d2 <- "2015-Mar-07"

– d3 <- "06-Jun-2017"

– d4 <- c("August 19 (2015)", "July 1 (2015)")

– d5 <- "12/30/14" # Dec 30, 2014

– t1 <- "1705"

– t2 <- "11:15:10.12 PM"

Page 64: R programming for data science

64

Importing Data: parsing files● example_file <- read_csv(readr_example("challenge.csv"))

● Use the problems() function to look at any issues with the import– problems(example_file)

● Specify the column names explicitly when reading the fileexample_file <- read_csv(readr_example(“challenge.csv”),

col_types = cols(x = col_double(),y = col_date()

)

)

● Use tail(dataframe, n=X) and head(dataframe, n=X) to look at last and first X rows of the data frame.

Page 65: R programming for data science

65

Parsing files

● One more strategy to get the column types is to use the guess_max option when reading in a file.

example_file2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)

Page 66: R programming for data science

66

Writing to a file

● If you want to save the data into CSV you can use either of the functions– write_csv() or write_tsv() where you need

to specify The data frame you are saving The the file path (location) where to save it Optionally:

– you can set how missing values are written with na– You can also append to an existing file

Page 67: R programming for data science

67

Parsing Files

● Group Activity – Download the dataset: Number of Trainees with

Special Needs enrolled in Vocational Training Centres from http://opendata.go.tz

Read it into a data frame and do some manipulations including making some plots

– Inspect read_rds() and write_rds() and see where you can

use these functions

– Explore these packages: Haven, readxl, DBI

Page 68: R programming for data science

68

Tidy Data● A tidy dataset has these features

– Each variable is in its own column– Each observation is in its own row– Each value is in its own cell

● ?gather, ?spread

● Missing Values: – Can be explicitly stated with NA– Can be implicit: not present in the data

● With gather(…, na.rm=TRUE)● You can use the complete() function to make missing

values explicit tidy data.– ?complete

Page 69: R programming for data science

69

Case Study

● Optionally download the data from http://www.who.int/tb/country/data/download/en/

● Load the data from the file or from the package: tidyr::who

● Looking at the data:– Country, iso2, iso3 are similar: representing a

country– Year is clearly a variable– Other columns, have unclear names, look at the

dictionary

Page 70: R programming for data science

70

Case Study cntd...● Gather all the other columns, removing all missing values

– who1 <- who %>%

gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)

● Look at structure of the values in the new key by counting– who1 %>%

count(key)– Use the data dictionary for the definition of the keys– who2 <- who1 %>% – mutate(key = stringr::str_replace(key, "newrel", "new_rel"))

● Separate the key variable into different columns– who3 <- who2 %>%

separate(key, c("new", "type", "sexage"), sep = "_")

● Look at new key– who3 %>% – count(new)

● Drop new column because it is constant– who4 <- who3 %>%

select(-new)

● Separate sexage into sex and age– who5 <- who4 %>%

separate(sexage, c("sex", "age"), sep = 1)

Page 71: R programming for data science

71

Page 72: R programming for data science

72

Writing Code in R● Create new objects with <- with the format object_name

<- object_value● The <- symbol is the assignment operator● Examples:

– first_name <- “Sovello”

– date.of.birth <- “12/31/1980”

– PlaceOfBirth <- “Njombe”

– AGE <- 37

– x = 200 * 5

● Object names must start with a letter.● Object names can only contain letters, numbers,

underscore (_), and period (.)– Look at the examples above

Page 73: R programming for data science

73

Writing code in R● You can look at what is in R by typing the name of the object

● You can also print an object explicitly– print(first_name)

[1] “Sovello” The [1] shown in the output indicates that x is a vector and 5 is its first element.

Page 74: R programming for data science

74

Writing code in R

● All values that are not numbers must be enclosed in double/single quotes (“value”, or ‘value’)– Look at definition of place.of.birth in the screenshot

● Typos matter, when using object names. Cases matter a lot such that surname and Surname are not the same.

● The # character indicates a comment. Anything to the right of # is ignored by R

● No multi-line comments

Page 75: R programming for data science

75

Group Exercise (5min)● What is wrong with this code snippet

Surname <- “Mkulima”

surname

● If you start typing a value for an object and press enter before an enclosing quote or paranthesis the code will look like

college <- “College of informatics

+

– A + means you should continue typing. What would you do to fix, stop or escape from the problem?

● Fix errors in this piece of code until it workslibrary(tidyverse)

ggplot(dota = mpg) +

geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8)

Page 76: R programming for data science

76

R Objects● R has five atomic objects

– Character– Numeric (real numbers)– Integer– Complex– Logical (True/False)

● The most basic type of R is a vector. An empty vector can be created with vector()

● A vector can only contain objects of the same type.● Numbers are generally treated as numeric objects

– If you want an integer, you have to explicitly specify an L. 1L is an integer 1 is a real number

Page 77: R programming for data science

77

R Objects

● Inf is a special number which represents infinity.– You can use Inf in calculations like 1/Inf

● Creating vectors● Use the c() function to create vectors

> x <- c(0.5, 0.6) ## numeric

> x <- c(TRUE, FALSE) ## logical

> x <- c(T, F) ## logical

> x <- c("a", "b", "c") ## character

> x <- 9:29 ## integer

> x <- c(1+0i, 2+4i) ## complex

Page 78: R programming for data science

78

Coercion of R objects● You can explicitly coerce objects using the as.* functions. ?

as.integer, ?as.character, ?as.logical, ?as.numeric

> x <- 0:6

> class(x)

[1] "integer"

> as.numeric(x)

[1] 0 1 2 3 4 5 6

> as.logical(x)

[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE

> as.character(x)

[1] "0" "1" "2" "3" "4" "5" "6"

● If R fails to coerce an object, it produces NAs.> x <- c("a", "b", "c")

> as.numeric(x)

Warning: NAs introduced by coercion

[1] NA NA NA

> as.logical(x)

[1] NA NA NA

> as.complex(x)

Warning: NAs introduced by coercion

[1] NA NA NA

Page 79: R programming for data science

79

R Objects: Matrices

● Matrices are vectors with a dimension attribute.● The dimension is an integer vector of length 2

(number of rows, number of columns)> m <- matrix(nrow = 2, ncol = 3)

> m

[,1] [,2] [,3]

[1,] NA NA NA

[2,] NA NA NA

> dim(m)

[1] 2 3

> attributes(m)

$dim

[1] 2 3

Page 80: R programming for data science

80

Matrices● Matrices are constructed column-wise and so entries start at the

“upper left” corner and running down the columns> m <- matrix(1:6, nrow = 2, ncol = 3)

> m

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

● You can create matrices from vectors by adding a dimensions attribute> m <- 1:10

> m

[1] 1 2 3 4 5 6 7 8 9 10

> dim(m) <- c(2, 5)

> m

[,1] [,2] [,3] [,4] [,5]

[1,] 1 3 5 7 9

[2,] 2 4 6 8 10

● Matrices must have every element be the same class (e.g. all integers or all numeric).

Page 81: R programming for data science

81

Group work

● What do cbind() and rbind() do?

● Create 3 vectors and 3 matrices.● Create 3 matrices from vectors● Create 2 matrices using cbind() and rbind()

● Read about R lists: how to create using list()

Page 82: R programming for data science

82

R Objects: Factors

● Factors represent categorical data● Factors can be ordered or unordered● Factor objects can be created with the

factor() function> x <- factor(c("yes", "yes", "no", "yes", "no"))

> x

[1] yes yes no yes no

Levels: no yes

> table(x)

x

no yes

2 3

Page 83: R programming for data science

83

Factors● Say you want to sort a vector

> x1 <- c("Dec", "Apr", "Jan", "Mar")

> sort(x1)

[1] "Apr" "Dec" "Jan" "Mar"

● The target was to see months sorted in the order of Jan, Mar, Apr, Dec● To solve this problem we can make use of factors

– Create a vector of monthsmonth_levels <- c(

"Jan", "Feb", "Mar", "Apr", "May", "Jun",

"Jul", "Aug", "Sep", "Oct", "Nov", "Dec”

)

● Then create a vector with month levels.> y1 <- factor(x1, levels = month_levels)

● Applying sort on the new variable, will produce a sorted list in order of months

> sort(y1)

Page 84: R programming for data science

84

R Objects: missing values● Missing values are denoted by NA and NaN for undefined mathematical

operations– is.na() is used to test objects if they are NA

– is.nan() is used to test for NaN

● NA values have a class also, so there are integer NA, character NA, etc.

● A NaN value is also NA but the converse is not true– > ## Create a vector with NAs in it

– > x <- c(1, 2, NA, 10, 3)

– > ## Return a logical vector indicating which elements are NA

– > is.na(x)

– [1] FALSE FALSE TRUE FALSE FALSE

– > ## Return a logical vector indicating which elements are NaN

– > is.nan(x)

– [1] FALSE FALSE FALSE FALSE FALSE

● What is difference between missing values Nas and Zero

Page 85: R programming for data science

85

R Objects:Data Frames

● Data frames store tabular data in R● Data frames are represented as a special type

of list where every element of the list has to have the same length.

● Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.

● Unlike matrices, data frames can store different classes of objects in each column.

Page 86: R programming for data science

86

Data Frames> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

> x

foo bar

1 TRUE

2 TRUE

3 FALSE

4 FALSE

> nrow(x)

[1] 4

> ncol(x)

[1] 2

Page 87: R programming for data science

87

Writing Code in R

● Scripts:– Turning interactive code into scripts

Page 88: R programming for data science

88

Data Transformation

● Filter rows with filter()– Comparisons: >, >=, <, <=, !=, ==

sqrt(2) ^ 2 == 2

– Logical operatorsAnd &

Or | (shorthand x %in% y e.g. 2 %in% c(1, 2, 3, 4))

Not !

– To determing missing values is.na(x)

● Ordering: use arrange()

Page 89: R programming for data science

89

Reading Data: large datasets

● With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.– Read the help page for read.table, which contains many hints– Stop if your RAM is smaller than the size of the file– Set comment.char = "" if there are no commented lines in

your file.– Use the colClasses argument. Specifying this option instead

of using the default can make ’read.table’ run MUCH faster, often twice as fast. You have to know the class of each column

– Set nrows. This doesn’t make R run faster but it helps with memory usage.

Page 90: R programming for data science

90

Reading large datasets

● A quick way to figure out the classes of each column is the following:

> initial <- read.table("datatable.txt", nrows = 100)

> classes <- sapply(initial, class)

> tabAll <- read.table("datatable.txt", colClasses = classes)

Page 91: R programming for data science

91

Control Structures

● Control structures allow to control the flow of execution of a series of R expressions.

● Control structures allow you to put some “logic” into R code, rather than just always executing the same R code every time.

● Control structures allow you to respond to inputs or to features of the data and execute different R expressions accordingly.

Page 92: R programming for data science

92

Control Structures: if-else● This if-else structure allows you to test a condition and act on it depending on

whether it’s true or false– You can only use the if statement

if(<condition>) {

## do something

}

## Continue with rest of code

● Or use the complete if-elseif(<condition>) {

## do something

}

else {

## do something else

}

● You can have a series of tests by following the initial if with any number of else ifs.if(<condition1>) {

## do something

} else if(<condition2>) {

## do something different

} else {

## do something different

}

Page 93: R programming for data science

93

Example: if-else● ## Generate a uniform random number

x <- runif(1, 0, 10)

if(x > 3) {

y <- 10

} else {

y <- 0

}

● This is the same as executingy <- if(x > 3) {

10

} else {

0

}

Page 94: R programming for data science

94

Control Structures: for

● For loops are the only looping construct in Rfor( x in sequence ){

##Execute code

}

● For one line loops, the curly braces are not strictly necessary.

– > for(i in 1:4) print(x[i])

[1] "a"

[1] "b"

[1] "c"

[1] "d"

Page 95: R programming for data science

95

Control Structures: while

● While loops begin by testing a condition● If it is true, they loop body is executed and

the condition is tested again until the condition is false

> count <- 0

> while(count < 10) {print(count)count <- count + 1

}

Page 96: R programming for data science

96

Control Structures: next

● Next is used to skip an iteration of a loopfor(i in 1:100) {

if(i <= 20) {

## Skip the first 20 iterations

next

}

## Do something here

}

Page 97: R programming for data science

97

Control Structures: break

● Break is used to exit the loop immediately, regardless of what the loop maybe on.

for(i in 1:100) {

print(i)

if(i > 20) {## Stop loop after 20 iterationsbreak

}

}

Page 98: R programming for data science

98

Functions

Page 99: R programming for data science

99

Functions: scoping

Page 100: R programming for data science

100

Dates and Times

Page 101: R programming for data science

101

Loop functions

Page 102: R programming for data science

102

Simulating and Profiling

Page 103: R programming for data science

103

Vectorized Operations