Data transformation is a crucial part in exploratory data analysis (EDA). EDA includes the processes of importing, cleaning, and visualizing data to uncover the patterns, relationships and hidden insights within the data.
In this lab, we will learn data processing and subsetting to see more insights from the data.
If you didn’t install tidyverse package
install.packages("tidyverse")
Load package
library(tidyverse)
## ── Attaching core tidyverse packages ───────────────
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Daily air quality measurements in New York, May to September 1973 stored within a data frame with 153 observations on 6 variables. It is obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). It is cited by Chambers (2018).
You can find the details fo the data description here.
You can use airquality object from base R data. You can
also airquality.csv file where the data is stored as a
comma separated value (CSV) file. Each row holds information for a
single observation.
Import the data from airquality.csv and show first 6
rows to see how this data is organized.
df = read.csv("Lab2/data/airquality.csv")
glimpse() to take a quick peek at your dataglimpse(df)
## Rows: 153
## Columns: 7
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
## $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
names(df)
## [1] "X" "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
head(df)
## X Ozone Solar.R Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
## 4 4 18 313 11.5 62 5 4
## 5 5 NA NA 14.3 56 5 5
## 6 6 28 NA 14.9 66 5 6
dplyr (one of data packages in tidyverse)
provides the %>% operator from magrittr.
x %>% f(y) turns into f(x, y) so the result
from one step is then “piped” into the next step.
df %>% filter(Month==5) means “Select all rows with
May”
df %>% filter(Month==5)
## X Ozone Solar.R Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
## 4 4 18 313 11.5 62 5 4
## 5 5 NA NA 14.3 56 5 5
## 6 6 28 NA 14.9 66 5 6
## 7 7 23 299 8.6 65 5 7
## 8 8 19 99 13.8 59 5 8
## 9 9 8 19 20.1 61 5 9
## 10 10 NA 194 8.6 69 5 10
## 11 11 7 NA 6.9 74 5 11
## 12 12 16 256 9.7 69 5 12
## 13 13 11 290 9.2 66 5 13
## 14 14 14 274 10.9 68 5 14
## 15 15 18 65 13.2 58 5 15
## 16 16 14 334 11.5 64 5 16
## 17 17 34 307 12.0 66 5 17
## 18 18 6 78 18.4 57 5 18
## 19 19 30 322 11.5 68 5 19
## 20 20 11 44 9.7 62 5 20
## 21 21 1 8 9.7 59 5 21
## 22 22 11 320 16.6 73 5 22
## 23 23 4 25 9.7 61 5 23
## 24 24 32 92 12.0 61 5 24
## 25 25 NA 66 16.6 57 5 25
## 26 26 NA 266 14.9 58 5 26
## 27 27 NA NA 8.0 57 5 27
## 28 28 23 13 12.0 67 5 28
## 29 29 45 252 14.9 81 5 29
## 30 30 115 223 5.7 79 5 30
## 31 31 37 279 7.4 76 5 31
You can use the pipe to rewrite multiple operations that you can read
left-to-right, top-to-bottom (reading the pipe operator as “then”)
(%>% and |> are equivalent)
df |>
select(Ozone, Wind, Month) |>
filter(Month==5)
## Ozone Wind Month
## 1 41 7.4 5
## 2 36 8.0 5
## 3 12 12.6 5
## 4 18 11.5 5
## 5 NA 14.3 5
## 6 28 14.9 5
## 7 23 8.6 5
## 8 19 13.8 5
## 9 8 20.1 5
## 10 NA 8.6 5
## 11 7 6.9 5
## 12 16 9.7 5
## 13 11 9.2 5
## 14 14 10.9 5
## 15 18 13.2 5
## 16 14 11.5 5
## 17 34 12.0 5
## 18 6 18.4 5
## 19 30 11.5 5
## 20 11 9.7 5
## 21 1 9.7 5
## 22 11 16.6 5
## 23 4 9.7 5
## 24 32 12.0 5
## 25 NA 16.6 5
## 26 NA 14.9 5
## 27 NA 8.0 5
## 28 23 12.0 5
## 29 45 14.9 5
## 30 115 5.7 5
## 31 37 7.4 5
#
Dplyr data manipulation functionsRows:
Columns:
select()
changes whether or not a column is included.
rename()
changes the name of columns.
mutate()
changes the values of columns and creates new columns.
relocate()
changes the order of the columns.
Groups of rows:
summarise()
collapses a group into a single row.https://openintrostat.github.io/oilabs-tidy/02_intro_to_data/intro_to_data.html
https://dplyr.tidyverse.org/articles/dplyr.html
https://rstudio.github.io/cheatsheets/html/data-transformation.html
DplyrYou can create new columns or modify existing ones using the
mutate() function.
For example, to create a new column that categorizes temperature:
df <- df %>%
mutate(Temp_Category = ifelse(Temp > 80, "High", "Moderate")) # if Temp>80 -> "High" otherwise "Moderate"
head(df)
## X Ozone Solar.R Wind Temp Month Day Temp_Category
## 1 1 41 190 7.4 67 5 1 Moderate
## 2 2 36 118 8.0 72 5 2 Moderate
## 3 3 12 149 12.6 74 5 3 Moderate
## 4 4 18 313 11.5 62 5 4 Moderate
## 5 5 NA NA 14.3 56 5 5 Moderate
## 6 6 28 NA 14.9 66 5 6 Moderate
dplyrA contingency table displays the frequency distribution of variables.
You can create contingency tables using dplyr in
combination with tidyr.
For example, to create a contingency table of Month vs. Ozone_Level:
df <- df %>%
mutate(Ozone_Level = ifelse(Ozone > median(Ozone, na.rm=TRUE), "High", "Low"))
contingency_table <- df %>%
group_by(Month, Ozone_Level) %>%
summarise(Count = n()) %>%
pivot_wider(names_from = Ozone_Level, values_from = Count)
## `summarise()` has grouped output by 'Month'. You
## can override using the `.groups` argument.
print(contingency_table)
## # A tibble: 5 × 4
## # Groups: Month [5]
## Month High Low `NA`
## <int> <int> <int> <int>
## 1 5 7 19 5
## 2 6 3 6 21
## 3 7 21 5 5
## 4 8 18 8 5
## 5 9 9 20 1
You can categorize solar.R by
Solar.R <= 115 : Low
115< Solar.R <= 185 : Medium
258< Solar.R : High
Create a contingency table for categorized solar.R vs
Ozone_level
contingency_table2 <- df %>%
mutate(solar_cat = case_when(
Solar.R <= 115 ~ "Low",
Solar.R > 115 & Solar.R <= 185 ~ "Medium",
Solar.R > 258 ~ "High"
)) %>%
group_by(solar_cat, Ozone_Level) %>%
summarise(Count = n()) %>%
pivot_wider(
names_from = Ozone_Level,
values_from = Count
)
## `summarise()` has grouped output by 'solar_cat'.
## You can override using the `.groups` argument.
print(contingency_table2)
## # A tibble: 4 × 4
## # Groups: solar_cat [4]
## solar_cat High Low `NA`
## <chr> <int> <int> <int>
## 1 High 15 12 10
## 2 Low 6 23 8
## 3 Medium 7 7 8
## 4 <NA> 30 16 11
We have NA values in Ozone,
Solar.R and so on.
print(head(df))
## X Ozone Solar.R Wind Temp Month Day Temp_Category Ozone_Level
## 1 1 41 190 7.4 67 5 1 Moderate High
## 2 2 36 118 8.0 72 5 2 Moderate High
## 3 3 12 149 12.6 74 5 3 Moderate Low
## 4 4 18 313 11.5 62 5 4 Moderate Low
## 5 5 NA NA 14.3 56 5 5 Moderate <NA>
## 6 6 28 NA 14.9 66 5 6 Moderate Low
Let’s filter out those NA values from df
and make it to filtered_df .
NA values based on Ozone column.filtered_df <- df %>%
filter(!is.na(Ozone))
print(filtered_df[1:10,])
## X Ozone Solar.R Wind Temp Month Day Temp_Category Ozone_Level
## 1 1 41 190 7.4 67 5 1 Moderate High
## 2 2 36 118 8.0 72 5 2 Moderate High
## 3 3 12 149 12.6 74 5 3 Moderate Low
## 4 4 18 313 11.5 62 5 4 Moderate Low
## 5 6 28 NA 14.9 66 5 6 Moderate Low
## 6 7 23 299 8.6 65 5 7 Moderate Low
## 7 8 19 99 13.8 59 5 8 Moderate Low
## 8 9 8 19 20.1 61 5 9 Moderate Low
## 9 11 7 NA 6.9 74 5 11 Moderate Low
## 10 12 16 256 9.7 69 5 12 Moderate Low
NA values based on Ozone and
Solar.Rcolumns.filtered_df <- df %>%
filter(!is.na(Ozone)) %>%
filter(!is.na(Solar.R))
print(filtered_df[1:10,])
## X Ozone Solar.R Wind Temp Month Day Temp_Category Ozone_Level
## 1 1 41 190 7.4 67 5 1 Moderate High
## 2 2 36 118 8.0 72 5 2 Moderate High
## 3 3 12 149 12.6 74 5 3 Moderate Low
## 4 4 18 313 11.5 62 5 4 Moderate Low
## 5 7 23 299 8.6 65 5 7 Moderate Low
## 6 8 19 99 13.8 59 5 8 Moderate Low
## 7 9 8 19 20.1 61 5 9 Moderate Low
## 8 12 16 256 9.7 69 5 12 Moderate Low
## 9 13 11 290 9.2 66 5 13 Moderate Low
## 10 14 14 274 10.9 68 5 14 Moderate Low
# filtered_df <- df %>%
# filter(!is.na(Ozone) & !is.na(Solar.R)). ## this is equivalent code
filtered_df <- df %>%
filter(complete.cases(.))
print(filtered_df[1:10,])
## X Ozone Solar.R Wind Temp Month Day Temp_Category Ozone_Level
## 1 1 41 190 7.4 67 5 1 Moderate High
## 2 2 36 118 8.0 72 5 2 Moderate High
## 3 3 12 149 12.6 74 5 3 Moderate Low
## 4 4 18 313 11.5 62 5 4 Moderate Low
## 5 7 23 299 8.6 65 5 7 Moderate Low
## 6 8 19 99 13.8 59 5 8 Moderate Low
## 7 9 8 19 20.1 61 5 9 Moderate Low
## 8 12 16 256 9.7 69 5 12 Moderate Low
## 9 13 11 290 9.2 66 5 13 Moderate Low
## 10 14 14 274 10.9 68 5 14 Moderate Low