HOXO-M - anonymous data analyst group in Japan

2015/08/04

New package "dplyrr" - Utilities for comfortable use of dplyr with databases

1. Overview

dplyr package is the most powerful package for data handling in R, and it has also the ability of working with databases(See Vignette).
But the functionalities of dealing with databases in dplyr is begin developped even now.
Now, I'm trying to make dplyr with databases more comfortable by some functions.
For that purpose, I've created dplyrr package.
dplyrr has below functions:

load_tbls() : Easy to load table objects for all tables in a database.
cut() in mutate() : Easy to create a case statement by using the grammar like the base::cut().
count_if() and n_if() in summarise() : Shortcut to count rows that a condition is satisfied.
filter() : Improved filter() for tbl_sql which adds parentheses appropriately.
moving_mean() in mutate() : Compute moving average for PostgreSQL.
moving_max() in mutate() : Compute moving max for PostgreSQL.
moving_min() in mutate() : Compute moving min for PostgreSQL.
moving_sum() in mutate() : Compute moving sum for PostgreSQL.
first_value() in mutate() : Compute first value for PostgreSQL.

2. How to install

The source code for dplyrr package is available on GitHub at

https://github.com/hoxo-m/dplyrr

You can install the pakage from there.

install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("hoxo-m/dplyrr")

3. Common functions for all databases

For illustration, we use a database file: "my_db.sqlite3".
If you want to trace the codes below, you should create the databese file at first.

library(dplyrr)
library(nycflights13)

db <- src_sqlite("my_db.sqlite3", create = TRUE)
copy_nycflights13(db)

3-1. load_tbls()

Usually, when we use a database with dplyr, we first create database object, and we can see the tables in the databese by show().

library(dplyrr)
# Create database object
db <- src_sqlite("my_db.sqlite3")
show(db)

## src:  sqlite 3.8.6 [my_db.sqlite3]
## tbls: airlines, airports, flights, planes, sqlite_stat1, weather

Next, we create table objects for pulling data in some tables in the database.

airlines_tbl <- tbl(db, "airlines")
airports_tbl <- tbl(db, "airports")
flights_tbl <- tbl(db, "flights")
planes_tbl <- tbl(db, "planes")
weather_tbl <- tbl(db, "weather")

Typing this code is really a bore!
If you want to create table objects for all tables in the database, you can use load_tbls().

load_tbls(db)

## Loading: airlines_tbl
## Loading: airports_tbl
## Loading: flights_tbl
## Loading: planes_tbl
## Loading: sqlite_stat1_tbl
## Loading: weather_tbl

Check the created table objects.

ls(pattern = "_tbl$")

## [1] "airlines_tbl"     "airports_tbl"     "flights_tbl"     
## [4] "planes_tbl"       "sqlite_stat1_tbl" "weather_tbl"

glimpse(airlines_tbl)

## Observations: 16
## Variables:
## $ carrier (chr) "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", ...
## $ name    (chr) "Endeavor Air Inc.", "American Airlines Inc.", "Alaska...

3-2. cut() in mutate()

If you want to write case statement with like base::cut(), you can use cut() function in mutate().
For example, there is air_time column in the database.

db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>% select(air_time)
air_time <- q %>% collect
head(air_time, 3)

## Source: local data frame [3 x 1]
## 
##   air_time
## 1      227
## 2      227
## 3      160

If you want to group the air_time by break points c(0, 80, 120, 190, 900), you think you must write the next code.

q <- flights_tbl %>% 
  select(air_time) %>%
  mutate(air_time_cut = if(air_time > 0 && air_time <= 80) "(0,80]"
         else if(air_time > 80 && air_time <= 120) "(80,120]"
         else if(air_time > 120 && air_time <= 190) "(120,190]"
         else if(air_time > 190 && air_time <= 900) "(190,900]")
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)

## Source: local data frame [3 x 2]
## 
##   air_time air_time_cut
## 1      227    (190,900]
## 2      227    (190,900]
## 3      160    (120,190]

When the break points increase, you are going to be tired to write more lines.
By using cut() function in mutate(), it can become easy.

q <- flights_tbl %>% 
  select(air_time) %>%
  mutate(air_time_cut = cut(air_time, breaks=c(0, 80, 120, 190, 900)))
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)

## Source: local data frame [3 x 2]
## 
##   air_time air_time_cut
## 1      227    (190,900]
## 2      227    (190,900]
## 3      160    (120,190]

The cut() in mutate() has more arguments such as labels coming from base::cut().

cut(variable, breaks, labels, include.lowest, right, dig.lab)

For integer break points, specially you can indicate labels="-".

q <- flights_tbl %>% 
  select(air_time) %>%
  mutate(air_time_cut = cut(air_time, breaks=c(0, 80, 120, 190, 900), labels="-"))
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)

## Source: local data frame [3 x 2]
## 
##   air_time air_time_cut
## 1      227      191-900
## 2      227      191-900
## 3      160      121-190

3-3. count_if() and n_if() in summarise()

When we want to count rows that condition is satisfied, we might write like this.

q <- flights_tbl %>% 
  select(air_time) %>%
  summarise(odd_airtime_rows = sum(if(air_time %% 2 == 1) 1L else 0L), 
            even_airtime_rows = sum(if(air_time %% 2 == 0) 1L else 0L), 
            total_rows=n())
q %>% collect

## Source: local data frame [1 x 3]
## 
##   odd_airtime_rows even_airtime_rows total_rows
## 1           164150            163196     336776

The count_if() and n_if() functions are a shortcut for it merely.

count_if(condition)
n_if(condition)

q <- flights_tbl %>% 
  select(air_time) %>%
  summarise(odd_airtime_rows = count_if(air_time %% 2 == 1), 
            even_airtime_rows = n_if(air_time %% 2 == 0), 
            total_rows=n())
q %>% collect

## Source: local data frame [1 x 3]
## 
##   odd_airtime_rows even_airtime_rows total_rows
## 1           164150            163196     336776

Both functions do exactly the same thing.

3-4. Improved filter()

If you use dplyr with databases in pure mind, you can encounter the unintended action like below.

library(dplyr)

db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>%
  select(month, air_time) %>%
  filter(month == 1) %>%
  filter(air_time > 200 || air_time < 100)
q$query

## <Query> SELECT "month" AS "month", "air_time" AS "air_time"
## FROM "flights"
## WHERE "month" = 1.0 AND "air_time" > 200.0 OR "air_time" < 100.0
## <SQLiteConnection>

Did you expect the WHERE clause to be that?
If you use dplyrr, it becomes natural by adding parentheses.

library(dplyrr)

db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>%
  select(month, air_time) %>%
  filter(month == 1) %>%
  filter(air_time > 200 || air_time < 100)
q$query

## <Query> SELECT "month" AS "month", "air_time" AS "air_time"
## FROM "flights"
## WHERE ("month" = 1.0) AND ("air_time" > 200.0 OR "air_time" < 100.0)
## <SQLiteConnection>

4. Functions for PostgreSQL

4-1. moving_**() in mutate()

dplyrr has four moving_**() functions that you can use in mutate().

moving_mean(variable, preceding, following)
moving_max(variable, preceding, following)
moving_min(variable, preceding, following)
moving_sum(variable, preceding, following)

When you want to set the same preceding and following, you can omit following.
For illustration, we use the test database that is PostgreSQL.

srcs <- temp_srcs("postgres")
df <- data.frame(x = 1:5)
tbls <- dplyr:::temp_load(srcs, list(df = df))
temp_tbl <- tbls$postgres$df
head(temp_tbl)

##   x
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5

Compute moving average with 1 preceding and 1 following.

q <- temp_tbl %>%
  mutate(y = moving_mean(x, 1))
q %>% collect

## Source: local data frame [5 x 2]
## 
##   x   y
## 1 1 1.5
## 2 2 2.0
## 3 3 3.0
## 4 4 4.0
## 5 5 4.5

Comfirm query.

q$query

## <Query> SELECT "x", "y"
## FROM (SELECT "x", avg("x") OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS "y"
## FROM "tlsqbjsuou") AS "_W1"
## <PostgreSQLConnection:(10316,0)>

Compute moving mean with 1 preceding and 2 following.

q <- temp_tbl %>%
  mutate(y = moving_mean(x, 1, 2))
q %>% collect

## Source: local data frame [5 x 2]
## 
##   x   y
## 1 1 2.0
## 2 2 2.5
## 3 3 3.5
## 4 4 4.0
## 5 5 4.5

Comfirm query.

q$query

## <Query> SELECT "x", "y"
## FROM (SELECT "x", avg("x") OVER (ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) AS "y"
## FROM "tlsqbjsuou") AS "_W2"
## PostgreSQLConnection:(10316,0)>

Similary, you can use the other moving_**() functions.

4-2. first_value() in mutate()

dplyrr has first_value() function that you can use in mutate().

first_value(value, order_by)

When you want to set the same value and order_by, you can omit order_by.
For illustration, we use the test database that is PostgreSQL.

srcs <- temp_srcs("postgres")
df <- data.frame(class = c("A", "A", "B", "B", "C", "C"), x = 1:6, y = 6:1)
tbls <- dplyr:::temp_load(srcs, list(df=df))
temp_tbl <- tbls$postgres$df
head(temp_tbl)

##   class x y
## 1     A 1 6
## 2     A 2 5
## 3     B 3 4
## 4     B 4 3
## 5     C 5 2
## 6     C 6 1

Get the first values of x partitioned by class and ordered by x.

q <- temp_tbl %>%
  group_by(class) %>%
  mutate(z = first_value(x))
q %>% collect

## Source: local data frame [6 x 4]
## Groups: class
## 
##   class x y z
## 1     A 1 6 1
## 2     A 2 5 1
## 3     B 3 4 3
## 4     B 4 3 3
## 5     C 5 2 5
## 6     C 6 1 5

See query.

q$query

## <Query> SELECT "class", "x", "y", "z"
## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "x") AS "z"
## FROM "slrhxfdvrt") AS "_W3"
## <PostgreSQLConnection:(10316,0)>

Get the first values of x partitioned by class and ordered by y.

q <- temp_tbl %>%
  group_by(class) %>%
  mutate(z = first_value(x, y))
q %>% collect

## Source: local data frame [6 x 4]
## Groups: class
## 
##   class x y z
## 1     A 2 5 2
## 2     A 1 6 2
## 3     B 4 3 4
## 4     B 3 4 4
## 5     C 6 1 6
## 6     C 5 2 6

See query.

q$query

## <Query> SELECT "class", "x", "y", "z"
## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "y") AS "z"
## FROM "slrhxfdvrt") AS "_W4"
## <PostgreSQLConnection:(10316,0)>

Get the first values of x partitioned by class and ordered by descent of y.

q <- temp_tbl %>%
  group_by(class) %>%
  mutate(z = first_value(x, desc(y)))
q %>% collect

## Source: local data frame [6 x 4]
## Groups: class
## 
##   class x y z
## 1     A 1 6 1
## 2     A 2 5 1
## 3     B 3 4 3
## 4     B 4 3 3
## 5     C 5 2 5
## 6     C 6 1 5

See query.

q$query

## <Query> SELECT "class", "x", "y", "z"
## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "y" DESC) AS "z"
## FROM "slrhxfdvrt") AS "_W5"
## <PostgreSQLConnection:(10316,0)>

5. Miscellaneous

update_dplyrr()

update_dplyrr() is a shortcut/syntax sugar of

devtools::install_github("hoxo-m/dplyrr")

unload_dplyrr()

unload_dplyrr() is a shortcut/syntax sugar of

detach("package:dplyrr", unload = TRUE)
detach("package:dplyr", unload = TRUE)

6. Bug reports

https://github.com/hoxo-m/dplyrr/issues

2015/07/28

The complete catalog of argument variations of select() in dplyr

When I read the dplyr vignette, I found a convenient way to select sequential columns such as select(data, year:day).
Because I had inputted only column names to select() function, I was deeply affected by the convenient way.

On closer inspection, I found that the select() function accepts many types of input.
Here, I will enumerate the variety of acceptable inputs for select() function.

By the way, these column selection methods also can use in the summarise_each(), mutate_each() and some functions in tidyr package(e.g. gather()).

1. Whole codes.

At first, whole codes were shown for perspicuity.
In the sections below, the details of each command were shown.

# Data preparation ------------------------------------------------------------------
library(dplyr)
library(nycflights13)
set.seed(123)
data <- sample_n(flights, 3)

glimpse(data)

# Basic method of use select() ----------------------------------------------------------------------
select(data, year)
select(data, year, month, day)
select(data, year:day)

select(data, -year, -month, -day)
select(data, -(year:day))

select(data, 1, 2, 3)
select(data, 1:3)

select(data, -1, -2, -3)
select(data, -(1:3))

select(data, year:day, -month)
select(data, -(year:day), month)
select(data, 1:3, -2)
select(data, -(1:3), 2)

# Utility functions of select() --------------------------------------------------------------
select(data, starts_with("arr"))
select(data, ends_with("time"))
select(data, contains("_"))
select(data, matches("^(dep|arr)_"))

data2 <- data
colnames(data2) <- sprintf("x%d", 1:16)
select(data2, num_range("x", 8:11))
select(data2, num_range("x", c(9, 11)))

data3 <- data
colnames(data3) <- sprintf("x%02d", 1:16)
select(data3, num_range("x", 8:11, width=2))

col_vector <- c("year", "month", "day")
select(data, col_vector)
select(data, one_of(col_vector))

select(data, everything())

select(data, -starts_with("arr"))

# Standard evaluation --------------------------------------------------------------------
select_(data, "year", "month", "day")

col_vector <- c("year", "month", "day")
select_(data, .dots = col_vector)

select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

2. Data preparation.

To follow the dplyr vignette, flights data set in nycflights13 package were used as an example.

library(dplyr)
library(nycflights13)

set.seed(123)
data <- sample_n(flights, 3)

glimpse(data)

Variables:
$ year      (int) 2013, 2013, 2013
$ month     (int) 12, 7, 3
$ day       (int) 15, 17, 2
$ dep_time  (int) 2124, 651, 1636
$ dep_delay (dbl) -4, -9, 1
$ arr_time  (int) 2322, 936, 1800
$ arr_delay (dbl) 1, -28, 0
$ carrier   (chr) "UA", "DL", "WN"
$ tailnum   (chr) "N801UA", "N194DN", "N475WN"
$ flight    (int) 289, 763, 1501
$ origin    (chr) "EWR", "JFK", "LGA"
$ dest      (chr) "DTW", "LAX", "MKE"
$ air_time  (dbl) 88, 306, 103
$ distance  (dbl) 488, 2475, 738
$ hour      (dbl) 21, 6, 16
$ minute    (dbl) 24, 51, 36

This data set includes the 16 columns shown above.

3. Basic method of use select().

At first, the ways of using select() were shown.

select(data, year)

This process shows the way to take the year column out of data. To pick multiple columns, you can write the following.

select(data, year, month, day)

  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

If columns were sequential in the dataset, you could write the following to pick sequential columns.

select(data, year:day)

  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

If you want to remove a specific column, add - in the head of the column name as follows.

select(data, -year, -month, -day)

  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1     2124        -4     2322         1      UA  N801UA    289    EWR  DTW       88      488   21     24
2      651        -9      936       -28      DL  N194DN    763    JFK  LAX      306     2475    6     51
3     1636         1     1800         0      WN  N475WN   1501    LGA  MKE      103      738   16     36

To remove sequential columns, put sequential columns in brackets () connected with a colon.

select(data, -(year:day))

  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1     2124        -4     2322         1      UA  N801UA    289    EWR  DTW       88      488   21     24
2      651        -9      936       -28      DL  N194DN    763    JFK  LAX      306     2475    6     51
3     1636         1     1800         0      WN  N475WN   1501    LGA  MKE      103      738   16     36

It is also possible to pick columns by choosing the column number.

select(data, 1, 2, 3)
select(data, 1:3)

  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

Next topics are slightly advanced.

It is possible to pick sequential columns temporarily and remove some of these.

select(data, year:day, -month)

It is also possible to remove sequential columns and keep a part of these.

select(data, -(year:day), month)

  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute month
1     2124        -4     2322         1      UA  N801UA    289    EWR  DTW       88      488   21     24    12
2      651        -9      936       -28      DL  N194DN    763    JFK  LAX      306     2475    6     51     7
3     1636         1     1800         0      WN  N475WN   1501    LGA  MKE      103      738   16     36     3

Even using a column number can give the same result with the column name. (The results are omitted.)

select(data, 1:3, -2)
select(data, -(1:3), 2)

4. Utility functions of select().

Utility functions existing in select(), summarise_each() and mutate_each() in dplyr as well as some functions in the tidyr package.

Seven functions existed in the utility functions of select().

starts_with(match, ignore.case = TRUE)
ends_with(match, ignore.case = TRUE)
contains(match, ignore.case = TRUE)
matches(match, ignore.case = TRUE)
num_range(prefix, range, width = NULL)
one_of(...)
everything()

We now check the respective commands and how to use them.

First, starts_with() picks columns whose name starts with the specified string.

select(data, starts_with("arr"))

  arr_time arr_delay
1     2322         1
2      936       -28
3     1800         0

The argument ignore.case specifies whether the lowercase is classified as a capital letter(default is TRUE).

The ends_with() picks columns whose name ends with the specified string .

select(data, ends_with("time"))

  dep_time arr_time air_time
1     2124     2322       88
2      651      936      306
3     1636     1800      103

The contains() picks columns whose name contains the specified string.

select(data, contains("_"))

  dep_time dep_delay arr_time arr_delay air_time
1     2124        -4     2322         1       88
2      651        -9      936       -28      306
3     1636         1     1800         0      103

The matches() picks columns based on a regular expression matching string.

select(data, contains("_"))

  dep_time dep_delay arr_time arr_delay air_time
1     2124        -4     2322         1       88
2      651        -9      936       -28      306
3     1636         1     1800         0      103

When the numbers were included in column names, num_range() might be useful.
In this example, we change the column names to be x1-x16 and execute num_range() command for the data set.

data2 <- data
colnames(data2) <- sprintf("x%d", 1:16)
select(data2, num_range("x", 8:11))

  x8     x9  x10 x11
1 UA N801UA  289 EWR
2 DL N194DN  763 JFK
3 WN N475WN 1501 LGA

By specifying as num_range("x", 8:11), columns x8 to x11 can be identified.
Numbers in the column name are not necessarily sequential.

select(data2, num_range("x", c(9, 11)))

      x9 x11
1 N801UA EWR
2 N194DN JFK
3 N475WN LGA

When column names were padded, the column name was shown as x01.
Here, the argument width in num_range() might be useful.
We now try this process for a data that changes the column names as x01-x16.

data3 <- data
colnames(data3) <- sprintf("x%02d", 1:16)
select(data3, num_range("x", 8:11, width=2))

  x08    x09  x10 x11
1  UA N801UA  289 EWR
2  DL N194DN  763 JFK
3  WN N475WN 1501 LGA

By specifying as width=2, the zero filled columns can be picked out.

When a column is named as a vector or character string, one_of() might be useful.
An error occurs in the following case with select().

col_vector <- c("year", "month", "day")
select(data, col_vector)

Error: All select() inputs must resolve to integer column positions.
The following do not:
*  col_vector

However, an intended process occurs in the case of one_of().

select(data, one_of(col_vector))

  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

The everything() selects all columns. (Result was omitted.)

select(data, everything())

If add - is in the head of a utility function name, we can pick out all except for the area specified in the utility function.

select(data, -starts_with("arr"))

  year month day dep_time dep_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013    12  15     2124        -4      UA  N801UA    289    EWR  DTW       88      488   21     24
2 2013     7  17      651        -9      DL  N194DN    763    JFK  LAX      306     2475    6     51
3 2013     3   2     1636         1      WN  N475WN   1501    LGA  MKE      103      738   16     36

5. Standard evaluation.

Thus far, we explained the normal select() function; however, the normal select() function cannot handle character strings as arguments.
This might become a problem when column names are given as a string vector for example.
To solve this problem, the select_() function was equipped in dplyr. (Caution: An underscore was added in the function name.)
The use of the select_() function is the same as the select() except specifying columns by string; however, attention is needed when specifying a column name by a vector.

select_(data, "year", "month", "day")

  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

When specifying column names by a vector, the vector should be given the .dot argument.

col_vector <- c("year", "month", "day")
select_(data, .dots = col_vector)

  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

All arguments that can use the select() function are also possible candidates for the select_() function.

select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

Furthermore, also in this case, the argument vector should be given the .dot argument in the select_() function.

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

6. References.

Introduction to dplyr
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Help page for select() function
> help("select", package = "dplyr")

Original entry in Japanese by hoxo_m
https://qiita.com/hoxo_m/items/f2f1793c6f086d381340

Translated by siero

2015/07/12

New package "SparkRext" - SparkR extension for closer to dplyr

Apache Spark is one of the hottest products in data science.
Spark 1.4.0 has formally adopted SparkR package which enables to handle Spark DataFrames on R.

SparkR is very useful and powerful.
One of the reasons is that SparkR DataFrames present an API similar to dplyr.
We launced our new package "SparkRext" to redefine the functions of SparkR to enable NSE(Non stanard Evaluation) inputs.
As a result, the functions will be able to be used in the same way as dplyr.

If you want to know about SparkRext package in detail, please check our blog post here.

2015/01/30

The time series of JGB Interest Rate(10Y) over the last one year

World wide quantitative easing does not seem to end. I live and work in Japan which has the lowest interest rate.

Thanks to Quandl and RStudio, I can easily get the data of the interest rate and visualize it with R and dygraphs package!

You can understand how you download the data of the interest rate from Quandl and vizualize it as javascript graph(dygraphs) with R when you see the following document published in RPubs.

http://rpubs.com/teramonagi/JGBInterestRate10Y

And also, You can download the entire content of the above document including R code from the following link:

https://github.com/teramonagi/RMarkDownsForRPubs/blob/master/rmd/JGBInterestRate10Y.Rmd

Enjoy!

登録: 投稿 (Atom)