Using dplyr and pool to query a database
Note (6/22/17): pool
is now compatible with dplyr
0.7.0 and the brand new dbplyr
. If you want to continue using pool with an older version of dplyr, please install pool@dplyr-pre-0.7.0-compat
(devtools::install_github("rstudio/pool@dplyr-pre-0.7.0-compat
).
Installation
There are four packages that you need throughout this series of articles. Here are the installation instructions, so your code runs smoothly:
# get shiny, DBI, dplyr and dbplyr from CRAN
install.packages("shiny")
install.packages("DBI")
install.packages("dplyr")
install.packages("dbplyr")
# get pool from GitHub, since it's not yet on CRAN
::install_github("rstudio/pool") devtools
Motivation
Many of you might be using dplyr
to connect to your database, especially because it allows you to query it using R code, rather than the more cumbersome SQL equivalent. For this reason and for performance, we’d actually recommend users to use dplyr
by default, unless they actually need more functionality and must turn to DBI
instead (when it comes to databases, dplyr
only allows you to do queries that can be transformed into a SELECT
SQL statement). If you’re not familiar with dplyr
applied to databases, make sure to read the section about this on the first article of this series.
Here’s why dplyr
tends to perform better than DBI
(from dplyr
’s vignette about databases):
When working with databases, dplyr tries to be as lazy as possible:
- It never pulls data into R unless you explicitly ask for it.
- It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
But on top of this, it’s possible to get even better performance by using a pool. Most importantly, however, you should use a pool for its connection management benefits. To do so, you only need to do minor changes to your existing code.
Combining dplyr
and pool
Comparison to dplyr
code
First, let’s consider how you’d connect to and query a MySQL database using only dplyr
:
library(dplyr)
<- dbPool(
my_db ::MySQL(),
RMySQLdbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
# get the first 5 rows:
%>% tbl("City") %>% head(5)
my_db ## # Source: lazy query [?? x 5]
## # Database: mysql 10.0.17-MariaDB [guest@shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com:/shinydemo]
## ID Name CountryCode District Population
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Kabul AFG Kabol 1780000
## 2 2 Qandahar AFG Qandahar 237500
## 3 3 Herat AFG Herat 186800
## 4 4 Mazar-e-Sharif AFG Balkh 127800
## 5 5 Amsterdam NLD Noord-Holland 731200
Now, let’s do the same thing using a Pool object:
library(pool)
library(dplyr)
<- dbPool(
pool drv = RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
# get the first 5 rows:
%>% tbl("City") %>% head(5)
pool ## # Source: lazy query [?? x 5]
## # Database: mysql 10.0.17-MariaDB [guest@shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com:/shinydemo]
## ID Name CountryCode District Population
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Kabul AFG Kabol 1780000
## 2 2 Qandahar AFG Qandahar 237500
## 3 3 Herat AFG Herat 186800
## 4 4 Mazar-e-Sharif AFG Balkh 127800
## 5 5 Amsterdam NLD Noord-Holland 731200
Shiny apps
What’s the advantage of using pool
with dplyr
, rather than just using dplyr
to query a database? As usual with pool
, the answer is performance and connection management. In order to reap these benefits within a Shiny app, however, you need to be careful about where you create your pool and where you use tbl
(or equivalent). The same idea that we used for combining DBI
and pool
, also applies here. First, you should create your pool at the top of server.R
(or in global.R
), but outside the actual server function. Then, for each query, you should use tbl
(or equivalent). In a Shiny app, this means that each reactive or function that queries the database has its own call to tbl
. For example, here’s the same app used in the pool
basics article, but written with dplyr
instead of DBI
:
library(shiny)
library(DBI)
library(pool)
<- dbPool(
pool drv = RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
<- fluidPage(
ui textInput("ID", "Enter your ID:", "5"),
tableOutput("tbl"),
numericInput("nrows", "How many cities to show?", 10),
plotOutput("popPlot")
)
<- function(input, output, session) {
server $tbl <- renderTable({
output%>% tbl("City") %>%
pool filter(ID == input$ID)
})$popPlot <- renderPlot({
output<- pool %>% tbl("City") %>%
df head(as.integer(input$nrows)[1]) %>% collect()
<- df$Population
pop names(pop) <- df$Name
barplot(pop)
})
}
shinyApp(ui, server)
You’ll note that there is no need to do our own input sanitizing for SQL injection prevention (i.e. no need to call a function like DBI’s sqlInterpolate
). This is because dplyr
acts a middleman between you and the actual SQL query sent to the database; as that middleman, it also sanitizes your inputs for you.