Skip to content

Commit

Permalink
Merge branch 'master' of github.com:tidyverse/dbplyr
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Jun 13, 2017
2 parents eb4056b + 6b5848f commit 848c58a
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions vignettes/dbplyr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ As well as working with local in-memory data stored in data frames, dplyr also w
* You have so much data that it does not all fit into memory simultaneously
and you need to use some external storage engine.

(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating).
(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.)

This vignette focusses on the first scenario because it's the most common. If you're using R to do data analysis inside a company, most of the data you need probably already lives in a database (it's just a matter of figuring out which one!). However, you will learn how to load data in to a local database in order to demonstrate dplyr's database tools. At the end, I'll also give you a few pointers if you do need to set up your own database.

Expand Down Expand Up @@ -62,7 +62,7 @@ library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
```

The arguments to `DBI::dbConnect()` vary from database to database, but the first argument is always the database backend. It's `RSQLite::SQLite()` for RSQLite, `RMySQL::MySQL()` for RMySQL, `RPostgreSQL::PostgreSQL()` for RPostgreSQL, `odbc::odbc()` for odbc, and `bigrquery::bigquery()` for BigQuery. SQLite only needs one other argument: the path to the database. Here we use the special string ":memory:" which causes SQLite to make a temporary in-memory database.
The arguments to `DBI::dbConnect()` vary from database to database, but the first argument is always the database backend. It's `RSQLite::SQLite()` for RSQLite, `RMySQL::MySQL()` for RMySQL, `RPostgreSQL::PostgreSQL()` for RPostgreSQL, `odbc::odbc()` for odbc, and `bigrquery::bigquery()` for BigQuery. SQLite only needs one other argument: the path to the database. Here we use the special string `":memory:"` which causes SQLite to make a temporary in-memory database.

Most existing databases don't live in a file, but instead live on another server. That means in real-life that your code will look more like this:

Expand Down Expand Up @@ -90,7 +90,7 @@ copy_to(con, nycflights13::flights, "flights",
)
```

As you can see, the `copy_to()` operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, by carrier, by plane, and destination. Creating the write indices is key to good database performance, but is unfortunately beyond the scope of this article.
As you can see, the `copy_to()` operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, carrier, plane, and destination. Creating the write indices is key to good database performance, but is unfortunately beyond the scope of this article.

Now that we've copied the data, we can use `tbl()` to take a reference to it:

Expand Down Expand Up @@ -179,11 +179,11 @@ nrow(tailnum_delay_db)
tail(tailnum_delay_db)
```

You can also ask the database how it plans to execute the query with `explain()`. The output is database dependent, and can be esoteric, but learning a bit about it can be very useful because it helps you understand if the database can execute query efficiently, or if you need to create new indices.
You can also ask the database how it plans to execute the query with `explain()`. The output is database dependent, and can be esoteric, but learning a bit about it can be very useful because it helps you understand if the database can execute the query efficiently, or if you need to create new indices.

## Creating your own database

If you don't already have a database, here's some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions makes it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built-in functions. In my opinion, you shouldn't bother with MySQL/MariaDB: it's a pain to set up, the documentation is subpar, and its less featureful than Postgres. Google BigQuery might be a good fit if you have very large data, or if you're willing to pay (a small amount of) money to someone who'll look after your database.
If you don't already have a database, here's some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions makes it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built-in functions. In my opinion, you shouldn't bother with MySQL/MariaDB: it's a pain to set up, the documentation is subpar, and it's less featureful than Postgres. Google BigQuery might be a good fit if you have very large data, or if you're willing to pay (a small amount of) money to someone who'll look after your database.

All of these databases follow a client-server model - a computer that connects to the database and the computer that is running the database (the two may be one and the same but usually isn't). Getting one of these databases up and running is beyond the scope of this article, but there are plenty of tutorials available on the web.

Expand All @@ -195,12 +195,12 @@ In terms of functionality, MySQL lies somewhere between SQLite and PostgreSQL. I

PostgreSQL is a considerably more powerful database than SQLite. It has:

* a much wider range of [built-in functions](http://www.postgresql.org/docs/9.3/static/functions.html)
* a much wider range of [built-in functions](http://www.postgresql.org/docs/9.3/static/functions.html), and

* support for [window functions](http://www.postgresql.org/docs/9.3/static/tutorial-window.html), which allow grouped subset and mutates to work.

### BigQuery

BigQuery is a hosted database server provided by Google. To connect, you need to provide your `project`, `dataset` and optionally a project for `billing` (if billing for `project` isn't enabled).

It provides a similar set of functions to Postgres and is designed specifically for analytic workflows. Because it's a hosted solution, there's no setup involved, but if you have a lot of data, getting it to google can be an ordeal (especially because upload support from R is not great currently). (If you have lots of data, you can [ship hard drives](<https://cloud.google.com/storage/docs/offline-media-import-export>)!)
It provides a similar set of functions to Postgres and is designed specifically for analytic workflows. Because it's a hosted solution, there's no setup involved, but if you have a lot of data, getting it to Google can be an ordeal (especially because upload support from R is not great currently). (If you have lots of data, you can [ship hard drives](<https://cloud.google.com/storage/docs/offline-media-import-export>)!)

0 comments on commit 848c58a

Please sign in to comment.