dplyr/0000755000176200001440000000000013163257361011413 5ustar liggesusersdplyr/inst/0000755000176200001440000000000013157241200012354 5ustar liggesusersdplyr/inst/doc/0000755000176200001440000000000013157241200013121 5ustar liggesusersdplyr/inst/doc/dplyr.Rmd0000644000176200001440000004667713135665123014755 0ustar liggesusers--- title: "Introduction to dplyr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to dplyr} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(dplyr) library(ggplot2) set.seed(1014) ``` When working with data you must: * Figure out what you want to do. * Describe those tasks in the form of a computer program. * Execute the program. The dplyr package makes these steps fast and easy: * By constraining your options, it helps you think about your data manipulation challenges. * It provides simple "verbs", functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code. * It uses efficient backends, so you spend less time waiting for the computer. This document introduces you to dplyr's basic set of tools, and shows you how to apply them to data frames. dplyr also supports databases via the dbplyr package, once you've installed, read `vignette("dbplyr")` to learn more. ## Data: nycflights13 To explore the basic data manipulation verbs of dplyr, we'll use `nycflights13::flights`. This dataset contains all `r nrow(nycflights13::flights)` flights that departed from New York City in 2013. The data comes from the US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?nycflights13` ```{r} library(nycflights13) dim(flights) flights ``` Note that `nycflights13::flights` is a tibble, a modern reimagining of the data frame. It's particular useful for large datasets because it only prints the first few rows. You can learn more about tibbles at ; in particular you can convert data frames to tibbles with `as_tibble()`. ## Single table verbs Dplyr aims to provide a function for each basic verb of data manipulation: * `filter()` to select cases based on their values. * `arrange()` to reorder the cases. * `select()` and `rename()` to select variables based on their names. * `mutate()` and `transmute()` to add new variables that are functions of existing variables. * `summarise()` to condense multiple values to a single value. * `sample_n()` and `sample_frac()` to take random samples. ### Filter rows with `filter()` `filter()` allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is `TRUE`. For example, we can select all flights on January 1st with: ```{r} filter(flights, month == 1, day == 1) ``` This is rougly equivalent to this base R code: ```{r, eval = FALSE} flights[flights$month == 1 & flights$day == 1, ] ``` ### Arrange rows with `arrange()` `arrange()` works similarly to `filter()` except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns: ```{r} arrange(flights, year, month, day) ``` Use `desc()` to order a column in descending order: ```{r} arrange(flights, desc(arr_delay)) ``` ### Select columns with `select()` Often you work with large datasets with many columns but only a few are actually of interest to you. `select()` allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions: ```{r} # Select columns by name select(flights, year, month, day) # Select all columns between year and day (inclusive) select(flights, year:day) # Select all columns except those from year to day (inclusive) select(flights, -(year:day)) ``` There are a number of helper functions you can use within `select()`, like `starts_with()`, `ends_with()`, `matches()` and `contains()`. These let you quickly match larger blocks of variables that meet some criterion. See `?select` for more details. You can rename variables with `select()` by using named arguments: ```{r} select(flights, tail_num = tailnum) ``` But because `select()` drops all the variables not explicitly mentioned, it's not that useful. Instead, use `rename()`: ```{r} rename(flights, tail_num = tailnum) ``` ### Add new columns with `mutate()` Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`: ```{r} mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` `dplyr::mutate()` is similar to the base `transform()`, but allows you to refer to columns that you've just created: ```{r} mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) ``` If you only want to keep the new variables, use `transmute()`: ```{r} transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) ``` ### Summarise values with `summarise()` The last verb is `summarise()`. It collapses a data frame to a single row. ```{r} summarise(flights, delay = mean(dep_delay, na.rm = TRUE) ) ``` It's not that useful until we learn the `group_by()` verb below. ### Randomly sample rows with `sample_n()` and `sample_frac()` You can use `sample_n()` and `sample_frac()` to take a random sample of rows: use `sample_n()` for a fixed number and `sample_frac()` for a fixed fraction. ```{r} sample_n(flights, 10) sample_frac(flights, 0.01) ``` Use `replace = TRUE` to perform a bootstrap sample. If needed, you can weight the sample with the `weight` argument. ### Commonalities You may have noticed that the syntax and function of all these verbs are very similar: * The first argument is a data frame. * The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using `$`. * The result is a new data frame Together these properties make it easy to chain together multiple simple steps to achieve a complex result. These five functions provide the basis of a language of data manipulation. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (`arrange()`), pick observations and variables of interest (`filter()` and `select()`), add new variables that are functions of existing variables (`mutate()`), or collapse many values to a summary (`summarise()`). The remainder of the language comes from applying the five functions to different types of data. For example, I'll discuss how these functions work with grouped data. ## Patterns of operations The dplyr verbs can be classified by the type of operations they accomplish (we sometimes speak of their **semantics**, i.e., their meaning). The most important and useful distinction is between grouped and ungrouped operations. In addition, it is helpful to have a good grasp of the difference between select and mutate operations. ### Grouped operations The dplyr verbs are useful on their own, but they become even more powerful when you apply them to groups of observations within a dataset. In dplyr, you do this with the `group_by()` function. It breaks down a dataset into specified groups of rows. When you then apply the verbs above on the resulting object they'll be automatically applied "by group". Grouping affects the verbs as follows: * grouped `select()` is the same as ungrouped `select()`, except that grouping variables are always retained. * grouped `arrange()` is the same as ungrouped; unless you set `.by_group = TRUE`, in which case it orders first by the grouping variables * `mutate()` and `filter()` are most useful in conjunction with window functions (like `rank()`, or `min(x) == x`). They are described in detail in `vignette("window-functions")`. * `sample_n()` and `sample_frac()` sample the specified number/fraction of rows in each group. * `summarise()` computes the summary for each group. In the following example, we split the complete dataset into individual planes and then summarise each plane by counting the number of flights (`count = n()`) and computing the average distance (`dist = mean(distance, na.rm = TRUE)`) and arrival delay (`delay = mean(arr_delay, na.rm = TRUE)`). We then use ggplot2 to display the output. ```{r, warning = FALSE, message = FALSE, fig.width = 6} by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000) # Interestingly, the average delay is only slightly related to the # average distance flown by a plane. ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() ``` You use `summarise()` with __aggregate functions__, which take a vector of values and return a single number. There are many useful examples of such functions in base R like `min()`, `max()`, `mean()`, `sum()`, `sd()`, `median()`, and `IQR()`. dplyr provides a handful of others: * `n()`: the number of observations in the current group * `n_distinct(x)`:the number of unique values in `x`. * `first(x)`, `last(x)` and `nth(x, n)` - these work similarly to `x[1]`, `x[length(x)]`, and `x[n]` but give you more control over the result if the value is missing. For example, we could use these to find the number of planes and the number of flights that go to each possible destination: ```{r} destinations <- group_by(flights, dest) summarise(destinations, planes = n_distinct(tailnum), flights = n() ) ``` When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset: ```{r} daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n())) (per_month <- summarise(per_day, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights))) ``` However you need to be careful when progressively rolling up summaries like this: it's ok for sums and counts, but you need to think about weighting for means and variances (it's not possible to do this exactly for medians). ### Selecting operations One of the appealing features of dplyr is that you can refer to columns from the tibble as if they were regular variables. However, the syntactic uniformity of referring to bare column names hide semantical differences across the verbs. A column symbol supplied to `select()` does not have the same meaning as the same symbol supplied to `mutate()`. Selecting operations expect column names and positions. Hence, when you call `select()` with bare variable names, they actually represent their own positions in the tibble. The following calls are completely equivalent from dplyr's point of view: ```{r} # `year` represents the integer 1 select(flights, year) select(flights, 1) ``` By the same token, this means that you cannot refer to variables from the surrounding context if they have the same name as one of the columns. In the following example, `year` still represents 1, not 5: ```r year <- 5 select(flights, year) ``` One useful subtlety is that this only applies to bare names and to selecting calls like `c(year, month, day)` or `year:day`. In all other cases, the columns of the data frame are not put in scope. This allows you to refer to contextual variables in selection helpers: ```{r} year <- "dep" select(flights, starts_with(year)) ``` These semantics are usually intuitive. But note the subtle difference: ```{r} year <- 5 select(flights, year, identity(year)) ``` In the first argument, `year` represents its own position `1`. In the second argument, `year` is evaluated in the surrounding context and represents the fifth column. For a long time, `select()` used to only understand column positions. Counting from dplyr 0.6, it now understands column names as well. This makes it a bit easier to program with `select()`: ```{r} vars <- c("year", "month") select(flights, vars, "day") ``` Note that the code above is somewhat unsafe because you might have added a column named `vars` to the tibble, or you might apply the code to another data frame containing such a column. To avoid this issue, you can wrap the variable in an `identity()` call as we mentioned above, as this will bypass column names. However, a more explicit and general method that works in all dplyr verbs is to unquote the variable with the `!!` operator. This tells dplyr to bypass the data frame and to directly look in the context: ```{r} # Let's create a new `vars` column: flights$vars <- flights$year # The new column won't be an issue if you evaluate `vars` in the # context with the `!!` operator: vars <- c("year", "month", "day") select(flights, !! vars) ``` This operator is very useful when you need to use dplyr within custom functions. You can learn more about it in `vignette("programming")`. However it is important to understand the semantics of the verbs you are unquoting into, that is, the values they understand. As we have just seen, `select()` supports names and positions of columns. But that won't be the case in other verbs like `mutate()` because they have different semantics. ### Mutating operations Mutate semantics are quite different from selection semantics. Whereas `select()` expects column names or positions, `mutate()` expects *column vectors*. Let's create a smaller tibble for clarity: ```{r} df <- select(flights, year:dep_time) ``` When we use `select()`, the bare column names stand for ther own positions in the tibble. For `mutate()` on the other hand, column symbols represent the actual column vectors stored in the tibble. Consider what happens if we give a string or a number to `mutate()`: ```{r} mutate(df, "year", 2) ``` `mutate()` gets length-1 vectors that it interprets as new columns in the data frame. These vectors are recycled so they match the number of rows. That's why it doesn't make sense to supply expressions like `"year" + 10` to `mutate()`. This amounts to adding 10 to a string! The correct expression is: ```{r} mutate(df, year + 10) ``` In the same way, you can unquote values from the context if these values represent a valid column. They must be either length 1 (they then get recycled) or have the same length as the number of rows. In the following example we create a new vector that we add to the data frame: ```{r} var <- seq(1, nrow(df)) mutate(df, new = var) ``` A case in point is `group_by()`. While you might think it has select semantics, it actually has mutate semantics. This is quite handy as it allows to group by a modified column: ```{r} group_by(df, month) group_by(df, month = as.factor(month)) group_by(df, day_binned = cut(day, 3)) ``` This is why you can't supply a column name to `group_by()`. This amounts to creating a new column containing the string recycled to the number of rows: ```{r} group_by(df, "month") ``` Since grouping with select semantics can be sometimes useful as well, we have added the `group_by_at()` variant. In dplyr, variants suffixed with `_at()` support selection semantics in their second argument. You just need to wrap the selection with `vars()`: ```{r} group_by_at(df, vars(year:day)) ``` You can read more about the `_at()` and `_if()` variants in the `?scoped` help page. ## Piping The dplyr API is functional in the sense that function calls don't have side-effects. You must always save their results. This doesn't lead to particularly elegant code, especially if you want to do many operations at once. You either have to do it step-by-step: ```{r, eval = FALSE} a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30) ``` Or if you don't want to name the intermediate results, you need to wrap the function calls inside each other: ```{r} filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ), arr > 30 | dep > 30 ) ``` This is difficult to read because the order of the operations is from inside to out. Thus, the arguments are a long way away from the function. To get around this problem, dplyr provides the `%>%` operator from magrittr. `x %>% f(y)` turns into `f(x, y)` so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom: ```{r, eval = FALSE} flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) ``` ## Other data sources As well as data frames, dplyr works with data that is stored in other ways, like data tables, databases and multidimensional arrays. ### Data table dplyr also provides [data table](http://datatable.r-forge.r-project.org/) methods for all verbs through [dtplyr](http://github.com/hadley/dtplyr). If you're using data.tables already this lets you to use dplyr syntax for data manipulation, and data.table for everything else. For multiple operations, data.table can be faster because you usually use it with multiple verbs simultaneously. For example, with data table you can do a mutate and a select in a single step. It's smart enough to know that there's no point in computing the new variable for rows you're about to throw away. The advantages of using dplyr with data tables are: * For common data manipulation tasks, it insulates you from the reference semantics of data.tables, and protects you from accidentally modifying your data. * Instead of one complex method built on the subscripting operator (`[`), it provides many simple methods. ### Databases dplyr also allows you to use the same verbs with a remote database. It takes care of generating the SQL for you so that you can avoid the cognitive challenge of constantly switching between languages. To use these capabilities, you'll need to install the dbplyr package and then read `vignette("dbplyr")` for the details. ### Multidimensional arrays / cubes `tbl_cube()` provides an experimental interface to multidimensional arrays or data cubes. If you're using this form of data in R, please get in touch so I can better understand your needs. ## Comparisons Compared to all existing options, dplyr: * abstracts away how your data is stored, so that you can work with data frames, data tables and remote databases using the same set of functions. This lets you focus on what you want to achieve, not on the logistics of data storage. * provides a thoughtful default `print()` method that doesn't automatically print pages of data to the screen (this was inspired by data table's output). Compared to base functions: * dplyr is much more consistent; functions have the same interface. So once you've mastered one, you can easily pick up the others * base functions tend to be based around vectors; dplyr is based around data frames Compared to plyr, dplyr: * is much much faster * provides a better thought out set of joins * only provides tools for working with data frames (e.g. most of dplyr is equivalent to `ddply()` + various functions, `do()` is equivalent to `dlply()`) Compared to virtual data frame approaches: * it doesn't pretend that you have a data frame: if you want to run lm etc, you'll still need to manually pull down the data * it doesn't provide methods for R summary functions (e.g. `mean()`, or `sum()`) dplyr/inst/doc/programming.R0000644000176200001440000001744213157241173015607 0ustar liggesusers## ----setup, echo = FALSE, message = FALSE-------------------------------- knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(dplyr) set.seed(1014) ## ------------------------------------------------------------------------ df <- tibble(x = 1:3, y = 3:1) filter(df, x == 1) ## ---- error = TRUE------------------------------------------------------- my_var <- x filter(df, my_var == 1) ## ---- error = TRUE------------------------------------------------------- my_var <- "x" filter(df, my_var == 1) ## ---- eval = FALSE------------------------------------------------------- # df[df$x == df$y, ] # df[df$x == y, ] # df[x == df$y, ] # df[x == y, ] ## ------------------------------------------------------------------------ greet <- function(name) { "How do you do, name?" } greet("Hadley") ## ------------------------------------------------------------------------ greet <- function(name) { paste0("How do you do, ", name, "?") } greet("Hadley") ## ------------------------------------------------------------------------ greet <- function(name) { glue::glue("How do you do, {name}?") } greet("Hadley") ## ---- eval = FALSE------------------------------------------------------- # mutate(df1, y = a + x) # mutate(df2, y = a + x) # mutate(df3, y = a + x) # mutate(df4, y = a + x) ## ------------------------------------------------------------------------ mutate_y <- function(df) { mutate(df, y = a + x) } ## ------------------------------------------------------------------------ df1 <- tibble(x = 1:3) a <- 10 mutate_y(df1) ## ---- error = TRUE------------------------------------------------------- mutate_y <- function(df) { mutate(df, y = .data$a + .data$x) } mutate_y(df1) ## ------------------------------------------------------------------------ df <- tibble( g1 = c(1, 1, 2, 2, 2), g2 = c(1, 2, 1, 2, 1), a = sample(5), b = sample(5) ) df %>% group_by(g1) %>% summarise(a = mean(a)) df %>% group_by(g2) %>% summarise(a = mean(a)) ## ---- error = TRUE------------------------------------------------------- my_summarise <- function(df, group_var) { df %>% group_by(group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1) ## ---- error = TRUE------------------------------------------------------- my_summarise(df, "g2") ## ------------------------------------------------------------------------ quo(g1) quo(a + b + c) quo("a") ## ---- error = TRUE------------------------------------------------------- my_summarise(df, quo(g1)) ## ------------------------------------------------------------------------ my_summarise <- function(df, group_var) { df %>% group_by(!!group_var) %>% summarise(a = mean(a)) } my_summarise(df, quo(g1)) ## ---- eval = FALSE------------------------------------------------------- # my_summarise(df, g1) ## ---- error = TRUE------------------------------------------------------- my_summarise <- function(df, group_var) { quo_group_var <- quo(group_var) print(quo_group_var) df %>% group_by(!!quo_group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1) ## ------------------------------------------------------------------------ my_summarise <- function(df, group_var) { group_var <- enquo(group_var) print(group_var) df %>% group_by(!!group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1) ## ------------------------------------------------------------------------ summarise(df, mean = mean(a), sum = sum(a), n = n()) summarise(df, mean = mean(a * b), sum = sum(a * b), n = n()) ## ------------------------------------------------------------------------ my_var <- quo(a) summarise(df, mean = mean(!!my_var), sum = sum(!!my_var), n = n()) ## ------------------------------------------------------------------------ quo(summarise(df, mean = mean(!!my_var), sum = sum(!!my_var), n = n() )) ## ------------------------------------------------------------------------ my_summarise2 <- function(df, expr) { expr <- enquo(expr) summarise(df, mean = mean(!!expr), sum = sum(!!expr), n = n() ) } my_summarise2(df, a) my_summarise2(df, a * b) ## ------------------------------------------------------------------------ mutate(df, mean_a = mean(a), sum_a = sum(a)) mutate(df, mean_b = mean(b), sum_b = sum(b)) ## ------------------------------------------------------------------------ my_mutate <- function(df, expr) { expr <- enquo(expr) mean_name <- paste0("mean_", quo_name(expr)) sum_name <- paste0("sum_", quo_name(expr)) mutate(df, !!mean_name := mean(!!expr), !!sum_name := sum(!!expr) ) } my_mutate(df, a) ## ------------------------------------------------------------------------ my_summarise <- function(df, ...) { group_var <- quos(...) df %>% group_by(!!!group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1, g2) ## ------------------------------------------------------------------------ args <- list(na.rm = TRUE, trim = 0.25) quo(mean(x, !!! args)) args <- list(quo(x), na.rm = TRUE, trim = 0.25) quo(mean(!!! args)) ## ------------------------------------------------------------------------ disp ~ cyl + drat ## ------------------------------------------------------------------------ # Computing the value of the expression: toupper(letters[1:5]) # Capturing the expression: quote(toupper(letters[1:5])) ## ------------------------------------------------------------------------ f <- function(x) { quo(x) } x1 <- f(10) x2 <- f(100) ## ------------------------------------------------------------------------ x1 x2 ## ---- message = FALSE---------------------------------------------------- library(rlang) get_env(x1) get_env(x2) ## ------------------------------------------------------------------------ eval_tidy(x1) eval_tidy(x2) ## ------------------------------------------------------------------------ user_var <- 1000 mtcars %>% summarise(cyl = mean(cyl) * user_var) ## ------------------------------------------------------------------------ typeof(mean) ## ------------------------------------------------------------------------ var <- ~toupper(letters[1:5]) var # You can extract its expression: get_expr(var) # Or inspect its enclosure: get_env(var) ## ------------------------------------------------------------------------ # Here we capture `letters[1:5]` as an expression: quo(toupper(letters[1:5])) # Here we capture the value of `letters[1:5]` quo(toupper(!!letters[1:5])) quo(toupper(UQ(letters[1:5]))) ## ------------------------------------------------------------------------ var1 <- quo(letters[1:5]) quo(toupper(!!var1)) ## ------------------------------------------------------------------------ my_mutate <- function(x) { mtcars %>% select(cyl) %>% slice(1:4) %>% mutate(cyl2 = cyl + (!! x)) } f <- function(x) quo(x) expr1 <- f(100) expr2 <- f(10) my_mutate(expr1) my_mutate(expr2) ## ---- error = TRUE------------------------------------------------------- my_fun <- quo(fun) quo(!!my_fun(x, y, z)) quo(UQ(my_fun)(x, y, z)) my_var <- quo(x) quo(filter(df, !!my_var == 1)) quo(filter(df, UQ(my_var) == 1)) ## ------------------------------------------------------------------------ quo(UQE(my_fun)(x, y, z)) quo(filter(df, UQE(my_var) == 1)) ## ------------------------------------------------------------------------ quo(list(!!! letters[1:5])) ## ------------------------------------------------------------------------ x <- list(foo = 1L, bar = quo(baz)) quo(list(!!! x)) ## ------------------------------------------------------------------------ args <- list(mean = quo(mean(cyl)), count = quo(n())) mtcars %>% group_by(am) %>% summarise(!!! args) ## ------------------------------------------------------------------------ mean_nm <- "mean" count_nm <- "count" mtcars %>% group_by(am) %>% summarise( !!mean_nm := mean(cyl), !!count_nm := n() ) dplyr/inst/doc/compatibility.html0000644000176200001440000006271613157241156016706 0ustar liggesusers dplyr compatibility

dplyr compatibility

This vignette is aimed at package authors who need to update their code because of a backward incompatible change to dplyr. We do try and minimise backward incompatible changes as much as possible, but sometimes they are necessary in order to radically simplify existing code, or unlock a lot of potential value in the future.

This vignette starts with some general advice on writing package code that works with multiple version of dplyr, then continues to discuss specific changes in dplyr versions.

Working with multiple dplyr versions

Ideally, you want to make sure that your package works with both the released version and the development version of dplyr. This is typically a little bit more work, but has two big advantages:

  1. It’s more convenient for your users, since they’re not forced to update dplyr if they don’t want to)

  2. It’s easier on CRAN since it doesn’t require a massive coordinated release of multiple packages.

To make code work with multiple versions of a package, your first tool is the simple if statement:

if (utils::packageVersion("dplyr") > "0.5.0") {
  # code for new version
} else {
  # code for old version
}

Always condition on > current-version, not >= next-version because this will ensure that this branch is also used for the development version of the package. For example, if the current release is version “0.5.0”, the development version will be “0.5.0.9000”.

Occasionally, you’ll run into a situation where the NAMESPACE has changed and you need to conditionally import different functions. This typically occurs when functions are moved from one package to another. We try out best to provide automatic fallbacks, but this is not always possible. Often you can work around the problem by avoiding importFrom and using :: instead. Do this where possible:

if (utils::packageVersion("dplyr") > "0.5.0") {
  dbplyr::build_sql(...)
} else {
  dplyr::build_sql(...)
}

This will generate an R CMD check NOTE (because the one of the functions will always be missing), but this is ok. Simply explain that you get the note because you have written a wrapper to make sure your code is backward compatible.

Sometimes it’s not possible to avoid importFrom(). For example you might be importing a generic so that you can define a method for it. In this case, you can take advantage of a little-known feature in the NAMESPACE file: you can include if statements.

#' @rawNamespace
#' if (utils::packageVersion("dplyr") > "0.5.0") {
#'   importFrom("dbplyr", "build_sql")
#' } else {
#'   importFrom("dplyr", "build_sql")
#' }

dplyr 0.6.0

Database code moves to dbplyr

Almost all database related code has been moved out of dplyr and into a new package, dbplyr. This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases. If you’ve implemented a database backend for dplyr, please read the backend news on the backend.

Depending on what generics you use, and what generics you provide methods for you, you may need to write some conditional code. To help make this easier we’ve written wrap_dbplyr_obj() which will write the helper code for you:

wrap_dbplyr_obj("build_sql")

wrap_dbplyr_obj("base_agg")

Simply copy the results of this function in your package.

These will generate R CMD check NOTES, so make sure to tell CRAN that this is to ensure backward compatibility.

Deprecation of underscored verbs_()

Because the tidyeval framework allows us to combine SE and NSE semantics within the same functions, the underscored verbs have been softly deprecated.

For users of SE_ verbs

The legacy underscored versions take objects for which a lazyeval::as.lazy() method is defined. This includes symbols and calls, strings, and formulas. All of these objects have been replaced with quosures and you can call tidyeval verbs with unquoted quosures:

quo <- quo(cyl)
select(mtcars, !! quo)

Symbolic expressions are also supported, but note that bare symbols and calls do not carry scope information. If you’re referring to objects in the data frame, it’s safe to omit specifying an enclosure:

sym <- quote(cyl)
select(mtcars, !! sym)

call <- quote(mean(cyl))
summarise(mtcars, !! call)

Transforming objects into quosures is generally straightforward. To enclose with the current environment, you can unquote directly in quo() or you can use as_quosure():

quo(!! sym)
#> <quosure: global>
#> ~cyl
quo(!! call)
#> <quosure: global>
#> ~mean(cyl)

rlang::as_quosure(sym)
#> <quosure: global>
#> ~cyl
rlang::as_quosure(call)
#> <quosure: global>
#> ~mean(cyl)

Note that while formulas and quosures are very similar objects (and in the most general sense, formulas are quosures), they can’t be used interchangeably in tidyeval functions. Early implementations did treat bare formulas as quosures, but this created compatibility issues with modelling functions of the stats package. Fortunately, it’s easy to transform formulas to quosures that will self-evaluate in tidyeval functions:

f <- ~cyl
f
#> ~cyl
rlang::as_quosure(f)
#> <quosure: global>
#> ~cyl

Finally, and perhaps most importantly, strings are not and should not be parsed. As developers, it is tempting to try and solve problems using strings because we have been trained to work with strings rather than quoted expressions. However it’s almost always the wrong way to approach the problem. The exception is for creating symbols. In that case it is perfectly legitimate to use strings:

rlang::sym("cyl")
#> cyl
rlang::syms(letters[1:3])
#> [[1]]
#> a
#> 
#> [[2]]
#> b
#> 
#> [[3]]
#> c

But you should never use strings to create calls. Instead you can use quasiquotation:

syms <- rlang::syms(c("foo", "bar", "baz"))
quo(my_call(!!! syms))
#> <quosure: global>
#> ~my_call(foo, bar, baz)

fun <- rlang::sym("my_call")
quo(UQ(fun)(!!! syms))
#> <quosure: global>
#> ~my_call(foo, bar, baz)

Or create the call with lang():

call <- rlang::lang("my_call", !!! syms)
call
#> my_call(foo, bar, baz)

rlang::as_quosure(call)
#> <quosure: global>
#> ~my_call(foo, bar, baz)

# Or equivalently:
quo(!! rlang::lang("my_call", !!! syms))
#> <quosure: global>
#> ~my_call(foo, bar, baz)

Note that idioms based on interp() should now generally be avoided and replaced with quasiquotation. Where you used to interpolate:

lazyeval::interp(~ mean(var), var = rlang::sym("mpg"))

You would now unquote:

var <- "mpg"
quo(mean(!! rlang::sym(var)))

See also vignette("programming") for more about quasiquotation and quosures.

For package authors

For package authors, rlang provides a compatibility file that you can copy to your package. compat_lazy() and compat_lazy_dots() turn lazy-able objects into proper quosures. This helps providing an underscored version to your users for backward compatibility. For instance, here is how we defined the underscored version of filter() in dplyr 0.6:

filter_.tbl_df <- function(.data, ..., .dots = list()) {
  dots <- compat_lazy_dots(.dots, caller_env(), ...)
  filter(.data, !!! dots)
}

With tidyeval, S3 dispatch to the correct method might be an issue. In the past, the genericity of dplyr verbs was accomplished by dispatching in the underscored versions. Now that those are deprecated, we’ve turned the non-underscored verbs into S3 generics.

We maintain backward compatibility by redispatching to old underscored verbs in the default methods of the new S3 generics. For example, here is how we redispatch filter():

filter.default <- function(.data, ...) {
  filter_(.data, .dots = compat_as_lazy_dots(...))
}

This gets the job done in most cases. However, the default method will not be called for objects inheriting from one of the classes for which we provide non-underscored methods: data.frame, tbl_df, tbl_cube and grouped_df. An example of this is the sf package whose objects have classes c("sf", "data.frame"). Authors of such packages should provide a method for the non-underscored generic in order to be compatible with dplyr:

filter.sf <- function(.data, ...) {
  st_as_sf(NextMethod())
}

If you need help with this, please let us know!

Deprecation of mutate_each() and summarise_each()

These functions have been replaced by a more complete family of functions. This family has suffixes _if, _at and _all and includes more verbs than just mutate summarise.

If you need to update your code to the new family, there are two relevant functions depending on which variables you apply funs() to. If you called mutate_each() without supplying a selection of variables, funs is applied to all variables. In this case, you should update your code to use mutate_all() instead:

mutate_each(starwars, funs(as.character))
mutate_all(starwars, funs(as.character))

Note that the new verbs support bare functions as well, so you don’t necessarily need to wrap with funs():

mutate_all(starwars, as.character)

On the other hand, if you supplied a variable selection, you should use mutate_at(). The variable selection should be wrapped with vars().

mutate_each(starwars, funs(as.character), height, mass)
mutate_at(starwars, vars(height, mass), as.character)

vars() supports all the selection helpers that you usually use with select():

summarise_at(mtcars, vars(starts_with("d")), mean)

Note that intead of a vars() selection, you can also supply character vectors of column names:

mutate_at(starwars, c("height", "mass"), as.character)
dplyr/inst/doc/two-table.html0000644000176200001440000011452213157241177015727 0ustar liggesusers Two-table verbs

Two-table verbs

It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time:

(This discussion assumes that you have tidy data, where the rows are observations and the columns are variables. If you’re not familiar with that framework, I’d recommend reading up on it first.)

All two-table verbs work similarly. The first two arguments are x and y, and provide the tables to combine. The output is always a new table with the same type as x.

Mutating joins

Mutating joins allow you to combine variables from multiple tables. For example, take the nycflights13 data. In one table we have flight information with an abbreviation for carrier, and in another we have a mapping between abbreviations and full names. You can use a join to add the carrier names to the flight data:

library("nycflights13")
# Drop unimportant variables so it's easier to understand the join results.
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2 %>% 
  left_join(airlines)
#> Joining, by = "carrier"
#> # A tibble: 336,776 x 9
#>    year month   day  hour origin dest  tailnum carrier name               
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>              
#> 1  2013     1     1  5.00 EWR    IAH   N14228  UA      "United Air Lines …
#> 2  2013     1     1  5.00 LGA    IAH   N24211  UA      "United Air Lines …
#> 3  2013     1     1  5.00 JFK    MIA   N619AA  AA      "American Airlines…
#> 4  2013     1     1  5.00 JFK    BQN   N804JB  B6      "JetBlue Airways"  
#> 5  2013     1     1  6.00 LGA    ATL   N668DN  DL      "Delta Air Lines I…
#> # ... with 3.368e+05 more rows

Controlling how the tables are matched

As well as x and y, each mutating join takes an argument by that controls which variables are used to match observations in the two tables. There are a few ways to specify it, as I illustrate below with various tables from nycflights13:

  • NULL, the default. dplyr will will use all variables that appear in both tables, a natural join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin.

    flights2 %>% left_join(weather)
    #> Joining, by = c("year", "month", "day", "hour", "origin")
    #> # A tibble: 336,776 x 18
    #>    year month   day  hour origin dest  tail… carr…  temp  dewp humid wind…
    #>   <dbl> <dbl> <int> <dbl> <chr>  <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
    #> 1  2013  1.00     1  5.00 EWR    IAH   N142… UA     NA    NA    NA      NA
    #> 2  2013  1.00     1  5.00 LGA    IAH   N242… UA     NA    NA    NA      NA
    #> 3  2013  1.00     1  5.00 JFK    MIA   N619… AA     NA    NA    NA      NA
    #> 4  2013  1.00     1  5.00 JFK    BQN   N804… B6     NA    NA    NA      NA
    #> 5  2013  1.00     1  6.00 LGA    ATL   N668… DL     39.9  26.1  57.3   260
    #> # ... with 3.368e+05 more rows, and 6 more variables: wind_speed <dbl>,
    #> #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
    #> #   time_hour <dttm>
  • A character vector, by = "x". Like a natural join, but uses only some of the common variables. For example, flights and planes have year columns, but they mean different things so we only want to join by tailnum.

    flights2 %>% left_join(planes, by = "tailnum")
    #> # A tibble: 336,776 x 16
    #>   year.x month   day  hour orig… dest  tail… carr… year… type  manu… model
    #>    <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
    #> 1   2013     1     1  5.00 EWR   IAH   N142… UA     1999 "Fix… BOEI… 737-…
    #> 2   2013     1     1  5.00 LGA   IAH   N242… UA     1998 "Fix… BOEI… 737-…
    #> 3   2013     1     1  5.00 JFK   MIA   N619… AA     1990 "Fix… BOEI… 757-…
    #> 4   2013     1     1  5.00 JFK   BQN   N804… B6     2012 "Fix… AIRB… A320…
    #> 5   2013     1     1  6.00 LGA   ATL   N668… DL     1991 "Fix… BOEI… 757-…
    #> # ... with 3.368e+05 more rows, and 4 more variables: engines <int>,
    #> #   seats <int>, speed <int>, engine <chr>

    Note that the year columns in the output are disambiguated with a suffix.

  • A named character vector: by = c("x" = "a"). This will match variable x in table x to variable a in table b. The variables from use will be used in the output.

    Each flight has an origin and destination airport, so we need to specify which one we want to join to:

    flights2 %>% left_join(airports, c("dest" = "faa"))
    #> # A tibble: 336,776 x 15
    #>    year month   day  hour origin dest  tail… carr… name    lat   lon   alt
    #>   <int> <int> <int> <dbl> <chr>  <chr> <chr> <chr> <chr> <dbl> <dbl> <int>
    #> 1  2013     1     1  5.00 EWR    IAH   N142… UA    "Geo…  30.0 -95.3    97
    #> 2  2013     1     1  5.00 LGA    IAH   N242… UA    "Geo…  30.0 -95.3    97
    #> 3  2013     1     1  5.00 JFK    MIA   N619… AA    "Mia…  25.8 -80.3     8
    #> 4  2013     1     1  5.00 JFK    BQN   N804… B6    <NA>   NA    NA      NA
    #> 5  2013     1     1  6.00 LGA    ATL   N668… DL    "Har…  33.6 -84.4  1026
    #> # ... with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
    #> #   tzone <chr>
    flights2 %>% left_join(airports, c("origin" = "faa"))
    #> # A tibble: 336,776 x 15
    #>    year month   day  hour origin dest  tail… carr… name    lat   lon   alt
    #>   <int> <int> <int> <dbl> <chr>  <chr> <chr> <chr> <chr> <dbl> <dbl> <int>
    #> 1  2013     1     1  5.00 EWR    IAH   N142… UA    "New…  40.7 -74.2    18
    #> 2  2013     1     1  5.00 LGA    IAH   N242… UA    "La …  40.8 -73.9    22
    #> 3  2013     1     1  5.00 JFK    MIA   N619… AA    "Joh…  40.6 -73.8    13
    #> 4  2013     1     1  5.00 JFK    BQN   N804… B6    "Joh…  40.6 -73.8    13
    #> 5  2013     1     1  6.00 LGA    ATL   N668… DL    "La …  40.8 -73.9    22
    #> # ... with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
    #> #   tzone <chr>

Types of join

There are four types of mutating join, which differ in their behaviour when a match is not found. We’ll illustrate each with a simple example:

(df1 <- data_frame(x = c(1, 2), y = 2:1))
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <int>
#> 1  1.00     2
#> 2  2.00     1
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
#> # A tibble: 2 x 3
#>       x     a b    
#>   <dbl> <dbl> <chr>
#> 1  1.00  10.0 a    
#> 2  3.00  10.0 a
  • inner_join(x, y) only includes observations that match in both x and y.

    df1 %>% inner_join(df2) %>% knitr::kable()
    #> Joining, by = "x"
    x y a b
    1 2 10 a
  • left_join(x, y) includes all observations in x, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table.

    df1 %>% left_join(df2)
    #> Joining, by = "x"
    #> # A tibble: 2 x 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1  1.00     2  10.0 a    
    #> 2  2.00     1  NA   <NA>
  • right_join(x, y) includes all observations in y. It’s equivalent to left_join(y, x), but the columns will be ordered differently.

    df1 %>% right_join(df2)
    #> Joining, by = "x"
    #> # A tibble: 2 x 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1  1.00     2  10.0 a    
    #> 2  3.00    NA  10.0 a
    df2 %>% left_join(df1)
    #> Joining, by = "x"
    #> # A tibble: 2 x 4
    #>       x     a b         y
    #>   <dbl> <dbl> <chr> <int>
    #> 1  1.00  10.0 a         2
    #> 2  3.00  10.0 a        NA
  • full_join() includes all observations from x and y.

    df1 %>% full_join(df2)
    #> Joining, by = "x"
    #> # A tibble: 3 x 4
    #>       x     y     a b    
    #>   <dbl> <int> <dbl> <chr>
    #> 1  1.00     2  10.0 a    
    #> 2  2.00     1  NA   <NA> 
    #> 3  3.00    NA  10.0 a

The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, the new variables are filled in with missing values.

Observations

While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:

df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

df1 %>% left_join(df2)
#> Joining, by = "x"
#> # A tibble: 5 x 3
#>       x     y z    
#>   <dbl> <int> <chr>
#> 1  1.00     1 a    
#> 2  1.00     1 b    
#> 3  1.00     2 a    
#> 4  1.00     2 b    
#> 5  2.00     3 a

Filtering joins

Filtering joins match obserations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don’t have a matching tail number in the planes table:

library("nycflights13")
flights %>% 
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE)
#> # A tibble: 722 x 2
#>   tailnum     n
#>   <chr>   <int>
#> 1 <NA>     2512
#> 2 N725MQ    575
#> 3 N722MQ    513
#> 4 N723MQ    507
#> 5 N713MQ    483
#> # ... with 717 more rows

If you’re worried about what observations your joins will match, start with a semi_join() or anti_join(). semi_join() and anti_join() never duplicate; they only ever remove observations.

df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

# Four rows to start with:
df1 %>% nrow()
#> [1] 4
# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()
#> [1] 4
# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()
#> [1] 2

Set operations

The final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:

Given this simple data:

(df1 <- data_frame(x = 1:2, y = c(1L, 1L)))
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     2     1
(df2 <- data_frame(x = 1:2, y = 1:2))
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     2     2

The four possibilities are:

intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
# Note that we get 3 rows, not 4
union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <int> <int>
#> 1     2     2
#> 2     2     1
#> 3     1     1
setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <int> <int>
#> 1     2     1
setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <int> <int>
#> 1     2     2

Coercion rules

When joining tables, dplyr is a little more conservative than base R about the types of variable that it considers equivalent. This is mostly likely to surprise if you’re working factors:

Otherwise logicals will be silently upcast to integer, and integer to numeric, but coercing to character will raise an error:

df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = 1.5)
full_join(df1, df2) %>% str()
#> Joining, by = c("x", "y")
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: num  1 2
#>  $ y: num  1 1.5

df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = "a")
full_join(df1, df2) %>% str()
#> Joining, by = c("x", "y")
#> Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'y' x 'y' because of incompatible types (character / integer)

Multiple-table verbs

dplyr does not provide any functions for working with three or more tables. Instead use purrr::reduce() or Reduce(), as described in Advanced R, to iteratively combine the two-table verbs to handle as many tables as you need.

dplyr/inst/doc/dplyr.html0000644000176200001440000047553613157241172015175 0ustar liggesusers Introduction to dplyr

Introduction to dplyr

When working with data you must:

The dplyr package makes these steps fast and easy:

This document introduces you to dplyr’s basic set of tools, and shows you how to apply them to data frames. dplyr also supports databases via the dbplyr package, once you’ve installed, read vignette("dbplyr") to learn more.

Data: nycflights13

To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This dataset contains all 336776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13

library(nycflights13)
dim(flights)
#> [1] 336776     19
flights
#> # A tibble: 336,776 x 19
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
#> 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
#> 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
#> 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
#> # ... with 336,772 more rows, and 7 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

Note that nycflights13::flights is a tibble, a modern reimagining of the data frame. It’s particular useful for large datasets because it only prints the first few rows. You can learn more about tibbles at http://tibble.tidyverse.org; in particular you can convert data frames to tibbles with as_tibble().

Single table verbs

Dplyr aims to provide a function for each basic verb of data manipulation:

Filter rows with filter()

filter() allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE.

For example, we can select all flights on January 1st with:

filter(flights, month == 1, day == 1)
#> # A tibble: 842 x 19
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
#> 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
#> 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
#> 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
#> # ... with 838 more rows, and 7 more variables: origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

This is rougly equivalent to this base R code:

flights[flights$month == 1 & flights$day == 1, ]

Arrange rows with arrange()

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)
#> # A tibble: 336,776 x 19
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
#> 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
#> 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
#> 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
#> # ... with 336,772 more rows, and 7 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

Use desc() to order a column in descending order:

arrange(flights, desc(arr_delay))
#> # A tibble: 336,776 x 19
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     9    641   900  1301  1242  1530  1272 HA       51 N384…
#> 2  2013     6    15   1432  1935  1137  1607  2120  1127 MQ     3535 N504…
#> 3  2013     1    10   1121  1635  1126  1239  1810  1109 MQ     3695 N517…
#> 4  2013     9    20   1139  1845  1014  1457  2210  1007 AA      177 N338…
#> # ... with 336,772 more rows, and 7 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

Select columns with select()

Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> # ... with 336,772 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> # ... with 336,772 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#>   dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail… orig… dest  air_…
#>    <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
#> 1    517   515  2.00   830   819  11.0 UA     1545 N142… EWR   IAH     227
#> 2    533   529  4.00   850   830  20.0 UA     1714 N242… LGA   IAH     227
#> 3    542   540  2.00   923   850  33.0 AA     1141 N619… JFK   MIA     160
#> 4    544   545 -1.00  1004  1022 -18.0 B6      725 N804… JFK   BQN     183
#> # ... with 336,772 more rows, and 4 more variables: distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.

You can rename variables with select() by using named arguments:

select(flights, tail_num = tailnum)
#> # A tibble: 336,776 x 1
#>   tail_num
#>   <chr>   
#> 1 N14228  
#> 2 N24211  
#> 3 N619AA  
#> 4 N804JB  
#> # ... with 336,772 more rows

But because select() drops all the variables not explicitly mentioned, it’s not that useful. Instead, use rename():

rename(flights, tail_num = tailnum)
#> # A tibble: 336,776 x 19
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
#> 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
#> 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
#> 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
#> # ... with 336,772 more rows, and 7 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

Add new columns with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)
#> # A tibble: 336,776 x 21
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
#> 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
#> 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
#> 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
#> # ... with 336,772 more rows, and 9 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, gain <dbl>, speed <dbl>

dplyr::mutate() is similar to the base transform(), but allows you to refer to columns that you’ve just created:

mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
#> # A tibble: 336,776 x 21
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
#> 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
#> 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
#> 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
#> # ... with 336,772 more rows, and 9 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>

If you only want to keep the new variables, use transmute():

transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
#> # A tibble: 336,776 x 2
#>     gain gain_per_hour
#>    <dbl>         <dbl>
#> 1   9.00          2.38
#> 2  16.0           4.23
#> 3  31.0          11.6 
#> 4 -17.0         - 5.57
#> # ... with 336,772 more rows

Summarise values with summarise()

The last verb is summarise(). It collapses a data frame to a single row.

summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE)
)
#> # A tibble: 1 x 1
#>   delay
#>   <dbl>
#> 1  12.6

It’s not that useful until we learn the group_by() verb below.

Randomly sample rows with sample_n() and sample_frac()

You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.

sample_n(flights, 10)
#> # A tibble: 10 x 19
#>    year month   day dep_t… sched_… dep_de… arr_… sched… arr_d… carr… flig…
#>   <int> <int> <int>  <int>   <int>   <dbl> <int>  <int>  <dbl> <chr> <int>
#> 1  2013    10     1    822     825  - 3.00   932    935 - 3.00 AA       84
#> 2  2013     8     2    712     715  - 3.00  1015   1010   5.00 VX      399
#> 3  2013     5    10   1309    1315  - 6.00  1502   1501   1.00 US     1895
#> 4  2013    10    28   2002    1930   32.0   2318   2250  28.0  DL      795
#> # ... with 6 more rows, and 8 more variables: tailnum <chr>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>
sample_frac(flights, 0.01)
#> # A tibble: 3,368 x 19
#>    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
#>   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1  2013     8    16    827   830 -3.00   928   950 -22.0 AA     1838 N3CA…
#> 2  2013    11     4   1306  1300  6.00  1639  1610  29.0 VX      411 N641…
#> 3  2013     1    14    929   935 -6.00  1213  1238 -25.0 B6      361 N639…
#> 4  2013    12    28    625   630 -5.00   916  1014 -58.0 US      690 N656…
#> # ... with 3,364 more rows, and 7 more variables: origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

Use replace = TRUE to perform a bootstrap sample. If needed, you can weight the sample with the weight argument.

Commonalities

You may have noticed that the syntax and function of all these verbs are very similar:

  • The first argument is a data frame.

  • The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using $.

  • The result is a new data frame

Together these properties make it easy to chain together multiple simple steps to achieve a complex result.

These five functions provide the basis of a language of data manipulation. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (arrange()), pick observations and variables of interest (filter() and select()), add new variables that are functions of existing variables (mutate()), or collapse many values to a summary (summarise()). The remainder of the language comes from applying the five functions to different types of data. For example, I’ll discuss how these functions work with grouped data.

Patterns of operations

The dplyr verbs can be classified by the type of operations they accomplish (we sometimes speak of their semantics, i.e., their meaning). The most important and useful distinction is between grouped and ungrouped operations. In addition, it is helpful to have a good grasp of the difference between select and mutate operations.

Grouped operations

The dplyr verbs are useful on their own, but they become even more powerful when you apply them to groups of observations within a dataset. In dplyr, you do this with the group_by() function. It breaks down a dataset into specified groups of rows. When you then apply the verbs above on the resulting object they’ll be automatically applied “by group”.

Grouping affects the verbs as follows:

  • grouped select() is the same as ungrouped select(), except that grouping variables are always retained.

  • grouped arrange() is the same as ungrouped; unless you set .by_group = TRUE, in which case it orders first by the grouping variables

  • mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette("window-functions").

  • sample_n() and sample_frac() sample the specified number/fraction of rows in each group.

  • summarise() computes the summary for each group.

In the following example, we split the complete dataset into individual planes and then summarise each plane by counting the number of flights (count = n()) and computing the average distance (dist = mean(distance, na.rm = TRUE)) and arrival delay (delay = mean(arr_delay, na.rm = TRUE)). We then use ggplot2 to display the output.

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

You use summarise() with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others:

  • n(): the number of observations in the current group

  • n_distinct(x):the number of unique values in x.

  • first(x), last(x) and nth(x, n) - these work similarly to x[1], x[length(x)], and x[n] but give you more control over the result if the value is missing.

For example, we could use these to find the number of planes and the number of flights that go to each possible destination:

destinations <- group_by(flights, dest)
summarise(destinations,
  planes = n_distinct(tailnum),
  flights = n()
)
#> # A tibble: 105 x 3
#>   dest  planes flights
#>   <chr>  <int>   <int>
#> 1 ABQ      108     254
#> 2 ACK       58     265
#> 3 ALB      172     439
#> 4 ANC        6       8
#> # ... with 101 more rows

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:

daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))
#> # A tibble: 365 x 4
#> # Groups:   year, month [?]
#>    year month   day flights
#>   <int> <int> <int>   <int>
#> 1  2013     1     1     842
#> 2  2013     1     2     943
#> 3  2013     1     3     914
#> 4  2013     1     4     915
#> # ... with 361 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
#> # A tibble: 12 x 3
#> # Groups:   year [?]
#>    year month flights
#>   <int> <int>   <int>
#> 1  2013     1   27004
#> 2  2013     2   24951
#> 3  2013     3   28834
#> 4  2013     4   28330
#> # ... with 8 more rows
(per_year  <- summarise(per_month, flights = sum(flights)))
#> # A tibble: 1 x 2
#>    year flights
#>   <int>   <int>
#> 1  2013  336776

However you need to be careful when progressively rolling up summaries like this: it’s ok for sums and counts, but you need to think about weighting for means and variances (it’s not possible to do this exactly for medians).

Selecting operations

One of the appealing features of dplyr is that you can refer to columns from the tibble as if they were regular variables. However, the syntactic uniformity of referring to bare column names hide semantical differences across the verbs. A column symbol supplied to select() does not have the same meaning as the same symbol supplied to mutate().

Selecting operations expect column names and positions. Hence, when you call select() with bare variable names, they actually represent their own positions in the tibble. The following calls are completely equivalent from dplyr’s point of view:

# `year` represents the integer 1
select(flights, year)
#> # A tibble: 336,776 x 1
#>    year
#>   <int>
#> 1  2013
#> 2  2013
#> 3  2013
#> 4  2013
#> # ... with 336,772 more rows
select(flights, 1)
#> # A tibble: 336,776 x 1
#>    year
#>   <int>
#> 1  2013
#> 2  2013
#> 3  2013
#> 4  2013
#> # ... with 336,772 more rows

By the same token, this means that you cannot refer to variables from the surrounding context if they have the same name as one of the columns. In the following example, year still represents 1, not 5:

year <- 5
select(flights, year)

One useful subtlety is that this only applies to bare names and to selecting calls like c(year, month, day) or year:day. In all other cases, the columns of the data frame are not put in scope. This allows you to refer to contextual variables in selection helpers:

year <- "dep"
select(flights, starts_with(year))
#> # A tibble: 336,776 x 2
#>   dep_time dep_delay
#>      <int>     <dbl>
#> 1      517      2.00
#> 2      533      4.00
#> 3      542      2.00
#> 4      544     -1.00
#> # ... with 336,772 more rows

These semantics are usually intuitive. But note the subtle difference:

year <- 5
select(flights, year, identity(year))
#> # A tibble: 336,776 x 2
#>    year sched_dep_time
#>   <int>          <int>
#> 1  2013            515
#> 2  2013            529
#> 3  2013            540
#> 4  2013            545
#> # ... with 336,772 more rows

In the first argument, year represents its own position 1. In the second argument, year is evaluated in the surrounding context and represents the fifth column.

For a long time, select() used to only understand column positions. Counting from dplyr 0.6, it now understands column names as well. This makes it a bit easier to program with select():

vars <- c("year", "month")
select(flights, vars, "day")
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> # ... with 336,772 more rows

Note that the code above is somewhat unsafe because you might have added a column named vars to the tibble, or you might apply the code to another data frame containing such a column. To avoid this issue, you can wrap the variable in an identity() call as we mentioned above, as this will bypass column names. However, a more explicit and general method that works in all dplyr verbs is to unquote the variable with the !! operator. This tells dplyr to bypass the data frame and to directly look in the context:

# Let's create a new `vars` column:
flights$vars <- flights$year

# The new column won't be an issue if you evaluate `vars` in the
# context with the `!!` operator:
vars <- c("year", "month", "day")
select(flights, !! vars)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> # ... with 336,772 more rows

This operator is very useful when you need to use dplyr within custom functions. You can learn more about it in vignette("programming"). However it is important to understand the semantics of the verbs you are unquoting into, that is, the values they understand. As we have just seen, select() supports names and positions of columns. But that won’t be the case in other verbs like mutate() because they have different semantics.

Mutating operations

Mutate semantics are quite different from selection semantics. Whereas select() expects column names or positions, mutate() expects column vectors. Let’s create a smaller tibble for clarity:

df <- select(flights, year:dep_time)

When we use select(), the bare column names stand for ther own positions in the tibble. For mutate() on the other hand, column symbols represent the actual column vectors stored in the tibble. Consider what happens if we give a string or a number to mutate():

mutate(df, "year", 2)
#> # A tibble: 336,776 x 6
#>    year month   day dep_time `"year"`   `2`
#>   <int> <int> <int>    <int> <chr>    <dbl>
#> 1  2013     1     1      517 year      2.00
#> 2  2013     1     1      533 year      2.00
#> 3  2013     1     1      542 year      2.00
#> 4  2013     1     1      544 year      2.00
#> # ... with 336,772 more rows

mutate() gets length-1 vectors that it interprets as new columns in the data frame. These vectors are recycled so they match the number of rows. That’s why it doesn’t make sense to supply expressions like "year" + 10 to mutate(). This amounts to adding 10 to a string! The correct expression is:

mutate(df, year + 10)
#> # A tibble: 336,776 x 5
#>    year month   day dep_time `year + 10`
#>   <int> <int> <int>    <int>       <dbl>
#> 1  2013     1     1      517        2023
#> 2  2013     1     1      533        2023
#> 3  2013     1     1      542        2023
#> 4  2013     1     1      544        2023
#> # ... with 336,772 more rows

In the same way, you can unquote values from the context if these values represent a valid column. They must be either length 1 (they then get recycled) or have the same length as the number of rows. In the following example we create a new vector that we add to the data frame:

var <- seq(1, nrow(df))
mutate(df, new = var)
#> # A tibble: 336,776 x 5
#>    year month   day dep_time   new
#>   <int> <int> <int>    <int> <int>
#> 1  2013     1     1      517     1
#> 2  2013     1     1      533     2
#> 3  2013     1     1      542     3
#> 4  2013     1     1      544     4
#> # ... with 336,772 more rows

A case in point is group_by(). While you might think it has select semantics, it actually has mutate semantics. This is quite handy as it allows to group by a modified column:

group_by(df, month)
#> # A tibble: 336,776 x 4
#> # Groups:   month [12]
#>    year month   day dep_time
#>   <int> <int> <int>    <int>
#> 1  2013     1     1      517
#> 2  2013     1     1      533
#> 3  2013     1     1      542
#> 4  2013     1     1      544
#> # ... with 336,772 more rows
group_by(df, month = as.factor(month))
#> # A tibble: 336,776 x 4
#> # Groups:   month [12]
#>    year month    day dep_time
#>   <int> <fctr> <int>    <int>
#> 1  2013 1          1      517
#> 2  2013 1          1      533
#> 3  2013 1          1      542
#> 4  2013 1          1      544
#> # ... with 336,772 more rows
group_by(df, day_binned = cut(day, 3))
#> # A tibble: 336,776 x 5
#> # Groups:   day_binned [3]
#>    year month   day dep_time day_binned
#>   <int> <int> <int>    <int> <fctr>    
#> 1  2013     1     1      517 (0.97,11] 
#> 2  2013     1     1      533 (0.97,11] 
#> 3  2013     1     1      542 (0.97,11] 
#> 4  2013     1     1      544 (0.97,11] 
#> # ... with 336,772 more rows

This is why you can’t supply a column name to group_by(). This amounts to creating a new column containing the string recycled to the number of rows:

group_by(df, "month")
#> # A tibble: 336,776 x 5
#> # Groups:   "month" [1]
#>    year month   day dep_time `"month"`
#>   <int> <int> <int>    <int> <chr>    
#> 1  2013     1     1      517 month    
#> 2  2013     1     1      533 month    
#> 3  2013     1     1      542 month    
#> 4  2013     1     1      544 month    
#> # ... with 336,772 more rows

Since grouping with select semantics can be sometimes useful as well, we have added the group_by_at() variant. In dplyr, variants suffixed with _at() support selection semantics in their second argument. You just need to wrap the selection with vars():

group_by_at(df, vars(year:day))
#> # A tibble: 336,776 x 4
#> # Groups:   year, month, day [365]
#>    year month   day dep_time
#>   <int> <int> <int>    <int>
#> 1  2013     1     1      517
#> 2  2013     1     1      533
#> 3  2013     1     1      542
#> 4  2013     1     1      544
#> # ... with 336,772 more rows

You can read more about the _at() and _if() variants in the ?scoped help page.

Piping

The dplyr API is functional in the sense that function calls don’t have side-effects. You must always save their results. This doesn’t lead to particularly elegant code, especially if you want to do many operations at once. You either have to do it step-by-step:

a1 <- group_by(flights, year, month, day)
a2 <- select(a1, arr_delay, dep_delay)
a3 <- summarise(a2,
  arr = mean(arr_delay, na.rm = TRUE),
  dep = mean(dep_delay, na.rm = TRUE))
a4 <- filter(a3, arr > 30 | dep > 30)

Or if you don’t want to name the intermediate results, you need to wrap the function calls inside each other:

filter(
  summarise(
    select(
      group_by(flights, year, month, day),
      arr_delay, dep_delay
    ),
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)
#> Adding missing grouping variables: `year`, `month`, `day`
#> # A tibble: 49 x 5
#> # Groups:   year, month [11]
#>    year month   day   arr   dep
#>   <int> <int> <int> <dbl> <dbl>
#> 1  2013     1    16  34.2  24.6
#> 2  2013     1    31  32.6  28.7
#> 3  2013     2    11  36.3  39.1
#> 4  2013     2    27  31.3  37.8
#> # ... with 45 more rows

This is difficult to read because the order of the operations is from inside to out. Thus, the arguments are a long way away from the function. To get around this problem, dplyr provides the %>% operator from magrittr. x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)

Other data sources

As well as data frames, dplyr works with data that is stored in other ways, like data tables, databases and multidimensional arrays.

Data table

dplyr also provides data table methods for all verbs through dtplyr. If you’re using data.tables already this lets you to use dplyr syntax for data manipulation, and data.table for everything else.

For multiple operations, data.table can be faster because you usually use it with multiple verbs simultaneously. For example, with data table you can do a mutate and a select in a single step. It’s smart enough to know that there’s no point in computing the new variable for rows you’re about to throw away.

The advantages of using dplyr with data tables are:

  • For common data manipulation tasks, it insulates you from the reference semantics of data.tables, and protects you from accidentally modifying your data.

  • Instead of one complex method built on the subscripting operator ([), it provides many simple methods.

Databases

dplyr also allows you to use the same verbs with a remote database. It takes care of generating the SQL for you so that you can avoid the cognitive challenge of constantly switching between languages. To use these capabilities, you’ll need to install the dbplyr package and then read vignette("dbplyr") for the details.

Multidimensional arrays / cubes

tbl_cube() provides an experimental interface to multidimensional arrays or data cubes. If you’re using this form of data in R, please get in touch so I can better understand your needs.

Comparisons

Compared to all existing options, dplyr:

Compared to base functions:

Compared to plyr, dplyr:

Compared to virtual data frame approaches:

dplyr/inst/doc/compatibility.Rmd0000644000176200001440000002334313135665123016455 0ustar liggesusers--- title: "dplyr compatibility" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{dplyr compatibility} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(dplyr) knitr::opts_chunk$set(collapse = T, comment = "#>") ``` This vignette is aimed at package authors who need to update their code because of a backward incompatible change to dplyr. We do try and minimise backward incompatible changes as much as possible, but sometimes they are necessary in order to radically simplify existing code, or unlock a lot of potential value in the future. This vignette starts with some general advice on writing package code that works with multiple version of dplyr, then continues to discuss specific changes in dplyr versions. ## Working with multiple dplyr versions Ideally, you want to make sure that your package works with both the released version and the development version of dplyr. This is typically a little bit more work, but has two big advantages: 1. It's more convenient for your users, since they're not forced to update dplyr if they don't want to) 1. It's easier on CRAN since it doesn't require a massive coordinated release of multiple packages. To make code work with multiple versions of a package, your first tool is the simple if statement: ```{r, results = "hide"} if (utils::packageVersion("dplyr") > "0.5.0") { # code for new version } else { # code for old version } ``` Always condition on `> current-version`, not `>= next-version` because this will ensure that this branch is also used for the development version of the package. For example, if the current release is version "0.5.0", the development version will be "0.5.0.9000". Occasionally, you'll run into a situation where the `NAMESPACE` has changed and you need to conditionally import different functions. This typically occurs when functions are moved from one package to another. We try out best to provide automatic fallbacks, but this is not always possible. Often you can work around the problem by avoiding `importFrom` and using `::` instead. Do this where possible: ```{r, eval = FALSE} if (utils::packageVersion("dplyr") > "0.5.0") { dbplyr::build_sql(...) } else { dplyr::build_sql(...) } ``` This will generate an `R CMD check` NOTE (because the one of the functions will always be missing), but this is ok. Simply explain that you get the note because you have written a wrapper to make sure your code is backward compatible. Sometimes it's not possible to avoid `importFrom()`. For example you might be importing a generic so that you can define a method for it. In this case, you can take advantage of a little-known feature in the `NAMESPACE` file: you can include `if` statements. ```{r} #' @rawNamespace #' if (utils::packageVersion("dplyr") > "0.5.0") { #' importFrom("dbplyr", "build_sql") #' } else { #' importFrom("dplyr", "build_sql") #' } ``` ## dplyr 0.6.0 ### Database code moves to dbplyr Almost all database related code has been moved out of dplyr and into a new package, [dbplyr](http://github.com/hadley/dbplyr/). This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases. If you've implemented a database backend for dplyr, please read the [backend news](https://github.com/hadley/dbplyr/blob/master/NEWS.md#backends) on the backend. Depending on what generics you use, and what generics you provide methods for you, you may need to write some conditional code. To help make this easier we've written `wrap_dbplyr_obj()` which will write the helper code for you: ```{r, eval = FALSE} wrap_dbplyr_obj("build_sql") wrap_dbplyr_obj("base_agg") ``` Simply copy the results of this function in your package. These will generate `R CMD check` NOTES, so make sure to tell CRAN that this is to ensure backward compatibility. ### Deprecation of underscored `verbs_()` Because the tidyeval framework allows us to combine SE and NSE semantics within the same functions, the underscored verbs have been softly deprecated. #### For users of SE_ verbs The legacy underscored versions take objects for which a `lazyeval::as.lazy()` method is defined. This includes symbols and calls, strings, and formulas. All of these objects have been replaced with quosures and you can call tidyeval verbs with unquoted quosures: ```{r, eval = FALSE} quo <- quo(cyl) select(mtcars, !! quo) ``` Symbolic expressions are also supported, but note that bare symbols and calls do not carry scope information. If you're referring to objects in the data frame, it's safe to omit specifying an enclosure: ```{r, results = "hide"} sym <- quote(cyl) select(mtcars, !! sym) call <- quote(mean(cyl)) summarise(mtcars, !! call) ``` Transforming objects into quosures is generally straightforward. To enclose with the current environment, you can unquote directly in `quo()` or you can use `as_quosure()`: ```{r} quo(!! sym) quo(!! call) rlang::as_quosure(sym) rlang::as_quosure(call) ``` Note that while formulas and quosures are very similar objects (and in the most general sense, formulas are quosures), they can't be used interchangeably in tidyeval functions. Early implementations did treat bare formulas as quosures, but this created compatibility issues with modelling functions of the stats package. Fortunately, it's easy to transform formulas to quosures that will self-evaluate in tidyeval functions: ```{r} f <- ~cyl f rlang::as_quosure(f) ``` Finally, and perhaps most importantly, **strings are not and should not be parsed**. As developers, it is tempting to try and solve problems using strings because we have been trained to work with strings rather than quoted expressions. However it's almost always the wrong way to approach the problem. The exception is for creating symbols. In that case it is perfectly legitimate to use strings: ```{r} rlang::sym("cyl") rlang::syms(letters[1:3]) ``` But you should never use strings to create calls. Instead you can use quasiquotation: ```{r} syms <- rlang::syms(c("foo", "bar", "baz")) quo(my_call(!!! syms)) fun <- rlang::sym("my_call") quo(UQ(fun)(!!! syms)) ``` Or create the call with `lang()`: ```{r} call <- rlang::lang("my_call", !!! syms) call rlang::as_quosure(call) # Or equivalently: quo(!! rlang::lang("my_call", !!! syms)) ``` Note that idioms based on `interp()` should now generally be avoided and replaced with quasiquotation. Where you used to interpolate: ```{r, eval=FALSE} lazyeval::interp(~ mean(var), var = rlang::sym("mpg")) ``` You would now unquote: ```{r, eval=FALSE} var <- "mpg" quo(mean(!! rlang::sym(var))) ``` See also `vignette("programming")` for more about quasiquotation and quosures. #### For package authors For package authors, rlang provides a [compatibility file](https://github.com/hadley/rlang/blob/master/R/compat-lazyeval.R) that you can copy to your package. `compat_lazy()` and `compat_lazy_dots()` turn lazy-able objects into proper quosures. This helps providing an underscored version to your users for backward compatibility. For instance, here is how we defined the underscored version of `filter()` in dplyr 0.6: ```{r, eval = FALSE} filter_.tbl_df <- function(.data, ..., .dots = list()) { dots <- compat_lazy_dots(.dots, caller_env(), ...) filter(.data, !!! dots) } ``` With tidyeval, S3 dispatch to the correct method might be an issue. In the past, the genericity of dplyr verbs was accomplished by dispatching in the underscored versions. Now that those are deprecated, we've turned the non-underscored verbs into S3 generics. We maintain backward compatibility by redispatching to old underscored verbs in the default methods of the new S3 generics. For example, here is how we redispatch `filter()`: ```{r, eval = FALSE} filter.default <- function(.data, ...) { filter_(.data, .dots = compat_as_lazy_dots(...)) } ``` This gets the job done in most cases. However, the default method will not be called for objects inheriting from one of the classes for which we provide non-underscored methods: `data.frame`, `tbl_df`, `tbl_cube` and `grouped_df`. An example of this is the `sf` package whose objects have classes `c("sf", "data.frame")`. Authors of such packages should provide a method for the non-underscored generic in order to be compatible with dplyr: ```{r, eval = FALSE} filter.sf <- function(.data, ...) { st_as_sf(NextMethod()) } ``` If you need help with this, please let us know! ### Deprecation of `mutate_each()` and `summarise_each()` These functions have been replaced by a more complete family of functions. This family has suffixes `_if`, `_at` and `_all` and includes more verbs than just `mutate` `summarise`. If you need to update your code to the new family, there are two relevant functions depending on which variables you apply `funs()` to. If you called `mutate_each()` without supplying a selection of variables, `funs` is applied to all variables. In this case, you should update your code to use `mutate_all()` instead: ```{r, eval = FALSE} mutate_each(starwars, funs(as.character)) mutate_all(starwars, funs(as.character)) ``` Note that the new verbs support bare functions as well, so you don't necessarily need to wrap with `funs()`: ```{r, eval = FALSE} mutate_all(starwars, as.character) ``` On the other hand, if you supplied a variable selection, you should use `mutate_at()`. The variable selection should be wrapped with `vars()`. ```{r, eval = FALSE} mutate_each(starwars, funs(as.character), height, mass) mutate_at(starwars, vars(height, mass), as.character) ``` `vars()` supports all the selection helpers that you usually use with `select()`: ```{r, eval = FALSE} summarise_at(mtcars, vars(starts_with("d")), mean) ``` Note that intead of a `vars()` selection, you can also supply character vectors of column names: ```{r, eval = FALSE} mutate_at(starwars, c("height", "mass"), as.character) ``` dplyr/inst/doc/window-functions.R0000644000176200001440000000656313157241200016573 0ustar liggesusers## ---- include = FALSE---------------------------------------------------- knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(dplyr) set.seed(1014) ## ------------------------------------------------------------------------ library(Lahman) batting <- Lahman::Batting %>% as_tibble() %>% select(playerID, yearID, teamID, G, AB:H) %>% arrange(playerID, yearID, teamID) %>% semi_join(Lahman::AwardsPlayers, by = "playerID") players <- batting %>% group_by(playerID) ## ---- eval = FALSE------------------------------------------------------- # # For each player, find the two years with most hits # filter(players, min_rank(desc(H)) <= 2 & H > 0) # # Within each player, rank each year by the number of games played # mutate(players, G_rank = min_rank(G)) # # # For each player, find every year that was better than the previous year # filter(players, G > lag(G)) # # For each player, compute avg change in games played per year # mutate(players, G_change = (G - lag(G)) / (yearID - lag(yearID))) # # # For each player, find all where they played more games than average # filter(players, G > mean(G)) # # For each, player compute a z score based on number of games played # mutate(players, G_z = (G - mean(G)) / sd(G)) ## ------------------------------------------------------------------------ x <- c(1, 1, 2, 2, 2) row_number(x) min_rank(x) dense_rank(x) ## ------------------------------------------------------------------------ cume_dist(x) percent_rank(x) ## ------------------------------------------------------------------------ filter(players, cume_dist(desc(G)) < 0.1) ## ------------------------------------------------------------------------ by_team_player <- group_by(batting, teamID, playerID) by_team <- summarise(by_team_player, G = sum(G)) by_team_quartile <- group_by(by_team, quartile = ntile(G, 4)) summarise(by_team_quartile, mean(G)) ## ------------------------------------------------------------------------ x <- 1:5 lead(x) lag(x) ## ---- results = "hide"--------------------------------------------------- # Compute the relative change in games played mutate(players, G_delta = G - lag(G)) ## ---- results = "hide"--------------------------------------------------- # Find when a player changed teams filter(players, teamID != lag(teamID)) ## ------------------------------------------------------------------------ df <- data.frame(year = 2000:2005, value = (0:5) ^ 2) scrambled <- df[sample(nrow(df)), ] wrong <- mutate(scrambled, running = cumsum(value)) arrange(wrong, year) right <- mutate(scrambled, running = order_by(year, cumsum(value))) arrange(right, year) ## ---- eval = FALSE------------------------------------------------------- # filter(players, cumany(G > 150)) ## ------------------------------------------------------------------------ x <- 1:10 y <- 10:1 order_by(y, cumsum(x)) ## ---- eval = FALSE------------------------------------------------------- # filter(players, G > mean(G)) # filter(players, G < median(G)) ## ---- eval = FALSE------------------------------------------------------- # filter(players, ntile(G, 2) == 2) ## ------------------------------------------------------------------------ mutate(players, career_year = yearID - min(yearID) + 1) ## ------------------------------------------------------------------------ mutate(players, G_z = (G - mean(G)) / sd(G)) dplyr/inst/doc/programming.Rmd0000644000176200001440000004565413150337633016136 0ustar liggesusers--- title: "Programming with dplyr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Programming with dplyr} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r setup, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(dplyr) set.seed(1014) ``` Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that means they don't follow the usual R rules of evaluation. Instead, they capture the expression that you typed and evaluate it in a custom way. This has two main benefits for dplyr code: * Operations on data frames can be expressed succinctly because you don't need to repeat the name of the data frame. For example, you can write `filter(df, x == 1, y == 2, z == 3)` instead of `df[df$x == 1 & df$y ==2 & df$z == 3, ]`. * dplyr can choose to compute results in a different way to base R. This is important for database backends because dplyr itself doesn't do any work, but instead generates the SQL that tells the database what to do. Unfortunately these benefits do not come for free. There are two main drawbacks: * Most dplyr arguments are not __referentially transparent__. That means you can't replace a value with a seemingly equivalent object that you've defined elsewhere. In other words, this code: ```{r} df <- tibble(x = 1:3, y = 3:1) filter(df, x == 1) ``` Is not equivalent to this code: ```{r, error = TRUE} my_var <- x filter(df, my_var == 1) ``` nor to this code: ```{r, error = TRUE} my_var <- "x" filter(df, my_var == 1) ``` This makes it hard to create functions with arguments that change how dplyr verbs are computed. * dplyr code is ambiguous. Depending on what variables are defined where, `filter(df, x == y)` could be equivalent to any of: ```{r, eval = FALSE} df[df$x == df$y, ] df[df$x == y, ] df[x == df$y, ] df[x == y, ] ``` This is useful when working interactively (because it saves typing and you quickly spot problems) but makes functions more unpredictable than you might desire. Fortunately, dplyr provides tools to overcome these challenges. They require a little more typing, but a small amount of upfront work is worth it because they help you save time in the long run. This vignette has two goals: * Show you how to use dplyr's __pronouns__ and __quasiquotation__ to write reliable functions that reduce duplication in your data analysis code. * To teach you the underlying theory including __quosures__, the data structure that stores both an expression and an environment, and __tidyeval__, the underlying toolkit. We'll start with a warmup, tying this problem to something you're more familiar with, then move on to some practical tools, then dive into the deeper theory. ## Warm up You might not have realised it, but you're already accomplished at solving this type of problem in another domain: strings. It's obvious that this function doesn't do what you want: ```{r} greet <- function(name) { "How do you do, name?" } greet("Hadley") ``` That's because `"` "quotes" its input: it doesn't interpret what you've typed, it just stores it in a string. One way to make the function do what you want is to use `paste()` to build up the string piece by piece: ```{r} greet <- function(name) { paste0("How do you do, ", name, "?") } greet("Hadley") ``` Another approach is exemplified by the __glue__ package: it allows you to "unquote" components of a string, replacing the string with the value of the R expression. This allows an elegant implementation of our function because `{name}` is replaced with the value of the `name` argument. ```{r} greet <- function(name) { glue::glue("How do you do, {name}?") } greet("Hadley") ``` ## Programming recipes The following recipes walk you through the basics of tidyeval, with the nominal goal of reducing duplication in dplyr code. The examples here are somewhat inauthentic because we've reduced them down to very simple components to make them easier to understand. They're so simple that you might wonder why we bother writing a function at all. But it's a good idea to learn the ideas on simple examples, so that you're better prepared to apply them to the more complex situations you'll see in your own code. ### Different data sets You already know how to write functions that work with the first argument of dplyr verbs: the data. That's because dplyr doesn't do anything special with that argument, so it's referentially transparent. For example, if you saw repeated code like this: ```{r, eval = FALSE} mutate(df1, y = a + x) mutate(df2, y = a + x) mutate(df3, y = a + x) mutate(df4, y = a + x) ``` You could already write a function to capture that duplication: ```{r} mutate_y <- function(df) { mutate(df, y = a + x) } ``` Unfortunately, there's a drawback to this simple approach: it can fail silently if one of the variables isn't present in the data frame, but is present in the global environment. ```{r} df1 <- tibble(x = 1:3) a <- 10 mutate_y(df1) ``` We can fix that ambiguity by being more explicit and using the `.data` pronoun. This will throw an informative error if the variable doesn't exist: ```{r, error = TRUE} mutate_y <- function(df) { mutate(df, y = .data$a + .data$x) } mutate_y(df1) ``` If this function is in a package, using `.data` also prevents `R CMD check` from giving a NOTE about undefined global variables (provided that you've also imported `rlang::.data` with `@importFrom rlang .data`). ### Different expressions Writing a function is hard if you want one of the arguments to be a variable name (like `x`) or an expression (like `x + y`). That's because dplyr automatically "quotes" those inputs, so they are not referentially transparent. Let's start with a simple case: you want to vary the grouping variable for a data summarization. ```{r} df <- tibble( g1 = c(1, 1, 2, 2, 2), g2 = c(1, 2, 1, 2, 1), a = sample(5), b = sample(5) ) df %>% group_by(g1) %>% summarise(a = mean(a)) df %>% group_by(g2) %>% summarise(a = mean(a)) ``` You might hope that this will work: ```{r, error = TRUE} my_summarise <- function(df, group_var) { df %>% group_by(group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1) ``` But it doesn't. Maybe providing the variable name as a string will fix things? ```{r, error = TRUE} my_summarise(df, "g2") ``` Nope. If you look carefully at the error message, you'll see that it's the same in both cases. `group_by()` works like `"`: it doesn't evaluate its input; it quotes it. To make this function work, we need to do two things. We need to quote the input ourselves (so `my_summarise()` can take a bare variable name like `group_by()`), and then we need to tell `group_by()` not to quote its input (because we've done the quoting). How do we quote the input? We can't use `""` to quote the input, because that gives us a string. Instead we need a function that captures the expression and its environment (we'll come back to why this is important later on). There are two possible options we could use in base R, the function `quote()` and the operator `~`. Neither of these work quite the way we want, so we need a new function: `quo()`. `quo()` works like `"`: it quotes its input rather than evaluating it. ```{r} quo(g1) quo(a + b + c) quo("a") ``` `quo()` returns a __quosure__, which is a special type of formula. You'll learn more about quosures later on. Now that we've captured this expression, how do we use it with `group_by()`? It doesn't work if we just shove it into our naive approach: ```{r, error = TRUE} my_summarise(df, quo(g1)) ``` We get the same error as before, because we haven't yet told `group_by()` that we're taking care of the quoting. In other words, we need to tell `group_by()` not to quote its input, because it has been pre-quoted by `my_summarise()`. Yet another way of saying the same thing is that we want to __unquote__ `group_var`. In dplyr (and in tidyeval in general) you use `!!` to say that you want to unquote an input so that it's evaluated, not quoted. This gives us a function that actually does what we want. ```{r} my_summarise <- function(df, group_var) { df %>% group_by(!!group_var) %>% summarise(a = mean(a)) } my_summarise(df, quo(g1)) ``` Huzzah! There's just one step left: we want to call this function like we call `group_by()`: ```{r, eval = FALSE} my_summarise(df, g1) ``` This doesn't work because there's no object called `g1`. We need to capture what the user of the function typed and quote it for them. You might try using `quo()` to do that: ```{r, error = TRUE} my_summarise <- function(df, group_var) { quo_group_var <- quo(group_var) print(quo_group_var) df %>% group_by(!!quo_group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1) ``` I've added a `print()` call to make it obvious what's going wrong here: `quo(group_var)` always returns `~group_var`. It is being too literal! We want it to substitute the value that the user supplied, i.e. to return `~g1`. By analogy to strings, we don't want `""`, instead we want some function that turns an argument into a string. That's the job of `enquo()`. `enquo()` uses some dark magic to look at the argument, see what the user typed, and return that value as a quosure. (Technically, this works because function arguments are evaluated lazily, using a special data structure called a __promise__.) ```{r} my_summarise <- function(df, group_var) { group_var <- enquo(group_var) print(group_var) df %>% group_by(!!group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1) ``` (If you're familiar with `quote()` and `substitute()` in base R, `quo()` is equivalent to `quote()` and `enquo()` is equivalent to `substitute()`.) You might wonder how to extend this to handle multiple grouping variables: we'll come back to that a little later. ### Different input variable Now let's tackle something a bit more complicated. The code below shows a duplicate `summarise()` statement where we compute three summaries, varying the input variable. ```{r} summarise(df, mean = mean(a), sum = sum(a), n = n()) summarise(df, mean = mean(a * b), sum = sum(a * b), n = n()) ``` To turn this into a function, we start by testing the basic approach interactively: we quote the variable with `quo()`, then unquoting it in the dplyr call with `!!`. Notice that we can unquote anywhere inside a complicated expression. ```{r} my_var <- quo(a) summarise(df, mean = mean(!!my_var), sum = sum(!!my_var), n = n()) ``` You can also wrap `quo()` around the dplyr call to see what will happen from dplyr's perspective. This is a very useful tool for debugging. ```{r} quo(summarise(df, mean = mean(!!my_var), sum = sum(!!my_var), n = n() )) ``` Now we can turn our code into a function (remembering to replace `quo()` with `enquo()`), and check that it works: ```{r} my_summarise2 <- function(df, expr) { expr <- enquo(expr) summarise(df, mean = mean(!!expr), sum = sum(!!expr), n = n() ) } my_summarise2(df, a) my_summarise2(df, a * b) ``` ### Different input and output variable The next challenge is to vary the name of the output variables: ```{r} mutate(df, mean_a = mean(a), sum_a = sum(a)) mutate(df, mean_b = mean(b), sum_b = sum(b)) ``` This code is similar to the previous example, but there are two new wrinkles: * We create the new names by pasting together strings, so we need `quo_name()` to convert the input expression to a string. * `!!mean_name = mean(!!expr)` isn't valid R code, so we need to use the `:=` helper provided by rlang. ```{r} my_mutate <- function(df, expr) { expr <- enquo(expr) mean_name <- paste0("mean_", quo_name(expr)) sum_name <- paste0("sum_", quo_name(expr)) mutate(df, !!mean_name := mean(!!expr), !!sum_name := sum(!!expr) ) } my_mutate(df, a) ``` ### Capturing multiple variables It would be nice to extend `my_summarise()` to accept any number of grouping variables. We need to make three changes: * Use `...` in the function definition so our function can accept any number of arguments. * Use `quos()` to capture all the `...` as a list of formulas. * Use `!!!` instead of `!!` to __splice__ the arguments into `group_by()`. ```{r} my_summarise <- function(df, ...) { group_var <- quos(...) df %>% group_by(!!!group_var) %>% summarise(a = mean(a)) } my_summarise(df, g1, g2) ``` `!!!` takes a list of elements and splices them into to the current call. Look at the bottom of the `!!!` and think `...`. ```{r} args <- list(na.rm = TRUE, trim = 0.25) quo(mean(x, !!! args)) args <- list(quo(x), na.rm = TRUE, trim = 0.25) quo(mean(!!! args)) ``` Now that you've learned the basics of tidyeval through some practical examples, we'll dive into the theory. This will help you generalise what you've learned here to new situations. ## Quoting Quoting is the action of capturing an expression instead of evaluating it. All expression-based functions quote their arguments and get the R code as an expression rather than the result of evaluating that code. If you are an R user, you probably quote expressions on a regular basis. One of the most important quoting operators in R is the _formula_. It is famously used for the specification of statistical models: ```{r} disp ~ cyl + drat ``` The other quoting operator in base R is `quote()`. It returns a raw expression rather than a formula: ```{r} # Computing the value of the expression: toupper(letters[1:5]) # Capturing the expression: quote(toupper(letters[1:5])) ``` (Note that despite being called the double quote, `"` is not a quoting operator in this context, because it generates a string, not an expression.) In practice, the formula is the better of the two options because it captures the code and its execution __environment__. This is important because even simple expression can yield different values in different environments. For example, the `x` in the following two expressions refers to different values: ```{r} f <- function(x) { quo(x) } x1 <- f(10) x2 <- f(100) ``` It might look like the expressions are the same if you print them out. ```{r} x1 x2 ``` But if you inspect the environments using `rlang::get_env()` --- they're different. ```{r, message = FALSE} library(rlang) get_env(x1) get_env(x2) ``` Further, when we evaluate those formulas using `rlang::eval_tidy()`, we see that they yield different values: ```{r} eval_tidy(x1) eval_tidy(x2) ``` This is a key property of R: one name can refer to different values in different environments. This is also important for dplyr, because it allows you to combine variables and objects in a call: ```{r} user_var <- 1000 mtcars %>% summarise(cyl = mean(cyl) * user_var) ``` When an object keeps track of an environment, it is said to have an enclosure. This is the reason that functions in R are sometimes referred to as closures: ```{r} typeof(mean) ``` For this reason we use a special name to refer to one-sided formulas: __quosures__. One-sided formulas are quotes (they carry an expression) with an environment. Quosures are regular R objects. They can be stored in a variable and inspected: ```{r} var <- ~toupper(letters[1:5]) var # You can extract its expression: get_expr(var) # Or inspect its enclosure: get_env(var) ``` ## Quasiquotation > Put simply, quasi-quotation enables one to introduce symbols that stand for > a linguistic expression in a given instance and are used as that linguistic > expression in a different instance. --- [Willard van Orman Quine](https://en.wikipedia.org/wiki/Quasi-quotation) Automatic quoting makes dplyr very convenient for interactive use. But if you want to program with dplyr, you need some way to refer to variables indirectly. The solution to this problem is __quasiquotation__, which allows you to evaluate directly inside an expression that is otherwise quoted. Quasiquotation was coined by Willard van Orman Quine in the 1940s, and was adopted for programming by the LISP community in the 1970s. All expression-based functions in the tidyeval framework support quasiquotation. Unquoting cancels quotation of parts of an expression. There are three types of unquoting: * basic * unquote splicing * unquoting names ### Unquoting The first important operation is the basic unquote, which comes in a functional form, `UQ()`, and as syntactic-sugar, `!!`. ```{r} # Here we capture `letters[1:5]` as an expression: quo(toupper(letters[1:5])) # Here we capture the value of `letters[1:5]` quo(toupper(!!letters[1:5])) quo(toupper(UQ(letters[1:5]))) ``` It is also possible to unquote other quoted expressions. Unquoting such symbolic objects provides a powerful way of manipulating expressions. ```{r} var1 <- quo(letters[1:5]) quo(toupper(!!var1)) ``` You can safely unquote quosures because they track their environments, and tidyeval functions know how to evaluate them. This allows any depth of quoting and unquoting. ```{r} my_mutate <- function(x) { mtcars %>% select(cyl) %>% slice(1:4) %>% mutate(cyl2 = cyl + (!! x)) } f <- function(x) quo(x) expr1 <- f(100) expr2 <- f(10) my_mutate(expr1) my_mutate(expr2) ``` The functional form is useful in cases where the precedence of `!` causes problems: ```{r, error = TRUE} my_fun <- quo(fun) quo(!!my_fun(x, y, z)) quo(UQ(my_fun)(x, y, z)) my_var <- quo(x) quo(filter(df, !!my_var == 1)) quo(filter(df, UQ(my_var) == 1)) ``` You'll note above that `UQ()` yields a quosure containing a formula. That ensures that when the quosure is evaluated, it'll be looked up in the right environment. In certain code-generation scenarios you just want to use expression and ignore the environment. That's the job of `UQE()`: ```{r} quo(UQE(my_fun)(x, y, z)) quo(filter(df, UQE(my_var) == 1)) ``` `UQE()` is for expert use only as you'll have to carefully analyse the environments to ensure that the generated code is correct. ### Unquote-splicing The second unquote operation is unquote-splicing. Its functional form is `UQS()` and the syntactic shortcut is `!!!`. It takes a vector and inserts each element of the vector in the surrounding function call: ```{r} quo(list(!!! letters[1:5])) ``` A very useful feature of unquote-splicing is that the vector names become argument names: ```{r} x <- list(foo = 1L, bar = quo(baz)) quo(list(!!! x)) ``` This makes it easy to program with dplyr verbs that take named dots: ```{r} args <- list(mean = quo(mean(cyl)), count = quo(n())) mtcars %>% group_by(am) %>% summarise(!!! args) ``` ### Setting variable names The final unquote operation is setting argument names. You've seen one way to do that above, but you can also use the definition operator `:=` instead of `=`. `:=` supports unquoting on both the LHS and the RHS. The rules on the LHS are slightly different: the unquoted operand should evaluate to a string or a symbol. ```{r} mean_nm <- "mean" count_nm <- "count" mtcars %>% group_by(am) %>% summarise( !!mean_nm := mean(cyl), !!count_nm := n() ) ``` dplyr/inst/doc/window-functions.html0000644000176200001440000006643513157241200017342 0ustar liggesusers Window functions

Window functions

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or round(). Window functions include variations on aggregate functions, like cumsum() and cummean(), functions for ranking and ordering, like rank(), and functions for taking offsets, like lead() and lag().

In this vignette, we’ll use a small sample of the Lahman batting dataset, including the players that have won an award.

library(Lahman)

batting <- Lahman::Batting %>%
  as_tibble() %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  semi_join(Lahman::AwardsPlayers, by = "playerID")

players <- batting %>% group_by(playerID)

Window functions are used in conjunction with mutate() and filter() to solve a wide range of problems. Here’s a selection:

# For each player, find the two years with most hits
filter(players, min_rank(desc(H)) <= 2 & H > 0)
# Within each player, rank each year by the number of games played
mutate(players, G_rank = min_rank(G))

# For each player, find every year that was better than the previous year
filter(players, G > lag(G))
# For each player, compute avg change in games played per year
mutate(players, G_change = (G - lag(G)) / (yearID - lag(yearID)))

# For each player, find all where they played more games than average
filter(players, G > mean(G))
# For each, player compute a z score based on number of games played
mutate(players, G_z = (G - mean(G)) / sd(G))

Before reading this vignette, you should be familiar with mutate() and filter().

Types of window functions

There are five main families of window functions. Two families are unrelated to aggregation functions:

The other three families are variations on familiar aggregate functions:

Each family is described in more detail below, focussing on the general goals and how to use them with dplyr. For more details, refer to the individual function documentation.

Ranking functions

The ranking functions are variations on a theme, differing in how they handle ties:

x <- c(1, 1, 2, 2, 2)

row_number(x)
#> [1] 1 2 3 4 5
min_rank(x)
#> [1] 1 1 3 3 3
dense_rank(x)
#> [1] 1 1 2 2 2

If you’re familiar with R, you may recognise that row_number() and min_rank() can be computed with the base rank() function and various values of the ties.method argument. These functions are provided to save a little typing, and to make it easier to convert between R and SQL.

Two other ranking functions return numbers between 0 and 1. percent_rank() gives the percentage of the rank; cume_dist() gives the proportion of values less than or equal to the current value.

cume_dist(x)
#> [1] 0.4 0.4 1.0 1.0 1.0
percent_rank(x)
#> [1] 0.0 0.0 0.5 0.5 0.5

These are useful if you want to select (for example) the top 10% of records within each group. For example:

filter(players, cume_dist(desc(G)) < 0.1)
#> # A tibble: 995 x 7
#> # Groups:   playerID [906]
#>   playerID  yearID teamID     G    AB     R     H
#>   <chr>      <int> <fctr> <int> <int> <int> <int>
#> 1 aaronha01   1963 ML1      161   631   121   201
#> 2 aaronha01   1968 ATL      160   606    84   174
#> 3 abbotji01   1991 CAL       34    NA    NA    NA
#> 4 abernte02   1965 CHN       84    18     1     3
#> # ... with 991 more rows

Finally, ntile() divides the data up into n evenly sized buckets. It’s a coarse ranking, and it can be used in with mutate() to divide the data into buckets for further summary. For example, we could use ntile() to divide the players within a team into four ranked groups, and calculate the average number of games within each group.

by_team_player <- group_by(batting, teamID, playerID)
by_team <- summarise(by_team_player, G = sum(G))
by_team_quartile <- group_by(by_team, quartile = ntile(G, 4))
summarise(by_team_quartile, mean(G))
#> # A tibble: 4 x 2
#>   quartile `mean(G)`
#>      <int>     <dbl>
#> 1        1      27.2
#> 2        2      97.6
#> 3        3     272  
#> 4        4     976

All ranking functions rank from lowest to highest so that small input values get small ranks. Use desc() to rank from highest to lowest.

Lead and lag

lead() and lag() produce offset versions of a input vector that is either ahead of or behind the original vector.

x <- 1:5
lead(x)
#> [1]  2  3  4  5 NA
lag(x)
#> [1] NA  1  2  3  4

You can use them to:

lead() and lag() have an optional argument order_by. If set, instead of using the row order to determine which value comes before another, they will use another variable. This important if you have not already sorted the data, or you want to sort one way and lag another.

Here’s a simple example of what happens if you don’t specify order_by when you need it:

df <- data.frame(year = 2000:2005, value = (0:5) ^ 2)
scrambled <- df[sample(nrow(df)), ]

wrong <- mutate(scrambled, running = cumsum(value))
arrange(wrong, year)
#>   year value running
#> 1 2000     0       0
#> 2 2001     1      55
#> 3 2002     4      20
#> 4 2003     9      54
#> 5 2004    16      16
#> 6 2005    25      45

right <- mutate(scrambled, running = order_by(year, cumsum(value)))
arrange(right, year)
#>   year value running
#> 1 2000     0       0
#> 2 2001     1       1
#> 3 2002     4       5
#> 4 2003     9      14
#> 5 2004    16      30
#> 6 2005    25      55

Cumulative aggregates

Base R provides cumulative sum (cumsum()), cumulative min (cummin()) and cumulative max (cummax()). (It also provides cumprod() but that is rarely useful). Other common accumulating functions are cumany() and cumall(), cumulative versions of || and &&, and cummean(), a cumulative mean. These are not included in base R, but efficient versions are provided by dplyr.

cumany() and cumall() are useful for selecting all rows up to, or all rows after, a condition is true for the first (or last) time. For example, we can use cumany() to find all records for a player after they played a year with 150 games:

filter(players, cumany(G > 150))

Like lead and lag, you may want to control the order in which the accumulation occurs. None of the built in functions have an order_by argument so dplyr provides a helper: order_by(). You give it the variable you want to order by, and then the call to the window function:

x <- 1:10
y <- 10:1
order_by(y, cumsum(x))
#>  [1] 55 54 52 49 45 40 34 27 19 10

This function uses a bit of non-standard evaluation, so I wouldn’t recommend using it inside another function; use the simpler but less concise with_order() instead.

Recycled aggregates

R’s vector recycling make it easy to select values that are higher or lower than a summary. I call this a recycled aggregate because the value of the aggregate is recycled to be the same length as the original vector. Recycled aggregates are useful if you want to find all records greater than the mean or less than the median:

filter(players, G > mean(G))
filter(players, G < median(G))

While most SQL databases don’t have an equivalent of median() or quantile(), when filtering you can achieve the same effect with ntile(). For example, x > median(x) is equivalent to ntile(x, 2) == 2; x > quantile(x, 75) is equivalent to ntile(x, 100) > 75 or ntile(x, 4) > 3.

filter(players, ntile(G, 2) == 2)

You can also use this idea to select the records with the highest (x == max(x)) or lowest value (x == min(x)) for a field, but the ranking functions give you more control over ties, and allow you to select any number of records.

Recycled aggregates are also useful in conjunction with mutate(). For example, with the batting data, we could compute the “career year”, the number of years a player has played since they entered the league:

mutate(players, career_year = yearID - min(yearID) + 1)
#> # A tibble: 19,113 x 8
#> # Groups:   playerID [1,322]
#>   playerID  yearID teamID     G    AB     R     H career_year
#>   <chr>      <int> <fctr> <int> <int> <int> <int>       <dbl>
#> 1 aaronha01   1954 ML1      122   468    58   131        1.00
#> 2 aaronha01   1955 ML1      153   602   105   189        2.00
#> 3 aaronha01   1956 ML1      153   609   106   200        3.00
#> 4 aaronha01   1957 ML1      151   615   118   198        4.00
#> # ... with 19,109 more rows

Or, as in the introductory example, we could compute a z-score:

mutate(players, G_z = (G - mean(G)) / sd(G))
#> # A tibble: 19,113 x 8
#> # Groups:   playerID [1,322]
#>   playerID  yearID teamID     G    AB     R     H    G_z
#>   <chr>      <int> <fctr> <int> <int> <int> <int>  <dbl>
#> 1 aaronha01   1954 ML1      122   468    58   131 -1.16 
#> 2 aaronha01   1955 ML1      153   602   105   189  0.519
#> 3 aaronha01   1956 ML1      153   609   106   200  0.519
#> 4 aaronha01   1957 ML1      151   615   118   198  0.411
#> # ... with 19,109 more rows
dplyr/inst/doc/programming.html0000644000176200001440000015523313157241173016353 0ustar liggesusers Programming with dplyr

Programming with dplyr

Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that means they don’t follow the usual R rules of evaluation. Instead, they capture the expression that you typed and evaluate it in a custom way. This has two main benefits for dplyr code:

Unfortunately these benefits do not come for free. There are two main drawbacks:

Fortunately, dplyr provides tools to overcome these challenges. They require a little more typing, but a small amount of upfront work is worth it because they help you save time in the long run.

This vignette has two goals:

We’ll start with a warmup, tying this problem to something you’re more familiar with, then move on to some practical tools, then dive into the deeper theory.

Warm up

You might not have realised it, but you’re already accomplished at solving this type of problem in another domain: strings. It’s obvious that this function doesn’t do what you want:

greet <- function(name) {
  "How do you do, name?"
}
greet("Hadley")
#> [1] "How do you do, name?"

That’s because " “quotes” its input: it doesn’t interpret what you’ve typed, it just stores it in a string. One way to make the function do what you want is to use paste() to build up the string piece by piece:

greet <- function(name) {
  paste0("How do you do, ", name, "?")
}
greet("Hadley")
#> [1] "How do you do, Hadley?"

Another approach is exemplified by the glue package: it allows you to “unquote” components of a string, replacing the string with the value of the R expression. This allows an elegant implementation of our function because {name} is replaced with the value of the name argument.

greet <- function(name) {
  glue::glue("How do you do, {name}?")
}
greet("Hadley")
#> How do you do, Hadley?

Programming recipes

The following recipes walk you through the basics of tidyeval, with the nominal goal of reducing duplication in dplyr code. The examples here are somewhat inauthentic because we’ve reduced them down to very simple components to make them easier to understand. They’re so simple that you might wonder why we bother writing a function at all. But it’s a good idea to learn the ideas on simple examples, so that you’re better prepared to apply them to the more complex situations you’ll see in your own code.

Different data sets

You already know how to write functions that work with the first argument of dplyr verbs: the data. That’s because dplyr doesn’t do anything special with that argument, so it’s referentially transparent. For example, if you saw repeated code like this:

mutate(df1, y = a + x)
mutate(df2, y = a + x)
mutate(df3, y = a + x)
mutate(df4, y = a + x)

You could already write a function to capture that duplication:

mutate_y <- function(df) {
  mutate(df, y = a + x)
}

Unfortunately, there’s a drawback to this simple approach: it can fail silently if one of the variables isn’t present in the data frame, but is present in the global environment.

df1 <- tibble(x = 1:3)
a <- 10
mutate_y(df1)
#> # A tibble: 3 x 2
#>       x     y
#>   <int> <dbl>
#> 1     1  11.0
#> 2     2  12.0
#> 3     3  13.0

We can fix that ambiguity by being more explicit and using the .data pronoun. This will throw an informative error if the variable doesn’t exist:

mutate_y <- function(df) {
  mutate(df, y = .data$a + .data$x)
}

mutate_y(df1)
#> Error in mutate_impl(.data, dots): Evaluation error: Column `a`: not found in data.

If this function is in a package, using .data also prevents R CMD check from giving a NOTE about undefined global variables (provided that you’ve also imported rlang::.data with @importFrom rlang .data).

Different expressions

Writing a function is hard if you want one of the arguments to be a variable name (like x) or an expression (like x + y). That’s because dplyr automatically “quotes” those inputs, so they are not referentially transparent. Let’s start with a simple case: you want to vary the grouping variable for a data summarization.

df <- tibble(
  g1 = c(1, 1, 2, 2, 2),
  g2 = c(1, 2, 1, 2, 1),
  a = sample(5), 
  b = sample(5)
)

df %>%
  group_by(g1) %>%
  summarise(a = mean(a))
#> # A tibble: 2 x 2
#>      g1     a
#>   <dbl> <dbl>
#> 1  1.00  2.50
#> 2  2.00  3.33

df %>%
  group_by(g2) %>%
  summarise(a = mean(a))
#> # A tibble: 2 x 2
#>      g2     a
#>   <dbl> <dbl>
#> 1  1.00  2.00
#> 2  2.00  4.50

You might hope that this will work:

my_summarise <- function(df, group_var) {
  df %>%
    group_by(group_var) %>%
    summarise(a = mean(a))
}

my_summarise(df, g1)
#> Error in grouped_df_impl(data, unname(vars), drop): Column `group_var` is unknown

But it doesn’t.

Maybe providing the variable name as a string will fix things?

my_summarise(df, "g2")
#> Error in grouped_df_impl(data, unname(vars), drop): Column `group_var` is unknown

Nope.

If you look carefully at the error message, you’ll see that it’s the same in both cases. group_by() works like ": it doesn’t evaluate its input; it quotes it.

To make this function work, we need to do two things. We need to quote the input ourselves (so my_summarise() can take a bare variable name like group_by()), and then we need to tell group_by() not to quote its input (because we’ve done the quoting).

How do we quote the input? We can’t use "" to quote the input, because that gives us a string. Instead we need a function that captures the expression and its environment (we’ll come back to why this is important later on). There are two possible options we could use in base R, the function quote() and the operator ~. Neither of these work quite the way we want, so we need a new function: quo().

quo() works like ": it quotes its input rather than evaluating it.

quo(g1)
#> <quosure: global>
#> ~g1
quo(a + b + c)
#> <quosure: global>
#> ~a + b + c
quo("a")
#> <quosure: empty>
#> ~"a"

quo() returns a quosure, which is a special type of formula. You’ll learn more about quosures later on.

Now that we’ve captured this expression, how do we use it with group_by()? It doesn’t work if we just shove it into our naive approach:

my_summarise(df, quo(g1))
#> Error in grouped_df_impl(data, unname(vars), drop): Column `group_var` is unknown

We get the same error as before, because we haven’t yet told group_by() that we’re taking care of the quoting. In other words, we need to tell group_by() not to quote its input, because it has been pre-quoted by my_summarise(). Yet another way of saying the same thing is that we want to unquote group_var.

In dplyr (and in tidyeval in general) you use !! to say that you want to unquote an input so that it’s evaluated, not quoted. This gives us a function that actually does what we want.

my_summarise <- function(df, group_var) {
  df %>%
    group_by(!!group_var) %>%
    summarise(a = mean(a))
}

my_summarise(df, quo(g1))
#> # A tibble: 2 x 2
#>      g1     a
#>   <dbl> <dbl>
#> 1  1.00  2.50
#> 2  2.00  3.33

Huzzah!

There’s just one step left: we want to call this function like we call group_by():

my_summarise(df, g1)

This doesn’t work because there’s no object called g1. We need to capture what the user of the function typed and quote it for them. You might try using quo() to do that:

my_summarise <- function(df, group_var) {
  quo_group_var <- quo(group_var)
  print(quo_group_var)

  df %>%
    group_by(!!quo_group_var) %>%
    summarise(a = mean(a))
}

my_summarise(df, g1)
#> <quosure: frame>
#> ~group_var
#> Error in grouped_df_impl(data, unname(vars), drop): Column `group_var` is unknown

I’ve added a print() call to make it obvious what’s going wrong here: quo(group_var) always returns ~group_var. It is being too literal! We want it to substitute the value that the user supplied, i.e. to return ~g1.

By analogy to strings, we don’t want "", instead we want some function that turns an argument into a string. That’s the job of enquo(). enquo() uses some dark magic to look at the argument, see what the user typed, and return that value as a quosure. (Technically, this works because function arguments are evaluated lazily, using a special data structure called a promise.)

my_summarise <- function(df, group_var) {
  group_var <- enquo(group_var)
  print(group_var)

  df %>%
    group_by(!!group_var) %>%
    summarise(a = mean(a))
}

my_summarise(df, g1)
#> <quosure: global>
#> ~g1
#> # A tibble: 2 x 2
#>      g1     a
#>   <dbl> <dbl>
#> 1  1.00  2.50
#> 2  2.00  3.33

(If you’re familiar with quote() and substitute() in base R, quo() is equivalent to quote() and enquo() is equivalent to substitute().)

You might wonder how to extend this to handle multiple grouping variables: we’ll come back to that a little later.

Different input variable

Now let’s tackle something a bit more complicated. The code below shows a duplicate summarise() statement where we compute three summaries, varying the input variable.

summarise(df, mean = mean(a), sum = sum(a), n = n())
#> # A tibble: 1 x 3
#>    mean   sum     n
#>   <dbl> <int> <int>
#> 1  3.00    15     5
summarise(df, mean = mean(a * b), sum = sum(a * b), n = n())
#> # A tibble: 1 x 3
#>    mean   sum     n
#>   <dbl> <int> <int>
#> 1  9.60    48     5

To turn this into a function, we start by testing the basic approach interactively: we quote the variable with quo(), then unquoting it in the dplyr call with !!. Notice that we can unquote anywhere inside a complicated expression.

my_var <- quo(a)
summarise(df, mean = mean(!!my_var), sum = sum(!!my_var), n = n())
#> # A tibble: 1 x 3
#>    mean   sum     n
#>   <dbl> <int> <int>
#> 1  3.00    15     5

You can also wrap quo() around the dplyr call to see what will happen from dplyr’s perspective. This is a very useful tool for debugging.

quo(summarise(df, 
  mean = mean(!!my_var),
  sum = sum(!!my_var),
  n = n()
))
#> <quosure: global>
#> ~summarise(df, mean = mean(~a), sum = sum(~a), n = n())

Now we can turn our code into a function (remembering to replace quo() with enquo()), and check that it works:

my_summarise2 <- function(df, expr) {
  expr <- enquo(expr)
  
  summarise(df, 
    mean = mean(!!expr),
    sum = sum(!!expr),
    n = n()
  )
}
my_summarise2(df, a)
#> # A tibble: 1 x 3
#>    mean   sum     n
#>   <dbl> <int> <int>
#> 1  3.00    15     5
my_summarise2(df, a * b)
#> # A tibble: 1 x 3
#>    mean   sum     n
#>   <dbl> <int> <int>
#> 1  9.60    48     5

Different input and output variable

The next challenge is to vary the name of the output variables:

mutate(df, mean_a = mean(a), sum_a = sum(a))
#> # A tibble: 5 x 6
#>      g1    g2     a     b mean_a sum_a
#>   <dbl> <dbl> <int> <int>  <dbl> <int>
#> 1  1.00  1.00     1     3   3.00    15
#> 2  1.00  2.00     4     2   3.00    15
#> 3  2.00  1.00     2     1   3.00    15
#> 4  2.00  2.00     5     4   3.00    15
#> # ... with 1 more row
mutate(df, mean_b = mean(b), sum_b = sum(b))
#> # A tibble: 5 x 6
#>      g1    g2     a     b mean_b sum_b
#>   <dbl> <dbl> <int> <int>  <dbl> <int>
#> 1  1.00  1.00     1     3   3.00    15
#> 2  1.00  2.00     4     2   3.00    15
#> 3  2.00  1.00     2     1   3.00    15
#> 4  2.00  2.00     5     4   3.00    15
#> # ... with 1 more row

This code is similar to the previous example, but there are two new wrinkles:

  • We create the new names by pasting together strings, so we need quo_name() to convert the input expression to a string.

  • !!mean_name = mean(!!expr) isn’t valid R code, so we need to use the := helper provided by rlang.

my_mutate <- function(df, expr) {
  expr <- enquo(expr)
  mean_name <- paste0("mean_", quo_name(expr))
  sum_name <- paste0("sum_", quo_name(expr))
  
  mutate(df, 
    !!mean_name := mean(!!expr), 
    !!sum_name := sum(!!expr)
  )
}

my_mutate(df, a)
#> # A tibble: 5 x 6
#>      g1    g2     a     b mean_a sum_a
#>   <dbl> <dbl> <int> <int>  <dbl> <int>
#> 1  1.00  1.00     1     3   3.00    15
#> 2  1.00  2.00     4     2   3.00    15
#> 3  2.00  1.00     2     1   3.00    15
#> 4  2.00  2.00     5     4   3.00    15
#> # ... with 1 more row

Capturing multiple variables

It would be nice to extend my_summarise() to accept any number of grouping variables. We need to make three changes:

  • Use ... in the function definition so our function can accept any number of arguments.

  • Use quos() to capture all the ... as a list of formulas.

  • Use !!! instead of !! to splice the arguments into group_by().

my_summarise <- function(df, ...) {
  group_var <- quos(...)

  df %>%
    group_by(!!!group_var) %>%
    summarise(a = mean(a))
}

my_summarise(df, g1, g2)
#> # A tibble: 4 x 3
#> # Groups:   g1 [?]
#>      g1    g2     a
#>   <dbl> <dbl> <dbl>
#> 1  1.00  1.00  1.00
#> 2  1.00  2.00  4.00
#> 3  2.00  1.00  2.50
#> 4  2.00  2.00  5.00

!!! takes a list of elements and splices them into to the current call. Look at the bottom of the !!! and think ....

args <- list(na.rm = TRUE, trim = 0.25)
quo(mean(x, !!! args))
#> <quosure: global>
#> ~mean(x, na.rm = TRUE, trim = 0.25)

args <- list(quo(x), na.rm = TRUE, trim = 0.25)
quo(mean(!!! args))
#> <quosure: global>
#> ~mean(~x, na.rm = TRUE, trim = 0.25)

Now that you’ve learned the basics of tidyeval through some practical examples, we’ll dive into the theory. This will help you generalise what you’ve learned here to new situations.

Quoting

Quoting is the action of capturing an expression instead of evaluating it. All expression-based functions quote their arguments and get the R code as an expression rather than the result of evaluating that code. If you are an R user, you probably quote expressions on a regular basis. One of the most important quoting operators in R is the formula. It is famously used for the specification of statistical models:

disp ~ cyl + drat
#> disp ~ cyl + drat

The other quoting operator in base R is quote(). It returns a raw expression rather than a formula:

# Computing the value of the expression:
toupper(letters[1:5])
#> [1] "A" "B" "C" "D" "E"

# Capturing the expression:
quote(toupper(letters[1:5]))
#> toupper(letters[1:5])

(Note that despite being called the double quote, " is not a quoting operator in this context, because it generates a string, not an expression.)

In practice, the formula is the better of the two options because it captures the code and its execution environment. This is important because even simple expression can yield different values in different environments. For example, the x in the following two expressions refers to different values:

f <- function(x) {
  quo(x)
}

x1 <- f(10)
x2 <- f(100)

It might look like the expressions are the same if you print them out.

x1
#> <quosure: local>
#> ~x
x2
#> <quosure: local>
#> ~x

But if you inspect the environments using rlang::get_env() — they’re different.

library(rlang)

get_env(x1)
#> <environment: 0x557ddf5b1ab0>
get_env(x2)
#> <environment: 0x557ddfa6e6c8>

Further, when we evaluate those formulas using rlang::eval_tidy(), we see that they yield different values:

eval_tidy(x1)
#> [1] 10
eval_tidy(x2)
#> [1] 100

This is a key property of R: one name can refer to different values in different environments. This is also important for dplyr, because it allows you to combine variables and objects in a call:

user_var <- 1000
mtcars %>% summarise(cyl = mean(cyl) * user_var)
#>      cyl
#> 1 6187.5

When an object keeps track of an environment, it is said to have an enclosure. This is the reason that functions in R are sometimes referred to as closures:

typeof(mean)
#> [1] "closure"

For this reason we use a special name to refer to one-sided formulas: quosures. One-sided formulas are quotes (they carry an expression) with an environment.

Quosures are regular R objects. They can be stored in a variable and inspected:

var <- ~toupper(letters[1:5])
var
#> ~toupper(letters[1:5])

# You can extract its expression:
get_expr(var)
#> toupper(letters[1:5])

# Or inspect its enclosure:
get_env(var)
#> <environment: R_GlobalEnv>

Quasiquotation

Put simply, quasi-quotation enables one to introduce symbols that stand for a linguistic expression in a given instance and are used as that linguistic expression in a different instance. — Willard van Orman Quine

Automatic quoting makes dplyr very convenient for interactive use. But if you want to program with dplyr, you need some way to refer to variables indirectly. The solution to this problem is quasiquotation, which allows you to evaluate directly inside an expression that is otherwise quoted.

Quasiquotation was coined by Willard van Orman Quine in the 1940s, and was adopted for programming by the LISP community in the 1970s. All expression-based functions in the tidyeval framework support quasiquotation. Unquoting cancels quotation of parts of an expression. There are three types of unquoting:

Unquoting

The first important operation is the basic unquote, which comes in a functional form, UQ(), and as syntactic-sugar, !!.

# Here we capture `letters[1:5]` as an expression:
quo(toupper(letters[1:5]))
#> <quosure: global>
#> ~toupper(letters[1:5])

# Here we capture the value of `letters[1:5]`
quo(toupper(!!letters[1:5]))
#> <quosure: global>
#> ~toupper(c("a", "b", "c", "d", "e"))
quo(toupper(UQ(letters[1:5])))
#> <quosure: global>
#> ~toupper(c("a", "b", "c", "d", "e"))

It is also possible to unquote other quoted expressions. Unquoting such symbolic objects provides a powerful way of manipulating expressions.

var1 <- quo(letters[1:5])
quo(toupper(!!var1))
#> <quosure: global>
#> ~toupper(~letters[1:5])

You can safely unquote quosures because they track their environments, and tidyeval functions know how to evaluate them. This allows any depth of quoting and unquoting.

my_mutate <- function(x) {
  mtcars %>%
    select(cyl) %>%
    slice(1:4) %>%
    mutate(cyl2 = cyl + (!! x))
}

f <- function(x) quo(x)
expr1 <- f(100)
expr2 <- f(10)

my_mutate(expr1)
#> # A tibble: 4 x 2
#>     cyl  cyl2
#>   <dbl> <dbl>
#> 1  6.00   106
#> 2  6.00   106
#> 3  4.00   104
#> 4  6.00   106
my_mutate(expr2)
#> # A tibble: 4 x 2
#>     cyl  cyl2
#>   <dbl> <dbl>
#> 1  6.00  16.0
#> 2  6.00  16.0
#> 3  4.00  14.0
#> 4  6.00  16.0

The functional form is useful in cases where the precedence of ! causes problems:

my_fun <- quo(fun)
quo(!!my_fun(x, y, z))
#> Error in my_fun(x, y, z): could not find function "my_fun"
quo(UQ(my_fun)(x, y, z))
#> <quosure: global>
#> ~(~fun)(x, y, z)

my_var <- quo(x)
quo(filter(df, !!my_var == 1))
#> <quosure: global>
#> ~filter(df, FALSE)
quo(filter(df, UQ(my_var) == 1))
#> <quosure: global>
#> ~filter(df, (~x) == 1)

You’ll note above that UQ() yields a quosure containing a formula. That ensures that when the quosure is evaluated, it’ll be looked up in the right environment. In certain code-generation scenarios you just want to use expression and ignore the environment. That’s the job of UQE():

quo(UQE(my_fun)(x, y, z))
#> <quosure: global>
#> ~fun(x, y, z)
quo(filter(df, UQE(my_var) == 1))
#> <quosure: global>
#> ~filter(df, x == 1)

UQE() is for expert use only as you’ll have to carefully analyse the environments to ensure that the generated code is correct.

Unquote-splicing

The second unquote operation is unquote-splicing. Its functional form is UQS() and the syntactic shortcut is !!!. It takes a vector and inserts each element of the vector in the surrounding function call:

quo(list(!!! letters[1:5]))
#> <quosure: global>
#> ~list("a", "b", "c", "d", "e")

A very useful feature of unquote-splicing is that the vector names become argument names:

x <- list(foo = 1L, bar = quo(baz))
quo(list(!!! x))
#> <quosure: global>
#> ~list(foo = 1L, bar = ~baz)

This makes it easy to program with dplyr verbs that take named dots:

args <- list(mean = quo(mean(cyl)), count = quo(n())) 
mtcars %>%
  group_by(am) %>%
  summarise(!!! args)
#> # A tibble: 2 x 3
#>      am  mean count
#>   <dbl> <dbl> <int>
#> 1  0     6.95    19
#> 2  1.00  5.08    13

Setting variable names

The final unquote operation is setting argument names. You’ve seen one way to do that above, but you can also use the definition operator := instead of =. := supports unquoting on both the LHS and the RHS.

The rules on the LHS are slightly different: the unquoted operand should evaluate to a string or a symbol.

mean_nm <- "mean"
count_nm <- "count"

mtcars %>%
  group_by(am) %>%
  summarise(
    !!mean_nm := mean(cyl),
    !!count_nm := n()
  )
#> # A tibble: 2 x 3
#>      am  mean count
#>   <dbl> <dbl> <int>
#> 1  0     6.95    19
#> 2  1.00  5.08    13
dplyr/inst/doc/dplyr.R0000644000176200001440000001306713157241171014414 0ustar liggesusers## ---- echo = FALSE, message = FALSE-------------------------------------- knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(dplyr) library(ggplot2) set.seed(1014) ## ------------------------------------------------------------------------ library(nycflights13) dim(flights) flights ## ------------------------------------------------------------------------ filter(flights, month == 1, day == 1) ## ---- eval = FALSE------------------------------------------------------- # flights[flights$month == 1 & flights$day == 1, ] ## ------------------------------------------------------------------------ arrange(flights, year, month, day) ## ------------------------------------------------------------------------ arrange(flights, desc(arr_delay)) ## ------------------------------------------------------------------------ # Select columns by name select(flights, year, month, day) # Select all columns between year and day (inclusive) select(flights, year:day) # Select all columns except those from year to day (inclusive) select(flights, -(year:day)) ## ------------------------------------------------------------------------ select(flights, tail_num = tailnum) ## ------------------------------------------------------------------------ rename(flights, tail_num = tailnum) ## ------------------------------------------------------------------------ mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ## ------------------------------------------------------------------------ mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) ## ------------------------------------------------------------------------ transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) ## ------------------------------------------------------------------------ summarise(flights, delay = mean(dep_delay, na.rm = TRUE) ) ## ------------------------------------------------------------------------ sample_n(flights, 10) sample_frac(flights, 0.01) ## ---- warning = FALSE, message = FALSE, fig.width = 6-------------------- by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000) # Interestingly, the average delay is only slightly related to the # average distance flown by a plane. ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() ## ------------------------------------------------------------------------ destinations <- group_by(flights, dest) summarise(destinations, planes = n_distinct(tailnum), flights = n() ) ## ------------------------------------------------------------------------ daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n())) (per_month <- summarise(per_day, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights))) ## ------------------------------------------------------------------------ # `year` represents the integer 1 select(flights, year) select(flights, 1) ## ------------------------------------------------------------------------ year <- "dep" select(flights, starts_with(year)) ## ------------------------------------------------------------------------ year <- 5 select(flights, year, identity(year)) ## ------------------------------------------------------------------------ vars <- c("year", "month") select(flights, vars, "day") ## ------------------------------------------------------------------------ # Let's create a new `vars` column: flights$vars <- flights$year # The new column won't be an issue if you evaluate `vars` in the # context with the `!!` operator: vars <- c("year", "month", "day") select(flights, !! vars) ## ------------------------------------------------------------------------ df <- select(flights, year:dep_time) ## ------------------------------------------------------------------------ mutate(df, "year", 2) ## ------------------------------------------------------------------------ mutate(df, year + 10) ## ------------------------------------------------------------------------ var <- seq(1, nrow(df)) mutate(df, new = var) ## ------------------------------------------------------------------------ group_by(df, month) group_by(df, month = as.factor(month)) group_by(df, day_binned = cut(day, 3)) ## ------------------------------------------------------------------------ group_by(df, "month") ## ------------------------------------------------------------------------ group_by_at(df, vars(year:day)) ## ---- eval = FALSE------------------------------------------------------- # a1 <- group_by(flights, year, month, day) # a2 <- select(a1, arr_delay, dep_delay) # a3 <- summarise(a2, # arr = mean(arr_delay, na.rm = TRUE), # dep = mean(dep_delay, na.rm = TRUE)) # a4 <- filter(a3, arr > 30 | dep > 30) ## ------------------------------------------------------------------------ filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ), arr > 30 | dep > 30 ) ## ---- eval = FALSE------------------------------------------------------- # flights %>% # group_by(year, month, day) %>% # select(arr_delay, dep_delay) %>% # summarise( # arr = mean(arr_delay, na.rm = TRUE), # dep = mean(dep_delay, na.rm = TRUE) # ) %>% # filter(arr > 30 | dep > 30) dplyr/inst/doc/two-table.R0000644000176200001440000000672313157241177015167 0ustar liggesusers## ---- echo = FALSE, message = FALSE-------------------------------------- knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 5) library(dplyr) knit_print.tbl_df <- function(x, options) { knitr::knit_print(trunc_mat(x), options) } ## ---- warning = FALSE---------------------------------------------------- library("nycflights13") # Drop unimportant variables so it's easier to understand the join results. flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) flights2 %>% left_join(airlines) ## ------------------------------------------------------------------------ flights2 %>% left_join(weather) ## ------------------------------------------------------------------------ flights2 %>% left_join(planes, by = "tailnum") ## ------------------------------------------------------------------------ flights2 %>% left_join(airports, c("dest" = "faa")) flights2 %>% left_join(airports, c("origin" = "faa")) ## ------------------------------------------------------------------------ (df1 <- data_frame(x = c(1, 2), y = 2:1)) (df2 <- data_frame(x = c(1, 3), a = 10, b = "a")) ## ------------------------------------------------------------------------ df1 %>% inner_join(df2) %>% knitr::kable() ## ------------------------------------------------------------------------ df1 %>% left_join(df2) ## ------------------------------------------------------------------------ df1 %>% right_join(df2) df2 %>% left_join(df1) ## ------------------------------------------------------------------------ df1 %>% full_join(df2) ## ------------------------------------------------------------------------ df1 <- data_frame(x = c(1, 1, 2), y = 1:3) df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a")) df1 %>% left_join(df2) ## ------------------------------------------------------------------------ library("nycflights13") flights %>% anti_join(planes, by = "tailnum") %>% count(tailnum, sort = TRUE) ## ------------------------------------------------------------------------ df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4) df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a")) # Four rows to start with: df1 %>% nrow() # And we get four rows after the join df1 %>% inner_join(df2, by = "x") %>% nrow() # But only two rows actually match df1 %>% semi_join(df2, by = "x") %>% nrow() ## ------------------------------------------------------------------------ (df1 <- data_frame(x = 1:2, y = c(1L, 1L))) (df2 <- data_frame(x = 1:2, y = 1:2)) ## ------------------------------------------------------------------------ intersect(df1, df2) # Note that we get 3 rows, not 4 union(df1, df2) setdiff(df1, df2) setdiff(df2, df1) ## ------------------------------------------------------------------------ df1 <- data_frame(x = 1, y = factor("a")) df2 <- data_frame(x = 2, y = factor("b")) full_join(df1, df2) %>% str() ## ------------------------------------------------------------------------ df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b"))) df2 <- data_frame(x = 2, y = factor("b", levels = c("b", "a"))) full_join(df1, df2) %>% str() ## ------------------------------------------------------------------------ df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b"))) df2 <- data_frame(x = 2, y = factor("b", levels = c("a", "b"))) full_join(df1, df2) %>% str() ## ------------------------------------------------------------------------ df1 <- data_frame(x = 1, y = "a") df2 <- data_frame(x = 2, y = factor("a")) full_join(df1, df2) %>% str() dplyr/inst/doc/window-functions.Rmd0000644000176200001440000002166113135665123017122 0ustar liggesusers--- title: "Window functions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Window functions} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) library(dplyr) set.seed(1014) ``` A __window function__ is a variation on an aggregation function. Where an aggregation function, like `sum()` and `mean()`, takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions don't include functions that work element-wise, like `+` or `round()`. Window functions include variations on aggregate functions, like `cumsum()` and `cummean()`, functions for ranking and ordering, like `rank()`, and functions for taking offsets, like `lead()` and `lag()`. In this vignette, we'll use a small sample of the Lahman batting dataset, including the players that have won an award. ```{r} library(Lahman) batting <- Lahman::Batting %>% as_tibble() %>% select(playerID, yearID, teamID, G, AB:H) %>% arrange(playerID, yearID, teamID) %>% semi_join(Lahman::AwardsPlayers, by = "playerID") players <- batting %>% group_by(playerID) ``` Window functions are used in conjunction with `mutate()` and `filter()` to solve a wide range of problems. Here's a selection: ```{r, eval = FALSE} # For each player, find the two years with most hits filter(players, min_rank(desc(H)) <= 2 & H > 0) # Within each player, rank each year by the number of games played mutate(players, G_rank = min_rank(G)) # For each player, find every year that was better than the previous year filter(players, G > lag(G)) # For each player, compute avg change in games played per year mutate(players, G_change = (G - lag(G)) / (yearID - lag(yearID))) # For each player, find all where they played more games than average filter(players, G > mean(G)) # For each, player compute a z score based on number of games played mutate(players, G_z = (G - mean(G)) / sd(G)) ``` Before reading this vignette, you should be familiar with `mutate()` and `filter()`. ## Types of window functions There are five main families of window functions. Two families are unrelated to aggregation functions: * Ranking and ordering functions: `row_number()`, `min_rank()`, `dense_rank()`, `cume_dist()`, `percent_rank()`, and `ntile()`. These functions all take a vector to order by, and return various types of ranks. * Offsets `lead()` and `lag()` allow you to access the previous and next values in a vector, making it easy to compute differences and trends. The other three families are variations on familiar aggregate functions: * Cumulative aggregates: `cumsum()`, `cummin()`, `cummax()` (from base R), and `cumall()`, `cumany()`, and `cummean()` (from dplyr). * Rolling aggregates operate in a fixed width window. You won't find them in base R or in dplyr, but there are many implementations in other packages, such as [RcppRoll](https://cran.r-project.org/package=RcppRoll). * Recycled aggregates, where an aggregate is repeated to match the length of the input. These are not needed in R because vector recycling automatically recycles aggregates where needed. They are important in SQL, because the presence of an aggregation function usually tells the database to return only one row per group. Each family is described in more detail below, focussing on the general goals and how to use them with dplyr. For more details, refer to the individual function documentation. ## Ranking functions The ranking functions are variations on a theme, differing in how they handle ties: ```{r} x <- c(1, 1, 2, 2, 2) row_number(x) min_rank(x) dense_rank(x) ``` If you're familiar with R, you may recognise that `row_number()` and `min_rank()` can be computed with the base `rank()` function and various values of the `ties.method` argument. These functions are provided to save a little typing, and to make it easier to convert between R and SQL. Two other ranking functions return numbers between 0 and 1. `percent_rank()` gives the percentage of the rank; `cume_dist()` gives the proportion of values less than or equal to the current value. ```{r} cume_dist(x) percent_rank(x) ``` These are useful if you want to select (for example) the top 10% of records within each group. For example: ```{r} filter(players, cume_dist(desc(G)) < 0.1) ``` Finally, `ntile()` divides the data up into `n` evenly sized buckets. It's a coarse ranking, and it can be used in with `mutate()` to divide the data into buckets for further summary. For example, we could use `ntile()` to divide the players within a team into four ranked groups, and calculate the average number of games within each group. ```{r} by_team_player <- group_by(batting, teamID, playerID) by_team <- summarise(by_team_player, G = sum(G)) by_team_quartile <- group_by(by_team, quartile = ntile(G, 4)) summarise(by_team_quartile, mean(G)) ``` All ranking functions rank from lowest to highest so that small input values get small ranks. Use `desc()` to rank from highest to lowest. ## Lead and lag `lead()` and `lag()` produce offset versions of a input vector that is either ahead of or behind the original vector. ```{r} x <- 1:5 lead(x) lag(x) ``` You can use them to: * Compute differences or percent changes. ```{r, results = "hide"} # Compute the relative change in games played mutate(players, G_delta = G - lag(G)) ``` Using `lag()` is more convenient than `diff()` because for `n` inputs `diff()` returns `n - 1` outputs. * Find out when a value changes. ```{r, results = "hide"} # Find when a player changed teams filter(players, teamID != lag(teamID)) ``` `lead()` and `lag()` have an optional argument `order_by`. If set, instead of using the row order to determine which value comes before another, they will use another variable. This important if you have not already sorted the data, or you want to sort one way and lag another. Here's a simple example of what happens if you don't specify `order_by` when you need it: ```{r} df <- data.frame(year = 2000:2005, value = (0:5) ^ 2) scrambled <- df[sample(nrow(df)), ] wrong <- mutate(scrambled, running = cumsum(value)) arrange(wrong, year) right <- mutate(scrambled, running = order_by(year, cumsum(value))) arrange(right, year) ``` ## Cumulative aggregates Base R provides cumulative sum (`cumsum()`), cumulative min (`cummin()`) and cumulative max (`cummax()`). (It also provides `cumprod()` but that is rarely useful). Other common accumulating functions are `cumany()` and `cumall()`, cumulative versions of `||` and `&&`, and `cummean()`, a cumulative mean. These are not included in base R, but efficient versions are provided by `dplyr`. `cumany()` and `cumall()` are useful for selecting all rows up to, or all rows after, a condition is true for the first (or last) time. For example, we can use `cumany()` to find all records for a player after they played a year with 150 games: ```{r, eval = FALSE} filter(players, cumany(G > 150)) ``` Like lead and lag, you may want to control the order in which the accumulation occurs. None of the built in functions have an `order_by` argument so `dplyr` provides a helper: `order_by()`. You give it the variable you want to order by, and then the call to the window function: ```{r} x <- 1:10 y <- 10:1 order_by(y, cumsum(x)) ``` This function uses a bit of non-standard evaluation, so I wouldn't recommend using it inside another function; use the simpler but less concise `with_order()` instead. ## Recycled aggregates R's vector recycling make it easy to select values that are higher or lower than a summary. I call this a recycled aggregate because the value of the aggregate is recycled to be the same length as the original vector. Recycled aggregates are useful if you want to find all records greater than the mean or less than the median: ```{r, eval = FALSE} filter(players, G > mean(G)) filter(players, G < median(G)) ``` While most SQL databases don't have an equivalent of `median()` or `quantile()`, when filtering you can achieve the same effect with `ntile()`. For example, `x > median(x)` is equivalent to `ntile(x, 2) == 2`; `x > quantile(x, 75)` is equivalent to `ntile(x, 100) > 75` or `ntile(x, 4) > 3`. ```{r, eval = FALSE} filter(players, ntile(G, 2) == 2) ``` You can also use this idea to select the records with the highest (`x == max(x)`) or lowest value (`x == min(x)`) for a field, but the ranking functions give you more control over ties, and allow you to select any number of records. Recycled aggregates are also useful in conjunction with `mutate()`. For example, with the batting data, we could compute the "career year", the number of years a player has played since they entered the league: ```{r} mutate(players, career_year = yearID - min(yearID) + 1) ``` Or, as in the introductory example, we could compute a z-score: ```{r} mutate(players, G_z = (G - mean(G)) / sd(G)) ``` dplyr/inst/doc/two-table.Rmd0000644000176200001440000002070513135665123015501 0ustar liggesusers--- title: "Two-table verbs" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Two-table verbs} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 5) library(dplyr) knit_print.tbl_df <- function(x, options) { knitr::knit_print(trunc_mat(x), options) } ``` It's rare that a data analysis involves only a single table of data. In practice, you'll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time: * Mutating joins, which add new variables to one table from matching rows in another. * Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table. * Set operations, which combine the observations in the data sets as if they were set elements. (This discussion assumes that you have [tidy data](http://www.jstatsoft.org/v59/i10/), where the rows are observations and the columns are variables. If you're not familiar with that framework, I'd recommend reading up on it first.) All two-table verbs work similarly. The first two arguments are `x` and `y`, and provide the tables to combine. The output is always a new table with the same type as `x`. ## Mutating joins Mutating joins allow you to combine variables from multiple tables. For example, take the nycflights13 data. In one table we have flight information with an abbreviation for carrier, and in another we have a mapping between abbreviations and full names. You can use a join to add the carrier names to the flight data: ```{r, warning = FALSE} library("nycflights13") # Drop unimportant variables so it's easier to understand the join results. flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) flights2 %>% left_join(airlines) ``` ### Controlling how the tables are matched As well as `x` and `y`, each mutating join takes an argument `by` that controls which variables are used to match observations in the two tables. There are a few ways to specify it, as I illustrate below with various tables from nycflights13: * `NULL`, the default. dplyr will will use all variables that appear in both tables, a __natural__ join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin. ```{r} flights2 %>% left_join(weather) ``` * A character vector, `by = "x"`. Like a natural join, but uses only some of the common variables. For example, `flights` and `planes` have `year` columns, but they mean different things so we only want to join by `tailnum`. ```{r} flights2 %>% left_join(planes, by = "tailnum") ``` Note that the year columns in the output are disambiguated with a suffix. * A named character vector: `by = c("x" = "a")`. This will match variable `x` in table `x` to variable `a` in table `b`. The variables from use will be used in the output. Each flight has an origin and destination `airport`, so we need to specify which one we want to join to: ```{r} flights2 %>% left_join(airports, c("dest" = "faa")) flights2 %>% left_join(airports, c("origin" = "faa")) ``` ### Types of join There are four types of mutating join, which differ in their behaviour when a match is not found. We'll illustrate each with a simple example: ```{r} (df1 <- data_frame(x = c(1, 2), y = 2:1)) (df2 <- data_frame(x = c(1, 3), a = 10, b = "a")) ``` * `inner_join(x, y)` only includes observations that match in both `x` and `y`. ```{r} df1 %>% inner_join(df2) %>% knitr::kable() ``` * `left_join(x, y)` includes all observations in `x`, regardless of whether they match or not. This is the most commonly used join because it ensures that you don't lose observations from your primary table. ```{r} df1 %>% left_join(df2) ``` * `right_join(x, y)` includes all observations in `y`. It's equivalent to `left_join(y, x)`, but the columns will be ordered differently. ```{r} df1 %>% right_join(df2) df2 %>% left_join(df1) ``` * `full_join()` includes all observations from `x` and `y`. ```{r} df1 %>% full_join(df2) ``` The left, right and full joins are collectively know as __outer joins__. When a row doesn't match in an outer join, the new variables are filled in with missing values. ### Observations While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations: ```{r} df1 <- data_frame(x = c(1, 1, 2), y = 1:3) df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a")) df1 %>% left_join(df2) ``` ## Filtering joins Filtering joins match obserations in the same way as mutating joins, but affect the observations, not the variables. There are two types: * `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`. * `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`. These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don't have a matching tail number in the planes table: ```{r} library("nycflights13") flights %>% anti_join(planes, by = "tailnum") %>% count(tailnum, sort = TRUE) ``` If you're worried about what observations your joins will match, start with a `semi_join()` or `anti_join()`. `semi_join()` and `anti_join()` never duplicate; they only ever remove observations. ```{r} df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4) df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a")) # Four rows to start with: df1 %>% nrow() # And we get four rows after the join df1 %>% inner_join(df2, by = "x") %>% nrow() # But only two rows actually match df1 %>% semi_join(df2, by = "x") %>% nrow() ``` ## Set operations The final type of two-table verb is set operations. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets: * `intersect(x, y)`: return only observations in both `x` and `y` * `union(x, y)`: return unique observations in `x` and `y` * `setdiff(x, y)`: return observations in `x`, but not in `y`. Given this simple data: ```{r} (df1 <- data_frame(x = 1:2, y = c(1L, 1L))) (df2 <- data_frame(x = 1:2, y = 1:2)) ``` The four possibilities are: ```{r} intersect(df1, df2) # Note that we get 3 rows, not 4 union(df1, df2) setdiff(df1, df2) setdiff(df2, df1) ``` ## Coercion rules When joining tables, dplyr is a little more conservative than base R about the types of variable that it considers equivalent. This is mostly likely to surprise if you're working factors: * Factors with different levels are coerced to character with a warning: ```{r} df1 <- data_frame(x = 1, y = factor("a")) df2 <- data_frame(x = 2, y = factor("b")) full_join(df1, df2) %>% str() ``` * Factors with the same levels in a different order are coerced to character with a warning: ```{r} df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b"))) df2 <- data_frame(x = 2, y = factor("b", levels = c("b", "a"))) full_join(df1, df2) %>% str() ``` * Factors are preserved only if the levels match exactly: ```{r} df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b"))) df2 <- data_frame(x = 2, y = factor("b", levels = c("a", "b"))) full_join(df1, df2) %>% str() ``` * A factor and a character are coerced to character with a warning: ```{r} df1 <- data_frame(x = 1, y = "a") df2 <- data_frame(x = 2, y = factor("a")) full_join(df1, df2) %>% str() ``` Otherwise logicals will be silently upcast to integer, and integer to numeric, but coercing to character will raise an error: ```{r, error = TRUE, purl = FALSE} df1 <- data_frame(x = 1, y = 1L) df2 <- data_frame(x = 2, y = 1.5) full_join(df1, df2) %>% str() df1 <- data_frame(x = 1, y = 1L) df2 <- data_frame(x = 2, y = "a") full_join(df1, df2) %>% str() ``` ## Multiple-table verbs dplyr does not provide any functions for working with three or more tables. Instead use `purrr::reduce()` or `Reduce()`, as described in [Advanced R](http://adv-r.had.co.nz/Functionals.html#functionals-fp), to iteratively combine the two-table verbs to handle as many tables as you need. dplyr/inst/doc/compatibility.R0000644000176200001440000000673613157241155016142 0ustar liggesusers## ----setup, include = FALSE---------------------------------------------- library(dplyr) knitr::opts_chunk$set(collapse = T, comment = "#>") ## ---- results = "hide"--------------------------------------------------- if (utils::packageVersion("dplyr") > "0.5.0") { # code for new version } else { # code for old version } ## ---- eval = FALSE------------------------------------------------------- # if (utils::packageVersion("dplyr") > "0.5.0") { # dbplyr::build_sql(...) # } else { # dplyr::build_sql(...) # } ## ------------------------------------------------------------------------ #' @rawNamespace #' if (utils::packageVersion("dplyr") > "0.5.0") { #' importFrom("dbplyr", "build_sql") #' } else { #' importFrom("dplyr", "build_sql") #' } ## ---- eval = FALSE------------------------------------------------------- # wrap_dbplyr_obj("build_sql") # # wrap_dbplyr_obj("base_agg") ## ---- eval = FALSE------------------------------------------------------- # quo <- quo(cyl) # select(mtcars, !! quo) ## ---- results = "hide"--------------------------------------------------- sym <- quote(cyl) select(mtcars, !! sym) call <- quote(mean(cyl)) summarise(mtcars, !! call) ## ------------------------------------------------------------------------ quo(!! sym) quo(!! call) rlang::as_quosure(sym) rlang::as_quosure(call) ## ------------------------------------------------------------------------ f <- ~cyl f rlang::as_quosure(f) ## ------------------------------------------------------------------------ rlang::sym("cyl") rlang::syms(letters[1:3]) ## ------------------------------------------------------------------------ syms <- rlang::syms(c("foo", "bar", "baz")) quo(my_call(!!! syms)) fun <- rlang::sym("my_call") quo(UQ(fun)(!!! syms)) ## ------------------------------------------------------------------------ call <- rlang::lang("my_call", !!! syms) call rlang::as_quosure(call) # Or equivalently: quo(!! rlang::lang("my_call", !!! syms)) ## ---- eval=FALSE--------------------------------------------------------- # lazyeval::interp(~ mean(var), var = rlang::sym("mpg")) ## ---- eval=FALSE--------------------------------------------------------- # var <- "mpg" # quo(mean(!! rlang::sym(var))) ## ---- eval = FALSE------------------------------------------------------- # filter_.tbl_df <- function(.data, ..., .dots = list()) { # dots <- compat_lazy_dots(.dots, caller_env(), ...) # filter(.data, !!! dots) # } ## ---- eval = FALSE------------------------------------------------------- # filter.default <- function(.data, ...) { # filter_(.data, .dots = compat_as_lazy_dots(...)) # } ## ---- eval = FALSE------------------------------------------------------- # filter.sf <- function(.data, ...) { # st_as_sf(NextMethod()) # } ## ---- eval = FALSE------------------------------------------------------- # mutate_each(starwars, funs(as.character)) # mutate_all(starwars, funs(as.character)) ## ---- eval = FALSE------------------------------------------------------- # mutate_all(starwars, as.character) ## ---- eval = FALSE------------------------------------------------------- # mutate_each(starwars, funs(as.character), height, mass) # mutate_at(starwars, vars(height, mass), as.character) ## ---- eval = FALSE------------------------------------------------------- # summarise_at(mtcars, vars(starts_with("d")), mean) ## ---- eval = FALSE------------------------------------------------------- # mutate_at(starwars, c("height", "mass"), as.character) dplyr/inst/include/0000755000176200001440000000000013150340402013774 5ustar liggesusersdplyr/inst/include/dplyr_types.h0000644000176200001440000000032613135665123016541 0ustar liggesusers#include #include #include #include #include // avoid inclusion of package header file #define dplyr_dplyr_H dplyr/inst/include/dplyr.h0000644000176200001440000000020113135665123015305 0ustar liggesusers#ifndef dplyr_dplyr_H #define dplyr_dplyr_H #include #include #include #endif dplyr/inst/include/dplyr_RcppExports.h0000644000176200001440000000656313150340402017662 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #ifndef RCPP_dplyr_RCPPEXPORTS_H_GEN_ #define RCPP_dplyr_RCPPEXPORTS_H_GEN_ #include "dplyr_types.h" #include namespace dplyr { using namespace Rcpp; namespace { void validateSignature(const char* sig) { Rcpp::Function require = Rcpp::Environment::base_env()["require"]; require("dplyr", Rcpp::Named("quietly") = true); typedef int(*Ptr_validate)(const char*); static Ptr_validate p_validate = (Ptr_validate) R_GetCCallable("dplyr", "_dplyr_RcppExport_validate"); if (!p_validate(sig)) { throw Rcpp::function_not_exported( "C++ function with signature '" + std::string(sig) + "' not found in dplyr"); } } } inline SEXP get_date_classes() { typedef SEXP(*Ptr_get_date_classes)(); static Ptr_get_date_classes p_get_date_classes = NULL; if (p_get_date_classes == NULL) { validateSignature("SEXP(*get_date_classes)()"); p_get_date_classes = (Ptr_get_date_classes)R_GetCCallable("dplyr", "_dplyr_get_date_classes"); } RObject rcpp_result_gen; { RNGScope RCPP_rngScope_gen; rcpp_result_gen = p_get_date_classes(); } if (rcpp_result_gen.inherits("interrupted-error")) throw Rcpp::internal::InterruptedException(); if (rcpp_result_gen.inherits("try-error")) throw Rcpp::exception(as(rcpp_result_gen).c_str()); return Rcpp::as(rcpp_result_gen); } inline SEXP get_time_classes() { typedef SEXP(*Ptr_get_time_classes)(); static Ptr_get_time_classes p_get_time_classes = NULL; if (p_get_time_classes == NULL) { validateSignature("SEXP(*get_time_classes)()"); p_get_time_classes = (Ptr_get_time_classes)R_GetCCallable("dplyr", "_dplyr_get_time_classes"); } RObject rcpp_result_gen; { RNGScope RCPP_rngScope_gen; rcpp_result_gen = p_get_time_classes(); } if (rcpp_result_gen.inherits("interrupted-error")) throw Rcpp::internal::InterruptedException(); if (rcpp_result_gen.inherits("try-error")) throw Rcpp::exception(as(rcpp_result_gen).c_str()); return Rcpp::as(rcpp_result_gen); } inline DataFrame build_index_cpp(DataFrame data) { typedef SEXP(*Ptr_build_index_cpp)(SEXP); static Ptr_build_index_cpp p_build_index_cpp = NULL; if (p_build_index_cpp == NULL) { validateSignature("DataFrame(*build_index_cpp)(DataFrame)"); p_build_index_cpp = (Ptr_build_index_cpp)R_GetCCallable("dplyr", "_dplyr_build_index_cpp"); } RObject rcpp_result_gen; { RNGScope RCPP_rngScope_gen; rcpp_result_gen = p_build_index_cpp(Shield(Rcpp::wrap(data))); } if (rcpp_result_gen.inherits("interrupted-error")) throw Rcpp::internal::InterruptedException(); if (rcpp_result_gen.inherits("try-error")) throw Rcpp::exception(as(rcpp_result_gen).c_str()); return Rcpp::as(rcpp_result_gen); } } #endif // RCPP_dplyr_RCPPEXPORTS_H_GEN_ dplyr/inst/include/solaris/0000755000176200001440000000000013135665123015465 5ustar liggesusersdplyr/inst/include/solaris/solaris.h0000644000176200001440000000115313135665123017312 0ustar liggesusers#ifndef DPLYR_SOLARIS_H #define DPLYR_SOLARIS_H #if defined(__SUNPRO_CC) && !defined(Rcpp__platform__solaris_h) namespace Rcpp { namespace traits { template struct is_convertible< std::vector, SEXP> : public false_type {}; template <> struct is_convertible : public false_type {}; template struct is_convertible< sugar::Minus_Vector_Primitive< RTYPE, NA, Vector >, SEXP> : public false_type {}; template struct is_convertible< sugar::Plus_Vector_Primitive< RTYPE, NA, Vector >, SEXP> : public false_type {}; } } #endif #endif dplyr/inst/include/tools/0000755000176200001440000000000013153522117015144 5ustar liggesusersdplyr/inst/include/tools/scalar_type.h0000644000176200001440000000046613135665123017636 0ustar liggesusers#ifndef DPLYR_SCALAR_TYPE_H #define DPLYR_SCALAR_TYPE_H namespace dplyr { namespace traits { template struct scalar_type { typedef typename Rcpp::traits::storage_type::type type; }; template <> struct scalar_type { typedef String type; }; } } #endif //DPLYR_SCALAR_TYPE_H dplyr/inst/include/tools/SymbolVector.h0000644000176200001440000000261413152715725017760 0ustar liggesusers#ifndef dplyr_tools_SymbolVector_h #define dplyr_tools_SymbolVector_h #include #include namespace dplyr { class SymbolVector { public: SymbolVector() {} template explicit SymbolVector(T v_) : v(v_) {} explicit SymbolVector(SEXP x) : v(init(x)) {} explicit SymbolVector(RObject x) : v(init(x)) {} public: void push_back(const SymbolString& s) { v.push_back(s.get_string()); } void remove(const R_xlen_t idx) { v.erase(v.begin() + idx); } const SymbolString operator[](const R_xlen_t i) const { return SymbolString(v[i]); } void set(int i, const SymbolString& x) { v[i] = x.get_string(); } R_xlen_t size() const { return v.size(); } int match(const SymbolString& s) const { CharacterVector vs = CharacterVector::create(s.get_string()); return as(match(vs)); } const IntegerVector match(const CharacterVector& m) const { return r_match(m, v); } const IntegerVector match_in_table(const CharacterVector& t) const { return r_match(v, t); } const CharacterVector get_vector() const { return v; } private: CharacterVector v; SEXP init(SEXP x) { if (Rf_isNull(x)) return CharacterVector(); else return x; } }; } namespace Rcpp { using namespace dplyr; template <> inline SEXP wrap(const SymbolVector& x) { return x.get_vector(); } } #endif dplyr/inst/include/tools/SlicingIndex.h0000644000176200001440000000574613135665123017716 0ustar liggesusers#ifndef dplyr_tools_SlicingIndex_H #define dplyr_tools_SlicingIndex_H // A SlicingIndex allows specifying which rows of a data frame are selected in which order, basically a 0:n -> 0:m map. // It also can be used to split a data frame in groups. // Important special cases can be implemented without materializing the map. class SlicingIndex { public: virtual int size() const = 0; virtual int operator[](int i) const = 0; virtual int group() const = 0; virtual bool is_identity(SEXP) const { return FALSE; }; }; // A GroupedSlicingIndex is the most general slicing index, // the 0:n -> 0:m map is specified and stored as an IntegerVector. // A group identifier can be assigned on construction. // It is used in grouped operations (group_by()). class GroupedSlicingIndex : public SlicingIndex { public: GroupedSlicingIndex(IntegerVector data_) : data(data_), group_index(-1) {} GroupedSlicingIndex(IntegerVector data_, int group_) : data(data_), group_index(group_) {} virtual int size() const { return data.size(); } virtual int operator[](int i) const { return data[i]; } virtual int group() const { return group_index; } private: IntegerVector data; int group_index; }; // A RowwiseSlicingIndex selects a single row, which is also the group ID by definition. // It is used in rowwise operations (rowwise()). class RowwiseSlicingIndex : public SlicingIndex { public: RowwiseSlicingIndex(const int start_) : start(start_) {} inline int size() const { return 1; } inline int operator[](int i) const { if (i != 0) stop("Can only use 0 for RowwiseSlicingIndex, queried %d", i); return start; } inline int group() const { return start; } private: int start; }; // A NaturalSlicingIndex selects an entire data frame as a single group. // It is used when the entire data frame needs to be processed by a processor that expects a SlicingIndex // to address the rows. class NaturalSlicingIndex : public SlicingIndex { public: NaturalSlicingIndex(const int n_) : n(n_) {} virtual int size() const { return n; } virtual int operator[](int i) const { if (i < 0 || i >= n) stop("Out of bounds index %d queried for NaturalSlicingIndex", i); return i; } virtual int group() const { return -1; } virtual bool is_identity(SEXP x) const { const R_len_t length = Rf_length(x); return length == n; } private: int n; }; // An OffsetSlicingIndex selects a consecutive part of a data frame, starting at a specific row. // It is used for binding data frames vertically (bind_rows()). class OffsetSlicingIndex : public SlicingIndex { public: OffsetSlicingIndex(const int start_, const int n_) : start(start_), n(n_) {} inline int size() const { return n; } inline int operator[](int i) const { if (i < 0 || i >= n) stop("Out of bounds index %d queried for OffsetSlicingIndex", i); return i + start; } inline int group() const { return -1; } private: int start, n; }; #endif dplyr/inst/include/tools/Quosure.h0000644000176200001440000000440513135665123016770 0ustar liggesusers#ifndef dplyr__Quosure_h #define dplyr__Quosure_h #include #include "SymbolVector.h" namespace dplyr { inline SEXP quosure(SEXP expr, SEXP env) { Language quo("~", expr); quo.attr(".Environment") = env; quo.attr("class") = CharacterVector("formula"); return quo; } class NamedQuosure { public: NamedQuosure(SEXP data_, SymbolString name__ = "") : data(data_), name_(name__) {} NamedQuosure(const Formula& data_, SymbolString name__ = "") : data(data_), name_(name__) {} NamedQuosure(const NamedQuosure& other) : data(other.data), name_(other.name_) {} SEXP expr() const { return Rf_duplicate(CADR(data)); } SEXP env() const { static SEXP sym_dotenv = Rf_install(".Environment"); return Rf_getAttrib(data, sym_dotenv); } SymbolString name() const { return name_; } private: Formula data; SymbolString name_; }; } // namespace dplyr namespace Rcpp { using namespace dplyr; template <> inline bool is(SEXP x) { bool is_tilde = TYPEOF(x) == LANGSXP && Rf_length(x) == 2 && CAR(x) == Rf_install("~"); SEXP env = Rf_getAttrib(x, Rf_install(".Environment")); bool has_env = TYPEOF(env) == ENVSXP; return is_tilde && has_env; } } // namespace Rcpp namespace dplyr { class QuosureList { public: QuosureList(const List& data_) : data() { int n = data_.size(); if (n == 0) return; CharacterVector names = data_.names(); for (int i = 0; i < n; i++) { SEXP x = data_[i]; if (!is(x)) { stop("corrupt tidy quote"); } data.push_back(NamedQuosure(x, SymbolString(names[i]))); } } const NamedQuosure& operator[](int i) const { return data[i]; } int size() const { return data.size(); } bool single_env() const { if (data.size() <= 1) return true; SEXP env = data[0].env(); for (size_t i = 1; i < data.size(); i++) { if (data[i].env() != env) return false; } return true; } SymbolVector names() const { CharacterVector out(data.size()); for (size_t i = 0; i < data.size(); ++i) { out[i] = data[i].name().get_string(); } return SymbolVector(out); } private: std::vector data; }; } // namespace dplyr #endif dplyr/inst/include/tools/SymbolMap.h0000644000176200001440000000505513135665123017232 0ustar liggesusers#ifndef dplyr_tools_SymbolMap_h #define dplyr_tools_SymbolMap_h #include #include namespace dplyr { enum Origin { HASH, RMATCH, NEW }; struct SymbolMapIndex { int pos; Origin origin; SymbolMapIndex(int pos_, Origin origin_) : pos(pos_), origin(origin_) {} }; class SymbolMap { private: dplyr_hash_map lookup; SymbolVector names; public: SymbolMap(): lookup(), names() {} SymbolMap(const SymbolVector& names_): lookup(), names(names_) {} SymbolMapIndex insert(const SymbolString& name) { SymbolMapIndex index = get_index(name); int idx = index.pos; switch (index.origin) { case HASH: break; case RMATCH: lookup.insert(std::make_pair(name.get_sexp(), idx)); break; case NEW: names.push_back(name.get_string()); lookup.insert(std::make_pair(name.get_sexp(), idx)); break; }; return index; } SymbolVector get_names() const { return names; } SymbolString get_name(const int i) const { return names[i]; } int size() const { return names.size(); } bool has(const SymbolString& name) const { SymbolMapIndex index = get_index(name); return index.origin != NEW; } SymbolMapIndex get_index(const SymbolString& name) const { // first, lookup the map dplyr_hash_map::const_iterator it = lookup.find(name.get_sexp()); if (it != lookup.end()) { return SymbolMapIndex(it->second, HASH); } int idx = names.match(name); if (idx != NA_INTEGER) { // we have a match return SymbolMapIndex(idx - 1, RMATCH); } // no match return SymbolMapIndex(names.size(), NEW); } int get(const SymbolString& name) const { SymbolMapIndex index = get_index(name); if (index.origin == NEW) { stop("variable '%s' not found", name.get_utf8_cstring()); } return index.pos; } SymbolMapIndex rm(const SymbolString& name) { SymbolMapIndex index = get_index(name); if (index.origin != NEW) { int idx = index.pos; names.remove(idx); for (dplyr_hash_map::iterator it = lookup.begin(); it != lookup.end();) { int k = it->second; if (k < idx) { // nothing to do in that case ++it; continue; } else if (k == idx) { // need to remove the data from the hash table it = lookup.erase(it); continue; } else { // decrement the index it->second--; ++it; } } } return index; } }; } #endif dplyr/inst/include/tools/SymbolString.h0000644000176200001440000000237113150340402017744 0ustar liggesusers#ifndef dplyr_tools_SymbolString_h #define dplyr_tools_SymbolString_h #include namespace dplyr { class SymbolString { public: SymbolString() {} SymbolString(const char* str) : s(str) {} SymbolString(const String& other) : s(other) {} SymbolString(const String::StringProxy& other) : s(other) {} SymbolString(const String::const_StringProxy& other) : s(other) {} // Symbols are always encoded in the native encoding (#1950) explicit SymbolString(const Symbol& symbol) : s(CHAR(PRINTNAME(symbol)), CE_NATIVE) {} public: const String& get_string() const { return s; } const Symbol get_symbol() const { return Symbol(Rf_translateChar(s.get_sexp())); } const std::string get_utf8_cstring() const { static Environment rlang = Environment::namespace_env("rlang"); static Function as_string = Function("as_string", rlang); SEXP utf8_string = as_string(Rf_lang2(R_QuoteSymbol, get_symbol())); return CHAR(STRING_ELT(utf8_string, 0)); } bool is_empty() const { return s == ""; } SEXP get_sexp() const { return s.get_sexp(); } bool operator==(const SymbolString& other) const { return Rf_NonNullStringMatch(get_sexp(), other.get_sexp()); } private: String s; }; } #endif dplyr/inst/include/tools/rlang-export.h0000644000176200001440000000067213135665123017751 0ustar liggesusers#ifndef RLANG_EXPORT_H #define RLANG_EXPORT_H #define R_NO_REMAP #include #include #include #if (defined(R_VERSION) && R_VERSION < R_Version(3, 4, 0)) typedef union { void* p; DL_FUNC fn; } fn_ptr; SEXP R_MakeExternalPtrFn(DL_FUNC p, SEXP tag, SEXP prot); DL_FUNC R_ExternalPtrAddrFn(SEXP s); #endif void rlang_register_pointer(const char* ns, const char* ptr_name, DL_FUNC fn); #endif dplyr/inst/include/tools/wrap_subset.h0000644000176200001440000000072313135665123017662 0ustar liggesusers#ifndef dplyr_wrap_subset_H #define dplyr_wrap_subset_H namespace dplyr { template SEXP wrap_subset(SEXP input, const Container& indices) { int n = indices.size(); Rcpp::Vector res = Rcpp::no_init(n); typedef typename Rcpp::traits::storage_type::type STORAGE; STORAGE* ptr = Rcpp::internal::r_vector_start(input); for (int i = 0; i < n; i++) res[i] = ptr[ indices[i] ]; return res; } } #endif dplyr/inst/include/tools/hash.h0000644000176200001440000000125313135665123016246 0ustar liggesusers#ifndef dplyr_HASH_H #define dplyr_HASH_H #include #ifndef dplyr_hash_map #if defined(_WIN32) #define dplyr_hash_map RCPP_UNORDERED_MAP #else #include #define dplyr_hash_map boost::unordered_map #endif #endif // #ifndef dplyr_hash_map #ifndef dplyr_hash_set #if defined(_WIN32) #define dplyr_hash_set RCPP_UNORDERED_SET #else #include #define dplyr_hash_set boost::unordered_set #endif #endif // #ifndef dplyr_hash_set inline std::size_t hash_value(const Rcomplex& cx) { boost::hash hasher; size_t seed = hasher(cx.r); boost::hash_combine(seed, hasher(cx.i)); return seed; } #endif dplyr/inst/include/tools/tools.h0000644000176200001440000000037513135665123016467 0ustar liggesusers#ifndef dplyr_tools_tools_H #define dplyr_tools_tools_H #include #include #include #include #include #include #include #endif dplyr/inst/include/tools/match.h0000644000176200001440000000113713156604776016432 0ustar liggesusers#ifndef dplyr_tools_match_h #define dplyr_tools_match_h namespace dplyr { inline IntegerVector r_match(SEXP x, SEXP y, SEXP incomparables = R_NilValue) { static Function match("match", R_BaseEnv); if (R_VERSION == R_Version(3, 3, 0)) { // Work around matching bug in R 3.3.0: #1806 // https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16885 if (Rf_isNull(incomparables)) { return match(x, y, NA_INTEGER, LogicalVector()); } else { return match(x, y, NA_INTEGER, incomparables); } } else { return match(x, y, NA_INTEGER, incomparables); } } } #endif dplyr/inst/include/tools/encoding.h0000644000176200001440000000411213150340402017071 0ustar liggesusers#ifndef DPLYR_ENCODING_H #define DPLYR_ENCODING_H #define TYPE_BITS 5 #define BYTES_MASK (1<<1) #define LATIN1_MASK (1<<2) #define UTF8_MASK (1<<3) #define ASCII_MASK (1<<6) struct sxpinfo_struct { // *INDENT-OFF* SEXPTYPE type : TYPE_BITS;/* ==> (FUNSXP == 99) %% 2^5 == 3 == CLOSXP * -> warning: `type' is narrower than values * of its type * when SEXPTYPE was an enum */ // *INDENT-ON* unsigned int obj : 1; unsigned int named : 2; unsigned int gp : 16; unsigned int mark : 1; unsigned int debug : 1; unsigned int trace : 1; /* functions and memory tracing */ unsigned int spare : 1; /* currently unused */ unsigned int gcgen : 1; /* old generation number */ unsigned int gccls : 3; /* node class */ }; /* Tot: 32 */ #ifndef IS_BYTES #define IS_BYTES(x) (reinterpret_cast(x)->gp & BYTES_MASK) #endif #ifndef IS_LATIN1 #define IS_LATIN1(x) (reinterpret_cast(x)->gp & LATIN1_MASK) #endif #ifndef IS_ASCII #define IS_ASCII(x) (reinterpret_cast(x)->gp & ASCII_MASK) #endif #ifndef IS_UTF8 #define IS_UTF8(x) (reinterpret_cast(x)->gp & UTF8_MASK) #endif // that bit seems unused by R. Just using it to mark // objects as Shrinkable Vectors // that is useful for things like summarise(list(x)) where x is a // variable from the data, because the SEXP that goes into the list // is the shrinkable vector, we use this information to duplicate // it if needed. See the maybe_copy method in DelayedProcessor #define DPLYR_SHRINKABLE_MASK (1<<8) #define IS_DPLYR_SHRINKABLE_VECTOR(x) (reinterpret_cast(x)->gp & DPLYR_SHRINKABLE_MASK) #define SET_DPLYR_SHRINKABLE_VECTOR(x) (reinterpret_cast(x)->gp |= DPLYR_SHRINKABLE_MASK) #define UNSET_DPLYR_SHRINKABLE_VECTOR(x) (reinterpret_cast(x)->gp &= (~DPLYR_SHRINKABLE_MASK) ) namespace dplyr { CharacterVector reencode_factor(IntegerVector x); CharacterVector reencode_char(SEXP x); } #endif dplyr/inst/include/tools/utils.h0000644000176200001440000000250113153522117016453 0ustar liggesusers#ifndef dplyr_tools_utils_H #define dplyr_tools_utils_H #include void assert_all_white_list(const DataFrame&); SEXP shared_SEXP(SEXP x); SEXP shallow_copy(const List& data); SEXP pairlist_shallow_copy(SEXP p); void copy_attributes(SEXP out, SEXP data); void strip_index(DataFrame x); SEXP null_if_empty(SEXP x); bool is_vector(SEXP x); bool is_atomic(SEXP x); SEXP vec_names(SEXP x); bool is_str_empty(SEXP str); bool has_name_at(SEXP x, R_len_t i); SEXP name_at(SEXP x, size_t i); SEXP f_env(SEXP x); bool is_quosure(SEXP x); SEXP maybe_rhs(SEXP x); namespace dplyr { std::string get_single_class(SEXP x); CharacterVector default_chars(SEXP x, R_xlen_t len); CharacterVector get_class(SEXP x); SEXP set_class(SEXP x, const CharacterVector& class_); CharacterVector get_levels(SEXP x); SEXP set_levels(SEXP x, const CharacterVector& levels); bool same_levels(SEXP left, SEXP right); bool character_vector_equal(const CharacterVector& x, const CharacterVector& y); SymbolVector get_vars(SEXP x, bool duplicate = false); void set_vars(SEXP x, const SymbolVector& vars); void copy_vars(SEXP target, SEXP source); // effectively the same as copy_attributes but without names and dims inline void copy_most_attributes(SEXP out, SEXP data) { Rf_copyMostAttrib(data, out); } } #endif // #ifndef dplyr_tools_utils_H dplyr/inst/include/tools/ShrinkableVector.h0000644000176200001440000000223013150340402020547 0ustar liggesusers#ifndef dplyr_ShrinkableVector_H #define dplyr_ShrinkableVector_H #include #include namespace Rcpp { template class ShrinkableVector { public: typedef typename Rcpp::traits::storage_type::type STORAGE; ShrinkableVector(int n, SEXP origin) : data(no_init(n)), max_size(n), start(internal::r_vector_start(data)), gp(LEVELS(data)) { copy_most_attributes(data, origin); SET_DPLYR_SHRINKABLE_VECTOR((SEXP)data); } inline void resize(int n) { SETLENGTH(data, n); } inline operator SEXP() const { return data; } inline void borrow(const SlicingIndex& indices, STORAGE* begin) { int n = indices.size(); for (int i = 0; i < n; i++) { start[i] = begin[indices[i]]; } SETLENGTH(data, n); } ~ShrinkableVector() { // restore the initial length so that R can reclaim the memory SETLENGTH(data, max_size); UNSET_DPLYR_SHRINKABLE_VECTOR((SEXP)data); } private: Rcpp::Vector data; int max_size; STORAGE* start; unsigned short gp; }; inline bool is_ShrinkableVector(SEXP x) { return IS_DPLYR_SHRINKABLE_VECTOR(x); } } #endif dplyr/inst/include/tools/debug.h0000644000176200001440000000070413135665123016411 0ustar liggesusers#ifndef dplyr_tools_debug_H #define dplyr_tools_debug_H // borrowed from Rcpp11 #ifndef RCPP_DEBUG_OBJECT #define RCPP_DEBUG_OBJECT(OBJ) Rf_PrintValue( Rf_eval( Rf_lang2( Rf_install( "str"), OBJ ), R_GlobalEnv ) ); #endif #ifndef RCPP_INSPECT_OBJECT #define RCPP_INSPECT_OBJECT(OBJ) Rf_PrintValue( Rf_eval( Rf_lang2( Rf_install( ".Internal"), Rf_lang2( Rf_install( "inspect" ), OBJ ) ), R_GlobalEnv ) ); #endif #endif // #ifndef dplyr_tools_debug_H dplyr/inst/include/tools/all_na.h0000644000176200001440000000055513120706341016545 0ustar liggesusers#ifndef dplyr_tools_all_na_H #define dplyr_tools_all_na_H template inline bool all_na_impl(const Vector& x) { return all(is_na(x)).is_true(); } template <> inline bool all_na_impl(const NumericVector& x) { return all(is_na(x) & !is_nan(x)).is_true(); } inline bool all_na(SEXP x) { RCPP_RETURN_VECTOR(all_na_impl, x); } #endif dplyr/inst/include/tools/pointer_vector.h0000644000176200001440000000211613135665123020364 0ustar liggesusers#ifndef dplyr_tools_pointer_vector_H #define dplyr_tools_pointer_vector_H namespace dplyr { template class pointer_vector { public: typedef typename std::vector Vector; typedef typename Vector::reference reference; typedef typename Vector::const_reference const_reference; typedef typename Vector::size_type size_type; typedef typename Vector::value_type value_type; typedef typename Vector::iterator iterator; pointer_vector() : data() {} pointer_vector(size_type n) : data(n) {} inline ~pointer_vector() { typedef typename Vector::size_type size_type; size_type n = data.size(); iterator it = data.end(); --it; for (size_type i = 0; i < n; --it, i++) delete *it; } inline reference operator[](size_type i) { return data[i]; } inline const_reference operator[](size_type i) const { return data[i]; } inline void push_back(const value_type& value) { data.push_back(value); } inline size_type size() const { return data.size(); } private: Vector data; pointer_vector(const pointer_vector&); }; } #endif dplyr/inst/include/tools/collapse.h0000644000176200001440000000124413135665123017125 0ustar liggesusers#ifndef dplyr_collapse_H #define dplyr_collapse_H namespace dplyr { template const char* to_string_utf8(typename Rcpp::traits::storage_type::type from) { SEXP s = Rcpp::internal::r_coerce(from); return Rf_translateCharUTF8(s); } template std::string collapse_utf8(const Vector& x, const char* sep = ", ", const char* quote = "") { std::stringstream ss; int n = x.size(); if (n > 0) { ss << quote << to_string_utf8(x[0]) << quote; for (int i = 1; i < n; i++) { const char* st = to_string_utf8(x[i]); ss << sep << quote << st << quote; } } return ss.str(); } } #endif dplyr/inst/include/tools/Call.h0000644000176200001440000000152713135665123016202 0ustar liggesusers#ifndef dplyr__Call_h #define dplyr__Call_h namespace Rcpp { class Call { public: Call() : data(R_NilValue) {} Call(SEXP x) : data(x) { if (data != R_NilValue) R_PreserveObject(data); } ~Call() { if (data != R_NilValue) R_ReleaseObject(data); } Call(const Call& other) : data(other.data) { if (data != R_NilValue) R_PreserveObject(data); } Call& operator=(SEXP other) { if (other != data) { if (data != R_NilValue) R_ReleaseObject(data); data = other; if (data != R_NilValue) R_PreserveObject(data); } return *this; } inline SEXP eval(SEXP env) const { return Rcpp_eval(data, env); } inline operator SEXP() const { return data; } private: SEXP data; Call& operator=(const Call& other); // { // *this = other.data; // return *this; // } }; } #endif dplyr/inst/include/dplyr/0000755000176200001440000000000013163257361015145 5ustar liggesusersdplyr/inst/include/dplyr/CharacterVectorOrderer.h0000644000176200001440000000051013150340402021675 0ustar liggesusers#ifndef dplyr_CharacterVectorOrderer_H #define dplyr_CharacterVectorOrderer_H #include namespace dplyr { class CharacterVectorOrderer { public: CharacterVectorOrderer(const CharacterVector& data_); inline IntegerVector get() const { return orders; } private: IntegerVector orders; }; } #endif dplyr/inst/include/dplyr/tbl_cpp.h0000644000176200001440000000135413135665123016742 0ustar liggesusers#ifndef dplyr_tools_tbl_cpp_H #define dplyr_tools_tbl_cpp_H #include namespace dplyr { template inline void set_rownames(Df& data, int n) { data.attr("row.names") = Rcpp::IntegerVector::create(Rcpp::IntegerVector::get_na(), -n); } template inline Rcpp::CharacterVector classes_grouped() { return Rcpp::CharacterVector::create("grouped_df", "tbl_df", "tbl", "data.frame"); } template <> inline Rcpp::CharacterVector classes_grouped() { return Rcpp::CharacterVector::create("rowwise_df", "tbl_df", "tbl", "data.frame"); } inline Rcpp::CharacterVector classes_not_grouped() { return Rcpp::CharacterVector::create("tbl_df", "tbl", "data.frame"); } } #endif dplyr/inst/include/dplyr/subset_visitor.h0000644000176200001440000000032513135665123020400 0ustar liggesusers#ifndef dplyr_subset_visitor_H #define dplyr_subset_visitor_H #include namespace dplyr { inline SubsetVectorVisitor* subset_visitor(SEXP vec, const SymbolString& name); } #endif dplyr/inst/include/dplyr/DataFrameColumnVisitor.h0000644000176200001440000000167113135665123021703 0ustar liggesusers#ifndef dplyr_DataFrameColumnVisitors_H #define dplyr_DataFrameColumnVisitors_H #include namespace dplyr { class DataFrameColumnVisitor : public VectorVisitor { public: DataFrameColumnVisitor(const DataFrame& data_) : data(data_), visitors(data) {} inline size_t hash(int i) const { return visitors.hash(i); } inline bool equal(int i, int j) const { return visitors.equal(i, j); } inline bool equal_or_both_na(int i, int j) const { return visitors.equal_or_both_na(i, j); } inline bool less(int i, int j) const { return visitors.less(i, j); } inline bool greater(int i, int j) const { return visitors.greater(i, j); } virtual int size() const { return visitors.nrows(); } virtual std::string get_r_type() const { return "data.frame"; } bool is_na(int) const { return false; } private: DataFrame data; DataFrameVisitors visitors; }; } #endif dplyr/inst/include/dplyr/visitor_set/0000755000176200001440000000000013135665123017515 5ustar liggesusersdplyr/inst/include/dplyr/visitor_set/VisitorSetLess.h0000644000176200001440000000112613135665123022630 0ustar liggesusers#ifndef dplyr_VisitorSetLess_H #define dplyr_VisitorSetLess_H namespace dplyr { template class VisitorSetLess { public: bool less(int i, int j) const { if (i == j) return false; const Class& obj = static_cast(*this); int n = obj.size(); for (int k = 0; k < n; k++) { typename Class::visitor_type* visitor = obj.get(k); if (! visitor->equal(i, j)) { return visitor->less(i, j); } } // if we end up here, it means rows i and j are equal // we break the tie using the indices return i < j; } }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorEqualPredicate.h0000644000176200001440000000053413135665123024140 0ustar liggesusers#ifndef dplyr_VisitorEqualPredicate_H #define dplyr_VisitorEqualPredicate_H namespace dplyr { template class VisitorEqualPredicate { public: VisitorEqualPredicate(const Visitor& v_) : v(v_) {} inline bool operator()(int i, int j) const { return v.equal_or_both_na(i, j); } private: const Visitor& v; }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetHash.h0000644000176200001440000000100713135665123022603 0ustar liggesusers#ifndef dplyr_VisitorSetHash_H #define dplyr_VisitorSetHash_H #include namespace dplyr { template class VisitorSetHash { public: size_t hash(int j) const { const Class& obj = static_cast(*this); int n = obj.size(); if (n == 0) { stop("Need at least one column for `hash()`"); } size_t seed = obj.get(0)->hash(j); for (int k = 1; k < n; k++) { boost::hash_combine(seed, obj.get(k)->hash(j)); } return seed; } }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetEqualPredicate.h0000644000176200001440000000065613135665123024621 0ustar liggesusers#ifndef dplyr_VisitorSetEqualPredicate_H #define dplyr_VisitorSetEqualPredicate_H namespace dplyr { template class VisitorSetEqualPredicate { public: VisitorSetEqualPredicate() : visitors(0) {} VisitorSetEqualPredicate(VisitorSet* visitors_) : visitors(visitors_) {}; inline bool operator()(int i, int j) const { return visitors->equal(i, j); } private: VisitorSet* visitors; }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetEqual.h0000644000176200001440000000124513135665123022773 0ustar liggesusers#ifndef dplyr_VisitorSetEqual_H #define dplyr_VisitorSetEqual_H namespace dplyr { template class VisitorSetEqual { public: bool equal(int i, int j) const { const Class& obj = static_cast(*this); if (i == j) return true; int n = obj.size(); for (int k = 0; k < n; k++) if (! obj.get(k)->equal(i, j)) return false; return true; } bool equal_or_both_na(int i, int j) const { const Class& obj = static_cast(*this); if (i == j) return true; int n = obj.size(); for (int k = 0; k < n; k++) if (! obj.get(k)->equal_or_both_na(i, j)) return false; return true; } }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetIndexMap.h0000644000176200001440000000171413135665123023432 0ustar liggesusers#ifndef dplyr_VisitorSetIndexMap_H #define dplyr_VisitorSetIndexMap_H #include #include #include namespace dplyr { template class VisitorSetIndexMap : public dplyr_hash_map, VisitorSetEqualPredicate > { private: typedef VisitorSetHasher Hasher; typedef VisitorSetEqualPredicate EqualPredicate; typedef typename dplyr_hash_map Base; public: VisitorSetIndexMap() : Base(), visitors(0) {} VisitorSetIndexMap(VisitorSet& visitors_) : Base(1024, Hasher(&visitors_), EqualPredicate(&visitors_)), visitors(&visitors_) {} VisitorSetIndexMap(VisitorSet* visitors_) : Base(1024, Hasher(visitors_), EqualPredicate(visitors_)), visitors(visitors_) {} VisitorSet* visitors; }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorHash.h0000644000176200001440000000044013135665123022127 0ustar liggesusers#ifndef dplyr_VisitorHash_H #define dplyr_VisitorHash_H namespace dplyr { template class VisitorHash { public: VisitorHash(const Visitor& v_) : v(v_) {} inline size_t operator()(int i) const { return v.hash(i); } private: const Visitor& v; }; } #endif dplyr/inst/include/dplyr/visitor_set/visitor_set.h0000644000176200001440000000024513135665123022241 0ustar liggesusers#ifndef dplyr_visitor_set_H #define dplyr_visitor_set_H #include #include #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetGreater.h0000644000176200001440000000114513135665123023314 0ustar liggesusers#ifndef dplyr_VisitorSetGreater_H #define dplyr_VisitorSetGreater_H namespace dplyr { template class VisitorSetGreater { public: bool greater(int i, int j) const { if (i == j) return false; const Class& obj = static_cast(*this); int n = obj.size(); for (int k = 0; k < n; k++) { typename Class::visitor_type* visitor = obj.get(k); if (! visitor->equal(i, j)) { return visitor->greater(i, j); } } // if we end up here, it means rows i and j are equal // we break the tie using the indices return i < j; } }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetHasher.h0000644000176200001440000000057513135665123023143 0ustar liggesusers#ifndef dplyr_VisitorSetHasher_H #define dplyr_VisitorSetHasher_H namespace dplyr { template class VisitorSetHasher { public: VisitorSetHasher() : visitors(0) {} VisitorSetHasher(VisitorSet* visitors_) : visitors(visitors_) {}; inline size_t operator()(int i) const { return visitors->hash(i); } private: VisitorSet* visitors; }; } #endif dplyr/inst/include/dplyr/visitor_set/VisitorSetMixin.h0000644000176200001440000000053613120706341023002 0ustar liggesusers#ifndef dplyr_dplyr_visitor_set_VisitorSetMixin_H #define dplyr_dplyr_visitor_set_VisitorSetMixin_H #include #include #include #include #endif // #ifndef dplyr_dplyr_visitor_set_VisitorSetMixin_H dplyr/inst/include/dplyr/visitor_set/VisitorSetIndexSet.h0000644000176200001440000000150713135665123023450 0ustar liggesusers#ifndef dplyr_VisitorSetIndexSet_H #define dplyr_VisitorSetIndexSet_H #include #include #include namespace dplyr { template class VisitorSetIndexSet : public dplyr_hash_set, VisitorSetEqualPredicate > { private: typedef VisitorSetHasher Hasher; typedef VisitorSetEqualPredicate EqualPredicate; typedef dplyr_hash_set Base; public: VisitorSetIndexSet() : Base() {} VisitorSetIndexSet(VisitorSet& visitors_) : Base(1024, Hasher(&visitors_), EqualPredicate(&visitors_)) {} VisitorSetIndexSet(VisitorSet* visitors_) : Base(1024, Hasher(visitors_), EqualPredicate(visitors_)) {} }; } #endif dplyr/inst/include/dplyr/NamedListAccumulator.h0000644000176200001440000000212313135665123021372 0ustar liggesusers#ifndef dplyr_NamedListAccumulator_H #define dplyr_NamedListAccumulator_H #include #include namespace dplyr { template class NamedListAccumulator { private: SymbolMap symbol_map; std::vector data; // owns the results public: NamedListAccumulator() {} inline void set(const SymbolString& name, RObject x) { if (! Rcpp::traits::same_type::value) check_supported_type(x, name); SymbolMapIndex index = symbol_map.insert(name); if (index.origin == NEW) { data.push_back(x); } else { data[ index.pos ] = x; } } inline void rm(const SymbolString& name) { SymbolMapIndex index = symbol_map.rm(name); if (index.origin != NEW) { data.erase(data.begin() + index.pos); } } inline operator List() const { List out = wrap(data); out.names() = symbol_map.get_names(); return out; } inline size_t size() const { return data.size(); } inline const SymbolVector names() const { return symbol_map.get_names(); } }; } #endif dplyr/inst/include/dplyr/Column.h0000644000176200001440000000070613135665123016554 0ustar liggesusers#ifndef DPLYR_DPLYR_COLUMN_H #define DPLYR_DPLYR_COLUMN_H class Column { public: Column(SEXP data_, const SymbolString& name_) : data(data_), name(name_) {} public: const RObject& get_data() const { return data; } const SymbolString& get_name() const { return name; } Column update_data(SEXP new_data) const { return Column(new_data, name); } private: RObject data; SymbolString name; }; #endif //DPLYR_DPLYR_COLUMN_H dplyr/inst/include/dplyr/EmptySubset.h0000644000176200001440000000023213135665123017575 0ustar liggesusers#ifndef dplyr_EmptySubset_H #define dplyr_EmptySubset_H namespace dplyr { class EmptySubset { public: int size() const { return 0; } }; } #endif dplyr/inst/include/dplyr/Result/0000755000176200001440000000000013156774310016423 5ustar liggesusersdplyr/inst/include/dplyr/Result/ILazySubsets.h0000644000176200001440000000157313135665123021201 0ustar liggesusers#ifndef dplyr_ILazySubsets_H #define dplyr_ILazySubsets_H #include #include #include namespace dplyr { class ILazySubsets { protected: ILazySubsets() {} public: virtual ~ILazySubsets() {} virtual const SymbolVector get_variable_names() const = 0; virtual SEXP get_variable(const SymbolString& symbol) const = 0; virtual SEXP get(const SymbolString& symbol, const SlicingIndex& indices) const = 0; virtual bool is_summary(const SymbolString& symbol) const = 0; virtual bool has_variable(const SymbolString& symbol) const = 0; virtual void input(const SymbolString& symbol, SEXP x) = 0; virtual int size() const = 0; virtual int nrows() const = 0; public: bool has_non_summary_variable(const SymbolString& symbol) const { return has_variable(symbol) && !is_summary(symbol); } }; } #endif dplyr/inst/include/dplyr/Result/GroupedHybridCall.h0000644000176200001440000001315713135665123022144 0ustar liggesusers#ifndef dplyr_GroupedHybridCall_H #define dplyr_GroupedHybridCall_H #include #include #include #include namespace dplyr { inline static SEXP rlang_object(const char* name) { static Environment rlang = Rcpp::Environment::namespace_env("rlang"); return rlang[name]; } class IHybridCallback { protected: virtual ~IHybridCallback() {} public: virtual SEXP get_subset(const SymbolString& name) const = 0; }; class GroupedHybridEnv { public: GroupedHybridEnv(const CharacterVector& names_, const Environment& env_, const IHybridCallback* callback_) : names(names_), env(env_), callback(callback_), has_overscope(false) { LOG_VERBOSE; } ~GroupedHybridEnv() { if (has_overscope) { static Function overscope_clean = rlang_object("overscope_clean"); overscope_clean(overscope); } } public: const Environment& get_overscope() const { provide_overscope(); return overscope; } private: void provide_overscope() const { if (has_overscope) return; // Environment::new_child() performs an R callback, creating the environment // in R should be slightly faster Environment active_env = create_env_string( names, &GroupedHybridEnv::hybrid_get_callback, PAYLOAD(const_cast(reinterpret_cast(callback))), env); // If bindr (via bindrcpp) supported the creation of a child environment, we could save the // call to Rcpp_eval() triggered by active_env.new_child() Environment bottom = active_env.new_child(true); bottom[".data"] = rlang_new_data_source(active_env); // Install definitions for formula self-evaluation and unguarding Function new_overscope = rlang_object("new_overscope"); overscope = new_overscope(bottom, active_env, env); has_overscope = true; } static List rlang_new_data_source(Environment env) { static Function as_dictionary = rlang_object("as_dictionary"); return as_dictionary( env, _["lookup_msg"] = "Column `%s`: not found in data", _["read_only"] = true ); } static SEXP hybrid_get_callback(const String& name, bindrcpp::PAYLOAD payload) { LOG_VERBOSE; IHybridCallback* callback_ = reinterpret_cast(payload.p); return callback_->get_subset(SymbolString(name)); } private: const CharacterVector names; const Environment env; const IHybridCallback* callback; mutable Environment overscope; mutable bool has_overscope; }; class GroupedHybridCall { public: GroupedHybridCall(const Call& call_, const ILazySubsets& subsets_, const Environment& env_) : original_call(call_), subsets(subsets_), env(env_) { LOG_VERBOSE; } public: // FIXME: replace the search & replace logic with overscoping Call simplify(const SlicingIndex& indices) const { set_indices(indices); Call call = clone(original_call); while (simplified(call)) {} clear_indices(); return call; } private: bool simplified(Call& call) const { LOG_VERBOSE; // initial if (TYPEOF(call) == LANGSXP || TYPEOF(call) == SYMSXP) { boost::scoped_ptr res(get_handler(call, subsets, env)); if (res) { // replace the call by the result of process call = res->process(get_indices()); // no need to go any further, we simplified the top level return true; } if (TYPEOF(call) == LANGSXP) return replace(CDR(call)); } return false; } bool replace(SEXP p) const { LOG_VERBOSE; SEXP obj = CAR(p); if (TYPEOF(obj) == LANGSXP) { boost::scoped_ptr res(get_handler(obj, subsets, env)); if (res) { SETCAR(p, res->process(get_indices())); return true; } if (replace(CDR(obj))) return true; } if (TYPEOF(p) == LISTSXP) { return replace(CDR(p)); } return false; } const SlicingIndex& get_indices() const { return *indices; } void set_indices(const SlicingIndex& indices_) const { indices = &indices_; } void clear_indices() const { indices = NULL; } private: // Initialization const Call original_call; const ILazySubsets& subsets; const Environment env; private: // State mutable const SlicingIndex* indices; }; class GroupedHybridEval : public IHybridCallback { public: GroupedHybridEval(const Call& call_, const ILazySubsets& subsets_, const Environment& env_) : indices(NULL), subsets(subsets_), env(env_), hybrid_env(subsets_.get_variable_names().get_vector(), env_, this), hybrid_call(call_, subsets_, env_) { LOG_VERBOSE; } const SlicingIndex& get_indices() const { return *indices; } public: // IHybridCallback SEXP get_subset(const SymbolString& name) const { LOG_VERBOSE; return subsets.get(name, get_indices()); } public: SEXP eval(const SlicingIndex& indices_) { set_indices(indices_); SEXP ret = eval_with_indices(); clear_indices(); return ret; } private: void set_indices(const SlicingIndex& indices_) { indices = &indices_; } void clear_indices() { indices = NULL; } SEXP eval_with_indices() { Call call = hybrid_call.simplify(get_indices()); LOG_INFO << type2name(call); if (TYPEOF(call) == LANGSXP || TYPEOF(call) == SYMSXP) { LOG_VERBOSE << "performing evaluation in overscope"; return Rcpp_eval(call, hybrid_env.get_overscope()); } return call; } private: const SlicingIndex* indices; const ILazySubsets& subsets; Environment env; const GroupedHybridEnv hybrid_env; const GroupedHybridCall hybrid_call; }; } // namespace dplyr #endif dplyr/inst/include/dplyr/Result/Count.h0000644000176200001440000000045613135665123017667 0ustar liggesusers#ifndef dplyr_Result_Count_H #define dplyr_Result_Count_H #include namespace dplyr { class Count : public Processor { public: Count() {} ~Count() {} inline int process_chunk(const SlicingIndex& indices) { return indices.size(); } }; } #endif dplyr/inst/include/dplyr/Result/LazyRowwiseSubsets.h0000644000176200001440000000036513135665123022446 0ustar liggesusers#ifndef dplyr_LazyRowwiseSubsets_H #define dplyr_LazyRowwiseSubsets_H #include #include namespace dplyr { typedef LazySplitSubsets LazyRowwiseSubsets; } #endif dplyr/inst/include/dplyr/Result/all.h0000644000176200001440000000221213135665123017337 0ustar liggesusers#ifndef dplyr_Result_all_H #define dplyr_Result_all_H #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #endif dplyr/inst/include/dplyr/Result/Count_Distinct.h0000644000176200001440000000275713135665123021536 0ustar liggesusers#ifndef dplyr_Result_Count_Distinct_H #define dplyr_Result_Count_Distinct_H #include #include #include #include namespace dplyr { template class Count_Distinct : public Processor > { public: typedef VisitorHash Hash; typedef VisitorEqualPredicate Pred; typedef dplyr_hash_set Set; Count_Distinct(Visitor v_): v(v_), set(0, Hash(v), Pred(v)) {} inline int process_chunk(const SlicingIndex& indices) { set.clear(); set.rehash(indices.size()); int n = indices.size(); for (int i = 0; i < n; i++) { set.insert(indices[i]); } return set.size(); } private: Visitor v; Set set; }; template class Count_Distinct_Narm : public Processor > { public: typedef VisitorHash Hash; typedef VisitorEqualPredicate Pred; typedef dplyr_hash_set Set; Count_Distinct_Narm(Visitor v_): v(v_), set(0, Hash(v), Pred(v)) {} inline int process_chunk(const SlicingIndex& indices) { set.clear(); set.rehash(indices.size()); int n = indices.size(); for (int i = 0; i < n; i++) { int index = indices[i]; if (! v.is_na(index)) { set.insert(index); } } return set.size(); } private: Visitor v; Set set; }; } #endif dplyr/inst/include/dplyr/Result/DelayedProcessor.h0000644000176200001440000001643313135665123022050 0ustar liggesusers#ifndef dplyr_Result_DelayedProcessor_H #define dplyr_Result_DelayedProcessor_H #include #include #include #include #include #include namespace dplyr { class IDelayedProcessor { public: IDelayedProcessor() {} virtual ~IDelayedProcessor() {} virtual bool try_handle(const RObject& chunk) = 0; virtual IDelayedProcessor* promote(const RObject& chunk) = 0; virtual SEXP get() = 0; virtual std::string describe() = 0; }; template bool valid_conversion(int rtype) { return rtype == RTYPE; } template <> inline bool valid_conversion(int rtype) { switch (rtype) { case REALSXP: case INTSXP: case LGLSXP: return true; default: break; } return false; } template <> inline bool valid_conversion(int rtype) { switch (rtype) { case INTSXP: case LGLSXP: return true; default: break; } return false; } template inline bool valid_promotion(int) { return false; } template <> inline bool valid_promotion(int rtype) { return rtype == REALSXP; } template <> inline bool valid_promotion(int rtype) { return rtype == REALSXP || rtype == INTSXP; } template class DelayedProcessor : public IDelayedProcessor { public: typedef typename traits::scalar_type::type STORAGE; typedef Vector Vec; DelayedProcessor(const RObject& first_result, int ngroups_, const SymbolString& name_) : res(no_init(ngroups_)), pos(0), seen_na_only(true), name(name_) { if (!try_handle(first_result)) stop("cannot handle result of type %i for column '%s'", first_result.sexp_type(), name.get_utf8_cstring()); copy_most_attributes(res, first_result); } DelayedProcessor(int pos_, const RObject& chunk, SEXP res_, const SymbolString& name_) : res(as(res_)), pos(pos_), seen_na_only(false), name(name_) { copy_most_attributes(res, chunk); if (!try_handle(chunk)) { stop("cannot handle result of type %i in promotion for column '%s'", chunk.sexp_type(), name.get_utf8_cstring() ); } } virtual bool try_handle(const RObject& chunk) { check_supported_type(chunk, name); check_length(Rf_length(chunk), 1, "a summary value", name); int rtype = TYPEOF(chunk); if (valid_conversion(rtype)) { // copy, and memoize the copied value const typename Vec::stored_type& converted_chunk = (res[pos++] = as(chunk)); if (!Vec::is_na(converted_chunk)) seen_na_only = false; return true; } else { return false; } } virtual IDelayedProcessor* promote(const RObject& chunk) { if (!can_promote(chunk)) { LOG_VERBOSE << "can't promote"; return 0; } int rtype = TYPEOF(chunk); switch (rtype) { case LGLSXP: return new DelayedProcessor(pos, chunk, res, name); case INTSXP: return new DelayedProcessor(pos, chunk, res, name); case REALSXP: return new DelayedProcessor(pos, chunk, res, name); case CPLXSXP: return new DelayedProcessor(pos, chunk, res, name); case STRSXP: return new DelayedProcessor(pos, chunk, res, name); default: break; } return 0; } virtual SEXP get() { return res; } virtual std::string describe() { return vector_class(); } private: bool can_promote(const RObject& chunk) { return seen_na_only || valid_promotion(TYPEOF(chunk)); } private: Vec res; int pos; bool seen_na_only; const SymbolString name; }; template class FactorDelayedProcessor : public IDelayedProcessor { private: typedef dplyr_hash_map LevelsMap; public: FactorDelayedProcessor(SEXP first_result, int ngroups, const SymbolString& name_) : res(no_init(ngroups)), pos(0), name(name_) { copy_most_attributes(res, first_result); CharacterVector levels = get_levels(first_result); int n = levels.size(); for (int i = 0; i < n; i++) levels_map[ levels[i] ] = i + 1; if (!try_handle(first_result)) stop("cannot handle factor result for column '%s'", name.get_utf8_cstring()); } virtual bool try_handle(const RObject& chunk) { CharacterVector lev = get_levels(chunk); update_levels(lev); int val = as(chunk); if (val != NA_INTEGER) val = levels_map[lev[val - 1]]; res[pos++] = val; return true; } virtual IDelayedProcessor* promote(const RObject&) { return 0; } virtual SEXP get() { int n = levels_map.size(); CharacterVector levels(n); LevelsMap::iterator it = levels_map.begin(); for (int i = 0; i < n; i++, ++it) { levels[it->second - 1] = it->first; } set_levels(res, levels); return res; } virtual std::string describe() { return "factor"; } private: void update_levels(const CharacterVector& lev) { int nlevels = levels_map.size(); int n = lev.size(); for (int i = 0; i < n; i++) { SEXP s = lev[i]; if (! levels_map.count(s)) { levels_map.insert(std::make_pair(s, ++nlevels)); } } } IntegerVector res; int pos; LevelsMap levels_map; const SymbolString name; }; template class DelayedProcessor : public IDelayedProcessor { public: DelayedProcessor(SEXP first_result, int ngroups, const SymbolString& name_) : res(ngroups), pos(0), name(name_) { copy_most_attributes(res, first_result); if (!try_handle(first_result)) stop("cannot handle list result for column '%s'", name.get_utf8_cstring()); } virtual bool try_handle(const RObject& chunk) { if (is(chunk) && Rf_length(chunk) == 1) { res[pos++] = Rf_duplicate(VECTOR_ELT(chunk, 0)); return true; } return false; } virtual IDelayedProcessor* promote(const RObject&) { return 0; } virtual SEXP get() { return res; } virtual std::string describe() { return "list"; } private: List res; int pos; const SymbolString name; }; template IDelayedProcessor* get_delayed_processor(SEXP first_result, int ngroups, const SymbolString& name) { check_supported_type(first_result, name); check_length(Rf_length(first_result), 1, "a summary value", name); if (Rf_inherits(first_result, "factor")) { return new FactorDelayedProcessor(first_result, ngroups, name); } else if (Rcpp::is(first_result)) { return new DelayedProcessor(first_result, ngroups, name); } else if (Rcpp::is(first_result)) { return new DelayedProcessor(first_result, ngroups, name); } else if (Rcpp::is(first_result)) { return new DelayedProcessor(first_result, ngroups, name); } else if (Rcpp::is(first_result)) { return new DelayedProcessor(first_result, ngroups, name); } else if (Rcpp::is(first_result)) { return new DelayedProcessor(first_result, ngroups, name); } else if (TYPEOF(first_result) == CPLXSXP) { return new DelayedProcessor(first_result, ngroups, name); } stop("unknown result of type %d for column '%s'", TYPEOF(first_result), name.get_utf8_cstring()); } } #endif dplyr/inst/include/dplyr/Result/Rank.h0000644000176200001440000002525713156774512017506 0ustar liggesusers#ifndef dplyr_Result_Rank_H #define dplyr_Result_Rank_H #include #include #include #include #include #include #include namespace dplyr { namespace internal { struct min_rank_increment { typedef IntegerVector OutputVector; typedef int scalar_type; template inline int post_increment(const Container& x, int) const { return x.size(); } template inline int pre_increment(const Container&, int) const { return 0; } inline int start() const { return 1; } }; struct dense_rank_increment { typedef IntegerVector OutputVector; typedef int scalar_type; template inline int post_increment(const Container&, int) const { return 1; } template inline int pre_increment(const Container&, int) const { return 0; } inline int start() const { return 1; } }; struct percent_rank_increment { typedef NumericVector OutputVector; typedef double scalar_type; template inline double post_increment(const Container& x, int m) const { return (double)x.size() / (m - 1); } template inline double pre_increment(const Container&, int) const { return 0.0; } inline double start() const { return 0.0; } }; struct cume_dist_increment { typedef NumericVector OutputVector; typedef double scalar_type; template inline double post_increment(const Container&, int) const { return 0.0; } template inline double pre_increment(const Container& x, int m) const { return (double)x.size() / m; } inline double start() const { return 0.0; } }; } template class RankComparer { typedef comparisons compare; public: typedef typename Rcpp::traits::storage_type::type STORAGE; inline bool operator()(STORAGE lhs, STORAGE rhs) const { return compare::is_less(lhs, rhs); } }; template class RankComparer { typedef comparisons compare; public: typedef typename Rcpp::traits::storage_type::type STORAGE; inline bool operator()(STORAGE lhs, STORAGE rhs) const { return compare::is_greater(lhs, rhs); } }; template class RankEqual { typedef comparisons compare; public: typedef typename Rcpp::traits::storage_type::type STORAGE; inline bool operator()(STORAGE lhs, STORAGE rhs) const { return compare::equal_or_both_na(lhs, rhs); } }; // powers both dense_rank and min_rank, see dplyr.cpp for how it is used template class Rank_Impl : public Result, public Increment { public: typedef typename Increment::OutputVector OutputVector; typedef typename Rcpp::traits::storage_type::type STORAGE; typedef VectorSliceVisitor Slice; typedef RankComparer Comparer; typedef RankEqual Equal; typedef dplyr_hash_map, boost::hash, Equal > Map; typedef std::map*, Comparer> oMap; Rank_Impl(SEXP data_) : data(data_), map() {} virtual SEXP process(const GroupedDataFrame& gdf) { int ng = gdf.ngroups(); int n = gdf.nrows(); if (n == 0) return IntegerVector(0); GroupedDataFrame::group_iterator git = gdf.group_begin(); OutputVector out = no_init(n); for (int i = 0; i < ng; i++, ++git) { process_slice(out, *git); } return out; } virtual SEXP process(const RowwiseDataFrame& gdf) { return IntegerVector(gdf.nrows(), 1); } virtual SEXP process(const FullDataFrame& df) { int n = df.nrows(); if (n == 0) return IntegerVector(0); OutputVector out = no_init(n); process_slice(out, df.get_index()); return out; } virtual SEXP process(const SlicingIndex& index) { int n = index.size(); if (n == 0) return IntegerVector(0); OutputVector out = no_init(n); process_slice(out, index); return out; } private: void process_slice(OutputVector& out, const SlicingIndex& index) { map.clear(); Slice slice(&data, index); int m = index.size(); for (int j = 0; j < m; j++) { map[ slice[j] ].push_back(j); } STORAGE na = Rcpp::traits::get_na(); typename Map::const_iterator it = map.find(na); if (it != map.end()) { m -= it->second.size(); } oMap ordered; it = map.begin(); for (; it != map.end(); ++it) { ordered[it->first] = &it->second; } typename oMap::const_iterator oit = ordered.begin(); typename Increment::scalar_type j = Increment::start(); for (; oit != ordered.end(); ++oit) { STORAGE key = oit->first; const std::vector& chunk = *oit->second; int n = chunk.size(); j += Increment::pre_increment(chunk, m); if (Rcpp::traits::is_na(key)) { typename Increment::scalar_type inc_na = Rcpp::traits::get_na< Rcpp::traits::r_sexptype_traits::rtype >(); for (int k = 0; k < n; k++) { out[ chunk[k] ] = inc_na; } } else { for (int k = 0; k < n; k++) { out[ chunk[k] ] = j; } } j += Increment::post_increment(chunk, m); } } Vector data; Map map; }; template class RowNumber : public Result { public: typedef typename Rcpp::traits::storage_type::type STORAGE; typedef VectorSliceVisitor Slice; typedef OrderVectorVisitorImpl Visitor; typedef Compare_Single_OrderVisitor Comparer; RowNumber(SEXP data_) : data(data_) {} virtual SEXP process(const GroupedDataFrame& gdf) { std::vector tmp(gdf.max_group_size()); int ng = gdf.ngroups(); int n = gdf.nrows(); if (n == 0) return IntegerVector(0); GroupedDataFrame::group_iterator git = gdf.group_begin(); IntegerVector out(n); for (int i = 0; i < ng; i++, ++git) { const SlicingIndex& index = *git; // tmp <- 0:(m-1) int m = index.size(); for (int j = 0; j < m; j++) tmp[j] = j; Slice slice(&data, index); // order( gdf.group(i) ) Visitor visitor(slice); Comparer comparer(visitor); std::sort(tmp.begin(), tmp.begin() + m, comparer); int j = m - 1; for (; j >= 0; j--) { if (Rcpp::traits::is_na(slice[ tmp[j] ])) { m--; out[ index[j] ] = NA_INTEGER; } else { break; } } for (; j >= 0; j--) { out[ index[j] ] = tmp[j] + 1; } } return out; } virtual SEXP process(const RowwiseDataFrame& gdf) { return IntegerVector(gdf.nrows(), 1); } virtual SEXP process(const FullDataFrame& df) { return process(df.get_index()); } virtual SEXP process(const SlicingIndex& index) { int nrows = index.size(); if (nrows == 0) return IntegerVector(0); IntegerVector x = seq(0, nrows - 1); Slice slice(&data, index); Visitor visitor(slice); std::sort(x.begin(), x.end(), Comparer(visitor)); IntegerVector out = no_init(nrows); int j = nrows - 1; for (; j >= 0; j--) { if (Rcpp::traits::is_na(slice[ x[j] ])) { out[ x[j] ] = NA_INTEGER; } else { break; } } for (; j >= 0; j--) { out[ x[j] ] = j + 1; } return out; } private: Vector data; }; template class Ntile : public Result { public: typedef typename Rcpp::traits::storage_type::type STORAGE; typedef VectorSliceVisitor Slice; typedef OrderVectorVisitorImpl Visitor; typedef Compare_Single_OrderVisitor Comparer; Ntile(SEXP data_, double ntiles_) : data(data_), ntiles(ntiles_) {} virtual SEXP process(const GroupedDataFrame& gdf) { std::vector tmp(gdf.max_group_size()); int ng = gdf.ngroups(); int n = gdf.nrows(); if (n == 0) return IntegerVector(0); GroupedDataFrame::group_iterator git = gdf.group_begin(); IntegerVector out(n); for (int i = 0; i < ng; i++, ++git) { const SlicingIndex& index = *git; // tmp <- 0:(m-1) int m = index.size(); for (int j = 0; j < m; j++) tmp[j] = j; Slice slice(&data, index); // order( gdf.group(i) ) Visitor visitor(slice); Comparer comparer(visitor); std::sort(tmp.begin(), tmp.begin() + m, comparer); int j = m - 1; for (; j >= 0; j--) { if (Rcpp::traits::is_na(slice[tmp[j]])) { out[index[j]] = NA_INTEGER; m--; } else { break; } } for (; j >= 0; j--) { out[ index[j] ] = (int)floor((ntiles * tmp[j]) / m) + 1; } } return out; } virtual SEXP process(const RowwiseDataFrame& gdf) { return IntegerVector(gdf.nrows(), 1); } virtual SEXP process(const FullDataFrame& df) { return process(df.get_index()); } virtual SEXP process(const SlicingIndex& index) { int nrows = index.size(); if (nrows == 0) return IntegerVector(0); IntegerVector x = seq(0, nrows - 1); Slice slice(&data, index); Visitor visitor(slice); std::sort(x.begin(), x.end(), Comparer(visitor)); IntegerVector out = no_init(nrows); int i = nrows - 1; for (; i >= 0; i--) { if (Rcpp::traits::is_na(slice[x[i]])) { nrows--; out[x[i]] = NA_INTEGER; } else { break; } } for (; i >= 0; i--) { out[ x[i] ] = (int)floor(ntiles * i / nrows) + 1; } return out; } private: Vector data; double ntiles; }; class RowNumber_0 : public Result { public: virtual SEXP process(const GroupedDataFrame& gdf) { int n = gdf.nrows(), ng = gdf.ngroups(); if (n == 0) return IntegerVector(0); IntegerVector res = no_init(n); GroupedDataFrame::group_iterator git = gdf.group_begin(); for (int i = 0; i < ng; i++, ++git) { const SlicingIndex& index = *git; int m = index.size(); for (int j = 0; j < m; j++) res[index[j]] = j + 1; } return res; } virtual SEXP process(const RowwiseDataFrame& gdf) { return IntegerVector(gdf.nrows(), 1); } virtual SEXP process(const FullDataFrame& df) { if (df.nrows() == 0) return IntegerVector(0); IntegerVector res = seq(1, df.nrows()); return res; } virtual SEXP process(const SlicingIndex& index) { if (index.size() == 0) return IntegerVector(0); IntegerVector res = seq(1, index.size()); return res; } }; } #include #endif dplyr/inst/include/dplyr/Result/Var.h0000644000176200001440000000416413135665123017327 0ustar liggesusers#ifndef dplyr_Result_Var_H #define dplyr_Result_Var_H #include namespace dplyr { namespace internal { inline double square(double x) { return x * x; } } // version for NA_RM = false template class Var : public Processor > { public: typedef Processor > Base; typedef typename Rcpp::traits::storage_type::type STORAGE; Var(SEXP x, bool is_summary_ = false) : Base(x), data_ptr(Rcpp::internal::r_vector_start(x)), is_summary(is_summary_) {} ~Var() {} inline double process_chunk(const SlicingIndex& indices) { if (is_summary) return NA_REAL; int n = indices.size(); if (n == 1) return NA_REAL; double m = internal::Mean_internal::process(data_ptr, indices); if (!R_FINITE(m)) return m; double sum = 0.0; for (int i = 0; i < n; i++) { sum += internal::square(data_ptr[indices[i]] - m); } return sum / (n - 1); } private: STORAGE* data_ptr; bool is_summary; }; // version for NA_RM = true template class Var : public Processor > { public: typedef Processor > Base; typedef typename Rcpp::traits::storage_type::type STORAGE; Var(SEXP x, bool is_summary_ = false) : Base(x), data_ptr(Rcpp::internal::r_vector_start(x)), is_summary(is_summary_) {} ~Var() {} inline double process_chunk(const SlicingIndex& indices) { if (is_summary) return NA_REAL; int n = indices.size(); if (n == 1) return NA_REAL; double m = internal::Mean_internal::process(data_ptr, indices); if (!R_FINITE(m)) return m; double sum = 0.0; int count = 0; for (int i = 0; i < n; i++) { STORAGE current = data_ptr[indices[i]]; if (Rcpp::Vector::is_na(current)) continue; sum += internal::square(current - m); count++; } if (count == 1) return NA_REAL; return sum / (count - 1); } private: STORAGE* data_ptr; bool is_summary; }; } #endif dplyr/inst/include/dplyr/Result/Result.h0000644000176200001440000000104013135665123020043 0ustar liggesusers#ifndef dplyr_Result_H #define dplyr_Result_H #include #include #include #include namespace dplyr { class Result { public: Result() {} virtual ~Result() {}; virtual SEXP process(const RowwiseDataFrame& gdf) = 0; virtual SEXP process(const GroupedDataFrame& gdf) = 0; virtual SEXP process(const FullDataFrame& df) = 0; virtual SEXP process(const SlicingIndex&) { return R_NilValue; } }; } // namespace dplyr #endif dplyr/inst/include/dplyr/Result/GroupedCallProxy.h0000644000176200001440000000420513135665123022036 0ustar liggesusers#ifndef dplyr_GroupedCallProxy_H #define dplyr_GroupedCallProxy_H #include #include #include #include #include namespace dplyr { template class GroupedCallProxy { public: GroupedCallProxy(const Rcpp::Call& call_, const Subsets& subsets_, const Environment& env_) : subsets(subsets_), proxies() { set_call(call_); set_env(env_); } GroupedCallProxy(const Rcpp::Call& call_, const Data& data_, const Environment& env_) : subsets(data_), proxies() { set_call(call_); set_env(env_); } GroupedCallProxy(const Data& data_, const Environment& env_) : subsets(data_), proxies() { set_env(env_); } GroupedCallProxy(const Data& data_) : subsets(data_), proxies() {} ~GroupedCallProxy() {} public: SEXP eval() { return get(NaturalSlicingIndex(subsets.nrows())); } SEXP get(const SlicingIndex& indices) { subsets.clear(); return get_hybrid_eval()->eval(indices); } GroupedHybridEval* get_hybrid_eval() { if (!hybrid_eval) { hybrid_eval.reset(new GroupedHybridEval(call, subsets, env)); } return hybrid_eval.get(); } void set_call(SEXP call_) { proxies.clear(); hybrid_eval.reset(); call = call_; } inline void set_env(SEXP env_) { env = env_; hybrid_eval.reset(); } void input(const SymbolString& name, SEXP x) { subsets.input(name, x); hybrid_eval.reset(); } inline int nsubsets() const { return subsets.size(); } inline bool has_variable(const SymbolString& name) const { return subsets.has_variable(name); } inline SEXP get_variable(const SymbolString& name) const { return subsets.get_variable(name); } inline bool is_constant() const { return TYPEOF(call) != LANGSXP && Rf_length(call) == 1; } private: Rcpp::Call call; Subsets subsets; std::vector proxies; Environment env; boost::scoped_ptr hybrid_eval; }; } #endif dplyr/inst/include/dplyr/Result/GroupedSubsetBase.h0000644000176200001440000000064113135665123022161 0ustar liggesusers#ifndef dplyr_GroupedSubsetBase_H #define dplyr_GroupedSubsetBase_H #include namespace dplyr { class GroupedSubset { public: GroupedSubset() {}; virtual ~GroupedSubset() {}; virtual SEXP get(const SlicingIndex& indices) = 0; virtual SEXP get_variable() const = 0; virtual bool is_summary() const = 0; }; typedef GroupedSubset RowwiseSubset; } #endif //dplyr_GroupedSubsetBase_H dplyr/inst/include/dplyr/Result/Processor.h0000644000176200001440000000554013135665123020555 0ustar liggesusers#ifndef dplyr_Result_Processor_H #define dplyr_Result_Processor_H #include #include #include #include namespace dplyr { // if we derive from this instead of deriving from Result, all we have to // do is implement a process_chunk method that takes a SlicingIndex& as // input and returns the suitable type (i.e. storage_type) // all the builtin result implementation (Mean, ...) use this. template class Processor : public Result { public: typedef typename Rcpp::traits::storage_type::type STORAGE; Processor() : data(R_NilValue) {} Processor(SEXP data_) : data(data_) {} virtual SEXP process(const Rcpp::GroupedDataFrame& gdf) { return process_grouped(gdf); } virtual SEXP process(const Rcpp::RowwiseDataFrame& gdf) { return process_grouped(gdf); } virtual SEXP process(const Rcpp::FullDataFrame& df) { return promote(process(df.get_index())); } virtual SEXP process(const SlicingIndex& index) { CLASS* obj = static_cast(this); Rcpp::Vector res = Rcpp::Vector::create(obj->process_chunk(index)); copy_attributes(res, data); return res; } private: template SEXP process_grouped(const Data& gdf) { int n = gdf.ngroups(); Rcpp::Shield res(Rf_allocVector(OUTPUT, n)); STORAGE* ptr = Rcpp::internal::r_vector_start(res); CLASS* obj = static_cast(this); typename Data::group_iterator git = gdf.group_begin(); for (int i = 0; i < n; i++, ++git) ptr[i] = obj->process_chunk(*git); copy_attributes(res, data); return res; } inline SEXP promote(SEXP obj) { RObject res(obj); copy_attributes(res, data); return res; } SEXP data; }; template class Processor : public Result { public: Processor(SEXP data_): data(data_) {} virtual SEXP process(const Rcpp::GroupedDataFrame& gdf) { return process_grouped(gdf); } virtual SEXP process(const Rcpp::RowwiseDataFrame& gdf) { return process_grouped(gdf); } virtual SEXP process(const Rcpp::FullDataFrame& df) { return process(df.get_index()); } virtual SEXP process(const SlicingIndex& index) { CLASS* obj = static_cast(this); return CharacterVector::create(obj->process_chunk(index)); } private: template SEXP process_grouped(const Data& gdf) { int n = gdf.ngroups(); Rcpp::Shield res(Rf_allocVector(STRSXP, n)); CLASS* obj = static_cast(this); typename Data::group_iterator git = gdf.group_begin(); for (int i = 0; i < n; i++, ++git) SET_STRING_ELT(res, i, obj->process_chunk(*git)); return res; } SEXP data; }; } #endif dplyr/inst/include/dplyr/Result/Sum.h0000644000176200001440000000550213135665123017340 0ustar liggesusers#ifndef dplyr_Result_Sum_H #define dplyr_Result_Sum_H #include namespace dplyr { namespace internal { // this one is actually only used for RTYPE = REALSXP and NA_RM = true template struct Sum { typedef typename Rcpp::traits::storage_type::type STORAGE; static STORAGE process(typename Rcpp::traits::storage_type::type* ptr, const Index& indices) { long double res = 0; int n = indices.size(); for (int i = 0; i < n; i++) { double value = ptr[indices[i]]; if (! Rcpp::traits::is_na(value)) res += value; } return (double)res; } }; template struct Sum { static int process(int* ptr, const Index& indices) { long double res = 0; int n = indices.size(); for (int i = 0; i < n; i++) { int value = ptr[indices[i]]; if (! Rcpp::traits::is_na(value)) res += value; } if (res > INT_MAX || res <= INT_MIN) { warning("integer overflow - use sum(as.numeric(.))"); return IntegerVector::get_na(); } return (int)res; } }; template struct Sum { static int process(int* ptr, const Index& indices) { long double res = 0; int n = indices.size(); for (int i = 0; i < n; i++) { int value = ptr[indices[i]]; if (Rcpp::traits::is_na(value)) { return NA_INTEGER; } res += value; } if (res > INT_MAX || res <= INT_MIN) { warning("integer overflow - use sum(as.numeric(.))"); return IntegerVector::get_na(); } return (int)res; } }; template struct Sum { static double process(double* ptr, const Index& indices) { long double res = 0.0; int n = indices.size(); for (int i = 0; i < n; i++) { // we don't test for NA here because += NA will give NA // this is faster in the most common case where there are no NA // if there are NA, we could return quicker as in the version for // INTSXP above, but we would penalize the most common case res += ptr[ indices[i] ]; } return (double)res; } }; } // namespace internal template class Sum : public Processor< RTYPE, Sum > { public: typedef Processor< RTYPE, Sum > Base; typedef typename Rcpp::traits::storage_type::type STORAGE; Sum(SEXP x, bool is_summary_ = false) : Base(x), data_ptr(Rcpp::internal::r_vector_start(x)), is_summary(is_summary_) {} ~Sum() {} inline STORAGE process_chunk(const SlicingIndex& indices) { if (is_summary) return data_ptr[indices.group()]; return internal::Sum::process(data_ptr, indices); } STORAGE* data_ptr; bool is_summary; }; } #endif dplyr/inst/include/dplyr/Result/CumMin.h0000644000176200001440000000177513135665123017774 0ustar liggesusers#ifndef dplyr_Result_CumMin_H #define dplyr_Result_CumMin_H #include namespace dplyr { // version for REALSXP template class CumMin : public Mutater > { public: typedef typename Rcpp::traits::storage_type::type STORAGE; CumMin(SEXP data_) : data(data_) {} void process_slice(Vector& out, const SlicingIndex& index, const SlicingIndex& out_index) { int n = index.size(); STORAGE value = data[index[0]]; out[out_index[0]] = value; if (NumericVector::is_na(value)) { for (int i = 1; i < n; i++) { out[out_index[i]] = value; } return; } for (int i = 1; i < n; i++) { STORAGE current = data[index[i]]; if (Rcpp::traits::is_na(current)) { for (int j = i; j < n; j++) { out[out_index[j]] = current; } return; } if (current < value) value = current; out[out_index[i]] = value; } } Vector data; }; } #endif dplyr/inst/include/dplyr/Result/In.h0000644000176200001440000000150413135665123017140 0ustar liggesusers#ifndef dplyr_Result_In_H #define dplyr_Result_In_H #include #include namespace dplyr { template class In : public Mutater > { public: typedef typename Rcpp::Vector Vec; typedef typename Rcpp::traits::storage_type::type STORAGE; In(Vec data_, const Vec& table_) : data(data_), set(table_.begin(), table_.end()) {} void process_slice(LogicalVector& out, const SlicingIndex& index, const SlicingIndex& out_index) { int n = index.size(); for (int i = 0; i < n; i++) { STORAGE value = data[index[i]]; if (Vec::is_na(value)) { out[ out_index[i] ] = false; } else { out[ out_index[i] ] = set.count(value); } } } private: Vec data; dplyr_hash_set set; }; } #endif dplyr/inst/include/dplyr/Result/Mutater.h0000644000176200001440000000247513135665123020223 0ustar liggesusers#ifndef dplyr_Result_Mutater_H #define dplyr_Result_Mutater_H #include namespace dplyr { template class Mutater : public Result { public: virtual SEXP process(const GroupedDataFrame& gdf) { int ng = gdf.ngroups(); Vector out = no_init(gdf.nrows()); GroupedDataFrame::group_iterator git = gdf.group_begin(); for (int i = 0; i < ng; i++, ++git) { static_cast(*this).process_slice(out, *git, *git); } return out; } virtual SEXP process(const RowwiseDataFrame& gdf) { int ng = gdf.ngroups(); Vector out = no_init(gdf.nrows()); RowwiseDataFrame::group_iterator git = gdf.group_begin(); for (int i = 0; i < ng; i++, ++git) { static_cast(*this).process_slice(out, *git, *git); } return out; } virtual SEXP process(const FullDataFrame& df) { Vector out = no_init(df.nrows()); const SlicingIndex& index = df.get_index(); static_cast(*this).process_slice(out, index, index); return out; } virtual SEXP process(const SlicingIndex& index) { int nrows = index.size(); Vector out = no_init(nrows); NaturalSlicingIndex fake(nrows); static_cast(*this).process_slice(out, index, fake); return out; } }; } #endif dplyr/inst/include/dplyr/Result/CumMax.h0000644000176200001440000000177513135665123017776 0ustar liggesusers#ifndef dplyr_Result_CumMax_H #define dplyr_Result_CumMax_H #include namespace dplyr { // version for REALSXP template class CumMax : public Mutater > { public: typedef typename Rcpp::traits::storage_type::type STORAGE; CumMax(SEXP data_) : data(data_) {} void process_slice(Vector& out, const SlicingIndex& index, const SlicingIndex& out_index) { int n = index.size(); STORAGE value = data[index[0]]; out[out_index[0]] = value; if (NumericVector::is_na(value)) { for (int i = 1; i < n; i++) { out[out_index[i]] = value; } return; } for (int i = 1; i < n; i++) { STORAGE current = data[index[i]]; if (Rcpp::traits::is_na(current)) { for (int j = i; j < n; j++) { out[out_index[j]] = current; } return; } if (current > value) value = current; out[out_index[i]] = value; } } Vector data; }; } #endif dplyr/inst/include/dplyr/Result/MinMax.h0000644000176200001440000000306013147347221017761 0ustar liggesusers#ifndef dplyr_Result_MinMax_H #define dplyr_Result_MinMax_H #include #include namespace dplyr { template class MinMax : public Processor > { public: typedef Processor > Base; typedef typename Rcpp::traits::storage_type::type STORAGE; private: static const double Inf; public: MinMax(SEXP x, bool is_summary_ = false) : Base(x), data_ptr(Rcpp::internal::r_vector_start(x)), is_summary(is_summary_) {} ~MinMax() {} double process_chunk(const SlicingIndex& indices) { if (is_summary) return data_ptr[ indices.group() ]; const int n = indices.size(); double res = Inf; for (int i = 0; i < n; ++i) { STORAGE current = data_ptr[indices[i]]; if (Rcpp::Vector::is_na(current)) { if (NA_RM) continue; else return NA_REAL; } else { double current_res = current; if (is_better(current_res, res)) res = current_res; } } return res; } inline static bool is_better(const double current, const double res) { if (MINIMUM) return internal::is_smaller(current, res); else return internal::is_smaller(res, current); } private: STORAGE* data_ptr; bool is_summary; }; template const double MinMax::Inf = (MINIMUM ? R_PosInf : R_NegInf); } #endif dplyr/inst/include/dplyr/Result/Lead.h0000644000176200001440000000403513135665123017441 0ustar liggesusers#ifndef dplyr_Result_Lead_H #define dplyr_Result_Lead_H #include #include #include namespace dplyr { template class Lead : public Result { public: typedef typename traits::scalar_type::type STORAGE; Lead(SEXP data_, int n_, const RObject& def_, bool is_summary_) : data(data_), n(n_), def(Vector::get_na()), is_summary(is_summary_) { if (!Rf_isNull(def_)) { def = as(def_); } } virtual SEXP process(const GroupedDataFrame& gdf) { int nrows = gdf.nrows(); int ng = gdf.ngroups(); Vector out = no_init(nrows); if (is_summary) { for (int i = 0; i < nrows; i++) out[i] = def; } else { GroupedDataFrame::group_iterator git = gdf.group_begin(); for (int i = 0; i < ng; i++, ++git) { process_slice(out, *git, *git); } } copy_most_attributes(out, data); return out; } virtual SEXP process(const RowwiseDataFrame& gdf) { int nrows = gdf.nrows(); Vector out(nrows, def); copy_most_attributes(out, data); return out; } virtual SEXP process(const FullDataFrame& df) { int nrows = df.nrows(); Vector out = no_init(nrows); const SlicingIndex& index = df.get_index(); process_slice(out, index, index); copy_most_attributes(out, data); return out; } virtual SEXP process(const SlicingIndex& index) { int nrows = index.size(); Vector out = no_init(nrows); NaturalSlicingIndex fake(nrows); process_slice(out, index, fake); copy_most_attributes(out, data); return out; } private: void process_slice(Vector& out, const SlicingIndex& index, const SlicingIndex& out_index) { int chunk_size = index.size(); int i = 0; for (; i < chunk_size - n; i++) { out[out_index[i]] = data[index[i + n]]; } for (; i < chunk_size; i++) { out[out_index[i]] = def; } } Vector data; int n; STORAGE def; bool is_summary; }; } #endif dplyr/inst/include/dplyr/Result/is_smaller.h0000644000176200001440000000070113136641136020721 0ustar liggesusers#ifndef dplyr_Result_is_smaller_H #define dplyr_Result_is_smaller_H namespace dplyr { namespace internal { template inline bool is_smaller(typename Rcpp::traits::storage_type::type lhs, typename Rcpp::traits::storage_type::type rhs) { return lhs < rhs; } template <> inline bool is_smaller(SEXP lhs, SEXP rhs) { return strcmp(CHAR(lhs), CHAR(rhs)) < 0; } } // namespace internal } // namespace dplyr #endif dplyr/inst/include/dplyr/Result/RowwiseSubset.h0000644000176200001440000000365413135665123021427 0ustar liggesusers#ifndef dplyr_RowwiseSubset_H #define dplyr_RowwiseSubset_H #include #include #include #include namespace dplyr { template class RowwiseSubsetTemplate : public RowwiseSubset { public: typedef typename Rcpp::traits::storage_type::type STORAGE; RowwiseSubsetTemplate(SEXP x) : object(x), output(1), start(Rcpp::internal::r_vector_start(object)) { copy_most_attributes(output, x); SET_DPLYR_SHRINKABLE_VECTOR((SEXP)output); } ~RowwiseSubsetTemplate() { UNSET_DPLYR_SHRINKABLE_VECTOR((SEXP)output); } virtual SEXP get(const SlicingIndex& indices) { output[0] = start[ indices.group() ]; return output; } virtual SEXP get_variable() const { return object; } virtual bool is_summary() const { return false; } private: SEXP object; Vector output; STORAGE* start; }; template <> class RowwiseSubsetTemplate : public RowwiseSubset { public: RowwiseSubsetTemplate(SEXP x) : object(x), start(Rcpp::internal::r_vector_start(object)) {} virtual SEXP get(const SlicingIndex& indices) { return start[ indices.group() ]; } virtual SEXP get_variable() const { return object; } virtual bool is_summary() const { return false; } private: SEXP object; SEXP* start; }; inline RowwiseSubset* rowwise_subset(SEXP x) { switch (check_supported_type(x)) { case DPLYR_INTSXP: return new RowwiseSubsetTemplate(x); case DPLYR_REALSXP: return new RowwiseSubsetTemplate(x); case DPLYR_LGLSXP: return new RowwiseSubsetTemplate(x); case DPLYR_STRSXP: return new RowwiseSubsetTemplate(x); case DPLYR_CPLXSXP: return new RowwiseSubsetTemplate(x); case DPLYR_VECSXP: return new RowwiseSubsetTemplate(x); } stop("Unreachable"); return 0; } } #endif dplyr/inst/include/dplyr/Result/VectorSliceVisitor.h0000644000176200001440000000127213156774325022406 0ustar liggesusers#ifndef dplyr_Result_VectorSliceVisitor_H #define dplyr_Result_VectorSliceVisitor_H #include namespace dplyr { template class VectorSliceVisitor { public: typedef typename Rcpp::traits::storage_type::type STORAGE; VectorSliceVisitor(const Vector* data_, const SlicingIndex& index_) : data(*data_), n(index_.size()), index(index_) {} inline STORAGE operator[](int i) const { return data[index[i]]; } inline int size() const { return n; } inline operator SEXP() const { return wrap_subset(data, index); } private: const Vector& data; int n; const SlicingIndex& index; }; } #endif dplyr/inst/include/dplyr/Result/Mean.h0000644000176200001440000000534613135665123017462 0ustar liggesusers#ifndef dplyr_Result_Mean_H #define dplyr_Result_Mean_H #include namespace dplyr { namespace internal { // version for NA_RM == true template struct Mean_internal { static double process(typename Rcpp::traits::storage_type::type* ptr, const Index& indices) { typedef typename Rcpp::traits::storage_type::type STORAGE; long double res = 0.0; int n = indices.size(); int m = 0; for (int i = 0; i < n; i++) { STORAGE value = ptr[ indices[i] ]; if (! Rcpp::traits::is_na(value)) { res += value; m++; } } if (m == 0) return R_NaN; res /= m; if (R_FINITE(res)) { long double t = 0.0; for (int i = 0; i < n; i++) { STORAGE value = ptr[indices[i]]; if (! Rcpp::traits::is_na(value)) { t += value - res; } } res += t / m; } return (double)res; } }; // special cases for NA_RM == false template struct Mean_internal { static double process(int* ptr, const Index& indices) { long double res = 0.0; int n = indices.size(); for (int i = 0; i < n; i++) { int value = ptr[ indices[i] ]; // need to handle missing value specifically if (value == NA_INTEGER) { return NA_REAL; } res += value; } res /= n; if (R_FINITE((double)res)) { long double t = 0.0; for (int i = 0; i < n; i++) { t += ptr[indices[i]] - res; } res += t / n; } return (double)res; } }; template struct Mean_internal { static double process(double* ptr, const Index& indices) { long double res = 0.0; int n = indices.size(); for (int i = 0; i < n; i++) { res += ptr[ indices[i] ]; } res /= n; if (R_FINITE((double)res)) { long double t = 0.0; for (int i = 0; i < n; i++) { t += ptr[indices[i]] - res; } res += t / n; } return (double)res; } }; } // namespace internal template class Mean : public Processor< REALSXP, Mean > { public: typedef Processor< REALSXP, Mean > Base; typedef typename Rcpp::traits::storage_type::type STORAGE; Mean(SEXP x, bool is_summary_ = false) : Base(x), data_ptr(Rcpp::internal::r_vector_start(x)), is_summary(is_summary_) {} ~Mean() {} inline double process_chunk(const SlicingIndex& indices) { if (is_summary) return data_ptr[indices.group()]; return internal::Mean_internal::process(data_ptr, indices); } private: STORAGE* data_ptr; bool is_summary; }; } #endif dplyr/inst/include/dplyr/Result/CumSum.h0000644000176200001440000000221413135665123020002 0ustar liggesusers#ifndef dplyr_Result_CumSum_H #define dplyr_Result_CumSum_H #include namespace dplyr { // REALSXP version template class CumSum : public Mutater > { public: CumSum(SEXP data_) : data(data_) {} void process_slice(Vector& out, const SlicingIndex& index, const SlicingIndex& out_index) { double value = 0.0; int n = index.size(); for (int i = 0; i < n; i++) { value += data[index[i]]; out[out_index[i]] = value; } } Vector data; }; // INTSXP version template <> class CumSum : public Mutater > { public: CumSum(SEXP data_) : data(data_) {} void process_slice(IntegerVector& out, const SlicingIndex& index, const SlicingIndex& out_index) { int value = 0; int n = index.size(); for (int i = 0; i < n; i++) { int current = data[index[i]]; if (IntegerVector::is_na(current)) { for (int j = i; j < n; j++) { out[ out_index[j] ] = NA_INTEGER; } return; } value += current; out[out_index[i]] = value; } } IntegerVector data; }; } #endif dplyr/inst/include/dplyr/Result/CallProxy.h0000644000176200001440000000034713135665123020513 0ustar liggesusers#ifndef dplyr_CallProxy_H #define dplyr_CallProxy_H #include #include namespace dplyr { typedef GroupedCallProxy CallProxy; } #endif dplyr/inst/include/dplyr/Result/Sd.h0000644000176200001440000000100313135665123017132 0ustar liggesusers#ifndef dplyr_Result_Sd_H #define dplyr_Result_Sd_H #include namespace dplyr { template class Sd : public Processor > { public: typedef Processor > Base; Sd(SEXP x, bool is_summary = false) : Base(x), var(x, is_summary) {} ~Sd() {} inline double process_chunk(const SlicingIndex& indices) { return sqrt(var.process_chunk(indices)); } private: Var var; }; } #endif dplyr/inst/include/dplyr/Result/Lag.h0000644000176200001440000000404213135665123017275 0ustar liggesusers#ifndef dplyr_Result_Lag_H #define dplyr_Result_Lag_H #include #include #include namespace dplyr { template class Lag : public Result { public: typedef typename traits::scalar_type::type STORAGE; Lag(SEXP data_, int n_, const RObject& def_, bool is_summary_) : data(data_), n(n_), def(Vector::get_na()), is_summary(is_summary_) { if (!Rf_isNull(def_)) { def = as(def_); } } virtual SEXP process(const GroupedDataFrame& gdf) { int nrows = gdf.nrows(); int ng = gdf.ngroups(); Vector out = no_init(nrows); if (is_summary) { for (int i = 0; i < nrows; i++) out[i] = def; } else { GroupedDataFrame::group_iterator git = gdf.group_begin(); for (int i = 0; i < ng; i++, ++git) { process_slice(out, *git, *git); } } copy_most_attributes(out, data); return out; } virtual SEXP process(const RowwiseDataFrame& gdf) { Vector out(gdf.nrows(), def); copy_most_attributes(out, data); return out; } virtual SEXP process(const FullDataFrame& df) { int nrows = df.nrows(); Vector out = no_init(nrows); const SlicingIndex& index = df.get_index(); process_slice(out, index, index); copy_most_attributes(out, data); return out; } virtual SEXP process(const SlicingIndex& index) { int nrows = index.size(); Vector out = no_init(nrows); NaturalSlicingIndex fake(nrows); process_slice(out, index, fake); copy_most_attributes(out, data); return out; } private: void process_slice(Vector& out, const SlicingIndex& index, const SlicingIndex& out_index) { int chunk_size = index.size(); int n_def = std::min(chunk_size, n); int i = 0; for (; i < n_def; ++i) { out[out_index[i]] = def; } for (; i < chunk_size; ++i) { out[out_index[i]] = data[index[i - n]]; } } Vector data; int n; STORAGE def; bool is_summary; }; } #endif dplyr/inst/include/dplyr/Result/GroupedCallReducer.h0000644000176200001440000000146513135665123022313 0ustar liggesusers#ifndef dplyr_GroupedCallReducer_H #define dplyr_GroupedCallReducer_H #include #include #include namespace dplyr { template class GroupedCallReducer : public CallbackProcessor< GroupedCallReducer > { public: GroupedCallReducer(Rcpp::Call call, const Subsets& subsets, const Environment& env, const SymbolString& name_) : proxy(call, subsets, env), name(name_) { } virtual ~GroupedCallReducer() {}; inline SEXP process_chunk(const SlicingIndex& indices) { return proxy.get(indices); } const SymbolString& get_name() const { return name; } private: GroupedCallProxy proxy; const SymbolString name; }; } // namespace dplyr #endif dplyr/inst/include/dplyr/Result/LazyGroupedSubsets.h0000644000176200001440000000547113135665123022417 0ustar liggesusers#ifndef dplyr_LazyGroupedSubsets_H #define dplyr_LazyGroupedSubsets_H #include #include #include #include #include namespace dplyr { template class LazySplitSubsets : public ILazySubsets { typedef typename Data::subset subset; public: LazySplitSubsets(const Data& gdf_) : gdf(gdf_), subsets(), symbol_map(), resolved(), owner(true) { const DataFrame& data = gdf.data(); CharacterVector names = data.names(); int n = data.size(); LOG_INFO << "processing " << n << " vars: " << names; for (int i = 0; i < n; i++) { input(names[i], data[i]); } } LazySplitSubsets(const LazySplitSubsets& other) : gdf(other.gdf), subsets(other.subsets), symbol_map(other.symbol_map), resolved(other.resolved), owner(false) {} virtual ~LazySplitSubsets() { if (owner) { for (size_t i = 0; i < subsets.size(); i++) { delete subsets[i]; } } } public: virtual const SymbolVector get_variable_names() const { return symbol_map.get_names(); } virtual SEXP get_variable(const SymbolString& symbol) const { return subsets[symbol_map.get(symbol)]->get_variable(); } virtual SEXP get(const SymbolString& symbol, const SlicingIndex& indices) const { int idx = symbol_map.get(symbol); SEXP value = resolved[idx]; if (value == R_NilValue) { resolved[idx] = value = subsets[idx]->get(indices); } return value; } virtual bool is_summary(const SymbolString& symbol) const { return subsets[symbol_map.get(symbol)]->is_summary(); } virtual bool has_variable(const SymbolString& head) const { return symbol_map.has(head); } virtual void input(const SymbolString& symbol, SEXP x) { input_subset(symbol, gdf.create_subset(x)); } virtual int size() const { return subsets.size(); } virtual int nrows() const { return gdf.nrows(); } public: void clear() { for (size_t i = 0; i < resolved.size(); i++) { resolved[i] = R_NilValue; } } void input_summarised(const SymbolString& symbol, SummarisedVariable x) { input_subset(symbol, summarised_subset(x)); } private: const Data& gdf; std::vector subsets; SymbolMap symbol_map; mutable std::vector resolved; bool owner; void input_subset(const SymbolString& symbol, subset* sub) { SymbolMapIndex index = symbol_map.insert(symbol); if (index.origin == NEW) { subsets.push_back(sub); resolved.push_back(R_NilValue); } else { int idx = index.pos; delete subsets[idx]; subsets[idx] = sub; resolved[idx] = R_NilValue; } } }; typedef LazySplitSubsets LazyGroupedSubsets; } #endif dplyr/inst/include/dplyr/Result/CallElementProxy.h0000644000176200001440000000047713135665123022031 0ustar liggesusers#ifndef dplyr_CallElementProxy_H #define dplyr_CallElementProxy_H namespace dplyr { class CallElementProxy { public: CallElementProxy(SEXP symbol_, SEXP object_) : symbol(symbol_), object(object_) {} inline void set(SEXP value) { SETCAR(object, value); } SEXP symbol; SEXP object; }; } #endif dplyr/inst/include/dplyr/Result/ConstantResult.h0000644000176200001440000000435513135665123021571 0ustar liggesusers#ifndef dplyr_Result_ConstantResult_H #define dplyr_Result_ConstantResult_H #include namespace dplyr { template class ConstantResult : public Result { public: typedef typename Rcpp::traits::storage_type::type STORAGE; ConstantResult(SEXP x) : value(Rcpp::internal::r_vector_start(x)[0]) {} SEXP process(const GroupedDataFrame& gdf) { return Vector(gdf.ngroups(), value); } SEXP process(const RowwiseDataFrame& gdf) { return Vector(gdf.ngroups(), value); } virtual SEXP process(const FullDataFrame&) { return Vector::create(value); } virtual SEXP process(const SlicingIndex&) { return Vector::create(value); } STORAGE value; }; template class TypedConstantResult : public Result { public: typedef typename Rcpp::traits::storage_type::type STORAGE; TypedConstantResult(SEXP x, SEXP classes_) : value(Rcpp::internal::r_vector_start(x)[0]), classes(classes_) {} SEXP process(const GroupedDataFrame& gdf) { return get(gdf.ngroups()); } SEXP process(const RowwiseDataFrame& gdf) { return get(gdf.ngroups()); } virtual SEXP process(const FullDataFrame&) { return get(1); } virtual SEXP process(const SlicingIndex&) { return get(1); } private: SEXP get(int n) const { Vector res(n, value); set_class(res, classes); return res; } STORAGE value; SEXP classes; }; template class DifftimeConstantResult : public Result { public: typedef typename Rcpp::traits::storage_type::type STORAGE; DifftimeConstantResult(SEXP x) : value(Rcpp::internal::r_vector_start(x)[0]), units(Rf_getAttrib(x, Rf_install("units"))) {} SEXP process(const GroupedDataFrame& gdf) { return get(gdf.ngroups()); } SEXP process(const RowwiseDataFrame& gdf) { return get(gdf.ngroups()); } virtual SEXP process(const FullDataFrame&) { return get(1); } virtual SEXP process(const SlicingIndex&) { return get(1); } private: SEXP get(int n) const { Vector res(n, value); set_class(res, "difftime"); res.attr("units") = units; return res; } STORAGE value; CharacterVector units; }; } #endif dplyr/inst/include/dplyr/Result/CallbackProcessor.h0000644000176200001440000000712713135665123022175 0ustar liggesusers#ifndef dplyr_Result_CallbackProcessor_H #define dplyr_Result_CallbackProcessor_H #include #include #include #include #include namespace dplyr { // classes inherit from this template when they have a method with this signature // SEXP process_chunk( const SlicingIndex& indices) // // the first time process_chunk is called, CallbackProcessor finds the right type // for storing the results, and it creates a suitable DelayedProcessor // object which is then used to fill the vector // // DelayedReducer is an example on how CallbackReducer is used // // it is assumed that the SEXP comes from evaluating some R expression, so // it should be one of a integer vector of length one, a numeric vector of // length one or a character vector of length one template class CallbackProcessor : public Result { public: CallbackProcessor() {} CLASS* obj() { return static_cast(this); } virtual SEXP process(const GroupedDataFrame& gdf) { return process_data(gdf, obj()).run(); } virtual SEXP process(const RowwiseDataFrame& gdf) { return process_data(gdf, obj()).run(); } virtual SEXP process(const Rcpp::FullDataFrame& df) { return obj()->process_chunk(df.get_index()); } virtual SEXP process(const SlicingIndex&) { return R_NilValue; } private: template class process_data { public: process_data(const Data& gdf, CLASS* chunk_source_) : git(gdf.group_begin()), ngroups(gdf.ngroups()), chunk_source(chunk_source_) {} SEXP run() { if (ngroups == 0) { LOG_INFO << "no groups to process"; return get_processed_empty(); } LOG_INFO << "processing groups"; process_first(); process_rest(); return get_processed(); } private: void process_first() { const RObject& first_result = fetch_chunk(); LOG_INFO << "instantiating delayed processor for type " << type2name(first_result) << " for column `" << chunk_source->get_name().get_utf8_cstring() << "`"; processor.reset(get_delayed_processor(first_result, ngroups, chunk_source->get_name())); LOG_VERBOSE << "processing " << ngroups << " groups with " << processor->describe() << " processor"; } void process_rest() { for (int i = 1; i < ngroups; ++i) { const RObject& chunk = fetch_chunk(); if (!try_handle_chunk(chunk)) { LOG_VERBOSE << "not handled group " << i; handle_chunk_with_promotion(chunk, i); } } } bool try_handle_chunk(const RObject& chunk) const { return processor->try_handle(chunk); } void handle_chunk_with_promotion(const RObject& chunk, const int i) { IDelayedProcessor* new_processor = processor->promote(chunk); if (!new_processor) { bad_col(chunk_source->get_name(), "can't promote group {group} to {type}", _["group"] = i, _["type"] = processor->describe()); } LOG_VERBOSE << "promoted group " << i; processor.reset(new_processor); } RObject fetch_chunk() { const RObject& chunk = chunk_source->process_chunk(*git); ++git; return chunk; } SEXP get_processed() const { return processor->get(); } static SEXP get_processed_empty() { return LogicalVector(0, NA_LOGICAL); } private: typename Data::group_iterator git; const int ngroups; boost::scoped_ptr processor; CLASS* chunk_source; }; }; } #endif dplyr/inst/include/dplyr/Result/LazySubsets.h0000644000176200001440000000404113136317426021062 0ustar liggesusers#ifndef dplyr_LazySubsets_H #define dplyr_LazySubsets_H #include #include #include namespace dplyr { class LazySubsets : public ILazySubsets { public: LazySubsets(const DataFrame& df) : nr(df.nrows()) { int nvars = df.size(); if (nvars) { CharacterVector names = df.names(); for (int i = 0; i < nvars; i++) { SEXP column = df[i]; if (Rf_inherits(column, "matrix")) { stop("matrix as column is not supported"); } symbol_map.insert(names[i]); data.push_back(df[i]); } } } virtual ~LazySubsets() {} public: virtual const SymbolVector get_variable_names() const { return symbol_map.get_names(); } virtual SEXP get_variable(const SymbolString& symbol) const { return data[ symbol_map.get(symbol) ]; } virtual SEXP get(const SymbolString& symbol, const SlicingIndex& indices) const { const int pos = symbol_map.get(symbol); SEXP col = data[pos]; if (!indices.is_identity(col) && Rf_length(col) != 1) stop("Attempt to query lazy column with non-natural slicing index"); return col; } virtual bool is_summary(const SymbolString& symbol) const { return summary_map.has(symbol); } virtual bool has_variable(const SymbolString& symbol) const { return symbol_map.has(symbol); } virtual void input(const SymbolString& symbol, SEXP x) { SymbolMapIndex index = symbol_map.insert(symbol); if (index.origin == NEW) { data.push_back(x); } else { data[index.pos] = x; } } virtual int size() const { return data.size(); } virtual int nrows() const { return nr; } void input_summarised(const SymbolString& symbol, SummarisedVariable x) { input(symbol, x); summary_map.insert(symbol); } public: void clear() {} inline SEXP& operator[](const SymbolString& symbol) { return data[symbol_map.get(symbol)]; } private: SymbolMap symbol_map, summary_map; std::vector data; int nr; }; } #endif dplyr/inst/include/dplyr/Result/GroupedSubset.h0000644000176200001440000000646113135665123021374 0ustar liggesusers#ifndef dplyr_GroupedSubset_H #define dplyr_GroupedSubset_H #include #include #include #include namespace dplyr { template class GroupedSubsetTemplate : public GroupedSubset { public: typedef typename Rcpp::traits::storage_type::type STORAGE; GroupedSubsetTemplate(SEXP x, int max_size) : object(x), output(max_size, object), start(Rcpp::internal::r_vector_start(object)) {} virtual SEXP get(const SlicingIndex& indices) { output.borrow(indices, start); return output; } virtual SEXP get_variable() const { return object; } virtual bool is_summary() const { return false; } private: SEXP object; ShrinkableVector output; STORAGE* start; }; class DataFrameGroupedSubset : public GroupedSubset { public: DataFrameGroupedSubset(SEXP x) : data(x), visitors(data) {} virtual SEXP get(const SlicingIndex& indices) { return visitors.subset(indices, get_class(data)); } virtual SEXP get_variable() const { return data; } virtual bool is_summary() const { return false; } private: DataFrame data; DataFrameSubsetVisitors visitors; }; inline GroupedSubset* grouped_subset(SEXP x, int max_size) { switch (TYPEOF(x)) { case INTSXP: return new GroupedSubsetTemplate(x, max_size); case REALSXP: return new GroupedSubsetTemplate(x, max_size); case LGLSXP: return new GroupedSubsetTemplate(x, max_size); case STRSXP: return new GroupedSubsetTemplate(x, max_size); case VECSXP: if (Rf_inherits(x, "data.frame")) return new DataFrameGroupedSubset(x); if (Rf_inherits(x, "POSIXlt")) { stop("POSIXlt not supported"); } return new GroupedSubsetTemplate(x, max_size); case CPLXSXP: return new GroupedSubsetTemplate(x, max_size); default: break; } stop("is of unsupported type %s", Rf_type2char(TYPEOF(x))); } template class SummarisedSubsetTemplate : public GroupedSubset { public: typedef typename Rcpp::traits::storage_type::type STORAGE; SummarisedSubsetTemplate(SummarisedVariable x) : object(x), output(1) { copy_most_attributes(output, object); } virtual SEXP get(const SlicingIndex& indices) { output[0] = object[indices.group()]; return output; } virtual SEXP get_variable() const { return object; } virtual bool is_summary() const { return true; } private: Rcpp::Vector object; Rcpp::Vector output; }; template <> inline SEXP SummarisedSubsetTemplate::get(const SlicingIndex& indices) { return List::create(object[indices.group()]); } inline GroupedSubset* summarised_subset(SummarisedVariable x) { switch (TYPEOF(x)) { case LGLSXP: return new SummarisedSubsetTemplate(x); case INTSXP: return new SummarisedSubsetTemplate(x); case REALSXP: return new SummarisedSubsetTemplate(x); case STRSXP: return new SummarisedSubsetTemplate(x); case VECSXP: return new SummarisedSubsetTemplate(x); case CPLXSXP: return new SummarisedSubsetTemplate(x); default: break; } stop("is of unsupported type %s", Rf_type2char(TYPEOF(x))); } } #endif dplyr/inst/include/dplyr/SubsetVectorVisitor.h0000644000176200001440000000266613135665123021336 0ustar liggesusers#ifndef dplyr_SubsetVectorVisitor_H #define dplyr_SubsetVectorVisitor_H #include #include #include namespace dplyr { template inline int output_size(const Container& container) { return container.size(); } /** * Subset Vector visitor base class, defines the interface */ class SubsetVectorVisitor { public: virtual ~SubsetVectorVisitor() {} /** creates a new vector, of the same type as the visited vector, by * copying elements at the given indices */ virtual SEXP subset(const Rcpp::IntegerVector& index) const = 0; virtual SEXP subset(const std::vector&) const = 0; virtual SEXP subset(const SlicingIndex&) const = 0; /** creates a new vector, of the same type as the visited vector, by * copying elements at the given indices */ virtual SEXP subset(const ChunkIndexMap& index) const = 0; virtual SEXP subset(EmptySubset) const = 0; virtual int size() const = 0; virtual std::string get_r_type() const = 0; bool is_same_typeid(SubsetVectorVisitor* other) const { return typeid(*other) == typeid(*this); } virtual bool is_same_type(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other); } virtual bool is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const = 0; }; } // namespace dplyr #endif dplyr/inst/include/dplyr/vector_class.h0000644000176200001440000000115013135665123020000 0ustar liggesusers#ifndef dplyr_vector_class_H #define dplyr_vector_class_H namespace dplyr { template inline std::string vector_class(); template <> inline std::string vector_class() { return "integer"; } template <> inline std::string vector_class() { return "numeric"; } template <> inline std::string vector_class() { return "character"; } template <> inline std::string vector_class() { return "logical"; } template <> inline std::string vector_class() { return "list"; } template <> inline std::string vector_class() { return "complex"; } } #endif dplyr/inst/include/dplyr/main.h0000644000176200001440000000050113135665123016234 0ustar liggesusers#ifndef dplyr_dplyr_main_H #define dplyr_dplyr_main_H #include #include #include #include #include #include #include using namespace Rcpp; #endif // #ifndef dplyr_dplyr_main_H dplyr/inst/include/dplyr/DataFrameVisitors.h0000644000176200001440000000213413135665123020703 0ustar liggesusers#ifndef dplyr_DataFrameVisitors_H #define dplyr_DataFrameVisitors_H #include #include #include #include namespace dplyr { class DataFrameVisitors : public VisitorSetEqual, public VisitorSetHash, public VisitorSetLess, public VisitorSetGreater { private: const Rcpp::DataFrame& data; pointer_vector visitors; SymbolVector visitor_names; int nvisitors; public: typedef VectorVisitor visitor_type; DataFrameVisitors(const DataFrame& data_); DataFrameVisitors(const DataFrame& data_, const SymbolVector& names); inline int size() const { return nvisitors; } inline VectorVisitor* get(int k) const { return visitors[k]; } const SymbolString name(int k) const { return visitor_names[k]; } inline int nrows() const { return data.nrows(); } private: void structure(List& x, int nrows, CharacterVector classes) const; }; } // namespace dplyr #endif dplyr/inst/include/dplyr/GroupedDataFrame.h0000644000176200001440000000703513147342615020474 0ustar liggesusers#ifndef dplyr_tools_GroupedDataFrame_H #define dplyr_tools_GroupedDataFrame_H #include #include #include #include #include #include namespace dplyr { inline void check_valid_colnames(const DataFrame& df) { if (df.size()) { CharacterVector names(df.names()); LogicalVector dup = duplicated(names); if (any(dup).is_true()) { std::stringstream s; s << "found duplicated column name: "; bool first = true; for (int i = 0; i < df.size(); i++) { if (dup[i] == TRUE) { if (first) { first = false; } else { s << ", "; } s << names[i]; } } stop(s.str()); } } } class GroupedDataFrame; class GroupedDataFrameIndexIterator { public: GroupedDataFrameIndexIterator(const GroupedDataFrame& gdf_); GroupedDataFrameIndexIterator& operator++(); GroupedSlicingIndex operator*() const; int i; const GroupedDataFrame& gdf; List indices; }; class GroupedDataFrame { public: typedef GroupedDataFrameIndexIterator group_iterator; typedef GroupedSlicingIndex slicing_index; typedef GroupedSubset subset; GroupedDataFrame(SEXP x): data_(x), group_sizes(), biggest_group_size(0), symbols(get_vars(data_)), labels() { // handle lazyness bool is_lazy = Rf_isNull(data_.attr("group_sizes")) || Rf_isNull(data_.attr("labels")); if (is_lazy) { data_ = build_index_cpp(data_); } group_sizes = data_.attr("group_sizes"); biggest_group_size = data_.attr("biggest_group_size"); labels = data_.attr("labels"); if (!is_lazy) { // check consistency of the groups int rows_in_groups = sum(group_sizes); if (data_.nrows() != rows_in_groups) { bad_arg(".data", "is a corrupt grouped_df, contains {rows} rows, and {group_rows} rows in groups", _["rows"] = data_.nrows(), _["group_rows"] = rows_in_groups); } } } group_iterator group_begin() const { return GroupedDataFrameIndexIterator(*this); } SymbolString symbol(int i) const { return symbols.get_name(i); } DataFrame& data() { return data_; } const DataFrame& data() const { return data_; } inline int ngroups() const { return group_sizes.size(); } inline int nvars() const { return labels.size(); } inline int nrows() const { return data_.nrows(); } inline SEXP label(int i) const { return labels[i]; } inline int max_group_size() const { return biggest_group_size; } inline bool has_group(const SymbolString& g) const { return symbols.has(g); } inline subset* create_subset(SEXP x) const { return grouped_subset(x, max_group_size()); } private: DataFrame data_; IntegerVector group_sizes; int biggest_group_size; SymbolMap symbols; DataFrame labels; }; inline GroupedDataFrameIndexIterator::GroupedDataFrameIndexIterator(const GroupedDataFrame& gdf_) : i(0), gdf(gdf_), indices(gdf.data().attr("indices")) {} inline GroupedDataFrameIndexIterator& GroupedDataFrameIndexIterator::operator++() { i++; return *this; } inline GroupedSlicingIndex GroupedDataFrameIndexIterator::operator*() const { return GroupedSlicingIndex(IntegerVector(indices[i]), i); } } namespace Rcpp { using namespace dplyr; template <> inline bool is(SEXP x) { return Rf_inherits(x, "grouped_df") && Rf_getAttrib(x, Rf_install("vars")) != R_NilValue; } } #endif dplyr/inst/include/dplyr/dplyr.h0000644000176200001440000000206713135665123016453 0ustar liggesusers#ifndef dplyr_dplyr_dplyr_H #define dplyr_dplyr_dplyr_H #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #endif // #ifndef dplyr_dplyr_dplyr_H dplyr/inst/include/dplyr/JoinVisitor.h0000644000176200001440000000110413135665123017567 0ustar liggesusers#ifndef dplyr_JoinVisitor_H #define dplyr_JoinVisitor_H #include #include namespace dplyr { class DataFrameJoinVisitors; class JoinVisitor { public: virtual ~JoinVisitor() {} virtual size_t hash(int i) = 0; virtual bool equal(int i, int j) = 0; virtual SEXP subset(const std::vector& indices) = 0; virtual SEXP subset(const VisitorSetIndexSet& set) = 0; }; JoinVisitor* join_visitor(const Column& left, const Column& right, bool warn, bool accept_na_match = true); } #endif dplyr/inst/include/dplyr/comparisons.h0000644000176200001440000000540613136641136017655 0ustar liggesusers#ifndef dplyr_comparison_H #define dplyr_comparison_H namespace dplyr { template struct comparisons { typedef typename Rcpp::traits::storage_type::type STORAGE; static inline bool is_less(STORAGE lhs, STORAGE rhs) { if (is_na(lhs)) return false; if (is_na(rhs)) return true; return lhs < rhs; } static inline bool is_greater(STORAGE lhs, STORAGE rhs) { return lhs > rhs; } static inline bool equal_or_both_na(STORAGE lhs, STORAGE rhs) { return lhs == rhs; } static inline bool is_na(STORAGE x) { return Rcpp::traits::is_na(x); } }; template <> struct comparisons { static inline bool is_less(SEXP lhs, SEXP rhs) { // we need this because CHAR(NA_STRING) gives "NA" if (is_na(lhs)) return false; if (is_na(rhs)) return true; return strcmp(CHAR(lhs), CHAR(rhs)) < 0; } static inline bool is_greater(SEXP lhs, SEXP rhs) { if (is_na(lhs)) return false; if (is_na(rhs)) return true; return strcmp(CHAR(lhs), CHAR(rhs)) > 0; } static inline bool equal_or_both_na(SEXP lhs, SEXP rhs) { return lhs == rhs; } static inline bool is_na(SEXP x) { return Rcpp::traits::is_na(x); } }; // taking advantage of the particularity of NA_REAL template <> struct comparisons { static inline bool is_less(double lhs, double rhs) { if (is_nan(lhs)) { return false; } else if (is_na(lhs)) { return is_nan(rhs); } else { // lhs >= rhs is false if rhs is NA or NaN return !(lhs >= rhs); } } static inline bool is_greater(double lhs, double rhs) { if (is_nan(lhs)) { return false; } else if (is_na(lhs)) { return is_nan(rhs); } else { // lhs <= rhs is false if rhs is NA or NaN return !(lhs <= rhs); } } static inline bool equal_or_both_na(double lhs, double rhs) { return lhs == rhs || (is_nan(lhs) && is_nan(rhs)) || (is_na(lhs) && is_na(rhs)); } static inline bool is_na(double x) { return ISNA(x); } static inline bool is_nan(double x) { return Rcpp::traits::is_nan(x); } }; template <> struct comparisons { static inline bool is_less(Rcomplex lhs, Rcomplex rhs) { if (is_na(lhs)) return false; if (is_na(rhs)) return true; return lhs.r < rhs.r || (lhs.r == rhs.r && lhs.i < rhs.i); } static inline bool is_greater(Rcomplex lhs, Rcomplex rhs) { if (is_na(lhs)) return false; if (is_na(rhs)) return true; return !(lhs.r < rhs.r || (lhs.r == rhs.r && lhs.i <= rhs.i)); } static inline bool equal_or_both_na(Rcomplex lhs, Rcomplex rhs) { return lhs.r == rhs.r && lhs.i == rhs.i; } static inline bool is_na(Rcomplex x) { return Rcpp::traits::is_na(x); } }; } #endif dplyr/inst/include/dplyr/config.h0000644000176200001440000000031313135665123016556 0ustar liggesusers#ifndef DPLYR_CONFIG_H #define DPLYR_CONFIG_H #ifndef DPLYR_MIN_INTERUPT_SIZE #define DPLYR_MIN_INTERUPT_SIZE 10000 #endif #ifndef DPLYR_INTERUPT_TIMES #define DPLYR_INTERUPT_TIMES 10 #endif #endif dplyr/inst/include/dplyr/workarounds.h0000644000176200001440000000032113135665123017666 0ustar liggesusers#ifndef DPLYR_WORKAROUND_H #define DPLYR_WORKAROUND_H // installChar was introduced in R 3.2.0 #ifndef installChar #define installChar(x) Rf_install(CHAR(x)) #define Rf_installChar installChar #endif #endif dplyr/inst/include/dplyr/SubsetVectorVisitorImpl.h0000644000176200001440000001402213135665123022145 0ustar liggesusers#ifndef dplyr_SubsetVectorVisitor_Impl_H #define dplyr_SubsetVectorVisitor_Impl_H #include #include #include #include namespace dplyr { /** * Implementations */ template class SubsetVectorVisitorImpl : public SubsetVectorVisitor { public: typedef Rcpp::Vector VECTOR; /** * The type of data : int, double, SEXP, Rcomplex */ typedef typename Rcpp::traits::storage_type::type STORAGE; SubsetVectorVisitorImpl(const VECTOR& vec_) : vec(vec_) {} inline SEXP subset(const Rcpp::IntegerVector& index) const { return subset_int_index(index); } inline SEXP subset(const std::vector& index) const { return subset_int_index(index); } inline SEXP subset(const SlicingIndex& index) const { return subset_int_index(index); } inline SEXP subset(const ChunkIndexMap& map) const { int n = output_size(map); VECTOR out = Rcpp::no_init(n); ChunkIndexMap::const_iterator it = map.begin(); for (int i = 0; i < n; i++, ++it) out[i] = vec[ it->first ]; copy_most_attributes(out, vec); return out; } inline SEXP subset(EmptySubset) const { VECTOR out(0); copy_most_attributes(out, vec); return out; } inline std::string get_r_type() const { return VectorVisitorType(); } inline int size() const { return vec.size(); } inline bool is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other); } protected: VECTOR vec; template inline SEXP subset_int_index(const Container& index) const { int n = output_size(index); VECTOR out = Rcpp::no_init(n); for (int i = 0; i < n; i++) { if (index[i] < 0) { out[i] = VECTOR::get_na(); } else { out[i] = vec[ index[i] ]; } } copy_most_attributes(out, vec); return out; } }; template <> template SEXP SubsetVectorVisitorImpl::subset_int_index(const Container& index) const { int n = output_size(index); List out(n); for (int i = 0; i < n; i++) out[i] = (index[i] < 0) ? R_NilValue : vec[ index[i] ]; copy_most_attributes(out, vec); return out; } class SubsetFactorVisitor : public SubsetVectorVisitorImpl { public: typedef SubsetVectorVisitorImpl Parent; SubsetFactorVisitor(const IntegerVector& vec_) : Parent(vec_) { levels = get_levels(vec); levels_ptr = Rcpp::internal::r_vector_start(levels); } inline SEXP subset(const Rcpp::IntegerVector& index) const { return promote(Parent::subset(index)); } inline SEXP subset(const SlicingIndex& index) const { return promote(Parent::subset(index)); } inline SEXP subset(const std::vector& index) const { return promote(Parent::subset(index)); } inline SEXP subset(const ChunkIndexMap& map) const { return promote(Parent::subset(map)); } inline SEXP subset(EmptySubset empty) const { return promote(Parent::subset(empty)); } inline std::string get_r_type() const { return get_single_class(Parent::vec); } inline bool is_same_type(SubsetVectorVisitor* other, std::stringstream& ss, const SymbolString& name) const { return is_same_typeid(other) && same_levels(dynamic_cast(other), ss, name); } inline bool is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other) || (typeid(*other) == typeid(SubsetVectorVisitorImpl)); } private: inline bool same_levels(SubsetFactorVisitor* other, std::stringstream& ss, const SymbolString& name) const { CharacterVector levels_other = other->levels; if (!character_vector_equal(levels, levels_other)) { ss << "Factor levels not equal for column `" << name.get_utf8_cstring() << "`"; return false; } return true; } inline SEXP promote(IntegerVector x) const { copy_most_attributes(x, vec); return x; } CharacterVector levels; SEXP* levels_ptr; }; class DateSubsetVectorVisitor : public SubsetVectorVisitor { public: DateSubsetVectorVisitor(SEXP data) : impl(0) { if (TYPEOF(data) == INTSXP) { impl = new SubsetVectorVisitorImpl(data); } else if (TYPEOF(data) == REALSXP) { impl = new SubsetVectorVisitorImpl(data); } else { stop("Unreachable"); } } ~DateSubsetVectorVisitor() { delete impl; } virtual SEXP subset(const Rcpp::IntegerVector& index) const { return impl->subset(index); } virtual SEXP subset(const SlicingIndex& index) const { return impl->subset(index); } virtual SEXP subset(const std::vector& index) const { return impl->subset(index); } virtual SEXP subset(const ChunkIndexMap& index) const { return impl->subset(index); } virtual SEXP subset(EmptySubset index) const { return impl->subset(index); } virtual int size() const { return impl->size(); } virtual std::string get_r_type() const { return impl->get_r_type(); } bool is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other); } private: SubsetVectorVisitor* impl; DateSubsetVectorVisitor(const DateSubsetVectorVisitor&); }; template <> inline bool SubsetVectorVisitorImpl::is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other) || typeid(*other) == typeid(SubsetVectorVisitorImpl); } template <> inline bool SubsetVectorVisitorImpl::is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other) || typeid(*other) == typeid(SubsetVectorVisitorImpl); } template <> inline bool SubsetVectorVisitorImpl::is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other) || typeid(*other) == typeid(SubsetFactorVisitor); } } #endif dplyr/inst/include/dplyr/DataFrameJoinVisitors.h0000644000176200001440000000333613135665123021530 0ustar liggesusers#ifndef dplyr_DataFrameJoinVisitors_H #define dplyr_DataFrameJoinVisitors_H #include #include #include #include namespace dplyr { class DataFrameJoinVisitors : public VisitorSetEqual, public VisitorSetHash { public: typedef JoinVisitor visitor_type; DataFrameJoinVisitors( const DataFrame& left_, const DataFrame& right_, const SymbolVector& names_left, const SymbolVector& names_right, bool warn_, bool na_match ); inline JoinVisitor* get(int k) const { return visitors[k]; } inline JoinVisitor* get(const SymbolString& name) const { for (int i = 0; i < nvisitors; i++) { if (name == visitor_names_left[i]) return get(i); } stop("visitor not found for name '%s' ", name.get_utf8_cstring()); } inline int size() const { return nvisitors; } template inline DataFrame subset(const Container& index, const CharacterVector& classes) { int nrows = index.size(); Rcpp::List out(nvisitors); for (int k = 0; k < nvisitors; k++) { out[k] = get(k)->subset(index); } set_class(out, classes); set_rownames(out, nrows); out.names() = visitor_names_left; copy_vars(out, left); return (SEXP)out; } const SymbolVector& left_names() const { return visitor_names_left; } const SymbolVector& right_names() const { return visitor_names_right; } private: const DataFrame& left; const DataFrame& right; SymbolVector visitor_names_left; SymbolVector visitor_names_right; int nvisitors; pointer_vector visitors; bool warn; }; } #endif dplyr/inst/include/dplyr/VectorVisitor.h0000644000176200001440000000154313135665123020141 0ustar liggesusers#ifndef dplyr_VectorVisitor_H #define dplyr_VectorVisitor_H namespace dplyr { /** * Vector visitor base class, defines the interface */ class VectorVisitor { public: virtual ~VectorVisitor() {} /** hash the element of the visited vector at index i */ virtual size_t hash(int i) const = 0; /** are the elements at indices i and j equal */ virtual bool equal(int i, int j) const = 0; /** are the elements at indices i and j equal or both NA */ virtual bool equal_or_both_na(int i, int j) const = 0; /** is the i element less than the j element */ virtual bool less(int i, int j) const = 0; /** is the i element less than the j element */ virtual bool greater(int i, int j) const = 0; virtual int size() const = 0; virtual std::string get_r_type() const = 0; virtual bool is_na(int i) const = 0; }; } // namespace dplyr #endif dplyr/inst/include/dplyr/RowwiseDataFrame.h0000644000176200001440000000316013135665123020520 0ustar liggesusers#ifndef dplyr_tools_RowwiseDataFrame_H #define dplyr_tools_RowwiseDataFrame_H #include #include #include namespace dplyr { class RowwiseDataFrame; class RowwiseDataFrameIndexIterator { public: RowwiseDataFrameIndexIterator() : i(0) {} RowwiseDataFrameIndexIterator& operator++() { ++i; return *this; } RowwiseSlicingIndex operator*() const { return RowwiseSlicingIndex(i); } int i; }; class RowwiseDataFrame { public: typedef RowwiseDataFrameIndexIterator group_iterator; typedef RowwiseSlicingIndex slicing_index; typedef RowwiseSubset subset; RowwiseDataFrame(SEXP x): data_(x), group_sizes() { group_sizes = rep(1, data_.nrows()); } group_iterator group_begin() const { return RowwiseDataFrameIndexIterator(); } DataFrame& data() { return data_; } const DataFrame& data() const { return data_; } inline int ngroups() const { return group_sizes.size(); } inline int nvars() const { return 0; } inline SymbolString symbol(int) { stop("Rowwise data frames don't have grouping variables"); } inline SEXP label(int) { return R_NilValue; } inline int nrows() const { return data_.nrows(); } inline int max_group_size() const { return 1; } inline subset* create_subset(SEXP x) const { return rowwise_subset(x); } private: DataFrame data_; IntegerVector group_sizes; }; } namespace Rcpp { using namespace dplyr; template <> inline bool is(SEXP x) { return Rf_inherits(x, "rowwise_df"); } } #endif dplyr/inst/include/dplyr/DataFrameColumnSubsetVisitor.h0000644000176200001440000000232713135665123023070 0ustar liggesusers#ifndef dplyr_DataFrameColumnSubsetVisitors_H #define dplyr_DataFrameColumnSubsetVisitors_H #include namespace dplyr { class DataFrameColumnSubsetVisitor : public SubsetVectorVisitor { public: DataFrameColumnSubsetVisitor(const DataFrame& data_) : data(data_), visitors(data) {} inline SEXP subset(const Rcpp::IntegerVector& index) const { return visitors.subset(index, get_class(data)); } inline SEXP subset(const std::vector& index) const { return visitors.subset(index, get_class(data)); } inline SEXP subset(const SlicingIndex& index) const { return visitors.subset(index, get_class(data)); } inline SEXP subset(const ChunkIndexMap& index) const { return visitors.subset(index, get_class(data)); } inline SEXP subset(EmptySubset index) const { return visitors.subset(index, get_class(data)); } inline int size() const { return visitors.nrows(); } inline std::string get_r_type() const { return "data.frame"; } inline bool is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other); } private: DataFrame data; DataFrameSubsetVisitors visitors; }; } #endif dplyr/inst/include/dplyr/DataFrameSubsetVisitors.h0000644000176200001440000000570713146776033022107 0ustar liggesusers#ifndef dplyr_DataFrameSubsetVisitors_H #define dplyr_DataFrameSubsetVisitors_H #include #include #include #include #include #include namespace dplyr { class DataFrameSubsetVisitors { private: const Rcpp::DataFrame& data; pointer_vector visitors; SymbolVector visitor_names; int nvisitors; public: typedef SubsetVectorVisitor visitor_type; DataFrameSubsetVisitors(const Rcpp::DataFrame& data_) : data(data_), visitors(), visitor_names(data.names()), nvisitors(visitor_names.size()) { CharacterVector names = data.names(); for (int i = 0; i < nvisitors; i++) { SubsetVectorVisitor* v = subset_visitor(data[i], names[i]); visitors.push_back(v); } } DataFrameSubsetVisitors(const DataFrame& data_, const SymbolVector& names) : data(data_), visitors(), visitor_names(names), nvisitors(visitor_names.size()) { CharacterVector data_names = data.names(); IntegerVector indx = names.match_in_table(data_names); int n = indx.size(); for (int i = 0; i < n; i++) { int pos = indx[i]; if (pos == NA_INTEGER) { bad_col(names[i], "is unknown"); } SubsetVectorVisitor* v = subset_visitor(data[pos - 1], data_names[pos - 1]); visitors.push_back(v); } } template DataFrame subset(const Container& index, const CharacterVector& classes) const { List out(nvisitors); for (int k = 0; k < nvisitors; k++) { out[k] = get(k)->subset(index); } copy_most_attributes(out, data); structure(out, output_size(index), classes); return out; } inline int size() const { return nvisitors; } inline SubsetVectorVisitor* get(int k) const { return visitors[k]; } const SymbolString name(int k) const { return visitor_names[k]; } inline int nrows() const { return data.nrows(); } private: inline void structure(List& x, int nrows, CharacterVector classes) const { set_class(x, classes); set_rownames(x, nrows); x.names() = visitor_names; copy_vars(x, data); } }; template <> inline DataFrame DataFrameSubsetVisitors::subset(const LogicalVector& index, const CharacterVector& classes) const { const int n = index.size(); std::vector idx; idx.reserve(n); for (int i = 0; i < n; i++) { if (index[i] == TRUE) { idx.push_back(i); } } return subset(idx, classes); } template DataFrame subset(DataFrame df, const Index& indices, const SymbolVector& columns, const CharacterVector& classes) { return DataFrameSubsetVisitors(df, columns).subset(indices, classes); } template DataFrame subset(DataFrame df, const Index& indices, CharacterVector classes) { return DataFrameSubsetVisitors(df).subset(indices, classes); } } // namespace dplyr #include #endif dplyr/inst/include/dplyr/FullDataFrame.h0000644000176200001440000000063213135665123017764 0ustar liggesusers#ifndef dplyr_tools_FullDataFrame_H #define dplyr_tools_FullDataFrame_H namespace dplyr { class FullDataFrame { public: typedef NaturalSlicingIndex slicing_index; FullDataFrame(const DataFrame& data_) : index(data_.nrows()) {} const SlicingIndex& get_index() const { return index; } inline int nrows() const { return index.size(); } private: NaturalSlicingIndex index; }; } #endif dplyr/inst/include/dplyr/Gatherer.h0000644000176200001440000001633213135665123017062 0ustar liggesusers#ifndef dplyr_Gatherer_H #define dplyr_Gatherer_H #include #include #include #include #include #include #include #include #include namespace dplyr { class Gatherer { public: virtual ~Gatherer() {} virtual SEXP collect() = 0; }; template class GathererImpl : public Gatherer { public: typedef GroupedCallProxy Proxy; GathererImpl(RObject& first, SlicingIndex& indices, Proxy& proxy_, const Data& gdf_, int first_non_na_, const SymbolString& name_) : gdf(gdf_), proxy(proxy_), first_non_na(first_non_na_), name(name_) { coll = collecter(first, gdf.nrows()); if (first_non_na < gdf.ngroups()) grab(first, indices); } ~GathererImpl() { if (coll != 0) { delete coll; } } SEXP collect() { int ngroups = gdf.ngroups(); if (first_non_na == ngroups) return coll->get(); typename Data::group_iterator git = gdf.group_begin(); int i = 0; for (; i < first_non_na; i++) ++git; ++git; i++; for (; i < ngroups; i++, ++git) { const SlicingIndex& indices = *git; Shield subset(proxy.get(indices)); grab(subset, indices); } return coll->get(); } private: inline void grab(SEXP subset, const SlicingIndex& indices) { int n = Rf_length(subset); if (n == indices.size()) { grab_along(subset, indices); } else if (n == 1) { grab_rep(subset, indices); } else if (Rf_isNull(subset)) { stop("incompatible types (NULL), expecting %s", coll->describe()); } else { check_length(n, indices.size(), "the group size", name); } } void grab_along(SEXP subset, const SlicingIndex& indices) { if (coll->compatible(subset)) { // if the current source is compatible, collect coll->collect(indices, subset); } else if (coll->can_promote(subset)) { // setup a new Collecter Collecter* new_collecter = promote_collecter(subset, gdf.nrows(), coll); // import data from previous collecter. new_collecter->collect(NaturalSlicingIndex(gdf.nrows()), coll->get()); // import data from this chunk new_collecter->collect(indices, subset); // dispose the previous collecter and keep the new one. delete coll; coll = new_collecter; } else if (coll->is_logical_all_na()) { Collecter* new_collecter = collecter(subset, gdf.nrows()); new_collecter->collect(indices, subset); delete coll; coll = new_collecter; } else { bad_col(name, "can't be converted from {source_type} to {target_type}", _["source_type"] = coll->describe(), _["target_type"] = get_single_class(subset)); } } void grab_rep(SEXP value, const SlicingIndex& indices) { int n = indices.size(); // FIXME: This can be made faster if `source` in `Collecter->collect(source, indices)` // could be of length 1 recycling the value. // TODO: create Collecter->collect_one(source, indices)? for (int j = 0; j < n; j++) { grab_along(value, RowwiseSlicingIndex(indices[j])); } } const Data& gdf; Proxy& proxy; Collecter* coll; int first_non_na; const SymbolString& name; }; template class ListGatherer : public Gatherer { public: typedef GroupedCallProxy Proxy; ListGatherer(List first, SlicingIndex& indices, Proxy& proxy_, const Data& gdf_, int first_non_na_, const SymbolString& name_) : gdf(gdf_), proxy(proxy_), data(gdf.nrows()), first_non_na(first_non_na_), name(name_) { if (first_non_na < gdf.ngroups()) { perhaps_duplicate(first); grab(first, indices); } copy_most_attributes(data, first); } SEXP collect() { int ngroups = gdf.ngroups(); if (first_non_na == ngroups) return data; typename Data::group_iterator git = gdf.group_begin(); int i = 0; for (; i < first_non_na; i++) ++git; ++git; i++; for (; i < ngroups; i++, ++git) { const SlicingIndex& indices = *git; List subset(proxy.get(indices)); perhaps_duplicate(subset); grab(subset, indices); } return data; } private: inline void perhaps_duplicate(List& x) { int n = x.size(); for (int i = 0; i < n; i++) { SEXP xi = x[i]; if (IS_DPLYR_SHRINKABLE_VECTOR(xi)) { x[i] = Rf_duplicate(xi); } else if (TYPEOF(xi) == VECSXP) { List lxi(xi); perhaps_duplicate(lxi); } } } inline void grab(const List& subset, const SlicingIndex& indices) { int n = subset.size(); if (n == indices.size()) { grab_along(subset, indices); } else if (n == 1) { grab_rep(subset[0], indices); } else { check_length(n, indices.size(), "the group size", name); } } void grab_along(const List& subset, const SlicingIndex& indices) { int n = indices.size(); for (int j = 0; j < n; j++) { data[ indices[j] ] = subset[j]; } } void grab_rep(SEXP value, const SlicingIndex& indices) { int n = indices.size(); for (int j = 0; j < n; j++) { data[ indices[j] ] = value; } } const Data& gdf; Proxy& proxy; List data; int first_non_na; const SymbolString name; }; template class ConstantGathererImpl : public Gatherer { public: ConstantGathererImpl(Vector constant, int n) : value(n, Rcpp::internal::r_vector_start(constant)[0]) { copy_most_attributes(value, constant); } inline SEXP collect() { return value; } private: Vector value; }; inline Gatherer* constant_gatherer(SEXP x, int n, const SymbolString& name) { if (Rf_inherits(x, "POSIXlt")) { bad_col(name, "is of unsupported class POSIXlt"); } switch (TYPEOF(x)) { case INTSXP: return new ConstantGathererImpl(x, n); case REALSXP: return new ConstantGathererImpl(x, n); case LGLSXP: return new ConstantGathererImpl(x, n); case STRSXP: return new ConstantGathererImpl(x, n); case CPLXSXP: return new ConstantGathererImpl(x, n); case VECSXP: return new ConstantGathererImpl(x, n); default: break; } bad_col(name, "is of unsupported type {type}", _["type"] = Rf_type2char(TYPEOF(x))); } template inline Gatherer* gatherer(GroupedCallProxy& proxy, const Data& gdf, const SymbolString& name) { typename Data::group_iterator git = gdf.group_begin(); typename Data::slicing_index indices = *git; RObject first(proxy.get(indices)); if (Rf_inherits(first, "POSIXlt")) { bad_col(name, "is of unsupported class POSIXlt"); } check_supported_type(first, name); check_length(Rf_length(first), indices.size(), "the group size", name); const int ng = gdf.ngroups(); int i = 0; while (all_na(first)) { i++; if (i == ng) break; ++git; indices = *git; first = proxy.get(indices); } if (TYPEOF(first) == VECSXP) { return new ListGatherer (List(first), indices, proxy, gdf, i, name); } else { return new GathererImpl (first, indices, proxy, gdf, i, name); } } } // namespace dplyr #endif dplyr/inst/include/dplyr/get_column.h0000644000176200001440000000031513135665123017447 0ustar liggesusers#ifndef dplyr_dplyr_get_column_H #define dplyr_dplyr_get_column_H namespace dplyr { SymbolString get_column(SEXP, const Environment&, const ILazySubsets&); } #endif // #ifndef dplyr_dplyr_get_column_H dplyr/inst/include/dplyr/Groups.h0000644000176200001440000000062313135665123016574 0ustar liggesusers#ifndef dplyr_dplyr_Groups_H #define dplyr_dplyr_Groups_H #include #include #include void check_not_groups(const QuosureList& quosures, const GroupedDataFrame& gdf); void check_not_groups(const QuosureList& quosures, const RowwiseDataFrame& gdf); SEXP strip_group_attributes(SEXP df); #endif // #ifndef dplyr_dplyr_Groups_H dplyr/inst/include/dplyr/bad.h0000644000176200001440000000742313136635173016053 0ustar liggesusers#ifndef DPLYR_DPLYR_BAD_H #define DPLYR_DPLYR_BAD_H namespace dplyr { template void NORET bad_arg(const SymbolString& arg, C1 arg1) { static Function bad_fun = Function("bad_args", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(CharacterVector::create(arg.get_string()), arg1, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_arg(const SymbolString& arg, C1 arg1, C2 arg2) { static Function bad_fun = Function("bad_args", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(CharacterVector::create(arg.get_string()), arg1, arg2, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_arg(const SymbolString& arg, C1 arg1, C2 arg2, C3 arg3) { static Function bad_fun = Function("bad_args", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(CharacterVector::create(arg.get_string()), arg1, arg2, arg3, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_pos_arg(int pos_arg, C1 arg1) { static Function bad_fun = Function("bad_pos_args", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(pos_arg, arg1, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_pos_arg(int pos_arg, C1 arg1, C2 arg2) { static Function bad_fun = Function("bad_pos_args", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(pos_arg, arg1, arg2, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_pos_arg(int pos_arg, C1 arg1, C2 arg2, C3 arg3) { static Function bad_fun = Function("bad_pos_args", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(pos_arg, arg1, arg2, arg3, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_col(const SymbolString& col, C1 arg1) { static Function bad_fun = Function("bad_cols", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(CharacterVector::create(col.get_string()), arg1, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_col(const SymbolString& col, C1 arg1, C2 arg2) { static Function bad_fun = Function("bad_cols", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(CharacterVector::create(col.get_string()), arg1, arg2, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } template void NORET bad_col(const SymbolString& col, C1 arg1, C2 arg2, C3 arg3) { static Function bad_fun = Function("bad_cols", Environment::namespace_env("dplyr")); static Function identity = Function("identity", Environment::base_env()); String message = bad_fun(CharacterVector::create(col.get_string()), arg1, arg2, arg3, _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } } #endif // DPLYR_DPLYR_BAD_H dplyr/inst/include/dplyr/SummarisedVariable.h0000644000176200001440000000040213135665123021067 0ustar liggesusers#ifndef dplyr_SummarisedVariable_H #define dplyr_SummarisedVariable_H namespace dplyr { class SummarisedVariable { public: SummarisedVariable(SEXP x) : data(x) {} inline operator SEXP() const { return data; } private: SEXP data; }; } #endif dplyr/inst/include/dplyr/white_list.h0000644000176200001440000000100213135665123017460 0ustar liggesusers#ifndef dplyr_white_list_H #define dplyr_white_list_H namespace dplyr { inline bool white_list(SEXP x) { if (Rf_isMatrix(x)) { // might have to refine later return true; } switch (TYPEOF(x)) { case INTSXP: return true; case REALSXP: return true; case LGLSXP: return true; case STRSXP: return true; case CPLXSXP: return true; case VECSXP: { if (Rf_inherits(x, "POSIXlt")) return false; return true; } default: break; } return false; } } #endif dplyr/inst/include/dplyr/subset_visitor_impl.h0000644000176200001440000000444713135665123021432 0ustar liggesusers#ifndef dplyr_subset_visitor_impl_H #define dplyr_subset_visitor_impl_H #include #include #include #include #include namespace dplyr { inline SubsetVectorVisitor* subset_visitor_matrix(SEXP vec); inline SubsetVectorVisitor* subset_visitor_vector(SEXP vec); inline SubsetVectorVisitor* subset_visitor(SEXP vec, const SymbolString& name) { try { if (Rf_isMatrix(vec)) { return subset_visitor_matrix(vec); } else { return subset_visitor_vector(vec); } } catch (const Rcpp::exception& e) { bad_col(name, e.what()); } } inline SubsetVectorVisitor* subset_visitor_matrix(SEXP vec) { switch (TYPEOF(vec)) { case CPLXSXP: return new MatrixColumnSubsetVisitor(vec); case INTSXP: return new MatrixColumnSubsetVisitor(vec); case REALSXP: return new MatrixColumnSubsetVisitor(vec); case LGLSXP: return new MatrixColumnSubsetVisitor(vec); case STRSXP: return new MatrixColumnSubsetVisitor(vec); case VECSXP: return new MatrixColumnSubsetVisitor(vec); default: break; } stop("unsupported matrix type %s", Rf_type2char(TYPEOF(vec))); } inline SubsetVectorVisitor* subset_visitor_vector(SEXP vec) { if (Rf_inherits(vec, "Date")) { return new DateSubsetVectorVisitor(vec); } switch (TYPEOF(vec)) { case CPLXSXP: return new SubsetVectorVisitorImpl(vec); case INTSXP: if (Rf_inherits(vec, "factor")) return new SubsetFactorVisitor(vec); return new SubsetVectorVisitorImpl(vec); case REALSXP: return new SubsetVectorVisitorImpl(vec); case LGLSXP: return new SubsetVectorVisitorImpl(vec); case STRSXP: return new SubsetVectorVisitorImpl(vec); case VECSXP: { if (Rf_inherits(vec, "data.frame")) { return new DataFrameColumnSubsetVisitor(vec); } if (Rf_inherits(vec, "POSIXlt")) { stop("POSIXlt not supported"); } return new SubsetVectorVisitorImpl(vec); } default: break; } // should not happen, safeguard against segfaults anyway stop("is of unsupported type %s", Rf_type2char(TYPEOF(vec))); } } #endif dplyr/inst/include/dplyr/workarounds/0000755000176200001440000000000013163257361017523 5ustar liggesusersdplyr/inst/include/dplyr/workarounds/xlen.h0000644000176200001440000000054013135665123020637 0ustar liggesusers#ifndef DPLYR_WORKAROUND_XLEN_H #define DPLYR_WORKAROUND_XLEN_H #ifdef LONG_VECTOR_SUPPORT namespace Rcpp { template <> inline SEXP wrap(const R_xlen_t& x) { if (x < -R_SHORT_LEN_MAX || x > R_SHORT_LEN_MAX) { return Rf_ScalarReal(static_cast(x)); } else { return Rf_ScalarInteger(static_cast(x)); } } } #endif #endif dplyr/inst/include/dplyr/workarounds/static_assert.h0000644000176200001440000001743513120706341022544 0ustar liggesusers// (C) Copyright John Maddock 2000. // Use, modification and distribution are subject to the // Boost Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) // See http://www.boost.org/libs/static_assert for documentation. /* Revision history: 02 August 2000 Initial version. */ #ifndef BOOST_STATIC_ASSERT_HPP #define BOOST_STATIC_ASSERT_HPP #include #include #if defined(__GNUC__) && !defined(__GXX_EXPERIMENTAL_CXX0X__) // // This is horrible, but it seems to be the only we can shut up the // "anonymous variadic macros were introduced in C99 [-Wvariadic-macros]" // warning that get spewed out otherwise in non-C++11 mode. // #pragma GCC system_header #endif #ifndef BOOST_NO_CXX11_STATIC_ASSERT # ifndef BOOST_NO_CXX11_VARIADIC_MACROS # define BOOST_STATIC_ASSERT_MSG( ... ) static_assert(__VA_ARGS__) # else # define BOOST_STATIC_ASSERT_MSG( B, Msg ) BOOST_STATIC_ASSERT( B ) # endif #else # define BOOST_STATIC_ASSERT_MSG( B, Msg ) BOOST_STATIC_ASSERT( B ) #endif #ifdef __BORLANDC__ // // workaround for buggy integral-constant expression support: #define BOOST_BUGGY_INTEGRAL_CONSTANT_EXPRESSIONS #endif #if defined(__GNUC__) && (__GNUC__ == 3) && ((__GNUC_MINOR__ == 3) || (__GNUC_MINOR__ == 4)) // gcc 3.3 and 3.4 don't produce good error messages with the default version: # define BOOST_SA_GCC_WORKAROUND #endif // // If the compiler issues warnings about old C style casts, // then enable this: // #if defined(__GNUC__) && ((__GNUC__ > 3) || ((__GNUC__ == 3) && (__GNUC_MINOR__ >= 4))) # ifndef BOOST_NO_CXX11_VARIADIC_MACROS # define BOOST_STATIC_ASSERT_BOOL_CAST( ... ) ((__VA_ARGS__) == 0 ? false : true) # else # define BOOST_STATIC_ASSERT_BOOL_CAST( x ) ((x) == 0 ? false : true) # endif #else # ifndef BOOST_NO_CXX11_VARIADIC_MACROS # define BOOST_STATIC_ASSERT_BOOL_CAST( ... ) (bool)(__VA_ARGS__) # else # define BOOST_STATIC_ASSERT_BOOL_CAST(x) (bool)(x) # endif #endif // // If the compiler warns about unused typedefs then enable this: // #if defined(__GNUC__) && ((__GNUC__ > 4) || ((__GNUC__ == 4) && (__GNUC_MINOR__ >= 7))) || (defined(__apple_build_version__) && (__apple_build_version__ >= 7000000)) # define BOOST_STATIC_ASSERT_UNUSED_ATTRIBUTE __attribute__((unused)) #else # define BOOST_STATIC_ASSERT_UNUSED_ATTRIBUTE #endif #ifndef BOOST_NO_CXX11_STATIC_ASSERT # ifndef BOOST_NO_CXX11_VARIADIC_MACROS # define BOOST_STATIC_ASSERT( ... ) static_assert(__VA_ARGS__, #__VA_ARGS__) # else # define BOOST_STATIC_ASSERT( B ) static_assert(B, #B) # endif #else namespace boost { // HP aCC cannot deal with missing names for template value parameters template struct STATIC_ASSERTION_FAILURE; template <> struct STATIC_ASSERTION_FAILURE { enum { value = 1 }; }; // HP aCC cannot deal with missing names for template value parameters template struct static_assert_test {}; } // // Implicit instantiation requires that all member declarations be // instantiated, but that the definitions are *not* instantiated. // // It's not particularly clear how this applies to enum's or typedefs; // both are described as declarations [7.1.3] and [7.2] in the standard, // however some compilers use "delayed evaluation" of one or more of // these when implicitly instantiating templates. We use typedef declarations // by default, but try defining BOOST_USE_ENUM_STATIC_ASSERT if the enum // version gets better results from your compiler... // // Implementation: // Both of these versions rely on sizeof(incomplete_type) generating an error // message containing the name of the incomplete type. We use // "STATIC_ASSERTION_FAILURE" as the type name here to generate // an eye catching error message. The result of the sizeof expression is either // used as an enum initialiser, or as a template argument depending which version // is in use... // Note that the argument to the assert is explicitly cast to bool using old- // style casts: too many compilers currently have problems with static_cast // when used inside integral constant expressions. // #if !defined(BOOST_BUGGY_INTEGRAL_CONSTANT_EXPRESSIONS) #if defined(BOOST_MSVC) && (BOOST_MSVC < 1300) // __LINE__ macro broken when -ZI is used see Q199057 // fortunately MSVC ignores duplicate typedef's. #define BOOST_STATIC_ASSERT( B ) \ typedef ::boost::static_assert_test<\ sizeof(::boost::STATIC_ASSERTION_FAILURE< (bool)( B ) >)\ > boost_static_assert_typedef_ #elif defined(BOOST_MSVC) && defined(BOOST_NO_CXX11_VARIADIC_MACROS) #define BOOST_STATIC_ASSERT( B ) \ typedef ::boost::static_assert_test<\ sizeof(::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST ( B ) >)>\ BOOST_JOIN(boost_static_assert_typedef_, __COUNTER__) #elif defined(BOOST_MSVC) #define BOOST_STATIC_ASSERT(...) \ typedef ::boost::static_assert_test<\ sizeof(::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST (__VA_ARGS__) >)>\ BOOST_JOIN(boost_static_assert_typedef_, __COUNTER__) #elif (defined(BOOST_INTEL_CXX_VERSION) || defined(BOOST_SA_GCC_WORKAROUND)) && defined(BOOST_NO_CXX11_VARIADIC_MACROS) // agurt 15/sep/02: a special care is needed to force Intel C++ issue an error // instead of warning in case of failure # define BOOST_STATIC_ASSERT( B ) \ typedef char BOOST_JOIN(boost_static_assert_typedef_, __LINE__) \ [ ::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST( B ) >::value ] #elif (defined(BOOST_INTEL_CXX_VERSION) || defined(BOOST_SA_GCC_WORKAROUND)) && !defined(BOOST_NO_CXX11_VARIADIC_MACROS) // agurt 15/sep/02: a special care is needed to force Intel C++ issue an error // instead of warning in case of failure # define BOOST_STATIC_ASSERT(...) \ typedef char BOOST_JOIN(boost_static_assert_typedef_, __LINE__) \ [ ::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST( __VA_ARGS__ ) >::value ] #elif defined(__sgi) // special version for SGI MIPSpro compiler #define BOOST_STATIC_ASSERT( B ) \ BOOST_STATIC_CONSTANT(bool, \ BOOST_JOIN(boost_static_assert_test_, __LINE__) = ( B )); \ typedef ::boost::static_assert_test<\ sizeof(::boost::STATIC_ASSERTION_FAILURE< \ BOOST_JOIN(boost_static_assert_test_, __LINE__) >)>\ BOOST_JOIN(boost_static_assert_typedef_, __LINE__) #elif BOOST_WORKAROUND(__MWERKS__, <= 0x3003) // special version for CodeWarrior <= 8.x #define BOOST_STATIC_ASSERT( B ) \ BOOST_STATIC_CONSTANT(int, \ BOOST_JOIN(boost_static_assert_test_, __LINE__) = \ sizeof(::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST( B ) >) ) #else // generic version # ifndef BOOST_NO_CXX11_VARIADIC_MACROS # define BOOST_STATIC_ASSERT( ... ) \ typedef ::boost::static_assert_test<\ sizeof(::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST( __VA_ARGS__ ) >)>\ BOOST_JOIN(boost_static_assert_typedef_, __LINE__) BOOST_STATIC_ASSERT_UNUSED_ATTRIBUTE # else # define BOOST_STATIC_ASSERT( B ) \ typedef ::boost::static_assert_test<\ sizeof(::boost::STATIC_ASSERTION_FAILURE< BOOST_STATIC_ASSERT_BOOL_CAST( B ) >)>\ BOOST_JOIN(boost_static_assert_typedef_, __LINE__) BOOST_STATIC_ASSERT_UNUSED_ATTRIBUTE # endif #endif #else // alternative enum based implementation: # ifndef BOOST_NO_CXX11_VARIADIC_MACROS # define BOOST_STATIC_ASSERT( ... ) \ enum { BOOST_JOIN(boost_static_assert_enum_, __LINE__) \ = sizeof(::boost::STATIC_ASSERTION_FAILURE< (bool)( __VA_ARGS__ ) >) } # else # define BOOST_STATIC_ASSERT(B) \ enum { BOOST_JOIN(boost_static_assert_enum_, __LINE__) \ = sizeof(::boost::STATIC_ASSERTION_FAILURE< (bool)( B ) >) } # endif #endif #endif // defined(BOOST_NO_CXX11_STATIC_ASSERT) #endif // BOOST_STATIC_ASSERT_HPP dplyr/inst/include/dplyr/visitor_impl.h0000644000176200001440000000352013135665123020034 0ustar liggesusers#ifndef dplyr_visitor_impl_H #define dplyr_visitor_impl_H #include #include #include namespace dplyr { inline VectorVisitor* visitor_matrix(SEXP vec); inline VectorVisitor* visitor_vector(SEXP vec); inline VectorVisitor* visitor(SEXP vec) { if (Rf_isMatrix(vec)) { return visitor_matrix(vec); } else { return visitor_vector(vec); } } inline VectorVisitor* visitor_matrix(SEXP vec) { switch (TYPEOF(vec)) { case CPLXSXP: return new MatrixColumnVisitor(vec); case INTSXP: return new MatrixColumnVisitor(vec); case REALSXP: return new MatrixColumnVisitor(vec); case LGLSXP: return new MatrixColumnVisitor(vec); case STRSXP: return new MatrixColumnVisitor(vec); case VECSXP: return new MatrixColumnVisitor(vec); default: break; } stop("unsupported matrix type %s", Rf_type2char(TYPEOF(vec))); } inline VectorVisitor* visitor_vector(SEXP vec) { switch (TYPEOF(vec)) { case CPLXSXP: return new VectorVisitorImpl(vec); case INTSXP: if (Rf_inherits(vec, "factor")) return new FactorVisitor(vec); return new VectorVisitorImpl(vec); case REALSXP: return new VectorVisitorImpl(vec); case LGLSXP: return new VectorVisitorImpl(vec); case STRSXP: return new VectorVisitorImpl(vec); case VECSXP: { if (Rf_inherits(vec, "data.frame")) { return new DataFrameColumnVisitor(vec); } if (Rf_inherits(vec, "POSIXlt")) { stop("POSIXlt not supported"); } return new VectorVisitorImpl(vec); } default: break; } // should not happen, safeguard against segfaults anyway stop("is of unsupported type %s", Rf_type2char(TYPEOF(vec))); } } #endif dplyr/inst/include/dplyr/BoolResult.h0000644000176200001440000000202213135665123017402 0ustar liggesusers#ifndef dplyr_tools_BoolResult_H #define dplyr_tools_BoolResult_H #include namespace dplyr { class BoolResult { public: BoolResult(bool result_) : result(result_) {} BoolResult(bool result_, const CharacterVector& msg) : result(result_), message(msg) {} inline operator SEXP() const { LogicalVector res = LogicalVector::create(result); res.attr("comment") = message; set_class(res, "BoolResult"); return res; } inline operator bool() const { return result; } inline std::string why_not() const { R_xlen_t n = message.length(); if (n == 0) return ""; if (n == 1) return std::string(message[0]); std::stringstream ss; ss << "\n"; for (int i = 0; i < n; ++i) { ss << "- " << std::string(message[i]) << "\n"; } return ss.str(); } private: bool result; CharacterVector message; }; inline BoolResult no_because(const CharacterVector& msg) { return BoolResult(false, msg); } inline BoolResult yes() { return true; } } #endif dplyr/inst/include/dplyr/Order.h0000644000176200001440000000327113156770126016375 0ustar liggesusers#ifndef dplyr_Order_H #define dplyr_Order_H #include #include namespace dplyr { class OrderVisitors_Compare; class OrderVisitors { public: OrderVisitors(List args, LogicalVector ascending, int n_) : visitors(n_), n(n_), nrows(0) { nrows = Rf_length(args[0]); for (int i = 0; i < n; i++) { visitors[i] = order_visitor(args[i], ascending[i], i); } } OrderVisitors(DataFrame data) : visitors(data.size()), n(data.size()), nrows(data.nrows()) { for (int i = 0; i < n; i++) visitors[i] = order_visitor(data[i], true, i); } Rcpp::IntegerVector apply() const; pointer_vector visitors; int n; int nrows; }; class OrderVisitors_Compare { public: OrderVisitors_Compare(const OrderVisitors& obj_) : obj(obj_), n(obj.n) {} inline bool operator()(int i, int j) const { if (i == j) return false; for (int k = 0; k < n; k++) if (! obj.visitors[k]->equal(i, j)) return obj.visitors[k]->before(i, j); return i < j; } private: const OrderVisitors& obj; int n; }; template class Compare_Single_OrderVisitor { public: Compare_Single_OrderVisitor(const OrderVisitorClass& obj_) : obj(obj_) {} inline bool operator()(int i, int j) const { if (i == j) return false; if (obj.equal(i, j)) return i < j; return obj.before(i, j); } private: const OrderVisitorClass& obj; }; inline Rcpp::IntegerVector OrderVisitors::apply() const { if (nrows == 0) return IntegerVector(0); IntegerVector x = seq(0, nrows - 1); std::sort(x.begin(), x.end(), OrderVisitors_Compare(*this)); return x; } } // namespace dplyr #endif dplyr/inst/include/dplyr/Replicator.h0000644000176200001440000000262313135665123017423 0ustar liggesusers#ifndef dplyr_Replicator_H #define dplyr_Replicator_H #include namespace dplyr { class Replicator { public: virtual ~Replicator() {} virtual SEXP collect() = 0; }; template class ReplicatorImpl : public Replicator { public: typedef typename Rcpp::traits::storage_type::type STORAGE; ReplicatorImpl(SEXP v, int n_, int ngroups_) : data(no_init(n_ * ngroups_)), source(v), n(n_), ngroups(ngroups_) {} SEXP collect() { for (int i = 0, k = 0; i < ngroups; i++) { for (int j = 0; j < n; j++, k++) { data[k] = source[j]; } } copy_most_attributes(data, source); return data; } private: Vector data; Vector source; int n; int ngroups; }; template inline Replicator* replicator(SEXP v, const Data& gdf) { int n = Rf_length(v); switch (TYPEOF(v)) { case INTSXP: return new ReplicatorImpl (v, n, gdf.ngroups()); case REALSXP: return new ReplicatorImpl (v, n, gdf.ngroups()); case STRSXP: return new ReplicatorImpl (v, n, gdf.ngroups()); case LGLSXP: return new ReplicatorImpl (v, n, gdf.ngroups()); case CPLXSXP: return new ReplicatorImpl (v, n, gdf.ngroups()); default: break; } stop("is of unsupported type %s", Rf_type2char(TYPEOF(v))); } } // namespace dplyr #endif dplyr/inst/include/dplyr/DataFrameVisitorsIndexMap.h0000644000176200001440000000042413135665123022331 0ustar liggesusers#ifndef dplyr_DataFrameVisitors_map_H #define dplyr_DataFrameVisitors_map_H #include #include namespace dplyr { typedef VisitorSetIndexMap< DataFrameVisitors, std::vector > ChunkIndexMap; } #endif dplyr/inst/include/dplyr/OrderVisitorImpl.h0000644000176200001440000001427713156572106020605 0ustar liggesusers#ifndef dplyr_OrderVectorVisitor_Impl_H #define dplyr_OrderVectorVisitor_Impl_H #include #include #include #include #include #include #include namespace dplyr { // version used for ascending = true template class OrderVectorVisitorImpl : public OrderVisitor { typedef comparisons compare; public: /** * The type of data : int, double, SEXP, Rcomplex */ typedef typename Rcpp::traits::storage_type::type STORAGE; OrderVectorVisitorImpl(const VECTOR& vec_) : vec(vec_) {} inline bool equal(int i, int j) const { return compare::equal_or_both_na(vec[i], vec[j]); } inline bool before(int i, int j) const { return compare::is_less(vec[i], vec[j]); } SEXP get() { return vec; } private: VECTOR vec; }; // version used for ascending = false template class OrderVectorVisitorImpl : public OrderVisitor { typedef comparisons compare; public: /** * The type of data : int, double, SEXP, Rcomplex */ typedef typename Rcpp::traits::storage_type::type STORAGE; OrderVectorVisitorImpl(const VECTOR& vec_) : vec(vec_) {} inline bool equal(int i, int j) const { return compare::equal_or_both_na(vec[i], vec[j]); } inline bool before(int i, int j) const { return compare::is_greater(vec[i], vec[j]); } SEXP get() { return vec; } private: VECTOR vec; }; template class OrderCharacterVectorVisitorImpl : public OrderVisitor { public: OrderCharacterVectorVisitorImpl(const CharacterVector& vec_) : vec(vec_), orders(CharacterVectorOrderer(vec).get()) {} inline bool equal(int i, int j) const { return orders.equal(i, j); } inline bool before(int i, int j) const { return orders.before(i, j); } SEXP get() { return vec; } private: CharacterVector vec; OrderVectorVisitorImpl orders; }; // ---------- data frame columns // ascending = true template class OrderVisitorDataFrame : public OrderVisitor { public: OrderVisitorDataFrame(const DataFrame& data_) : data(data_), visitors(data) {} inline bool equal(int i, int j) const { return visitors.equal(i, j); } inline bool before(int i, int j) const { return visitors.less(i, j); } inline SEXP get() { return data; } private: DataFrame data; DataFrameVisitors visitors; }; template <> class OrderVisitorDataFrame : public OrderVisitor { public: OrderVisitorDataFrame(const DataFrame& data_) : data(data_), visitors(data) {} inline bool equal(int i, int j) const { return visitors.equal(i, j); } inline bool before(int i, int j) const { return visitors.greater(i, j); } inline SEXP get() { return data; } private: DataFrame data; DataFrameVisitors visitors; }; // ---------- matrix columns // ascending = true template class OrderVisitorMatrix : public OrderVisitor { public: OrderVisitorMatrix(const Matrix& data_) : data(data_), visitors(data) {} inline bool equal(int i, int j) const { return visitors.equal(i, j); } inline bool before(int i, int j) const { return visitors.less(i, j); } inline SEXP get() { return data; } private: Matrix data; MatrixColumnVisitor visitors; }; // ascending = false template class OrderVisitorMatrix : public OrderVisitor { public: OrderVisitorMatrix(const Matrix& data_) : data(data_), visitors(data) {} inline bool equal(int i, int j) const { return visitors.equal(i, j); } inline bool before(int i, int j) const { return visitors.greater(i, j); } inline SEXP get() { return data; } private: Matrix data; MatrixColumnVisitor visitors; }; inline OrderVisitor* order_visitor(SEXP vec, const bool ascending, const int i); template OrderVisitor* order_visitor_asc(SEXP vec); template OrderVisitor* order_visitor_asc_matrix(SEXP vec); template OrderVisitor* order_visitor_asc_vector(SEXP vec); inline OrderVisitor* order_visitor(SEXP vec, const bool ascending, const int i) { try { if (ascending) { return order_visitor_asc(vec); } else { return order_visitor_asc(vec); } } catch (const Rcpp::exception& e) { bad_pos_arg(i + 1, e.what()); } } template inline OrderVisitor* order_visitor_asc(SEXP vec) { if (Rf_isMatrix(vec)) { return order_visitor_asc_matrix(vec); } else { return order_visitor_asc_vector(vec); } } template inline OrderVisitor* order_visitor_asc_matrix(SEXP vec) { switch (check_supported_type(vec)) { case DPLYR_INTSXP: return new OrderVisitorMatrix(vec); case DPLYR_REALSXP: return new OrderVisitorMatrix(vec); case DPLYR_LGLSXP: return new OrderVisitorMatrix(vec); case DPLYR_STRSXP: return new OrderVisitorMatrix(vec); case DPLYR_CPLXSXP: return new OrderVisitorMatrix(vec); case DPLYR_VECSXP: stop("Matrix can't be a list"); } stop("Unreachable"); return 0; } template inline OrderVisitor* order_visitor_asc_vector(SEXP vec) { switch (TYPEOF(vec)) { case INTSXP: return new OrderVectorVisitorImpl >(vec); case REALSXP: return new OrderVectorVisitorImpl >(vec); case LGLSXP: return new OrderVectorVisitorImpl >(vec); case STRSXP: return new OrderCharacterVectorVisitorImpl(vec); case CPLXSXP: return new OrderVectorVisitorImpl >(vec); case VECSXP: { if (Rf_inherits(vec, "data.frame")) { return new OrderVisitorDataFrame(vec); } break; } default: break; } stop("is of unsupported type %s", Rf_type2char(TYPEOF(vec))); } } #endif dplyr/inst/include/dplyr/HybridHandler.h0000644000176200001440000000037213135665123020035 0ustar liggesusers#ifndef dplyr_dplyr_HybridHandler_H #define dplyr_dplyr_HybridHandler_H namespace dplyr { class ILazySubsets; class Result; } typedef dplyr::Result* (*HybridHandler)(SEXP, const dplyr::ILazySubsets&, int); #endif // dplyr_dplyr_HybridHandlerMap_H dplyr/inst/include/dplyr/visitor.h0000644000176200001440000000023213135665123017010 0ustar liggesusers#ifndef dplyr_visitor_H #define dplyr_visitor_H #include namespace dplyr { inline VectorVisitor* visitor(SEXP vec); } #endif dplyr/inst/include/dplyr/registration.h0000644000176200001440000000051013135665123020022 0ustar liggesusers#ifndef dplyr_registration_H #define dplyr_registration_H #include #if defined(COMPILING_DPLYR) DataFrame build_index_cpp(DataFrame data); void registerHybridHandler(const char*, HybridHandler); SEXP get_time_classes(); SEXP get_date_classes(); #else #include "dplyr_RcppExports.h" #endif #endif dplyr/inst/include/dplyr/train.h0000644000176200001440000000237013135665123016433 0ustar liggesusers#ifndef dplyr_train_h #define dplyr_train_h namespace dplyr { template inline void iterate_with_interupts(Op op, int n) { int i = 0; if (n > DPLYR_MIN_INTERUPT_SIZE) { int m = n / DPLYR_INTERUPT_TIMES; for (int k = 0; k < DPLYR_INTERUPT_TIMES; k++) { for (int j = 0; j < m; j++, i++) op(i); Rcpp::checkUserInterrupt(); } } for (; i < n; i++) op(i); } template struct push_back_op { push_back_op(Map& map_) : map(map_) {} inline void operator()(int i) { map[i].push_back(i); } Map& map; }; template struct push_back_right_op { push_back_right_op(Map& map_) : map(map_) {} inline void operator()(int i) { map[-i - 1].push_back(-i - 1); } Map& map; }; template inline void train_push_back(Map& map, int n) { iterate_with_interupts(push_back_op(map), n); } template inline void train_push_back_right(Map& map, int n) { iterate_with_interupts(push_back_right_op(map), n); } template inline void train_insert(Set& set, int n) { for (int i = 0; i < n; i++) set.insert(i); } template inline void train_insert_right(Set& set, int n) { for (int i = 0; i < n; i++) set.insert(-i - 1); } } #endif dplyr/inst/include/dplyr/MatrixColumnVisitor.h0000644000176200001440000000564013136641136021322 0ustar liggesusers#ifndef dplyr_MatrixColumnVisitor_H #define dplyr_MatrixColumnVisitor_H #include namespace dplyr { template class MatrixColumnVisitor : public VectorVisitor { public: typedef typename Rcpp::traits::storage_type::type STORAGE; typedef typename Matrix::Column Column; class ColumnVisitor { public: typedef typename Rcpp::traits::storage_type::type STORAGE; typedef comparisons compare; typedef boost::hash hasher; ColumnVisitor(Matrix& data, int column) : column(data.column(column)) {} inline size_t hash(int i) const { return hash_fun(const_cast(column)[i]); } inline bool equal(int i, int j) const { return compare::equal_or_both_na(const_cast(column)[i], const_cast(column)[j]); } inline bool less(int i, int j) const { return compare::is_less(const_cast(column)[i], const_cast(column)[j]); } inline bool equal_or_both_na(int i, int j) const { return compare::equal_or_both_na(const_cast(column)[i], const_cast(column)[j]); } inline bool greater(int i, int j) const { return compare::is_greater(const_cast(column)[i], const_cast(column)[j]); } private: Column column; hasher hash_fun; }; MatrixColumnVisitor(const Matrix& data_) : data(data_), visitors() { for (int h = 0; h < data.ncol(); h++) { visitors.push_back(ColumnVisitor(data, h)); } } inline size_t hash(int i) const { size_t seed = visitors[0].hash(i); for (size_t h = 1; h < visitors.size(); h++) { boost::hash_combine(seed, visitors[h].hash(i)); } return seed; } inline bool equal(int i, int j) const { if (i == j) return true; for (size_t h = 0; h < visitors.size(); h++) { if (!visitors[h].equal(i, j)) return false; } return true; } inline bool equal_or_both_na(int i, int j) const { if (i == j) return true; for (size_t h = 0; h < visitors.size(); h++) { if (!visitors[h].equal_or_both_na(i, j)) return false; } return true; } inline bool less(int i, int j) const { if (i == j) return false; for (size_t h = 0; h < visitors.size(); h++) { const ColumnVisitor& v = visitors[h]; if (!v.equal(i, j)) { return v.less(i, j); } } return i < j; } inline bool greater(int i, int j) const { if (i == j) return false; for (size_t h = 0; h < visitors.size(); h++) { const ColumnVisitor& v = visitors[h]; if (!v.equal(i, j)) { return v.greater(i, j); } } return i < j; } inline int size() const { return data.nrow(); } inline std::string get_r_type() const { return "matrix"; } bool is_na(int) const { return false; } private: Matrix data; std::vector visitors; }; } #endif dplyr/inst/include/dplyr/MatrixColumnSubsetVectorVisitor.h0000644000176200001440000000407113135665123023671 0ustar liggesusers#ifndef dplyr_MatrixColumnSubsetVisitor_H #define dplyr_MatrixColumnSubsetVisitor_H namespace dplyr { template class MatrixColumnSubsetVisitor : public SubsetVectorVisitor { public: typedef typename Rcpp::traits::storage_type::type STORAGE; typedef typename Matrix::Column Column; MatrixColumnSubsetVisitor(const Matrix& data_) : data(data_) {} inline SEXP subset(const Rcpp::IntegerVector& index) const { return subset_int(index); } inline SEXP subset(const std::vector& index) const { return subset_int(index); } inline SEXP subset(const SlicingIndex& index) const { return subset_int(index); } inline SEXP subset(const ChunkIndexMap& index) const { int n = index.size(); int nc = data.ncol(); Matrix res(n, data.ncol()); for (int h = 0; h < nc; h++) { ChunkIndexMap::const_iterator it = index.begin(); Column column = res.column(h); Column source_column = const_cast&>(data).column(h); for (int i = 0; i < n; i++, ++it) { column[i] = source_column[ it->first ]; } } return res; } inline SEXP subset(EmptySubset) const { return Matrix(0, data.ncol()); } inline int size() const { return data.nrow(); } inline std::string get_r_type() const { return "matrix"; } inline bool is_compatible(SubsetVectorVisitor* other, std::stringstream&, const SymbolString&) const { return is_same_typeid(other); } private: template inline SEXP subset_int(const Container& index) const { int n = index.size(), nc = data.ncol(); Matrix res(n, nc); for (int h = 0; h < nc; h++) { Column column = res.column(h); Column source_column = const_cast&>(data).column(h); for (int k = 0; k < n; k++) { int idx = index[k]; if (idx < 0) { column[k] = Vector::get_na(); } else { column[k] = source_column[ index[k] ]; } } } return res; } Matrix data; }; } #endif dplyr/inst/include/dplyr/JoinVisitorImpl.h0000644000176200001440000001641613135665123020425 0ustar liggesusers#ifndef dplyr_JoinVisitorImpl_H #define dplyr_JoinVisitorImpl_H #include #include #include #include #include namespace dplyr { CharacterVector get_uniques(const CharacterVector& left, const CharacterVector& right); void check_attribute_compatibility(const Column& left, const Column& right); template class DualVector { public: enum { RTYPE = (LHS_RTYPE > RHS_RTYPE ? LHS_RTYPE : RHS_RTYPE) }; typedef Vector LHS_Vec; typedef Vector RHS_Vec; typedef Vector Vec; typedef typename Rcpp::traits::storage_type::type LHS_STORAGE; typedef typename Rcpp::traits::storage_type::type RHS_STORAGE; typedef typename Rcpp::traits::storage_type::type STORAGE; public: DualVector(LHS_Vec left_, RHS_Vec right_) : left(left_), right(right_) {} LHS_STORAGE get_left_value(const int i) const { if (i < 0) stop("get_left_value() called with negative argument"); return left[i]; } RHS_STORAGE get_right_value(const int i) const { if (i >= 0) stop("get_right_value() called with nonnegative argument"); return right[-i - 1]; } bool is_left_na(const int i) const { return left.is_na(get_left_value(i)); } bool is_right_na(const int i) const { return right.is_na(get_right_value(i)); } bool is_na(const int i) const { if (i >= 0) return is_left_na(i); else return is_right_na(i); } LHS_STORAGE get_value_as_left(const int i) const { if (i >= 0) return get_left_value(i); else { RHS_STORAGE x = get_right_value(i); if (LHS_RTYPE == RHS_RTYPE) return x; else return Rcpp::internal::r_coerce(x); } } RHS_STORAGE get_value_as_right(const int i) const { if (i >= 0) { LHS_STORAGE x = get_left_value(i); if (LHS_RTYPE == RHS_RTYPE) return x; else return Rcpp::internal::r_coerce(x); } else return get_right_value(i); } STORAGE get_value(const int i) const { if (RTYPE == LHS_RTYPE) return get_value_as_left(i); else return get_value_as_right(i); } template SEXP subset(iterator it, const int n) { // We use the fact that LGLSXP < INTSXP < REALSXP, this defines our coercion precedence RObject ret; if (LHS_RTYPE == RHS_RTYPE) ret = subset_same(it, n); else if (LHS_RTYPE > RHS_RTYPE) ret = subset_left(it, n); else ret = subset_right(it, n); copy_most_attributes(ret, left); return ret; } template SEXP subset_same(iterator it, const int n) { Vec res = no_init(n); for (int i = 0; i < n; i++, ++it) { res[i] = get_value(*it); } return res; } template SEXP subset_left(iterator it, const int n) { LHS_Vec res = no_init(n); for (int i = 0; i < n; i++, ++it) { res[i] = get_value_as_left(*it); } return res; } template SEXP subset_right(iterator it, const int n) { RHS_Vec res = no_init(n); for (int i = 0; i < n; i++, ++it) { res[i] = get_value_as_right(*it); } return res; } private: LHS_Vec left; RHS_Vec right; }; template class JoinVisitorImpl : public JoinVisitor { protected: typedef DualVector Storage; typedef boost::hash hasher; typedef typename Storage::LHS_Vec LHS_Vec; typedef typename Storage::RHS_Vec RHS_Vec; typedef typename Storage::Vec Vec; public: JoinVisitorImpl(const Column& left, const Column& right, const bool warn) : dual((SEXP)left.get_data(), (SEXP)right.get_data()) { if (warn) check_attribute_compatibility(left, right); } inline size_t hash(int i) { // If NAs don't match, we want to distribute their hashes as evenly as possible if (!ACCEPT_NA_MATCH && dual.is_na(i)) return static_cast(i); typename Storage::STORAGE x = dual.get_value(i); return hash_fun(x); } inline bool equal(int i, int j) { if (LHS_RTYPE == RHS_RTYPE) { // Shortcut for same data type return join_match::is_match(dual.get_value(i), dual.get_value(j)); } else if (i >= 0 && j >= 0) { return join_match::is_match(dual.get_left_value(i), dual.get_left_value(j)); } else if (i < 0 && j < 0) { return join_match::is_match(dual.get_right_value(i), dual.get_right_value(j)); } else if (i >= 0 && j < 0) { return join_match::is_match(dual.get_left_value(i), dual.get_right_value(j)); } else { return join_match::is_match(dual.get_right_value(i), dual.get_left_value(j)); } } SEXP subset(const std::vector& indices) { return dual.subset(indices.begin(), indices.size()); } SEXP subset(const VisitorSetIndexSet& set) { return dual.subset(set.begin(), set.size()); } public: hasher hash_fun; private: Storage dual; }; template class POSIXctJoinVisitor : public JoinVisitorImpl { typedef JoinVisitorImpl Parent; public: POSIXctJoinVisitor(const Column& left, const Column& right) : Parent(left, right, false), tzone(R_NilValue) { RObject tzone_left = left.get_data().attr("tzone"); RObject tzone_right = right.get_data().attr("tzone"); if (tzone_left.isNULL() && tzone_right.isNULL()) return; if (tzone_left.isNULL()) { tzone = tzone_right; } else if (tzone_right.isNULL()) { tzone = tzone_left; } else { std::string s_left = as(tzone_left); std::string s_right = as(tzone_right); if (s_left == s_right) { tzone = wrap(s_left); } else { tzone = wrap("UTC"); } } } inline SEXP subset(const std::vector& indices) { return promote(Parent::subset(indices)); } inline SEXP subset(const VisitorSetIndexSet& set) { return promote(Parent::subset(set)); } private: inline SEXP promote(NumericVector x) { set_class(x, Rcpp::CharacterVector::create("POSIXct", "POSIXt")); if (!tzone.isNULL()) { x.attr("tzone") = tzone; } return x; } private: RObject tzone; }; class DateJoinVisitorGetter { public: virtual ~DateJoinVisitorGetter() {}; virtual double get(int i) = 0; virtual bool is_na(int i) const = 0; }; template class DateJoinVisitor : public JoinVisitorImpl { typedef JoinVisitorImpl Parent; public: DateJoinVisitor(const Column& left, const Column& right) : Parent(left, right, false) {} inline SEXP subset(const std::vector& indices) { return promote(Parent::subset(indices)); } inline SEXP subset(const VisitorSetIndexSet& set) { return promote(Parent::subset(set)); } private: static SEXP promote(SEXP x) { set_class(x, "Date"); return x; } private: typename Parent::hasher hash_fun; }; } #endif dplyr/inst/include/dplyr/OrderVisitor.h0000644000176200001440000000061313135665123017747 0ustar liggesusers#ifndef dplyr_OrderVisitor_H #define dplyr_OrderVisitor_H namespace dplyr { class OrderVisitor { public: virtual ~OrderVisitor() {} /** are the elements at indices i and j equal */ virtual bool equal(int i, int j) const = 0; /** is the i element less than the j element */ virtual bool before(int i, int j) const = 0; virtual SEXP get() = 0; }; } // namespace dplyr #endif dplyr/inst/include/dplyr/HybridHandlerMap.h0000644000176200001440000000133013135665123020466 0ustar liggesusers#ifndef dplyr_dplyr_HybridHandlerMap_H #define dplyr_dplyr_HybridHandlerMap_H #include #include typedef dplyr_hash_map HybridHandlerMap; void install_simple_handlers(HybridHandlerMap& handlers); void install_minmax_handlers(HybridHandlerMap& handlers); void install_count_handlers(HybridHandlerMap& handlers); void install_nth_handlers(HybridHandlerMap& handlers); void install_window_handlers(HybridHandlerMap& handlers); void install_offset_handlers(HybridHandlerMap& handlers); void install_in_handlers(HybridHandlerMap& handlers); void install_debug_handlers(HybridHandlerMap& handlers); bool hybridable(RObject arg); #endif // dplyr_dplyr_HybridHandlerMap_H dplyr/inst/include/dplyr/VectorVisitorImpl.h0000644000176200001440000001014013150340402020737 0ustar liggesusers#ifndef dplyr_VectorVisitor_Impl_H #define dplyr_VectorVisitor_Impl_H #include #include #include #include #include #include namespace dplyr { template std::string VectorVisitorType(); template <> inline std::string VectorVisitorType() { return "integer"; } template <> inline std::string VectorVisitorType() { return "numeric"; } template <> inline std::string VectorVisitorType() { return "logical"; } template <> inline std::string VectorVisitorType() { return "character"; } template <> inline std::string VectorVisitorType() { return "complex"; } template <> inline std::string VectorVisitorType() { return "list"; } /** * Implementations */ template class VectorVisitorImpl : public VectorVisitor { typedef comparisons compare; public: typedef Rcpp::Vector VECTOR; /** * The type of data : int, double, SEXP, Rcomplex */ typedef typename Rcpp::traits::storage_type::type STORAGE; /** * Hasher for that type of data */ typedef boost::hash hasher; VectorVisitorImpl(const VECTOR& vec_) : vec(vec_) {} /** * implementations */ size_t hash(int i) const { return hash_fun(vec[i]); } inline bool equal(int i, int j) const { return compare::equal_or_both_na(vec[i], vec[j]); } inline bool less(int i, int j) const { return compare::is_less(vec[i], vec[j]); } inline bool equal_or_both_na(int i, int j) const { return compare::equal_or_both_na(vec[i], vec[j]); } inline bool greater(int i, int j) const { return compare::is_greater(vec[i], vec[j]); } inline std::string get_r_type() const { return VectorVisitorType(); } int size() const { return vec.size(); } bool is_na(int i) const { return VECTOR::is_na(vec[i]); } protected: VECTOR vec; hasher hash_fun; }; class FactorVisitor : public VectorVisitorImpl { typedef comparisons string_compare; public: typedef VectorVisitorImpl Parent; FactorVisitor(const IntegerVector& vec_) : Parent(vec_) { levels = get_levels(vec); levels_ptr = Rcpp::internal::r_vector_start(levels); } inline bool equal(int i, int j) const { return vec[i] == vec[j]; } inline bool less(int i, int j) const { return string_compare::is_less( vec[i] < 0 ? NA_STRING : levels_ptr[vec[i]], vec[j] < 0 ? NA_STRING : levels_ptr[vec[j]] ); } inline bool greater(int i, int j) const { return string_compare::is_greater( vec[i] < 0 ? NA_STRING : levels_ptr[vec[i]], vec[j] < 0 ? NA_STRING : levels_ptr[vec[j]] ); } inline std::string get_r_type() const { return get_single_class(Parent::vec); } private: CharacterVector levels; SEXP* levels_ptr; }; template <> class VectorVisitorImpl : public VectorVisitor { public: VectorVisitorImpl(const CharacterVector& vec_) : vec(reencode_char(vec_)), has_orders(false) {} size_t hash(int i) const { return reinterpret_cast(get_item(i)); } inline bool equal(int i, int j) const { return equal_or_both_na(i, j); } inline bool less(int i, int j) const { provide_orders(); return orders[i] < orders[j]; } inline bool equal_or_both_na(int i, int j) const { return get_item(i) == get_item(j); } inline bool greater(int i, int j) const { provide_orders(); return orders[i] > orders[j]; } inline std::string get_r_type() const { return VectorVisitorType(); } int size() const { return vec.size(); } bool is_na(int i) const { return CharacterVector::is_na(vec[i]); } private: SEXP get_item(const int i) const { return static_cast(vec[i]); } void provide_orders() const { if (has_orders) return; orders = CharacterVectorOrderer(vec).get(); has_orders = true; } private: CharacterVector vec; mutable IntegerVector orders; mutable bool has_orders; }; } #endif dplyr/inst/include/dplyr/checks.h0000644000176200001440000000422113135665123016553 0ustar liggesusers#ifndef dplyr_checks_H #define dplyr_checks_H #include #include namespace dplyr { enum SupportedType { DPLYR_LGLSXP = LGLSXP, DPLYR_INTSXP = INTSXP, DPLYR_REALSXP = REALSXP, DPLYR_CPLXSXP = CPLXSXP, DPLYR_STRSXP = STRSXP, DPLYR_VECSXP = VECSXP }; inline std::string type_name(SEXP x) { switch (TYPEOF(x)) { case NILSXP: return "NULL"; case SYMSXP: return "symbol"; case S4SXP: return "S4"; case LGLSXP: return "logical vector"; case INTSXP: return "integer vector"; case REALSXP: return "double vector"; case STRSXP: return "character vector"; case CPLXSXP: return "complex vector"; case RAWSXP: return "raw vector"; case VECSXP: return "list"; case LANGSXP: return "quoted call"; case EXPRSXP: return "expression"; case ENVSXP: return "environment"; case SPECIALSXP: case BUILTINSXP: case CLOSXP: return "function"; // Everything else can fall back to R's default default: return std::string(Rf_type2char(TYPEOF(x))); } } inline SupportedType check_supported_type(SEXP x, const SymbolString& name = String()) { switch (TYPEOF(x)) { case LGLSXP: return DPLYR_LGLSXP; case INTSXP: return DPLYR_INTSXP; case REALSXP: return DPLYR_REALSXP; case CPLXSXP: return DPLYR_CPLXSXP; case STRSXP: return DPLYR_STRSXP; case VECSXP: return DPLYR_VECSXP; default: if (name.is_empty()) { Rcpp::stop("is of unsupported type %s", type_name(x)); } else { bad_col(name, "is of unsupported type {type}", _["type"] = type_name(x)); } } } inline void check_length(const int actual, const int expected, const char* comment, const SymbolString& name) { if (actual == expected || actual == 1) return; static Function check_length_col("check_length_col", Environment::namespace_env("dplyr")); static Function identity("identity", Environment::base_env()); String message = check_length_col(actual, expected, CharacterVector::create(name.get_sexp()), std::string(comment), _[".abort"] = identity); message.set_encoding(CE_UTF8); stop(message.get_cstring()); } } #endif dplyr/inst/include/dplyr/join_match.h0000644000176200001440000000475313135665123017440 0ustar liggesusers#ifndef dplyr_join_match_H #define dplyr_join_match_H #include namespace dplyr { // not defined on purpose template struct join_match; // specialization when LHS_TYPE == RHS_TYPE template struct join_match { typedef comparisons compare; typedef typename Rcpp::traits::storage_type::type STORAGE; static inline bool is_match(STORAGE lhs, STORAGE rhs) { return compare::equal_or_both_na(lhs, rhs) && (ACCEPT_NA_MATCH || !compare::is_na(lhs)); } }; // NaN also don't match for reals template struct join_match { typedef comparisons compare; static inline bool is_match(double lhs, double rhs) { if (ACCEPT_NA_MATCH) return compare::equal_or_both_na(lhs, rhs); else return lhs == rhs && (ACCEPT_NA_MATCH || (!compare::is_na(lhs) && !compare::is_nan(lhs))); } }; // works for both LHS_RTYPE = INTSXP and LHS_RTYPE = LGLSXP template struct join_match_int_double { static inline bool is_match(int lhs, double rhs) { LOG_VERBOSE << lhs << " " << rhs; if (double(lhs) == rhs) { return (lhs != NA_INTEGER); } else { if (ACCEPT_NA_MATCH) return (lhs == NA_INTEGER && ISNA(rhs)); else return false; } } }; template struct join_match : join_match_int_double {}; template struct join_match : join_match_int_double {}; template struct join_match_double_int { static inline bool is_match(double lhs, int rhs) { return join_match_int_double::is_match(rhs, lhs); } }; template struct join_match : join_match_double_int {}; template struct join_match : join_match_double_int {}; template struct join_match : join_match {}; template struct join_match : join_match {}; } #endif // #ifndef dplyr_join_match_H dplyr/inst/include/dplyr/Hybrid.h0000644000176200001440000000032713135665123016537 0ustar liggesusers#ifndef dplyr_dplyr_Hybrid_H #define dplyr_dplyr_Hybrid_H namespace dplyr { class ILazySubsets; class Result; Result* get_handler(SEXP, const ILazySubsets&, const Environment&); } #endif // dplyr_dplyr_Hybrid_H dplyr/inst/include/dplyr/Collecter.h0000644000176200001440000004367213150340402017227 0ustar liggesusers#ifndef dplyr_Collecter_H #define dplyr_Collecter_H #include #include #include #include namespace dplyr { static inline bool inherits_from(SEXP x, const std::set& classes) { std::vector x_classes, inherited_classes; if (!OBJECT(x)) { return false; } x_classes = Rcpp::as< std::vector >(Rf_getAttrib(x, R_ClassSymbol)); std::sort(x_classes.begin(), x_classes.end()); std::set_intersection(x_classes.begin(), x_classes.end(), classes.begin(), classes.end(), std::back_inserter(inherited_classes)); return !inherited_classes.empty(); } static bool is_class_known(SEXP x) { static std::set known_classes; if (known_classes.empty()) { known_classes.insert("hms"); known_classes.insert("difftime"); known_classes.insert("POSIXct"); known_classes.insert("factor"); known_classes.insert("Date"); known_classes.insert("AsIs"); known_classes.insert("integer64"); known_classes.insert("table"); } if (OBJECT(x)) { return inherits_from(x, known_classes); } else { return true; } } static inline void warn_loss_attr(SEXP x) { /* Attributes are lost with unknown classes */ if (!is_class_known(x)) { SEXP classes = Rf_getAttrib(x, R_ClassSymbol); Rf_warning("Vectorizing '%s' elements may not preserve their attributes", CHAR(STRING_ELT(classes, 0))); } } static inline bool all_logical_na(SEXP x, SEXPTYPE xtype) { return LGLSXP == xtype && all_na(x); } class Collecter { public: virtual ~Collecter() {}; virtual void collect(const SlicingIndex& index, SEXP v, int offset = 0) = 0; virtual SEXP get() = 0; virtual bool compatible(SEXP) = 0; virtual bool can_promote(SEXP) const = 0; virtual bool is_factor_collecter() const { return false; } virtual bool is_logical_all_na() const { return false; } virtual std::string describe() const = 0; }; template class Collecter_Impl : public Collecter { public: typedef typename Rcpp::traits::storage_type::type STORAGE; Collecter_Impl(int n_): data(n_, Rcpp::traits::get_na()) {} void collect(const SlicingIndex& index, SEXP v, int offset = 0) { if (all_logical_na(v, TYPEOF(v))) { collect_logicalNA(index); } else { collect_sexp(index, v, offset); } } inline SEXP get() { return data; } inline bool compatible(SEXP x) { return RTYPE == TYPEOF(x) || all_logical_na(x, TYPEOF(x)); } bool can_promote(SEXP) const { return false; } std::string describe() const { return vector_class(); } bool is_logical_all_na() const { return all_logical_na(data, RTYPE); } protected: Vector data; private: void collect_logicalNA(const SlicingIndex& index) { for (int i = 0; i < index.size(); i++) { data[index[i]] = Rcpp::traits::get_na(); } } void collect_sexp(const SlicingIndex& index, SEXP v, int offset = 0) { warn_loss_attr(v); Vector source(v); STORAGE* source_ptr = Rcpp::internal::r_vector_start(source); source_ptr = source_ptr + offset; for (int i = 0; i < index.size(); i++) { data[index[i]] = source_ptr[i]; } } }; template <> class Collecter_Impl : public Collecter { public: Collecter_Impl(int n_): data(n_, NA_REAL) {} void collect(const SlicingIndex& index, SEXP v, int offset = 0) { warn_loss_attr(v); NumericVector source(v); double* source_ptr = source.begin() + offset; for (int i = 0; i < index.size(); i++) { data[index[i]] = source_ptr[i]; } } inline SEXP get() { return data; } inline bool compatible(SEXP x) { int RTYPE = TYPEOF(x); return (RTYPE == REALSXP && !Rf_inherits(x, "POSIXct") && !Rf_inherits(x, "Date")) || (RTYPE == INTSXP && !Rf_inherits(x, "factor")) || all_logical_na(x, RTYPE); } bool can_promote(SEXP) const { return false; } std::string describe() const { return "numeric"; } protected: NumericVector data; }; template <> class Collecter_Impl : public Collecter { public: Collecter_Impl(int n_): data(n_, NA_STRING) {} void collect(const SlicingIndex& index, SEXP v, int offset = 0) { warn_loss_attr(v); if (TYPEOF(v) == STRSXP) { collect_strings(index, v, offset); } else if (Rf_inherits(v, "factor")) { collect_factor(index, v, offset); } else if (all_logical_na(v, TYPEOF(v))) { collect_logicalNA(index, v); } else { CharacterVector vec(v); collect_strings(index, vec, offset); } } inline SEXP get() { return data; } inline bool compatible(SEXP x) { return (STRSXP == TYPEOF(x)) || Rf_inherits(x, "factor") || all_logical_na(x, TYPEOF(x)); } bool can_promote(SEXP) const { return false; } std::string describe() const { return "character"; } protected: CharacterVector data; private: void collect_logicalNA(const SlicingIndex& index, LogicalVector) { int n = index.size(); for (int i = 0; i < n; i++) { SET_STRING_ELT(data, index[i], NA_STRING); } } void collect_strings(const SlicingIndex& index, CharacterVector source, int offset = 0) { SEXP* p_source = Rcpp::internal::r_vector_start(source) + offset; int n = index.size(); for (int i = 0; i < n; i++) { SET_STRING_ELT(data, index[i], p_source[i]); } } void collect_factor(const SlicingIndex& index, IntegerVector source, int offset = 0) { CharacterVector levels = get_levels(source); Rf_warning("binding character and factor vector, coercing into character vector"); for (int i = 0; i < index.size(); i++) { if (source[i] == NA_INTEGER) { data[index[i]] = NA_STRING; } else { data[index[i]] = levels[source[i + offset] - 1]; } } } }; template <> class Collecter_Impl : public Collecter { public: Collecter_Impl(int n_): data(n_, NA_INTEGER) {} void collect(const SlicingIndex& index, SEXP v, int offset = 0) { warn_loss_attr(v); IntegerVector source(v); int* source_ptr = source.begin() + offset; for (int i = 0; i < index.size(); i++) { data[index[i]] = source_ptr[i]; } } inline SEXP get() { return data; } inline bool compatible(SEXP x) { int RTYPE = TYPEOF(x); return ((INTSXP == RTYPE) && !Rf_inherits(x, "factor")) || all_logical_na(x, RTYPE); } bool can_promote(SEXP x) const { return TYPEOF(x) == REALSXP && !Rf_inherits(x, "POSIXct") && !Rf_inherits(x, "Date"); } std::string describe() const { return "integer"; } protected: IntegerVector data; }; template class TypedCollecter : public Collecter_Impl { public: TypedCollecter(int n, SEXP types_) : Collecter_Impl(n), types(types_) {} inline SEXP get() { Vector data = Collecter_Impl::data; set_class(data, types); return data; } inline bool compatible(SEXP x) { String type = STRING_ELT(types, 0); return Rf_inherits(x, type.get_cstring()) || all_logical_na(x, TYPEOF(x)); } inline bool can_promote(SEXP) const { return false; } std::string describe() const { return collapse_utf8(types); } private: SEXP types; }; class POSIXctCollecter : public Collecter_Impl { public: typedef Collecter_Impl Parent; POSIXctCollecter(int n, SEXP tz_) : Parent(n), tz(tz_) {} void collect(const SlicingIndex& index, SEXP v, int offset = 0) { if (Rf_inherits(v, "POSIXct")) { Parent::collect(index, v, offset); update_tz(v); } else if (all_logical_na(v, TYPEOF(v))) { Parent::collect(index, v, offset); } } inline SEXP get() { set_class(data, get_time_classes()); if (!tz.isNULL()) { Parent::data.attr("tzone") = tz; } return Parent::data; } inline bool compatible(SEXP x) { return Rf_inherits(x, "POSIXct") || all_logical_na(x, TYPEOF(x)); } inline bool can_promote(SEXP) const { return false; } std::string describe() const { return collapse_utf8(get_time_classes()); } private: void update_tz(SEXP v) { RObject v_tz(Rf_getAttrib(v, Rf_install("tzone"))); // if the new tz is NULL, keep previous value if (v_tz.isNULL()) return; if (tz.isNULL()) { // if current tz is NULL, grab the new one tz = v_tz; } else { // none are NULL, so compare them // if they are equal, fine if (STRING_ELT(tz, 0) == STRING_ELT(v_tz, 0)) return; // otherwise, settle to UTC tz = wrap("UTC"); } } RObject tz; }; class DifftimeCollecter : public Collecter_Impl { public: typedef Collecter_Impl Parent; DifftimeCollecter(int n, std::string units_, SEXP types_) : Parent(n), units(units_), types(types_) {} void collect(const SlicingIndex& index, SEXP v, int offset = 0) { if (Rf_inherits(v, "difftime")) { collect_difftime(index, v, offset); } else if (all_logical_na(v, TYPEOF(v))) { Parent::collect(index, v, offset); } } inline SEXP get() { set_class(Parent::data, types); Parent::data.attr("units") = wrap(units); return Parent::data; } inline bool compatible(SEXP x) { return Rf_inherits(x, "difftime") || all_logical_na(x, TYPEOF(x)); } inline bool can_promote(SEXP) const { return false; } std::string describe() const { return collapse_utf8(types); } private: bool is_valid_difftime(RObject x) { return x.inherits("difftime") && x.sexp_type() == REALSXP && get_units_map().is_valid_difftime_unit(Rcpp::as(x.attr("units"))); } void collect_difftime(const SlicingIndex& index, RObject v, int offset = 0) { if (!is_valid_difftime(v)) { stop("Invalid difftime object"); } std::string v_units = Rcpp::as(v.attr("units")); if (!get_units_map().is_valid_difftime_unit(units)) { // if current unit is NULL, grab the new one units = v_units; // then collect the data: Parent::collect(index, v, offset); } else { // We had already defined the units. // Does the new vector have the same units? if (units == v_units) { Parent::collect(index, v, offset); } else { // If units are different convert the existing data and the new vector // to seconds (following the convention on // r-source/src/library/base/R/datetime.R) double factor_data = get_units_map().time_conversion_factor(units); if (factor_data != 1.0) { for (int i = 0; i < Parent::data.size(); i++) { Parent::data[i] = factor_data * Parent::data[i]; } } units = "secs"; double factor_v = get_units_map().time_conversion_factor(v_units); if (Rf_length(v) < index.size()) { stop("Wrong size of vector to collect"); } for (int i = 0; i < index.size(); i++) { Parent::data[index[i]] = factor_v * (REAL(v)[i + offset]); } } } } class UnitsMap { typedef std::map units_map; const units_map valid_units; static units_map create_valid_units() { units_map valid_units; double factor = 1; // Acceptable units based on r-source/src/library/base/R/datetime.R valid_units.insert(std::make_pair("secs", factor)); factor *= 60; valid_units.insert(std::make_pair("mins", factor)); factor *= 60; valid_units.insert(std::make_pair("hours", factor)); factor *= 24; valid_units.insert(std::make_pair("days", factor)); factor *= 7; valid_units.insert(std::make_pair("weeks", factor)); return valid_units; } public: UnitsMap() : valid_units(create_valid_units()) {} bool is_valid_difftime_unit(const std::string& x_units) const { return (valid_units.find(x_units) != valid_units.end()); } double time_conversion_factor(const std::string& v_units) const { units_map::const_iterator it = valid_units.find(v_units); if (it == valid_units.end()) { stop("Invalid difftime units (%s).", v_units.c_str()); } return it->second; } }; static const UnitsMap& get_units_map() { static UnitsMap map; return map; } private: std::string units; SEXP types; }; class FactorCollecter : public Collecter { public: typedef dplyr_hash_map LevelsMap; FactorCollecter(int n, SEXP model_): data(n, IntegerVector::get_na()), model(model_), levels(get_levels(model_)), levels_map() { int nlevels = levels.size(); for (int i = 0; i < nlevels; i++) levels_map[ levels[i] ] = i + 1; } bool is_factor_collecter() const { return true; } void collect(const SlicingIndex& index, SEXP v, int offset = 0) { if (offset != 0) stop("Nonzero offset ot supported by FactorCollecter"); if (Rf_inherits(v, "factor") && has_same_levels_as(v)) { collect_factor(index, v); } else if (all_logical_na(v, TYPEOF(v))) { collect_logicalNA(index); } } inline SEXP get() { set_levels(data, levels); set_class(data, get_class(model)); return data; } inline bool compatible(SEXP x) { return ((Rf_inherits(x, "factor") && has_same_levels_as(x)) || all_logical_na(x, TYPEOF(x))); } inline bool can_promote(SEXP x) const { return TYPEOF(x) == STRSXP || Rf_inherits(x, "factor"); } inline bool has_same_levels_as(SEXP x) const { CharacterVector levels_other = get_levels(x); int nlevels = levels_other.size(); if (nlevels != (int)levels_map.size()) return false; for (int i = 0; i < nlevels; i++) if (! levels_map.count(levels_other[i])) return false; return true; } inline std::string describe() const { return "factor"; } private: IntegerVector data; RObject model; CharacterVector levels; LevelsMap levels_map; void collect_factor(const SlicingIndex& index, SEXP v) { // here we can assume that v is a factor with the right levels // we however do not assume that they are in the same order IntegerVector source(v); CharacterVector levels = get_levels(source); SEXP* levels_ptr = Rcpp::internal::r_vector_start(levels); int* source_ptr = Rcpp::internal::r_vector_start(source); for (int i = 0; i < index.size(); i++) { if (source_ptr[i] == NA_INTEGER) { data[ index[i] ] = NA_INTEGER; } else { SEXP x = levels_ptr[ source_ptr[i] - 1 ]; data[ index[i] ] = levels_map.find(x)->second; } } } void collect_logicalNA(const SlicingIndex& index) { for (int i = 0; i < index.size(); i++) { data[ index[i] ] = NA_INTEGER; } } }; template <> inline bool Collecter_Impl::can_promote(SEXP) const { return is_logical_all_na(); } inline Collecter* collecter(SEXP model, int n) { switch (TYPEOF(model)) { case INTSXP: if (Rf_inherits(model, "POSIXct")) return new POSIXctCollecter(n, Rf_getAttrib(model, Rf_install("tzone"))); if (Rf_inherits(model, "factor")) return new FactorCollecter(n, model); if (Rf_inherits(model, "Date")) return new TypedCollecter(n, get_date_classes()); return new Collecter_Impl(n); case REALSXP: if (Rf_inherits(model, "POSIXct")) return new POSIXctCollecter(n, Rf_getAttrib(model, Rf_install("tzone"))); if (Rf_inherits(model, "difftime")) return new DifftimeCollecter( n, Rcpp::as(Rf_getAttrib(model, Rf_install("units"))), Rf_getAttrib(model, R_ClassSymbol)); if (Rf_inherits(model, "Date")) return new TypedCollecter(n, get_date_classes()); if (Rf_inherits(model, "integer64")) return new TypedCollecter(n, CharacterVector::create("integer64")); return new Collecter_Impl(n); case CPLXSXP: return new Collecter_Impl(n); case LGLSXP: return new Collecter_Impl(n); case STRSXP: return new Collecter_Impl(n); case VECSXP: if (Rf_inherits(model, "POSIXlt")) { stop("POSIXlt not supported"); } if (Rf_inherits(model, "data.frame")) { stop("Columns of class data.frame not supported"); } return new Collecter_Impl(n); default: break; } stop("is of unsupported type %s", Rf_type2char(TYPEOF(model))); } inline Collecter* promote_collecter(SEXP model, int n, Collecter* previous) { // handle the case where the previous collecter was a // Factor collecter and model is a factor. when this occurs, we need to // return a Collecter_Impl because the factors don't have the // same levels if (Rf_inherits(model, "factor") && previous->is_factor_collecter()) { Rf_warning("Unequal factor levels: coercing to character"); return new Collecter_Impl(n); } // logical NA can be promoted to whatever type comes next if (previous->is_logical_all_na()) { return collecter(model, n); } switch (TYPEOF(model)) { case INTSXP: if (Rf_inherits(model, "Date")) return new TypedCollecter(n, get_date_classes()); if (Rf_inherits(model, "factor")) return new Collecter_Impl(n); return new Collecter_Impl(n); case REALSXP: if (Rf_inherits(model, "POSIXct")) return new POSIXctCollecter(n, Rf_getAttrib(model, Rf_install("tzone"))); if (Rf_inherits(model, "Date")) return new TypedCollecter(n, get_date_classes()); if (Rf_inherits(model, "integer64")) return new TypedCollecter(n, CharacterVector::create("integer64")); return new Collecter_Impl(n); case LGLSXP: return new Collecter_Impl(n); case STRSXP: if (previous->is_factor_collecter()) Rf_warning("binding factor and character vector, coercing into character vector"); return new Collecter_Impl(n); default: break; } stop("is of unsupported type %s", Rf_type2char(TYPEOF(model))); } } #endif dplyr/inst/include/dplyr/MultipleVectorVisitors.h0000644000176200001440000000254513135665123022043 0ustar liggesusers#ifndef dplyr_MultipleVectorVisitors_H #define dplyr_MultipleVectorVisitors_H #include #include #include namespace dplyr { class MultipleVectorVisitors : public VisitorSetEqual, public VisitorSetHash, public VisitorSetLess, public VisitorSetGreater { private: std::vector< boost::shared_ptr > visitors; public: typedef VectorVisitor visitor_type; MultipleVectorVisitors() : visitors() {} MultipleVectorVisitors(List data) : visitors() { int n = data.size(); for (int i = 0; i < n; i++) { push_back(data[i]); } } inline int size() const { return visitors.size(); } inline VectorVisitor* get(int k) const { return visitors[k].get(); } inline int nrows() const { if (visitors.size() == 0) { stop("Need at least one column for `nrows()`"); } return visitors[0]->size(); } inline void push_back(SEXP x) { visitors.push_back(boost::shared_ptr(visitor(x))); } inline bool is_na(int index) const { int n = size(); for (int i = 0; i < n; i++) if (visitors[i]->is_na(index)) return true; return false; } }; } // namespace dplyr #include #endif dplyr/tests/0000755000176200001440000000000013120572701012544 5ustar liggesusersdplyr/tests/testthat.R0000644000176200001440000000006613156557741014551 0ustar liggesuserslibrary(testthat) library(dplyr) test_check("dplyr") dplyr/tests/testthat/0000755000176200001440000000000013163257361014415 5ustar liggesusersdplyr/tests/testthat/test-equality.r0000644000176200001440000001426013153520575017414 0ustar liggesuserscontext("Equality") # A data frame with all major types df_all <- data.frame( a = c(1, 2.5), b = 1:2, c = c(T, F), d = c("a", "b"), e = factor(c("a", "b")), f = Sys.Date() + 1:2, g = Sys.time() + 1:2, stringsAsFactors = FALSE ) test_that("data frames equal to themselves", { expect_true(all.equal(tbl_df(mtcars), tbl_df(mtcars))) expect_true(all.equal(tbl_df(iris), tbl_df(iris))) expect_true(all.equal(tbl_df(df_all), tbl_df(df_all))) }) test_that("data frames equal to random permutations of themselves", { scramble <- function(x) { x[sample(nrow(x)), sample(ncol(x)), drop = FALSE] } expect_equal(tbl_df(mtcars), tbl_df(scramble(mtcars))) expect_equal(tbl_df(iris), tbl_df(scramble(iris))) expect_equal(tbl_df(df_all), tbl_df(scramble(df_all))) }) test_that("data frames not equal if missing row", { expect_match(all.equal(tbl_df(mtcars), mtcars[-1, ]), "Different number of rows") expect_match(all.equal(tbl_df(iris), iris[-1, ]), "Different number of rows") expect_match(all.equal(tbl_df(df_all), df_all[-1, ]), "Different number of rows") }) test_that("data frames not equal if missing col", { expect_match( all.equal(tbl_df(mtcars), mtcars[, -1]), "Cols in x but not y: `mpg`" ) expect_match( all.equal(tbl_df(iris), iris[, -1]), "Cols in x but not y: `Sepal.Length`" ) expect_match( all.equal(tbl_df(df_all), df_all[, -1]), "Cols in x but not y: `a`" ) }) test_that("factors equal only if levels equal", { df1 <- data_frame(x = factor(c("a", "b"))) df2 <- data_frame(x = factor(c("a", "d"))) expect_equal( all.equal(df1, df2), "Factor levels not equal for column `x`" ) expect_equal( all.equal(df2, df1), "Factor levels not equal for column `x`" ) }) test_that("factor comparison requires strict equality of levels (#2440)", { df1 <- data_frame(x = factor("a")) df2 <- data_frame(x = factor("a", levels = c("a", "b"))) expect_equal( all.equal(df1, df2), "Factor levels not equal for column `x`" ) expect_equal( all.equal(df2, df1), "Factor levels not equal for column `x`" ) expect_warning(expect_true(all.equal(df1, df2, convert = TRUE)), "joining factors") expect_warning(expect_true(all.equal(df2, df1, convert = TRUE)), "joining factors") }) test_that("BoolResult does not overwrite singleton R_TrueValue", { dplyr:::equal_data_frame(mtcars, mtcars) expect_equal(class(2 == 2), "logical") }) test_that("all.equal.data.frame handles data.frames with NULL names", { x <- data.frame(LETTERS[1:3], rnorm(3)) names(x) <- NULL expect_true(all.equal(x, x)) }) test_that("data frame equality test with ignore_row_order=TRUE detects difference in number of rows. #1065", { DF1 <- data_frame(a = 1:4, b = letters[1:4]) DF2 <- data_frame(a = c(1:4, 4L), b = letters[c(1:4, 4L)]) expect_false(isTRUE(all.equal(DF1, DF2, ignore_row_order = TRUE))) DF1 <- data_frame(a = c(1:4, 2L), b = letters[c(1:4, 2L)]) DF2 <- data_frame(a = c(1:4, 4L), b = letters[c(1:4, 4L)]) expect_false(isTRUE(all.equal(DF1, DF2, ignore_row_order = TRUE))) }) test_that("all.equal handles NA_character_ correctly. #1095", { d1 <- data_frame(x = c(NA_character_)) expect_true(all.equal(d1, d1)) d2 <- data_frame(x = c(NA_character_, "foo", "bar")) expect_true(all.equal(d2, d2)) }) test_that("handle Date columns of different types, integer and numeric (#1204)", { a <- data.frame(date = as.Date("2015-06-07")) b <- data.frame(date = structure(as.integer(a$date), class = "Date")) expect_true(all.equal(a, b)) }) test_that("equality test fails when convert is FALSE and types don't match (#1484)", { df1 <- data_frame(x = "a") df2 <- data_frame(x = factor("a")) expect_equal( all_equal(df1, df2, convert = FALSE), "Incompatible type for column `x`: x character, y factor" ) expect_warning(all_equal(df1, df2, convert = TRUE)) }) test_that("equality handles data frames with 0 rows (#1506)", { df0 <- data_frame(x = numeric(0), y = character(0)) expect_equal(df0, df0) }) test_that("equality handles data frames with 0 columns (#1506)", { df0 <- data_frame(a = 1:10)[-1] expect_equal(df0, df0) }) test_that("equality cannot be checked in presence of raw columns", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( all.equal(df, df), "Column `b` is of unsupported type raw", fixed = TRUE ) }) test_that("equality returns a message for convert = TRUE", { df1 <- data_frame(x = 1:3) df2 <- data_frame(x = as.character(1:3)) expect_match(all.equal(df1, df2), "Incompatible") expect_match(all.equal(df1, df2, convert = TRUE), "Incompatible") }) test_that("numeric and integer can be compared if convert = TRUE", { df1 <- data_frame(x = 1:3) df2 <- data_frame(x = as.numeric(1:3)) expect_match(all.equal(df1, df2), "Incompatible") expect_true(all.equal(df1, df2, convert = TRUE)) }) test_that("returns vector for more than one difference (#1819)", { expect_equal( all.equal(data_frame(a = 1, b = 2), data_frame(a = 1L, b = 2L)), c( "Incompatible type for column `a`: x numeric, y integer", "Incompatible type for column `b`: x numeric, y integer" ) ) }) test_that("returns UTF-8 column names (#2441)", { df1 <- data_frame("\u5e78" := 1) df2 <- data_frame("\u798f" := 1) expect_equal( all.equal(df1, df2), c( "Cols in y but not x: `\u798f`. ", "Cols in x but not y: `\u5e78`. " ), fixed = TRUE ) }) test_that("proper message formatting for set operations", { expect_error( union(data_frame(a = 1), data_frame(a = "1")), "not compatible: Incompatible type for column `a`: x numeric, y character", fixed = TRUE ) expect_error( union(data_frame(a = 1, b = 2), data_frame(a = "1", b = "2")), "not compatible: \n- Incompatible type for column `a`: x numeric, y character\n- Incompatible type for column `b`: x numeric, y character", fixed = TRUE ) }) test_that("ignore column order", { expect_equal( all.equal(data_frame(a = 1, b = 2), data_frame(b = 2, a = 1), ignore_col_order = FALSE), "Same column names, but different order" ) expect_equal( all.equal(data_frame(a = 1, b = 2), data_frame(a = 1), ignore_col_order = FALSE), "Cols in x but not y: `b`. " ) }) dplyr/tests/testthat/test-pull.R0000644000176200001440000000246113135665123016472 0ustar liggesuserscontext("pull") test_that("default extracts last var from data frame", { df <- data_frame(x = 1:10, y = 1:10) expect_equal(pull(df), 1:10) }) test_that("can extract by name, or positive/negative position", { x <- 1:10 df <- data_frame(x = x, y = runif(10)) expect_equal(pull(df, x), x) expect_equal(pull(df, 1L), x) expect_equal(pull(df, 1), x) expect_equal(pull(df, -2), x) expect_equal(pull(df, -2L), x) }) # select_var -------------------------------------------------------------- test_that("errors for bad inputs", { expect_error( select_var(letters, letters), "`var` must evaluate to a single number", fixed = TRUE ) expect_error( select_var(letters, aa), "object 'aa' not found", fixed = TRUE ) expect_error( select_var(letters, 0), "`var` must be a value between -26 and 26 (excluding zero), not 0", fixed = TRUE ) expect_error( select_var(letters, 100), "`var` must be a value between -26 and 26 (excluding zero), not 100", fixed = TRUE ) expect_error( select_var(letters, -Inf), "`var` must be a value between -26 and 26 (excluding zero), not NA", fixed = TRUE ) expect_error( select_var(letters, NA_integer_), "`var` must be a value between -26 and 26 (excluding zero), not NA", fixed = TRUE ) }) dplyr/tests/testthat/test-utils.R0000644000176200001440000000064013153520575016654 0ustar liggesuserscontext("utils") test_that("check_pkg() gives correct error message", { expect_error( dplyr:::check_pkg("`__foobarbaz__`", "foobar a baz"), "The `__foobarbaz__` package is required to foobar a baz") }) test_that("get_vars() handles list of symbols as vars attribute", { gdf <- group_by(tibble(g = 1:2), g) gdf <- set_attrs(gdf, vars = list(sym("g"))) expect_identical(test_grouped_df(gdf), gdf) }) dplyr/tests/testthat/test-coalesce.R0000644000176200001440000000074613135665123017300 0ustar liggesuserscontext("coalesce") test_that("non-missing scalar replaces all missing values", { x <- c(NA, 1) expect_equal(coalesce(x, 1), c(1, 1)) }) test_that("finds non-missing values in multiple positions", { x1 <- c(1L, NA, NA) x2 <- c(NA, 2L, NA) x3 <- c(NA, NA, 3L) expect_equal(coalesce(x1, x2, x3), 1:3) }) test_that("error if invalid length", { expect_error( coalesce(1:2, 1:3), "Argument 2 must be length 2 (length of `x`) or one, not 3", fixed = TRUE ) }) dplyr/tests/testthat/test-lazyeval-compat.R0000644000176200001440000000163713135665123020632 0ustar liggesuserscontext("lazyeval compatibility") test_that("can select negatively (#2519)", { expect_identical(select_(mtcars, ~-cyl), mtcars[-2]) }) test_that("select yields proper names", { expect_identical(names(select_(mtcars, ~cyl:hp)), c("cyl", "disp", "hp")) }) test_that("lazydots are named and arrange() doesn't fail (it assumes empty names)", { dots <- compat_lazy_dots(list(), env(), "cyl") expect_identical(names(dots), "") expect_identical(arrange_(mtcars, "cyl"), arrange(mtcars, cyl)) }) test_that("mutate_each_() and summarise_each_() handle lazydots", { cyl_chr <- mutate_each_(mtcars, funs(as.character), "cyl")$cyl expect_identical(cyl_chr, as.character(mtcars$cyl)) cyl_mean <- summarise_each_(mtcars, funs(mean), "cyl")$cyl expect_equal(cyl_mean, mean(mtcars$cyl)) }) test_that("select_vars_() handles lazydots", { expect_identical(select_vars_(letters, c("a", "b")), set_names(c("a", "b"))) }) dplyr/tests/testthat/test-colwise-mutate.R0000644000176200001440000001200213153520575020451 0ustar liggesuserscontext("colwise mutate/summarise") test_that("funs found in current environment", { f <- function(x) 1 df <- data.frame(x = c(2:10, 1000)) out <- summarise_all(df, funs(f, mean, median)) expect_equal(out, data.frame(f = 1, mean = 105.4, median = 6.5)) }) test_that("can use character vectors", { df <- data.frame(x = 1:3) expect_equal(summarise_all(df, "mean"), summarise_all(df, funs(mean))) expect_equal(mutate_all(df, list(mean = "mean")), mutate_all(df, funs(mean = mean))) }) test_that("can use bare functions", { df <- data.frame(x = 1:3) expect_equal(summarise_all(df, mean), summarise_all(df, funs(mean))) expect_equal(mutate_all(df, mean), mutate_all(df, funs(mean))) }) test_that("default names are smallest unique set", { df <- data.frame(x = 1:3, y = 1:3) expect_named(summarise_at(df, vars(x:y), funs(mean)), c("x", "y")) expect_named(summarise_at(df, vars(x), funs(mean, sd)), c("mean", "sd")) expect_named(summarise_at(df, vars(x:y), funs(mean, sd)), c("x_mean", "y_mean", "x_sd", "y_sd")) expect_named(summarise_at(df, vars(x:y), funs(base::mean, stats::sd)), c("x_base::mean", "y_base::mean", "x_stats::sd", "y_stats::sd")) }) test_that("named arguments force complete named", { df <- data.frame(x = 1:3, y = 1:3) expect_named(summarise_at(df, vars(x:y), funs(mean = mean)), c("x_mean", "y_mean")) expect_named(summarise_at(df, vars(x = x), funs(mean, sd)), c("x_mean", "x_sd")) }) expect_classes <- function(tbl, expected) { classes <- unname(map_chr(tbl, class)) classes <- paste0(substring(classes, 0, 1), collapse = "") expect_equal(classes, expected) } test_that("can select colwise", { columns <- iris %>% mutate_at(vars(starts_with("Petal")), as.character) expect_classes(columns, "nnccf") numeric <- iris %>% mutate_at(c(1, 3), as.character) expect_classes(numeric, "cncnf") character <- iris %>% mutate_at("Species", as.character) expect_classes(character, "nnnnc") }) test_that("can probe colwise", { predicate <- iris %>% mutate_if(is.factor, as.character) expect_classes(predicate, "nnnnc") logical <- iris %>% mutate_if(c(TRUE, FALSE, TRUE, TRUE, FALSE), as.character) expect_classes(logical, "cnccf") }) test_that("non syntactic colnames work", { df <- data_frame(`x 1` = 1:3) expect_identical(summarise_at(df, "x 1", sum)[[1]], 6L) expect_identical(summarise_if(df, is.numeric, sum)[[1]], 6L) expect_identical(summarise_all(df, sum)[[1]], 6L) expect_identical(mutate_all(df, `*`, 2)[[1]], (1:3) * 2) }) test_that("empty selection does not select everything (#2009, #1989)", { expect_equal(mtcars, mutate_if(mtcars, is.factor, as.character)) }) test_that("error is thrown with improper additional arguments", { # error messages by base R, not checked expect_error(mutate_all(mtcars, round, 0, 0)) expect_error(mutate_all(mtcars, mean, na.rm = TRUE, na.rm = TRUE)) }) test_that("predicate can be quoted", { expected <- mutate_if(mtcars, is_integerish, mean) expect_identical(mutate_if(mtcars, "is_integerish", mean), expected) expect_identical(mutate_if(mtcars, ~is_integerish(.x), mean), expected) }) test_that("transmute verbs do not retain original variables", { expect_named(transmute_all(data_frame(x = 1:3, y = 1:3), funs(mean, sd)), c("x_mean", "y_mean", "x_sd", "y_sd")) expect_named(transmute_if(data_frame(x = 1:3, y = 1:3), is_integer, funs(mean, sd)), c("x_mean", "y_mean", "x_sd", "y_sd")) expect_named(transmute_at(data_frame(x = 1:3, y = 1:3), vars(x:y), funs(mean, sd)), c("x_mean", "y_mean", "x_sd", "y_sd")) }) test_that("can rename with vars() (#2594)", { expect_equal(mutate_at(tibble(x = 1:3), vars(y = x), mean), tibble(x = 1:3, y = c(2, 2, 2))) }) test_that("selection works with grouped data frames (#2624)", { gdf <- group_by(iris, Species) expect_identical(mutate_if(gdf, is.factor, as.character), gdf) }) test_that("at selection works even if not all ops are named (#2634)", { df <- tibble(x = 1, y = 2) expect_identical(mutate_at(df, vars(z = x, y), funs(. + 1)), tibble(x = 1, y = 3, z = 2)) }) test_that("can use a purrr-style lambda", { expect_identical(summarise_at(mtcars, vars(1:2), ~mean(.x)), summarise(mtcars, mpg = mean(mpg), cyl = mean(cyl))) }) # Deprecated --------------------------------------------------------- test_that("_each() and _all() families agree", { df <- data.frame(x = 1:3, y = 1:3) expect_equal(summarise_each(df, funs(mean)), summarise_all(df, mean)) expect_equal(summarise_each(df, funs(mean), x:y), summarise_at(df, vars(x:y), mean)) expect_equal(summarise_each(df, funs(mean), z = y), summarise_at(df, vars(z = y), mean)) expect_equal(mutate_each(df, funs(mean)), mutate_all(df, mean)) expect_equal(mutate_each(df, funs(mean), x:y), mutate_at(df, vars(x:y), mean)) expect_equal(mutate_each(df, funs(mean), z = y), mutate_at(df, vars(z = y), mean)) }) test_that("specific directions are given for _all() and _at() versions", { summarise_each(mtcars, funs(mean)) summarise_each(mtcars, funs(mean), cyl) mutate_each(mtcars, funs(mean)) mutate_each(mtcars, funs(mean), cyl) }) dplyr/tests/testthat/test-group-by.r0000644000176200001440000002071213156573655017334 0ustar liggesuserscontext("Group by") df <- data.frame(x = rep(1:3, each = 10), y = rep(1:6, each = 5)) test_that("group_by with add = TRUE adds groups", { add_groups1 <- function(tbl) group_by(tbl, x, y, add = TRUE) add_groups2 <- function(tbl) group_by(group_by(tbl, x, add = TRUE), y, add = TRUE) expect_groups(add_groups1(df), c("x", "y")) expect_groups(add_groups2(df), c("x", "y")) }) test_that("joins preserve grouping", { g <- group_by(df, x) expect_groups(inner_join(g, g, by = c("x", "y")), "x") expect_groups(left_join (g, g, by = c("x", "y")), "x") expect_groups(semi_join (g, g, by = c("x", "y")), "x") expect_groups(anti_join (g, g, by = c("x", "y")), "x") }) test_that("constructors drops groups", { df <- data.frame(x = 1:3) %>% group_by(x) expect_no_groups(tbl_df(df)) }) test_that("grouping by constant adds column (#410)", { grouped <- group_by(mtcars, "cyl") %>% summarise(foo = n()) expect_equal(names(grouped), c('"cyl"', "foo")) expect_equal(nrow(grouped), 1L) }) # Test full range of variable types -------------------------------------------- test_that("local group_by preserves variable types", { df_var <- data_frame( l = c(T, F), i = 1:2, d = Sys.Date() + 1:2, f = factor(letters[1:2]), num = 1:2 + 0.5, t = Sys.time() + 1:2, c = letters[1:2] ) for (var in names(df_var)) { expected <- data_frame(unique(df_var[[var]]), n = 1L) names(expected)[1] <- var summarised <- df_var %>% group_by(!! sym(var)) %>% summarise(n = n()) expect_equal(summarised, expected, info = var) } }) test_that("mutate does not loose variables (#144)", { df <- tbl_df(data.frame(a = rep(1:4, 2), b = rep(1:4, each = 2), x = runif(8))) by_ab <- group_by(df, a, b) by_a <- summarise(by_ab, x = sum(x)) by_a_quartile <- group_by(by_a, quartile = ntile(x, 4)) expect_equal(names(by_a_quartile), c("a", "b", "x", "quartile")) }) test_that("group_by uses shallow copy", { m1 <- group_by(mtcars, cyl) expect_no_groups(mtcars) expect_equal(dfloc(mtcars), dfloc(m1)) }) test_that("FactorVisitor handles NA. #183", { g <- group_by(MASS::survey, M.I) expect_equal(g$M.I, MASS::survey$M.I) }) test_that("group_by orders by groups. #242", { df <- data.frame(a = sample(1:10, 3000, replace = TRUE)) %>% group_by(a) expect_equal(attr(df, "labels")$a, 1:10) df <- data.frame(a = sample(letters[1:10], 3000, replace = TRUE), stringsAsFactors = FALSE) %>% group_by(a) expect_equal(attr(df, "labels")$a, letters[1:10]) df <- data.frame(a = sample(sqrt(1:10), 3000, replace = TRUE)) %>% group_by(a) expect_equal(attr(df, "labels")$a, sqrt(1:10)) }) test_that("group_by uses the white list", { df <- data.frame(times = 1:5) df$times <- as.POSIXlt(seq.Date(Sys.Date(), length.out = 5, by = "day")) expect_error( group_by(df, times), "Column `times` is of unsupported class POSIXlt/POSIXt", fixed = TRUE ) }) test_that("group_by fails when lists are used as grouping variables (#276)", { df <- data.frame(x = 1:3) df$y <- list(1:2, 1:3, 1:4) expect_error( group_by(df, y), "Column `y` can't be used as a grouping variable because it's a list", fixed = TRUE ) }) test_that("select(group_by(.)) implicitely adds grouping variables (#170)", { res <- mtcars %>% group_by(vs) %>% select(mpg) expect_equal(names(res), c("vs", "mpg")) }) test_that("grouped_df errors on empty vars (#398)", { m <- mtcars %>% group_by(cyl) attr(m, "vars") <- NULL attr(m, "indices") <- NULL expect_error( m %>% do(mpg = mean(.$mpg)), "no variables to group by", fixed = TRUE ) }) test_that("grouped_df errors on non-existent var (#2330)", { df <- data.frame(x = 1:5) expect_error( grouped_df(df, list(quote(y))), "Column `y` is unknown" ) }) test_that("group_by only creates one group for NA (#401)", { x <- as.numeric(c(NA, NA, NA, 10:1, 10:1)) w <- c(20, 30, 40, 1:10, 1:10) * 10 n_distinct(x) # 11 OK res <- data.frame(x = x, w = w) %>% group_by(x) %>% summarise(n = n()) expect_equal(nrow(res), 11L) }) test_that("there can be 0 groups (#486)", { data <- data.frame(a = numeric(0), g = character(0)) %>% group_by(g) expect_equal(length(data$a), 0L) expect_equal(length(data$g), 0L) expect_equal(attr(data, "group_sizes"), integer(0)) }) test_that("group_by works with zero-row data frames (#486)", { dfg <- group_by(data.frame(a = numeric(0), b = numeric(0), g = character(0)), g) expect_equal(dim(dfg), c(0, 3)) expect_groups(dfg, "g") expect_equal(group_size(dfg), integer(0)) x <- summarise(dfg, n = n()) expect_equal(dim(x), c(0, 2)) expect_no_groups(x) x <- mutate(dfg, c = b + 1) expect_equal(dim(x), c(0, 4)) expect_groups(x, "g") expect_equal(group_size(x), integer(0)) x <- filter(dfg, a == 100) expect_equal(dim(x), c(0, 3)) expect_groups(x, "g") expect_equal(group_size(x), integer(0)) x <- arrange(dfg, a, g) expect_equal(dim(x), c(0, 3)) expect_groups(x, "g") expect_equal(group_size(x), integer(0)) x <- select(dfg, a) # Only select 'a' column; should result in 'g' and 'a' expect_equal(dim(x), c(0, 2)) expect_groups(x, "g") expect_equal(group_size(x), integer(0)) }) test_that("grouped_df requires a list of symbols (#665)", { features <- list("feat1", "feat2", "feat3") # error message by assertthat expect_error(grouped_df(data.frame(feat1 = 1, feat2 = 2, feat3 = 3), features)) }) test_that("group_by gives meaningful message with unknow column (#716)", { expect_error( group_by(iris, wrong_name_of_variable), "Column `wrong_name_of_variable` is unknown", fixed = TRUE ) }) test_that("[ on grouped_df preserves grouping if subset includes grouping vars", { df <- data_frame(x = 1:5, ` ` = 6:10) by_x <- df %>% group_by(x) expect_equal(by_x %>% groups(), by_x %>% `[`(1:2) %>% groups) # non-syntactic name by_ns <- df %>% group_by(` `) expect_equal(by_ns %>% groups(), by_ns %>% `[`(1:2) %>% groups) }) test_that("[ on grouped_df drops grouping if subset doesn't include grouping vars", { by_cyl <- mtcars %>% group_by(cyl) no_cyl <- by_cyl %>% `[`(c(1, 3)) expect_no_groups(no_cyl) expect_is(no_cyl, "tbl_df") }) test_that("group_by works after arrange (#959)", { df <- data_frame(Log = c(1, 2, 1, 2, 1, 2), Time = c(10, 1, 3, 0, 15, 11)) res <- df %>% arrange(Time) %>% group_by(Log) %>% mutate(Diff = Time - lag(Time)) expect_true(all(is.na(res$Diff[c(1, 3)]))) expect_equal(res$Diff[c(2, 4, 5, 6)], c(1, 7, 10, 5)) }) test_that("group_by keeps attributes", { d <- data.frame(x = structure(1:10, foo = "bar")) gd <- group_by(d) expect_equal(attr(gd$x, "foo"), "bar") }) test_that("ungroup.rowwise_df gives a tbl_df (#936)", { res <- tbl_df(mtcars) %>% rowwise %>% ungroup %>% class expect_equal(res, c("tbl_df", "tbl", "data.frame")) }) test_that(paste0("group_by handles encodings for native strings (#1507)"), { with_non_utf8_encoding({ special <- get_native_lang_string() df <- data.frame(x = 1:3, Eng = 2:4) for (names_converter in c(enc2native, enc2utf8)) { for (dots_converter in c(enc2native, enc2utf8)) { names(df) <- names_converter(c(special, "Eng")) res <- group_by(df, !!! syms(dots_converter(special))) expect_equal(names(res), names(df)) expect_groups(res, special) } } for (names_converter in c(enc2native, enc2utf8)) { names(df) <- names_converter(c(special, "Eng")) res <- group_by(df, !!! special) expect_equal(names(res), c(names(df), deparse(special))) expect_equal(groups(res), list(as.name(enc2native(deparse(special))))) } }) }) test_that("group_by fails gracefully on raw columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( group_by(df, a), "Column `b` is of unsupported type raw", fixed = TRUE ) expect_error( group_by(df, b), "Column `b` is of unsupported type raw", fixed = TRUE ) }) test_that("rowwise fails gracefully on raw columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( rowwise(df), "Column `b` is of unsupported type raw", fixed = TRUE ) }) test_that("group_by() names pronouns correctly (#2686)", { expect_named(group_by(tibble(x = 1), .data$x), "x") expect_named(group_by(tibble(x = 1), .data[["x"]]), "x") }) test_that("group_by() does not affect input data (#3028)", { x <- data.frame(old1 = c(1, 2, 3), old2 = c(4, 5, 6)) %>% group_by(old1) y <- x %>% select(new1 = old1, new2 = old2) expect_identical(groups(x), syms(quote(old1))) }) dplyr/tests/testthat/test-lead-lag.R0000644000176200001440000000374313153520575017171 0ustar liggesuserscontext("Lead and lag") test_that("lead and lag preserve factors", { x <- factor(c("a", "b", "c")) expect_equal(levels(lead(x)), c("a", "b", "c")) expect_equal(levels(lag(x)), c("a", "b", "c")) }) test_that("lead and lag preserves dates and times", { x <- as.Date("2013-01-01") + 1:3 y <- as.POSIXct(x) expect_is(lead(x), "Date") expect_is(lag(x), "Date") expect_is(lead(y), "POSIXct") expect_is(lag(y), "POSIXct") }) test_that("#925 is fixed", { data <- data_frame( name = c("Rob", "Pete", "Rob", "John", "Rob", "Pete", "John", "Pete", "John", "Pete", "Rob", "Rob"), time = c(3, 2, 5, 3, 2, 3, 2, 4, 1, 1, 4, 1) ) res <- data %>% group_by(name) %>% mutate(lag_time = lag(time)) expect_equal( res$lag_time[res$name == "Rob"], c(NA, head(data$time[data$name == "Rob"], -1)) ) expect_equal( res$lag_time[res$name == "Pete"], c(NA, head(data$time[data$name == "Pete"], -1)) ) expect_equal( res$lag_time[res$name == "John"], c(NA, head(data$time[data$name == "John"], -1)) ) }) test_that("#937 is fixed", { df <- data_frame( name = rep(c("Al", "Jen"), 3), score = rep(c(100, 80, 60), 2) ) res <- df %>% group_by(name) %>% mutate(next.score = lead(score)) expect_equal( res$next.score[res$name == "Al"], c(tail(df$score[df$name == "Al"], -1), NA) ) expect_equal( res$next.score[res$name == "Jen"], c(tail(df$score[df$name == "Jen"], -1), NA) ) }) test_that("input checks", { expect_error( lead(letters, -1), "`n` must be a nonnegative integer scalar, not double of length 1", fixed = TRUE ) expect_error( lead(letters, "1"), "`n` must be a nonnegative integer scalar, not string of length 1", fixed = TRUE ) expect_error( lag(letters, -1), "`n` must be a nonnegative integer scalar, not double of length 1", fixed = TRUE ) expect_error( lag(letters, "1"), "`n` must be a nonnegative integer scalar, not string of length 1", fixed = TRUE ) }) dplyr/tests/testthat/test-DBI.R0000644000176200001440000000054713135665123016117 0ustar liggesuserscontext("DBI") test_that("can work directly with DBI connection", { skip_if_not_installed("RSQLite") skip_if_not_installed("dbplyr") con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") df <- tibble(x = 1:10, y = letters[1:10]) df1 <- copy_to(con, df) df2 <- tbl(con, "df") expect_equal(collect(df1), df) expect_equal(collect(df2), df) }) dplyr/tests/testthat/test-arrange.r0000644000176200001440000001125713153520575017201 0ustar liggesuserscontext("Arrange") df2 <- data.frame( a = rep(c(NA, 1, 2, 3), each = 4), b = rep(c(0L, NA, 1L, 2L), 4), c = c(NA, NA, NA, NA, letters[10:21]), d = rep(c(T, NA, F, T), each = 4), id = 1:16, stringsAsFactors = FALSE ) equal_df <- function(x, y) { rownames(x) <- NULL rownames(y) <- NULL isTRUE(all.equal(x, y)) } test_that("local arrange sorts missing values to end", { na_last <- function(x) { n <- length(x) all(is.na(x[(n - 3):n])) } # Numeric expect_true(na_last(arrange(df2, a)$a)) expect_true(na_last(arrange(df2, desc(a))$a)) # Integer expect_true(na_last(arrange(df2, b)$b)) expect_true(na_last(arrange(df2, desc(b))$b)) # Character expect_true(na_last(arrange(df2, c)$c)) expect_true(na_last(arrange(df2, desc(c))$c)) # Logical expect_true(na_last(arrange(df2, d)$d)) expect_true(na_last(arrange(df2, desc(d))$d)) }) test_that("two arranges equivalent to one", { df <- tribble( ~x, ~y, 2, 1, 2, -1, 1, 1 ) df1 <- df %>% arrange(x, y) df2 <- df %>% arrange(y) %>% arrange(x) expect_equal(df1, df2) }) test_that("arrange handles list columns (#282)", { df <- data.frame(a = 2:1) df$b <- list("foo", "bar") res <- arrange(df, a) expect_equal(res$b, list("bar", "foo")) }) test_that("arrange handles the case where ... is missing (#338)", { expect_equivalent(arrange(mtcars), mtcars) }) test_that("arrange handles 0-rows data frames", { d <- data.frame(a = numeric(0)) expect_equal(d, arrange(d)) }) test_that("grouped arrange ignores group (#491 -> #1206)", { df <- data.frame(g = c(2, 1, 2, 1), x = c(4:1)) out <- df %>% group_by(g) %>% arrange(x) expect_equal(out$x, 1:4) }) test_that("arrange keeps the grouping structure (#605)", { dat <- data_frame(g = c(2, 2, 1, 1), x = c(1, 3, 2, 4)) res <- dat %>% group_by(g) %>% arrange() expect_is(res, "grouped_df") expect_equal(res$x, dat$x) res <- dat %>% group_by(g) %>% arrange(x) expect_is(res, "grouped_df") expect_equal(res$x, 1:4) expect_equal(attr(res, "indices"), list(c(1, 3), c(0, 2))) }) test_that("arrange handles complex vectors", { d <- data.frame(x = 1:10, y = 10:1 + 2i) res <- arrange(d, y) expect_equal(res$y, rev(d$y)) expect_equal(res$x, rev(d$x)) res <- arrange(res, desc(y)) expect_equal(res$y, d$y) expect_equal(res$x, d$x) d$y[c(3, 6)] <- NA res <- arrange(d, y) expect_true(all(is.na(res$y[9:10]))) res <- arrange(d, desc(y)) expect_true(all(is.na(res$y[9:10]))) }) test_that("arrange respects attributes #1105", { env <- environment() Period <- suppressWarnings(setClass("Period", contains = "numeric", where = env)) on.exit(removeClass("Period", where = env)) df <- data.frame(p = Period(c(1, 2, 3)), x = 1:3) res <- arrange(df, p) expect_is(res$p, "Period") }) test_that("arrange works with empty data frame (#1142)", { df <- data.frame() res <- df %>% arrange expect_equal(nrow(res), 0L) expect_equal(length(res), 0L) }) test_that("arrange respects locale (#1280)", { df2 <- data_frame(words = c("casa", "\u00e1rbol", "zona", "\u00f3rgano")) res <- df2 %>% arrange(words) expect_equal(res$words, sort(df2$words)) res <- df2 %>% arrange(desc(words)) expect_equal(res$words, sort(df2$words, decreasing = TRUE)) }) test_that("duplicated column name is explicit about which column (#996)", { df <- data.frame(x = 1:10, x = 1:10) names(df) <- c("x", "x") # Error message created by tibble expect_error(df %>% arrange) df <- data.frame(x = 1:10, x = 1:10, y = 1:10, y = 1:10) names(df) <- c("x", "x", "y", "y") # Error message created by tibble expect_error(df %>% arrange) }) test_that("arrange fails gracefully on list columns (#1489)", { df <- expand.grid(group = 1:2, y = 1, x = 1) %>% group_by(group) %>% do(fit = lm(data = ., y ~ x)) expect_error( arrange(df, fit), "Argument 1 is of unsupported type list", fixed = TRUE ) }) test_that("arrange fails gracefully on raw columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( arrange(df, a), "Column `b` is of unsupported type raw", fixed = TRUE ) expect_error( arrange(df, b), "Column `b` is of unsupported type raw", fixed = TRUE ) }) test_that("arrange fails gracefully on matrix input (#1870)", { df <- data_frame(a = 1:3, b = 4:6) expect_error( arrange(df, is.na(df)), "Argument 1 is of unsupported type matrix", fixed = TRUE ) }) # grouped_df -------------------------------------------------------------- test_that("can choose to inclue grouping vars", { df <- tibble(g = c(1, 2), x = c(2, 1)) %>% group_by(g) df1 <- df %>% arrange(x, .by_group = TRUE) df2 <- df %>% arrange(g, x) expect_equal(df1, df2) }) dplyr/tests/testthat/test-if-else.R0000644000176200001440000000552713135665123017050 0ustar liggesuserscontext("if_else") test_that("first argument must be logical", { expect_error( if_else(1:10, 1, 2), "`condition` must be a logical, not integer", fixed = TRUE ) }) test_that("true and false must be same length as condition (or length 1)", { expect_error( if_else(1:3 < 2, 1:2, 1:3), "`true` must be length 3 (length of `condition`) or one, not 2", fixed = TRUE ) expect_error( if_else(1:3 < 2, 1:3, 1:2), "`false` must be length 3 (length of `condition`) or one, not 2", fixed = TRUE ) }) test_that("true and false must be same type and same class", { expect_error( if_else(1:3 < 2, 1, 1L), "`false` must be type double, not integer", fixed = TRUE ) x <- factor("x") y <- ordered("x") expect_error( if_else(1:3 < 2, x, y), "`false` must be factor, not ordered/factor", fixed = TRUE ) }) test_that("scalar true and false are vectorised", { x <- c(TRUE, TRUE, FALSE, FALSE) expect_equal(if_else(x, 1, 2), c(1, 1, 2, 2)) }) test_that("vector true and false are ok", { x <- c(-1, 0, 1) expect_equal(if_else(x < 0, x, 0), c(-1, 0, 0)) expect_equal(if_else(x > 0, x, 0), c(0, 0, 1)) }) test_that("missing values are missing", { expect_equal(if_else(c(TRUE, NA, FALSE), -1, 1), c(-1, NA, 1)) }) test_that("works with lists", { x <- list(1, 2, 3) expect_equal( if_else(c(TRUE, TRUE, FALSE), x, list(NULL)), list(1, 2, NULL) ) }) test_that("better factor support (#2197)", { skip("Currently failing") test_that("gives proper error messages for factor class (#2197)", { x <- factor(1:3, labels = letters[1:3]) expect_error( if_else(x == "a", "b", x), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", 1L, x), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", 1., x), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", TRUE, x), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", Sys.Date(), x), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", x, "b"), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", x, 1L), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", x, 1.), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", x, TRUE), "asdf", fixed = TRUE ) expect_error( if_else(x == "a", x, Sys.Date()), "asdf", fixed = TRUE ) }) test_that("works with factors as both `true` and `false` (#2197)", { x <- factor(1:3, labels = letters[1:3]) y <- factor(1:3, labels = letters[c(1, 2, 4)]) expect_equal(if_else(x == "a", x[[2]], x), x[c(2, 2, 3)]) expect_error( if_else(x == "a", x, y), "asdf levels in `false` don't match levels in `true`" ) }) }) dplyr/tests/testthat/test-data_frame.R0000644000176200001440000000046513153520575017604 0ustar liggesuserscontext("data_frame") # add_rownames ----------------------------------------------------------- test_that("add_rownames keeps the tbl classes (#882)", { expect_warning( res <- mtcars %>% add_rownames("Make&Model"), "Deprecated" ) expect_equal(class(res), c("tbl_df", "tbl", "data.frame")) }) dplyr/tests/testthat/helper-torture.R0000644000176200001440000000005413135665123017516 0ustar liggesuserswith_gctorture2 <- withr::with_(gctorture2) dplyr/tests/testthat/test-nth-value.R0000644000176200001440000000177413135665123017427 0ustar liggesuserscontext("Nth value") test_that("nth works with lists", { x <- list(1, 2, 3) expect_equal(nth(x, 1), 1) expect_equal(nth(x, 4), NULL) expect_equal(nth(x, 4, default = 1), 1) }) test_that("negative values index from end", { x <- 1:5 expect_equal(nth(x, -1), 5) expect_equal(nth(x, -3), 3) }) test_that("indexing past ends returns default value", { expect_equal(nth(1:4, 5), NA_integer_) expect_equal(nth(1:4, -5), NA_integer_) expect_equal(nth(1:4, -10), NA_integer_) }) test_that("first uses default value for 0 length vectors", { expect_equal(first(logical()), NA) expect_equal(first(integer()), NA_integer_) expect_equal(first(numeric()), NA_real_) expect_equal(first(character()), NA_character_) expect_equal(first(list()), NULL) }) test_that("firsts uses default value for 0 length augmented vectors", { fc <- factor("a")[0] dt <- Sys.Date() tm <- Sys.time() expect_equal(first(fc[0]), fc[NA]) expect_equal(first(dt[0]), dt[NA]) expect_equal(first(tm[0]), tm[NA]) }) dplyr/tests/testthat/test-select.r0000644000176200001440000001561213153520575017040 0ustar liggesuserscontext("Select") test_that("select does not lose grouping (#147)", { df <- tibble(a = rep(1:4, 2), b = rep(1:4, each = 2), x = runif(8)) grouped <- df %>% group_by(a) %>% select(a, b, x) expect_groups(grouped, "a") }) test_that("grouping variables preserved with a message (#1511)", { df <- data_frame(g = 1:3, x = 3:1) %>% group_by(g) expect_message(res <- select(df, x), "Adding missing grouping variables") expect_named(res, c("g", "x")) }) test_that("non-syntactic grouping variable is preserved (#1138)", { df <- data_frame(`a b` = 1L) %>% group_by(`a b`) %>% select() expect_named(df, "a b") }) test_that("select doesn't fail if some names missing", { df1 <- data.frame(x = 1:10, y = 1:10, z = 1:10) df2 <- setNames(df1, c("x", "y", "")) # df3 <- setNames(df1, c("x", "", "")) expect_equal(select(df1, x), data.frame(x = 1:10)) expect_equal(select(df2, x), data.frame(x = 1:10)) # expect_equal(select(df3, x), data.frame(x = 1:10)) }) # Empty selects ------------------------------------------------- test_that("select with no args returns nothing", { empty <- select(mtcars) expect_equal(ncol(empty), 0) expect_equal(nrow(empty), 32) }) test_that("select excluding all vars returns nothing", { expect_equal(dim(select(mtcars, -(mpg:carb))), c(32, 0)) expect_equal(dim(select(mtcars, starts_with("x"))), c(32, 0)) expect_equal(dim(select(mtcars, -matches("."))), c(32, 0)) }) test_that("negating empty match returns everything", { df <- data.frame(x = 1:3, y = 3:1) expect_equal(select(df, -starts_with("xyz")), df) }) # Select variables ----------------------------------------------- test_that("select_vars can rename variables", { vars <- c("a", "b") expect_equal(select_vars(vars, b = a, a = b), c("b" = "a", "a" = "b")) }) test_that("last rename wins", { vars <- c("a", "b") expect_equal(select_vars(vars, b = a, c = a), c("c" = "a")) }) test_that("negative index removes values", { vars <- letters[1:3] expect_equal(select_vars(vars, -c), c(a = "a", b = "b")) expect_equal(select_vars(vars, a:c, -c), c(a = "a", b = "b")) expect_equal(select_vars(vars, a, b, c, -c), c(a = "a", b = "b")) expect_equal(select_vars(vars, -c, a, b), c(a = "a", b = "b")) }) test_that("select can be before group_by (#309)", { df <- data.frame( id = c(1, 1, 2, 2, 2, 3, 3, 4, 4, 5), year = c(2013, 2013, 2012, 2013, 2013, 2013, 2012, 2012, 2013, 2013), var1 = rnorm(10) ) dfagg <- df %>% group_by(id, year) %>% select(id, year, var1) %>% summarise(var1 = mean(var1)) expect_equal(names(dfagg), c("id", "year", "var1")) expect_equal(attr(dfagg, "vars"), "id") }) test_that("rename does not crash with invalid grouped data frame (#640)", { df <- data_frame(a = 1:3, b = 2:4, d = 3:5) %>% group_by(a, b) df$a <- NULL expect_equal( df %>% rename(e = d) %>% ungroup, data_frame(b = 2:4, e = 3:5) ) expect_equal( df %>% rename(e = b) %>% ungroup, data_frame(e = 2:4, d = 3:5) ) }) test_that("can select with character vectors", { expect_identical(select_vars(letters, "b", !! "z", c("b", "c")), set_names(c("b", "z", "c"))) }) test_that("abort on unknown columns", { expect_error(select_vars(letters, "foo"), "must match column names") expect_error(select_vars(letters, c("a", "bar", "foo", "d")), "bar, foo") }) test_that("rename() handles data pronoun", { expect_identical(rename(tibble(x = 1), y = .data$x), tibble(y = 1)) }) # combine_vars ------------------------------------------------------------ # This is the low C++ function which works on integer indices test_that("empty index gives empty output", { vars <- combine_vars(letters, list()) expect_equal(length(vars), 0) vars <- combine_vars(letters, list(numeric())) expect_equal(length(vars), 0) }) test_that("positive indexes kept", { expect_equal(combine_vars(letters, list(1)), c(a = 1)) expect_equal(combine_vars(letters, list(1, 26)), c(a = 1, z = 26)) expect_equal(combine_vars(letters, list(c(1, 26))), c(a = 1, z = 26)) }) test_that("indexes returned in order they appear", { expect_equal(combine_vars(letters, list(26, 1)), c(z = 26, a = 1)) }) test_that("negative index in first position includes all others", { vars <- combine_vars(letters[1:3], list(-1)) expect_equal(vars, c(b = 2, c = 3)) }) test_that("named inputs rename outputs", { expect_equal(combine_vars(letters[1:3], list(d = 1)), c(d = 1)) expect_equal(combine_vars(letters[1:3], list(c(d = 1))), c(d = 1)) }) test_that("if multiple names, last kept", { expect_equal(combine_vars(letters[1:3], list(d = 1, e = 1)), c(e = 1)) expect_equal(combine_vars(letters[1:3], list(c(d = 1, e = 1))), c(e = 1)) }) test_that("if one name for multiple vars, use integer index", { expect_equal(combine_vars(letters[1:3], list(x = 1:3)), c(x1 = 1, x2 = 2, x3 = 3)) }) test_that("invalid inputs raise error", { expect_error( combine_vars(names(mtcars), list(0)), "Each argument must yield either positive or negative integers", fixed = TRUE ) expect_error( combine_vars(names(mtcars), list(c(-1, 1))), "Each argument must yield either positive or negative integers", fixed = TRUE ) expect_error( combine_vars(names(mtcars), list(12)), "Position must be between 0 and n", fixed = TRUE ) }) test_that("select succeeds in presence of raw columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_identical(select(df, a), df["a"]) expect_identical(select(df, b), df["b"]) expect_identical(select(df, -b), df["a"]) }) test_that("arguments to select() don't match select_vars() arguments", { df <- tibble(a = 1) expect_identical(select(df, var = a), tibble(var = 1)) expect_identical(select(group_by(df, a), var = a), group_by(tibble(var = 1), var)) expect_identical(select(df, exclude = a), tibble(exclude = 1)) expect_identical(select(df, include = a), tibble(include = 1)) expect_identical(select(group_by(df, a), exclude = a), group_by(tibble(exclude = 1), exclude)) expect_identical(select(group_by(df, a), include = a), group_by(tibble(include = 1), include)) }) test_that("arguments to rename() don't match rename_vars() arguments (#2861)", { df <- tibble(a = 1) expect_identical(rename(df, var = a), tibble(var = 1)) expect_identical(rename(group_by(df, a), var = a), group_by(tibble(var = 1), var)) expect_identical(rename(df, strict = a), tibble(strict = 1)) expect_identical(rename(group_by(df, a), strict = a), group_by(tibble(strict = 1), strict)) }) test_that("can select() with .data pronoun (#2715)", { expect_identical(select(mtcars, .data$cyl), select(mtcars, cyl)) }) test_that("can select() with character vectors", { expect_identical(select(mtcars, "cyl", !! "disp", c("cyl", "am", "drat")), mtcars[c("cyl", "disp", "am", "drat")]) }) test_that("rename() to UTF-8 column names", { skip_on_os("windows") # needs an rlang update? #3049 df <- data_frame(a = 1) %>% rename("\u5e78" := a) expect_equal(colnames(df), "\u5e78") }) dplyr/tests/testthat/test-count-tally.r0000644000176200001440000000550413153520575020033 0ustar liggesuserscontext("count-tally") # count ------------------------------------------------------------------- test_that("can count variable called n", { df <- data.frame(n = c(1, 1, 2, 2, 2)) out <- df %>% count(n) expect_equal(names(out), c("n", "nn")) expect_equal(out$nn, c(2, 3)) out <- df %>% count(n, sort = TRUE) expect_equal(out$nn, c(3, 2)) }) test_that("count preserves grouping of input", { df <- data.frame(g = c(1, 2, 2, 2)) out1 <- count(df, g) expect_equal(group_vars(out1), character()) df2 <- df %>% group_by(g) out2 <- count(df2) expect_equal(group_vars(out2), "g") }) test_that("grouped count includes group", { df <- data.frame(g = c(1, 2, 2, 2)) res <- df %>% group_by(g) %>% count() expect_equal(names(res), c("g", "n")) expect_equal(res$n, c(1, 3)) expect_equal(group_vars(res), "g") }) # add_count --------------------------------------------------------------- test_that("can add counts of a variable called n", { df <- data.frame(n = c(1, 1, 2, 2, 2)) out <- df %>% add_count(n) expect_equal(names(out), c("n", "nn")) expect_equal(out$n, df$n) expect_equal(out$nn, c(2, 2, 3, 3, 3)) out <- df %>% add_count(n, sort = TRUE) expect_equal(out$nn, c(3, 3, 3, 2, 2)) }) test_that("add_count respects and preserves existing groups", { df <- data.frame(g = c(1, 2, 2, 2), val = c("b", "b", "b", "c")) res <- df %>% add_count(val) expect_equal(res$n, c(3, 3, 3, 1)) expect_no_groups(res) res <- df %>% group_by(g) %>% add_count(val) expect_equal(res$n, c(1, 2, 2, 1)) expect_groups(res, "g") }) # tally ------------------------------------------------------------------- test_that("weighted tally drops NAs (#1145)", { df <- data_frame(x = c(1, 1, NA)) expect_equal(tally(df, x)$n, 2) }) # add_tally --------------------------------------------------------------- test_that("can add tallies of a variable", { df <- data.frame(a = c(1, 1, 2, 2, 2)) out <- df %>% group_by(a) %>% add_tally() expect_equal(names(out), c("a", "n")) expect_equal(out$a, df$a) expect_equal(out$n, c(2, 2, 3, 3, 3)) out <- df %>% group_by(a) %>% add_tally(sort = TRUE) expect_equal(out$n, c(3, 3, 3, 2, 2)) }) test_that("add_tally respects and preserves existing groups", { df <- data.frame(g = c(1, 2, 2, 2), val = c("b", "b", "b", "c")) res <- df %>% group_by(val) %>% add_tally() expect_equal(res$n, c(3, 3, 3, 1)) expect_groups(res, "val") res <- df %>% group_by(g, val) %>% add_tally() expect_equal(res$n, c(1, 2, 2, 1)) expect_groups(res, c("g", "val")) }) test_that("add_tally can be given a weighting variable", { df <- data.frame(a = c(1, 1, 2, 2, 2), w = c(1, 1, 2, 3, 4)) out <- df %>% group_by(a) %>% add_tally(wt = w) expect_equal(out$n, c(2, 2, 9, 9, 9)) out <- df %>% group_by(a) %>% add_tally(wt = w + 1) expect_equal(out$n, c(4, 4, 12, 12, 12)) }) dplyr/tests/testthat/helper-encoding.R0000644000176200001440000000245613135665123017610 0ustar liggesusersget_lang_strings <- function() { lang_strings <- c( de = "Gl\u00fcck", cn = "\u5e78\u798f", ru = "\u0441\u0447\u0430\u0441\u0442\u044c\u0435", ko = "\ud589\ubcf5" ) native_lang_strings <- enc2native(lang_strings) same <- (lang_strings == native_lang_strings) list( same = lang_strings[same], different = lang_strings[!same] ) } get_native_lang_string <- function() { lang_strings <- get_lang_strings() if (length(lang_strings$same) == 0) testthat::skip("No native language string available") lang_strings$same[[1L]] } get_alien_lang_string <- function() { lang_strings <- get_lang_strings() if (length(lang_strings$different) == 0) testthat::skip("No alien language string available") lang_strings$different[[1L]] } with_non_utf8_encoding <- function(code) { old_encoding <- set_non_utf8_encoding() on.exit(set_encoding(old_encoding), add = TRUE) code } set_non_utf8_encoding <- function() { if (.Platform$OS.type == "windows") return(NULL) tryCatch( locale <- set_encoding("en_US.ISO88591"), warning = function(e) { testthat::skip("Cannot set latin-1 encoding") } ) locale } set_encoding <- function(encoding) { if (is.null(encoding)) return(NULL) locale <- Sys.getlocale("LC_CTYPE") Sys.setlocale("LC_CTYPE", encoding) locale } dplyr/tests/testthat/test-binds.R0000644000176200001440000004131713153626677016633 0ustar liggesuserscontext("binds") # error ------------------------------------------------------------------- test_that("bind_rows() and bind_cols() err for non-data frames (#2373)", { df1 <- data_frame(x = 1) df2 <- structure(list(x = 1), class = "blah_frame") expect_error( bind_cols(df1, df2), "Argument 2 must be a data frame or a named atomic vector, not a blah_frame", fixed = TRUE ) expect_error( bind_rows(df1, df2), "Argument 2 must be a data frame or a named atomic vector, not a blah_frame", fixed = TRUE ) }) test_that("bind_rows() err for invalid ID", { df1 <- data_frame(x = 1:3) df2 <- data_frame(x = 4:6) expect_error( bind_rows(df1, df2, .id = 5), "`.id` must be a scalar string, not double of length 1", fixed = TRUE ) }) # columns ----------------------------------------------------------------- test_that("cbind uses shallow copies", { df1 <- data.frame( int = 1:10, num = rnorm(10), cha = letters[1:10], stringsAsFactors = FALSE) df2 <- data.frame( log = sample(c(T, F), 10, replace = TRUE), dat = seq.Date(Sys.Date(), length.out = 10, by = "day"), tim = seq(Sys.time(), length.out = 10, by = "1 hour") ) df <- bind_cols(df1, df2) expect_equal(dfloc(df1), dfloc(df)[names(df1)]) expect_equal(dfloc(df2), dfloc(df)[names(df2)]) }) test_that("bind_cols handles lists (#1104)", { exp <- data_frame(x = 1, y = "a", z = 2) l1 <- list(x = 1, y = "a") l2 <- list(z = 2) expect_equal(bind_cols(l1, l2), exp) expect_equal(bind_cols(list(l1, l2)), exp) }) test_that("bind_cols handles empty argument list (#1963)", { expect_equal(bind_cols(), data.frame()) }) test_that("bind_cols handles all-NULL values (#2303)", { expect_identical(bind_cols(list(a = NULL, b = NULL)), data.frame()) expect_identical(bind_cols(NULL), data.frame()) }) test_that("bind_cols repairs names", { df <- tibble(a = 1, b = 2) bound <- bind_cols(df, df) repaired <- as_tibble(tibble::repair_names( data.frame(a = 1, b = 2, a = 1, b = 2, check.names = FALSE) )) expect_equal(bound, repaired) }) # rows -------------------------------------------------------------------- df_var <- data_frame( l = c(T, F, F), i = c(1, 1, 2), d = Sys.Date() + c(1, 1, 2), f = factor(letters[c(1, 1, 2)]), n = c(1, 1, 2) + 0.5, t = Sys.time() + c(1, 1, 2), c = letters[c(1, 1, 2)] ) test_that("bind_rows() equivalent to rbind()", { exp <- tbl_df(rbind(df_var, df_var, df_var)) expect_equal(bind_rows(df_var, df_var, df_var), exp) expect_equal(bind_rows(list(df_var, df_var, df_var)), exp) }) test_that("bind_rows reorders columns", { df_var_scramble <- df_var[sample(ncol(df_var))] expect_equal( names(bind_rows(df_var, df_var_scramble)), names(df_var) ) }) test_that("bind_rows ignores NULL", { df <- data_frame(a = 1) expect_equal(bind_rows(df, NULL), df) expect_equal(bind_rows(list(df, NULL)), df) }) test_that("bind_rows only accepts data frames or named vectors", { ll <- list(1:5, rlang::get_env()) expect_error( bind_rows(ll), "Argument 1 must have names", fixed = TRUE ) ll <- list(tibble(a = 1:5), rlang::get_env()) expect_error( bind_rows(ll), "Argument 2 must be a data frame or a named atomic vector, not a environment", fixed = TRUE ) }) test_that("bind_rows handles list columns (#463)", { dfl <- data_frame(x = I(list(1:2, 1:3, 1:4))) res <- bind_rows(list(dfl, dfl)) expect_equal(rep(dfl$x, 2L), res$x) }) test_that("can bind lists of data frames #1389", { df <- data_frame(x = 1) res <- bind_rows(list(df, df), list(df, df)) expect_equal(nrow(res), 4) }) test_that("bind_rows handles data frames with no rows (#597)", { df1 <- data_frame(x = 1, y = factor("a")) df0 <- df1[0, ] expect_equal(bind_rows(df0), df0) expect_equal(bind_rows(df0, df0), df0) expect_equal(bind_rows(df0, df1), df1) }) test_that("bind_rows handles data frames with no columns (#1346)", { df1 <- data_frame(x = 1, y = factor("a")) df0 <- df1[, 0] expect_equal(bind_rows(df0), df0) expect_equal(dim(bind_rows(df0, df0)), c(2, 0)) res <- bind_rows(df0, df1) expect_equal(res$x, c(NA, 1)) }) test_that("bind_rows handles lists with NULL values (#2056)", { df1 <- data_frame(x = 1, y = 1) df2 <- data_frame(x = 2, y = 2) lst1 <- list(a = df1, NULL, b = df2) df3 <- data_frame( names = c("a", "b"), x = c(1, 2), y = c(1, 2) ) expect_equal(bind_rows(lst1, .id = "names"), df3) }) test_that("bind_rows handles lists with list() values (#2826)", { expect_equal(bind_rows(list(iris, list())), iris) }) test_that("bind_rows puts data frames in order received even if no columns (#2175)", { df2 <- data_frame(x = 2, y = "b") df1 <- df2[, 0] res <- bind_rows(df1, df2) expect_equal(res$x, c(NA, 2)) expect_equal(res$y, c(NA, "b")) }) # Column coercion -------------------------------------------------------------- test_that("bind_rows promotes integer to numeric", { df1 <- data_frame(a = 1L, b = 1L) df2 <- data_frame(a = 1, b = 1L) res <- bind_rows(df1, df2) expect_equal(typeof(res$a), "double") expect_equal(typeof(res$b), "integer") }) test_that("bind_rows does not coerce logical to integer", { df1 <- data_frame(a = FALSE) df2 <- data_frame(a = 1L) expect_error( bind_rows(df1, df2), "Column `a` can't be converted from logical to integer", fixed = TRUE ) }) test_that("bind_rows promotes factor to character with warning", { df1 <- data_frame(a = factor("a")) df2 <- data_frame(a = "b") expect_warning( res <- bind_rows(df1, df2), "binding factor and character vector, coercing into character vector" ) expect_equal(typeof(res$a), "character") }) test_that("bind_rows coerces factor to character when levels don't match", { df1 <- data.frame(a = factor("a")) df2 <- data.frame(a = factor("b")) expect_warning( res <- bind_rows(df1, df2), "Unequal factor levels: coercing to character" ) expect_equal(res$a, c("a", "b")) }) test_that("bind_rows handles NA in factors #279", { df1 <- data_frame(a = factor("a")) df2 <- data_frame(a = factor(NA)) expect_warning(res <- bind_rows(df1, df2), "Unequal factor levels") expect_equal(res$a, c("a", NA)) }) test_that("bind_rows doesn't promote integer/numeric to factor", { df1 <- data_frame(a = factor("a")) df2 <- data_frame(a = 1L) df3 <- data_frame(a = 1) expect_error( bind_rows(df1, df2), "Column `a` can't be converted from factor to integer", fixed = TRUE ) expect_error( bind_rows(df1, df3), "Column `a` can't be converted from factor to numeric", fixed = TRUE ) }) test_that("bind_rows preserves timezones #298", { dates1 <- data.frame(ID = c("a", "b", "c"), dates = structure(c(-247320000, -246196800, -245073600), tzone = "GMT", class = c("POSIXct", "POSIXt")), stringsAsFactors = FALSE) dates2 <- data.frame(ID = c("d", "e", "f"), dates = structure(c(-243864000, -242654400, -241444800), tzone = "GMT", class = c("POSIXct", "POSIXt")), stringsAsFactors = FALSE) alldates <- bind_rows(dates1, dates2) expect_equal(attr(alldates$dates, "tzone"), "GMT") }) test_that("bind_rows handles all NA columns (#493)", { mydata <- list( data.frame(x = c("foo", "bar")), data.frame(x = NA) ) res <- bind_rows(mydata) expect_true(is.na(res$x[3])) expect_is(res$x, "factor") mydata <- list( data.frame(x = NA), data.frame(x = c("foo", "bar")) ) res <- bind_rows(mydata) expect_true(is.na(res$x[1])) expect_is(res$x, "factor") }) test_that("bind_rows handles complex. #933", { df1 <- data.frame(r = c(1 + 1i, 2 - 1i)) df2 <- data.frame(r = c(1 - 1i, 2 + 1i)) df3 <- bind_rows(df1, df2) expect_equal(nrow(df3), 4L) expect_equal(df3$r, c(df1$r, df2$r)) }) test_that("bind_rows is careful about column names encoding #1265", { one <- data.frame(foo = 1:3, bar = 1:3) names(one) <- c("f\u00fc", "bar") two <- data.frame(foo = 1:3, bar = 1:3) names(two) <- c("f\u00fc", "bar") Encoding(names(one)[1]) <- "UTF-8" expect_equal(names(one), names(two)) res <- bind_rows(one, two) expect_equal(ncol(res), 2L) }) test_that("bind_rows handles POSIXct (#1125)", { df1 <- data.frame(date = as.POSIXct(NA)) df2 <- data.frame(date = as.POSIXct("2015-05-05")) res <- bind_rows(df1, df2) expect_equal(nrow(res), 2L) expect_true(is.na(res$date[1])) }) test_that("bind_rows respects ordered factors (#1112)", { l <- c("a", "b", "c", "d") id <- factor(c("a", "c", "d"), levels = l, ordered = TRUE) df <- data.frame(id = rep(id, 2), val = rnorm(6)) res <- bind_rows(df, df) expect_is(res$id, "ordered") expect_equal(levels(df$id), levels(res$id)) res <- group_by(df, id) %>% do(na.omit(.)) expect_is(res$id, "ordered") expect_equal(levels(df$id), levels(res$id)) }) test_that("bind_rows can handle lists (#1104)", { my_list <- list(tibble(x = 1, y = "a"), tibble(x = 2, y = "b")) res <- bind_rows(my_list) expect_equal(nrow(res), 2L) expect_is(res$x, "numeric") expect_is(res$y, "character") res <- bind_rows(list(x = 1, y = "a"), list(x = 2, y = "b")) expect_equal(nrow(res), 2L) expect_is(res$x, "numeric") expect_is(res$y, "character") }) test_that("bind_rows keeps ordered factors (#948)", { y <- bind_rows( data.frame(x = factor(c(1, 2, 3), ordered = TRUE)), data.frame(x = factor(c(1, 2, 3), ordered = TRUE)) ) expect_is(y$x, "ordered") expect_equal(levels(y$x), as.character(1:3)) }) test_that("bind handles POSIXct of different tz ", { date1 <- structure(-1735660800, tzone = "America/Chicago", class = c("POSIXct", "POSIXt")) date2 <- structure(-1735660800, tzone = "UTC", class = c("POSIXct", "POSIXt")) date3 <- structure(-1735660800, class = c("POSIXct", "POSIXt")) df1 <- data.frame(date = date1) df2 <- data.frame(date = date2) df3 <- data.frame(date = date3) res <- bind_rows(df1, df2) expect_equal(attr(res$date, "tzone"), "UTC") res <- bind_rows(df1, df3) expect_equal(attr(res$date, "tzone"), "America/Chicago") res <- bind_rows(df2, df3) expect_equal(attr(res$date, "tzone"), "UTC") res <- bind_rows(df3, df3) expect_equal(attr(res$date, "tzone"), NULL) res <- bind_rows(df1, df2, df3) expect_equal(attr(res$date, "tzone"), "UTC") }) test_that("bind_rows() creates a column of identifiers (#1337)", { data1 <- mtcars[c(2, 3), ] data2 <- mtcars[1, ] out <- bind_rows(data1, data2, .id = "col") out_list <- bind_rows(list(data1, data2), .id = "col") expect_equal(names(out)[1], "col") expect_equal(out$col, c("1", "1", "2")) expect_equal(out_list$col, c("1", "1", "2")) out_labelled <- bind_rows(one = data1, two = data2, .id = "col") out_list_labelled <- bind_rows(list(one = data1, two = data2), .id = "col") expect_equal(out_labelled$col, c("one", "one", "two")) expect_equal(out_list_labelled$col, c("one", "one", "two")) }) test_that("string vectors are filled with NA not blanks before collection (#595)", { one <- mtcars[1:10, -10] two <- mtcars[11:32, ] two$char_col <- letters[1:22] res <- bind_rows(one, two) expect_true(all(is.na(res$char_col[1:10]))) }) test_that("bind_rows handles POSIXct stored as integer (#1402)", { now <- Sys.time() df1 <- data.frame(time = now) expect_equal(class(bind_rows(df1)$time), c("POSIXct", "POSIXt")) df2 <- data.frame(time = seq(now, length.out = 1, by = 1)) expect_equal(class(bind_rows(df2)$time), c("POSIXct", "POSIXt")) res <- bind_rows(df1, df2) expect_equal(class(res$time), c("POSIXct", "POSIXt")) expect_true(all(res$time == c(df1$time, df2$time))) }) test_that("bind_cols accepts NULL (#1148)", { df1 <- data_frame(a = 1:10, b = 1:10) df2 <- data_frame(c = 1:10, d = 1:10) res1 <- bind_cols(df1, df2) res2 <- bind_cols(NULL, df1, df2) res3 <- bind_cols(df1, NULL, df2) res4 <- bind_cols(df1, df2, NULL) expect_equal(res1, res2) expect_equal(res1, res3) expect_equal(res1, res4) }) test_that("bind_rows handles 0-length named list (#1515)", { res <- bind_rows(list(a = 1)[-1]) expect_equal(nrow(res), 0L) expect_is(res, "data.frame") expect_equal(ncol(res), 0L) }) test_that("bind_rows handles promotion to strings (#1538)", { df1 <- data_frame(b = c(1, 2)) df2 <- data_frame(b = c(1L, 2L)) df3 <- data_frame(b = factor(c("A", "B"))) df4 <- data_frame(b = c("C", "D")) expect_error( bind_rows(df1, df3), "Column `b` can't be converted from numeric to factor", fixed = TRUE ) expect_error( bind_rows(df1, df4), "Column `b` can't be converted from numeric to character", fixed = TRUE ) expect_error( bind_rows(df2, df3), "Column `b` can't be converted from integer to factor", fixed = TRUE ) expect_error( bind_rows(df2, df4), "Column `b` can't be converted from integer to character", fixed = TRUE ) }) test_that("bind_rows infers classes from first result (#1692)", { d1 <- data.frame(a = 1:10, b = rep(1:2, each = 5)) d2 <- tbl_df(d1) d3 <- group_by(d1, b) d4 <- rowwise(d1) d5 <- list(a = 1:10, b = rep(1:2, each = 5)) expect_equal(class(bind_rows(d1, d1)), "data.frame") expect_equal(class(bind_rows(d2, d1)), c("tbl_df", "tbl", "data.frame")) res3 <- bind_rows(d3, d1) expect_equal(class(res3), c("grouped_df", "tbl_df", "tbl", "data.frame")) expect_equal(attr(res3, "group_sizes"), c(10, 10)) expect_equal(class(bind_rows(d4, d1)), c("rowwise_df", "tbl_df", "tbl", "data.frame")) expect_equal(class(bind_rows(d5, d1)), c("tbl_df", "tbl", "data.frame")) }) test_that("bind_cols infers classes from first result (#1692)", { d1 <- data.frame(a = 1:10, b = rep(1:2, each = 5)) d2 <- data_frame(c = 1:10, d = rep(1:2, each = 5)) d3 <- group_by(d2, d) d4 <- rowwise(d2) d5 <- list(c = 1:10, d = rep(1:2, each = 5)) expect_equal(class(bind_cols(d1, d1)), "data.frame") expect_equal(class(bind_cols(d2, d1)), c("tbl_df", "tbl", "data.frame")) res3 <- bind_cols(d3, d1) expect_equal(class(res3), c("grouped_df", "tbl_df", "tbl", "data.frame")) expect_equal(attr(res3, "group_sizes"), c(5, 5)) expect_equal(class(bind_cols(d4, d1)), c("rowwise_df", "tbl_df", "tbl", "data.frame")) expect_equal(class(bind_cols(d5, d1)), c("tbl_df", "tbl", "data.frame")) }) test_that("bind_rows rejects POSIXlt columns (#1789)", { df <- data_frame(x = Sys.time() + 1:12) df$y <- as.POSIXlt(df$x) expect_error( bind_rows(df, df), "Argument 2 can't be a list containing POSIXlt values", fixed = TRUE ) }) test_that("bind_rows rejects data frame columns (#2015)", { df <- list( x = 1:10, y = data.frame(a = 1:10, y = 1:10) ) class(df) <- "data.frame" attr(df, "row.names") <- .set_row_names(10) expect_error( dplyr::bind_rows(df, df), "Argument 2 can't be a list containing data frames", fixed = TRUE ) }) test_that("bind_rows accepts difftime objects", { df1 <- data.frame(x = as.difftime(1, units = "hours")) df2 <- data.frame(x = as.difftime(1, units = "mins")) res <- bind_rows(df1, df2) expect_equal(res$x, as.difftime(c(3600, 60), units = "secs")) }) test_that("bind_rows accepts hms objects", { df1 <- data.frame(x = hms::hms(hours = 1)) df2 <- data.frame(x = as.difftime(1, units = "mins")) res <- bind_rows(df1, df2) expect_equal(res$x, hms::hms(hours = c(1, 0), minutes = c(0, 1))) }) test_that("bind_rows() fails with unnamed vectors", { expect_error( bind_rows(1:2), "Argument 1 must have names", fixed = TRUE ) }) test_that("bind_rows() handles rowwise vectors", { expect_warning(regex = "character and factor", tbl <- bind_rows( tibble(a = "foo", b = "bar"), c(a = "A", b = "B"), set_names(factor(c("B", "B")), c("a", "b")) )) expect_identical(tbl, tibble(a = c("foo", "A", "B"), b = c("bar", "B", "B"))) id_tbl <- bind_rows(a = c(a = 1, b = 2), b = c(a = 3, b = 4), .id = "id") expect_identical(id_tbl, tibble(id = c("a", "b"), a = c(1, 3), b = c(2, 4))) }) test_that("bind_rows() accepts lists of dataframe-like lists as first argument", { expect_identical(bind_rows(list(list(a = 1, b = 2))), tibble(a = 1, b = 2)) }) # Vectors ------------------------------------------------------------ test_that("accepts named columns", { expect_identical(bind_cols(a = 1:2, b = 3:4), tibble(a = 1:2, b = 3:4)) expect_equal(bind_cols(!!! mtcars), as_tibble(mtcars)) }) test_that("uncompatible sizes fail", { expect_error( bind_cols(a = 1, mtcars), "Argument 2 must be length 32, not 1", fixed = TRUE ) expect_error( bind_cols(mtcars, a = 1), "Argument 2 must be length 1, not 32", fixed = TRUE ) }) test_that("unnamed vectors fail", { expect_error( bind_cols(1:2), "Argument 1 must have names", fixed = TRUE ) expect_error( bind_cols(!!! list(1:2)), "Argument 1 must have names", fixed = TRUE ) }) test_that("supports NULL values", { expect_identical(bind_cols(a = 1, NULL, b = 2, NULL), tibble(a = 1, b = 2)) }) dplyr/tests/testthat/test-near.R0000644000176200001440000000014513102155231016424 0ustar liggesuserscontext("near") test_that("near accepts nearby fp values", { expect_true(near(sqrt(2) ^ 2, 2)) }) dplyr/tests/testthat/test-mutate-windowed.R0000644000176200001440000001626413156557264020652 0ustar liggesuserscontext("Mutate - windowed") test_that("desc is correctly handled by window functions", { df <- data.frame(x = 1:10, y = seq(1, 10, by = 1), g = rep(c(1, 2), each = 5)) expect_equal(mutate(df, rank = min_rank(desc(x)))$rank, 10:1) expect_equal(mutate(group_by(df, g), rank = min_rank(desc(x)))$rank, rep(5:1, 2)) expect_equal(mutate(df, rank = row_number(desc(x)))$rank, 10:1) expect_equal(mutate(group_by(df, g), rank = row_number(desc(x)))$rank, rep(5:1, 2)) }) test_that("row_number gives correct results", { tmp <- data.frame(id = rep(c(1, 2), each = 5), value = c(1, 1, 2, 5, 0, 6, 4, 0, 0, 2)) res <- group_by(tmp, id) %>% mutate(var = row_number(value)) expect_equal(res$var, c(2, 3, 4, 5, 1, 5, 4, 1, 2, 3)) }) test_that("row_number works with 0 arguments", { g <- group_by(mtcars, cyl) expect_equal(mutate(g, rn = row_number()), mutate(g, rn = 1:n())) }) test_that("cum(sum,min,max) works", { df <- data.frame(x = 1:10, y = seq(1, 10, by = 1), g = rep(c(1, 2), each = 5)) res <- mutate(df, csumx = cumsum(x), csumy = cumsum(y), cminx = cummin(x), cminy = cummin(y), cmaxx = cummax(x), cmaxy = cummax(y) ) expect_equal(res$csumx, cumsum(df$x)) expect_equal(res$csumy, cumsum(df$y)) expect_equal(res$cminx, cummin(df$x)) expect_equal(res$cminy, cummin(df$y)) expect_equal(res$cmaxx, cummax(df$x)) expect_equal(res$cmaxy, cummax(df$y)) res <- mutate(group_by(df, g), csumx = cumsum(x), csumy = cumsum(y), cminx = cummin(x), cminy = cummin(y), cmaxx = cummax(x), cmaxy = cummax(y) ) expect_equal(res$csumx, c(cumsum(df$x[1:5]), cumsum(df$x[6:10]))) expect_equal(res$csumy, c(cumsum(df$y[1:5]), cumsum(df$y[6:10]))) expect_equal(res$cminx, c(cummin(df$x[1:5]), cummin(df$x[6:10]))) expect_equal(res$cminy, c(cummin(df$y[1:5]), cummin(df$y[6:10]))) expect_equal(res$cmaxx, c(cummax(df$x[1:5]), cummax(df$x[6:10]))) expect_equal(res$cmaxy, c(cummax(df$y[1:5]), cummax(df$y[6:10]))) df$x[3] <- NA df$y[4] <- NA res <- mutate(df, csumx = cumsum(x), csumy = cumsum(y), cminx = cummin(x), cminy = cummin(y), cmaxx = cummax(x), cmaxy = cummax(y) ) expect_true(all(is.na(res$csumx[3:10]))) expect_true(all(is.na(res$csumy[4:10]))) expect_true(all(is.na(res$cminx[3:10]))) expect_true(all(is.na(res$cminy[4:10]))) expect_true(all(is.na(res$cmaxx[3:10]))) expect_true(all(is.na(res$cmaxy[4:10]))) }) test_that("lead and lag simple hybrid version gives correct results (#133)", { res <- group_by(mtcars, cyl) %>% mutate(disp_lag_2 = lag(disp, 2), disp_lead_2 = lead(disp, 2)) %>% summarise( lag1 = all(is.na(head(disp_lag_2, 2))), lag2 = all(!is.na(tail(disp_lag_2, -2))), lead1 = all(is.na(tail(disp_lead_2, 2))), lead2 = all(!is.na(head(disp_lead_2, -2))) ) expect_true(all(res$lag1)) expect_true(all(res$lag2)) expect_true(all(res$lead1)) expect_true(all(res$lead2)) }) test_that("min_rank handles columns full of NaN (#726)", { test <- data.frame( Name = c("a", "b", "c", "d", "e"), ID = c(1, 1, 1, 1, 1), expression = c(NaN, NaN, NaN, NaN, NaN) ) data <- group_by(test, ID) %>% mutate(rank = min_rank(expression)) expect_true(all(is.na(data$rank))) }) test_that("rank functions deal correctly with NA (#774)", { data <- data_frame(x = c(1, 2, NA, 1, 0, NA)) res <- data %>% mutate( min_rank = min_rank(x), percent_rank = percent_rank(x), dense_rank = dense_rank(x), cume_dist = cume_dist(x), ntile = ntile(x, 2), row_number = row_number(x) ) expect_true(all(is.na(res$min_rank[c(3, 6)]))) expect_true(all(is.na(res$dense_rank[c(3, 6)]))) expect_true(all(is.na(res$percent_rank[c(3, 6)]))) expect_true(all(is.na(res$cume_dist[c(3, 6)]))) expect_true(all(is.na(res$ntile[c(3, 6)]))) expect_true(all(is.na(res$row_number[c(3, 6)]))) expect_equal(res$percent_rank[ c(1, 2, 4, 5) ], c(1 / 3, 1, 1 / 3, 0)) expect_equal(res$min_rank[ c(1, 2, 4, 5) ], c(2L, 4L, 2L, 1L)) expect_equal(res$dense_rank[ c(1, 2, 4, 5) ], c(2L, 3L, 2L, 1L)) expect_equal(res$cume_dist[ c(1, 2, 4, 5) ], c(.75, 1, .75, .25)) expect_equal(res$ntile[ c(1, 2, 4, 5) ], c(1L, 2L, 2L, 1L)) expect_equal(res$row_number[ c(1, 2, 4, 5) ], c(2L, 4L, 3L, 1L)) data <- data_frame( x = rep(c(1, 2, NA, 1, 0, NA), 2), g = rep(c(1, 2), each = 6) ) res <- data %>% group_by(g) %>% mutate( min_rank = min_rank(x), percent_rank = percent_rank(x), dense_rank = dense_rank(x), cume_dist = cume_dist(x), ntile = ntile(x, 2), row_number = row_number(x) ) expect_true(all(is.na(res$min_rank[c(3, 6, 9, 12)]))) expect_true(all(is.na(res$dense_rank[c(3, 6, 9, 12)]))) expect_true(all(is.na(res$percent_rank[c(3, 6, 9, 12)]))) expect_true(all(is.na(res$cume_dist[c(3, 6, 9, 12)]))) expect_true(all(is.na(res$ntile[c(3, 6, 9, 12)]))) expect_true(all(is.na(res$row_number[c(3, 6, 9, 12)]))) expect_equal(res$percent_rank[ c(1, 2, 4, 5, 7, 8, 10, 11) ], rep(c(1 / 3, 1, 1 / 3, 0), 2)) expect_equal(res$min_rank[ c(1, 2, 4, 5, 7, 8, 10, 11) ], rep(c(2L, 4L, 2L, 1L), 2)) expect_equal(res$dense_rank[ c(1, 2, 4, 5, 7, 8, 10, 11) ], rep(c(2L, 3L, 2L, 1L), 2)) expect_equal(res$cume_dist[ c(1, 2, 4, 5, 7, 8, 10, 11) ], rep(c(.75, 1, .75, .25), 2)) expect_equal(res$ntile[ c(1, 2, 4, 5, 7, 8, 10, 11) ], rep(c(1L, 2L, 2L, 1L), 2)) expect_equal(res$row_number[ c(1, 2, 4, 5, 7, 8, 10, 11) ], rep(c(2L, 4L, 3L, 1L), 2)) }) test_that("lag and lead work on factors inside mutate (#955)", { test_factor <- factor(rep(c("A", "B", "C"), each = 3)) exp_lag <- test_factor != lag(test_factor) exp_lead <- test_factor != lead(test_factor) test_df <- tbl_df(data.frame(test = test_factor)) res <- test_df %>% mutate( is_diff_lag = (test != lag(test)), is_diff_lead = (test != lead(test)) ) expect_equal(exp_lag , res$is_diff_lag) expect_equal(exp_lead, res$is_diff_lead) }) test_that("lag handles default argument in mutate (#915)", { blah <- data.frame(x1 = c(5, 10, 20, 27, 35, 58, 5, 6), y = 8:1) blah <- mutate(blah, x2 = x1 - lag(x1, n = 1, default = 0), x3 = x1 - lead(x1, n = 1, default = 0), x4 = lag(x1, n = 1L, order_by = y), x5 = lead(x1, n = 1L, order_by = y) ) expect_equal(blah$x2, blah$x1 - lag(blah$x1, n = 1, default = 0)) expect_equal(blah$x3, blah$x1 - lead(blah$x1, n = 1, default = 0)) expect_equal(blah$x4, lag(blah$x1, n = 1L, order_by = blah$y)) expect_equal(blah$x5, lead(blah$x1, n = 1L, order_by = blah$y)) }) # FIXME: this should only fail if strict checking is on. # test_that("window functions fail if db doesn't support windowing", { # df_sqlite <- temp_load(temp_srcs("sqlite"), df)$sql %>% group_by(g) # ok <- collect(df_sqlite %>% mutate(x > 4)) # expect_equal(nrow(ok), 10) # # expect_error(df_sqlite %>% mutate(x > mean(x)), "does not support") # expect_error(df_sqlite %>% mutate(r = row_number()), "does not support") # }) test_that("dim attribute is stripped from grouped mutate (#1918)", { df <- data.frame(a = 1:3, b = 1:3) df_regular <- mutate(df, b = scale(b)) df_grouped <- mutate(group_by(df, a), b = scale(b)) df_rowwise <- mutate(rowwise(df), b = scale(b)) expect_null(dim(df$b)) expect_null(dim(df_grouped$b)) expect_null(dim(df_rowwise$b)) }) dplyr/tests/testthat/test-underscore.R0000644000176200001440000001572513153520575017677 0ustar liggesuserscontext("underscore") df <- data_frame( a = c(1:3, 2:3), b = letters[c(1:4, 4L)] ) test_that("arrange_ works", { expect_equal( arrange_(df, ~-a), arrange(df, -a) ) expect_equal( arrange_(df, .dots = list(quote(-a))), arrange(df, -a) ) expect_equal( arrange_(df, .dots = list(~-a)), arrange(df, -a) ) }) test_that("count_ works", { expect_equal( count_(df, ~b), count(df, b) ) expect_equal( count_(df, ~b, wt = quote(a)), count(df, b, wt = a) ) wt <- 1:4 expect_identical( count_(df, "b", "wt"), count(df, b, wt = wt) ) expect_identical( add_count(df, b), add_count_(df, ~b) ) }) test_that("distinct_ works", { expect_equal( distinct_(df, ~a), distinct(df, a) ) expect_equal( distinct_(df, .dots = list(quote(a))), distinct(df, a) ) expect_equal( distinct_(df, .dots = list(~a)), distinct(df, a) ) expect_equal( distinct_(df %>% group_by(b), ~a, .dots = NULL), distinct(df %>% group_by(b), a) ) expect_equal( distinct_(df %>% group_by(b), .dots = list(quote(a))), distinct(df %>% group_by(b), a) ) expect_equal( distinct_(df %>% group_by(b), .dots = list(~a)), distinct(df %>% group_by(b), a) ) }) test_that("do_ works", { expect_equal( do_(df, ~data_frame(-.$a)), do(df, data_frame(-.$a)) ) expect_equal( do_(df, .dots = list(quote(dplyr::data_frame(-.$a)))), do(df, data_frame(-.$a)) ) expect_equal( do_(df, .dots = list(~dplyr::data_frame(-.$a))), do(df, data_frame(-.$a)) ) foo <- "foobar" expect_identical( do_(df, .dots = "tibble(foo)"), do(df, tibble(foo)) ) expect_equal( do_(df %>% group_by(b), ~data_frame(-.$a)), do(df %>% group_by(b), data_frame(-.$a)) ) expect_equal( do_(df %>% group_by(b), .dots = list(quote(dplyr::data_frame(-.$a)))), do(df %>% group_by(b), data_frame(-.$a)) ) expect_equal( do_(df %>% group_by(b), .dots = list(~dplyr::data_frame(-.$a))), do(df %>% group_by(b), data_frame(-.$a)) ) }) test_that("filter_ works", { expect_equal( filter_(df, ~a > 1), filter(df, a > 1) ) expect_equal( filter_(df, .dots = list(quote(a > 1))), filter(df, a > 1) ) cnd <- rep(TRUE, 5) expect_identical( filter_(df, .dots = "cnd"), filter(df, cnd) ) }) test_that("funs_ works", { expect_equal( funs(mean), funs_(list(~mean)) ) expect_equal( funs_(list("mean")), funs_(list(`environment<-`(~mean, baseenv()))) ) expect_equal( funs(mean(.)), funs_(list(~mean(.))) ) }) test_that("group_by_ works", { expect_equal( group_by_(df, ~a), group_by(df, a) ) expect_equal( group_by_(df, ~-a), group_by(df, -a) ) expect_equal( group_by_(df, .dots = "a"), group_by(df, a) ) expect_equal( group_by_(df, .dots = list(quote(-a))), group_by(df, -a) ) expect_equal( group_by_(df, .dots = list(~-a)), group_by(df, -a) ) expect_warning( expect_equal( group_by_(df %>% rowwise, ~a), group_by(df %>% rowwise, a) ), "rowwise" ) expect_warning( expect_equal( group_by_(df %>% rowwise, ~-a), group_by(df %>% rowwise, -a) ), "rowwise" ) expect_warning( expect_equal( group_by_(df %>% rowwise, .dots = "a"), group_by(df %>% rowwise, a) ), "rowwise" ) expect_warning( expect_equal( group_by_(df %>% rowwise, .dots = list(quote(-a))), group_by(df %>% rowwise, -a) ), "rowwise" ) expect_warning( expect_equal( group_by_(df %>% rowwise, .dots = list(~-a)), group_by(df %>% rowwise, -a) ), "rowwise" ) }) test_that("mutate_ works", { expect_equal( mutate_(df, c = ~-a), mutate(df, c = -a) ) expect_equal( mutate_(df, .dots = list(c = quote(-a))), mutate(df, c = -a) ) expect_equal( mutate_(df, .dots = list(c = ~-a)), mutate(df, c = -a) ) expect_identical( mutate_(df, ~-a), mutate(df, -a) ) foo <- "foobar" expect_identical( mutate_(df, .dots = "foo"), mutate(df, foo) ) }) test_that("rename_ works", { expect_equal( rename_(df, c = ~a), rename(df, c = a) ) expect_equal( rename_(df, .dots = list(c = quote(a))), rename(df, c = a) ) expect_equal( rename_(df, .dots = list(c = ~a)), rename(df, c = a) ) }) test_that("select_ works", { expect_equal( select_(df, ~a), select(df, a) ) expect_equal( select_(df, ~-a), select(df, -a) ) expect_equal( select_(df, .dots = "a"), select(df, a) ) expect_equal( select_(df, .dots = list(quote(-a))), select(df, -a) ) expect_equal( select_(df, .dots = list(~-a)), select(df, -a) ) pos <- 1 expect_identical( select_(df, c = "pos"), select(df, c = pos) ) }) test_that("slice_ works", { expect_equal( slice_(df, ~2:n()), slice(df, 2:n()) ) expect_equal( slice_(df, .dots = list(quote(2:n()))), slice(df, 2:n()) ) expect_equal( slice_(df, .dots = list(~2:n())), slice(df, 2:n()) ) pos <- 3 expect_identical( slice_(df, .dots = "pos:n()"), slice(df, pos:n()) ) }) test_that("summarise_ works", { expect_equal( summarise_(df, ~mean(a)), summarise(df, mean(a)) ) expect_equal( summarise_(df, .dots = list(quote(mean(a)))), summarise(df, mean(a)) ) expect_equal( summarise_(df, .dots = list(~mean(a))), summarise(df, mean(a)) ) my_mean <- mean expect_identical( summarise_(df, .dots = "my_mean(a)"), summarise(df, my_mean(a)) ) expect_equal( summarise_(df %>% group_by(b), ~mean(a)), summarise(df %>% group_by(b), mean(a)) ) expect_equal( summarise_(df %>% group_by(b), .dots = list(quote(mean(a)))), summarise(df %>% group_by(b), mean(a)) ) expect_equal( summarise_(df %>% group_by(b), .dots = list(~mean(a))), summarise(df %>% group_by(b), mean(a)) ) }) test_that("summarize_ works", { expect_equal( summarize_(df, ~mean(a)), summarize(df, mean(a)) ) expect_equal( summarize_(df, .dots = list(quote(mean(a)))), summarize(df, mean(a)) ) expect_equal( summarize_(df, .dots = list(~mean(a))), summarize(df, mean(a)) ) expect_equal( summarize_(df %>% group_by(b), ~mean(a)), summarize(df %>% group_by(b), mean(a)) ) expect_equal( summarize_(df %>% group_by(b), .dots = list(quote(mean(a)))), summarize(df %>% group_by(b), mean(a)) ) expect_equal( summarize_(df %>% group_by(b), .dots = list(~mean(a))), summarize(df %>% group_by(b), mean(a)) ) }) test_that("transmute_ works", { expect_equal( transmute_(df, c = ~-a), transmute(df, c = -a) ) expect_equal( transmute_(df, .dots = list(c = quote(-a))), transmute(df, c = -a) ) expect_equal( transmute_(df, .dots = list(c = ~-a)), transmute(df, c = -a) ) foo <- "foobar" expect_identical( transmute_(df, .dots = "foo"), transmute(df, foo) ) }) dplyr/tests/testthat/test-hybrid-traverse.R0000644000176200001440000002235013153520575020630 0ustar liggesuserscontext("hybrid-traverse") test_df <- data_frame( id = c(1L, 2L, 2L), a = 1:3, b = as.numeric(1:3), c = letters[1:3], d = c(TRUE, FALSE, NA), e = list(list(a = 1, x = 2), list(a = 2, x = 3), list(a = 3, x = 4)) ) test_that("$ is parsed correctly (#1400)", { grouping <- rowwise expect_equal( test_df %>% grouping %>% mutate(f = e$x) %>% select(-e), test_df %>% mutate(f = as.numeric(2:4)) %>% grouping %>% select(-e) ) }) test_that("$ is parsed correctly if column by the same name exists (#1400)", { grouping <- rowwise expect_equal( test_df %>% grouping %>% mutate(f = e$a) %>% select(-e), test_df %>% mutate(f = as.numeric(1:3)) %>% grouping %>% select(-e) ) }) test_that("[[ works for ungrouped access (#912)", { grouping <- identity expect_equal( test_df %>% grouping %>% mutate(f = mean(test_df[["a"]])) %>% select(-e), test_df %>% mutate(f = mean(a)) %>% grouping %>% select(-e) ) }) test_that("[[ works for rowwise access of list columns (#912)", { grouping <- rowwise df <- tibble( x = c("a", "b"), y = list(list(a = 1, b = 2), list(a = 3, b = 4)) ) expect_equal( df %>% rowwise() %>% transmute(z = y[[x]]), data_frame(z = c(1, 4)) ) }) test_that("$ works for rle result (#2125)", { grouping <- identity expect_equal( test_df %>% grouping %>% mutate(f = rle(b)$lengths) %>% select(-e), test_df %>% mutate(f = rep(1L, 3L)) %>% grouping %>% select(-e) ) }) test_hybrid <- function(grouping) { test_that("case_when() works for LHS (#1719, #2244)", { expect_equal( test_df %>% grouping %>% mutate(f = case_when(a == 1 ~ 1, a == 2 ~ 2, TRUE ~ 3)) %>% select(-e), test_df %>% mutate(f = b) %>% grouping %>% select(-e) ) }) test_that("case_when() works for RHS (#1719, #2244)", { expect_equal( test_df %>% grouping %>% mutate(f = case_when(a == 1 ~ as.numeric(a), a == 2 ~ b, TRUE ~ 3)) %>% select(-e), test_df %>% mutate(f = b) %>% grouping %>% select(-e) ) }) test_that("assignments work (#1452)", { expect_false(env_has(nms = "xx")) expect_equal( test_df %>% grouping %>% mutate(f = { xx <- 5 xx }) %>% select(-e), test_df %>% mutate(f = 5) %>% grouping %>% select(-e) ) expect_false(env_has(nms = "xx")) }) test_that("assignments don't change variable (#315, #1452)", { expect_false(env_has(nms = "a")) expect_equal( test_df %>% grouping %>% mutate(f = { a <- 5 a }) %>% select(-e), test_df %>% mutate(f = 5) %>% grouping %>% select(-e) ) expect_false(env_has(nms = "a")) }) test_that("assignments don't carry over (#1452)", { # error messages by bindr/rlang expect_error( test_df %>% grouping %>% mutate(f = { xx <- 5; xx }, g = xx), "xx" ) }) test_that("assignments don't leak (#1452)", { expect_false(env_has(nms = "a")) test <- test_df %>% grouping %>% mutate(f = { xx <- 5 xx }) expect_false(env_has(nms = "a")) }) test_that("[ works (#912)", { grouped_df <- test_df %>% grouping expect_equal( grouped_df %>% mutate(f = mean(grouped_df["a"][[1]])) %>% select(-e), test_df %>% mutate(f = mean(a)) %>% grouping %>% select(-e) ) }) test_that("interpolation works (#1012)", { var <- quo(b) expect_equal( test_df %>% grouping %>% mutate(., f = mean(UQ(var))) %>% select(-e), test_df %>% grouping %>% mutate(f = mean(b)) %>% select(-e) ) }) test_that("can compute 1 - ecdf(y)(y) (#2018)", { surv <- function(x) 1 - ecdf(x)(x) expect_equal( test_df %>% grouping %>% mutate(., f = 1 - ecdf(b)(b)) %>% select(-e), test_df %>% grouping %>% mutate(., f = surv(b)) %>% select(-e) ) }) test_that("filter understands .data (#1012)", { expect_equal( test_df %>% grouping %>% filter({ b <- 5 .data$b < 2 }) %>% select(-e), test_df %>% grouping %>% filter(b < 2) %>% select(-e) ) }) test_that("filter understands .data (#1012)", { expect_equal( test_df %>% grouping %>% filter(.data[["b"]] < 2) %>% select(-e), test_df %>% grouping %>% filter(b < 2) %>% select(-e) ) }) test_that("filter understands .data (#1012)", { idx <- 2L expect_equal( test_df %>% grouping %>% filter(.data[[letters[[idx]]]] < 2) %>% select(-e), test_df %>% grouping %>% filter(b < 2) %>% select(-e) ) }) test_that("filter understands .env (#1469)", { b <- 2L expect_equal( filter( test_df %>% grouping, b < .env$b) %>% select(-e), test_df %>% grouping %>% filter(b < 2) %>% select(-e) ) }) test_that("filter understands get(..., .env) in a pipe (#1469)", { b <- 2L expect_equal( test_df %>% grouping %>% filter(b < get("b", envir = .env)) %>% select(-e), test_df %>% grouping %>% filter(b < 2) %>% select(-e) ) }) test_that("mutate understands .data (#1012)", { expect_equal( test_df %>% grouping %>% mutate(f = { b <- 5 .data$b }) %>% select(-e), test_df %>% grouping %>% mutate(f = b) %>% select(-e) ) }) test_that("mutate understands .data (#1012)", { expect_equal( test_df %>% grouping %>% mutate(f = .data[["b"]]) %>% select(-e), test_df %>% grouping %>% mutate(f = b) %>% select(-e) ) }) test_that("mutate understands .data (#1012)", { idx <- 2L expect_equal( test_df %>% grouping %>% mutate(f = .data[[letters[[idx]]]]) %>% select(-e), test_df %>% grouping %>% mutate(f = b) %>% select(-e) ) }) test_that("mutate understands .env (#1469)", { b <- 2L expect_equal( mutate( test_df %>% grouping, f = .env$b) %>% select(-e), test_df %>% grouping %>% mutate(f = 2L) %>% select(-e) ) }) test_that("mutate understands get(..., .env) in a pipe (#1469)", { b <- 2L expect_equal( test_df %>% grouping %>% mutate(f = get("b", .env)) %>% select(-e), test_df %>% grouping %>% mutate(f = 2L) %>% select(-e) ) }) test_that("summarise understands .data (#1012)", { expect_equal( test_df %>% grouping %>% summarise(f = { b <- 5; sum(.data$b) }), test_df %>% grouping %>% summarise(f = sum(b))) }) test_that("summarise understands .data (#1012)", { expect_equal( test_df %>% grouping %>% summarise(f = sum(.data[["b"]])), test_df %>% grouping %>% summarise(f = sum(b)) ) }) test_that("summarise understands .data (#1012)", { idx <- 2L expect_equal( test_df %>% grouping %>% summarise(f = sum(.data[[letters[[idx]]]])), test_df %>% grouping %>% summarise(f = sum(b)) ) }) test_that("summarise understands .env (#1469)", { b <- 2L expect_equal( summarise( test_df %>% grouping, f = .env$b), test_df %>% grouping %>% summarise(f = 2L) ) }) test_that("summarise understands get(..., .env) in a pipe (#1469)", { b <- 2L expect_equal( test_df %>% grouping %>% summarise(f = get("b", .env)), test_df %>% grouping %>% summarise(f = 2L) ) }) test_that("columns named .data and .env are overridden", { conflict_data <- data_frame(id = test_df$id, .data = 1:3, .env = 3:1) expect_equal( conflict_data %>% grouping %>% summarise(env = list(.env), data = list(.data)) %>% ungroup() %>% summarise( is_env_env = all(vapply(env, is.environment, logical(1))), is_data_env = all(vapply(env, is.environment, logical(1))) ), data_frame(is_env_env = TRUE, is_data_env = TRUE) ) }) test_that("contents of columns named .data and .env can be accessed", { conflict_data <- data_frame(id = test_df$id, .data = 1:3, .env = 3:1) expect_equal( conflict_data %>% grouping %>% summarise( env = mean(.data$.env), data = mean(.data$.data) ), conflict_data %>% rename(env = .env, data = .data) %>% grouping %>% summarise_at(vars(env, data), funs(mean)) ) }) } test_hybrid(identity) test_hybrid(rowwise) test_hybrid(. %>% group_by(!! quo(id))) dplyr/tests/testthat/test-between.R0000644000176200001440000000110313135665123017137 0ustar liggesuserscontext("between") test_that("returns NA if any argument is NA", { expect_equal(between(1, 1, NA), NA) expect_equal(between(1, NA, 1), NA) expect_equal(between(NA, 1, 1), NA) }) test_that("compatible with base R", { x <- runif(1e3) expect_equal(between(x, 0.25, 0.5), x >= 0.25 & x <= 0.5) }) test_that("warns when called on S3 object", { expect_warning(between(factor(1:5), 1, 3), "numeric vector with S3 class") }) test_that("unless it's a date or date time", { expect_warning(between(Sys.Date(), 1, 3), NA) expect_warning(between(Sys.time(), 1, 3), NA) }) dplyr/tests/testthat/helper-astyle.R0000644000176200001440000000177413156775616017341 0ustar liggesusersvcapply <- function(X, FUN, ..., USE.NAMES = TRUE) { vapply(X = X, FUN = FUN, FUN.VALUE = character(1L), ..., USE.NAMES = USE.NAMES) } astyle <- function(extra_args = character()) { astyle_cmd <- "astyle" if (Sys.which(astyle_cmd) == "") { skip("astyle not found") } astyle_args <- c( "-n", "--indent=spaces=2", "--unpad-paren", "--pad-header", "--pad-oper", "--min-conditional-indent=0", "--align-pointer=type", "--align-reference=type" ) src_path <- normalizePath(map_chr(c("../../src", "../../inst/include"), testthat::test_path)) src_files <- dir(src_path, "[.](?:cpp|h)$", recursive = TRUE, full.names = TRUE) astyle_files <- grep("(?:RcppExports[.](?:cpp|h)|static_assert[.]h)", src_files, value = TRUE, invert = TRUE) output <- system2(astyle_cmd, c(astyle_args, astyle_files, extra_args), stdout = TRUE, stderr = TRUE) unchanged <- grepl("^Unchanged", output) if (any(!unchanged)) { rlang::warn(paste(output[!unchanged], collapse = "\n")) } } dplyr/tests/testthat/test-recode.R0000644000176200001440000001240313153520575016755 0ustar liggesuserscontext("recode") test_that("error if no arguments", { expect_error( recode(1:5), "No replacements provided") expect_error( recode("a"), "No replacements provided") expect_error( recode(factor("a")), "No replacements provided") }) test_that("error if unnamed", { expect_error( recode("a", b = 5, "c"), "Argument 3 must be named, not unnamed" ) expect_error( recode(factor("a"), b = 5, "c"), "Argument 3 must be named, not unnamed", fixed = TRUE ) }) test_that("error if missing given for factors", { expect_error( recode(factor("a"), a = 5, .missing = 10), "`.missing` is not supported for factors", fixed = TRUE ) }) test_that("positional substitution works", { expect_equal(recode(1:2, "a", "b"), c("a", "b")) }) test_that("names override positions", { expect_equal(recode(1:2, `2` = "b", `1` = "a"), c("a", "b")) }) test_that("numeric vals must be all named or not named at all", { expect_error( recode(1:2, "b", `1` = "a"), "Either all values must be named, or none must be named" ) }) test_that("named substitution works", { x1 <- letters[1:3] x2 <- factor(x1) expect_equal(recode(x1, a = "apple", .default = NA_character_), c("apple", NA, NA)) expect_equal(recode(x2, a = "apple", .default = NA_character_), factor(c("apple", NA, NA))) }) test_that("missing values replaced by missing argument", { expect_equal(recode(c(1, NA), "a"), c("a", NA)) expect_equal(recode(c(1, NA), "a", .missing = "b"), c("a", "b")) expect_equal(recode(c(letters[1:3], NA), .missing = "A"), c("a", "b", "c", "A")) }) test_that("unmatched value replaced by default argument", { expect_warning(expect_equal(recode(c(1, 2), "a"), c("a", NA))) expect_equal(recode(c(1, 2), "a", .default = "b"), c("a", "b")) expect_equal(recode(letters[1:3], .default = "A"), c("A", "A", "A")) }) test_that("missing and default place nicely together", { expect_equal( recode(c(1, 2, NA), "a", .default = "b", .missing = "c"), c("a", "b", "c") ) }) test_that("can give name x", { expect_equal(recode("x", x = "a"), "a") }) test_that(".default works when not all values are named", { x <- rep(1:3, 3) expect_equal(recode(x, `3` = 10L, .default = x), rep(c(1L, 2L, 10L), 3)) }) test_that(".default is aliased to .x when missing and compatible", { x <- letters[1:3] expect_equal(recode(x, a = "A"), c("A", "b", "c")) n <- 1:3 expect_equal(recode(n, `1` = 10L), c(10L, 2L, 3L)) }) test_that(".default is not aliased to .x when missing and not compatible", { x <- letters[1:3] expect_warning(expect_equal(recode(x, a = 1), c(1L, NA, NA))) n <- 1:3 expect_warning(expect_equal(recode(n, `1` = "a"), c("a", NA, NA))) }) test_that("conversion of unreplaced values to NA gives warning", { expect_warning(recode(1:3, `1` = "a"), "treated as NA") expect_warning(recode_factor(letters[1:3], b = 1, c = 2)) }) test_that(".dot argument works correctly (PR #2110)", { x1 <- letters[1:3] x2 <- 1:3 expect_equal( recode(x1, a = "apple", b = "banana", .default = NA_character_), recode(x1, .default = NA_character_, !!! list(a = "apple", b = "banana")) ) expect_equal( recode(x1, a = "apple", b = "banana", .default = NA_character_), recode(x1, a = "apple", .default = NA_character_, !!! list(b = "banana")) ) expect_equal( recode(x2, "1" = 4, "2" = 5, .default = NA_real_), recode(x2, .default = NA_real_, !!! list("1" = 4, "2" = 5)) ) expect_equal( recode(x2, "1" = 4, "2" = 5, .default = NA_real_), recode(x2, "1" = 4, .default = NA_real_, !!! list("2" = 5)) ) }) # factor ------------------------------------------------------------------ test_that("default .default works with factors", { expect_equal(recode(factor(letters[1:3]), a = "A"), factor(c("A", "b", "c"))) }) test_that("can recode factor to double", { f <- factor(letters[1:3]) expect_equal(recode(f, a = 1, b = 2, c = 3), c(1, 2, 3)) expect_equal(recode(f, a = 1, b = 2), c(1, 2, NA)) expect_equal(recode(f, a = 1, b = 2, .default = 99), c(1, 2, 99)) }) test_that("recode_factor() handles .missing and .default levels", { x <- c(1:3, NA) expect_warning( expect_equal( recode_factor(x, `1` = "z", `2` = "y"), factor(c("z", "y", NA, NA), levels = c("z", "y")) ) ) expect_equal( recode_factor(x, `1` = "z", `2` = "y", .default = "D"), factor(c("z", "y", "D", NA), levels = c("z", "y", "D")) ) expect_equal( recode_factor(x, `1` = "z", `2` = "y", .default = "D", .missing = "M"), factor(c("z", "y", "D", "M"), c("z", "y", "D", "M")) ) }) test_that("recode_factor() handles vector .default", { expected <- factor(c("a", "z", "y"), levels = c("z", "y", "a")) x1 <- letters[1:3] x2 <- factor(x1) expect_equal(recode_factor(x1, b = "z", c = "y"), expected) expect_equal(recode_factor(x2, b = "z", c = "y"), expected) expect_equal(recode_factor(x1, b = "z", c = "y", .default = x1), expected) expect_equal(recode_factor(x2, b = "z", c = "y", .default = x1), expected) }) test_that("can recode factor with redundant levels", { expect_equal( recode(factor(letters[1:4]), d = "c", b = "a"), factor(c("a", "a", "c", "c"), levels = c("a", "c")) ) expect_equal( recode_factor(letters[1:4], d = "c", b = "a"), factor(c("a", "a", "c", "c"), levels = c("c", "a")) ) }) dplyr/tests/testthat/test-do.R0000644000176200001440000001222413153520575016117 0ustar liggesuserscontext("Do") # Grouped data frames ---------------------------------------------------------- df <- data.frame( g = c(1, 2, 2, 3, 3, 3), x = 1:6, y = 6:1 ) %>% group_by(g) test_that("can't use both named and unnamed args", { expect_error( df %>% do(x = 1, 2), "Arguments must either be all named or all unnamed", fixed = TRUE ) }) test_that("unnamed elements must return data frames", { expect_error( df %>% ungroup %>% do(1), "Result must be a data frame, not numeric") expect_error( df %>% do(1), "Results 1, 2, 3 must be data frames, not numeric") expect_error( df %>% do("a"), "Results 1, 2, 3 must be data frames, not character") }) test_that("unnamed results bound together by row", { first <- df %>% do(head(., 1)) expect_equal(nrow(first), 3) expect_equal(first$g, 1:3) expect_equal(first$x, c(1, 2, 4)) }) test_that("can only use single unnamed argument", { expect_error( df %>% do(head, tail), "Can only supply one unnamed argument, not 2" ) }) test_that("named argument become list columns", { out <- df %>% do(nrow = nrow(.), ncol = ncol(.)) expect_equal(out$nrow, list(1, 2, 3)) # includes grouping columns expect_equal(out$ncol, list(3, 3, 3)) }) test_that("colums in output override columns in input", { out <- df %>% do(data.frame(g = 1)) expect_equal(names(out), "g") expect_equal(out$g, c(1, 1, 1)) }) test_that("empty results preserved (#597)", { blankdf <- function(x) data.frame(blank = numeric(0)) dat <- data.frame(a = 1:2, b = factor(1:2)) expect_equal( dat %>% group_by(b) %>% do(blankdf(.)), data.frame(b = factor(integer(), levels = 1:2), blank = numeric()) ) }) test_that("empty inputs give empty outputs (#597)", { out <- data.frame(a = numeric(), b = factor()) %>% group_by(b) %>% do(data.frame()) expect_equal(out, data.frame(b = factor()) %>% group_by(b)) out <- data.frame(a = numeric(), b = character()) %>% group_by(b) %>% do(data.frame()) expect_equal(out, data.frame(b = character()) %>% group_by(b)) }) test_that("grouped do evaluates args in correct environment", { a <- 10 f <- function(a) { mtcars %>% group_by(cyl) %>% do(a = a) } expect_equal(f(100)$a, list(100, 100, 100)) }) # Ungrouped data frames -------------------------------------------------------- test_that("ungrouped data frame with unnamed argument returns data frame", { out <- mtcars %>% do(head(.)) expect_is(out, "data.frame") expect_equal(dim(out), c(6, 11)) }) test_that("ungrouped data frame with named argument returns list data frame", { out <- mtcars %>% do(x = 1, y = 2:10) expect_is(out, "tbl_df") expect_equal(out$x, list(1)) expect_equal(out$y, list(2:10)) }) test_that("ungrouped do evaluates args in correct environment", { a <- 10 f <- function(a) { mtcars %>% do(a = a) } expect_equal(f(100)$a, list(100)) }) # Zero row inputs -------------------------------------------------------------- test_that("empty data frames give consistent outputs", { dat <- data_frame(x = numeric(0), g = character(0)) grp <- dat %>% group_by(g) emt <- grp %>% filter(FALSE) dat %>% do(data.frame()) %>% vapply(type_sum, character(1)) %>% length %>% expect_equal(0) dat %>% do(data.frame(y = integer(0))) %>% vapply(type_sum, character(1)) %>% expect_equal(c(y = "int")) dat %>% do(data.frame(.)) %>% vapply(type_sum, character(1)) %>% expect_equal(c(x = "dbl", g = "chr")) dat %>% do(data.frame(., y = integer(0))) %>% vapply(type_sum, character(1)) %>% expect_equal(c(x = "dbl", g = "chr", y = "int")) dat %>% do(y = ncol(.)) %>% vapply(type_sum, character(1)) %>% expect_equal(c(y = "list")) # Grouped data frame should have same col types as ungrouped, with addition # of grouping variable grp %>% do(data.frame()) %>% vapply(type_sum, character(1)) %>% expect_equal(c(g = "chr")) grp %>% do(data.frame(y = integer(0))) %>% vapply(type_sum, character(1)) %>% expect_equal(c(g = "chr", y = "int")) grp %>% do(data.frame(.)) %>% vapply(type_sum, character(1)) %>% expect_equal(c(x = "dbl", g = "chr")) grp %>% do(data.frame(., y = integer(0))) %>% vapply(type_sum, character(1)) %>% expect_equal(c(x = "dbl", g = "chr", y = "int")) grp %>% do(y = ncol(.)) %>% vapply(type_sum, character(1)) %>% expect_equal(c(g = "chr", y = "list")) # A empty grouped dataset should have same types as grp emt %>% do(data.frame()) %>% vapply(type_sum, character(1)) %>% expect_equal(c(g = "chr")) emt %>% do(data.frame(y = integer(0))) %>% vapply(type_sum, character(1)) %>% expect_equal(c(g = "chr", y = "int")) emt %>% do(data.frame(.)) %>% vapply(type_sum, character(1)) %>% expect_equal(c(x = "dbl", g = "chr")) emt %>% do(data.frame(., y = integer(0))) %>% vapply(type_sum, character(1)) %>% expect_equal(c(x = "dbl", g = "chr", y = "int")) emt %>% do(y = ncol(.)) %>% vapply(type_sum, character(1)) %>% expect_equal(c(g = "chr", y = "list")) }) test_that("handling of empty data frames in do", { blankdf <- function(x) data.frame(blank = numeric(0)) dat <- data.frame(a = 1:2, b = factor(1:2)) res <- dat %>% group_by(b) %>% do(blankdf(.)) expect_equal(names(res), c("b", "blank")) }) dplyr/tests/testthat/test-rank.R0000644000176200001440000000111013135665123016437 0ustar liggesuserscontext("rank") ntile_h <- function(x, n) { tibble(x = x) %>% mutate(y = ntile(x, n)) %>% pull(y) } test_that("ntile ignores number of NAs", { x <- c(1:3, NA, NA, NA) expect_equal(ntile(x, 3), x) expect_equal(ntile_h(x, 3), x) x1 <- c(1L, 1L, 1L, NA, NA, NA) expect_equal(ntile(x, 1), x1) expect_equal(ntile_h(x, 1), x1) }) test_that("ntile always returns an integer", { expect_equal(ntile(numeric(), 3), integer()) expect_equal(ntile_h(numeric(), 3), integer()) expect_equal(ntile(NA, 3), NA_integer_) expect_equal(ntile_h(NA, 3), NA_integer_) }) dplyr/tests/testthat/test-top-n.R0000644000176200001440000000074013153520575016552 0ustar liggesuserscontext("top_n") test_that("top_n returns n rows", { test_df <- data.frame(x = 1:10, y = 11:20) top_four <- test_df %>% top_n(4, y) expect_equal(dim(top_four), c(4, 2)) }) test_that("top_n() handles missing `wt`", { df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1)) expect_message(regexp = "Selecting by x", expect_identical(top_n(df, 2)$x, c(10, 6)) ) }) test_that("top_n() handles calls", { expect_identical(top_n(mtcars, 2, -disp), top_n(mtcars, -2, disp)) }) dplyr/tests/testthat/test-colwise-arrange.R0000644000176200001440000000070113135665123020573 0ustar liggesuserscontext("colwise arrange") df <- mtcars[1:3] test_that("scoped arrange is identical to manual arrange", { expect_identical(arrange_all(df), arrange(df, mpg, cyl, disp)) expect_identical(arrange_at(df, vars(mpg)), arrange(df, mpg)) expect_identical(arrange_if(iris, is.factor), arrange(iris, Species)) }) test_that(".funs is applied to variables before sorting", { expect_identical(arrange_all(df, `-`), arrange(df, -mpg, -cyl, -disp)) }) dplyr/tests/testthat/test-joins.r0000644000176200001440000007044213153520575016705 0ustar liggesuserscontext("Joins") # Univariate keys -------------------------------------------------------------- a <- data.frame(x = c(1, 1, 2, 3), y = 1:4) b <- data.frame(x = c(1, 2, 2, 4), z = 1:4) test_that("univariate inner join has all columns, repeated matching rows", { j <- inner_join(a, b, "x") expect_equal(names(j), c("x", "y", "z")) expect_equal(j$y, c(1, 2, 3, 3)) expect_equal(j$z, c(1, 1, 2, 3)) }) test_that("univariate left join has all columns, all rows", { j1 <- left_join(a, b, "x") j2 <- left_join(b, a, "x") expect_equal(names(j1), c("x", "y", "z")) expect_equal(names(j2), c("x", "z", "y")) expect_equal(j1$z, c(1, 1, 2, 3, NA)) expect_equal(j2$y, c(1, 2, 3, 3, NA)) }) test_that("univariate semi join has x columns, matching rows", { j1 <- semi_join(a, b, "x") j2 <- semi_join(b, a, "x") expect_equal(names(j1), c("x", "y")) expect_equal(names(j2), c("x", "z")) expect_equal(j1$y, 1:3) expect_equal(j2$z, 1:3) }) test_that("univariate anti join has x columns, missing rows", { j1 <- anti_join(a, b, "x") j2 <- anti_join(b, a, "x") expect_equal(names(j1), c("x", "y")) expect_equal(names(j2), c("x", "z")) expect_equal(j1$y, 4) expect_equal(j2$z, 4) }) test_that("univariate right join has all columns, all rows", { j1 <- right_join(a, b, "x") j2 <- right_join(b, a, "x") expect_equal(names(j1), c("x", "y", "z")) expect_equal(names(j2), c("x", "z", "y")) expect_equal(j1$x, c(1, 1, 2, 2, 4)) expect_equal(j1$y, c(1, 2, 3, 3, NA)) expect_equal(j1$z, c(1, 1, 2, 3, 4)) expect_equal(j2$x, c(1, 1, 2, 2, 3)) expect_equal(j2$y, c(1, 2, 3, 3, 4)) expect_equal(j2$z, c(1, 1, 2, 3, NA)) }) # Bivariate keys --------------------------------------------------------------- c <- data.frame( x = c(1, 1, 2, 3), y = c(1, 1, 2, 3), a = 1:4) d <- data.frame( x = c(1, 2, 2, 4), y = c(1, 2, 2, 4), b = 1:4) test_that("bivariate inner join has all columns, repeated matching rows", { j <- inner_join(c, d, c("x", "y")) expect_equal(names(j), c("x", "y", "a", "b")) expect_equal(j$a, c(1, 2, 3, 3)) expect_equal(j$b, c(1, 1, 2, 3)) }) test_that("bivariate left join has all columns, all rows", { j1 <- left_join(c, d, c("x", "y")) j2 <- left_join(d, c, c("x", "y")) expect_equal(names(j1), c("x", "y", "a", "b")) expect_equal(names(j2), c("x", "y", "b", "a")) expect_equal(j1$b, c(1, 1, 2, 3, NA)) expect_equal(j2$a, c(1, 2, 3, 3, NA)) }) test_that("bivariate semi join has x columns, matching rows", { j1 <- semi_join(c, d, c("x", "y")) j2 <- semi_join(d, c, c("x", "y")) expect_equal(names(j1), c("x", "y", "a")) expect_equal(names(j2), c("x", "y", "b")) expect_equal(j1$a, 1:3) expect_equal(j2$b, 1:3) }) test_that("bivariate anti join has x columns, missing rows", { j1 <- anti_join(c, d, c("x", "y")) j2 <- anti_join(d, c, c("x", "y")) expect_equal(names(j1), c("x", "y", "a")) expect_equal(names(j2), c("x", "y", "b")) expect_equal(j1$a, 4) expect_equal(j2$b, 4) }) # Duplicate column names -------------------------------------------------- e <- data.frame(x = c(1, 1, 2, 3), z = 1:4) f <- data.frame(x = c(1, 2, 2, 4), z = 1:4) test_that("univariate inner join has all columns, repeated matching rows", { j <- inner_join(e, f, "x") expect_equal(names(j), c("x", "z.x", "z.y")) expect_equal(j$z.x, c(1, 2, 3, 3)) expect_equal(j$z.y, c(1, 1, 2, 3)) }) test_that("univariate left join has all columns, all rows", { j1 <- left_join(e, f, "x") j2 <- left_join(f, e, "x") expect_equal(names(j1), c("x", "z.x", "z.y")) expect_equal(names(j2), c("x", "z.x", "z.y")) expect_equal(j1$z.y, c(1, 1, 2, 3, NA)) expect_equal(j2$z.y, c(1, 2, 3, 3, NA)) }) test_that("can control suffixes with suffix argument", { j1 <- inner_join(e, f, "x", suffix = c("1", "2")) j2 <- left_join(e, f, "x", suffix = c("1", "2")) j3 <- right_join(e, f, "x", suffix = c("1", "2")) j4 <- full_join(e, f, "x", suffix = c("1", "2")) expect_named(j1, c("x", "z1", "z2")) expect_named(j2, c("x", "z1", "z2")) expect_named(j3, c("x", "z1", "z2")) expect_named(j4, c("x", "z1", "z2")) }) test_that("can handle empty string in suffix argument, left side (#2228, #2182, #2007)", { j1 <- inner_join(e, f, "x", suffix = c("", "2")) j2 <- left_join(e, f, "x", suffix = c("", "2")) j3 <- right_join(e, f, "x", suffix = c("", "2")) j4 <- full_join(e, f, "x", suffix = c("", "2")) expect_named(j1, c("x", "z", "z2")) expect_named(j2, c("x", "z", "z2")) expect_named(j3, c("x", "z", "z2")) expect_named(j4, c("x", "z", "z2")) }) test_that("can handle empty string in suffix argument, right side (#2228, #2182, #2007)", { j1 <- inner_join(e, f, "x", suffix = c("1", "")) j2 <- left_join(e, f, "x", suffix = c("1", "")) j3 <- right_join(e, f, "x", suffix = c("1", "")) j4 <- full_join(e, f, "x", suffix = c("1", "")) expect_named(j1, c("x", "z1", "z")) expect_named(j2, c("x", "z1", "z")) expect_named(j3, c("x", "z1", "z")) expect_named(j4, c("x", "z1", "z")) }) test_that("disallow empty string in both sides of suffix argument (#2228)", { expect_error( inner_join(e, f, "x", suffix = c("", "")), "`suffix` can't be empty string for both `x` and `y` suffixes", fixed = TRUE ) expect_error( left_join(e, f, "x", suffix = c("", "")), "`suffix` can't be empty string for both `x` and `y` suffixes", fixed = TRUE ) expect_error( right_join(e, f, "x", suffix = c("", "")), "`suffix` can't be empty string for both `x` and `y` suffixes", fixed = TRUE ) expect_error( full_join(e, f, "x", suffix = c("", "")), "`suffix` can't be empty string for both `x` and `y` suffixes", fixed = TRUE ) }) test_that("check suffix input", { expect_error( inner_join(e, f, "x", suffix = letters[1:3]), "`suffix` must be a character vector of length 2, not character of length 3", fixed = TRUE ) expect_error( inner_join(e, f, "x", suffix = letters[1]), "`suffix` must be a character vector of length 2, not string of length 1", fixed = TRUE ) expect_error( inner_join(e, f, "x", suffix = 1:2), "`suffix` must be a character vector of length 2, not integer of length 2", fixed = TRUE ) }) test_that("inner_join does not segfault on NA in factors (#306)", { a <- data.frame(x = c("p", "q", NA), y = c(1, 2, 3), stringsAsFactors = TRUE) b <- data.frame(x = c("p", "q", "r"), z = c(4, 5, 6), stringsAsFactors = TRUE) expect_warning(res <- inner_join(a, b, "x"), "joining factors with different levels") expect_equal(nrow(res), 2L) }) test_that("joins don't reorder columns #328", { a <- data.frame(a = 1:3) b <- data.frame(a = 1:3, b = 1, c = 2, d = 3, e = 4, f = 5) res <- left_join(a, b, "a") expect_equal(names(res), names(b)) }) test_that("join handles type promotions #123", { df <- data.frame( V1 = c(rep("a", 5), rep("b", 5)), V2 = rep(c(1:5), 2), V3 = c(101:110), stringsAsFactors = FALSE ) match <- data.frame( V1 = c("a", "b"), V2 = c(3.0, 4.0), stringsAsFactors = FALSE ) res <- semi_join(df, match, c("V1", "V2")) expect_equal(res$V2, 3:4) expect_equal(res$V3, c(103L, 109L)) }) test_that("indices don't get mixed up when nrow(x) > nrow(y). #365", { a <- data.frame(V1 = c(0, 1, 2), V2 = c("a", "b", "c"), stringsAsFactors = FALSE) b <- data.frame(V1 = c(0, 1), V3 = c("n", "m"), stringsAsFactors = FALSE) res <- inner_join(a, b, by = "V1") expect_equal(res$V1, c(0, 1)) expect_equal(res$V2, c("a", "b")) expect_equal(res$V3, c("n", "m")) }) test_that("join functions error on column not found #371", { expect_error( left_join(data.frame(x = 1:5), data.frame(y = 1:5), by = "x"), "`by` can't contain join column `x` which is missing from RHS", fixed = TRUE ) expect_error( left_join(data.frame(x = 1:5), data.frame(y = 1:5), by = "y"), "`by` can't contain join column `y` which is missing from LHS", fixed = TRUE ) expect_error( left_join(data.frame(x = 1:5), data.frame(y = 1:5)), "`by` required, because the data sources have no common variables", fixed = TRUE ) expect_error( left_join(data.frame(x = 1:5), data.frame(y = 1:5), by = 1:3), "`by` must be a (named) character vector, list, or NULL for natural joins (not recommended in production code), not integer", fixed = TRUE ) }) test_that("inner_join is symmetric (even when joining on character & factor)", { foo <- data_frame(id = factor(c("a", "b")), var1 = "foo") bar <- data_frame(id = c("a", "b"), var2 = "bar") expect_warning(tmp1 <- inner_join(foo, bar, by = "id"), "joining factor and character") expect_warning(tmp2 <- inner_join(bar, foo, by = "id"), "joining character vector and factor") expect_is(tmp1$id, "character") expect_is(tmp2$id, "character") expect_equal(names(tmp1), c("id", "var1", "var2")) expect_equal(names(tmp2), c("id", "var2", "var1")) expect_equal(tmp1, tmp2) }) test_that("inner_join is symmetric, even when type of join var is different (#450)", { foo <- tbl_df(data.frame(id = 1:10, var1 = "foo")) bar <- tbl_df(data.frame(id = as.numeric(rep(1:10, 5)), var2 = "bar")) tmp1 <- inner_join(foo, bar, by = "id") tmp2 <- inner_join(bar, foo, by = "id") expect_equal(names(tmp1), c("id", "var1", "var2")) expect_equal(names(tmp2), c("id", "var2", "var1")) expect_equal(tmp1, tmp2) }) test_that("left_join by different variable names (#617)", { x <- data_frame(x1 = c(1, 3, 2)) y <- data_frame(y1 = c(1, 2, 3), y2 = c("foo", "foo", "bar")) res <- left_join(x, y, by = c("x1" = "y1")) expect_equal(names(res), c("x1", "y2")) expect_equal(res$x1, c(1, 3, 2)) expect_equal(res$y2, c("foo", "bar", "foo")) }) test_that("joins support comple vectors", { a <- data.frame(x = c(1, 1, 2, 3) * 1i, y = 1:4) b <- data.frame(x = c(1, 2, 2, 4) * 1i, z = 1:4) j <- inner_join(a, b, "x") expect_equal(names(j), c("x", "y", "z")) expect_equal(j$y, c(1, 2, 3, 3)) expect_equal(j$z, c(1, 1, 2, 3)) }) test_that("joins suffix variable names (#655)", { a <- data.frame(x = 1:10, y = 2:11) b <- data.frame(z = 5:14, x = 3:12) # x from this gets suffixed by .y res <- left_join(a, b, by = c("x" = "z")) expect_equal(names(res), c("x", "y", "x.y")) a <- data.frame(x = 1:10, z = 2:11) b <- data.frame(z = 5:14, x = 3:12) # x from this gets suffixed by .y res <- left_join(a, b, by = c("x" = "z")) }) test_that("right_join gets the column in the right order #96", { a <- data.frame(x = 1:10, y = 2:11) b <- data.frame(x = 5:14, z = 3:12) res <- right_join(a, b) expect_equal(names(res), c("x", "y", "z")) a <- data.frame(x = 1:10, y = 2:11) b <- data.frame(z = 5:14, a = 3:12) res <- right_join(a, b, by = c("x" = "z")) expect_equal(names(res), c("x", "y", "a")) }) test_that("full_join #96", { a <- data.frame(x = 1:3, y = 2:4) b <- data.frame(x = 3:5, z = 3:5) res <- full_join(a, b, "x") expect_equal(res$x, 1:5) expect_equal(res$y[1:3], 2:4) expect_true(all(is.na(res$y[4:5]))) expect_true(all(is.na(res$z[1:2]))) expect_equal(res$z[3:5], 3:5) }) test_that("JoinStringFactorVisitor and JoinFactorStringVisitor handle NA #688", { x <- data.frame(Greek = c("Alpha", "Beta", NA), numbers = 1:3) y <- data.frame( Greek = c("Alpha", "Beta", "Gamma"), Letters = c("C", "B", "C"), stringsAsFactors = F ) expect_warning( res <- left_join(x, y, by = "Greek"), "Column `Greek` joining factor and character vector, coercing into character vector", fixed = TRUE ) expect_true(is.na(res$Greek[3])) expect_true(is.na(res$Letters[3])) expect_equal(res$numbers, 1:3) expect_warning( res <- left_join(y, x, by = "Greek"), "Column `Greek` joining character vector and factor, coercing into character vector", fixed = TRUE ) expect_equal(res$Greek, y$Greek) expect_equal(res$Letters, y$Letters) expect_equal(res$numbers[1:2], 1:2) expect_true(is.na(res$numbers[3])) }) test_that("JoinFactorFactorVisitor_SameLevels preserve levels order (#675)", { input <- data.frame(g1 = factor(c("A", "B", "C"), levels = c("B", "A", "C"))) output <- data.frame( g1 = factor(c("A", "B", "C"), levels = c("B", "A", "C")), g2 = factor(c("A", "B", "C"), levels = c("B", "A", "C")) ) res <- inner_join(group_by(input, g1), group_by(output, g1)) expect_equal(levels(res$g1), levels(input$g1)) expect_equal(levels(res$g2), levels(output$g2)) }) test_that("inner_join does not reorder (#684)", { test <- data_frame(Greek = c("Alpha", "Beta", "Gamma"), Letters = LETTERS[1:3]) lookup <- data_frame(Letters = c("C", "B", "C")) res <- inner_join(lookup, test) expect_equal(res$Letters, c("C", "B", "C")) }) test_that("joins coerce factors with different levels to character (#684)", { d1 <- data_frame(a = factor(c("a", "b", "c"))) d2 <- data_frame(a = factor(c("a", "e"))) expect_warning(res <- inner_join(d1, d2)) expect_is(res$a, "character") # different orders d2 <- d1 attr(d2$a, "levels") <- c("c", "b", "a") expect_warning(res <- inner_join(d1, d2)) expect_is(res$a, "character") }) test_that("joins between factor and character coerces to character with a warning (#684)", { d1 <- data_frame(a = factor(c("a", "b", "c"))) d2 <- data_frame(a = c("a", "e")) expect_warning(res <- inner_join(d1, d2)) expect_is(res$a, "character") expect_warning(res <- inner_join(d2, d1)) expect_is(res$a, "character") }) test_that("group column names reflect renamed duplicate columns (#2330)", { d1 <- data_frame(x = 1:5, y = 1:5) %>% group_by(x, y) d2 <- data_frame(x = 1:5, y = 1:5) res <- inner_join(d1, d2, by = "x") expect_groups(d1, c("x", "y")) expect_groups(res, c("x", "y.x")) }) test_that("group column names are null when joined data frames are not grouped (#2330)", { d1 <- data_frame(x = 1:5, y = 1:5) d2 <- data_frame(x = 1:5, y = 1:5) res <- inner_join(d1, d2, by = "x") expect_no_groups(res) }) # Guessing variables in x and y ------------------------------------------------ test_that("unnamed vars are the same in both tables", { by1 <- common_by_from_vector(c("x", "y", "z")) expect_equal(by1$x, c("x", "y", "z")) expect_equal(by1$y, c("x", "y", "z")) by2 <- common_by_from_vector(c("x" = "a", "y", "z")) expect_equal(by2$x, c("x", "y", "z")) expect_equal(by2$y, c("a", "y", "z")) }) test_that("join columns are not moved to the left (#802)", { df1 <- data.frame(x = 1, y = 1:5) df2 <- data.frame(y = 1:5, z = 2) out <- left_join(df1, df2) expect_equal(names(out), c("x", "y", "z")) }) test_that("join can handle multiple encodings (#769)", { text <- c("\xC9lise", "Pierre", "Fran\xE7ois") Encoding(text) <- "latin1" x <- data_frame(name = text, score = c(5, 7, 6)) y <- data_frame(name = text, attendance = c(8, 10, 9)) res <- left_join(x, y, by = "name") expect_equal(nrow(res), 3L) expect_equal(res$name, x$name) x <- data_frame(name = factor(text), score = c(5, 7, 6)) y <- data_frame(name = text, attendance = c(8, 10, 9)) res <- suppressWarnings(left_join(x, y, by = "name")) expect_equal(nrow(res), 3L) expect_equal(res$name, y$name) x <- data_frame(name = text, score = c(5, 7, 6)) y <- data_frame(name = factor(text), attendance = c(8, 10, 9)) res <- suppressWarnings(left_join(x, y, by = "name")) expect_equal(nrow(res), 3L) expect_equal(res$name, x$name) x <- data_frame(name = factor(text), score = c(5, 7, 6)) y <- data_frame(name = factor(text), attendance = c(8, 10, 9)) res <- suppressWarnings(left_join(x, y, by = "name")) expect_equal(nrow(res), 3L) expect_equal(res$name, x$name) }) test_that("join creates correctly named results (#855)", { x <- data.frame(q = c("a", "b", "c"), r = c("d", "e", "f"), s = c("1", "2", "3")) y <- data.frame(q = c("a", "b", "c"), r = c("d", "e", "f"), t = c("xxx", "xxx", "xxx")) res <- left_join(x, y, by = c("r", "q")) expect_equal(names(res), c("q", "r", "s", "t")) expect_equal(res$q, x$q) expect_equal(res$r, x$r) }) test_that("inner join gives same result as merge by default (#1281)", { set.seed(75) x <- data.frame(cat1 = sample(c("A", "B", NA), 5, 1), cat2 = sample(c(1, 2, NA), 5, 1), v = rpois(5, 3), stringsAsFactors = FALSE) y <- data.frame(cat1 = sample(c("A", "B", NA), 5, 1), cat2 = sample(c(1, 2, NA), 5, 1), v = rpois(5, 3), stringsAsFactors = FALSE) ij <- inner_join(x, y, by = c("cat1", "cat2")) me <- merge(x, y, by = c("cat1", "cat2")) expect_true(equal_data_frame(ij, me)) }) test_that("join handles matrices #1230", { df1 <- data_frame(x = 1:10, text = letters[1:10]) df2 <- data_frame(x = 1:5, text = "") df2$text <- matrix(LETTERS[1:10], nrow = 5) res <- left_join(df1, df2, by = c("x" = "x")) %>% filter(x > 5) text.y <- res$text.y expect_true(is.matrix(text.y)) expect_equal(dim(text.y), c(5, 2)) expect_true(all(is.na(text.y))) }) test_that("ordering of strings is not confused by R's collate order (#1315)", { a = data.frame(character = c("\u0663"), set = c("arabic_the_language"), stringsAsFactors = F) b = data.frame(character = c("3"), set = c("arabic_the_numeral_set"), stringsAsFactors = F) res <- b %>% inner_join(a, by = c("character")) expect_equal(nrow(res), 0L) res <- a %>% inner_join(b, by = c("character")) expect_equal(nrow(res), 0L) }) test_that("joins handle tzone differences (#819)", { date1 <- structure(-1735660800, tzone = "America/Chicago", class = c("POSIXct", "POSIXt")) date2 <- structure(-1735660800, tzone = "UTC", class = c("POSIXct", "POSIXt")) df1 <- data.frame(date = date1) df2 <- data.frame(date = date2) expect_equal(attr(left_join(df1, df1)$date, "tzone"), "America/Chicago") }) test_that("joins matches NA in character vector by default (#892, #2033)", { x <- data.frame( id = c(NA_character_, NA_character_), stringsAsFactors = F ) y <- expand.grid( id = c(NA_character_, NA_character_), LETTER = LETTERS[1:2], stringsAsFactors = F ) res <- left_join(x, y, by = "id") expect_true(all(is.na(res$id))) expect_equal(res$LETTER, rep(rep(c("A", "B"), each = 2), 2)) }) test_that("joins avoid name repetition (#1460)", { d1 <- data.frame(id = 1:5, foo = rnorm(5)) d2 <- data.frame(id = 1:5, foo = rnorm(5)) d3 <- data.frame(id = 1:5, foo = rnorm(5)) d <- d1 %>% left_join(d1, by = "id") %>% left_join(d2, by = "id") %>% left_join(d3, by = "id") expect_equal(names(d), c("id", "foo.x", "foo.y", "foo.x.x", "foo.y.y")) }) test_that("join functions are protected against empty by (#1496)", { x <- data.frame() y <- data.frame(a = 1) expect_error( left_join(x, y, by = names(x)), "`by` must specify variables to join by", fixed = TRUE ) expect_error( right_join(x, y, by = names(x)), "`by` must specify variables to join by", fixed = TRUE ) expect_error( semi_join(x, y, by = names(x)), "`by` must specify variables to join by", fixed = TRUE ) expect_error( full_join(x, y, by = names(x)), "`by` must specify variables to join by", fixed = TRUE ) expect_error( anti_join(x, y, by = names(x)), "`by` must specify variables to join by", fixed = TRUE ) expect_error( inner_join(x, y, by = names(x)), "`by` must specify variables to join by", fixed = TRUE ) }) test_that("joins takes care of duplicates in by (#1192)", { data2 <- data_frame(a = 1:3) data1 <- data_frame(a = 1:3, c = 3:5) res1 <- left_join(data1, data2, by = c("a", "a")) res2 <- left_join(data1, data2, by = c("a" = "a")) expect_equal(res1, res2) }) # Joined columns result in correct type ---------------------------------------- test_that("result of joining POSIXct is POSIXct (#1578)", { data1 <- data_frame( t = seq(as.POSIXct("2015-12-01", tz = "UTC"), length.out = 2, by = "days"), x = 1:2 ) data2 <- inner_join(data1, data1, by = "t") res1 <- class(data2$t) expected <- c("POSIXct", "POSIXt") expect_identical(res1, expected) }) test_that("joins allows extra attributes if they are identical (#1636)", { tbl_left <- data_frame( i = rep(c(1, 2, 3), each = 2), x1 = letters[1:6] ) tbl_right <- data_frame( i = c(1, 2, 3), x2 = letters[1:3] ) attr(tbl_left$i, "label") <- "iterator" attr(tbl_right$i, "label") <- "iterator" res <- left_join(tbl_left, tbl_right, by = "i") expect_equal(attr(res$i, "label"), "iterator") attr(tbl_left$i, "foo") <- "bar" attributes(tbl_right$i) <- NULL attr(tbl_right$i, "foo") <- "bar" attr(tbl_right$i, "label") <- "iterator" res <- left_join(tbl_left, tbl_right, by = "i") expect_equal(attr(res$i, "label"), "iterator") expect_equal(attr(res$i, "foo"), "bar") }) test_that("joins work with factors of different levels (#1712)", { d1 <- iris[, c("Species", "Sepal.Length")] d2 <- iris[, c("Species", "Sepal.Width")] d2$Species <- factor(as.character(d2$Species), levels = rev(levels(d1$Species))) expect_warning(res1 <- left_join(d1, d2, by = "Species")) d1$Species <- as.character(d1$Species) d2$Species <- as.character(d2$Species) res2 <- left_join(d1, d2, by = "Species") expect_equal(res1, res2) }) test_that("anti and semi joins give correct result when by variable is a factor (#1571)", { big <- data.frame(letter = rep(c("a", "b"), each = 2), number = 1:2) small <- data.frame(letter = "b") expect_warning( aj_result <- anti_join(big, small, by = "letter"), "Column `letter` joining factors with different levels, coercing to character vector", fixed = TRUE ) expect_equal(aj_result$number, 1:2) expect_equal(aj_result$letter, factor(c("a", "a"), levels = c("a", "b"))) expect_warning( sj_result <- semi_join(big, small, by = "letter"), "Column `letter` joining factors with different levels, coercing to character vector", fixed = TRUE ) expect_equal(sj_result$number, 1:2) expect_equal(sj_result$letter, factor(c("b", "b"), levels = c("a", "b"))) }) test_that("inner join not crashing (#1559)", { df3 <- data_frame( id = c(102, 102, 102, 121), name = c("qwer", "qwer", "qwer", "asdf"), k = factor(c("one", "two", "total", "one"), levels = c("one", "two", "total")), total = factor(c("tot", "tot", "tot", "tot"), levels = c("tot", "plan", "fact")), v = c(NA_real_, NA_real_, NA_real_, NA_real_), btm = c(25654.957609, 29375.7547216667, 55030.7123306667, 10469.3523273333), top = c(22238.368946, 30341.516924, 52579.88587, 9541.893144) ) df4 <- data_frame( id = c(102, 102, 102, 121), name = c("qwer", "qwer", "qwer", "asdf"), k = factor(c("one", "two", "total", "one"), levels = c("one", "two", "total")), type = factor(c("fact", "fact", "fact", "fact"), levels = c("tot", "plan", "fact")), perc = c(0.15363485835208, -0.0318297270618471, 0.0466114830816894, 0.0971986553754823) ) # all we want here is to test that this does not crash expect_message(res <- replicate(100, df3 %>% inner_join(df4))) for (i in 2:100) expect_equal(res[, 1], res[, i]) }) test_that("join handles mix of encodings in data (#1885, #2118, #2271)", { with_non_utf8_encoding({ special <- get_native_lang_string() for (factor1 in c(FALSE, TRUE)) { for (factor2 in c(FALSE, TRUE)) { for (encoder1 in c(enc2native, enc2utf8)) { for (encoder2 in c(enc2native, enc2utf8)) { df1 <- data.frame(x = encoder1(special), y = 1, stringsAsFactors = factor1) df1 <- tbl_df(df1) df2 <- data.frame(x = encoder2(special), z = 2, stringsAsFactors = factor2) df2 <- tbl_df(df2) df <- data.frame(x = special, y = 1, z = 2, stringsAsFactors = factor1 && factor2) df <- tbl_df(df) info <- paste( factor1, factor2, Encoding(as.character(df1$x)), Encoding(as.character(df2$x)) ) if (factor1 != factor2) warning_msg <- "coercing" else warning_msg <- NA expect_warning_msg <- function(code, msg = warning_msg) { expect_warning( code, msg, info = paste(deparse(substitute(code)[[2]][[1]]), info)) } expect_equal_df <- function(code, df_ = df) { code <- substitute(code) eval(bquote( expect_equal( .(code), df_, info = paste(deparse(code[[1]]), info) ) )) } expect_warning_msg(expect_equal_df(inner_join(df1, df2, by = "x"))) expect_warning_msg(expect_equal_df(left_join(df1, df2, by = "x"))) expect_warning_msg(expect_equal_df(right_join(df1, df2, by = "x"))) expect_warning_msg(expect_equal_df(full_join(df1, df2, by = "x"))) expect_warning_msg( expect_equal_df( semi_join(df1, df2, by = "x"), data.frame(x = special, y = 1, stringsAsFactors = factor1) ) ) expect_warning_msg( expect_equal_df( anti_join(df1, df2, by = "x"), data.frame(x = special, y = 1, stringsAsFactors = factor1)[0,] ) ) } } } } }) }) test_that("left_join handles mix of encodings in column names (#1571)", { with_non_utf8_encoding({ special <- get_native_lang_string() df1 <- data_frame(x = 1:6, foo = 1:6) names(df1)[1] <- special df2 <- data_frame(x = 1:6, baz = 1:6) names(df2)[1] <- enc2native(special) expect_message(res <- left_join(df1, df2), special, fixed = TRUE) expect_equal(names(res), c(special, "foo", "baz")) expect_equal(res$foo, 1:6) expect_equal(res$baz, 1:6) expect_equal(res[[special]], 1:6) }) }) test_that("NAs match in joins only with na_matches = 'na' (#2033)", { df1 <- data_frame(a = NA) df2 <- data_frame(a = NA, b = 1:3) for (na_matches in c("na", "never")) { accept_na_match <- (na_matches == "na") expect_equal(inner_join(df1, df2, na_matches = na_matches) %>% nrow, 0 + 3 * accept_na_match) expect_equal(left_join(df1, df2, na_matches = na_matches) %>% nrow, 1 + 2 * accept_na_match) expect_equal(right_join(df2, df1, na_matches = na_matches) %>% nrow, 1 + 2 * accept_na_match) expect_equal(full_join(df1, df2, na_matches = na_matches) %>% nrow, 4 - accept_na_match) expect_equal(anti_join(df1, df2, na_matches = na_matches) %>% nrow, 1 - accept_na_match) expect_equal(semi_join(df1, df2, na_matches = na_matches) %>% nrow, 0 + accept_na_match) } }) test_that("joins strip group indexes (#1597)", { df1 <- data_frame(a = 1:3) %>% group_by(a) df2 <- data_frame(a = rep(1:4, 2)) %>% group_by(a) expect_stripped <- function(df) { expect_null(attr(df, "indices")) expect_null(attr(df, "group_sizes")) expect_null(attr(df, "biggest_group_size")) expect_null(attr(df, "labels")) } expect_stripped(inner_join(df1, df2)) expect_stripped(left_join(df1, df2)) expect_stripped(right_join(df2, df1)) expect_stripped(full_join(df1, df2)) expect_stripped(anti_join(df1, df2)) expect_stripped(semi_join(df1, df2)) }) test_that("join accepts tz attributes (#2643)", { # It's the same time: df1 <- data_frame(a = as.POSIXct("2009-01-01 10:00:00", tz = "Europe/London")) df2 <- data_frame(a = as.POSIXct("2009-01-01 11:00:00", tz = "Europe/Paris")) result <- inner_join(df1, df2, by = "a") expect_equal(nrow(result), 1) }) test_that("join takes LHS with warning if attributes inconsistent", { df1 <- tibble(a = 1:2, b = 2:1) df2 <- tibble( a = structure(1:2, foo = "bar"), c = 2:1 ) expect_warning( out1 <- left_join(df1, df2, by = "a"), "Column `a` has different attributes on LHS and RHS of join" ) expect_warning(out2 <- left_join(df2, df1, by = "a")) expect_warning( out3 <- left_join(df1, df2, by = c("b" = "a")), "Column `b`/`a` has different attributes on LHS and RHS of join" ) expect_equal(attr(out1$a, "foo"), NULL) expect_equal(attr(out2$a, "foo"), "bar") }) test_that("common_by() message", { df <- tibble(!!! set_names(letters, letters)) expect_message( left_join(df, df %>% select(1)), 'Joining, by = "a"', fixed = TRUE ) expect_message( left_join(df, df %>% select(1:3)), 'Joining, by = c("a", "b", "c")', fixed = TRUE ) expect_message( left_join(df, df), paste0("Joining, by = c(", paste0('"', letters, '"', collapse = ", "), ")"), fixed = TRUE ) }) test_that("semi- and anti-joins preserve order (#2964)", { expect_identical( data_frame(a = 3:1) %>% semi_join(data_frame(a = 1:3)), data_frame(a = 3:1) ) expect_identical( data_frame(a = 3:1) %>% anti_join(data_frame(a = 4:6)), data_frame(a = 3:1) ) }) dplyr/tests/testthat/test-slice.r0000644000176200001440000000625413153524224016655 0ustar liggesuserscontext("slice") test_that("slice handles numeric input (#226)", { g <- mtcars %>% group_by(cyl) res <- g %>% slice(1) expect_equal(nrow(res), 3) expect_equal(res, g %>% filter(row_number() == 1L)) expect_equal( mtcars %>% slice(1), mtcars %>% filter(row_number() == 1L) ) }) test_that("slice silently ignores out of range values (#226)", { expect_equal(slice(mtcars, c(2, 100)), slice(mtcars, 2)) g <- group_by(mtcars, cyl) expect_equal(slice(g, c(2, 100)), slice(g, 2)) }) test_that("slice works with 0 args", { expect_equivalent(slice(mtcars), mtcars) }) test_that("slice works with negative indices", { res <- slice(mtcars, -(1:2)) exp <- tail(mtcars, -2) expect_equal(names(res), names(exp)) for (col in names(res)) { expect_equal(res[[col]], exp[[col]]) } }) test_that("slice forbids positive and negative together", { expect_error( mtcars %>% slice(c(-1, 2)), "Found 1 positive indices and 1 negative indices", fixed = TRUE ) }) test_that("slice works with grouped data", { g <- group_by(mtcars, cyl) res <- slice(g, 1:2) exp <- filter(g, row_number() < 3) expect_equal(res, exp) res <- slice(g, -(1:2)) exp <- filter(g, row_number() >= 3) expect_equal(res, exp) }) test_that("slice gives correct rows (#649)", { a <- data_frame(value = paste0("row", 1:10)) expect_equal(slice(a, 1:3)$value, paste0("row", 1:3)) expect_equal(slice(a, c(4, 6, 9))$value, paste0("row", c(4, 6, 9))) a <- data_frame( value = paste0("row", 1:10), group = rep(1:2, each = 5) ) %>% group_by(group) expect_equal(slice(a, 1:3)$value, paste0("row", c(1:3, 6:8))) expect_equal(slice(a, c(2, 4))$value, paste0("row", c(2, 4, 7, 9))) }) test_that("slice handles NA (#1235)", { df <- data_frame(x = 1:3) expect_equal(nrow(slice(df, NA_integer_)), 0L) expect_equal(nrow(slice(df, c(1L, NA_integer_))), 1L) expect_equal(nrow(slice(df, c(-1L, NA_integer_))), 2L) df <- data_frame(x = 1:4, g = rep(1:2, 2)) %>% group_by(g) expect_equal(nrow(slice(df, NA)), 0L) expect_equal(nrow(slice(df, c(1, NA))), 2) expect_equal(nrow(slice(df, c(-1, NA))), 2) }) test_that("slice handles empty data frames (#1219)", { df <- data.frame(x = numeric()) res <- df %>% slice(1:3) expect_equal(nrow(res), 0L) expect_equal(names(res), "x") }) test_that("slice works fine if n > nrow(df) (#1269)", { slice_res <- mtcars %>% group_by(cyl) %>% slice(8) filter_res <- mtcars %>% group_by(cyl) %>% filter(row_number() == 8) expect_equal(slice_res, filter_res) }) test_that("slice strips grouped indices (#1405)", { res <- mtcars %>% group_by(cyl) %>% slice(1) %>% mutate(mpgplus = mpg + 1) expect_equal(nrow(res), 3L) expect_equal(attr(res, "indices"), as.list(0:2)) }) test_that("slice works with zero-column data frames (#2490)", { expect_equal( data_frame(a = 1:3) %>% select(-a) %>% slice(1) %>% nrow, 1L ) }) test_that("slice works under gctorture2", { x <- tibble(y = 1:10) with_gctorture2(999, x2 <- slice(x, 1:10)) expect_identical(x, x2) }) test_that("slice correctly computes positive indices from negative indices (#3073)", { x <- tibble(y = 1:10) expect_identical(slice(x, -10:-30), tibble(y = 1:9)) }) dplyr/tests/testthat/test-funs-predicates.R0000644000176200001440000000121613135665123020607 0ustar liggesuserscontext("funs-predicates") test_that("all_exprs() creates intersection", { expect_identical(all_exprs(am == 1), quo(am == 1)) quo <- set_env(quo((!! quo(cyl == 2)) & (!! quo(am == 1))), base_env()) expect_identical(all_exprs(cyl == 2, am == 1), quo) }) test_that("any_exprs() creates union", { expect_identical(any_exprs(am == 1), quo(am == 1)) quo <- set_env(quo((!! quo(cyl == 2)) | (!! quo(am == 1))), base_env()) expect_identical(any_exprs(cyl == 2, am == 1), quo) }) test_that("all_exprs() without expression returns an error", { expect_error( all_exprs(), "At least one expression must be given", fixed = TRUE ) }) dplyr/tests/testthat/test-overscope.R0000644000176200001440000000046513153520575017526 0ustar liggesuserscontext("overscope") test_that(".data has strict matching semantics (#2591)", { expect_error( data_frame(a = 1) %>% mutate(c = .data$b), "Column `b`: not found in data" ) expect_error( data_frame(a = 1:3) %>% group_by(a) %>% mutate(c = .data$b), "Column `b`: not found in data" ) }) dplyr/tests/testthat/test-colwise-select.R0000644000176200001440000000453713153520575020447 0ustar liggesuserscontext("colwise select") df <- data_frame(x = 0L, y = 0.5, z = 1) test_that("can select/rename all variables", { expect_identical(select_all(df), df) expect_error( rename_all(df), "`.funs` must specify a renaming function", fixed = TRUE ) expect_identical(select_all(df, toupper), set_names(df, c("X", "Y", "Z"))) expect_identical(select_all(df, toupper), rename_all(df, toupper)) }) test_that("can select/rename with predicate", { expect_identical(select_if(df, is_integerish), select(df, x, z)) expect_error( rename_if(df, is_integerish), "`.funs` must specify a renaming function", fixed = TRUE ) expect_identical(select_if(df, is_integerish, toupper), set_names(df[c("x", "z")], c("X", "Z"))) expect_identical(rename_if(df, is_integerish, toupper), set_names(df, c("X", "y", "Z"))) }) test_that("can supply funs()", { expect_identical(select_if(df, funs(is_integerish(.)), funs(toupper(.))), set_names(df[c("x", "z")], c("X", "Z"))) expect_identical(rename_if(df, funs(is_integerish(.)), funs(toupper(.))), set_names(df, c("X", "y", "Z"))) }) test_that("fails when more than one renaming function is supplied", { expect_error( select_all(df, funs(tolower, toupper)), "`.funs` must contain one renaming function, not 2", fixed = TRUE ) expect_error( rename_all(df, funs(tolower, toupper)), "`.funs` must contain one renaming function, not 2", fixed = TRUE ) }) test_that("can select/rename with vars()", { expect_identical(select_at(df, vars(x:y)), df[-3]) expect_error( rename_at(df, vars(x:y)), "`.funs` must specify a renaming function", fixed = TRUE ) expect_identical(select_at(df, vars(x:y), toupper), set_names(df[-3], c("X", "Y"))) expect_identical(rename_at(df, vars(x:y), toupper), set_names(df, c("X", "Y", "z"))) }) test_that("select_if keeps grouping cols", { expect_silent(df <- iris %>% group_by(Species) %>% select_if(is.numeric)) expect_equal(df, tbl_df(iris[c(5, 1:4)])) }) test_that("select_if() handles non-syntactic colnames", { df <- data_frame(`x 1` = 1:3) expect_identical(select_if(df, is_integer)[[1]], 1:3) }) test_that("select_if() handles quoted predicates", { expected <- select_if(mtcars, is_integerish) expect_identical(select_if(mtcars, "is_integerish"), expected) expect_identical(select_if(mtcars, ~is_integerish(.x)), expected) }) dplyr/tests/testthat/test-hybrid.R0000644000176200001440000004741613153520575017011 0ustar liggesuserscontext("hybrid") test_that("hybrid evaluation environment is cleaned up (#2358)", { # Can't use pipe here, f and g should have top-level parent.env() df <- data_frame(x = 1) df <- mutate(df, f = list(function(){})) df <- mutate(df, g = list(quo(.))) df <- mutate(df, h = list(~.)) expect_environments_clean(df$f[[1]]) expect_environments_clean(df$g[[1]]) expect_environments_clean(df$h[[1]]) # Avoids "Empty test" message expect_true(TRUE) }) test_that("n() and n_distinct() work", { check_hybrid_result( n(), a = 1:5, expected = 5L, test_eval = FALSE ) check_not_hybrid_result( list(1:n()), a = 1:5, expected = list(1:5), test_eval = FALSE ) check_hybrid_result( n_distinct(a), a = 1:5, expected = 5L ) check_hybrid_result( n_distinct(a), a = rep(1L, 3), expected = 1L ) check_hybrid_result( n_distinct(a, b), a = rep(1L, 3), b = 1:3, expected = 3L ) check_hybrid_result( n_distinct(a, b), a = rep(1L, 3), b = c(1, 1, 2), expected = 2L ) check_hybrid_result( n_distinct(a, b), a = rep(1L, 3), b = c(1, 1, NA), expected = 2L ) check_hybrid_result( n_distinct(a, b, na.rm = TRUE), a = rep(1L, 3), b = c(1, 1, NA), expected = 1L ) check_hybrid_result( n_distinct(a = a, b = b, na.rm = TRUE), a = rep(1L, 3), b = c(1, 1, NA), expected = 1L ) expect_not_hybrid_error( n_distinct(), a = 1:5, error = "Need at least one column for `n_distinct[(][)]`" ) }) test_that("%in% works (#192)", { # compilation errors on Windows # https://ci.appveyor.com/project/hadley/dplyr/build/1.0.230 check_not_hybrid_result( list(a %in% (1:3 * 1i)), a = 2:4 * 1i, expected = list(c(TRUE, TRUE, FALSE)) ) check_not_hybrid_result( list(a %in% 1:3), a = as.numeric(2:4), expected = list(c(TRUE, TRUE, FALSE)) ) check_not_hybrid_result( list(a %in% as.numeric(1:3)), a = 2:4, expected = list(c(TRUE, TRUE, FALSE)) ) c <- 2:4 check_not_hybrid_result( list(c %in% 1:3), a = as.numeric(2:4), expected = list(c(TRUE, TRUE, FALSE)) ) }) test_that("min() and max() work", { check_hybrid_result( min(a), a = 1:5, expected = 1, test_eval = FALSE ) check_hybrid_result( max(a), a = 1:5, expected = 5, test_eval = FALSE ) check_hybrid_result( min(a), a = as.numeric(1:5), expected = 1 ) check_hybrid_result( max(a), a = as.numeric(1:5), expected = 5 ) check_hybrid_result( min(a), a = c(1:5, NA), expected = NA_real_, test_eval = FALSE ) check_hybrid_result( max(a), a = c(1:5, NA), expected = NA_real_, test_eval = FALSE ) check_hybrid_result( min(a), a = c(NA, 1:5), expected = NA_real_, test_eval = FALSE ) check_hybrid_result( max(a), a = c(NA, 1:5), expected = NA_real_, test_eval = FALSE ) c <- 1:3 check_not_hybrid_result( min(c), a = 1:5, expected = 1L ) check_not_hybrid_result( max(c), a = 1:5, expected = 3L ) check_not_hybrid_result( min(a), a = letters, expected = "a" ) check_not_hybrid_result( max(a), a = letters, expected = "z" ) check_not_hybrid_result( min(a), a = c(letters, NA), expected = NA_character_ ) check_not_hybrid_result( max(a), a = c(letters, NA), expected = NA_character_ ) check_not_hybrid_result( min(a, na.rm = TRUE), a = c(letters, NA), expected = "a" ) check_not_hybrid_result( max(a, na.rm = TRUE), a = c(letters, NA), expected = "z" ) check_hybrid_result( min(a, na.rm = TRUE), a = NA_real_, expected = Inf, test_eval = FALSE ) check_hybrid_result( max(a, na.rm = TRUE), a = NA_real_, expected = -Inf, test_eval = FALSE ) check_hybrid_result( min(a), a = numeric(), expected = Inf, test_eval = FALSE ) check_hybrid_result( max(a), a = numeric(), expected = -Inf, test_eval = FALSE ) check_hybrid_result( max(a, na.rm = TRUE), a = NA_integer_, expected = -Inf, test_eval = FALSE ) check_hybrid_result( min(a, na.rm = TRUE), a = NA_integer_, expected = Inf, test_eval = FALSE ) check_hybrid_result( max(a), a = integer(), expected = -Inf, test_eval = FALSE ) check_hybrid_result( min(a), a = integer(), expected = Inf, test_eval = FALSE ) }) test_that("first(), last(), and nth() work", { check_hybrid_result( first(a), a = 1:5, expected = 1L ) check_hybrid_result( last(a), a = as.numeric(1:5), expected = 5 ) check_hybrid_result( nth(a, 6, default = 3), a = as.numeric(1:5), expected = 3 ) check_hybrid_result( nth(a, 6, def = 3), a = as.numeric(1:5), expected = 3 ) check_hybrid_result( nth(a, 6.5), a = 1:5, expected = NA_integer_ ) check_not_hybrid_result( nth(a, b[[2]]), a = letters[1:5], b = 5:1, expected = "d" ) check_hybrid_result( nth(a, 3), a = as.numeric(1:5) * 1i, expected = 3i ) check_not_hybrid_result( nth(a, 2), a = as.list(1:5), expected = 2L ) check_not_hybrid_result( nth(a, order_by = 5:1, 2), a = 1:5, expected = 4L ) expect_not_hybrid_error( first(a, bogus = 3), a = 1:5, error = "unused argument" ) expect_not_hybrid_error( last(a, bogus = 3), a = 1:5, error = "unused argument" ) expect_not_hybrid_error( nth(a, 3, bogus = 3), a = 1:5, error = "unused argument" ) c <- 1:3 check_not_hybrid_result( first(c), a = 2:4, expected = 1L ) check_not_hybrid_result( last(c), a = 2:4, expected = 3L ) check_not_hybrid_result( nth(c, 2), a = 2:4, expected = 2L ) check_hybrid_result( first(a, order_by = b), a = 1:5, b = 5:1, expected = 5L ) default_value <- 6L check_not_hybrid_result( nth(a, 6, default = default_value), a = 1:5, expected = 6L ) expect_equal( tibble(a = c(1, 1, 2), b = letters[1:3]) %>% group_by(a) %>% summarize(b = b[1], b = first(b)) %>% ungroup, tibble(a = c(1, 2), b = c("a", "c")) ) }) test_that("lead() and lag() work", { check_hybrid_result( list(lead(a)), a = 1:5, expected = list(c(2:5, NA)) ) check_hybrid_result( list(lag(a)), a = 1:5, expected = list(c(NA, 1:4)) ) check_hybrid_result( list(lead(a)), a = as.numeric(1:5), expected = list(c(as.numeric(2:5), NA)) ) check_hybrid_result( list(lag(a)), a = as.numeric(1:5), expected = list(c(NA, as.numeric(1:4))) ) check_hybrid_result( list(lead(a)), a = letters[1:5], expected = list(c(letters[2:5], NA)) ) check_hybrid_result( list(lag(a)), a = letters[1:5], expected = list(c(NA, letters[1:4])) ) check_hybrid_result( list(lead(a)), a = c(TRUE, FALSE), expected = list(c(FALSE, NA)) ) check_hybrid_result( list(lag(a)), a = c(TRUE, FALSE), expected = list(c(NA, TRUE)) ) check_not_hybrid_result( list(lead(a, order_by = b)), a = 1:5, b = 5:1, expected = list(c(NA, 1:4)) ) check_not_hybrid_result( list(lag(a, order_by = b)), a = 1:5, b = 5:1, expected = list(c(2:5, NA)) ) check_hybrid_result( list(lead(a)), a = 1:5 * 1i, expected = list(c(2:5, NA) * 1i) ) check_hybrid_result( list(lag(a)), a = 1:5 * 1i, expected = list(c(NA, 1:4) * 1i) ) v <- 1:2 check_not_hybrid_result( list(lead(a, v[1])), a = 1:5, expected = list(c(2:5, NA)) ) check_not_hybrid_result( list(lag(a, v[1])), a = 1:5, expected = list(c(NA, 1:4)) ) }) test_that("mean(), var(), sd() and sum() work", { check_hybrid_result( mean(a), a = 1:5, expected = 3 ) check_hybrid_result( var(a), a = 1:3, expected = 1 ) check_hybrid_result( sd(a), a = 1:3, expected = 1 ) check_hybrid_result( sum(a), a = 1:5, expected = 15L ) check_hybrid_result( sum(a), a = as.numeric(1:5), expected = 15 ) check_hybrid_result( mean(a), a = c(1:5, NA), expected = NA_real_ ) check_hybrid_result( var(a), a = c(1:3, NA), expected = NA_real_ ) check_hybrid_result( sum(a), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( sum(a), a = c(as.numeric(1:5), NA), expected = NA_real_ ) check_not_hybrid_result( sd(a, TRUE), a = c(1:3, NA), expected = 1 ) check_not_hybrid_result( sd(a, na.rm = b[[1]]), a = c(1:3, NA), b = TRUE, expected = 1 ) check_hybrid_result( sd(a), a = c(1:3, NA), expected = is.na ) }) test_that("row_number(), ntile(), min_rank(), percent_rank(), dense_rank(), and cume_dist() work", { check_hybrid_result( list(row_number()), a = 1:5, expected = list(1:5), test_eval = FALSE ) check_hybrid_result( list(row_number(a)), a = 5:1, expected = list(5:1) ) check_hybrid_result( list(min_rank(a)), a = c(1, 3, 2, 3, 1), expected = list(c(1L, 4L, 3L, 4L, 1L)) ) check_hybrid_result( list(percent_rank(a)), a = c(1, 3, 2, 3, 1), expected = list((c(1L, 4L, 3L, 4L, 1L) - 1) / 4) ) check_hybrid_result( list(cume_dist(a)), a = c(1, 3, 2, 3), expected = list(c(0.25, 1.0, 0.5, 1.0)) ) check_hybrid_result( list(dense_rank(a)), a = c(1, 3, 2, 3, 1), expected = list(c(1L, 3L, 2L, 3L, 1L)) ) expect_not_hybrid_error( row_number(a, 1), a = 5:1, error = "unused argument" ) expect_not_hybrid_error( min_rank(a, 1), a = 5:1, error = "unused argument" ) expect_not_hybrid_error( percent_rank(a, 1), a = 5:1, error = "unused argument" ) expect_not_hybrid_error( cume_dist(a, 1), a = 5:1, error = "unused argument" ) expect_not_hybrid_error( dense_rank(a, 1), a = 5:1, error = "unused argument" ) expect_not_hybrid_error( ntile(a, 2, 1), a = 5:1, error = "unused argument" ) check_not_hybrid_result( row_number("a"), a = 5:1, expected = 1L ) check_not_hybrid_result( min_rank("a"), a = 5:1, expected = 1L ) check_not_hybrid_result( percent_rank("a"), a = 5:1, expected = is.nan ) check_not_hybrid_result( cume_dist("a"), a = 5:1, expected = 1 ) check_not_hybrid_result( dense_rank("a"), a = 5:1, expected = 1L ) check_not_hybrid_result( ntile("a", 2), a = 5:1, expected = 1L ) expect_equal( tibble(a = c(1, 1, 2), b = letters[1:3]) %>% group_by(a) %>% summarize(b = b[1], b = min_rank(desc(b))) %>% ungroup, tibble(a = c(1, 2), b = c(1L, 1L)) ) }) test_that("hybrid handlers don't nest", { check_not_hybrid_result( mean(lag(a)), a = 1:5, expected = NA_real_ ) check_not_hybrid_result( mean(row_number()), a = 1:5, expected = 3, test_eval = FALSE ) check_not_hybrid_result( list(lag(cume_dist(a))), a = 1:4, expected = list(c(NA, 0.25, 0.5, 0.75)) ) }) test_that("constant folding and argument matching in hybrid evaluator (#2299)", { skip("Currently failing") skip("Currently failing (external var)") c <- 1:3 check_not_hybrid_result( n_distinct(c), a = 1:5, expected = 3L, test_eval = FALSE ) check_not_hybrid_result( n_distinct(a, c), a = 1:3, expected = 3L, test_eval = FALSE ) check_not_hybrid_result( n_distinct(a, b, na.rm = 1), a = rep(1L, 3), b = c(1, 1, NA), expected = 1L ) skip("Currently failing (constfold)") check_hybrid_result( list(a %in% 1:3), a = 2:4, expected = list(c(TRUE, TRUE, FALSE)) ) check_hybrid_result( list(a %in% as.numeric(1:3)), a = as.numeric(2:4), expected = list(c(TRUE, TRUE, FALSE)) ) check_hybrid_result( list(a %in% letters[1:3]), a = letters[2:4], expected = list(c(TRUE, TRUE, FALSE)) ) check_hybrid_result( list(a %in% c(TRUE, FALSE)), a = c(TRUE, FALSE, NA), expected = list(c(TRUE, TRUE, FALSE)) ) skip("Currently failing") check_hybrid_result( list(a %in% NA_integer_), a = c(2:4, NA), expected = list(c(FALSE, FALSE, FALSE, TRUE)) ) check_hybrid_result( list(a %in% NA_real_), a = as.numeric(c(2:4, NA)), expected = list(c(FALSE, FALSE, FALSE, TRUE)) ) check_hybrid_result( list(a %in% NA_character_), a = c(letters[2:4], NA), expected = list(c(FALSE, FALSE, FALSE, TRUE)) ) check_hybrid_result( list(a %in% NA), a = c(TRUE, FALSE, NA), expected = list(c(FALSE, FALSE, TRUE)) ) skip("Currently failing (constfold)") check_hybrid_result( min(a, na.rm = (1 == 0)), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( max(a, na.rm = (1 == 0)), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( min(a, na.rm = (1 == 1)), a = c(1:5, NA), expected = 1L ) check_hybrid_result( max(a, na.rm = (1 == 1)), a = c(1:5, NA), expected = 5L ) check_hybrid_result( min(a, na.rm = pi != pi), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( max(a, na.rm = pi != pi), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( min(a, na.rm = pi == pi), a = c(1:5, NA), expected = 1L ) check_hybrid_result( max(a, na.rm = pi == pi), a = c(1:5, NA), expected = 5L ) skip("Currently failing") check_hybrid_result( min(a, na.rm = F), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( max(a, na.rm = F), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( min(a, na.rm = T), a = c(1:5, NA), expected = 1L ) check_hybrid_result( max(a, na.rm = T), a = c(1:5, NA), expected = 5L ) skip("Currently failing (constfold)") check_hybrid_result( nth(a, 1 + 2), a = letters[1:5], expected = "c" ) check_hybrid_result( nth(a, -4), a = 1:5, expected = 2L ) skip("Currently failing (constfold)") check_hybrid_result( list(lead(a, 1L + 2L)), a = 1:5, expected = list(c(4:5, NA, NA, NA)) ) check_hybrid_result( list(lag(a, 4L - 2L)), a = as.numeric(1:5), expected = list(c(NA, NA, as.numeric(1:3))) ) check_not_hybrid_result( list(lead(a, default = 2 + 4)), a = 1:5, expected = list(as.numeric(2:6)) ) check_not_hybrid_result( list(lag(a, default = 3L - 3L)), a = as.numeric(1:5), expected = list(as.numeric(0:4)) ) check_hybrid_result( list(lead(a, 1 + 2)), a = 1:5, expected = list(c(4:5, NA, NA, NA)) ) check_hybrid_result( list(lag(a, 4 - 2)), a = as.numeric(1:5), expected = list(c(NA, NA, as.numeric(1:3))) ) check_hybrid_result( list(lead(a, default = 2L + 4L)), a = 1:5, expected = list(2:6) ) check_hybrid_result( list(lag(a, default = 3L - 3L)), a = 1:5, expected = list(0:4) ) check_hybrid_result( list(lead(a, def = 2L + 4L)), a = 1:5, expected = list(2:6) ) check_hybrid_result( list(lag(a, def = 3L - 3L)), a = 1:5, expected = list(0:4) ) check_hybrid_result( list(lead(a, 2, 2L + 4L)), a = 1:5, expected = list(c(3:6, 6L)) ) check_hybrid_result( list(lag(a, 3, 3L - 3L)), a = 1:5, expected = list(c(0L, 0L, 0:2)) ) skip("Currently failing") check_hybrid_result( mean(a, na.rm = (1 == 0)), a = c(1:5, NA), expected = NA_real_ ) check_hybrid_result( var(a, na.rm = (1 == 0)), a = c(1:3, NA), expected = NA_real_ ) check_hybrid_result( sd(a, na.rm = (1 == 0)), a = c(1:3, NA), expected = NA_real_ ) check_hybrid_result( sum(a, na.rm = (1 == 0)), a = c(1:5, NA), expected = NA_integer_ ) check_hybrid_result( sum(a, na.rm = (1 == 0)), a = c(as.numeric(1:5), NA), expected = NA_real_ ) check_hybrid_result( mean(a, na.rm = (1 == 1)), a = c(1:5, NA), expected = 3 ) check_hybrid_result( var(a, na.rm = (1 == 1)), a = c(1:3, NA), expected = 1 ) check_hybrid_result( sd(a, na.rm = (1 == 1)), a = c(1:3, NA), expected = 1 ) check_hybrid_result( sum(a, na.rm = (1 == 1)), a = c(1:5, NA), expected = 15L ) check_hybrid_result( sum(a, na.rm = (1 == 1)), a = c(as.numeric(1:5), NA), expected = 15 ) check_hybrid_result( mean(na.rm = (1 == 1), a), a = c(1:5, NA), expected = 3 ) check_hybrid_result( var(na.rm = (1 == 1), a), a = c(1:3, NA), expected = 1 ) check_hybrid_result( sd(na.rm = (1 == 1), a), a = c(1:3, NA), expected = 1 ) check_hybrid_result( sum(na.rm = (1 == 1), a), a = c(1:5, NA), expected = 15L ) check_hybrid_result( sum(na.rm = (1 == 1), a), a = c(as.numeric(1:5), NA), expected = 15 ) skip("Currently failing (constfold)") check_hybrid_result( list(ntile(a, 1 + 2)), a = c(1, 3, 2, 3, 1), expected = list(c(1L, 2L, 2L, 3L, 1L)) ) check_hybrid_result( list(ntile(a, 1L + 2L)), a = c(1, 3, 2, 3, 1), expected = list(c(1L, 2L, 2L, 3L, 1L)) ) check_hybrid_result( list(ntile(n = 1 + 2, a)), a = c(1, 3, 2, 3, 1), expected = list(c(1L, 2L, 2L, 3L, 1L)) ) skip("Currently failing") df <- data_frame(x = c(NA, 1L, 2L, NA, 3L, 4L, NA)) expected <- rep(4L, nrow(df)) expect_equal(mutate(df, y = last(na.omit(x)))$y, expected) expect_equal(mutate(df, y = last(x[!is.na(x)]))$y, expected) expect_equal(mutate(df, y = x %>% na.omit() %>% last())$y, expected) expect_equal(mutate(df, y = x %>% na.omit %>% last)$y, expected) data_frame(x = c(1, 1)) %>% mutate(y = 1) %>% summarise(z = first(x, order_by = y)) }) test_that("simple handlers supports quosured symbols", { mean <- sum <- var <- sd <- bad_hybrid_handler expect_identical( pull(summarise(mtcars, mean(!! quo(cyl)))), base::mean(mtcars$cyl) ) expect_identical( pull(summarise(mtcars, sum(!! quo(cyl)))), base::sum(mtcars$cyl) ) expect_identical( pull(summarise(mtcars, sd(!! quo(cyl)))), stats::sd(mtcars$cyl) ) expect_identical( pull(summarise(mtcars, var(!! quo(cyl)))), stats::var(mtcars$cyl) ) }) test_that("%in% handler supports quosured symbols", { `%in%` <- bad_hybrid_handler expect_identical( pull(mutate(mtcars, UQ(quo(cyl)) %in% 4)), base::`%in%`(mtcars$cyl, 4) ) }) test_that("min() and max() handlers supports quosured symbols", { min <- max <- bad_hybrid_handler expect_identical( pull(summarise(mtcars, min(!! quo(cyl)))), base::min(mtcars$cyl) ) expect_identical( pull(summarise(mtcars, max(!! quo(cyl)))), base::max(mtcars$cyl) ) }) test_that("lead/lag handlers support quosured symbols", { lead <- lag <- bad_hybrid_handler expect_identical( pull(mutate(mtcars, lead(!! quo(cyl)))), dplyr::lead(mtcars$cyl) ) expect_identical( pull(mutate(mtcars, lag(!! quo(cyl)))), dplyr::lag(mtcars$cyl) ) }) test_that("window handlers supports quosured symbols", { ntile <- min_rank <- percent_rank <- dense_rank <- cume_dist <- bad_hybrid_handler expect_identical( pull(mutate(mtcars, ntile(!! quo(disp), 2))), dplyr::ntile(mtcars$disp, 2) ) expect_identical( pull(mutate(mtcars, min_rank(!! quo(cyl)))), dplyr::min_rank(mtcars$cyl) ) expect_identical( pull(mutate(mtcars, percent_rank(!! quo(cyl)))), dplyr::percent_rank(mtcars$cyl) ) expect_identical( pull(mutate(mtcars, dense_rank(!! quo(cyl)))), dplyr::dense_rank(mtcars$cyl) ) expect_identical( pull(mutate(mtcars, cume_dist(!! quo(cyl)))), dplyr::cume_dist(mtcars$cyl) ) }) test_that("n_distinct() handler supports quosured symbols", { n_distinct <- bad_hybrid_handler expect_identical( pull(summarise(mtcars, n_distinct(!! quo(cyl)))), dplyr::n_distinct(mtcars$cyl) ) }) test_that("nth handlers support quosured symbols", { first <- last <- nth <- bad_hybrid_handler expect_identical( pull(summarise(mtcars, first(!! quo(cyl)))), dplyr::first(mtcars$cyl) ) expect_identical( pull(summarise(mtcars, last(!! quo(cyl)))), dplyr::last(mtcars$cyl) ) expect_identical( pull(summarise(mtcars, nth(!! quo(cyl), 2))), dplyr::nth(mtcars$cyl, 2) ) }) test_that("top_n() is hybridised (#2822)", { min_rank <- bad_hybrid_handler expect_error(top_n(mtcars, 1, cyl), NA) }) dplyr/tests/testthat/test-union-all.R0000644000176200001440000000044313135665123017412 0ustar liggesuserscontext("union_all") test_that("union all on vectors concatenates", { expect_equal(union_all(1:3, 4:6), 1:6) }) test_that("union all on data frames calls bind rows", { df1 <- data_frame(x = 1:2) df2 <- data_frame(y = 1:2) expect_equal(union_all(df1, df2), bind_rows(df1, df2)) }) dplyr/tests/testthat/helper-hybrid.R0000644000176200001440000000347713153520575017310 0ustar liggesusersexpect_predicate <- function(actual, expected) { if (is.function(expected)) { expect_true(expected(actual)) } else { expect_identical(actual, expected) } } check_hybrid_result <- function(expr, ..., expected, test_eval = TRUE) { check_hybrid_result_(rlang::enquo(expr), ..., expected = expected, test_eval = test_eval) } check_hybrid_result_ <- function(expr, ..., expected, test_eval) { expect_error( expect_predicate(with_hybrid_(expr, ...), expected), NA) if (test_eval) { expect_predicate(eval_dots_(expr, ...), expected) } } check_not_hybrid_result <- function(expr, ..., expected, test_eval = TRUE) { check_not_hybrid_result_(rlang::enquo(expr), ..., expected = expected, test_eval = test_eval) } check_not_hybrid_result_ <- function(expr, ..., expected, test_eval) { expect_error( expect_predicate(without_hybrid_(expr, ...), expected), NA) if (test_eval) { expect_predicate(eval_dots_(expr, ...), expected) } } expect_hybrid_error <- function(expr, ..., error) { expect_hybrid_error_(rlang::enquo(expr), ..., error = error) } expect_hybrid_error_ <- function(expr, ..., error) { expect_error( with_hybrid_(expr, ...), error ) } expect_not_hybrid_error <- function(expr, ..., error) { expect_not_hybrid_error_(rlang::enquo(expr), ..., error = error) } expect_not_hybrid_error_ <- function(expr, ..., error) { expect_error( without_hybrid_(expr, ...), error ) } expect_environments_clean <- function(x, stop_env = parent.frame()) { if (!is.environment(x)) x <- environment(x) if (identical(x, stop_env)) return() obj_names <- ls(x, all.names = TRUE) objs <- mget(obj_names, x) lapply(objs, expect_is, "environment") expect_environments_clean(parent.env(x), stop_env = stop_env) } bad_hybrid_handler <- function(...) stop("Expected hybrid evaluation") dplyr/tests/testthat/test-rbind.R0000644000176200001440000002233613153520575016620 0ustar liggesuserscontext("rbind") rbind_list_warn <- function(...) { expect_warning(ret <- rbind_list(...), "bind_rows") ret } rbind_all_warn <- function(...) { expect_warning(ret <- rbind_list(...), "bind_rows") ret } df_var <- data.frame( l = c(T, F, F), i = c(1, 1, 2), d = Sys.Date() + c(1, 1, 2), f = factor(letters[c(1, 1, 2)]), n = c(1, 1, 2) + 0.5, t = Sys.time() + c(1, 1, 2), c = letters[c(1, 1, 2)], stringsAsFactors = FALSE ) test_that("rbind_list works on key types", { exp <- tbl_df( rbind( df_var, df_var, df_var ) ) expect_equal( rbind_list_warn(df_var, df_var, df_var), exp ) }) test_that("rbind_list reorders columns", { columns <- seq_len(ncol(df_var)) exp <- tbl_df( rbind( df_var, df_var, df_var ) ) expect_equal( rbind_list_warn( df_var, df_var[, sample(columns)], df_var[, sample(columns)] ), exp ) }) test_that("rbind_list promotes integer to numeric", { df <- data.frame( a = 1:5, b = 1:5 ) df2 <- df df2$a <- as.numeric(df$a) res <- rbind_list_warn(df, df2) expect_equal( typeof(res$a), "double" ) expect_equal( typeof(res$b), "integer" ) }) test_that("rbind_list promotes factor to character", { df <- data.frame( a = letters[1:5], b = 1:5, stringsAsFactors=TRUE ) df2 <- df df2$a <- as.character(df$a) res <- rbind_list_warn(df, df2) expect_equal( typeof(res$a), "character" ) }) test_that("rbind_list doesn't promote factor to numeric", { df1 <- data.frame( a = 1:5, b = 1:5 ) df2 <- data.frame( a = 1:5, b = factor(letters[1:5]) ) expect_error(rbind_list_warn(df1, df2)) }) test_that("rbind_list doesn't coerce integer to factor", { df1 <- data.frame( a = 1:10, b = 1:10 ) df2 <- data.frame( a = 1:5, b = factor(letters[1:5]) ) expect_error(rbind_list_warn(df1, df2)) }) test_that( "rbind_list coerces factor to character when levels don't match", { df1 <- data.frame( a = 1:3, b = factor(c("a", "b", "c"))) df2 <- data.frame( a = 1:3, b = factor(c("a", "b", "c"), levels = c("b", "c", "a", "d"))) expect_warning(res <- rbind_list( df1, df2 ), "Unequal factor levels: coercing to character") expect_equal( res$b, c("a","b","c", "a","b","c" ) ) }) test_that( "rbind handles NULL",{ x <- cbind(a=1:10,b=1:10) y <- data.frame(x) res <- rbind_all_warn(list(y, y, NULL, y)) expect_equal(nrow(res), 30L) }) test_that( "rbind handles NA in factors #279", { xx <- as.data.frame(list(a=as.numeric(NA), b="c", c="d")) zz <- as.data.frame(list(a=1, b=as.character(NA), c="b")) expect_warning( res <- rbind_list( xx, zz ) ) expect_equal(res$a, c(NA,1.0)) expect_equal(res$b, c("c", NA)) expect_equal(res$c, c("d","b")) }) test_that( "rbind_all only accepts data frames #288",{ ll <- list(c(1,2,3,4, 5), c(6, 7, 8, 9, 10)) expect_error(rbind_all_warn(ll)) }) test_that( "rbind propagates timezone for POSIXct #298", { dates1 <- data.frame(ID=c("a", "b", "c"), dates=structure(c(-247320000, -246196800, -245073600), tzone = "GMT", class = c("POSIXct", "POSIXt")), stringsAsFactors=FALSE) dates2 <- data.frame(ID=c("d", "e", "f"), dates=structure(c(-243864000, -242654400, -241444800), tzone = "GMT", class = c("POSIXct", "POSIXt")), stringsAsFactors=FALSE) alldates <- rbind_list_warn(dates1, dates2) expect_equal( attr( alldates$dates, "tzone" ), "GMT" ) }) test_that( "Collecter_Impl can collect INTSXP. #321", { res <- rbind_list_warn(data.frame(x = 0.5), data.frame(x = 1:3)) expect_equal( res$x, c(0.5, 1:3) ) }) test_that( "Collecter_Impl can collect LGLSXP. #321", { res <- rbind_list_warn(data.frame(x = 1:3), data.frame(x = NA)) expect_equal( res$x, c(1:3, NA) ) }) test_that("rbind_all handles list columns (#463)", { dfl <- data.frame(x = I(list(1:2, 1:3, 1:4))) res <- rbind_all_warn(list(dfl, dfl)) expect_equal(rep(dfl$x,2L), res$x) }) test_that("rbind_all creates tbl_df object", { res <- rbind_list_warn(tbl_df(mtcars)) expect_is( res, "tbl_df" ) }) test_that("string vectors are filled with NA not blanks before collection (#595)", { one <- mtcars[1:10, -10] two <- mtcars[11:32, ] two$char_col <- letters[1:22] res <- rbind_list_warn(one, two) expect_true( all(is.na(res$char_col[1:10])) ) }) test_that("rbind handles data frames with no rows (#597)",{ empty <- data.frame(result = numeric()) expect_equal(rbind_list_warn(empty), tbl_df(empty)) expect_equal(rbind_list_warn(empty, empty), tbl_df(empty)) expect_equal(rbind_list_warn(empty, empty, empty), tbl_df(empty)) }) test_that("rbind handles all NA columns (#493)", { mydata <- list( data.frame(x=c("foo", "bar")), data.frame(x=NA) ) res <- rbind_all_warn(mydata) expect_true( is.na(res$x[3]) ) expect_is( res$x, "factor" ) mydata <- list( data.frame(x=NA), data.frame(x=c("foo", "bar")) ) res <- rbind_all_warn(mydata) expect_true( is.na(res$x[1]) ) expect_is( res$x, "factor" ) }) test_that( "bind_rows handles complex. #933", { df1 <- data.frame(r = c(1+1i, 2-1i)) df2 <- data.frame(r = c(1-1i, 2+1i)) df3 <- bind_rows(df1,df2) expect_equal( nrow(df3), 4L) expect_equal( df3$r, c(df1$r, df2$r) ) }) test_that("bind_rows is careful about column names encoding #1265", { one <- data.frame(foo=1:3, bar=1:3); names(one) <- c("f\u00fc", "bar") two <- data.frame(foo=1:3, bar=1:3); names(two) <- c("f\u00fc", "bar") Encoding(names(one)[1]) <- "UTF-8" expect_equal( names(one), names(two)) res <- bind_rows(one,two) expect_equal(ncol(res), 2L) }) test_that("bind_rows handles POSIXct (#1125)", { df1 <- data.frame(date = as.POSIXct(NA)) df2 <- data.frame(date = as.POSIXct("2015-05-05")) res <- bind_rows(df1,df2) expect_equal(nrow(res),2L) expect_true(is.na(res$date[1])) }) test_that("bind_rows respects ordered factors (#1112)", { l <- c("a", "b", "c", "d") id <- factor(c("a", "c", "d"), levels = l, ordered = TRUE) df <- data.frame(id = rep(id, 2), val = rnorm(6)) res <- bind_rows(df, df) expect_is( res$id, "ordered") expect_equal( levels(df$id), levels(res$id) ) res <- group_by(df, id) %>% do(na.omit(.)) expect_is( res$id, "ordered") expect_equal( levels(df$id), levels(res$id) ) }) test_that("bind_rows can handle lists (#1104)", { my_list <- list(list(x = 1, y = 'a'), list(x = 2, y = 'b')) res <- bind_rows(my_list) expect_equal(nrow(res), 2L) expect_is(res$x, "numeric") expect_is(res$y, "character") res <- bind_rows(list(x = 1, y = 'a'), list(x = 2, y = 'b')) expect_equal(nrow(res), 2L) expect_is(res$x, "numeric") expect_is(res$y, "character") }) test_that("rbind_list keeps ordered factors (#948)", { y <- rbind_list_warn( data.frame(x = factor(c(1,2,3), ordered = TRUE)), data.frame(x = factor(c(1,2,3), ordered = TRUE)) ) expect_is( y$x, "ordered" ) expect_equal( levels(y$x), as.character(1:3) ) }) test_that("bind handles POSIXct of different tz ", { date1 <- structure(-1735660800, tzone = "America/Chicago", class = c("POSIXct", "POSIXt")) date2 <- structure(-1735660800, tzone = "UTC", class = c("POSIXct", "POSIXt")) date3 <- structure(-1735660800, class = c("POSIXct", "POSIXt")) df1 <- data.frame( date = date1 ) df2 <- data.frame( date = date2 ) df3 <- data.frame( date = date3 ) res <- bind_rows(df1, df2) expect_equal( attr(res$date, "tzone"), "UTC" ) res <- bind_rows(df1, df3) expect_equal( attr(res$date, "tzone"), "America/Chicago" ) res <- bind_rows(df2, df3) expect_equal( attr(res$date, "tzone"), "UTC" ) res <- bind_rows(df3, df3) expect_equal( attr(res$date, "tzone"), NULL ) res <- bind_rows(df1, df2, df3) expect_equal( attr(res$date, "tzone"), "UTC" ) }) test_that("bind_rows() creates a column of identifiers (#1337)", { data1 <- mtcars[c(2, 3), ] data2 <- mtcars[1, ] out <- bind_rows(data1, data2, .id = "col") out_list <- bind_rows(list(data1, data2), .id = "col") expect_equal(names(out)[1], "col") expect_equal(out$col, c("1", "1", "2")) expect_equal(out_list$col, c("1", "1", "2")) out_labelled <- bind_rows(one = data1, two = data2, .id = "col") out_list_labelled <- bind_rows(list(one = data1, two = data2), .id = "col") expect_equal(out_labelled$col, c("one", "one", "two")) expect_equal(out_list_labelled$col, c("one", "one", "two")) }) test_that("empty data frame are handled (#1346)", { res <- data_frame() %>% bind_rows(data_frame(x = "a")) expect_equal( nrow(res), 1L) }) test_that("bind_rows handles POSIXct stored as integer (#1402)", { now <- Sys.time() df1 <- data.frame(time = now) expect_equal( class(bind_rows(df1)$time), c("POSIXct", "POSIXt") ) df2 <- data.frame(time = seq(now, length.out = 1, by = 1)) expect_equal( class(bind_rows(df2)$time), c("POSIXct", "POSIXt") ) res <- bind_rows( df1, df2 ) expect_equal( class(res$time), c("POSIXct", "POSIXt") ) expect_true( all(res$time == c(df1$time, df2$time) ) ) }) test_that("bind_rows warns on binding factor and character (#1485)", { df1 <- head(iris, 1) df2 <- tail(iris, 1) %>% mutate(Species = as.character(Species)) expect_warning( bind_rows(df1, df2), "binding factor and character vector, coercing into character vector" ) }) dplyr/tests/testthat/test-mutate.r0000644000176200001440000005443713153520575017070 0ustar liggesuserscontext("Mutate") test_that("repeated outputs applied progressively (data frame)", { df <- data.frame(x = 1) out <- mutate(df, z = x + 1, z = z + 1) expect_equal(nrow(out), 1) expect_equal(ncol(out), 2) expect_equal(out$z, 3) }) test_that("repeated outputs applied progressively (grouped_df)", { df <- data.frame(x = c(1, 1), y = 1:2) ds <- group_by(df, y) out <- mutate(ds, z = x + 1, z = z + 1) expect_equal(nrow(out), 2) expect_equal(ncol(out), 3) expect_equal(out$z, c(3L, 3L)) }) test_that("two mutates equivalent to one", { df <- tibble(x = 1:10, y = 6:15) df1 <- df %>% mutate(x2 = x * 2, y4 = y * 4) df2 <- df %>% mutate(x2 = x * 2) %>% mutate(y4 = y * 4) expect_equal(df1, df2) }) test_that("mutate can refer to variables that were just created (#140)", { res <- mutate(tbl_df(mtcars), cyl1 = cyl + 1, cyl2 = cyl1 + 1) expect_equal(res$cyl2, mtcars$cyl + 2) gmtcars <- group_by(tbl_df(mtcars), am) res <- mutate(gmtcars, cyl1 = cyl + 1, cyl2 = cyl1 + 1) res_direct <- mutate(gmtcars, cyl2 = cyl + 2) expect_equal(res$cyl2, res_direct$cyl2) }) test_that("mutate handles logical result (#141)", { x <- data.frame(x = 1:10, g = rep(c(1, 2), each = 5)) res <- tbl_df(x) %>% group_by(g) %>% mutate(r = x > mean(x)) expect_equal(res$r, rep(c(FALSE, FALSE, FALSE, TRUE, TRUE), 2)) }) test_that("mutate can rename variables (#137)", { res <- mutate(tbl_df(mtcars), cyl2 = cyl) expect_equal(res$cyl2, mtcars$cyl) res <- mutate(group_by(tbl_df(mtcars), am), cyl2 = cyl) expect_equal(res$cyl2, res$cyl) }) test_that("mutate refuses to modify grouping vars (#143)", { expect_error( mutate(group_by(tbl_df(mtcars), am), am = am + 2), "Column `am` can't be modified because it's a grouping variable", fixed = TRUE ) }) test_that("mutate handles constants (#152)", { res <- mutate(tbl_df(mtcars), zz = 1) expect_equal(res$zz, rep(1, nrow(mtcars))) }) test_that("mutate fails with wrong result size (#152)", { df <- group_by(data.frame(x = c(2, 2, 3, 3)), x) expect_equal(mutate(df, y = 1:2)$y, rep(1:2, 2)) expect_error( mutate(mtcars, zz = 1:2), "Column `zz` must be length 32 (the number of rows) or one, not 2", fixed = TRUE ) df <- group_by(data.frame(x = c(2, 2, 3, 3, 3)), x) expect_error( mutate(df, y = 1:2), "Column `y` must be length 3 (the group size) or one, not 2", fixed = TRUE ) }) test_that("mutate refuses to use symbols not from the data", { y <- 1:6 df <- group_by(data.frame(x = c(1, 2, 2, 3, 3, 3)), x) expect_error( mutate(df, z = y), "Column `z` must be length 1 (the group size), not 6", fixed = TRUE ) }) test_that("mutate recycles results of length 1", { df <- data.frame(x = c(2, 2, 3, 3)) expect_equal(mutate(tbl_df(df), z = length(x))$z, rep(4, 4)) expect_equal(mutate(group_by(df, x), z = length(x))$z, rep(2, 4)) int <- 1L str <- "foo" num <- 1 bool <- TRUE list <- list(NULL) res <- mutate(group_by(df, x), int = int, str = str, num = num, bool = bool, list = list) expect_equal(res$int , rep(int , 4)) expect_equal(res$str , rep(str , 4)) expect_equal(res$num , rep(num , 4)) expect_equal(res$bool, rep(bool, 4)) expect_equal(res$list, rep(list, 4)) }) test_that("mutate handles out of data variables", { today <- Sys.Date() now <- Sys.time() df <- data.frame(x = c(2, 2, 3, 3)) gdf <- group_by(df, x) int <- c(1L, 2L) str <- c("foo", "bar") num <- c(1, 2) bool <- c(TRUE, FALSE) dat <- rep(today, 2) tim <- rep(now, 2) res <- mutate( gdf, int = int, str = str, num = num, bool = bool, dat = dat, tim = tim ) expect_equal(res$int , rep(int , 2)) expect_equal(res$str , rep(str , 2)) expect_equal(res$num , rep(num , 2)) expect_equal(res$bool, rep(bool, 2)) expect_equal(res$dat , rep(dat , 2)) expect_equal(res$tim , rep(tim , 2)) int <- 1:6 expect_error( mutate(gdf, int = int), "Column `int` must be length 2 (the group size) or one, not 6", fixed = TRUE ) expect_error( mutate(tbl_df(df), int = int), "Column `int` must be length 4 (the number of rows) or one, not 6", fixed = TRUE ) int <- 1:4 str <- rep(c("foo", "bar"), 2) num <- c(1, 2, 3, 4) bool <- c(TRUE, FALSE, FALSE, TRUE) dat <- rep(today, 4) tim <- rep(now, 4) res <- mutate( tbl_df(df), int = int, str = str, num = num, bool = bool, tim = tim, dat = dat ) expect_equal(res$int , int) expect_equal(res$str , str) expect_equal(res$num , num) expect_equal(res$bool, bool) expect_equal(res$dat , dat) expect_equal(res$tim , tim) }) test_that("mutate handles passing ...", { df <- data.frame(x = 1:4) f <- function(...) { x1 <- 1 f1 <- function(x) x mutate(df, ..., x1 = f1(x1)) } g <- function(...) { x2 <- 2 f(x2 = x2, ...) } h <- function(before = "before", ..., after = "after") { g(before = before, ..., after = after) } res <- h(x3 = 3) expect_equal(res$x1, rep(1, 4)) expect_equal(res$x2, rep(2, 4)) expect_equal(res$before, rep("before", 4)) expect_equal(res$after, rep("after", 4)) df <- tbl_df(df) res <- h(x3 = 3) expect_equal(res$x1, rep(1, 4)) expect_equal(res$x2, rep(2, 4)) expect_equal(res$before, rep("before", 4)) expect_equal(res$after, rep("after", 4)) df <- group_by(df, x) res <- h(x3 = 3) expect_equal(res$x1, rep(1, 4)) expect_equal(res$x2, rep(2, 4)) expect_equal(res$before, rep("before", 4)) expect_equal(res$after, rep("after", 4)) }) test_that("mutate fails on unsupported column type", { df <- data.frame(created = c("2014/1/1", "2014/1/2", "2014/1/2")) expect_error( mutate(df, date = strptime(created, "%Y/%m/%d")), "Column `date` is of unsupported class POSIXlt", fixed = TRUE ) df <- data.frame( created = c("2014/1/1", "2014/1/2", "2014/1/2"), g = c(1, 1, 2) ) expect_error( mutate(group_by(df, g), date = strptime(created, "%Y/%m/%d")), "Column `date` is of unsupported class POSIXlt", fixed = TRUE ) }) test_that("mutate modifies same column repeatedly (#243)", { df <- data.frame(x = 1) expect_equal(mutate(df, x = x + 1, x = x + 1)$x, 3) }) test_that("mutate errors when results are not compatible accross groups (#299)", { d <- data.frame(x = rep(1:5, each = 3)) expect_error( mutate(group_by(d, x), val = ifelse(x < 3, "foo", 2)), "Column `val` can't be converted from character to numeric", fixed = TRUE ) }) test_that("assignments don't overwrite variables (#315)", { expect_equal( mutate(mtcars, cyl2 = { mpg <- cyl ^ 2; -mpg }), mutate(mtcars, cyl2 = -cyl ^ 2) ) }) test_that("hybrid evaluator uses correct environment (#403)", { func1 <- function() { func2 <- function(x) floor(x) mutate(mtcars, xx = func2(mpg / sum(mpg))) } res <- func1() expect_equal(res$xx, rep(0, nrow(res))) }) test_that("mutate remove variables with = NULL syntax (#462)", { data <- mtcars %>% mutate(cyl = NULL) expect_false("cyl" %in% names(data)) data <- mtcars %>% group_by(disp) %>% mutate(cyl = NULL) expect_false("cyl" %in% names(data)) }) test_that("mutate strips names, but only if grouped (#1689, #2675)", { data <- data_frame(a = 1:3) %>% mutate(b = setNames(nm = a)) expect_equal(names(data$b), as.character(1:3)) data <- data_frame(a = 1:3) %>% rowwise %>% mutate(b = setNames(nm = a)) expect_null(names(data$b)) data <- data_frame(a = c(1, 1, 2)) %>% group_by(a) %>% mutate(b = setNames(nm = a)) expect_null(names(data$b)) }) test_that("mutate does not strip names of list-columns (#2675)", { vec <- list(a = 1, b = 2) data <- data_frame(x = vec) data <- mutate(data, x) expect_identical(names(vec), c("a", "b")) expect_identical(names(data$x), c("a", "b")) }) test_that("mutate gives a nice error message if an expression evaluates to NULL (#2187)", { df <- data_frame(a = 1:3) gf <- group_by(df, a) rf <- rowwise(df) expect_error( mutate(df, b = identity(NULL)), "Column `b` is of unsupported type NULL", fixed = TRUE ) expect_error( mutate(gf, b = identity(NULL)), "Column `b` is of unsupported type NULL", fixed = TRUE ) expect_error( mutate(rf, b = identity(NULL)), "Column `b` is of unsupported type NULL", fixed = TRUE ) }) test_that("mutate(rowwise_df) makes a rowwise_df (#463)", { one_mod <- data.frame(grp = "a", x = runif(5, 0, 1)) %>% tbl_df %>% mutate(y = rnorm(x, x * 2, 1)) %>% group_by(grp) %>% do(mod = lm(y~x, data = .)) out <- one_mod %>% mutate(rsq = summary(mod)$r.squared) %>% mutate(aic = AIC(mod)) expect_is(out, "rowwise_df") expect_equal(nrow(out), 1L) expect_is(out$mod, "list") expect_is(out$mod[[1L]], "lm") }) test_that("mutate allows list columns (#555)", { df <- data.frame(x = c("a;b", "c;d;e"), stringsAsFactors = FALSE) res <- mutate(df, pieces = strsplit(x, ";")) expect_equal(res$pieces, list(c("a", "b"), c("c", "d", "e"))) }) test_that("hybrid evaluation goes deep enough (#554)", { res1 <- iris %>% mutate(test = 1 == 2 | row_number() < 10) res2 <- iris %>% mutate(test = row_number() < 10 | 1 == 2) expect_equal(res1, res2) }) test_that("hybrid does not segfault when given non existing variable (#569)", { # error message from rlang expect_error(mtcars %>% summarise(first(mp))) }) test_that("namespace extraction works in hybrid (#412)", { df <- data.frame(x = 1:2) expect_equal( mutate(df, y = base::mean(x)), mutate(df, y = mean(x)) ) expect_equal( mutate(df, y = stats::IQR(x)), mutate(df, y = IQR(x)) ) }) test_that("hybrid not get in the way of order_by (#169)", { df <- data_frame(x = 10:1, y = 1:10) res <- mutate(df, z = order_by(x, cumsum(y))) expect_equal(res$z, rev(cumsum(10:1))) }) test_that("mutate supports difftime objects (#390)", { df <- data_frame( grp = c(1, 1, 2, 2), val = c(1, 3, 4, 6), date1 = c(rep(Sys.Date() - 10, 2), rep(Sys.Date() - 20, 2)), date2 = Sys.Date() + c(1, 2, 1, 2), diffdate = difftime(date2, date1, unit = "days") ) res <- df %>% group_by(grp) %>% mutate(mean_val = mean(val), mean_diffdate = mean(diffdate)) expect_is(res$mean_diffdate, "difftime") expect_equal(as.numeric(res$mean_diffdate), c(11.5, 11.5, 21.5, 21.5)) res <- df %>% group_by(grp) %>% summarise(dt = mean(diffdate)) expect_is(res$dt, "difftime") expect_equal(as.numeric(res$dt), c(11.5, 21.5)) }) test_that("mutate works on zero-row grouped data frame (#596)", { dat <- data.frame(a = numeric(0), b = character(0)) res <- dat %>% group_by(b) %>% mutate(a2 = a * 2) expect_is(res$a2, "numeric") expect_is(res, "grouped_df") expect_equal(res$a2, numeric(0)) expect_equal(attr(res, "indices"), list()) expect_equal(attr(res, "vars"), "b") expect_equal(attr(res, "group_sizes"), integer(0)) expect_equal(attr(res, "biggest_group_size"), 0L) }) test_that("Non-ascii column names in version 0.3 are not duplicated (#636)", { # Currently failing (#2967) skip_on_os("windows") df <- data_frame(a = "1", b = "2") names(df) <- c("a", enc2native("\u4e2d")) res <- df %>% mutate_all(funs(as.numeric)) %>% names expect_equal(res, names(df)) }) test_that("nested hybrid functions do the right thing (#637)", { res <- mtcars %>% mutate(mean(1)) expect_true(all(res[["mean(1)"]] == 1L)) }) test_that("mutate handles using and gathering complex data (#436)", { d <- data_frame(x = 1:10, y = 1:10 + 2i) res <- mutate(d, real = Re(y), imag = Im(y), z = 2 * y, constant = 2 + 2i) expect_equal(names(res), c("x", "y", "real", "imag", "z", "constant")) expect_equal(res$real, Re(d$y)) expect_equal(res$imag, Im(d$y)) expect_equal(res$z, d$y * 2) expect_true(all(res$constant == 2 + 2i)) }) test_that("mutate forbids POSIXlt results (#670)", { expect_error( data.frame(time = "2014/01/01 10:10:10") %>% mutate(time = as.POSIXlt(time)), "Column `time` is of unsupported class POSIXlt", fixed = TRUE ) expect_error( data.frame(time = "2014/01/01 10:10:10", a = 2) %>% group_by(a) %>% mutate(time = as.POSIXlt(time)), "Column `time` is of unsupported class POSIXlt", fixed = TRUE ) }) test_that("constant factor can be handled by mutate (#715)", { d <- data_frame(x = 1:2) %>% mutate(y = factor("A")) expect_true(is.factor(d$y)) expect_equal(d$y, factor(c("A", "A"))) }) test_that("row_number handles empty data frames (#762)", { df <- data.frame(a = numeric(0)) res <- df %>% mutate( row_number_0 = row_number(), row_number_a = row_number(a), ntile = ntile(a, 2), min_rank = min_rank(a), percent_rank = percent_rank(a), dense_rank = dense_rank(a), cume_dist = cume_dist(a) ) expect_equal( names(res), c("a", "row_number_0", "row_number_a", "ntile", "min_rank", "percent_rank", "dense_rank", "cume_dist") ) expect_equal(nrow(res), 0L) }) test_that("no utf8 invasion (#722)", { skip_on_os("windows") source("utf-8.R", local = TRUE, encoding = "UTF-8") }) test_that("mutate works on empty data frames (#1142)", { df <- data.frame() res <- df %>% mutate expect_equal(nrow(res), 0L) expect_equal(length(res), 0L) res <- df %>% mutate(x = numeric()) expect_equal(names(res), "x") expect_equal(nrow(res), 0L) expect_equal(length(res), 1L) }) test_that("mutate handles 0 rows rowwise (#1300)", { a <- data_frame(x = 1) b <- data_frame(y = character()) g <- function(y) { 1 } f <- function() { b %>% rowwise() %>% mutate(z = g(y)) } res <- f() expect_equal(nrow(res), 0L) expect_error( a %>% mutate(b = f()), "Column `b` must be length 1 (the number of rows), not 2", fixed = TRUE ) expect_error( a %>% rowwise() %>% mutate(b = f()), "Column `b` must be length 1 (the group size), not 2", fixed = TRUE ) }) test_that("regression test for #637", { res <- mtcars %>% mutate(xx = mean(1)) expect_true(all(res$xx == 1)) res <- mtcars %>% mutate(xx = sum(mean(mpg))) expect_true(all(res$xx == sum(mean(mtcars$mpg)))) }) test_that("mutate.rowwise handles factors (#886)", { res <- data.frame(processed = c("foo", "bar")) %>% rowwise() %>% mutate(processed_trafo = paste("test", processed)) expect_equal(res$processed_trafo, c("test foo", "test bar")) }) test_that("setting first column to NULL with mutate works (#1329)", { df <- data.frame(x = 1:10, y = 1:10) expect_equal(mutate(df, x = NULL), select(df, -x)) expect_equal(mutate(df, y = NULL), select(df, -y)) gdf <- group_by(df, y) expect_equal(select(gdf, -x), mutate(gdf, x = NULL)) }) test_that("mutate handles the all NA case (#958)", { x <- rep(c("Bob", "Jane"), each = 36) y <- rep(rep(c("A", "B", "C"), each = 12), 2) day <- rep(rep(1:12, 3), 2) values <- rep(rep(c(10, 11, 30, 12, 13, 14, 15, 16, 17, 18, 19, 20), 3), 2) df <- data.frame(x = x, y = y, day = day, values = values) df$values[1:12] <- NA res <- df %>% group_by(x, y) %>% mutate(max.sum = day[which.max(values)[1]]) %>% mutate(adjusted_values = ifelse(day < max.sum, 30, values)) expect_true(all(is.na(res$adjusted_values[1:12]))) }) test_that("rowwise mutate gives expected results (#1381)", { f <- function(x) ifelse(x < 2, NA_real_, x) res <- data_frame(x = 1:3) %>% rowwise() %>% mutate(y = f(x)) expect_equal(res$y, c(NA, 2, 3)) }) test_that("mutate handles factors (#1414)", { d <- data_frame( g = c(1, 1, 1, 2, 2, 3, 3), f = c("a", "b", "a", "a", "a", "b", "b") ) res <- d %>% group_by(g) %>% mutate(f2 = factor(f, levels = c("a", "b"))) expect_equal(as.character(res$f2), res$f) }) test_that("mutate handles results from one group with all NA values (#1463) ", { df <- data_frame(x = c(1, 2), y = c(1, NA)) res <- df %>% group_by(x) %>% mutate(z = ifelse(y > 1, 1, 2)) expect_true(is.na(res$z[2])) expect_is(res$z, "numeric") }) test_that("rowwise mutate handles the NA special case (#1448)", { res <- data.frame(k = c(-1, 1, 1)) %>% rowwise() %>% mutate(l = ifelse(k > 0, 1, NA)) expect_is(res$l, "numeric") expect_true(is.na(res$l[1])) expect_true(!anyNA(res$l[-1])) res <- data.frame(k = rnorm(10)) %>% rowwise() %>% mutate(l = ifelse(k > 0, 1L, NA_integer_)) expect_true(all(is.na(res$l[res$k <= 0]))) expect_true(!any(is.na(res$l[res$k > 0]))) }) test_that("mutate disambiguates NA and NaN (#1448)", { Pass <- data.frame(P2 = c(0, 3, 2), F2 = c(0, 2, 0), id = 1:3) res <- Pass %>% group_by(id) %>% mutate(pass2 = P2 / (P2 + F2)) expect_true(is.nan(res$pass2[1])) res <- Pass %>% rowwise %>% mutate(pass2 = P2 / (P2 + F2)) expect_true(is.nan(res$pass2[1])) Pass <- data_frame( P1 = c(2L, 0L, 10L, 8L, 9L), F1 = c(0L, 2L, 0L, 4L, 3L), P2 = c(0L, 3L, 2L, 2L, 2L), F2 = c(0L, 2L, 0L, 1L, 1L), id = c(1, 2, 4, 4, 5) ) res <- Pass %>% group_by(id) %>% mutate( pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, pass_rate1 = P1 / (P1 + F1) * 100, pass_rate2 = P2 / (P2 + F2) * 100 ) expect_true(is.nan(res$pass_rate2[1])) }) test_that("hybrid evaluator leaves formulas untouched (#1447)", { d <- data_frame(g = 1:2, training = list(mtcars, mtcars * 2)) mpg <- data.frame(x = 1:10, y = 1:10) res <- d %>% group_by(g) %>% mutate(lm_result = list(lm(mpg ~ wt, data = training[[1]]))) expect_is(res$lm_result, "list") expect_is(res$lm_result[[1]], "lm") expect_is(res$lm_result[[2]], "lm") }) test_that("lead/lag inside mutate handles expressions as value for default (#1411) ", { df <- data_frame(x = 1:3) res <- mutate(df, leadn = lead(x, default = x[1]), lagn = lag(x, default = x[1])) expect_equal(res$leadn, lead(df$x, default = df$x[1])) expect_equal(res$lagn, lag(df$x, default = df$x[1])) res <- mutate(df, leadn = lead(x, default = c(1)), lagn = lag(x, default = c(1))) expect_equal(res$leadn, lead(df$x, default = 1)) expect_equal(res$lagn, lag(df$x, default = 1)) }) test_that("grouped mutate does not drop grouping attributes (#1020)", { d <- data.frame(subject = c("Jack", "Jill"), id = c(2, 1)) %>% group_by(subject) a1 <- names(attributes(d)) a2 <- names(attributes(d %>% mutate(foo = 1))) expect_equal(setdiff(a1, a2), character(0)) }) test_that("grouped mutate coerces integer + double -> double (#1892)", { df <- data_frame( id = c(1, 4), value = c(1L, NA), group = c("A", "B") ) %>% group_by(group) %>% mutate(value = ifelse(is.na(value), 0, value)) expect_type(df$value, "double") expect_identical(df$value, c(1, 0)) }) test_that("grouped mutate coerces factor + character -> character (WARN) (#1892)", { factor_or_character <- function(x) { if (x > 3) { return(factor("hello")) } else { return("world") } } df <- data_frame( id = c(1, 4), group = c("A", "B") ) %>% group_by(group) expect_warning( df <- df %>% mutate(value = factor_or_character(id)) ) expect_type(df$value, "character") expect_identical(df$value, c("world", "hello")) }) test_that("lead/lag works on more complex expressions (#1588)", { df <- data_frame(x = rep(1:5, 2), g = rep(1:2, each = 5)) %>% group_by(g) res <- df %>% mutate(y = lead(x > 3)) expect_equal(res$y, rep(lead(1:5 > 3), 2)) }) test_that("Adding a Column of NA to a Grouped Table gives expected results (#1645)", { dataset <- data_frame(A = 1:10, B = 10:1, group = factor(sample(LETTERS[25:26], 10, TRUE))) res <- dataset %>% group_by(group) %>% mutate(prediction = factor(NA)) expect_true(all(is.na(res$prediction))) expect_is(res$prediction, "factor") expect_equal(levels(res$prediction), character()) }) test_that("Deep copies are performed when needed (#1463)", { res <- data.frame(prob = c(F, T)) %>% rowwise %>% mutate(model = list(x = prob)) expect_equal(unlist(res$model), c(FALSE, TRUE)) res <- data.frame(x = 1:4, g = c(1, 1, 1, 2)) %>% group_by(g) %>% mutate(model = list(y = x)) expect_equal(res$model[[1]], 1:3) expect_equal(res$model[[4]], 4) }) test_that("ntile falls back to R (#1750)", { res <- mutate(iris, a = ntile("Sepal.Length", 3)) expect_equal(res$a, rep(1, 150)) }) test_that("mutate() names pronouns correctly (#2686)", { expect_named(mutate(tibble(x = 1), .data$x), "x") expect_named(mutate(tibble(x = 1), .data[["x"]]), "x") }) test_that("mutate() supports unquoted values", { df <- tibble(g = c(1, 1, 2, 2, 2), x = 1:5) expect_identical(mutate(df, out = !! 1), mutate(df, out = 1)) expect_identical(mutate(df, out = !! 1:5), mutate(df, out = 1:5)) expect_identical(mutate(df, out = !! quote(1:5)), mutate(df, out = 1:5)) expect_error(mutate(df, out = !! 1:2), "must be length 5 (the number of rows)", fixed = TRUE) expect_error(mutate(df, out = !! get_env()), "unsupported type") gdf <- group_by(df, g) expect_identical(mutate(gdf, out = !! 1), mutate(gdf, out = 1)) expect_identical(mutate(gdf, out = !! 1:5), group_by(mutate(df, out = 1:5), g)) expect_error(mutate(gdf, out = !! quote(1:5)), "must be length 2 (the group size)", fixed = TRUE) expect_error(mutate(gdf, out = !! 1:2), "must be length 5 (the number of rows)", fixed = TRUE) expect_error(mutate(gdf, out = !! get_env()), "unsupported type") }) # Error messages ---------------------------------------------------------- test_that("mutate fails gracefully on non-vector columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( mutate(df, a = 1), "Column `b` is of unsupported type raw vector", fixed = TRUE ) expect_error( mutate(df, b = 1), "Column `b` is of unsupported type raw vector", fixed = TRUE ) expect_error( mutate(df, c = 1), "Column `b` is of unsupported type raw vector", fixed = TRUE ) }) test_that("grouped mutate errors on incompatible column type (#1641)", { expect_error( tibble(x = 1) %>% mutate(y = mean), "Column `y` is of unsupported type function", fixed = TRUE ) expect_error( tibble(x = 1) %>% mutate(y = quote(a)), "Column `y` is of unsupported type symbol", fixed = TRUE ) }) test_that("can reuse new variables", { expect_equal( data.frame(c = 1) %>% mutate(c, gc = mean(c)), data.frame(c = 1, gc = 1) ) }) test_that("can use character vectors in grouped mutate (#2971)", { df <- data_frame(x = 1:10000) %>% group_by(x) %>% mutate(y = as.character(runif(1L)), z = as.character(runif(1L))) expect_error(df %>% distinct(x, .keep_all = TRUE), NA) }) test_that("mutate() to UTF-8 column names", { df <- data_frame(a = 1) %>% mutate("\u5e78" := a) expect_equal(colnames(df), c("a", "\u5e78")) }) dplyr/tests/testthat/test-colwise-filter.R0000644000176200001440000000157713135665123020455 0ustar liggesuserscontext("colwise filter") test_that("filter_if()", { expect_identical(nrow(filter_if(mtcars, is_integerish, all_vars(. > 1))), 0L) expect_identical(nrow(filter_if(mtcars, is_integerish, all_vars(. > 0))), 7L) }) test_that("filter_at()", { sepal_large <- filter_at(iris, vars(starts_with("Sepal")), all_vars(. > 4)) expect_equal(sepal_large$Sepal.Length, c(5.7, 5.2, 5.5)) }) test_that("filter_all()", { expect_identical(filter_all(mtcars, any_vars(. > 200))$disp, mtcars$disp[mtcars$disp > 200]) }) test_that("aborts on empty selection", { expect_error( filter_if(mtcars, is_character, all_vars(. > 0)), "`.predicate` has no matching columns", fixed = TRUE ) }) test_that("aborts when supplied funs()", { expect_error( filter_all(mtcars, funs(. > 0)), "`.vars_predicate` must be a call to `all_vars()` or `any_vars()`, not list", fixed = TRUE ) }) dplyr/tests/testthat/test-n_distinct.R0000644000176200001440000000131613135665123017652 0ustar liggesuserscontext("n_distinct") test_that("count_distinct gives the correct results on iris", { expect_equal( sapply(iris, n_distinct), sapply(iris, function(.) length(unique(.))) ) }) df_var <- data.frame( l = c(T, F, F), i = c(1, 1, 2), d = Sys.Date() + c(1, 1, 2), f = factor(letters[c(1, 1, 2)]), n = c(1, 1, 2) + 0.5, t = Sys.time() + c(1, 1, 2), c = letters[c(1, 1, 2)], stringsAsFactors = FALSE ) test_that("count_distinct gives correct results for key types", { expect_equal( sapply(df_var, n_distinct), sapply(df_var, function(.) length(unique(.))) ) }) test_that("n_distinct treats NA correctly in the REALSXP case (#384)", { expect_equal(n_distinct(c(1.0, NA, NA)), 2) }) dplyr/tests/testthat/test-case-when.R0000644000176200001440000000403713150340402017354 0ustar liggesuserscontext("case_when") test_that("zero inputs throws an error", { expect_error( case_when(), "No cases provided", fixed = TRUE ) }) test_that("error messages", { expect_error( case_when( paste(50) ), "Case 1 (`paste(50)`) must be a two-sided formula, not a string", fixed = TRUE ) expect_error( case_when( 50 ~ 1:3 ), "LHS of case 1 (`50`) must be a logical, not double", fixed = TRUE ) }) test_that("cases must yield compatible lengths", { expect_error( case_when( c(TRUE, FALSE) ~ 1, c(FALSE, TRUE, FALSE) ~ 2, c(FALSE, TRUE, FALSE, NA) ~ 3 ), "`c(FALSE, TRUE, FALSE) ~ 2`, `c(FALSE, TRUE, FALSE, NA) ~ 3` must be length 2 or one, not 3, 4", fixed = TRUE ) expect_error( case_when( c(TRUE, FALSE) ~ 1:3, c(FALSE, TRUE) ~ 1:2 ), "`c(TRUE, FALSE) ~ 1:3` must be length 2 or one, not 3", fixed = TRUE ) }) test_that("matches values in order", { x <- 1:3 expect_equal( case_when( x <= 1 ~ 1, x <= 2 ~ 2, x <= 3 ~ 3 ), c(1, 2, 3) ) }) test_that("unmatched gets missing value", { x <- 1:3 expect_equal( case_when( x <= 1 ~ 1, x <= 2 ~ 2 ), c(1, 2, NA) ) }) test_that("missing values can be replaced (#1999)", { x <- c(1:3, NA) expect_equal( case_when( x <= 1 ~ 1, x <= 2 ~ 2, is.na(x) ~ 0 ), c(1, 2, NA, 0) ) }) test_that("NA conditions (#2927)", { expect_equal( case_when( c(TRUE, FALSE, NA) ~ 1:3, TRUE ~ 4L ), c(1L, 4L, 4L) ) }) test_that("atomic conditions (#2909)", { expect_equal( case_when( TRUE ~ 1:3, FALSE ~ 4:6 ), 1:3 ) expect_equal( case_when( NA ~ 1:3, TRUE ~ 4:6 ), 4:6 ) }) test_that("zero-length conditions and values (#3041)", { expect_equal( case_when( TRUE ~ integer(), FALSE ~ integer() ), integer() ) expect_equal( case_when( logical() ~ 1, logical() ~ 2 ), numeric() ) }) dplyr/tests/testthat/helper-combine.R0000644000176200001440000001666113153520575017442 0ustar liggesuserscombine_pair_test <- function(item_pair, var1, var2, result, can_combine = TRUE, warning = FALSE) { label_if_fail <- paste0( "combine(items[c(\"", var1, "\", \"", var2, "\")])" ) if (warning) { warning_regexp <- ".*" } else { warning_regexp <- NA } if (can_combine) { expect_warning( res <- combine(item_pair), regexp = warning_regexp, label = label_if_fail ) expect_equal( object = res, expected = result, label = label_if_fail, expected.label = deparse(result) ) } else { expect_warning( expect_error( combine(item_pair), "^Argument 2 can't be converted from [^ ]* to [^ ]*$", label = label_if_fail ), regexp = warning_regexp, label = label_if_fail ) } } can_be_combined <- function(item1, item2, class1, class2, all_na1, all_na2, known_to_dplyr1, known_to_dplyr2) { # Unknown classes will be stripped and ignored (#2406) if (!known_to_dplyr1) { class1 <- class(as.vector(item1)) } if (!known_to_dplyr2) { class2 <- class(as.vector(item2)) } # Two elements of the same class can be combined # NA values are also combinable if (identical(class1, class2) || all_na1 || all_na2) { return(TRUE) } # doubles and integers: if (all(c(class1, class2) %in% c("numeric", "integer"))) { return(TRUE) } # coerce factor with character if ((class1 == "factor" && class2 == "character") || (class2 == "factor" && class1 == "character")) { return(TRUE) } # All the other cases can't be combined return(FALSE) } give_a_warning <- function(item1, item2, class1, class2, known_to_dplyr1, known_to_dplyr2, can_be_combined) { # Unknown classes give a warning, because attributes may be wrong if (!known_to_dplyr1) { return(TRUE) } # If only the second element is of an unknown type to dplyr # Then the warning is only emmitted in case we can combine (otherwise the # error appears before) if (!known_to_dplyr2 && can_be_combined) { return(TRUE) } # factor and character give a warning when combined (coercion to character) if ((class1 == "factor" && class2 == "character") || (class1 == "character" && class2 == "factor")) { return(TRUE) } # Two factors give a warning if they don't have identical levels (coercion to character) if (class1 == "factor" && class2 == "factor") { if (!identical(levels(item1), levels(item2))) { return(TRUE) } } # All other cases do not raise a warning return(FALSE) } combine_result <- function(item1, item2, class1, class2, all_na1, all_na2, known_to_dplyr1, known_to_dplyr2, can_combine, give_warning) { result <- NULL # Unknown classes will be stripped and ignored (#2406) if (!known_to_dplyr1) { class1 <- class(as.vector(item1)) } if (!known_to_dplyr2) { class2 <- class(as.vector(item2)) } if (can_combine) { # Custom coercions: # - Factor with character coerced to character # - Factor with Factor without same levels -> character # - Factor with NA is Factor # Otherwise use the default approach with unlist and add classes # if needed. if ((class1 == "factor" && class2 == "character") || (class2 == "factor" && class1 == "character")) { result <- c(as.character(item1), as.character(item2)) } else if ((class1 == "factor" && class2 == "factor") && !identical(levels(item1), levels(item2))) { result <- c(as.character(item1), as.character(item2)) } else if ((is.factor(item1) && all(is.na(item2))) || (is.factor(item2) && all(is.na(item1)))) { result <- factor(c(as.character(item1), as.character(item2))) } else { # Default combination result result <- unlist( list(item1, item2), recursive = FALSE, use.names = FALSE ) # Add classes and attributes in some cases to the default if ((all(is.na(item1)) && "POSIXct" %in% class2) || (all(is.na(item2)) && "POSIXct" %in% class1) || ("POSIXct" %in% class1 && "POSIXct" %in% class2)) { class(result) <- c("POSIXct", "POSIXt") attr(result, "tzone") <- "" } else if (all_na1 && known_to_dplyr2) { class(result) <- class2 } else if (all_na2 && known_to_dplyr1) { class(result) <- class1 } else if (identical(class1, class2) && known_to_dplyr1) { class(result) <- class1 } } } list(result) } prepare_table_with_coercion_rules <- function() { items <- list( logicalvalue = TRUE, logicalNA = NA, anotherNA = c(NA, NA), integer = 4L, factor = factor("a"), another_factor = factor("b"), double = 4.5, character = "c", POSIXct = as.POSIXct("2010-01-01"), Date = as.Date("2016-01-01"), complex = 1 + 2i, int_with_class = structure(4L, class = "int_with_class"), num_with_class = structure(4.5, class = "num_with_class") ) special_non_vector_classes <- c( "factor", "POSIXct", "Date", "table", "AsIs", "integer64" ) pairs <- expand.grid(names(items), names(items)) pairs$can_combine <- FALSE pairs$warning <- FALSE pairs$item_pair <- vector("list", nrow(pairs)) pairs$result <- vector("list", nrow(pairs)) for (i in seq_len(nrow(pairs))) { item1 <- items[[pairs$Var1[i]]] item2 <- items[[pairs$Var2[i]]] class1 <- class(item1) class2 <- class(item2) all_na1 <- all(is.na(item1)) all_na2 <- all(is.na(item2)) known_to_dplyr1 <- is.vector(item1) || any(class1 %in% special_non_vector_classes) known_to_dplyr2 <- is.vector(item2) || any(class2 %in% special_non_vector_classes) pairs$can_combine[i] <- can_be_combined( item1, item2, class1, class2, all_na1, all_na2, known_to_dplyr1, known_to_dplyr2 ) pairs$warning[i] <- give_a_warning( item1, item2, class1, class2, known_to_dplyr1, known_to_dplyr2, can_be_combined = pairs$can_combine[i] ) pairs$item_pair[[i]] <- list(item1, item2) pairs$result[i] <- combine_result( item1, item2, class1, class2, all_na1, all_na2, known_to_dplyr1, known_to_dplyr2, pairs$can_combine[i], pairs$warning[i] ) } return(pairs) } print_pairs <- function(pairs) { pairs_printable <- pairs pairs_printable$result <- sapply( pairs$result, function(x) { if (is.null(x)) { "" } else { as.character(x) } } ) pairs_printable$result_class <- lapply( pairs$result, function(x) { if (is.null(x)) { "" } else { class(x) } } ) pairs_printable <- arrange( pairs_printable, desc(can_combine), warning, Var1, Var2 ) pairs_printable } combine_coercion_types <- function() { pairs <- prepare_table_with_coercion_rules() # knitr::kable(print_pairs(pairs)) for (i in seq_len(nrow(pairs))) { test_that(paste0("Coercion from ", pairs$Var1[i], " to ", pairs$Var2[i]), { combine_pair_test( item_pair = pairs$item_pair[[i]], var1 = pairs$Var1[i], var2 = pairs$Var2[i], result = pairs$result[[i]], can_combine = pairs$can_combine[i], warning = pairs$warning[i] ) }) } } dplyr/tests/testthat/test-astyle.R0000644000176200001440000000024313120706341017003 0ustar liggesuserscontext("astyle") test_that("source code formatting", { skip_on_cran() skip_on_os("windows") skip_on_travis() expect_warning(astyle("--dry-run"), NA) }) dplyr/tests/testthat/test-sample.R0000644000176200001440000000551613153520575017004 0ustar liggesuserscontext("Sample") # Basic behaviour ------------------------------------------------------------- test_that("sample preserves class", { expect_is(sample_n(mtcars, 1), "data.frame") expect_is(sample_n(tbl_df(mtcars), 1), "tbl_df") expect_is(sample_frac(mtcars, 1), "data.frame") expect_is(sample_frac(tbl_df(mtcars), 1), "tbl_df") }) # Ungrouped -------------------------------------------------------------------- df <- data.frame( x = 1:2, y = c(0, 1) ) test_that("sample respects weight", { # error message from base R expect_error(sample_n(df, 2, weight = y)) expect_equal(sample_n(df, 1, weight = y)$x, 2) expect_error( sample_frac(df, 2), "`size` of sampled fraction must be less or equal to one, set `replace` = TRUE to use sampling with replacement", fixed = TRUE ) expect_error( sample_frac(df %>% group_by(y), 2), "`size` of sampled fraction must be less or equal to one, set `replace` = TRUE to use sampling with replacement", fixed = TRUE ) # error message from base R expect_error(sample_frac(df, 1, weight = y)) expect_equal(sample_frac(df, 0.5, weight = y)$x, 2) }) test_that("sample_* error message", { expect_error( check_weight(letters[1:2], 2), "`weight` must be a numeric, not character", fixed = TRUE ) expect_error( check_weight(-1:-2, 2), "`weight` must be a vector with all values nonnegative, not -1", fixed = TRUE ) expect_error( check_weight(letters, 2), "`weight` must be a numeric, not character" ) }) test_that("sample gives informative error for unknown type", { expect_error( sample_n(list()), "`tbl` must be a data frame, not list", fixed = TRUE ) expect_error( sample_frac(list()), "`tbl` must be a data frame, not list", fixed = TRUE ) }) # Grouped ---------------------------------------------------------------------- test_that("sampling grouped tbl samples each group", { sampled <- mtcars %>% group_by(cyl) %>% sample_n(2) expect_is(sampled, "grouped_df") expect_groups(sampled, "cyl") expect_equal(nrow(sampled), 6) expect_equal(sampled$cyl, rep(c(4, 6, 8), each = 2)) }) test_that("can't sample more values than obs (without replacement)", { by_cyl <- mtcars %>% group_by(cyl) expect_error( sample_n(by_cyl, 10), "`size` must be less or equal than 7 (size of data), set `replace` = TRUE to use sampling with replacement", fixed = TRUE ) }) df2 <- data.frame( x = rep(1:2, 2), y = rep(c(0, 1), 2), g = rep(1:2, each = 2) ) test_that("grouped sample respects weight", { grp <- df2 %>% group_by(g) # error message from base R expect_error(sample_n(grp, 2, weight = y)) expect_equal(sample_n(grp, 1, weight = y)$x, c(2, 2)) # error message from base R expect_error(sample_frac(grp, 1, weight = y)) expect_equal(sample_frac(grp, 0.5, weight = y)$x, c(2, 2)) }) dplyr/tests/testthat/test-distinct.R0000644000176200001440000000401613153520575017336 0ustar liggesuserscontext("Distinct") test_that("distinct equivalent to local unique when keep_all is TRUE", { df <- tibble( x = c(1, 1, 1, 1), y = c(1, 1, 2, 2), z = c(1, 2, 1, 2) ) expect_equal(distinct(df), unique(df)) }) test_that("distinct for single column works as expected (#1937)", { df <- tibble( x = c(1, 1, 1, 1), y = c(1, 1, 2, 2), z = c(1, 2, 1, 2) ) expect_equal(distinct(df, x, .keep_all = FALSE), unique(df["x"])) expect_equal(distinct(df, y, .keep_all = FALSE), unique(df["y"])) }) test_that("distinct works for 0-sized columns (#1437)", { df <- data_frame(x = 1:10) %>% select(-x) ddf <- distinct(df) expect_equal(ncol(ddf), 0L) }) test_that("if no variables specified, uses all", { df <- data_frame(x = c(1, 1), y = c(2, 2)) expect_equal(distinct(df), data_frame(x = 1, y = 2)) }) test_that("distinct keeps only specified cols", { df <- data_frame(x = c(1, 1, 1), y = c(1, 1, 1)) expect_equal(df %>% distinct(x), data_frame(x = 1)) }) test_that("unless .keep_all = TRUE", { df <- data_frame(x = c(1, 1, 1), y = 3:1) expect_equal(df %>% distinct(x), data_frame(x = 1)) expect_equal(df %>% distinct(x, .keep_all = TRUE), data_frame(x = 1, y = 3L)) }) test_that("distinct doesn't duplicate columns", { df <- tibble(a = 1:3, b = 4:6) expect_named(df %>% distinct(a, a), "a") expect_named(df %>% group_by(a) %>% distinct(a), "a") }) test_that("grouped distinct always includes group cols", { df <- tibble(g = c(1, 2), x = c(1, 2)) out <- df %>% group_by(g) %>% distinct(x) expect_equal(df, out) }) test_that("empty grouped distinct equivalent to empty ungrouped", { df <- tibble(g = c(1, 2), x = c(1, 2)) df1 <- df %>% distinct() %>% group_by(g) df2 <- df %>% group_by(g) %>% distinct() expect_equal(df1, df2) }) test_that("distinct on a new, mutated variable is equivalent to mutate followed by distinct", { df <- tibble(g = c(1, 2), x = c(1, 2)) df1 <- df %>% distinct(aa = g * 2) df2 <- df %>% mutate(aa = g * 2) %>% distinct(aa) expect_equal(df1, df2) }) dplyr/tests/testthat/test-combine.R0000644000176200001440000001564113153520575017137 0ustar liggesuserscontext("combine") test_that("combine handles NULL (1596)", { expect_equal(combine(list(NULL, 1, 2)), c(1, 2)) expect_equal(combine(list(1, NULL, 2)), c(1, 2)) expect_equal(combine(list(1, 2, NULL)), c(1, 2)) expect_error( combine(list(NULL, NULL)), "no data to combine, all elements are NULL", fixed = TRUE ) }) test_that("combine complains about incompatibilites", { expect_error( combine("a", 1), "Argument 2 can't be converted from numeric to character" ) expect_error( combine(factor("a"), 1L), "Argument 2 can't be converted from integer to factor" ) }) test_that("combine works with input that used to fail (#1780)", { no <- list(alpha = letters[1:3], omega = letters[24:26]) expect_equal(combine(no), unlist(no, use.names = FALSE)) }) test_that("combine works with NA and logical (#2203)", { # NA first expected_result <- c(NA, TRUE, FALSE, NA, TRUE) works1 <- combine(list(NA, TRUE, FALSE, NA, TRUE)) expect_equal(works1, expected_result) # NA length == 1 expected_result <- c(TRUE, FALSE, NA, TRUE) works1 <- combine(list(TRUE, FALSE, NA, TRUE)) expect_equal(works1, expected_result) # NA length > 1 expected_result <- c(TRUE, FALSE, NA, NA, TRUE) works3 <- combine(list(TRUE, FALSE, c(NA, NA), TRUE)) expect_equal(works3, expected_result) }) test_that("combine works with NA and integers (#2203)", { works <- combine(list(1L, 2L, NA, 4L)) expect_equal(works, c(1L, 2L, NA, 4L)) works <- combine(list(1L, 2L, c(NA, NA), 4L)) expect_equal(works, c(1L, 2L, NA, NA, 4L)) }) test_that("combine works with NA and factors (#2203)", { # NA first fac <- factor(c("a", "c", NA, "b"), levels = letters[1:3]) expected_result <- fac[c(3, 1, 3, 2)] works1 <- combine(list(NA, fac[1], NA, fac[2])) expect_equal(works1, expected_result) # NA length == 1 expected_result <- fac works1 <- combine(list(fac[1], fac[2], fac[3], fac[4])) expect_equal(works1, expected_result) works2 <- combine(list(fac[1], fac[2], NA, fac[4])) expect_equal(works2, expected_result) # NA length > 1 expected_result <- fac[c(1, 2, 3, 3, 4)] works3 <- combine(list(fac[1], fac[2], fac[c(3, 3)], fac[4])) expect_equal(works3, expected_result) works4 <- combine(list(fac[1], fac[2], c(NA, NA), fac[4])) expect_equal(works4, expected_result) }) test_that("combine works with NA and double (#2203)", { # NA first works <- combine(list(NA, 1.5, 2.5, NA, 4.5)) expect_equal(works, c(NA, 1.5, 2.5, NA, 4.5)) # NA length 1 works <- combine(list(1.5, 2.5, NA, 4.5)) expect_equal(works, c(1.5, 2.5, NA, 4.5)) # NA length > 1 works <- combine(list(1.5, 2.5, c(NA, NA), 4.5)) expect_equal(works, c(1.5, 2.5, NA, NA, 4.5)) }) test_that("combine works with NA and characters (#2203)", { # NA first works <- combine(list(NA, "a", "b", "c", NA, "e")) expect_equal(works, c(NA, "a", "b", "c", NA, "e")) # NA length 1 works <- combine(list("a", "b", "c", NA, "e")) expect_equal(works, c("a", "b", "c", NA, "e")) # NA length > 1 works <- combine(list("a", "b", "c", c(NA, NA), "e")) expect_equal(works, c("a", "b", "c", NA, NA, "e")) }) test_that("combine works with NA and POSIXct (#2203)", { # NA first works <- combine(list(NA, as.POSIXct("2010-01-01"), as.POSIXct("2010-01-02"), NA, as.POSIXct("2010-01-04"))) expect_equal(works, c(as.POSIXct(c(NA, "2010-01-01", "2010-01-02", NA, "2010-01-04")))) # NA length 1 works <- combine(list(as.POSIXct("2010-01-01"), as.POSIXct("2010-01-02"), NA, as.POSIXct("2010-01-04"))) expect_equal(works, c(as.POSIXct(c("2010-01-01", "2010-01-02", NA, "2010-01-04")))) # NA length > 1 works <- combine(list(as.POSIXct("2010-01-01"), as.POSIXct("2010-01-02"), c(NA, NA), as.POSIXct("2010-01-04"))) expect_equal(works, c(as.POSIXct(c("2010-01-01", "2010-01-02", NA, NA, "2010-01-04")))) }) test_that("combine works with NA and Date (#2203)", { # NA first expected_result <- as.Date("2010-01-01") + c(NA, 1, 2, NA, 4) expect_equal(combine(as.list(expected_result)), expected_result) # NA length == 1 expected_result <- c(as.Date(c("2010-01-01", "2010-01-02", NA, "2010-01-04"))) works1 <- combine(list(as.Date("2010-01-01"), as.Date("2010-01-02"), as.Date(NA), as.Date("2010-01-04"))) expect_equal(works1, expected_result) works2 <- combine(list(as.Date("2010-01-01"), as.Date("2010-01-02"), NA, as.Date("2010-01-04"))) expect_equal(works2, expected_result) # NA length > 1 expected_result <- as.Date("2010-01-01") + c(0, 1, NA, NA, 3) works1 <- combine(split(expected_result, c(1, 2, 3, 3, 4))) expect_equal(works1, expected_result) works2 <- combine(list(as.Date("2010-01-01"), as.Date("2010-01-02"), c(NA, NA), as.Date("2010-01-04"))) expect_equal(works2, expected_result) }) test_that("combine works with NA and complex (#2203)", { # NA first expected_result <- c(NA, 1 + 2i) works1 <- combine(list(NA, 1 + 2i)) expect_equal(works1, expected_result) # NA length == 1 expected_result <- c(1, 2, NA, 4) + 1i expect_equal(combine(as.list(expected_result)), expected_result) works2 <- combine(list(1 + 1i, 2 + 1i, NA, 4 + 1i)) expect_equal(works2, expected_result) # NA length > 1 expected_result <- c(1, 2, NA, NA, 4) + 1i expect_equal(combine(split(expected_result, c(1, 2, 3, 3, 4))), expected_result) works3 <- combine(list(1 + 1i, 2 + 1i, c(NA, NA), 4 + 1i)) expect_equal(works3, expected_result) }) test_that("combine works with integer64 (#1092)", { expect_equal( combine(bit64::as.integer64(2^34), bit64::as.integer64(2^35)), bit64::as.integer64(c(2^34, 2^35)) ) }) test_that("combine works with difftime", { expect_equal( combine(as.difftime(1, units = "mins"), as.difftime(1, units = "hours")), as.difftime(c(60, 3600), units = "secs") ) expect_equal( combine(as.difftime(1, units = "secs"), as.difftime(1, units = "secs")), as.difftime(c(1, 1), units = "secs") ) expect_equal( combine(as.difftime(1, units = "days"), as.difftime(1, units = "secs")), as.difftime(c(24*60*60, 1), units = "secs") ) expect_equal( combine(as.difftime(2, units = "weeks"), as.difftime(1, units = "secs")), as.difftime(c(2*7*24*60*60, 1), units = "secs") ) expect_equal( combine(as.difftime(2, units = "weeks"), as.difftime(3, units = "weeks")), as.difftime(c(2,3), units = "weeks") ) }) test_that("combine works with hms and difftime", { expect_equal( combine(as.difftime(2, units = "weeks"), hms::hms(hours = 1)), as.difftime(c(2*7*24*60*60, 3600), units = "secs") ) expect_equal( combine(hms::hms(hours = 1), as.difftime(2, units = "weeks")), hms::hms(seconds = c(3600, 2*7*24*60*60)) ) }) # Uses helper-combine.R combine_coercion_types() dplyr/tests/testthat/test-sets.R0000644000176200001440000000412213153520575016471 0ustar liggesuserscontext("Set ops") test_that("set operation give useful error message. #903", { alfa <- data_frame( land = c("Sverige", "Norway", "Danmark", "Island", "GB"), data = rnorm(length(land)) ) beta <- data_frame( land = c("Norge", "Danmark", "Island", "Storbritannien"), data2 = rnorm(length(land)) ) expect_error( intersect(alfa, beta), "not compatible: \n- Cols in y but not x: `data2`. \n- Cols in x but not y: `data`. \n", fixed = TRUE ) expect_error( union(alfa, beta), "not compatible: \n- Cols in y but not x: `data2`. \n- Cols in x but not y: `data`. \n", fixed = TRUE ) expect_error( setdiff(alfa, beta), "not compatible: \n- Cols in y but not x: `data2`. \n- Cols in x but not y: `data`. \n", fixed = TRUE ) }) test_that("set operations use coercion rules (#799)", { df1 <- data_frame(x = 1:2, y = c(1, 1)) df2 <- data_frame(x = 1:2, y = 1:2) expect_equal(nrow(union(df1, df2)), 3L) expect_equal(nrow(intersect(df1, df2)), 1L) expect_equal(nrow(setdiff(df1, df2)), 1L) df1 <- data_frame(x = factor(letters[1:10])) df2 <- data_frame(x = letters[6:15]) expect_warning(res <- intersect(df1, df2)) expect_equal(res, data_frame(x = letters[6:10])) expect_warning(res <- intersect(df2, df1)) expect_equal(res, data_frame(x = letters[6:10])) expect_warning(res <- union(df1, df2)) expect_equal(res, data_frame(x = letters[1:15])) expect_warning(res <- union(df2, df1)) expect_equal(res, data_frame(x = letters[1:15])) expect_warning(res <- setdiff(df1, df2)) expect_equal(res, data_frame(x = letters[1:5])) expect_warning(res <- setdiff(df2, df1)) expect_equal(res, data_frame(x = letters[11:15])) }) test_that("setdiff handles factors with NA (#1526)", { df1 <- data_frame(x = factor(c(NA, "a"))) df2 <- data_frame(x = factor("a")) res <- setdiff(df1, df2) expect_is(res$x, "factor") expect_equal(levels(res$x), "a") expect_true(is.na(res$x[1])) }) test_that("intersect does not unnecessarily coerce (#1722)", { df <- data_frame(a = 1L) res <- intersect(df, df) expect_is(res$a, "integer") }) dplyr/tests/testthat/test-summarise.r0000644000176200001440000006666013153520575017577 0ustar liggesuserscontext("Summarise") test_that("repeated outputs applied progressively", { df <- data.frame(x = 5) out <- summarise(df, x = mean(x), x = x + 1) expect_equal(nrow(out), 1) expect_equal(ncol(out), 1) expect_equal(out$x, 6) }) test_that("repeated outputs applied progressively (grouped_df)", { df <- data.frame(x = c(1, 1), y = 1:2) ds <- group_by(df, y) out <- summarise(ds, z = mean(x), z = z + 1) expect_equal(nrow(out), 2) expect_equal(ncol(out), 2) expect_equal(out$z, c(2L, 2L)) }) test_that("summarise peels off a single layer of grouping", { df <- tibble(x = rep(1:4, each = 4), y = rep(1:2, each = 8), z = runif(16)) grouped <- df %>% group_by(x, y, z) expect_equal(group_vars(grouped), c("x", "y", "z")) expect_equal(group_vars(grouped %>% summarise(n = n())), c("x", "y")) }) test_that("summarise can refer to variables that were just created (#138)", { res <- summarise(tbl_df(mtcars), cyl1 = mean(cyl), cyl2 = cyl1 + 1) expect_equal(res$cyl2, mean(mtcars$cyl) + 1) gmtcars <- group_by(tbl_df(mtcars), am) res <- summarise(gmtcars, cyl1 = mean(cyl), cyl2 = cyl1 + 1) res_direct <- summarise(gmtcars, cyl2 = mean(cyl) + 1) expect_equal(res$cyl2, res_direct$cyl2) }) test_that("summarise can refer to factor variables that were just created (#2217)", { df <- data_frame(a = 1:3) %>% group_by(a) res <- df %>% summarise(f = factor(if_else(a <= 1, "a", "b")), g = (f == "a")) expect_equal( res, data_frame(a = 1:3, f = factor(c("a", "b", "b")), g = c(TRUE, FALSE, FALSE)) ) }) test_that("summarise refuses to modify grouping variable (#143)", { df <- data.frame(a = c(1, 2, 1, 2), b = c(1, 1, 2, 2), x = 1:4) ds <- group_by(tbl_df(df), a, b) expect_error( summarise(ds, a = mean(x), a = b + 1), "Column `a` can't be modified because it's a grouping variable" ) }) test_that("summarise gives proper errors (#153)", { df <- data_frame( x = 1, y = c(1, 2, 2), z = runif(3) ) expect_error( summarise(df, identity(NULL)), "Column `identity(NULL)` is of unsupported type NULL", fixed = TRUE ) expect_error( summarise(df, log(z)), "Column `log(z)` must be length 1 (a summary value), not 3", fixed = TRUE ) expect_error( summarise(df, y[1:2]), "Column `y[1:2]` must be length 1 (a summary value), not 2", fixed = TRUE ) expect_error( summarise(df, env(a = 1)), "Column `env(a = 1)` is of unsupported type environment", fixed = TRUE ) gdf <- group_by(df, x, y) expect_error( summarise(gdf, identity(NULL)), "Column `identity(NULL)` is of unsupported type NULL", fixed = TRUE ) expect_error( summarise(gdf, z), "Column `z` must be length 1 (a summary value), not 2", fixed = TRUE ) expect_error( summarise(gdf, log(z)), "Column `log(z)` must be length 1 (a summary value), not 2", fixed = TRUE ) expect_error( summarise(gdf, y[1:2]), "Column `y[1:2]` must be length 1 (a summary value), not 2", fixed = TRUE ) expect_error( summarise(gdf, env(a = 1)), "Column `env(a = 1)` is of unsupported type environment", fixed = TRUE ) }) test_that("summarise handles constants (#153)", { df <- data.frame(a = 1:4) today <- Sys.Date() now <- Sys.time() res <- summarise( tbl_df(df), int = 1L, num = 1, str = "foo", bool = TRUE, date = today, time = now ) expect_equal(res$int, 1L) expect_equal(res$num, 1.0) expect_equal(res$str, "foo") expect_equal(res$bool, TRUE) expect_equal(res$date, today) expect_equal(res$time, now) res <- summarise( group_by(df, a), int = 1L, num = 1, str = "foo", bool = TRUE, date = today, time = now ) expect_equal(res$int, rep(1L, 4)) expect_equal(res$num, rep(1.0, 4)) expect_equal(res$str, rep("foo", 4)) expect_equal(res$bool, rep(TRUE, 4)) expect_equal(res$date, rep(today, 4)) expect_equal(res$time, rep(now, 4)) }) test_that("summarise handles passing ...", { df <- data.frame(x = 1:4) f <- function(...) { x1 <- 1 f1 <- function(x) x summarise(df, ..., x1 = f1(x1)) } g <- function(...) { x2 <- 2 f(x2 = x2, ...) } h <- function(before = "before", ..., after = "after") { g(before = before, ..., after = after) } res <- h(x3 = 3) expect_equal(res$x1, 1) expect_equal(res$x2, 2) expect_equal(res$before, "before") expect_equal(res$after, "after") df <- tbl_df(df) res <- h(x3 = 3) expect_equal(res$x1, 1) expect_equal(res$x2, 2) expect_equal(res$before, "before") expect_equal(res$after, "after") df <- group_by(df, x) res <- h(x3 = 3) expect_equal(res$x1, rep(1, 4)) expect_equal(res$x2, rep(2, 4)) expect_equal(res$before, rep("before", 4)) expect_equal(res$after, rep("after", 4)) }) test_that("summarise propagate attributes (#194)", { df <- data.frame( b = rep(1:2, 2), f = Sys.Date() + 1:4, g = Sys.time() + 1:4, stringsAsFactors = FALSE ) %>% group_by(b) min_ <- min res <- summarise(df, min_f = min(f), max_f = max(f), min_g = min(g), max_g = max(g), min__f = min_(f), min__g = min_(g) ) expect_equal(class(res$min_f) , "Date") expect_equal(class(res$max_f) , "Date") expect_equal(class(res$min__f), "Date") expect_equal(class(res$min_g) , c("POSIXct", "POSIXt")) expect_equal(class(res$max_g) , c("POSIXct", "POSIXt")) expect_equal(class(res$min__g), c("POSIXct", "POSIXt")) }) test_that("summarise strips names, but only if grouped (#2231, #2675)", { data <- data_frame(a = 1:3) %>% summarise(b = setNames(nm = a[[1]])) expect_equal(names(data$b), "1") data <- data_frame(a = 1:3) %>% rowwise %>% summarise(b = setNames(nm = a)) expect_null(names(data$b)) data <- data_frame(a = c(1, 1, 2)) %>% group_by(a) %>% summarise(b = setNames(nm = a[[1]])) expect_null(names(data$b)) }) test_that("summarise fails on missing variables", { # error messages from rlang expect_error(summarise(mtcars, a = mean(notthear))) }) test_that("summarise fails on missing variables when grouping (#2223)", { # error messages from rlang expect_error(summarise(group_by(mtcars, cyl), a = mean(notthear))) }) test_that("n() does not accept arguments", { expect_error( summarise(group_by(mtcars, cyl), n(hp)), "`n()` does not take arguments", fixed = TRUE ) }) test_that("hybrid nests correctly", { res <- group_by(mtcars, cyl) %>% summarise(a = if (n() > 10) 1 else 2) expect_equal(res$a, c(1, 2, 1)) res <- mtcars %>% summarise(a = if (n() > 10) 1 else 2) expect_equal(res$a, 1) }) test_that("hybrid min and max propagate attributes (#246)", { x <- data.frame( id = c(rep("a", 2), rep("b", 2)), date = as.POSIXct(c("2014-01-13", "2014-01-14", "2014-01-15", "2014-01-16"), tz = "GMT") ) y <- x %>% group_by(id) %>% summarise(max_date = max(date), min_date = min(date)) expect_true("tzone" %in% names(attributes(y$min_date))) expect_true("tzone" %in% names(attributes(y$max_date))) }) test_that("summarise can use newly created variable more than once", { df <- data.frame(id = c(1, 1, 2, 2, 3, 3), a = 1:6) %>% group_by(id) for (i in 1:10) { res <- summarise( df, biggest = max(a), smallest = min(a), diff1 = biggest - smallest, diff2 = smallest - biggest ) expect_equal(res$diff1, -res$diff2) } }) test_that("summarise creates an empty data frame when no parameters are used", { res <- summarise(mtcars) expect_equal(res, data.frame()) }) test_that("integer overflow (#304)", { groups <- rep(c("A", "B"), each = 3) values <- rep(1e9, 6) dat <- data.frame(groups, X1 = as.integer(values), X2 = values) # now group and summarise expect_warning( res <- group_by(dat, groups) %>% summarise(sum_integer = sum(X1), sum_numeric = sum(X2)), "integer overflow" ) expect_true(all(is.na(res$sum_integer))) expect_equal(res$sum_numeric, rep(3e9, 2L)) }) test_that("summarise checks outputs (#300)", { expect_error( summarise(mtcars, mpg, cyl), "Column `mpg` must be length 1 (a summary value), not 32", fixed = TRUE ) expect_error( summarise(mtcars, mpg + cyl), "Column `mpg + cyl` must be length 1 (a summary value), not 32", fixed = TRUE ) }) test_that("comment attribute is white listed (#346)", { test <- data.frame(A = c(1, 1, 0, 0), B = c(2, 2, 3, 3)) comment(test$B) <- "2nd Var" res <- group_by(test, A) expect_equal(comment(res$B), "2nd Var") }) test_that("AsIs class is white listed (#453)", { test <- data.frame(A = c(1, 1, 0, 0), B = I(c(2, 2, 3, 3))) res <- group_by(test, B) expect_equal(res$B, test$B) }) test_that("names attribute is not retained (#357)", { df <- data.frame(x = c(1:3), y = letters[1:3]) df <- group_by(df, y) m <- df %>% summarise( a = length(x), b = quantile(x, 0.5) ) expect_equal(m$b, c(1, 2, 3)) expect_null(names(m$b)) }) test_that("na.rm is supported (#168)", { df <- data.frame( x = c(1:5, NA, 7:10), y = rep(1:2, each = 5), z = c(rnorm(5), NA, rnorm(4)) ) res <- df %>% group_by(y) %>% summarise( mean_x = mean(x, na.rm = FALSE), mean_z = mean(z, na.rm = FALSE), min_x = min(x, na.rm = FALSE), min_z = min(z, na.rm = FALSE) ) expect_equal(res$mean_x[1], 3) expect_true(is.na(res$mean_x[2])) expect_equal(res$mean_z[1], mean(df$z[1:5])) expect_true(is.na(res$mean_z[2])) expect_equal(res$min_x[1], 1) expect_true(is.na(res$min_x[2])) expect_equal(res$min_z[1], min(df$z[1:5])) expect_true(is.na(res$min_z[2])) res <- df %>% group_by(y) %>% summarise( mean_x = mean(x, na.rm = TRUE), mean_z = mean(z, na.rm = TRUE), min_x = min(x, na.rm = TRUE), min_z = min(z, na.rm = TRUE) ) expect_equal(res$mean_x[1], 3) expect_equal(res$mean_x[2], 8.5) expect_equal(res$mean_z[1], mean(df$z[1:5])) expect_equal(res$mean_z[2], mean(df$z[7:10])) expect_equal(res$min_x[1], 1) expect_equal(res$min_x[2], 7) expect_equal(res$min_z[1], min(df$z[1:5])) expect_equal(res$min_z[2], min(df$z[7:10])) }) test_that("summarise hybrid functions can use summarized variables", { df <- data.frame(x = c(1:5, NA, 7:10), y = rep(1:2, each = 5)) %>% group_by(y) res <- summarise( df, x = mean(x), min = min(x), max = max(x), mean = mean(x), var = var(x) ) expect_identical(res$x, res$min) expect_identical(res$x, res$max) expect_identical(res$x, res$mean) expect_identical(res$var, rep(NA_real_, 2)) }) test_that("LazySubset is not confused about input data size (#452)", { res <- data.frame(a = c(10, 100)) %>% summarise(b = sum(a), c = sum(a) * 2) expect_equal(res$b, 110) expect_equal(res$c, 220) }) test_that("nth, first, last promote dates and times (#509)", { data <- data_frame( ID = rep(letters[1:4], each = 5), date = Sys.Date() + 1:20, time = Sys.time() + 1:20, number = rnorm(20) ) res <- data %>% group_by(ID) %>% summarise( date2 = nth(date, 2), time2 = nth(time, 2), first_date = first(date), last_date = last(date), first_time = first(time), last_time = last(time) ) expect_is(res$date2, "Date") expect_is(res$first_date, "Date") expect_is(res$last_date, "Date") expect_is(res$time2, "POSIXct") expect_is(res$first_time, "POSIXct") expect_is(res$last_time, "POSIXct") # error messages from rlang expect_error(data %>% group_by(ID) %>% summarise(time2 = nth(times, 2))) }) test_that("nth, first, last preserves factor data (#509)", { dat <- data_frame(a = rep(seq(1, 20, 2), 3), b = as.ordered(a)) dat1 <- dat %>% group_by(a) %>% summarise( der = nth(b, 2), first = first(b), last = last(b) ) expect_is(dat1$der, "ordered") expect_is(dat1$first, "ordered") expect_is(dat1$last, "ordered") expect_equal(levels(dat1$der), levels(dat$b)) }) test_that("nth handle negative value (#1584) ", { df <- data.frame( a = 1:10, b = 10:1, g = rep(c(1, 2), c(4, 6)) ) %>% group_by(g) res <- summarise( df, x1 = nth(a, -1L), x2 = nth(a, -1L, order_by = b), x3 = nth(a, -5L), x4 = nth(a, -5L, order_by = b), x5 = nth(a, -5L, default = 99), x6 = nth(a, -5L, order_by = b, default = 99) ) expect_equal(res$x1, c(4, 10)) expect_equal(res$x2, c(1, 5)) expect_true(is.na(res$x3[1])) expect_equal(res$x3[2], 6) expect_true(is.na(res$x4[1])) expect_equal(res$x4[2], 9) expect_equal(res$x5, c(99, 6)) expect_equal(res$x6, c(99, 9)) }) test_that("LazyGroupSubsets is robust about columns not from the data (#600)", { foo <- data_frame(x = 1:10, y = 1:10) # error messages from rlang expect_error(foo %>% group_by(x) %>% summarise(first_y = first(z))) }) test_that("can summarise first(x[-1]) (#1980)", { expect_equal( tibble(x = 1:3) %>% summarise(f = first(x[-1])), tibble(f = 2L) ) }) test_that("hybrid eval handles $ and @ (#645)", { tmp <- expand.grid(a = 1:3, b = 0:1, i = 1:10) g <- tmp %>% group_by(a) f <- function(a, b) { list(x = 1:10) } res <- g %>% summarise( r = sum(b), n = length(b), p = f(r, n)$x[1] ) expect_equal(names(res), c("a", "r", "n", "p")) res <- tmp %>% summarise( r = sum(b), n = length(b), p = f(r, n)$x[1] ) expect_equal(names(res), c("r", "n", "p")) }) test_that("argument order_by in last is flexible enough to handle more than just a symbol (#626)", { res1 <- group_by(mtcars, cyl) %>% summarise( big = last(mpg[drat > 3], order_by = wt[drat > 3]), small = first(mpg[drat > 3], order_by = wt[drat > 3]), second = nth(mpg[drat > 3], 2, order_by = wt[drat > 3]) ) # turning off lazy eval last. <- last first. <- first nth. <- nth res2 <- group_by(mtcars, cyl) %>% summarise( big = last.(mpg[drat > 3], order_by = wt[drat > 3]), small = first.(mpg[drat > 3], order_by = wt[drat > 3]), second = nth.(mpg[drat > 3], 2, order_by = wt[drat > 3]) ) expect_equal(res1, res2) }) test_that("min(., na.rm=TRUE) correctly handles Dates that are coded as REALSXP (#755)", { dates <- as.Date(c("2014-01-01", "2013-01-01")) dd <- data.frame(Dates = dates) res <- summarise(dd, Dates = min(Dates, na.rm = TRUE)) expect_is(res$Dates, "Date") expect_equal(res$Dates, as.Date("2013-01-01")) }) test_that("nth handles expressions for n argument (#734)", { df <- data.frame(x = c(1:4, 7:9, 13:19), y = sample(100:999, 14)) idx <- which(df$x == 16) res <- df %>% summarize(abc = nth(y, n = which(x == 16))) expect_equal(res$abc, df$y[idx]) }) test_that("summarise is not polluted by logical NA (#599)", { dat <- data.frame(grp = rep(1:4, each = 2), val = c(NA, 2, 3:8)) Mean <- function(x, thresh = 2) { res <- mean(x, na.rm = TRUE) if (res > thresh) res else NA } res <- dat %>% group_by(grp) %>% summarise(val = Mean(val, thresh = 2)) expect_is(res$val, "numeric") expect_true(is.na(res$val[1])) }) test_that("summarise handles list output columns (#832)", { df <- data_frame(x = 1:10, g = rep(1:2, each = 5)) res <- df %>% group_by(g) %>% summarise(y = list(x)) expect_equal(res$y[[1]], 1:5) expect_equal(res$y[[2]], 6:10) # just checking objects are not messed up internally expect_equal(gp(res$y[[1]]), 0L) expect_equal(gp(res$y[[2]]), 0L) res <- df %>% group_by(g) %>% summarise(y = list(x + 1)) expect_equal(res$y[[1]], 1:5 + 1) expect_equal(res$y[[2]], 6:10 + 1) # just checking objects are not messed up internally expect_equal(gp(res$y[[1]]), 0L) expect_equal(gp(res$y[[2]]), 0L) df <- data_frame(x = 1:10, g = rep(1:2, each = 5)) res <- df %>% summarise(y = list(x)) expect_equal(res$y[[1]], 1:10) res <- df %>% summarise(y = list(x + 1)) expect_equal(res$y[[1]], 1:10 + 1) }) test_that("summarise works with empty data frame (#1142)", { df <- data.frame() res <- df %>% summarise expect_equal(nrow(res), 0L) expect_equal(length(res), 0L) }) test_that("n_distint uses na.rm argument", { df <- data.frame(x = c(1:3, NA), g = rep(1:2, 2)) res <- summarise(df, n = n_distinct(x, na.rm = TRUE)) expect_equal(res$n, 3L) res <- group_by(df, g) %>% summarise(n = n_distinct(x, na.rm = TRUE)) expect_equal(res$n, c(2L, 1L)) }) test_that("n_distinct front end supports na.rm argument (#1052)", { x <- c(1:3, NA) expect_equal(n_distinct(x, na.rm = TRUE), 3L) }) test_that("n_distinct without arguments stops (#1957)", { expect_error( n_distinct(), "Need at least one column for `n_distinct()`", fixed = TRUE ) }) test_that("hybrid evaluation does not take place for objects with a class (#1237)", { mean.foo <- function(x) 42 df <- data_frame(x = structure(1:10, class = "foo")) expect_equal(summarise(df, m = mean(x))$m[1], 42) env <- environment() Foo <- suppressWarnings(setClass("Foo", contains = "numeric", where = env)) suppressMessages(setMethod("mean", "Foo", function(x, ...) 42, where = env)) on.exit(removeClass("Foo", where = env)) df <- data.frame(x = Foo(c(1, 2, 3))) expect_equal(summarise(df, m = mean(x))$m[1], 42) }) test_that("summarise handles promotion of results (#893)", { df <- structure(list( price = c(580L, 650L, 630L, 706L, 1080L, 3082L, 3328L, 4229L, 1895L, 3546L, 752L, 13003L, 814L, 6115L, 645L, 3749L, 2926L, 765L, 1140L, 1158L), cut = structure(c(2L, 4L, 4L, 2L, 3L, 2L, 2L, 3L, 4L, 1L, 1L, 3L, 2L, 4L, 3L, 3L, 1L, 2L, 2L, 2L), .Label = c("Good", "Ideal", "Premium", "Very Good"), class = "factor")), row.names = c(NA, -20L), .Names = c("price", "cut"), class = "data.frame" ) res <- df %>% group_by(cut) %>% select(price) %>% summarise(price = median(price)) expect_is(res$price, "numeric") }) test_that("summarise correctly handles logical (#1291)", { test <- expand.grid(id = 1:2, type = letters[1:2], sample = 1:2) %>% mutate(var = c(1, 0, 1, 1, 0, 0, 0, 1)) %>% mutate(var_l = as.logical(var)) %>% mutate(var_ch = as.character(var_l)) %>% arrange(id, type, sample) %>% group_by(id, type) test_sum <- test %>% ungroup() %>% group_by(id, type) %>% summarise( anyvar = any(var == 1), anyvar_l = any(var_l), anyvar_ch = any(var_ch == "TRUE") ) expect_equal(test_sum$anyvar, c(TRUE, TRUE, FALSE, TRUE)) }) test_that("summarise correctly handles NA groups (#1261)", { tmp <- data_frame( a = c(1, 1, 1, 2, 2), b1 = NA_integer_, b2 = NA_character_ ) res <- tmp %>% group_by(a, b1) %>% summarise(n()) expect_equal(nrow(res), 2L) res <- tmp %>% group_by(a, b2) %>% summarise(n()) expect_equal(nrow(res), 2L) }) test_that("n_distinct handles multiple columns (#1084)", { df <- data.frame( x = rep(1:4, each = 2), y = rep(1:2, each = 4), g = rep(1:2, 4) ) res <- summarise(df, n = n_distinct(x, y)) expect_equal(res$n, 4L) res <- group_by(df, g) %>% summarise(n = n_distinct(x, y)) expect_equal(res$n, c(4L, 4L)) df$x[3] <- df$y[7] <- NA res <- summarise(df, n = n_distinct(x, y)) expect_equal(res$n, 6L) res <- summarise(df, n = n_distinct(x, y, na.rm = TRUE)) expect_equal(res$n, 4L) res <- group_by(df, g) %>% summarise(n = n_distinct(x, y)) expect_equal(res$n, c(4L, 4L)) res <- group_by(df, g) %>% summarise(n = n_distinct(x, y, na.rm = TRUE)) expect_equal(res$n, c(2L, 4L)) }) test_that("hybrid max works when not used on columns (#1369)", { df <- data_frame(x = 1:1000) y <- 1:10 expect_equal(summarise(df, z = max(y))$z, 10) expect_equal(summarise(df, z = max(10))$z, 10) }) test_that("min and max handle empty sets in summarise (#1481)", { df <- data_frame(A = numeric()) res <- df %>% summarise(Min = min(A, na.rm = TRUE), Max = max(A, na.rm = TRUE)) expect_equal(res$Min, Inf) expect_equal(res$Max, -Inf) }) test_that("lead and lag behave correctly in summarise (#1434)", { res <- mtcars %>% group_by(cyl) %>% summarise( n = n(), leadn = lead(n), lagn = lag(n), leadn10 = lead(n, default = 10), lagn10 = lag(n, default = 10) ) expect_true(all(is.na(res$lagn))) expect_true(all(is.na(res$leadn))) expect_true(all(res$lagn10 == 10)) expect_true(all(res$leadn10 == 10)) res <- mtcars %>% rowwise() %>% summarise( n = n(), leadn = lead(n), lagn = lag(n), leadn10 = lead(n, default = 10), lagn10 = lag(n, default = 10) ) expect_true(all(is.na(res$lagn))) expect_true(all(is.na(res$leadn))) expect_true(all(res$lagn10 == 10)) expect_true(all(res$leadn10 == 10)) }) # .data and .env tests now in test-hybrid-traverse.R test_that("data.frame columns are supported in summarise (#1425)", { df <- data.frame(x1 = rep(1:3, times = 3), x2 = 1:9) df$x3 <- df %>% mutate(x3 = x2) res <- df %>% group_by(x1) %>% summarise(nr = nrow(x3)) expect_true(all(res$nr == 3)) }) test_that("summarise handles min/max of already summarised variable (#1622)", { df <- data.frame( FIRST_DAY = rep(seq(as.POSIXct("2015-12-01", tz = "UTC"), length.out = 2, by = "days"), 2), event = c("a", "a", "b", "b") ) df_summary <- df %>% group_by(event) %>% summarise(FIRST_DAY = min(FIRST_DAY), LAST_DAY = max(FIRST_DAY)) expect_equal(df_summary$FIRST_DAY, df_summary$LAST_DAY) }) test_that("group_by keeps classes (#1631)", { df <- data.frame(a = 1, b = as.Date(NA)) %>% group_by(a) %>% summarize(c = min(b)) expect_equal(class(df$c), "Date") df <- data.frame(a = 1, b = as.POSIXct(NA)) %>% group_by(a) %>% summarize(c = min(b)) expect_equal(class(df$c), c("POSIXct", "POSIXt")) }) test_that("hybrid n_distinct falls back to R evaluation when needed (#1657)", { dat3 <- data.frame(id = c(2, 6, 7, 10, 10)) res <- dat3 %>% summarise(n_unique = n_distinct(id[id > 6])) expect_equal(res$n_unique, 2) }) test_that("summarise() correctly coerces factors with different levels (#1678)", { res <- data_frame(x = 1:3) %>% group_by(x) %>% summarise( y = if (x == 1) "a" else "b", z = factor(y) ) expect_is(res$z, "factor") expect_equal(levels(res$z), c("a", "b")) expect_equal(as.character(res$z), c("a", "b", "b")) }) test_that("summarise works if raw columns exist but are not involved (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_equal(summarise(df, c = sum(a)), data_frame(c = 6L)) }) test_that("summarise fails gracefully on raw columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( summarise(df, c = b[[1]]), "Column `c` is of unsupported type raw vector", fixed = TRUE ) }) test_that("dim attribute is stripped from grouped summarise (#1918)", { df <- data.frame(a = 1:3, b = 1:3) df_regular <- summarise(df, b = scale(b)[1, 1]) df_grouped <- summarise(group_by(df, a), b = scale(b)) df_rowwise <- summarise(rowwise(df), b = scale(b)) expect_null(dim(df$b)) expect_null(dim(df_grouped$b)) expect_null(dim(df_rowwise$b)) }) test_that("typing and NAs for grouped summarise (#1839)", { expect_identical( data_frame(id = 1L, a = NA_character_) %>% group_by(id) %>% summarise(a = a[[1]]) %>% .$a, NA_character_) expect_identical( data_frame(id = 1:2, a = c(NA, "a")) %>% group_by(id) %>% summarise(a = a[[1]]) %>% .$a, c(NA, "a")) # Properly upgrade NA (logical) to character expect_identical( data_frame(id = 1:2, a = 1:2) %>% group_by(id) %>% summarise(a = ifelse(all(a < 2), NA, "yes")) %>% .$a, c(NA, "yes")) expect_error( data_frame(id = 1:2, a = list(1, "2")) %>% group_by(id) %>% summarise(a = a[[1]]) %>% .$a, "Column `a` can't promote group 1 to numeric", fixed = TRUE ) expect_identical( data_frame(id = 1:2, a = list(1, "2")) %>% group_by(id) %>% summarise(a = a[1]) %>% .$a, list(1, "2")) }) test_that("typing and NAs for rowwise summarise (#1839)", { expect_identical( data_frame(id = 1L, a = NA_character_) %>% rowwise %>% summarise(a = a[[1]]) %>% .$a, NA_character_) expect_identical( data_frame(id = 1:2, a = c(NA, "a")) %>% rowwise %>% summarise(a = a[[1]]) %>% .$a, c(NA, "a")) # Properly promote NA (logical) to character expect_identical( data_frame(id = 1:2, a = 1:2) %>% group_by(id) %>% summarise(a = ifelse(all(a < 2), NA, "yes")) %>% .$a, c(NA, "yes")) expect_error( data_frame(id = 1:2, a = list(1, "2")) %>% rowwise %>% summarise(a = a[[1]]) %>% .$a, "Column `a` can't promote group 1 to numeric", fixed = TRUE ) expect_error( data_frame(id = 1:2, a = list(1, "2")) %>% rowwise %>% summarise(a = a[1]) %>% .$a, "Column `a` can't promote group 1 to numeric", fixed = TRUE ) }) test_that("calculating an ordered factor preserves order (#2200)", { test_df <- tibble( id = c("a", "b"), val = 1:2 ) ret <- group_by(test_df, id) %>% summarize(level = ordered(val)) expect_s3_class(ret$level, "ordered") expect_equal(levels(ret$level), c("1", "2")) }) test_that("min, max preserves ordered factor data (#2200)", { test_df <- tibble( id = rep(c("a", "b"), 2), ord = ordered(c("A", "B", "B", "A"), levels = c("A", "B")) ) ret <- group_by(test_df, id) %>% summarize( min_ord = min(ord), max_ord = max(ord) ) expect_s3_class(ret$min_ord, "ordered") expect_s3_class(ret$max_ord, "ordered") expect_equal(levels(ret$min_ord), levels(test_df$ord)) expect_equal(levels(ret$max_ord), levels(test_df$ord)) }) test_that("ungrouped summarise() uses summary variables correctly (#2404)", { df <- tibble::as_tibble(seq(1:10)) out <- df %>% summarise(value = mean(value), sd = sd(value)) expect_equal(out$value, 5.5) expect_equal(out$sd, NA_real_) }) test_that("proper handling of names in summarised list columns (#2231)", { d <- data_frame(x = rep(1:3, 1:3), y = 1:6, names = letters[1:6]) res <- d %>% group_by(x) %>% summarise(y = list(setNames(y, names))) expect_equal(names(res$y[[1]]), letters[[1]]) expect_equal(names(res$y[[2]]), letters[2:3]) expect_equal(names(res$y[[3]]), letters[4:6]) }) test_that("proper handling of NA factors (#2588)", { df <- tibble( x = c(1, 1, 2, 2, 3, 3), y = factor(c(NA, NA, NA, "2", "3", "3")) ) ret <- df %>% group_by(x) %>% summarise(y = y[1]) expect_identical(as.character(ret$y), c(NA, NA, "3")) }) test_that("can refer to previously summarised symbols", { expect_identical(summarise(group_by(mtcars, cyl), x = 1, z = x)[2:3], tibble(x = c(1, 1, 1), z = x)) expect_identical(summarise(group_by(mtcars, cyl), x = n(), z = x)[2:3], tibble(x = c(11L, 7L, 14L), z = x)) }) test_that("can refer to symbols if group size is one overall", { df <- tibble(x = LETTERS[3:1], y = 1:3) expect_identical( df %>% group_by(x) %>% summarise(z = y), tibble(x = LETTERS[1:3], z = 3:1) ) }) test_that("summarise() supports unquoted values", { df <- tibble(g = c(1, 1, 2, 2, 2), x = 1:5) expect_identical(summarise(df, out = !! 1), tibble(out = 1)) expect_identical(summarise(df, out = !! quote(identity(1))), tibble(out = 1)) expect_error(summarise(df, out = !! 1:2), "must be length 1 (the number of groups)", fixed = TRUE) expect_error(summarise(df, out = !! env(a = 1)), "unsupported type") gdf <- group_by(df, g) expect_identical(summarise(gdf, out = !! 1), summarise(gdf, out = 1)) expect_identical(summarise(gdf, out = !! 1:2), tibble(g = c(1, 2), out = 1:2)) expect_identical(summarise(gdf, out = !! quote(identity(1))), summarise(gdf, out = 1)) expect_error(summarise(gdf, out = !! 1:5), "must be length 2 (the number of groups)", fixed = TRUE) expect_error(summarise(gdf, out = !! env(a = 1)), "unsupported type") }) dplyr/tests/testthat/test-funs.R0000644000176200001440000000331613135665123016471 0ustar liggesuserscontext("funs") test_that("fun_list is merged with new args", { funs <- funs(fn = bar) funs <- as_fun_list(funs, quo(bar), env(), baz = "baz") expect_identical(funs$fn, quo(bar(., baz = "baz"))) }) test_that("funs() works with namespaced calls", { expect_identical(summarise_all(mtcars, funs(base::mean(.))), summarise_all(mtcars, funs(mean(.)))) expect_identical(summarise_all(mtcars, funs(base::mean)), summarise_all(mtcars, funs(mean(.)))) }) test_that("funs() accepts quoted functions", { expect_identical(funs(mean), funs("mean")) }) test_that("funs() accepts unquoted functions", { funs <- funs(fn = !! mean) expect_identical(funs$fn, new_quosure(lang(base::mean, quote(.)))) }) test_that("funs() accepts quoted calls", { expect_identical(funs(mean), funs(mean(.))) }) test_that("funs() can be merged with new arguments", { fns <- funs(foo(.)) expect_identical(as_fun_list(fns, ~NULL, get_env(), foo = 1L), funs(foo(., foo = 1L))) }) enfun <- function(.funs, ...) { as_fun_list(.funs, enquo(.funs), caller_env(), ...) } test_that("can enfun() literal functions", { expect_identical(enfun(identity(mean)), funs(!! mean)) }) test_that("can enfun() named functions by expression", { expect_identical(enfun(mean), funs(mean(.))) }) test_that("local objects are not treated as symbols", { mean <- funs(my_mean(.)) expect_identical(enfun(mean), mean) }) test_that("can enfun() character vectors", { expect_identical(enfun(c("min", "max")), funs(min, max)) }) test_that("can enfun() quosures", { expect_identical(enfun(quo(mean(.))), funs(mean(.))) }) test_that("can enfun() purrr-style lambdas", { my_mean <- as_function(~mean(.x)) expect_identical(enfun(~mean(.x)), funs(!! my_mean)) }) dplyr/tests/testthat/test-group-indices.R0000644000176200001440000000212113153520575020260 0ustar liggesuserscontext("Group indices") test_that("group_indices from ungrouped or grouped gives same result", { res1 <- group_indices(mtcars, cyl, vs, am) res2 <- mtcars %>% group_by(cyl, vs, am) %>% group_indices() expect_equal(res1, res2) }) test_that("group_indices handles the case where no variable is given (#867)", { res <- group_indices(mtcars) expect_true(all(res == 1L)) }) test_that("group_indices handles grouped data and no arguments", { res1 <- mtcars %>% group_by(cyl) %>% group_indices() res2 <- mtcars %>% group_indices(cyl) expect_equal(res1, res2) }) test_that("group_indices can be used in mutate (#2160)", { res1 <- mtcars %>% mutate(., group_idx = group_indices(., cyl)) res2 <- mtcars %>% mutate(group_idx = as.integer(factor(cyl))) expect_equal(res1, res2) }) test_that("group indices are updated correctly for joined grouped data frames (#2330)", { d1 <- data.frame(x = 1:2, y = 1:2) %>% group_by(x, y) expect_equal(group_indices(d1), d1$x) d2 <- expand.grid(x = 1:2, y = 1:2) res <- inner_join(d1, d2, by = "x") expect_equal(group_indices(res), res$x) }) dplyr/tests/testthat/test-select-helpers.R0000644000176200001440000002226713153520575020444 0ustar liggesuserscontext("select-helpers") test_that("no set variables throws error", { expect_error( starts_with("z"), "Variable context not set", fixed = TRUE ) }) test_that("failed match removes all columns", { old <- set_current_vars(c("x", "y")) on.exit(set_current_vars(old)) expect_equal(starts_with("z"), integer(0)) expect_equal(ends_with("z"), integer(0)) expect_equal(contains("z"), integer(0)) expect_equal(matches("z"), integer(0)) expect_equal(num_range("z", 1:3), integer(0)) }) test_that("matches return integer positions", { old <- set_current_vars(c("abc", "acd", "bbc", "bbd", "eee")) on.exit(set_current_vars(old)) expect_equal(starts_with("a"), c(1L, 2L)) expect_equal(ends_with("d"), c(2L, 4L)) expect_equal(contains("eee"), 5L) expect_equal(matches(".b."), c(1L, 3L, 4L)) }) test_that("throws with empty pattern is provided", { # error messages from rlang expect_error(starts_with("")) expect_error(ends_with("")) expect_error(contains("")) expect_error(matches("")) }) test_that("can use a variable", { vars <- "x" names(vars) <- vars expect_equal(select_vars(vars, starts_with(vars)), c(x = "x")) expect_equal(select_vars(vars, ends_with(vars)), c(x = "x")) expect_equal(select_vars(vars, contains(vars)), c(x = "x")) expect_equal(select_vars(vars, matches(vars)), c(x = "x")) }) test_that("can use a variable even if it exists in the data (#2266)", { vars <- c("x", "y") names(vars) <- vars y <- "x" expected_result <- c(x = "x") expect_equal(select_vars(vars, starts_with(y)), expected_result) expect_equal(select_vars(vars, ends_with(y)), expected_result) expect_equal(select_vars(vars, contains(y)), expected_result) expect_equal(select_vars(vars, matches(y)), expected_result) }) test_that("num_range selects numeric ranges", { vars <- c("x1", "x2", "x01", "x02", "x10", "x11") names(vars) <- vars expect_equal(select_vars(vars, num_range("x", 1:2)), vars[1:2]) expect_equal(select_vars(vars, num_range("x", 1:2, width = 2)), vars[3:4]) expect_equal(select_vars(vars, num_range("x", 10:11)), vars[5:6]) expect_equal(select_vars(vars, num_range("x", 10:11, width = 2)), vars[5:6]) }) test_that("position must resolve to numeric variables throws error", { expect_error( select_vars(letters, !! list()), 'must resolve to integer column positions', fixed = TRUE ) }) # one_of ------------------------------------------------------------------ test_that("one_of gives useful errors", { expect_error( one_of(1L, vars = c("x", "y")), "All arguments must be character vectors, not integer", fixed = TRUE ) }) test_that("one_of tolerates but warns for unknown variables", { vars <- c("x", "y") expect_warning(res <- one_of("z", vars = vars), "Unknown variables: `z`") expect_equal(res, integer(0)) expect_warning(res <- one_of(c("x", "z"), vars = vars), "Unknown variables: `z`") expect_equal(res, 1L) }) test_that("one_of converts names to positions", { expect_equal(one_of("a", "z", vars = letters), c(1L, 26L)) }) test_that("one_of works with variables", { vars <- c("x", "y") expected_result <- c(x = "x") var <- "x" expect_equal(select_vars(vars, one_of(var)), expected_result) # error messages from rlang expect_error(select_vars(vars, one_of(`_x`)), "not found") expect_error(select_vars(vars, one_of(`_y`)), "not found") }) test_that("one_of works when passed variable name matches the column name (#2266)", { vars <- c("x", "y") expected_result <- c(x = "x") x <- "x" y <- "x" expect_equal(select_vars(vars, one_of(!! x)), expected_result) expect_equal(select_vars(vars, one_of(!! y)), expected_result) expect_equal(select_vars(vars, one_of(y)), expected_result) }) # first-selector ---------------------------------------------------------- test_that("initial (single) selector defaults correctly (issue #2275)", { cn <- setNames(nm = c("x", "y", "z")) ### Single Column Selected # single columns (present), explicit expect_equal(select_vars(cn, x), cn["x"]) expect_equal(select_vars(cn, -x), cn[c("y", "z")]) # single columns (present), matched expect_equal(select_vars(cn, contains("x")), cn["x"]) expect_equal(select_vars(cn, -contains("x")), cn[c("y", "z")]) # single columns (not present), explicit expect_error(select_vars(cn, foo), "not found") expect_error(select_vars(cn, -foo), "not found") # single columns (not present), matched expect_equal(select_vars(cn, contains("foo")), cn[integer()]) expect_equal(select_vars(cn, -contains("foo")), cn) }) test_that("initial (of multiple) selectors default correctly (issue #2275)", { cn <- setNames(nm = c("x", "y", "z")) ### Multiple Columns Selected # explicit(present) + matched(present) expect_equal(select_vars(cn, x, contains("y")), cn[c("x", "y")]) expect_equal(select_vars(cn, x, -contains("y")), cn["x"]) expect_equal(select_vars(cn, -x, contains("y")), cn[c("y", "z")]) expect_equal(select_vars(cn, -x, -contains("y")), cn["z"]) # explicit(present) + matched(not present) expect_equal(select_vars(cn, x, contains("foo")), cn["x"]) expect_equal(select_vars(cn, x, -contains("foo")), cn["x"]) expect_equal(select_vars(cn, -x, contains("foo")), cn[c("y", "z")]) expect_equal(select_vars(cn, -x, -contains("foo")), cn[c("y", "z")]) # matched(present) + explicit(present) expect_equal(select_vars(cn, contains("x"), y), cn[c("x", "y")]) expect_equal(select_vars(cn, contains("x"), -y), cn["x"]) expect_equal(select_vars(cn, -contains("x"), y), cn[c("y", "z")]) expect_equal(select_vars(cn, -contains("x"), -y), cn["z"]) # matched(not present) + explicit(not present) expect_error(select_vars(cn, contains("foo"), bar), "object 'bar' not found") expect_error(select_vars(cn, contains("foo"), -bar), "object 'bar' not found") expect_error(select_vars(cn, -contains("foo"), bar), "object 'bar' not found") expect_error(select_vars(cn, -contains("foo"), -bar), "object 'bar' not found") # matched(present) + matched(present) expect_equal(select_vars(cn, contains("x"), contains("y")), cn[c("x", "y")]) expect_equal(select_vars(cn, contains("x"), -contains("y")), cn["x"]) expect_equal(select_vars(cn, -contains("x"), contains("y")), cn[c("y", "z")]) expect_equal(select_vars(cn, -contains("x"), -contains("y")), cn["z"]) # matched(present) + matched(not present) expect_equal(select_vars(cn, contains("x"), contains("foo")), cn["x"]) expect_equal(select_vars(cn, contains("x"), -contains("foo")), cn["x"]) expect_equal(select_vars(cn, -contains("x"), contains("foo")), cn[c("y", "z")]) expect_equal(select_vars(cn, -contains("x"), -contains("foo")), cn[c("y", "z")]) # matched(not present) + matched(present) expect_equal(select_vars(cn, contains("foo"), contains("x")), cn["x"]) expect_equal(select_vars(cn, contains("foo"), -contains("x")), cn[integer()]) expect_equal(select_vars(cn, -contains("foo"), contains("x")), cn) expect_equal(select_vars(cn, -contains("foo"), -contains("x")), cn[c("y", "z")]) # matched(not present) + matched(not present) expect_equal(select_vars(cn, contains("foo"), contains("bar")), cn[integer()]) expect_equal(select_vars(cn, contains("foo"), -contains("bar")), cn[integer()]) expect_equal(select_vars(cn, -contains("foo"), contains("bar")), cn) expect_equal(select_vars(cn, -contains("foo"), -contains("bar")), cn) }) test_that("middle (no-match) selector should not clear previous selectors (issue #2275)", { cn <- setNames(nm = c("x", "y", "z")) expect_equal( select_vars(cn, contains("x"), contains("foo"), contains("z")), cn[c("x", "z")] ) expect_equal( select_vars(cn, contains("x"), -contains("foo"), contains("z")), cn[c("x", "z")] ) }) test_that("can select with c() (#2685)", { expect_identical(select_vars(letters, c(a, z)), c(a = "a", z = "z")) }) test_that("can select with .data pronoun (#2715)", { expect_identical(select_vars("foo", .data$foo), c(foo = "foo")) expect_identical(select_vars("foo", .data[["foo"]]), c(foo = "foo")) expect_identical(select_vars(c("a", "b", "c"), .data$a : .data$b), c(a = "a", b = "b")) expect_identical(select_vars(c("a", "b", "c"), .data[["a"]] : .data[["b"]]), c(a = "a", b = "b")) }) # rename_vars ------------------------------------------------------------- test_that("when strict = FALSE, rename_vars always succeeds", { expect_error( rename_vars(c("a", "b"), d = e, strict = TRUE), "`e` contains unknown variables", fixed = TRUE ) expect_equal( rename_vars(c("a", "b"), d = e, strict = FALSE), c("a" = "a", "b" = "b") ) }) test_that("rename_vars() expects symbol or string", { expect_error( rename_vars(letters, d = 1), '`d` = 1 must be a symbol or a string', fixed = TRUE ) }) # tbl_at_vars ------------------------------------------------------------- test_that("tbl_at_vars() errs on bad input", { expect_error( tbl_at_vars(iris, raw(3)), "`.vars` must be a character/numeric vector or a `vars()` object, not raw", fixed = TRUE ) }) # tbl_if_vars ------------------------------------------------------------- test_that("tbl_if_vars() errs on bad input", { expect_error( tbl_if_vars(iris, funs(identity, force), environment()), "`.predicate` must have length 1, not 2", fixed = TRUE ) }) dplyr/tests/testthat/test-na-if.R0000644000176200001440000000066013135665123016507 0ustar liggesuserscontext("na_if") test_that("error for bad y length", { expect_error( na_if(1:3, 1:2), "`y` must be length 3 (same as `x`) or one, not 2", fixed = TRUE ) expect_error( na_if(1, 1:2), "`y` must be length 1 (same as `x`), not 2", fixed = TRUE ) }) test_that("scalar y replaces all matching x", { x <- c(0, 1, 0) expect_equal(na_if(x, 0), c(NA, 1, NA)) expect_equal(na_if(x, 1), c(0, NA, 0)) }) dplyr/tests/testthat/test-group-size.R0000644000176200001440000000177413153520575017631 0ustar liggesuserscontext("Group sizes") test_that("ungrouped data has 1 group, with group size = nrow()", { df <- tibble(x = rep(1:3, each = 10), y = rep(1:6, each = 5)) expect_equal(n_groups(df), 1L) expect_equal(group_size(df), 30) }) test_that("rowwise data has one group for each group", { rw <- rowwise(mtcars) expect_equal(n_groups(rw), 32) expect_equal(group_size(rw), rep(1, 32)) }) test_that("group_size correct for grouped data", { df <- tibble(x = rep(1:3, each = 10), y = rep(1:6, each = 5)) %>% group_by(x) expect_equal(n_groups(df), 3L) expect_equal(group_size(df), rep(10, 3)) }) # For following tests, add an extra level that's not present in data test_that("n_groups drops zero-length groups", { df <- tibble(x = factor(1:3, levels = 1:4)) %>% group_by(x) expect_equal(n_groups(df), 3) }) test_that("summarise drops zero-length groups", { df <- tibble(x = factor(rep(1:3, each = 10))) out <- df %>% group_by(x) %>% summarise(n = n()) expect_equal(out$n, c(10L, 10L, 10L)) }) dplyr/tests/testthat/test-copy_to.R0000644000176200001440000000306013153520575017167 0ustar liggesuserscontext("copy_to") test_that("src_local only overwrites if overwrite = TRUE", { env <- new.env(parent = emptyenv()) env$x <- 1 src_env <- src_df(env = env) expect_error( copy_to(src_env, tibble(x = 1), name = "x"), "object with `name` = `x` must not already exist, unless `overwrite` = TRUE", fixed = TRUE ) df <- tibble(x = 1) copy_to(src_env, df, name = "x", overwrite = TRUE) expect_equal(env$x, df) }) test_that("src_local errs with pkg/env", { expect_error( src_df("base", new.env()), "Exactly one of `pkg` and `env` must be non-NULL, not 2", fixed = TRUE ) expect_error( src_df(), "Exactly one of `pkg` and `env` must be non-NULL, not 0", fixed = TRUE ) }) test_that("auto_copy() requires same source", { skip_if_not_installed("dbplyr") env <- new.env(parent = emptyenv()) env$iris <- iris src_iris <- src_df(env = env) src_mtcars <- src_sqlite(":memory:", create = TRUE) copy_to(src_mtcars, mtcars, "mtcars") expect_error( auto_copy(tbl(src_iris, "iris"), src_mtcars, name = "iris"), "`x` and `y` must share the same src, set `copy` = TRUE (may be slow)", fixed = TRUE ) expect_error( auto_copy(tbl(src_mtcars, "mtcars"), src_iris, name = "mtcars"), "`x` and `y` must share the same src, set `copy` = TRUE (may be slow)", fixed = TRUE ) }) test_that("src_sqlite() errs if path does not exist", { skip_if_not_installed("dbplyr") expect_error( src_sqlite(":memory:"), "`path` must not already exist, unless `create` = TRUE", fixed = TRUE ) }) dplyr/tests/testthat/test-colwise-group-by.R0000644000176200001440000000121013135665123020714 0ustar liggesuserscontext("colwise group_by") test_that("group_by_ verbs take scoped inputs", { expect_identical(group_vars(group_by_all(mtcars)), names(mtcars)) expect_identical(group_vars(group_by_at(mtcars, vars(starts_with("d")))), c("disp", "drat")) expect_identical(group_vars(group_by_if(iris, is.factor)), "Species") }) test_that("group_by_ verbs accept optional operations", { df <- data_frame(x = 1:2, y = 2:3) gdf <- group_by(mutate_all(df, as.factor), x, y) expect_identical(group_by_all(df, as.factor), gdf) expect_identical(group_by_if(df, is_integer, as.factor), gdf) expect_identical(group_by_at(df, vars(x:y), as.factor), gdf) }) dplyr/tests/testthat/test-copying.R0000644000176200001440000000155113102155231017151 0ustar liggesuserscontext("Copying") test_that("coercion doesn't copy vars", { mtcars2 <- tbl_df(mtcars) mtcars3 <- as.data.frame(mtcars2) expect_equal(location(mtcars2)$vars, location(mtcars)$vars) expect_equal(location(mtcars3)$vars, location(mtcars)$vars) }) test_that("grouping and ungrouping doesn't copy vars", { mtcars2 <- group_by(mtcars, cyl) mtcars3 <- ungroup(mtcars2) expect_equal(location(mtcars2)$vars, location(mtcars)$vars) expect_equal(location(mtcars3)$vars, location(mtcars)$vars) }) test_that("mutate doesn't copy vars", { mtcars2 <- tbl_df(mtcars) mtcars3 <- mutate(mtcars2, cyl2 = cyl * 2) expect_equal(location(mtcars3)$vars[1:11], location(mtcars2)$vars) }) test_that("select doesn't copy vars", { mtcars2 <- tbl_df(mtcars) mtcars3 <- select(mtcars2, carb:mpg) expect_equal(location(mtcars3)$vars[11:1], location(mtcars2)$vars) }) dplyr/tests/testthat/test-window.R0000644000176200001440000000302713153626671017031 0ustar liggesuserscontext("Window functions") test_that("If n = 0, lead and lag return x", { expect_equal(lead(1:2, 0), 1:2) expect_equal(lag(1:2, 0), 1:2) }) test_that("If n = length(x), returns all missing", { miss <- rep(NA_integer_, 2) expect_equal(lead(1:2, 2), miss) expect_equal(lag(1:2, 2), miss) }) test_that("cumany handles NA (#408)", { batman <- c(NA, NA, NA, NA, NA) expect_true(all(is.na(cumany(batman)))) expect_true(all(is.na(cumall(batman)))) x <- c(FALSE, NA) expect_true(all(!cumall(x))) x <- c(TRUE, NA) expect_true(all(cumany(x))) }) test_that("percent_rank ignores NAs (#1132)", { expect_equal(percent_rank(c(1:3, NA)), c(0, 0.5, 1, NA)) }) test_that("cume_dist ignores NAs (#1132)", { expect_equal(cume_dist(c(1:3, NA)), c(1 / 3, 2 / 3, 1, NA)) }) test_that("cummean is not confused by FP error (#1387)", { a <- rep(99, 9) expect_true(all(cummean(a) == a)) }) test_that("order_by() returns correct value", { expected <- int(15, 14, 12, 9, 5) expect_identical(order_by(5:1, cumsum(1:5)), expected) x <- 5:1; y <- 1:5 expect_identical(order_by(x, cumsum(y)), expected) }) test_that("order_by() works in arbitrary envs (#2297)", { env <- child_env("base") expect_equal( with_env(env, dplyr::order_by(5:1, cumsum(1:5))), rev(cumsum(rev(1:5))) ) expect_equal( order_by(5:1, cumsum(1:5)), rev(cumsum(rev(1:5))) ) }) test_that("order_by() fails when not supplied a call (#3065)", { expect_error(order_by(NULL, !! 1L), "`call` must be a function call, not an integer vector") }) dplyr/tests/testthat/test-filter.r0000644000176200001440000002202313153520575017040 0ustar liggesuserscontext("Filter") test_that("filter fails if inputs incorrect length (#156)", { expect_error( filter(tbl_df(mtcars), c(F, T)), "Result must have length 32, not 2", fixed = TRUE ) expect_error( filter(group_by(mtcars, am), c(F, T)), "Result must have length 19, not 2", fixed = TRUE ) }) test_that("filter gives useful error message when given incorrect input", { # error message by rlang expect_error(filter(tbl_df(mtcars), `_x`), "_x", fixed = TRUE ) }) test_that("filter complains in inputs are named", { expect_error( filter(mtcars, x = 1), "`x` (`x = 1`) must not be named, do you need `==`?", fixed = TRUE ) expect_error( filter(mtcars, x = 1 & y > 2), "`x` (`x = 1 & y > 2`) must not be named, do you need `==`?", fixed = TRUE ) }) test_that("filter handles passing ...", { df <- data.frame(x = 1:4) f <- function(...) { x1 <- 4 f1 <- function(y) y filter(df, ..., f1(x1) > x) } g <- function(...) { x2 <- 2 f(x > x2, ...) } res <- g() expect_equal(res$x, 3L) df <- group_by(df, x) res <- g() expect_equal(res$x, 3L) }) test_that("filter handles simple symbols", { df <- data.frame(x = 1:4, test = rep(c(T, F), each = 2)) res <- filter(df, test) gdf <- group_by(df, x) res <- filter(gdf, test) h <- function(data) { test2 <- c(T, T, F, F) filter(data, test2) } expect_equal(h(df), df[1:2, ]) f <- function(data, ...) { one <- 1 filter(data, test, x > one, ...) } g <- function(data, ...) { four <- 4 f(data, x < four, ...) } res <- g(df) expect_equal(res$x, 2L) expect_equal(res$test, TRUE) res <- g(gdf) expect_equal(res$x, 2L) expect_equal(res$test, TRUE) }) test_that("filter handlers scalar results", { expect_equivalent(filter(mtcars, min(mpg) > 0), mtcars) expect_equal(filter(group_by(mtcars, cyl), min(mpg) > 0), group_by(mtcars, cyl)) }) test_that("filter propagates attributes", { date.start <- ISOdate(2010, 01, 01, 0) test <- data.frame(Date = ISOdate(2010, 01, 01, 1:10)) test2 <- test %>% filter(Date < ISOdate(2010, 01, 01, 5)) expect_equal(test$Date[1:4], test2$Date) }) test_that("filter fails on integer indices", { expect_error( filter(mtcars, 1:2), "Argument 2 filter condition does not evaluate to a logical vector", fixed = TRUE ) expect_error( filter(group_by(mtcars, cyl), 1:2), "Argument 2 filter condition does not evaluate to a logical vector", fixed = TRUE ) }) test_that("filter discards NA", { temp <- data.frame( i = 1:5, x = c(NA, 1L, 1L, 0L, 0L) ) res <- filter(temp, x == 1) expect_equal(nrow(res), 2L) }) test_that("date class remains on filter (#273)", { x1 <- x2 <- data.frame( date = seq.Date(as.Date("2013-01-01"), by = "1 days", length.out = 2), var = c(5, 8) ) x1.filter <- x1 %>% filter(as.Date(date) > as.Date("2013-01-01")) x2$date <- x2$date + 1 x2.filter <- x2 %>% filter(as.Date(date) > as.Date("2013-01-01")) expect_equal(class(x1.filter$date), "Date") expect_equal(class(x2.filter$date), "Date") }) test_that("filter handles $ correctly (#278)", { d1 <- tbl_df(data.frame( num1 = as.character(sample(1:10, 1000, T)), var1 = runif(1000), stringsAsFactors = FALSE)) d2 <- data.frame(num1 = as.character(1:3), stringsAsFactors = FALSE) res1 <- d1 %>% filter(num1 %in% c("1", "2", "3")) res2 <- d1 %>% filter(num1 %in% d2$num1) expect_equal(res1, res2) }) test_that("filter returns the input data if no parameters are given", { expect_equivalent(filter(mtcars), mtcars) }) test_that("$ does not end call traversing. #502", { # Suppose some analysis options are set much earlier in the script analysis_opts <- list(min_outcome = 0.25) # Generate some dummy data d <- expand.grid(Subject = 1:3, TrialNo = 1:2, Time = 1:3) %>% tbl_df %>% arrange(Subject, TrialNo, Time) %>% mutate(Outcome = (1:18 %% c(5, 7, 11)) / 10) # Do some aggregation trial_outcomes <- d %>% group_by(Subject, TrialNo) %>% summarise(MeanOutcome = mean(Outcome)) left <- filter(trial_outcomes, MeanOutcome < analysis_opts$min_outcome) right <- filter(trial_outcomes, analysis_opts$min_outcome > MeanOutcome) expect_equal(left, right) }) test_that("GroupedDataFrame checks consistency of data (#606)", { df1 <- data_frame( g = rep(1:2, each = 5), x = 1:10 ) %>% group_by(g) attr(df1, "group_sizes") <- c(2, 2) expect_error( df1 %>% filter(x == 1), "`.data` is a corrupt grouped_df, contains 10 rows, and 4 rows in groups", fixed = TRUE ) }) test_that("filter uses the white list (#566)", { datesDF <- read.csv(stringsAsFactors = FALSE, text = " X 2014-03-13 16:08:19 2014-03-13 16:16:23 2014-03-13 16:28:28 2014-03-13 16:28:54 ") datesDF$X <- as.POSIXlt(datesDF$X) # error message from tibble expect_error(filter(datesDF, X > as.POSIXlt("2014-03-13"))) }) test_that("filter handles complex vectors (#436)", { d <- data.frame(x = 1:10, y = 1:10 + 2i) expect_equal(filter(d, x < 4)$y, 1:3 + 2i) expect_equal(filter(d, Re(y) < 4)$y, 1:3 + 2i) }) test_that("%in% works as expected (#126)", { df <- data_frame(a = c("a", "b", "ab"), g = c(1, 1, 2)) res <- df %>% filter(a %in% letters) expect_equal(nrow(res), 2L) res <- df %>% group_by(g) %>% filter(a %in% letters) expect_equal(nrow(res), 2L) }) test_that("row_number does not segfault with example from #781", { z <- data.frame(a = c(1, 2, 3)) b <- "a" res <- z %>% filter(row_number(b) == 2) expect_equal(nrow(res), 0L) }) test_that("filter does not alter expression (#971)", { my_filter <- ~ am == 1 expect_equal(my_filter[[2]][[2]], as.name("am")) }) test_that("hybrid evaluation handles $ correctly (#1134)", { df <- data_frame(x = 1:10, g = rep(1:5, 2)) res <- df %>% group_by(g) %>% filter(x > min(df$x)) expect_equal(nrow(res), 9L) }) test_that("filter correctly handles empty data frames (#782)", { res <- data_frame() %>% filter(F) expect_equal(nrow(res), 0L) expect_equal(length(names(res)), 0L) }) test_that("filter(.,TRUE,TRUE) works (#1210)", { df <- data.frame(x = 1:5) res <- filter(df, TRUE, TRUE) expect_equal(res, df) }) test_that("filter, slice and arrange preserves attributes (#1064)", { df <- structure( data.frame(x = 1:10, g1 = rep(1:2, each = 5), g2 = rep(1:5, 2)), meta = "this is important" ) res <- filter(df, x < 5) %>% attr("meta") expect_equal(res, "this is important") res <- filter(df, x < 5, x > 4) %>% attr("meta") expect_equal(res, "this is important") res <- df %>% slice(1:50) %>% attr("meta") expect_equal(res, "this is important") res <- df %>% arrange(x) %>% attr("meta") expect_equal(res, "this is important") res <- df %>% summarise(n()) %>% attr("meta") expect_equal(res, "this is important") res <- df %>% group_by(g1) %>% summarise(n()) %>% attr("meta") expect_equal(res, "this is important") res <- df %>% group_by(g1, g2) %>% summarise(n()) %>% attr("meta") expect_equal(res, "this is important") }) test_that("filter works with rowwise data (#1099)", { df <- data_frame(First = c("string1", "string2"), Second = c("Sentence with string1", "something")) res <- df %>% rowwise() %>% filter(grepl(First, Second, fixed = TRUE)) expect_equal(nrow(res), 1L) expect_equal(df[1, ], res) }) test_that("grouped filter handles indices (#880)", { res <- iris %>% group_by(Species) %>% filter(Sepal.Length > 5) res2 <- mutate(res, Petal = Petal.Width * Petal.Length) expect_equal(nrow(res), nrow(res2)) expect_equal(attr(res, "indices"), attr(res2, "indices")) }) test_that("filter(FALSE) drops indices", { out <- mtcars %>% group_by(cyl) %>% filter(FALSE) %>% attr("indices") expect_identical(out, list()) }) test_that("filter handles S4 objects (#1366)", { env <- environment() Numbers <- suppressWarnings(setClass( "Numbers", slots = c(foo = "numeric"), contains = "integer", where = env )) on.exit(removeClass("Numbers", where = env)) df <- data.frame(x = Numbers(1:10, foo = 10)) res <- filter(df, x > 3) expect_true(isS4(res$x)) expect_is(res$x, "Numbers") expect_equal(res$x@foo, 10) }) test_that("hybrid lag and default value for string columns work (#1403)", { res <- mtcars %>% mutate(xx = LETTERS[gear]) %>% filter(xx == lag(xx, default = "foo")) xx <- LETTERS[mtcars$gear] ok <- xx == lag(xx, default = "foo") expect_equal(xx[ok], res$xx) res <- mtcars %>% mutate(xx = LETTERS[gear]) %>% filter(xx == lead(xx, default = "foo")) xx <- LETTERS[mtcars$gear] ok <- xx == lead(xx, default = "foo") expect_equal(xx[ok], res$xx) }) # .data and .env tests now in test-hybrid-traverse.R test_that("filter fails gracefully on raw columns (#1803)", { df <- data_frame(a = 1:3, b = as.raw(1:3)) expect_error( filter(df, a == 1), "Column `b` is of unsupported type raw", fixed = TRUE ) expect_error( filter(df, b == 1), "Column `b` is of unsupported type raw", fixed = TRUE ) }) test_that("`vars` attribute is not added if empty (#2772)", { expect_identical(tibble(x = 1:2) %>% filter(x == 1), tibble(x = 1L)) }) dplyr/tests/testthat/test-internals.r0000644000176200001440000000054713135665123017560 0ustar liggesuserscontext("internals") test_that("comparisons works as expected (#275)", { res <- test_comparisons() expect_true(all(res)) }) test_that("join_match() works as expected", { res <- test_matches() expect_true(all(unlist(res))) }) test_that("wrapping of length values works as expected", { res <- test_length_wrap() expect_true(all(res)) }) dplyr/tests/testthat/test-ts.R0000644000176200001440000000052313135665123016141 0ustar liggesuserscontext("ts") test_that("filter and lag throw errors", { x <- ts(1:10) expect_error( filter(x), "`.data` must be a data source, not a ts object, do you want `stats::filter()`?", fixed = TRUE ) expect_error( lag(x), "`x` must be a vector, not a ts object, do you want `stats::lag()`?", fixed = TRUE ) }) dplyr/tests/testthat/test-tbl-cube.R0000644000176200001440000000672313153520575017221 0ustar liggesuserscontext("tbl_cube") test_that("construction errors", { expect_error( tbl_cube(1:3, 1:3), "`dimensions` must be a named list of vectors, not integer", fixed = TRUE ) expect_error( tbl_cube(list(a = 1:3), 1:3), "`measures` must be a named list of arrays, not integer", fixed = TRUE ) expect_error( tbl_cube(list(a = 1:3), list(b = 1:3)), "`measures` must be a named list of arrays, not list", fixed = TRUE ) expect_error( tbl_cube(list(a = 1:3), list(b = array(1:3), c = array(1:2))), "Measure `c` needs dimensions [3], not [2]", fixed = TRUE ) }) test_that("coercion", { grid <- expand.grid(x = letters[1:3], y = letters[1:5], stringsAsFactors = FALSE) tbl <- table(x = grid$x, y = grid$y) tbl_as_df <- as.data.frame(tbl, stringsAsFactors = FALSE) expect_message(cube <- as.tbl_cube(tbl_as_df), "Using Freq as") expect_identical(cube$dims, list(x = letters[1:3], y = letters[1:5])) expect_identical(names(cube$mets), "Freq") expect_message(cube_met <- as.tbl_cube(tbl_as_df, met_name = "Freq"), NA) expect_identical(cube, cube_met) expect_message(cube_dim <- as.tbl_cube(tbl_as_df, dim_names = c("x", "y")), NA) expect_identical(cube, cube_dim) expect_message(cube_tbl <- as.tbl_cube(tbl), NA) expect_identical(cube, cube_tbl) }) test_that("incomplete", { d <- rbind( cbind(data.frame(s = 1), expand.grid(j = 1)), cbind(data.frame(s = 2), expand.grid(j = 1:2)) ) d$value <- 1:3 cube <- as.tbl_cube(d, met_name = "value") expect_true(is.na(as.data.frame(filter(cube, s == 1, j == 2))[["value"]])) expect_equal(filter(as_data_frame(as.data.frame(cube)), s != 1 | j != 2), d) }) test_that("duplicate", { d <- rbind( cbind(data.frame(s = 1), expand.grid(j = c(1, 1))), cbind(data.frame(s = 2), expand.grid(j = 1:2)) ) d$value <- 1:4 expect_error( as.tbl_cube(d, met_name = "value"), "`x` must be unique in all combinations of dimension variables, duplicates: `s` = 1, `j` = 1", fixed = TRUE ) }) test_that("filter", { expect_equal( nasa %>% filter(month == 1) %>% filter(year == 2000), nasa %>% filter(year == 2000) %>% filter(month == 1) ) expect_equal( nasa %>% filter(month == 1) %>% filter(year == 2000), filter(nasa, month == 1, year == 2000) ) expect_equal( filter(nasa, month == 1, year == 2000), filter(nasa, year == 2000, month == 1) ) expect_error( filter(nasa, month == 1 & year == 2000), "`month == 1 & year == 2000` must refer to exactly one dimension, not `month`, `year`" ) }) test_that("summarise works with single group", { by_month <- group_by(nasa, month) out <- summarise(by_month, temp = mean(temperature)) expect_equal(names(out$dims), "month") expect_equal(names(out$mets), "temp") expect_equal(dim(out), c(12, 1)) }) test_that("can coerce to data_frame", { slice <- filter(nasa, year == 1995L, month == 1L) expect_identical( tbl_df(as.data.frame(slice, stringsAsFactors = FALSE)), as_data_frame(slice) ) }) test_that("can coerce to table", { expect_is(as.table(nasa), "table") expect_equal(length(dim(as.table(nasa))), 4L) expect_equal(dimnames(as.table(nasa)), lapply(nasa$dims, as.character)) expect_equal(as.vector(as.table(nasa)), as.vector(nasa$mets[[1]])) expect_identical(as.table(nasa, measure = "ozone"), as.table(select(nasa, ozone))) }) test_that("group_vars() returns variables", { gcube <- group_by(nasa, month) expect_identical(group_vars(gcube), "month") }) dplyr/tests/testthat/utf-8.R0000644000176200001440000000142613153520575015505 0ustar liggesusers# UTF-8 tests that can't be run on Windows CRAN df <- data.frame(中文1 = 1:10, 中文2 = 1:10, eng = 1:10) df2 <- df %>% mutate(中文1 = 中文1 + 1) gdf2 <- df %>% group_by(eng) %>% mutate(中文1 = 中文1 + 1) expect_equal(strings_addresses(names(df)), strings_addresses(names(df2))) expect_equal(strings_addresses(names(df)), strings_addresses(names(gdf2))) df3 <- filter(df2, eng > 5) gdf3 <- filter(gdf2, eng > 5) expect_equal(strings_addresses(names(df)), strings_addresses(names(df3))) expect_equal(strings_addresses(names(df)), strings_addresses(names(gdf3))) df4 <- filter(df2, 中文1 > 5) gdf4 <- filter(gdf2, 中文1 > 5) expect_equal(strings_addresses(names(df)), strings_addresses(names(df4))) expect_equal(strings_addresses(names(df)), strings_addresses(names(gdf4))) dplyr/tests/testthat/helper-groups.R0000644000176200001440000000063113135665123017332 0ustar liggesusersexpect_groups <- function(df, groups, info = NULL) { if (length(groups) == 0L) { expect_null(groups(df), info = info) expect_identical(group_vars(df), character(), info = info) } else { expect_identical(groups(df), lapply(enc2native(groups), as.name), info = info) expect_identical(group_vars(df), groups, info = info) } } expect_no_groups <- function(df) { expect_groups(df, NULL) } dplyr/tests/testthat/test-tbl.R0000644000176200001440000000044713135665123016301 0ustar liggesuserscontext("tbl") test_that("tbl_nongroup_vars() excludes group variables", { cube <- group_by(nasa, month) expect_identical(tbl_nongroup_vars(cube), setdiff(tbl_vars(cube), "month")) gdf <- group_by(mtcars, cyl) expect_identical(tbl_nongroup_vars(gdf), setdiff(tbl_vars(gdf), "cyl")) }) dplyr/src/0000755000176200001440000000000013157241200012166 5ustar liggesusersdplyr/src/set.cpp0000644000176200001440000002041113157241200013463 0ustar liggesusers#include "pch.h" #include #include #include #include #include #include #include #include #include #include #include using namespace Rcpp; using namespace dplyr; class RowTrack { public: RowTrack(const std::string& msg, int max_count_ = 10) : ss(), count(0), max_count(max_count_) { ss << msg; } void record(int i) { if (count > max_count) return; if (count) ss << ", "; int idx = i >= 0 ? (i + 1) : -i; ss << idx; if (count == max_count) ss << "[...]"; count++; } bool empty() const { return count == 0; } std::string str() const { return ss.str(); } private: std::stringstream ss; int count; int max_count; }; // [[Rcpp::export]] dplyr::BoolResult compatible_data_frame_nonames(DataFrame x, DataFrame y, bool convert) { int n = x.size(); if (n != y.size()) return no_because(tfm::format("different number of columns : %d x %d", n, y.size())); if (convert) { for (int i = 0; i < n; i++) { try { boost::scoped_ptr v( join_visitor( Column(x[i], SymbolString("x")), Column(y[i], SymbolString("y")), true, true ) ); } catch (...) { return no_because("incompatible"); } } } else { for (int i = 0; i < n; i++) { SEXP xi = x[i], yi = y[i]; if (TYPEOF(xi) != TYPEOF(yi)) return no_because("incompatible types"); if (TYPEOF(xi) == INTSXP) { if (Rf_inherits(xi, "factor") && Rf_inherits(yi, "factor")) { if (same_levels(xi, yi)) continue; return no_because("factors with different levels"); } if (Rf_inherits(xi, "factor")) return no_because("cannot compare factor and integer"); if (Rf_inherits(yi, "factor")) return no_because("cannot compare factor and integer"); } } } return yes(); } // [[Rcpp::export]] dplyr::BoolResult compatible_data_frame(DataFrame x, DataFrame y, bool ignore_col_order = true, bool convert = false) { int n = x.size(); bool null_x = Rf_isNull(x.names()), null_y = Rf_isNull(y.names()); if (null_x && !null_y) { return no_because("x does not have names, but y does"); } else if (null_y && !null_x) { return no_because("y does not have names, but x does"); } else if (null_x && null_y) { return compatible_data_frame_nonames(x, y, convert); } CharacterVector names_x = x.names(); CharacterVector names_y = y.names(); CharacterVector names_y_not_in_x = setdiff(names_y, names_x); CharacterVector names_x_not_in_y = setdiff(names_x, names_y); if (!ignore_col_order) { if (names_y_not_in_x.size() == 0 && names_x_not_in_y.size() == 0) { // so the names are the same, check if they are in the same order for (int i = 0; i < n; i++) { if (names_x[i] != names_y[i]) { return no_because("Same column names, but different order"); } } } } CharacterVector why; if (names_y_not_in_x.size()) { std::stringstream ss; ss << "Cols in y but not x: " << collapse_utf8(names_y_not_in_x, ", ", "`") << ". "; why.push_back(String(ss.str(), CE_UTF8)); } if (names_x_not_in_y.size()) { std::stringstream ss; ss << "Cols in x but not y: " << collapse_utf8(names_x_not_in_y, ", ", "`") << ". "; why.push_back(String(ss.str(), CE_UTF8)); } if (why.length() > 0) return no_because(why); IntegerVector orders = r_match(names_x, names_y); for (int i = 0; i < n; i++) { SymbolString name = names_x[i]; SEXP xi = x[i], yi = y[orders[i] - 1]; boost::scoped_ptr vx(subset_visitor(xi, name)); boost::scoped_ptr vy(subset_visitor(yi, name)); std::stringstream ss; bool compatible = convert ? vx->is_compatible(vy.get(), ss, name) : vx->is_same_type(vy.get(), ss, name); if (!compatible) { if (ss.str() == "") { ss << "Incompatible type for column `" << name.get_utf8_cstring() << "`: x " << vx->get_r_type() << ", y " << vy->get_r_type(); } why.push_back(String(ss.str(), CE_UTF8)); } } if (why.length() > 0) return no_because(why); return yes(); } // [[Rcpp::export]] dplyr::BoolResult equal_data_frame(DataFrame x, DataFrame y, bool ignore_col_order = true, bool ignore_row_order = true, bool convert = false) { BoolResult compat = compatible_data_frame(x, y, ignore_col_order, convert); if (!compat) return compat; typedef VisitorSetIndexMap > Map; DataFrameJoinVisitors visitors(x, y, x.names(), x.names(), true, true); Map map(visitors); // train the map in both x and y int nrows_x = x.nrows(); int nrows_y = y.nrows(); if (nrows_x != nrows_y) return no_because("Different number of rows"); if (x.size() == 0) return yes(); for (int i = 0; i < nrows_x; i++) map[i].push_back(i); for (int i = 0; i < nrows_y; i++) map[-i - 1].push_back(-i - 1); RowTrack track_x("Rows in x but not y: "); RowTrack track_y("Rows in y but not x: "); RowTrack track_mismatch("Rows with difference occurences in x and y: "); bool ok = true; Map::const_iterator it = map.begin(); for (; it != map.end(); ++it) { // retrieve the indices ( -ves for y, +ves for x ) const std::vector& chunk = it->second; int n = chunk.size(); int count_left = 0, count_right = 0; for (int i = 0; i < n; i++) { if (chunk[i] < 0) count_right++; else count_left++; } if (count_right == 0) { track_x.record(chunk[0]); ok = false; } else if (count_left == 0) { track_y.record(chunk[0]); ok = false; } else if (count_left != count_right) { track_mismatch.record(chunk[0]); ok = false; } } if (!ok) { std::stringstream ss; if (! track_x.empty()) ss << track_x.str() << ". "; if (! track_y.empty()) ss << track_y.str() << ". "; if (! track_mismatch.empty()) ss << track_mismatch.str(); return no_because(CharacterVector::create(String(ss.str(), CE_UTF8))); } if (ok && ignore_row_order) return yes(); if (!ignore_row_order) { for (int i = 0; i < nrows_x; i++) { if (!visitors.equal(i, -i - 1)) { return no_because("Same row values, but different order"); } } } return yes(); } // [[Rcpp::export]] DataFrame union_data_frame(DataFrame x, DataFrame y) { BoolResult compat = compatible_data_frame(x, y, true, true); if (!compat) { stop("not compatible: %s", compat.why_not()); } typedef VisitorSetIndexSet Set; DataFrameJoinVisitors visitors(x, y, x.names(), x.names(), true, true); Set set(visitors); train_insert(set, x.nrows()); train_insert_right(set, y.nrows()); return visitors.subset(set, get_class(x)); } // [[Rcpp::export]] DataFrame intersect_data_frame(DataFrame x, DataFrame y) { BoolResult compat = compatible_data_frame(x, y, true, true); if (!compat) { stop("not compatible: %s", compat.why_not()); } typedef VisitorSetIndexSet Set; DataFrameJoinVisitors visitors(x, y, x.names(), x.names(), true, true); Set set(visitors); train_insert(set, x.nrows()); std::vector indices; int n_y = y.nrows(); for (int i = 0; i < n_y; i++) { Set::iterator it = set.find(-i - 1); if (it != set.end()) { indices.push_back(*it); set.erase(it); } } return visitors.subset(indices, get_class(x)); } // [[Rcpp::export]] DataFrame setdiff_data_frame(DataFrame x, DataFrame y) { BoolResult compat = compatible_data_frame(x, y, true, true); if (!compat) { stop("not compatible: %s", compat.why_not()); } typedef VisitorSetIndexSet Set; DataFrameJoinVisitors visitors(y, x, y.names(), y.names(), true, true); Set set(visitors); train_insert(set, y.nrows()); std::vector indices; int n_x = x.nrows(); for (int i = 0; i < n_x; i++) { if (!set.count(-i - 1)) { set.insert(-i - 1); indices.push_back(-i - 1); } } return visitors.subset(indices, get_class(x)); } dplyr/src/rlang-export.c0000644000176200001440000000175513157241200014764 0ustar liggesusers#define R_NO_REMAP #include #include #include #if (defined(R_VERSION) && R_VERSION < R_Version(3, 4, 0)) SEXP R_MakeExternalPtrFn(DL_FUNC p, SEXP tag, SEXP prot) { fn_ptr ptr; ptr.fn = p; return R_MakeExternalPtr(ptr.p, tag, prot); } DL_FUNC R_ExternalPtrAddrFn(SEXP s) { fn_ptr ptr; ptr.p = EXTPTR_PTR(s); return ptr.fn; } #endif SEXP rlang_namespace(const char* ns) { SEXP call = PROTECT(Rf_lang2(Rf_install("getNamespace"), PROTECT(Rf_mkString(ns)))); SEXP ns_env = Rf_eval(call, R_BaseEnv); UNPROTECT(2); return ns_env; } void rlang_register_pointer(const char* ns, const char* ptr_name, DL_FUNC fn) { SEXP ptr = PROTECT(R_MakeExternalPtrFn(fn, R_NilValue, R_NilValue)); SEXP ptr_obj = PROTECT(Rf_allocVector(VECSXP, 1)); SET_VECTOR_ELT(ptr_obj, 0, ptr); Rf_setAttrib(ptr_obj, R_ClassSymbol, Rf_mkString("fn_pointer")); Rf_defineVar(Rf_install(ptr_name), ptr_obj, PROTECT(rlang_namespace(ns))); UNPROTECT(3); } dplyr/src/Makevars0000644000176200001440000000031013157241200013654 0ustar liggesusers# Disable long types from C99 or CPP11 extensions PKG_CPPFLAGS = -I../inst/include -DCOMPILING_DPLYR -DBOOST_NO_INT64_T -DBOOST_NO_INTEGRAL_INT64_T -DBOOST_NO_LONG_LONG -DRCPP_USING_UTF8_ERROR_STRING dplyr/src/hybrid.cpp0000644000176200001440000001256613157241200014165 0ustar liggesusers#include "pch.h" #include #include #include #include #include #include #include using namespace Rcpp; using namespace dplyr; bool has_no_class(const RObject& arg) { return RCPP_GET_CLASS(arg) == R_NilValue; } bool hybridable(RObject arg) { if (Rf_inherits(arg, "Date") || Rf_inherits(arg, "POSIXct") || Rf_inherits(arg, "difftime")) return true; if (arg.isObject() || arg.isS4()) return false; int type = arg.sexp_type(); switch (type) { case INTSXP: case REALSXP: case LGLSXP: case STRSXP: case CPLXSXP: case RAWSXP: return has_no_class(arg); default: break; } return false; } template