Title: | Quick Indexation |
---|---|
Description: | Quick indexation of any type of vector or of any combination of those. Indexation turns a vector into an integer vector going from 1 to the number of unique elements. Indexes are important building blocks for many algorithms. The method is described at <https://github.com/lrberge/indexthis/>. |
Authors: | Laurent Berge [aut, cre], Sebastian Krantz [ctb], Morgan Jacob [ctb] |
Maintainer: | Laurent Berge <[email protected]> |
License: | GPL-3 |
Version: | 2.0.0 |
Built: | 2024-11-05 05:27:24 UTC |
Source: | https://github.com/lrberge/indexthis |
to_index
functionUtility to integrate the to_index
function within a package without a dependency.
indexthis_vendor(pkg = ".")
indexthis_vendor(pkg = ".")
pkg |
Character scalar, default is |
This is a utility to populate a package with the necessary code to run the to_index
function. This avoids to create a dependency with the indexthis
package.
The underlying code of to_index
is in C++. Hence if the routines are to be included in a package, it needs to be registered appropriately. There are four cases: three are automatic, one requires a bit of copy pasting from the user. Let's review them.
It the target package already has C++ code and uses Rcpp
or cpp11
to make the linking, the function indexthis_vendor
registers the main function as a Rcpp
or cpp11
routine, and all should be well.
If the target package has no C/C++ code at all, indexthis_vendor
updates the NAMESPACE and registers the routine, and all should be well.
If the target package already has C/C++ code, this is more coplicated because there should be only one R_init_pkgname
symbol and it should be existing already (see Writing R extensions, section "dyn.load and dyn.unload").
In that case, in the file to_index.cpp
the necessary code to register the routine will be at the end of the file, within comments.
The (knowledgeable) user has to copy paste in the appropriate location, where she registers the existing routines.
This function does not return anything. Instead it writes two files: one in R (by default in the folder ./R
) and one in cpp (by default in the folder src/
). Those files contain the necessary source code to run the function to_index
.
## DO NOT RUN: otherwise it will write in your packge workspace # indexthis_vendor()
## DO NOT RUN: otherwise it will write in your packge workspace # indexthis_vendor()
Turns one or multiple vectors of the same length into an index, that is an integer vector of the same length ranging from 1 to the number of unique elements in the vectors. This is equivalent to creating a key.
to_index( ..., list = NULL, sorted = FALSE, items = FALSE, items.simplify = TRUE )
to_index( ..., list = NULL, sorted = FALSE, items = FALSE, items.simplify = TRUE )
... |
The vectors to be turned into an index. Only works for atomic vectors.
If multiple vectors are provided, they should all be of the same length. Notes that
you can alternatively provide a list of vectors with the argument |
list |
An alternative to using |
sorted |
Logical, default is |
items |
Logical, default is |
items.simplify |
Logical scalar, default is |
The algorithm to create the indexes is based on a semi-hashing of the vectors in input.
The hash table is of size 2 * n
, with n
the number of observations. Hence
the hash of all values is partial in order to fit that range. That is to say a
32 bits hash is turned into a log2(2 * n)
bits hash simply by shifting the bits.
This in turn will necessarily
lead to multiple collisions (ie different values leading to the same hash). This
is why collisions are checked systematically, guaranteeing the validity of the resulting index.
Note that NA
values are considered as valid and will not be returned as NA
in the index.
When indexing numeric vectors, there is no distinction between NA
and NaN
.
The algorithm is optimized for input vectors of type: i) numeric or integer (and equivalent data structures, like, e.g., dates), ii) logicals, iii) factors, and iv) character. The algorithm will be slow for types different from the ones previously mentioned, since a conversion to character will first be applied before indexing.
By default, an integer vector is returned, of the same length as the inputs.
If you are interested in the values the indexes (i.e. the integer values) refer to, you can
use the argument items = TRUE
. In that case, a list of two elements, named index
and items
, is returned. The index
is the integer vector representing the index, and
the items
is a data.frame containing the input values the index refers to.
Note that if items = TRUE
and items.simplify = TRUE
and there is only one vector
in input, the items
slot of the returned object will be equal to a vector.
Laurent Berge for this original implementation, Morgan Jacob (author of kit
) and Sebastian
Krantz (author of collapse
) for the hashing idea.
x = c("u", "a", "a", "s", "u", "u") y = c( 5, 5, 5, 3, 3, 5) # By default, the index value is based on order of occurrence to_index(x) to_index(y) to_index(x, y) # Use the order of the input values with sorted=TRUE to_index(x, sorted = TRUE) to_index(y, sorted = TRUE) to_index(x, y, sorted = TRUE) # To get the values to which the index refer, use items=TRUE to_index(x, items = TRUE) # play around with the format of the output to_index(x, items = TRUE, items.simplify = TRUE) # => default to_index(x, items = TRUE, items.simplify = FALSE) # multiple items are always in a data.frame to_index(x, y, items = TRUE) # NAs are considered as valid x_NA = c("u", NA, "a", "a", "s", "u", "u") to_index(x_NA, items = TRUE) to_index(x_NA, items = TRUE, sorted = TRUE) # # Getting the data back from the index # info = to_index(x, y, items = TRUE) info$items[info$index, ]
x = c("u", "a", "a", "s", "u", "u") y = c( 5, 5, 5, 3, 3, 5) # By default, the index value is based on order of occurrence to_index(x) to_index(y) to_index(x, y) # Use the order of the input values with sorted=TRUE to_index(x, sorted = TRUE) to_index(y, sorted = TRUE) to_index(x, y, sorted = TRUE) # To get the values to which the index refer, use items=TRUE to_index(x, items = TRUE) # play around with the format of the output to_index(x, items = TRUE, items.simplify = TRUE) # => default to_index(x, items = TRUE, items.simplify = FALSE) # multiple items are always in a data.frame to_index(x, y, items = TRUE) # NAs are considered as valid x_NA = c("u", NA, "a", "a", "s", "u", "u") to_index(x_NA, items = TRUE) to_index(x_NA, items = TRUE, sorted = TRUE) # # Getting the data back from the index # info = to_index(x, y, items = TRUE) info$items[info$index, ]