---
title: "hdd walkthrough"
author: "Laurent Bergé"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 3
    number_sections: true
    toc_float:
      collapsed: false
      smooth_scroll: false
vignette: >
  %\VignetteIndexEntry{hdd introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- output:
  html_document:
    toc: true
    toc_depth: 3
    number_sections: true
    toc_float:
      collapsed: false
      smooth_scroll: false -->
      
      
<!-- fontsize: 12pt
output:
  pdf_document:
    toc: true
    toc_depth: 3
    number_sections: true -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
	echo = FALSE,
	eval = TRUE,
	collapse = TRUE,
	comment = "#>"
)
```

```{r}
# We conditionnaly run the code pieces
if(!file.exists("C:/Users/laurent.berge/DATA/MAG/Authors.txt")){
	message("This vignette takes the example of a data set of over 13GB on disk, and thus cannot be run non-locally.")
	knitr::opts_chunk$set(eval = FALSE)
}
```

```{r}
suppressPackageStartupMessages(library(data.table))
library(hdd)
```


`hdd` provides a class of data, *hard drive data*, allowing the easy 
importation/manipulation of out of memory data sets. The data sets are located 
on disk but look like in-memory, the syntax for manipulation is similar to 
[data.table](<https://github.com/Rdatatable/data.table/wiki>). Operations 
are performed "chunk-wise" behind the scene. Here is a brief presentation of the main features.

# Example with publication data

Throughout this document, we will use the example of the Microsoft Academic Graph 
data (<https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/>). 
This large and freely available data set contains all scientific publications meta 
information (like authors, titles, institutions, etc...) as collected by Microsoft 
and used in (the now defunct) microsoft academic.

The data is in the form of a relational data base of usually large text files (well over 10GB). 
We'll see how to deal with them in `R` with `hdd`.


## Importation

First we have to import the data into `R` through a `hdd` data set. We're interested 
in importing the information on authors. We'll use the `txt2hdd` function. But first
 let's have a look at the data.

```{r, echo = TRUE, eval = FALSE}
library(hdd)
peek("_path/authors.txt")
```


```{r, results='asis'}
tab = head(peek("C:/Users/laurent.berge/DATA/MAG/Authors.txt", view = FALSE))
if("pdf_document" %in% rmarkdown::all_output_formats(knitr::current_input())){
  # tab = as.data.table(lapply(tab, function(x) fplot:::truncate_string(iconv(x, to="ASCII", sub = "£"), method = "trimMid", trunc = 16)))
  tab = as.data.table(lapply(tab, iconv, to="ASCII", sub = "£"))
}
knitr::kable(tab)

```

We can see that the data set contains 8 variables, some are text, other are numeric.
 By default, the function `peek` also displays the delimiter of the data, here it is a 
 tab delimited text file. 

Now let's import the data. From the documentation, we can retrieve the variable names, 
and we will use them:

```{r, echo = TRUE, eval = FALSE}
col_names = c("AuthorId", "Rank", "NormalizedName", "DisplayName", 
			  "LastKnownAffiliationId", 
			  "PaperCount", "CitationCount", "CreatedDate")
txt2hdd("_path/authors.txt", # The text file
		# dirDest: The destination of the HDD data => must be a directory
		dirDest = "_path/hdd_authors", 
		chunkMB = 500, col_names = col_names)
```

By default, the types of the variables are automatically set, based on the guess 
of the function `fread` from package `data.table`. Note that 64bits integers 
variables are imported as doubles. You can set the column types yourself using the 
function `cols` or `cols_only` from package `readr`. In any case, if there are 
importing problems, a specific repository, which is itself a hdd file, reporting all 
importing problems is created (in the example, it there were problems, it would be 
located at `"_path/hdd_authors/problems"`).

The function `txt2hdd` creates a folder on disk containing the full data set divided 
into several files named `slice_XX.fst` with XX its number. The `.fst` files are 
in a format for fast reading/writing on disk (see `fst` [homepage](<http://www.fstpackage.org/>)). 
Every time a `hdd` data set is created, the associated folder on disk includes a 
file `_hdd.txt` containing: i) a summary of the data (number of rows/columns, 
first five observations) and ii) the log of the commands which created it.

Now that the data is imported into a `hdd` file, let's have a look at it:

```{r, echo = TRUE, eval = FALSE}
authors = hdd("_path/hdd_authors")
summary(authors)
```

```{r}
authors = hdd("C:/Users/laurent.berge/DATA/MAG/HDD/authors")
summary(authors)
```

The summary provides some general information: the location of the data in 
the hard drive (note that this is the location on my hard drive!), the size of the 
data set on disk (which is lower that what it would be in-memory due to compression), 
here 13GB, the number of chunks, here 69, the number of lines, here 217 millions, and 
the numer of variables.

Now let's have a quick look at the first lines of the data:

```{r, echo = TRUE}
head(authors)
```

It's indeed the same data as in the text file. You can see that you can access it as a regular data frame.


### Importing with preprocessing

Now assume that you do not want to import the full text file because some information might be unecessary to you -- or you may want to generate new information straight away. You can apply a preprocessing function while importing. Assume we want to import only the first three columns for which the names, in variable `NormalizedName`, contain only ASCII characters. We could do as follows:

```{r, echo = TRUE, eval = FALSE}

fun_ascii = function(x){
	# selection of the first 3 columns
	res = x[, 1:3]
	# selection of only ascii names
	res[!is.na(iconv(NormalizedName, to = "ASCII"))]
}

col_names = c("AuthorId", "Rank", "NormalizedName", "DisplayName", 
			  "LastKnownAffiliationId", 
			  "PaperCount", "CitationCount", "CreatedDate")
txt2hdd("_path/authors.txt", dirDest = "_path/hdd_authors_ascii", 
		chunkMB = 500, col_names = col_names,
		preprocessfun = fun_ascii)
```

Let's look at the new data set:

```{r, echo = TRUE, eval = FALSE}
authors_ascii = hdd("_path/hdd_authors_ascii")
head(authors_ascii)
```

```{r}
authors_ascii = authors[1:50, 1:3]
authors_ascii = authors_ascii[!is.na(iconv(NormalizedName, to = "ASCII"))]
head(authors_ascii)
```


## Manipulation

You can manipulate the data as with any other `data.table`, but the extraction method 
for `hdd` objects (`[.hdd`) includes a few extra arguments.

By default, the results are put into memory. Using the previous `author` data, 
let's find out all the author names containing the word "Einstein":

```{r, echo = TRUE, eval = FALSE}
names_einstein = authors[grepl("\\beinstein\\b", NormalizedName), 
						 NormalizedName]
length(names_einstein)
head(names_einstein)
```

```{r}
load("C:/Users/laurent.berge/Google Drive/R_packages/hdd/_DATA/names_einstein.RData")
length(names_einstein)
head(names_einstein)
```

That's it, the algorithm has gone through the 217 million rows and found 1700 author names 
containing "Einstein". You can see that the command is the same as for a regular `data.table`.

But what if the result of the query does not fit into memory? You can still perform 
the query by adding the argument `newfile`. Now the result will be a `hdd` data set 
located in the path provided by the argument `newfile`.  As in the 
`"Importing with preprocessing"` section, let's create the data set containing the 
first three columns and dropping all names with non-ASCII characters. We can do as follows:

```{r, echo = TRUE, eval = FALSE}
authors[!is.na(iconv(NormalizedName, to = "ASCII")), 1:3, 
		newfile = "_path/hdd_authors_ascii"]
```

The result is a new `hdd` data set located in `"_path/hdd_authors_ascii"`, 
which can be of any size.

## Exploring a hdd data set

A `hdd` data set is made of several chunks, or files. You can explore each of 
them individually using the argument `file`. Further, you can use the special 
variable `.N` to refer to the total number of files making the data set. For example, 
let's select the first name of each chunk (or file):

```{r, echo = TRUE}
names_first = authors[1, NormalizedName, file = 1:.N]
head(names_first)
```

When you use the argument `file`, you can also use the special variable `.N` in the index. 
Here by selecting the last lines of each file:

```{r, echo = TRUE}
names_last = authors[.N, NormalizedName, file = 1:.N]
head(names_last)
```

## Extracting a full variable

Of course you can extract a full variable with `$`, but the algorithm will 
proceed only if the expected size is not too large. For example the following 
code will raise an error because the expected size of the variable, 7GB, is deemed too large:

```{r, echo = TRUE, error = TRUE}
author_id = authors$AuthorId
```

By default the cap at which this error is raised is 1GB. To drop the cap, 
just set `setHdd_extract.cap(Inf)`, but then beware of memory issues!

## Reading full hdd data sets from disk to memory

Use the function `readfst` to read `hdd` files located on disk to memory. Of 
course the `hdd` file should be *small* enough to fit in memory. An error will 
be raised if the expected size of the data exceeds the value of 
`getHdd_extract.cap(new_cap)` (default is 1GB), which you can 
set with `setHdd_extract.cap(new_cap)`. For example:

```{r, echo = TRUE, eval = FALSE}
# to read the full data set into memory:
base_authors = readfst("_path/hdd_authors")
# Alternative way
authors_hdd = hdd("_path/hdd_authors")
base_authors = authors_hdd[]
```

# Slicing 

Imagine you dispose of an in-memory data set to which you want to apply some 
function -- say for instance that you will have to apply a cartesian merge. 
However the result of this function does not fit in memomy. The function `hdd_slice` 
deals with it: it applies the function to slices of the original data, and save the
 results in a `hdd` data set. You'll be then able to deal with the result with `hdd`. 

Let's have an example with a cartesian merge:

```{r, echo = TRUE, eval = FALSE}
# x: the original data set
# y: the data set you want to merge to x
cartesian_merge = function(x){
	merge(x, y, allow.cartesian = TRUE)
}

hdd_slice(x, fun = cartesian_merge, 
		  dir = "_path/result_merge", chunkMB = 100)

```

Here the data `x` will be split in 100MB chunks and the function `cartesian_merge` 
will be applied to each of these chunks. The results will be saved in a `hdd` data 
set located in `_path/result_merge`. You'll then be able to manipulate the data 
in `_path/result_merge` as a regular `hdd` data set.

This example involved a merging operation only, but you can apply any kind of 
function (for example `x` can be a vector of text and the function can be the 
creation of ngrams, etc...).

# Speed considerations and limitations

## In-memory operations will always be faster

Manipulating in-memory data will always be orders of mmagnitude faster than 
manipulating on-disk data. This comes from the simple fact that read/write operations
 on disk are about 100 times slower than read/write in RAM -- further the read and 
 write on disk also involves compression/decompression incurring increased CPU use. 
 This is however the only way to deal with very large data sets (except of course if 
 you have very deep pockets allowing you to have big RAM computers!).

This means that at the moment your final data set reaches a memory-workable size, 
stop using `hdd` and start using regular `R`. Package `hdd` exists to make the 
transition from too-large-a-data-set to a memory-workable-data-set and is not intended 
to be a tool for regular data manipulation.

## On aggregation

Since `hdd` data sets are split into multiple files, the user cannot perform 
aggregate operations on some variable (i.e. using the `by` clause in `data.table` 
language) and obtain "valid" results. Indeed, the aggregate operations will be 
performed chunk per chunk and **not** on the entirety of the data set (which is not 
possible because of the size).

To circumvent this issue, the data set must be sorted by the variable(s) on which 
aggregation is done -- in which case the chunk by chunk operations will be valid. 
To sort `hdd` data sets, the function `hdd_setkey` has been created -- in particular 
it ensures that the keys do not spill across multiple files (to ensure consistency of 
the chunk by chunk aggregation). But beware, it is extremely slow (it involves 
multiple on-disk copying of the *full* data set).


# References

On the Microsoft Academic Graph data:

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, 
and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and 
Applications. In Proceedings of the 24th International Conference on World Wide 
Web (WWW 15 Companion). ACM, New York, NY, USA, 243-246.
 DOI=<http://dx.doi.org/10.1145/2740908.2742839>