Under the mathematical hood at Custora we often have to work with *sparse* vectors and matrices, which are very large but whose elements are mostly zero.

A simple example of this type of data structure that occurs in many statistical applications is a *model matrix*, a matrix whose rows correspond to observations, whose columns correspond to categorical membership, and whose elements indicate each observation's membership in each category.

Customer |
From US |
From Europe |
From Asia |
... |

Chester McNebbins | 1 | 0 | 0 | ... |

Vesper Lynd | 0 | 1 | 0 | ... |

Walter Walterson | 0 | 0 | 0 | ... |

Storing these data structures with a naive matrix architecture, one allocated variable per element, can be very wasteful. In R, which we use to do our statistical work, there is a package called `Matrix`

that is designed to handle these structures more efficiently. Let's do some quick comparisons (all numbers that follow have been computed on a 3.4GHz iMac with 32 GB of RAM):

`> standard <- matrix(0, nrow=1e6, ncol=100)`

> object.size(standard)

800000200 bytes

> csparse <- Matrix(0, nrow=1e6, ncol=100)

> object.size(csparse)

1824 bytes

R in fact cannot handle ordinary matrices with more than 2^{31} - 1 elements; try increasing `nrow`

to 1e8 to see this. (The 2^{31} - 1 upper limit is the max value of a signed 32-bit integer, and is accessible in R via `.Machine$integer.max`

.) Aside from concerns about memory efficiency, this limit is restrictive for our purposes: if you're dealing with 3 million users and you want to store 1000 variables per user in your matrix, you're already out of luck.

The default sparse matrix architecture for `Matrix`

is the *column-oriented* format (class `dgCMatrix`

in the `Matrix`

package). There is also a similar *row-oriented* format that is not as well supported by `Matrix`

(class `dgRMatrix`

). Finally, `Matrix`

also offers a *triplet* format (class `dgTMatrix`

). Wikipedia's article on sparse matrix representation explains these formats nicely, but to summarize in brief:

- The column- and row-oriented formats store data as three arrays (A,B,C). In the column-oriented format, A contains all of the nonzero entries reading top to bottom one column after the other, B contains indices of/pointers to A indicating where each new column begins, and C contains the row index of each element in A. The row-oriented format is similar except that A reads left to right one row after the other, B indicates where each new row begins, and C contains column indices. (I will disregard the row-oriented format from here on, without loss of generality, as it is architecturally the same as the column-oriented format.)
- The triplet format consists of a list of triplets (row, column, value) for each element. As
`Matrix`

's documentation mentions, this implies that internal representation is*not*unique; a different reordering of the triplets would correspond to the same matrix.

These two architectures lead to various pros and cons relative to each other and to the standard matrix format. One disadvantage of both of the sparse formats is that assignment is much slower than for standard matrices, although this is a considerably worse problem for the column-oriented format than for the triplet format:

`> standard <- matrix(0, nrow=1e5, ncol=100)`

> csparse <- as(Matrix(0, nrow=1e5, ncol=100), "dgCMatrix")

> tsparse <- as(Matrix(0, nrow=1e5, ncol=100), "dgTMatrix")

> system.time(standard[,25] <- 1)

user system elapsed

0.001 0.000 0.001

> system.time(csparse[,25] <- 1)

user system elapsed

3.949 0.002 3.952

> system.time(tsparse[,25] <- 1)

user system elapsed

0.066 0.007 0.075

Standard matrices can just assign new values directly to space allocated in memory and so assignment takes very little time. The triplet format has a little more to do; it has to allocate memory for each new triplet and then assign values. The column-oriented format has the most work to do: it has to allocate for its array of nonzero values and assign values, allocate for its array of row indices and assign those values, and then adjust the pointer array so that the columns start in the right place.

A disadvantage of the triplet format is that it often consumes more memory than the column-oriented format:

`> standard <- matrix(c(0,0,0,0,1), nrow=1e5, ncol=100)`

> csparse <- as(Matrix(c(0,0,0,0,1), nrow=1e5, ncol=100), "dgCMatrix")

> tsparse <- as(Matrix(c(0,0,0,0,1), nrow=1e5, ncol=100), "dgTMatrix")

> object.size(standard)

80000200 bytes

> object.size(csparse)

24001824 bytes

> object.size(tsparse)

32001416 bytes

These matrices are still fairly sparse, and both sparse formats offer substantial memory improvements over the standard format, but the column-oriented format does better in this respect. Both formats do worse as the matrix becomes less and less sparse and will eventually consume more memory than the standard matrix format (try it out with a matrix consisting entirely of 1s).

One way to deal with these issues in `Matrix`

is to convert between sparse matrix formats as needed, or to deal with smaller matrices in the standard format and use R's `cbind2`

and `rbind2`

functions (which can combine two matrices along columns or rows into a single larger matrix) to attach the data into a sparse matrix. For example, if you are storing sparse matrices on disk you may prefer to keep them in column-oriented format. But when later working with them, you may want to convert them to triplet format, or, if possible, do all your operations with standard-format matrices and take care of conversion to a sparse format separately.

`> system.time({`

csparse <- Matrix(0, nrow=1e5, ncol=100)

csparse[,c(25,50,75,100)] <- 1

})

user system elapsed

16.141 0.016 16.157

`> system.time({`

csparse <- Matrix(0, nrow=1e5, ncol=0)

for (i in 1:4) {

a <- matrix(0, nrow=1e5, ncol=25);

a[,25] <- 1

csparse <- cbind2(csparse, a)

}})

user system elapsed

0.076 0.055 0.130

Using `cbind`

/`rbind`

(or in this case `cbind2`

/`rbind2`

) to incrementally increase the size of a matrix, as is done in the above example, is exactly the opposite of traditional good coding style in R, which emphasizes pre-allocation and vectorization in place of loops. Beginners to R or to other languages that emphasize an array programming approach often have to learn specifically *not* to do this. But in this case, because of the way our underlying data structures are defined, it's *more* efficient to use a loop. To add more columns to a column-oriented sparse matrix, all you need to do is append to the three arrays, with no readjustment of the old values needed.

Knowing the sparsity of your data, knowing what kinds of operations you will be performing on your data, and thinking about your preferences when trading off memory consumption and speed can help you select appropriate formats and algorithms for statistical work on sparse matrices.