{"id":6370,"date":"2017-05-23T10:42:40","date_gmt":"2017-05-23T09:42:40","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=6370"},"modified":"2017-05-23T10:42:40","modified_gmt":"2017-05-23T09:42:40","slug":"faster-transpose-matrix-multiplication-in-r","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=6370","title":{"rendered":"Faster transpose matrix multiplication in R"},"content":{"rendered":"<p>I&#8217;m working on optimising some R code written by a researcher at University of Sheffield and its very much a war of attrition! There&#8217;s no easily optimisable hotspot and there&#8217;s no obvious way to leverage parallelism. Progress is being made by steadily identifying places here and there where we can do a little better. 10% here and 20% there can eventually add up to something worth shouting about.<\/p>\n<p>One such micro-optimisation we discovered involved multiplying two matrices together where one of them needed to be transposed. Here&#8217;s a minimal example.<\/p>\n<pre>#Set random seed for reproducibility\r\nset.seed(3)\r\n\r\n# Generate two random n by n matrices\r\nn = 10\r\na = matrix(runif(n*n,0,1),n,n)\r\nb = matrix(runif(n*n,0,1),n,n)\r\n\r\n# Multiply the matrix a by the transpose of b\r\nc = a %*% t(b)\r\n<\/pre>\n<p>When the speed of linear algebra computations are an issue in R, it makes sense to use a version that is linked to a fast implementation of <a href=\"http:\/\/www.netlib.org\/blas\/\">BLAS<\/a> and <a href=\"http:\/\/www.netlib.org\/lapack\/\">LAPACK<\/a> and we <a href=\"http:\/\/rse.shef.ac.uk\/blog\/intel-R-iceberg\/\">are already doing that on our HPC system<\/a>.<\/p>\n<p>Here, I am using version 3.3.3 of <a href=\"https:\/\/mran.microsoft.com\/open\/\">Microsoft R Open<\/a> which links to <a href=\"https:\/\/software.intel.com\/en-us\/intel-mkl\">Intel&#8217;s MKL <\/a>(an implementation of BLAS and LAPACK) on a Windows laptop.<\/p>\n<p>In R, there is another way to do the computation <strong>c = a %*% t(b)<\/strong>\u00a0 &#8212; we can make use of the <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/base\/html\/crossprod.html\">tcrossprod<\/a> function (There is also a crossprod function for when you want to do <strong>t(a) %*% b<\/strong>)<\/p>\n<pre> c_new = tcrossprod(a,b)<\/pre>\n<p>Let&#8217;s check for equality<\/p>\n<pre>c_new == c\r\n[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]\r\n[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\r\n<\/pre>\n<p>Sometimes, when comparing the two methods you may find that some of those entries are FALSE which may worry you!<br \/>\nIf that happens, computing the difference between the two results should convince you that all is OK and that the differences are just because of numerical noise. This happens sometimes when dealing with floating point arithmetic (For example, see <a href=\"https:\/\/www.walkingrandomly.com\/?p=5380\">https:\/\/www.walkingrandomly.com\/?p=5380<\/a>).<\/p>\n<p>Let&#8217;s time the two methods using the microbenchmark package.<\/p>\n<pre>install.packages('microbenchmark')\r\nlibrary(microbenchmark)\r\n<\/pre>\n<p>We time just the matrix multiplication part of the code above:<\/p>\n<pre>microbenchmark(\r\noriginal = a %*% t(b),\r\ntcrossprod = tcrossprod(a,b)\r\n)\r\n\r\n\r\nUnit: nanoseconds\r\nexpr min lq mean median uq max neval\r\noriginal 2918 3283 3491.312 3283 3647 18599 1000\r\ntcrossprod 365 730 756.278 730 730 10576 1000\r\n<\/pre>\n<p>We are only saving microseconds here but that&#8217;s more than a factor of 4 speed-up in this small matrix case. If that computation is being performed a lot in a tight loop (and for our real application, it was), it can add up to quite a difference.<\/p>\n<p>As the matrices get bigger, the speed-benefit in percentage terms gets lower but tcrossprod always seems to be the faster method. For example, here are the results for 1000 x 1000 matrices<\/p>\n<pre>#Set random seed for reproducibility\r\nset.seed(3)\r\n\r\n# Generate two random n by n matrices\r\nn = 1000\r\na = matrix(runif(n*n,0,1),n,n)\r\nb = matrix(runif(n*n,0,1),n,n)\r\n\r\nmicrobenchmark(\r\noriginal = a %*% t(b),\r\ntcrossprod = tcrossprod(a,b)\r\n)\r\n\r\nUnit: milliseconds\r\nexpr min lq mean median uq max neval\r\noriginal 18.93015 26.65027 31.55521 29.17599 31.90593 71.95318 100\r\ntcrossprod 13.27372 18.76386 24.12531 21.68015 23.71739 61.65373 100\r\n<\/pre>\n<p><strong>The cost of not using an optimised version of BLAS and LAPACK<\/strong><\/p>\n<p>While writing this blog post, I accidentally used the <a href=\"https:\/\/cran.r-project.org\/\">CRAN version of R<\/a>.\u00a0 The recently released version 3.4. Unlike Microsoft R Open, this is not linked to the Intel MKL and so matrix multiplication is rather slower.<\/p>\n<p>For our original 10 x 10 matrix example we have:<\/p>\n<pre>library(microbenchmark)\r\n#Set random seed for reproducibility\r\nset.seed(3)\r\n\r\n# Generate two random n by n matrices\r\nn = 10\r\na = matrix(runif(n*n,0,1),n,n)\r\nb = matrix(runif(n*n,0,1),n,n)\r\n\r\nmicrobenchmark(\r\noriginal = a %*% t(b),\r\ntcrossprod = tcrossprod(a,b)\r\n)\r\n\r\nUnit: microseconds\r\n       expr   min    lq    mean median     uq    max neval\r\n   original 3.647 3.648 4.22727  4.012 4.1945 22.611   100\r\n tcrossprod 1.094 1.459 1.52494  1.459 1.4600  3.282   100\r\n<\/pre>\n<p>Everything is a little slower as you might expect and the conclusion of this article &#8212; <strong>tcrossprod(a,b)<\/strong> <strong>is faster than a %*% t(b)<\/strong> &#8212; seems to still be valid.<\/p>\n<p>However, when we move to 1000 x 1000 matrices, this changes<\/p>\n<pre>library(microbenchmark)\r\n#Set random seed for reproducibility\r\nset.seed(3)\r\n\r\n# Generate two random n by n matrices\r\nn = 1000\r\na = matrix(runif(n*n,0,1),n,n)\r\nb = matrix(runif(n*n,0,1),n,n)\r\n\r\nmicrobenchmark(\r\noriginal = a %*% t(b),\r\ntcrossprod = tcrossprod(a,b)\r\n)\r\n\r\nUnit: milliseconds\r\n       expr      min       lq     mean   median       uq       max neval\r\n   original 546.6008 587.1680 634.7154 602.6745 658.2387  957.5995   100\r\n tcrossprod 560.4784 614.9787 658.3069 634.7664 685.8005 1013.2289   100\r\n<\/pre>\n<p>As expected, both results are much slower than when using the Intel MKL-lined version of R (~600 milliseconds vs ~31 milliseconds) &#8212; nothing new there.\u00a0 More disappointingly, however, is that now tcrossprod is slightly <strong>slower<\/strong> than explicitly taking the transpose.<\/p>\n<p>As such, this particular micro-optimisation might not be as effective as we might like for all versions of R.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m working on optimising some R code written by a researcher at University of Sheffield and its very much a war of attrition! There&#8217;s no easily optimisable hotspot and there&#8217;s no obvious way to leverage parallelism. Progress is being made by steadily identifying places here and there where we can do a little better. 10% [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[40,73,7,36,67,42],"tags":[],"class_list":["post-6370","post","type-post","status-publish","format-standard","hentry","category-free-software","category-linear-algebra","category-programming","category-r","category-scientific-software","category-tutorials"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-1EK","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6370","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6370"}],"version-history":[{"count":7,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6370\/revisions"}],"predecessor-version":[{"id":6378,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6370\/revisions\/6378"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6370"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6370"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6370"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}