{"id":6411,"date":"2018-02-10T10:47:48","date_gmt":"2018-02-10T09:47:48","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=6411"},"modified":"2018-02-19T13:27:18","modified_gmt":"2018-02-19T12:27:18","slug":"meltdown-and-high-performance-computing","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=6411","title":{"rendered":"Meltdown, Spectre and High Performance Computing"},"content":{"rendered":"<p>The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Meltdown_(security_vulnerability)\">Meltdown bug<\/a> which affects most modern CPUs has been called by some <a href=\"https:\/\/www.theguardian.com\/technology\/2018\/jan\/04\/meltdown-spectre-worst-cpu-bugs-ever-found-affect-computers-intel-processors-security-flaw\">&#8216;The worst ever CPU bug&#8217;<\/a>. Accessible explanations about what the Meltdown bug actually is are available <a href=\"https:\/\/www.facebook.com\/Ozzard\/posts\/10157151940878975\">here<\/a> and <a href=\"https:\/\/www.raspberrypi.org\/blog\/why-raspberry-pi-isnt-vulnerable-to-spectre-or-meltdown\/\">here<\/a>.<\/p>\n<p>Software patches have been made available but some people have estimated a performance hit of up to 30% in some cases. Some of us in the High Performance Computing (HPC) community (See <a href=\"https:\/\/twitter.com\/walkingrandomly\/status\/949230133243768835\">here for the initial twitter conversation<\/a>) started to wonder what this might mean for the type of workloads that run on our systems. After all, if the worst case scenario of 30% is the norm, it will drastically affect the power of our systems and hence reduce the amount of science we are able to support.<\/p>\n<p>In the video below, Professor Mark Handley from University College London gives a detailed explanation of both Meltdown and Spectre at an event held at <a href=\"https:\/\/www.turing.ac.uk\/\">Alan Turing Institute<\/a>  in London.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/www.youtube.com\/embed\/m66EAgRMmi8\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<p>Another video that gives a great introduction to this topic was given by Jon Masters at\u00a0\u00a0<a href=\"https:\/\/fosdem.org\/2018\/schedule\/event\/closing_keynote\/\">https:\/\/fosdem.org\/2018\/schedule\/event\/closing_keynote\/\u00a0<\/a><\/p>\n<p><strong>To patch or not to patch<\/strong><\/p>\n<p>To a first approximation, a patch causing a 30% performance hit on a system costing \u00a31 million pounds is going to cost an equivalent of \u00a3300,000 &#8212; not exactly small change! This has led to some people wondering if we should patch HPC systems at all:<\/p>\n<blockquote class=\"twitter-tweet\" data-lang=\"en\">\n<p dir=\"ltr\" lang=\"en\">Given the size of the performance hit should we even *be* patching for this? Unless you need trusted computing, does it really matter for the average HPC?<\/p>\n<p>\u2014 Phil Tooley (@acceleratedsci) <a href=\"https:\/\/twitter.com\/acceleratedsci\/status\/949233180451713024?ref_src=twsrc%5Etfw\">January 5, 2018<\/a><\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<p>All\u00a0of the UK Tier-3 HPC centres I&#8217;m aware of have applied the patches (Sheffield, Leeds and Manchester) but I&#8217;d be interested to learn of centres that decided not to.\u00a0 Feel free to comment here or <a href=\"https:\/\/twitter.com\/walkingrandomly\">message me on twitter<\/a> if you have something to add to this discussion and I&#8217;ll update this post where appropriate.<\/p>\n<p><strong>Research paper discussing the performance penalties of these patches on HPC workloads<\/strong><\/p>\n<p>A group of people have written a paper on Arxiv that looks at HPC performance penalties in more detail.\u00a0 From the paper&#8217;s abstract:<\/p>\n<blockquote class=\"\">\n<div class=\"\">\n<div class=\"\">The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance. The real-world applications show a 2-3% decrease in performance for single node jobs and a 5-11% decrease for parallel multi node jobs.<\/div>\n<\/div>\n<\/blockquote>\n<div class=\"\">The full pdf is available at\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1801.04329\">https:\/\/arxiv.org\/abs\/1801.04329<\/a><\/div>\n<p>&nbsp;<\/p>\n<p><strong>Other relevant results and benchmarks<\/strong><\/p>\n<p>Here are a few other links that discuss the performance penalty of applying the Meltdown patch.<\/p>\n<ul>\n<li><a href=\"https:\/\/access.redhat.com\/articles\/3307751\">Redhat&#8217;s experiments<\/a>\u00a0showing worst-case performance drops of 19% in certain benchmarks.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Kernel_page-table_isolation#Implementation\">Wikipedia article<\/a> showing a summary of various benchmarks.<\/li>\n<\/ul>\n<p><strong>Acknowledgements<\/strong><\/p>\n<p>Thanks to Adrian Jackson, Phil Tooley, Filippo Spiga and members of the UK HPC-SIG for useful discussions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Meltdown bug which affects most modern CPUs has been called by some &#8216;The worst ever CPU bug&#8217;. Accessible explanations about what the Meltdown bug actually is are available here and here. Software patches have been made available but some people have estimated a performance hit of up to 30% in some cases. Some of [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[86,68,41],"tags":[],"class_list":["post-6411","post","type-post","status-publish","format-standard","hentry","category-cloud-computing","category-hpc","category-parallel-programming"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-1Fp","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6411","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6411"}],"version-history":[{"count":6,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6411\/revisions"}],"predecessor-version":[{"id":6423,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6411\/revisions\/6423"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6411"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6411"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6411"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}