{"id":5343,"date":"2014-02-19T16:38:03","date_gmt":"2014-02-19T15:38:03","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=5343"},"modified":"2014-02-19T18:12:48","modified_gmt":"2014-02-19T17:12:48","slug":"checkpointing-matlab-programs","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=5343","title":{"rendered":"Checkpointing MATLAB Programs"},"content":{"rendered":"<p>I occasionally get emails from researchers saying something like this<\/p>\n<p><em>&#8216;My MATLAB code takes a week to run and the cleaner\/cat\/my husband keeps switching off my machine \u00a0before it&#8217;s completed &#8212; could you help me make the code go faster please so that I can get my results in between these events&#8217;<\/em><\/p>\n<p>While I am more than happy to try to optimise the code in question, what these users really need is some sort of checkpointing scheme. Checkpointing is also important for users of high performance computing systems that limit the length of each individual job.<\/p>\n<p><strong>The solution &#8211; Checkpointing (or &#8216;Assume that your job will frequently be killed&#8217;)<\/strong><\/p>\n<p>The basic idea behind checkpointing is to periodically save your program&#8217;s state so that, if it is interrupted, it can start again where it left off rather than from the beginning. In order to demonstrate some of the principals involved, I&#8217;m going to need some code that&#8217;s sufficiently simple that it doesn&#8217;t cloud what I want to discuss. Let&#8217;s add up some numbers using a for-loop.<\/p>\n<pre>%addup.m\r\n%This is not the recommended way to sum integers in MATLAB -- we only use it here to keep things simple\r\n%This version does NOT use checkpointing\r\n\r\nmysum=0;\r\nfor count=1:100\r\n    mysum = mysum + count;\r\n    pause(1);           %Let's pretend that this is a complicated calculation\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\n\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p><em>Using a for-loop to perform an addition like this is\u00a0something that I&#8217;d never usually suggest in MATLAB<\/em> but I&#8217;m using it here because it is so simple that it won&#8217;t get in the way of understanding the checkpointing code.<\/p>\n<p>If you run this program in MATLAB, it will take about 100 seconds thanks to that pause statement which is acting as a proxy for some real work. Try interrupting it by pressing CTRL-C and then restart it. As you might expect, it will always start from the beginning:<\/p>\n<pre>&gt;&gt; addup\r\nCompleted iteration 1\r\nCompleted iteration 2\r\nCompleted iteration 3\r\nOperation terminated by user during addup (line 6)\r\n\r\n&gt;&gt; addup\r\nCompleted iteration 1\r\nCompleted iteration 2\r\nCompleted iteration 3\r\nOperation terminated by user during addup (line 6)<\/pre>\n<p>This is no big deal when your calculation only takes 100 seconds but is going to be a major problem when the calculation represented by that pause statement becomes something like an hour rather than a second.<\/p>\n<p>Let&#8217;s now look at a version of the above that makes use of checkpointing.<\/p>\n<pre>%addup_checkpoint.m\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n\r\nelse %otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\nend\r\n\r\nfor count = countmin:100\r\n    mysum = mysum + count;\r\n    pause(1);           %Let's pretend that this is a complicated calculation\r\n\r\n    %save checkpoint\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n    fprintf('Saving checkpoint\\n');\r\n    save('checkpoint.mat');\r\n\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p>Before you run the above code, the checkpoint file <strong>checkpoint.mat<\/strong> does not exist and so the calculation starts from the beginning. After every iteration, a checkpoint file is created which contains every variable in the MATLAB workspace. If the program is restarted, it will find the checkpoint file and continue where it left off. Our code now deals with interruptions a lot more gracefully.<\/p>\n<pre>&gt;&gt; addup_checkpoint\r\nNo checkpoint file found - starting from beginning\r\nSaving checkpoint\r\nCompleted iteration 1 \r\nSaving checkpoint\r\nCompleted iteration 2 \r\nSaving checkpoint\r\nCompleted iteration 3 \r\nOperation terminated by user during addup_checkpoint (line 16)\r\n\r\n&gt;&gt; addup_checkpoint\r\nCheckpoint file found - Loading\r\nSaving checkpoint\r\nCompleted iteration 4 \r\nSaving checkpoint\r\nCompleted iteration 5 \r\nSaving checkpoint\r\nCompleted iteration 6 \r\nOperation terminated by user during addup_checkpoint (line 16)<\/pre>\n<p>Note that we&#8217;ve had to change the program logic slightly. Our original loop counter was<\/p>\n<pre>for count = 1:100<\/pre>\n<p>In the check-pointed example, however, we&#8217;ve had to introduce the variable <strong>countmin<\/strong><\/p>\n<pre>for count = countmin:100<\/pre>\n<p>This allows us to start the loop from whatever value of countmin was in our last checkpoint file. Such minor modifications are often necessary when converting code to use checkpointing and you should carefully check that the introduction of checkpointing does not introduce bugs in your code.<\/p>\n<p><strong>Don&#8217;t checkpoint too often<\/strong><\/p>\n<p>The creation of even a small checkpoint file is a time consuming process. Consider our original addup code but without the pause command.<\/p>\n<pre>%addup_nopause.m\r\n%This version does NOT use checkpointing\r\nmysum=0;\r\nfor count=1:100\r\n    mysum = mysum + count;\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p>On my machine, this code takes 0.0046 seconds to execute. Compare this to the checkpointed version, again with the pause statement removed.<\/p>\n<pre>%addup_checkpoint_nopause.m\r\n\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n\r\nelse %otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\nend\r\n\r\nfor count = countmin:100\r\n    mysum = mysum + count;\r\n\r\n    %save checkpoint\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n    fprintf('Saving checkpoint\\n');\r\n    save('checkpoint.mat');\r\n\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p>This checkpointed version takes 0.85 seconds to execute on the same machine &#8212; Over 180 times slower than the original! The problem is that the time it takes to checkpoint is long compared to the calculation time.<\/p>\n<p>If we make a modification so that we only checkpoint every 25 iterations, code execution time comes down to 0.05 seconds:<\/p>\n<pre>%Checkpoint every 25 iterations\r\n\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n\r\nelse %otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\nend\r\n\r\nfor count = countmin:100\r\n    mysum = mysum + count;\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n\r\n    if mod(count,25)==0\r\n        %save checkpoint   \r\n        fprintf('Saving checkpoint\\n');\r\n        save('checkpoint.mat');\r\n    end\r\n\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p>Of course, the issue now is that we might lose more work if our program is interrupted between checkpoints. Additionally, in this particular case, the mod command used to decide whether or not to checkpoint is more expensive than simply performing the calculation but hopefully that isn&#8217;t going to be the case when working with real world calculations.<\/p>\n<p>In practice, we have to work out a balance such that we checkpoint often enough so that we don&#8217;t stand to lose too much work but not so often that our program runs too slowly.<\/p>\n<p><strong>Checkpointing code that involves random numbers<\/strong><\/p>\n<p>Extra care needs to be taken when running code that involves random numbers. Consider a modification of our checkpointed adding program that creates a sum of random numbers.<\/p>\n<pre>%addup_checkpoint_rand.m\r\n%Adding random numbers the slow way, in order to demo checkpointing\r\n%This version has a bug\r\n\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n\r\nelse %otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\n    rng(0);     %Seed the random number generator for reproducible results\r\nend\r\n\r\nfor count = countmin:100\r\n    mysum = mysum + rand();\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n    pause(1); %pretend this is a complicated calculation\r\n\r\n    %save checkpoint\r\n    fprintf('Saving checkpoint\\n');\r\n    save('checkpoint.mat');\r\n\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p>In the above, we set the seed of the random number generator to 0 at the beginning of the calculation. This ensures that we always get the same set of random numbers and allows us to get reproducible results. As such, the sum should always come out to be 52.799447 to the number of decimal places used in the program.<\/p>\n<p>The above code has a subtle bug that you won&#8217;t find if your testing is confined to interrupting using CTRL-C and then restarting in an interactive session of MATLAB. Proceed that way, and you&#8217;ll get exactly the sum you&#8217;ll expect : 52.799447. \u00a0If, on the other hand, you test your code by doing the following<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">Run for a few iterations<\/span><\/li>\n<li>Interrupt with CTRL-C<\/li>\n<li>Restart MATLAB<\/li>\n<li>Run the code again, ensuring that it starts from the checkpoint<\/li>\n<\/ul>\n<p>You&#8217;ll get a different result. This is not what we want!<\/p>\n<p>The root cause of this problem is that we are not saving the state of the random number generator in our checkpoint file. Thus, when we restart MATLAB, all information concerning this state is lost. If we don&#8217;t restart MATLAB between interruptions, the state of the random number generator is safely tucked away behind the scenes.<\/p>\n<p>Assume, for example, that you stop the calculation running after the third iteration. The random numbers you&#8217;d have consumed would be (to 4 d.p.)<\/p>\n<p>0.8147<br \/>\n0.9058<br \/>\n0.1270<\/p>\n<p>Your checkpoint file will contain the variables <strong>mysum<\/strong>, <strong>count<\/strong> and <strong>countmin<\/strong> but will contain nothing about the state of the random number generator. In English, this state is something like <em>&#8216;The next random number will be the 4th one in the sequence defined by a starting seed of 0.&#8217;<\/em><\/p>\n<p>When we restart MATLAB, the default seed is 0 so we&#8217;ll be using the right sequence (since we explicitly set it to be 0 in our code) but we&#8217;ll be starting right from the beginning again. That is, the 4th,5th and 6th iterations of the summation will contain the first 3 numbers in the stream, thus double counting them, and so our checkpointing procedure will alter the results of the calculation.<\/p>\n<p>In order to fix this, we need to additionally save the state of the random number generator when we save a checkpoint and also make correct use of this on restarting. Here&#8217;s the code<\/p>\n<pre>%addup_checkpoint_rand_correct.m\r\n%Adding random numbers the slow way, in order to demo checkpointing\r\n\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n\r\n    %use the saved RNG state\r\n    stream = RandStream.getGlobalStream;\r\n    stream.State = savedState;\r\n\r\nelse % otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\n    rng(0);     %Seed the random number generator for reproducible results\r\nend\r\n\r\nfor count = countmin:100\r\n    mysum = mysum + rand();\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n    pause(1); %pretend this is a complicated calculation\r\n\r\n    %save the state of the random number genertor\r\n    stream = RandStream.getGlobalStream;\r\n    savedState = stream.State;\r\n    %save checkpoint\r\n    fprintf('Saving checkpoint\\n');\r\n    save('checkpoint.mat');\r\n\r\n    fprintf('Completed iteration %d \\n',count);\r\nend\r\nfprintf('The sum is %f \\n',mysum);<\/pre>\n<p><strong>Ensuring that the checkpoint save completes<\/strong><\/p>\n<p>Events that terminate our code can occur extremely quickly &#8212; a powercut for example. There is a risk that the machine was switched off while our check-point file was being written. How can we ensure that the file is complete?<\/p>\n<p>The solution, which I found on the <a href=\"http:\/\/www.liv.ac.uk\/csd\/escience\/condor\/checkpoint.htm\">MATLAB checkpointing page of the Liverpool University Condor Pool site<\/a> is to first write a temporary file and then rename it. \u00a0That is, instead of<\/p>\n<pre>save('checkpoint.mat')\/pre&gt;<\/pre>\n<p>we do<\/p>\n<pre>\r\nif strcmp(computer,'PCWIN64') || strcmp(computer,'PCWIN')\r\n            %We are running on a windows machine\r\n            system( 'move \/y checkpoint_tmp.mat checkpoint.mat' );\r\nelse\r\n            %We are running on Linux or Mac\r\n            system( 'mv checkpoint_tmp.mat checkpoint.mat' );\r\nend\r\n<\/pre>\n<p>As the author of that page explains <em>&#8216;The operating system should guarantee that the move command is &#8220;atomic&#8221; (in the sense that it is indivisible i.e. it succeeds completely or not at all) so that there is no danger of receiving a corrupt &#8220;half-written&#8221; checkpoint file from the job.&#8217;<\/em><\/p>\n<p><strong>Only checkpoint what is necessary<\/strong><\/p>\n<p>So far, we&#8217;ve been saving the entire MATLAB workspace in our checkpoint files and this hasn&#8217;t been a problem since our workspace hasn&#8217;t contained much. In general, however, the workspace might contain all manner of intermediate variables that we simply don&#8217;t need in order to restart where we left off. Saving the stuff that we might not need can be expensive.<\/p>\n<p>For the sake of illustration, let&#8217;s skip 100 million random numbers before adding one to our sum. For reasons only known to ourselves, we store these numbers in an intermediate variable which we never do anything with. This array isn&#8217;t particularly large at 763 Megabytes but its existence slows down our checkpointing somewhat. The correct result of this variation of the calculation is 41.251376 if we set the starting seed to 0; something we can use to test our new checkpoint strategy.<\/p>\n<p>Here&#8217;s the code<\/p>\n<pre>\r\n% A demo of how slow checkpointing can be if you include large intermediate variables\r\n\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n    %use the saved RNG state\r\n    stream = RandStream.getGlobalStream;\r\n    stream.State = savedState;\r\nelse %otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\n    rng(0);     %Seed the random number generator for reproducible results\r\nend\r\n\r\nfor count = countmin:100\r\n    %Create and store 100 million random numbers for no particular reason\r\n    randoms = rand(10000);\r\n    mysum = mysum + rand();\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n    fprintf('Completed iteration %d \\n',count);\r\n    \r\n    if mod(count,25)==0\r\n        %save the state of the random number generator\r\n        stream = RandStream.getGlobalStream;\r\n        savedState = stream.State;\r\n        %save and time checkpoint\r\n        tic\r\n        save('checkpoint_tmp.mat');\r\n        if strcmp(computer,'PCWIN64') || strcmp(computer,'PCWIN')\r\n            %We are running on a windows machine\r\n            system( 'move \/y checkpoint_tmp.mat checkpoint.mat' );\r\n        else\r\n            %We are running on Linux or Mac\r\n            system( 'mv checkpoint_tmp.mat checkpoint.mat' );\r\n        end\r\n        timing = toc;\r\n        fprintf('Checkpoint save took %f seconds\\n',timing);\r\n    end\r\n    \r\nend\r\nfprintf('The sum is %f \\n',mysum);\r\n<\/pre>\n<p>On my Windows 7 Desktop, each checkpoint save takes around 17 seconds:<\/p>\n<pre>Completed iteration 25 \r\n        1 file(s) moved. \r\nCheckpoint save took 17.269897 seconds<\/pre>\n<p>It is not necessary to include that huge random matrix in a checkpoint file. If we are explicit in what we require, we can reduce the time taken to checkpoint significantly. Here, we change<\/p>\n<pre>save('checkpoint_tmp.mat');<\/pre>\n<p>to<\/p>\n<pre>save('checkpoint_tmp.mat','mysum','countmin','savedState');<\/pre>\n<p>This has a dramatic effect on check-pointing time:<\/p>\n<pre>Completed iteration 25 \r\n        1 file(s) moved. \r\nCheckpoint save took 0.033576 seconds<\/pre>\n<p>Here&#8217;s the final piece of code that uses everything discussed in this article<\/p>\n<pre>\r\n%Final checkpointing demo\r\n\r\nif exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it\r\n    fprintf('Checkpoint file found - Loading\\n');\r\n    load('checkpoint.mat')\r\n    %use the saved RNG state\r\n    stream = RandStream.getGlobalStream;\r\n    stream.State = savedState;\r\nelse %otherwise, start from the beginning\r\n    fprintf('No checkpoint file found - starting from beginning\\n');\r\n    mysum=0;\r\n    countmin=1;\r\n    rng(0);     %Seed the random number generator for reproducible results\r\nend\r\n\r\nfor count = countmin:100\r\n    %Create and store 100 million random numbers for no particular reason\r\n    randoms = rand(10000);\r\n    mysum = mysum + rand();\r\n    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration\r\n    fprintf('Completed iteration %d \\n',count);\r\n    \r\n    if mod(count,25)==0 %checkpoint every 25th iteration\r\n        %save the state of the random number generator\r\n        stream = RandStream.getGlobalStream;\r\n        savedState = stream.State;\r\n        %save and time checkpoint\r\n        tic\r\n        %only save the variables that are strictly necessary\r\n        save('checkpoint_tmp.mat','mysum','countmin','savedState');\r\n        %Ensure that the save completed\r\n        if strcmp(computer,'PCWIN64') || strcmp(computer,'PCWIN')\r\n            %We are running on a windows machine\r\n            system( 'move \/y checkpoint_tmp.mat checkpoint.mat' );\r\n        else\r\n            %We are running on Linux or Mac\r\n            system( 'mv checkpoint_tmp.mat checkpoint.mat' );\r\n        end\r\n        timing = toc;\r\n        fprintf('Checkpoint save took %f seconds\\n',timing);\r\n    end\r\n    \r\nend\r\nfprintf('The sum is %f \\n',mysum);\r\n<\/pre>\n<p><strong>Parallel checkpointing<\/strong><\/p>\n<p>If your code includes parallel regions using constructs such as parfor or spmd, you might have to do more work to checkpoint correctly. I haven&#8217;t considered any of the potential issues that may arise in such code in this article<\/p>\n<p><strong>Checkpointing checklist<\/strong><\/p>\n<p>Here&#8217;s a reminder of everything you need to consider<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">Test to ensure that the introduction of checkpointing doesn&#8217;t alter results<\/span><\/li>\n<li>Don&#8217;t checkpoint too often<\/li>\n<li>Take care when checkpointing code that involves random numbers &#8211; you need to explicitly save the state of the random number generator.<\/li>\n<li>Take measures to ensure that the checkpoint save is completed<\/li>\n<li>Only checkpoint what is necessary<\/li>\n<li>Code that includes parallel regions might require extra care<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I occasionally get emails from researchers saying something like this &#8216;My MATLAB code takes a week to run and the cleaner\/cat\/my husband keeps switching off my machine \u00a0before it&#8217;s completed &#8212; could you help me make the code go faster please so that I can get my results in between these events&#8217; While I am [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[48,11,7,63,42],"tags":[],"class_list":["post-5343","post","type-post","status-publish","format-standard","hentry","category-condor","category-matlab","category-programming","category-random-numbers","category-tutorials"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-1ob","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/5343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5343"}],"version-history":[{"count":32,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/5343\/revisions"}],"predecessor-version":[{"id":5376,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/5343\/revisions\/5376"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}