{"id":6431,"date":"2018-02-21T13:22:15","date_gmt":"2018-02-21T12:22:15","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=6431"},"modified":"2018-02-21T13:37:56","modified_gmt":"2018-02-21T12:37:56","slug":"creating-a-temporary-customised-multi-user-hpc-cluster-for-teaching-using-amazon-aws-and-alces-flight","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=6431","title":{"rendered":"Creating a temporary, customised, multi-user HPC cluster for teaching using Amazon AWS and Alces Flight"},"content":{"rendered":"<p>In a <a href=\"https:\/\/www.walkingrandomly.com\/?p=6392\">previous blog post<\/a>, I told the story of how I used Amazon AWS and <a href=\"https:\/\/alces-flight.com\/\">AlcesFlight<\/a> to create a temporary multi-user HPC cluster for use in a training course.\u00a0 Here are the details of how I actually did it.<\/p>\n<p>Note that I have only ever used this configuration as a training cluster.\u00a0 I am not suggesting that the customisations are suitable for real work.<\/p>\n<p><strong>Before you start<\/strong><\/p>\n<p>Before attempting to use AlcesFlight on AWS, I suggest that you ensure that you have the following things working<\/p>\n<ul>\n<li>A working AWS account.\u00a0 Ensure that you can create an <a href=\"https:\/\/aws.amazon.com\/getting-started\/tutorials\/launch-a-virtual-machine\/\">EC2 virtual machine<\/a> and connect to it via ssh.<\/li>\n<li>Get the <a href=\"https:\/\/aws.amazon.com\/cli\/\">AWS Command Line Interface<\/a> working.<\/li>\n<\/ul>\n<p><strong>Customizing the HPC cluster on AWS<\/strong><\/p>\n<p>AlcesFlight provides a CloudFormation template for launching cluster instances on Amazon AWS.\u00a0 The practical upshot of this is that you answer a bunch of questions on a web form to customise your cluster and then you launch it.<\/p>\n<p>We are going to use this CloudFormation template along with some bash scripts that provide additional customisation.<\/p>\n<p><strong>Get the customisation scripts<\/strong><\/p>\n<p>The first step is to get some customization scripts in an S3 bucket. You could use your own or you could use the ones I created.<\/p>\n<p>If you use mine, make sure you take a good look at them first to make sure you are happy with what I&#8217;ve done!\u00a0 It&#8217;s probably worth using your own fork of my repo so you can customise your cluster further.<\/p>\n<p>It&#8217;s the bash scripts that allow the creation of a bunch of user accounts for trainees with randomized passwords.\u00a0 My scripts do some other things too and I&#8217;ve listed everything in the <a href=\"https:\/\/github.com\/mikecroucher\/alces_flight_customisation\/blob\/master\/README.md\">github README.md<\/a>.<br \/>\n<code><br \/>\ngit clone https:\/\/github.com\/mikecroucher\/alces_flight_customisation<br \/>\ncd alces_flight_customisation<br \/>\n<\/code><\/p>\n<p>Now you need to upload these to an s3 bucket. I called mine\u00a0<strong>walkingrandomly-aws-cluster<\/strong><br \/>\n<code><br \/>\naws s3api create-bucket --bucket walkingrandomly-aws-cluster --region eu-west-2 --create-bucket-configuration LocationConstraint=eu-west-2<br \/>\naws s3 sync . s3:\/\/walkingrandomly-aws-cluster --delete<br \/>\n<\/code><\/p>\n<p><strong>Set up the CloudFormation template<\/strong><\/p>\n<ul>\n<li>Head over to <a href=\"https:\/\/aws.amazon.com\/marketplace\/pp\/B01GC9E3OG?qid=1518562867216&amp;amp;sr=0-1&amp;amp;ref_=srh_res_product_title\">Alces Flight Solo (Community Edition)<\/a>\u00a0and click on <strong>continue to subscribe<\/strong><\/li>\n<li>Choose the region you want to create the cluster in, select\u00a0<strong>Personal HPC compute\u00a0<\/strong><strong>cluster<\/strong> and click on\u00a0<strong>Launch with CloudFormationConsole<\/strong><\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.walkingrandomly.com\/wp-content\/uploads\/2018\/02\/personal_hpc_compute_cluster.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-6428\" src=\"https:\/\/www.walkingrandomly.com\/wp-content\/uploads\/2018\/02\/personal_hpc_compute_cluster-300x265.png\" alt=\"personal_hpc_compute_cluster\" width=\"300\" height=\"265\" srcset=\"https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/personal_hpc_compute_cluster-300x265.png 300w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/personal_hpc_compute_cluster-768x678.png 768w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/personal_hpc_compute_cluster-1024x904.png 1024w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/personal_hpc_compute_cluster.png 1307w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<ul>\n<li>Go through the CloudFormation template screens, creating the cluster as you want it until you get to the\u00a0<strong>S3 bubcket for customization profiles\u00a0<\/strong>box where you fill in the name of the S3 bucket you created earlier.<\/li>\n<li>Enable the default profile<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.walkingrandomly.com\/wp-content\/uploads\/2018\/02\/flight_customisation.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-6429\" src=\"https:\/\/www.walkingrandomly.com\/wp-content\/uploads\/2018\/02\/flight_customisation-300x90.png\" alt=\"flight_customisation\" width=\"300\" height=\"90\" srcset=\"https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/flight_customisation-300x90.png 300w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/flight_customisation-768x230.png 768w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/flight_customisation-1024x307.png 1024w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/flight_customisation.png 1129w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<ul>\n<li>Continue answering the questions asked by the web form.\u00a0 For this simple training cluster, I just accepted all of the defaults and it worked fine<\/li>\n<\/ul>\n<p>When the CloudFormation stack has been fully created, you can log into your new cluster as an administrator.\u00a0 To get the connection details of the headnode, go to the <strong>EC2 management console<\/strong>\u00a0in your web-browser,<strong>\u00a0<\/strong>select\u00a0the headnode and click on <strong>Connect.<\/strong><\/p>\n<p>When you log in to the cluster as administrator, the usernames and passwords for your training cohort will be in directory specified by the password_file <a href=\"https:\/\/github.com\/mikecroucher\/alces_flight_customisation\/blob\/master\/customizer\/default\/configure.d\/run_me.sh\">variable in the configure.d\/run_me.sh script<\/a>. I set my administrator account to be called walkingrandomly and so put the password file in \/home\/walkingrandomly\/users.txt.\u00a0 I could then print this out and distribute the usernames and passwords to each training delegate.<\/p>\n<p>This is probably not great sysadmin practice but worked on the day.\u00a0 If anyone can come up with a better way, <a href=\"https:\/\/github.com\/mikecroucher\/alces_flight_customisation\">Pull Requests are welcomed<\/a>!<\/p>\n<p><a href=\"https:\/\/www.walkingrandomly.com\/wp-content\/uploads\/2018\/02\/alces_flight_login.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-6435\" src=\"https:\/\/www.walkingrandomly.com\/wp-content\/uploads\/2018\/02\/alces_flight_login-1024x696.png\" alt=\"alces_flight_login\" width=\"1024\" height=\"696\" srcset=\"https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/alces_flight_login-1024x696.png 1024w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/alces_flight_login-300x204.png 300w, https:\/\/walkingrandomly.com\/wp-content\/uploads\/2018\/02\/alces_flight_login-768x522.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<p><strong>Try a training account<\/strong><\/p>\n<p>At this point, I suggest that you try logging in as one of the training user accounts and make sure you can successfully submit a job.\u00a0 When I first tried all of this, the default scheduler on the created cluster was SunGridEngine and my first attempt at customisation left me with user accounts that couldn&#8217;t submit jobs.<\/p>\n<p>The current scripts have been battle tested with Sun Grid Engine, including MPI job submission and I&#8217;ve also done a very basic test with Slurm. However, you really should check that a user account can submit all of the types of job you expect to use in class.<\/p>\n<p><strong>Troubleshooting<\/strong><\/p>\n<p>When I first tried to do this, things didn&#8217;t go completely smoothly.\u00a0 Here are some things I learned to help diagnose the problems<\/p>\n<p>Full documentation is available at\u00a0<a href=\"http:\/\/docs.alces-flight.com\/en\/stable\/customisation\/customisation.html\">http:\/\/docs.alces-flight.com\/en\/stable\/customisation\/customisation.html<\/a><\/p>\n<p>On the cluster, we can see where its looking for its customisation scripts with the\u00a0<strong>alces about<\/strong>\u00a0command<\/p>\n<p><code><br \/>\nalces about customizer<br \/>\nCustomizer bucket prefix: s3:\/\/walkingrandomly-aws-cluster\/customizer<br \/>\n<\/code><\/p>\n<p>The log file at <strong>\/var\/log\/clusterware\/instance.log<\/strong> on both the head node and worker nodes is very useful.<\/p>\n<p>Once, I did all of this using a Windows CMD bash prompt and the customisation scripts failed to run.\u00a0 The logs showed this error<br \/>\n<code><br \/>\n\/bin\/bash^M: bad interpreter: No such file or directory<\/code><\/p>\n<p>This is a classic <a href=\"https:\/\/www.walkingrandomly.com\/?p=9\">dos2unix error<\/a> and could be avoided, for example, by using the <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows\/wsl\/install-win10\">Windows Subsystem for linux<\/a> instead of CMD.exe.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a previous blog post, I told the story of how I used Amazon AWS and AlcesFlight to create a temporary multi-user HPC cluster for use in a training course.\u00a0 Here are the details of how I actually did it. Note that I have only ever used this configuration as a training cluster.\u00a0 I am [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[86,68],"tags":[],"class_list":["post-6431","post","type-post","status-publish","format-standard","hentry","category-cloud-computing","category-hpc"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-1FJ","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6431","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6431"}],"version-history":[{"count":5,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6431\/revisions"}],"predecessor-version":[{"id":6439,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/6431\/revisions\/6439"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6431"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6431"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6431"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}