Ben's Blog
Published on
Reading time
8 min read

Miscellaneous AWS Notes

Miscellaneous notes about various AWS-related things.

I finally got AWS ParallelCluster working (after a long, painful battle). I keep referencing some heterogeneous AWS things I've written down in various places so I figured I should put them all in one place so I can find it more easily later.

GPU Notes

The CUDA architecture corresponding to each gencode:

ArchitectureGencodeExample GPUsCUDA Versions
Fermi203.2 - 8
Kepler305 - 10
37K80
Maxwell50M606 - 11
Pascal608 -
62Jetson TX2
Volta70V1009 -
72Jetson Xavier
Turing75T4, RTX 20xx10 -
Ampere80A10011.1 -
86A10G, RTX 30xx
87Jetson Orin
Ada89RTX 409011.8
Hopper90H10012

Alternatively, for some GPUs which are available through AWS:

GPUMemoryGencode
V10032SM70
A10040/80SM80
H10080SM90
M608SM50
T416SM75
T4G16SM75
A10G24SM86

Instance Types

Use this link to get the most up-to-date prices for each instance type. The prices below are from a particular reference day.

For reference, there are 720 hours in a month and 8760 hours in a year, so:

  • 0.10/hourtranslatesto**0.10** / hour translates to 2.40 / day, 72/month,or**72** / month, or 872 / year.
  • 0.50/hourtranslatesto**0.50** / hour translates to 12.00 / day, 360/month,or**360** / month, or 4,380 / year.
  • 1.00/hourtranslatesto**1.00** / hour translates to 24.00 / day, 720/month,or**720** / month, or 8,760 / year.
  • 2.00/hourtranslatesto**2.00** / hour translates to 48.00 / day, 1,440/month,or**1,440** / month, or 17,520 / year.
  • 4.00/hourtranslatesto**4.00** / hour translates to 96.00 / day, 2,880/month,or**2,880** / month, or 35,040 / year.
GPU TypeInstance TypeGPUsvCPUsRAMPrice / GPU-hour
V100p3.2xlarge1861 GB3.06
p3.8xlarge432244 GB3.06
p3.16xlarge864488 GB3.06
p3dn.24xlarge896768 GB3.90
A100p4d.24xlarge8961152 GB4.10
H100p5.48xlarge81922048 GB12.29
M60g3s.xlarge1430 GB0.75
g3.4xlarge116122 GB1.14
g3.8xlarge232244 GB1.14
g3.16xlarge464488 GB1.14
T4g4dn.xlarge1416 GB0.526
g4dn.2xlarge1832 GB0.752
g4dn.4xlarge11664 GB1.204
g4dn.8xlarge132128 GB2.176
g4dn.12xlarge448192 GB0.978
g4dn.16xlarge164256 GB4.352
g4dn.metal896384 GB0.978
T4Gg5g.xlarge148 GB0.42
g5g.2xlarge1816 GB0.556
g5g.4xlarge11632 GB0.828
g5g.8xlarge13264 GB1.372
g5g.16xlarge264128 GB1.372
g5g.metal264128 GB1.372
A10Gg5.xlarge1416 GB1.006
g5.2xlarge1832 GB1.212
g5.4xlarge11664 GB1.624
g5.8xlarge132128 GB2.448
g5.12xlarge448192 GB1.418
g5.16xlarge164256 GB4.096
g5.24xlarge496384 GB2.036
g5.48xlarge8192768 GB2.036
Trainiumtrn1.2xlarge1832 GB1.34
trn1.32xlarge16128512 GB1.34
trn1n.32xlarge16128512 GB1.34

Other non-GPU instance types:

Instance TypePrice / hourvCPUsMemoryNotes
c5.large0.14524Xeon Platinum 8000, 10 GiB networking
c5.xlarge0.35448Xeon Platinum 8000, 25 GiB networking
t2.xlarge0.186416Xeon Skylake
t2.2xlarge0.371832Xeon Skylake
t3.2xlarge0.333832Xeon Haswell
t3a.2xlarge0.301832AMD EPYC 7000
t4g.2xlarge0.269832Graviton2 (arm64)

Availability Zones

Use this link to get availability zones for each region. The zones below are for us-east-1.

GPU TypeInstance Type1a1b1c1d1e1f
V100p3.2xlarge
p3.8xlarge
p3.16xlarge
p3dn.24xlarge
A100p4d.24xlarge
H100p5.48xlarge
A10Gg5.xlarge
g5.2xlarge
g5.4xlarge
g5.8xlarge
g5.12xlarge
g5.16xlarge
g5.24xlarge
g5.48xlarge
Trainiumtrn1.2xlarge
trn1.32xlarge
trn1n.32xlarge

ParallelCluster

  • There's some weird issues with some of the instance types and elastic IPs which cause the cluster to hang while being created. Can potentially fix by adding ElasticIp: true under HeadNode.Networking and AssignPublicIp: true under Scheduling.SlurmQueues.Networking in the config.
  • To use A100s (i..e, p4d.24xlarge), you need to stop the subnet from automatically assigning a public IP. To do so:
    • Go to the VPC console here
    • Click on "Subnets" in the left sidebar.
    • Select the subnet you want to use.
    • Click on "Modify auto-assign IP settings" in the "Actions" dropdown.
    • Uncheck "Auto-assign IPv4" and click "Save".
  • See the full cluster configuration here or just the required options here

Important Logs

  • /var/log/parallelcluster/slurm_resume.log - Logs for the slurm resume script
  • /var/log/parallelcluster/clustermgtd - Logs for the cluster manager daemon (requires root access)
  • /var/log/slurmctld.log - Logs for the slurm controller (requires root access)

FSx Lustre

  • Console
  • FSx mounts are restricted to a single availability zone

Aurora RDS

The dev and prod instance types for Aurora that AWS recommends are:

  • Prod
    • db.r6g.2xlarge
      • 8 vCPUs
      • 64 GiB RAM
      • 1.038perhouror1.038 per hour or 174.38 per week
    • This is under the "memory optimized" category
  • Dev
    • db.t4g.large
      • 2 vCPUs
      • 8 GiB RAM
      • 0.146perhouror0.146 per hour or 24.53 per week
    • This is under the "burstable" category

Here are some additional links which I found helpful: