Static Website with AWS S3 and Route 53

This article covers hosting a static website using S3 and Route 53. I'll assume you already have an AWS account, but no further experience with AWS services is required as each step will be explained, with links to more detailed documentation where appropriate. We'll stray off the "happy path", make a few "mistakes" along the way and use some cool tools to understand and fix them. Some knowledge of basic HTTP concepts and shell usage is assumed.

There are many similar tutorials on the internet, mostly using the web interface (the AWS Management Console - MC for short). Personally I prefer CLIs and would like to avoid the indignity of clicking around in GUIs - so we'll mostly stick with the AWS CLI. Occasionally it is useful to check the MC as well, so some relevant links are provided but I didn't include any screenshots as they tend to become outdated very quickly.

I'll be using git bash on Windows, but the commands should be portable to other shells and operating systems.

Setup AWS CLI

The first step is to install the AWS CLI; it's a Python package and can be installed with pip:

$ pip install awscli

Initial User Setup

When you sign up for an AWS account, you'll be able to login to the MC using your email and a password. These are the MC credentials for the root user, created implicitly with your account. However, Amazon recommends is that you avoid using the root user as much as possible - instead, we're going to use the IAM Management Console to create an IAM user. Any number of such users can be created; in addition, we can also create groups and policies which can be used to organize users, and limit their access to the accounts various AWS services and resources.

So log in with you root user, head over to the IAM Management Console, create a group named "Admins" and assign the AdministratorAccess policy to it. IAM policies are defined using a special JSON syntax; AdministratorAccess is one of the built-in, AWS-managed policies, and it has the following definition:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

Which might seem pointless, as even without knowing the details of the JSON syntax, you can tell it's quite permissive and seems to allow everything - therefore defeating the purpose of creating a non-root user. However, not everything can be expressed using policies and because anything that is not explicitly allowed is disallowed, things that cannot be expressed in policies are only accessible to the root user.

Once the "Admins" group is setup, create a user (let's say, john.smith) and add it to the group - it will automatically have all the groups policies applied to it. When creating the user, you can choose what type of access it requires - CLI, MC or both; make sure the CLI checkbox is ticked as we'll use this user going forward. On the final user creation screen, AWS will let you know the user was successfully created and that it generated credentials for it, which can be downloaded in the form of a CSV file that would look like this:

User name,Password,Access key ID,Secret access key,Console login link
john.smith,a+D2xDr-cbLT,AKIAIOSFODNN7EXAMPLE,wJalrXUtnFEMIxK7MDENGXbPxRfiCYzEXAMPLEKEY,https://123456789012.signin.aws.amazon.com/console

It contains two sets of credentials - assuming both MC and CLI checkboxes were ticked. Below, I've listed them in a more readable fashion, and annotated each line with the access type it is associated with:

The MC credentials consist of three fields; the "Console login link" is a URL where you can go and use the user name and the password to login to the MC as the newly created admin user - rather than as the root user. The "Access key ID" and "Secret access key" fields are specific to the CLI access type.

The credentials are only displayed on the final user creation screen, where you also have the "Download .csv" button. Once you close that screen, the secret access key and the password won't be displayed again. However, not all is lost if you forget them because you can easily set a new password (as the root or admin user) or create a new access key pair - and remove/disable the old one. By the way, don't forget to remove/disable the root users key pair - this will disable programmatic access as the root user, but not logging in with the username and password.

Configuring the CLI

Once we have a user and CLI credentials, we need to configure the CLI to use them. Like many other CLI tools, the AWS CLI will read in a couple of configuration files ("dotfiles") on each run. These are ~/.aws/config and ~/.aws/credentials; they resemble INI files and have "sections"; each section is a "profile" which generally is either a user name, or "default". The CLI will read only one section from each file, the one corresponding to the profile being used. As you would imagine, the contents of the "default" profile are used when you don't specify another one, which you can do using the --profile CLI option or the AWS_PROFILE environment variable.

These files can be edited manually, or by running the aws configure command; they should look as below.

$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMIxK7MDENGXbPxRfiCYzEXAMPLEKEY

$ cat ~/.aws/config
[default]
region = eu-west-1
output = json

Create a Website

Let's create a very simple website, with an index (home) page and another page available on the /blog/first-post route. The home page will reference a CSS stylesheet, and the nested one (the blog post) will reference the same stylesheet, and an image. We'll be taking advantage of the fact that we can express this type of simple route layout using the filesystem. Here is a simple Bash script which will create a file tree as outlined below.

├─ blog/
│  └─ first-post/
|      ├─ index.html
│      └─ image.svg
├─ index.html
└─ style.css

Setup S3 Bucket

The website is sorted, now we need to upload it somewhere - so let's create an S3 bucket and set it up as a website host. The AWS CLI actually comes with two sub-commands for dealing with S3 - s3 and s3api; s3 offers a higher level of abstraction, whereas s3api is lower level and more powerful - similar to how git divides its commands into "porcelain" and "plumbing". We'll be using mostly "porcelain" s3 commands.

The name of the bucket doesn't really matter, but it has to be globally unique - that is, unique across all AWS accounts, not just your; and since I'm going to name my bucket "glorious-website", you'll have to name yours something else.

$ aws s3 mb s3://glorious-website # create the bucket
make_bucket: glorious-website

$ aws s3 ls # verify the bucket was created
2020-02-03 08:57:12 glorious-website

$ cd ~/code/glorious-website # change to the folder with the website

$ aws s3 sync . s3://glorious-website --dryrun # do a "dry run" first ...
(dryrun) upload: .git\COMMIT_EDITMSG to s3://glorious-website/.git/COMMIT_EDITMSG
(dryrun) upload: .git\HEAD to s3://glorious-website/.git/HEAD
(dryrun) upload: .git\config to s3://glorious-website/.git/config
(dryrun) upload: blog\first-post\image.svg to s3://glorious-website/blog/first-post/image.svg
(dryrun) upload: blog\first-post\index.html to s3://glorious-website/blog/first-post/index.html
(dryrun) upload: .\index.html to s3://glorious-website/index.html
(dryrun) upload: .\style.css to s3://glorious-website/style.css

$ aws s3 sync . s3://glorious-website --exclude ".git/*" --dryrun # ... make adjustments, if necessary
(dryrun) upload: blog\first-post\image.svg to s3://glorious-website/blog/first-post/image.svg
(dryrun) upload: blog\first-post\index.html to s3://glorious-website/blog/first-post/index.html
(dryrun) upload: .\index.html to s3://glorious-website/index.html
(dryrun) upload: .\style.css to s3://glorious-website/style.css

$ aws s3 sync . s3://glorious-website # ... and if dry run looks OK, do it for real
upload: blog\first-post\image.svg to s3://glorious-website/blog/first-post/image.svg
upload: blog\first-post\index.html to s3://glorious-website/blog/first-post/index.html
upload: .\style.css to s3://glorious-website/style.css
upload: .\index.html to s3://glorious-website/index.html

Above we used the mb command to create a bucket, then ls to verify that it was created. Then we uploaded the files with sync - but before doing that, we ran the sync command with the --dryrun flag, supported by many other commands, which tells you what would happen if you ran the command without this flag; useful as a sanity check. In this case it helped to notice that the .git folder would have been uploaded as well, which was not intended.

Make Bucket into Static Website

So the website is now in an S3 bucket - but it's still not accessible as a website; to AWS it's just another bucket with files in it - so we need to tell it that we actually have a website in there. For this we have the aptly named website command:

$ aws s3 website s3://glorious-website --index-document index.html

If everything goes OK and the operation is successful, we won't get any output; many other aws commands have the same behaviour, adhering to the UNIX tenet of avoiding unnecessary output. Unfortunately this means we don't know the URL where the website is hosted, and apparently we have to use the S3 MC for that as I couldn't find a way of revealing the URL using the command line.

The URL will be something like this: http://glorious-website.s3-website-eu-west-1.amazonaws.com/, and you can see that it follows a predictable format; if you know the bucket name and the region (eu-west-1 in this case), you can deduce the website URL. However, if you open it in your browser, you'll get a "403 Forbidden" page. Of course you can check this in a browser, but below we're using the httpie tool with the --headers option; it makes a GET request to the URL given as a parameter, but only displays the response headers:

$ http --headers http://glorious-website.s3-website-eu-west-1.amazonaws.com/
HTTP/1.1 403 Forbidden
Content-Length: 303
Content-Type: text/html; charset=utf-8
Date: Tue, 04 Feb 2020 13:58:38 GMT
Server: AmazonS3

The reason for the 403 is that the request didn't contain any authentication information, and by default anonymous access to S3 buckets is disallowed - but for public websites, that's exactly what we need. The way to accomplish this is via policies - we need to assign a policy to the bucket which lets AWS know that we're fine with anonymous read access to the contents of this particular bucket.

Let's create a website policy and write it to a JSON file:

$ tee ./policy.json <<EOF
{
  "Version":"2012-10-17",
  "Statement":[{
    "Sid":"PublicReadGetObject",
    "Effect":"Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::glorious-website/*"
  }]
}
EOF

To apply the policy, we'll have to use an s3api "plumbing" command, put-bucket-policy:

$ aws s3api put-bucket-policy --bucket glorious-website --policy file://policy.json

Once again, if successful, there will be no output but we'll be able to actually visit the website; let's verify that with httpie:

$ http --headers http://glorious-website.s3-website-eu-west-1.amazonaws.com/
HTTP/1.1 200 OK
Content-Length: 259
Content-Type: text/html
Date: Tue, 04 Feb 2020 14:10:16 GMT
ETag: "4d5aea333733346209b576265ee4f46f"
Last-Modified: Mon, 03 Feb 2020 21:48:21 GMT
Server: AmazonS3

Setup Domain Name

We have a static website hosted on AWS S3, that we can actually visit - but the URL isn't very user friendly - so we need to register a domain. Even if doing this as a learning exercise, I would suggest registering an actual domain - thanks to the plethora of new TLDs (there are over 1.5k now) a domain can be registered very cheaply - I registered the glorious.website domain with Namecheap for $1.46.

Once a domain is registered, let's use httpie and dig to do understand how things are setup just after registering. The default output of the dig command is quite verbose, so the +noall option is used to turn everything off, and then +answer selectively enables only the "answer" section. I modified the response slightly by adding column names; more on dig here.

$ http --headers http://glorious.website
HTTP/1.1 302 Found
Connection: keep-alive
Content-Length: 51
Content-Type: text/html; charset=utf-8
Date: Wed, 05 Feb 2020 07:11:17 GMT
Location: http://www.glorious.website/
Server: nginx
X-Served-By: Namecheap URL Forward

$ http --headers http://www.glorious.website
HTTP/1.1 200 OK
Allow: GET, HEAD
Cache-Control: no-cache
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Wed, 05 Feb 2020 07:11:24 GMT
Expires: -1
Pragma: no-cache
Server: namecheap-nginx
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-CST: MISS
X-CST: MISS

$ dig +noall +answer glorious.website
---------------------------------------------------------------------------
NAME                       TTL   CLASS   TYPE    DATA
---------------------------------------------------------------------------
glorious.website.          776   IN      A       192.64.119.240

$ dig +noall +answer www.glorious.website
---------------------------------------------------------------------------
NAME                       TTL   CLASS   TYPE    DATA
---------------------------------------------------------------------------
www.glorious.website.      782   IN      CNAME   parkingpage.namecheap.com.
parkingpage.namecheap.com.  30   IN      A       198.54.117.218
parkingpage.namecheap.com.  30   IN      A       198.54.117.212
parkingpage.namecheap.com.  30   IN      A       198.54.117.211
---------------------------------------------------------------------------

From the responses, we can see that these requests are indeed handled by Namecheap. The "www.glorious.website" domain points to a parking page which we can visit, and that the server at the IP of the non-"www" is setup by Namecheap to respond with a 302 response. When browsers receive a 302, they read the URL in the Location response header and navigate there - so visitors to the root domain will be redirected to the "www" one.

To www or not to www ?

One of the first things to consider when it comes to domains is the thorny issue of "www" vs non-www ("naked") - we need to pick one as the "canonical" website URL, and redirect the other one so that both of them resolve to the same canonical URL. Traditionally, the recommendation is to opt for the prefixed version - we won't go into the reasons as to why that is because this rabbit hole goes quite deep but there are good technical reasons for high-traffic websites to at least consider going this route.

Using the root domain (also known as the "apex" or "naked" domain) as the canonical URL is the hip and trendy thing to do; lots of people appreciate the shorter URLs and find them more aesthetically pleasing. The "www"-prefixed domain would redirect to the root domain, so both would work. Normally this kind of aliasing would be done via a CNAME DNS record. Technically, this is not disallowed for the root domain - but according to RFC1912, "a CNAME record is not allowed to coexist with any other data". The main reason this "hurts" is that you generally want email delivery for the domain, sooner or later - and for that you'd need an MX record, which would "coexist" with the CNAME record - which is forbidden.

Registrars and DNS providers (these are different services, but they are often offered by the same company) work around this issue in a several ways - by handling CNAME records in a custom way, or introducing new record types - like ALIAS or ANAME. While there is no standard way of doing this, as of early 2020 most DNS providers have fairly mature solutions for this popular request and, as we'll see, Amazon is no exception. So, we'll go the hipster route and use the shorter, non-www domain for the canonical URL.

Setup Route 53

With the domain registered, the glorious.website domain is "resolved" by Namecheap's DNS servers to an IP address for the server hosting the parking page; this is done via an A record. All host names ultimately need to resolve to an IP; so, somehow, we need to make the glorious.website domain (and its "www" sub-domain) resolve to the IP of a server which hosts the files in our bucket. Who knows the exact IP ? Amazon does. So essentially we need to tell Namecheap to delegate resolving this particular host name to Amazon.

This is what DNS name servers are for; they allow this kind of delegation. For Amazon, this type of functionality can be accessed through its Route 53 service, which allows us to create hosted zones. A hosted zone is essentially Amazon's way of allowing us to create a DNS zone and its associated zone files.

For this we have the aws route53 create-hosted-zone command; the --name parameter is the domain, and the --caller-reference can be any string that is unique for every invocation; normally a timestamp is used, as produced by the date command; the quotes around the timestamp are important as it contains spaces:

$ aws route53 create-hosted-zone --name glorious.website --caller-reference "$(date)"
{
    "Location": "https://route53.amazonaws.com/2013-04-01/hostedzone/ZDDQ0TEAVANOW",
    "HostedZone": {
        "Id": "/hostedzone/ZDDQ0TEAVANOW",
        "Name": "glorious.website.",
        "CallerReference": "10 Feb 2020 21:40:02",
        "Config": {
            "PrivateZone": false
        },
        "ResourceRecordSetCount": 2
    },
    "ChangeInfo": {
        "Id": "/change/C083657918IENJDLM30TJ",
        "Status": "PENDING",
        "SubmittedAt": "2020-02-10T21:40:05.108Z"
    },
    "DelegationSet": {
        "NameServers": [
            "ns-1566.awsdns-03.co.uk",
            "ns-1180.awsdns-19.org",
            "ns-859.awsdns-43.net",
            "ns-82.awsdns-10.com"
        ]
    }
}

We get a JSON response, and the most important piece of information there is the DelegationSet.NameServers array, which represents AWS name servers, to be used in the last step of this section. But first, we need to create a record in the hosted zone. According to the docs, for an S3 bucket website we need an "Alias" record set.

At this point it would be instructive to access the Route 53 MC; if you go there you should see the newly created hosted zone listed as a link. If you access the link, you'll be taken to a screen for managing the hosted zone, and there should be some button for creating record sets. When clicked, if you select the "Alias" option, the "Alias Target" text box should include the website bucket as an auto-complete option - but in this case it won't.

If the bucket is not listed, it could be because:

It is actually fairly common to miss at least one thing. In our case, the bucket name is "glorious-website", whereas the domain we want to alias is "glorious.website" - and Amazon isn't going to let us do that. So we must rename the bucket. Unfortunatelly there is no "rename" CLI command, so the process is a bit more involved - we need to create a new bucket with the correct name, sync it with the old one, and then also make it a website and apply the JSON policy. Note that we can't reuse the policy file, because we need to update the bucket ARN.

$ aws s3 mb s3://glorious.website
make_bucket: glorious.website

$ aws s3 sync s3://glorious-website s3://glorious.website
copy: s3://glorious-website/blog/first-post/index.html to s3://glorious.website/blog/first-post/index.html
copy: s3://glorious-website/style.css to s3://glorious.website/style.css
copy: s3://glorious-website/blog/first-post/image.svg to s3://glorious.website/blog/first-post/image.svg
copy: s3://glorious-website/index.html to s3://glorious.website/index.html

$ aws s3 rb --force s3://glorious-website
delete: s3://glorious-website/index.html
delete: s3://glorious-website/style.css
delete: s3://glorious-website/blog/first-post/image.svg
delete: s3://glorious-website/blog/first-post/index.html
remove_bucket: glorious-website

$ aws s3 website s3://glorious.website --index-document index.html

$ tee ./policy.json <<EOF
{
  "Version":"2012-10-17",
  "Statement":[{
    "Sid":"PublicReadGetObject",
    "Effect":"Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::glorious.website/*"
  }]
}
EOF

$ aws s3api put-bucket-policy --bucket glorious.website --policy file://policy.json

With the bucket sorted, we can create the record set with the change-resource-record-sets command. It takes JSON as input; the structure isn't too complicated, but there is some legwork that we need to perform. The value for the ResourceRecordSet.AliasTarget.HostedZoneId field depends on the AWS region the bucket was created in. The list of hosed zone ids corresponding to each region is here; and since our bucket is in eu-west-1, we're going to use "Z1BKCTXD74EZPE".

Also note that Route 53 requires we use A as the record type. Normally an A record points directly to a concrete IPv4 address - but here it's an indirect reference, similar to a CNAME. Arguably, Amazon should have called this something else, like most other DNS providers do; not CNAME as that would preclude usage on a root domain, but ALIAS or something like that. That being said, DNS queries for "glorious.website" will result in an A record with a concrete IP as its payload - so make of this what you will.

$ tee ./alias-record-set.json <<EOF
{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "glorious.website",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z1BKCTXD74EZPE",
          "DNSName": "s3-website-eu-west-1.amazonaws.com.",
          "EvaluateTargetHealth": false
        }
      }
    }
  ]
}
EOF

$ aws route53 change-resource-record-sets --hosted-zone-id ZDDQ0TEAVANOW --change-batch file://alias-record-set.json
{
    "ChangeInfo": {
        "Id": "/change/C00881342NTAFY5GN1BMZ",
        "Status": "PENDING",
        "SubmittedAt": "2020-02-11T10:06:06.059Z"
    }
}

With the hosted zone setup, all we need to do now is letting the registrar know that we want Amazon to handle resolving the domain. We do this by setting NS records, referencing Amazon name servers. We need to use the name servers corresponding to the hosted zone created for the domain; these are revealed when the zone is created, as we saw, or with the aws route53 list-resource-record-sets command:

$ aws route53 list-resource-record-sets --hosted-zone-id ZDDQ0TEAVANOW
{
    "ResourceRecordSets": [
        {
            "Name": "glorious.website.",
            "Type": "A",
            "AliasTarget": {
                "HostedZoneId": "Z1BKCTXD74EZPE",
                "DNSName": "s3-website-eu-west-1.amazonaws.com.",
                "EvaluateTargetHealth": false
            }
        },
        {
            "Name": "glorious.website.",
            "Type": "NS",
            "TTL": 172800,
            "ResourceRecords": [
                {
                    "Value": "ns-1566.awsdns-03.co.uk."
                },
                {
                    "Value": "ns-1180.awsdns-19.org."
                },
                {
                    "Value": "ns-859.awsdns-43.net."
                },
                {
                    "Value": "ns-82.awsdns-10.com."
                }
            ]
        },
        {
            "Name": "glorious.website.",
            "Type": "SOA",
            "TTL": 900,
            "ResourceRecords": [
                {
                    "Value": "ns-1566.awsdns-03.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400"
                }
            ]
        }
    ]
}

The exact steps for this depend on the registrar; here's a link to a Namecheap guide. I actually kept getting the generic "Oops, something went wrong. Please try again." error message while I was trying to set the NS records. By inspecting the response payload, I saw that the message was "A host object with that hostname already exists." - still not clear so I had to contact support - as it turns out, they require entering the name servers without the terminating dot - which technically is incorrect, as the FQDN includes the dot. Lesson here is, be prepared for dealing with this kind of shenanigans.

Once the name servers are set, the hard part begins... waiting. By making a dig request, you can find out the TTL for a domain:

$ dig +noall +answer glorious.website
glorious.website.       5       IN      A       52.218.40.164

$ dig +noall +answer www.glorious.website
www.glorious.website.   552     IN      CNAME   parkingpage.namecheap.com.
parkingpage.namecheap.com. 30   IN      A       198.54.117.217
parkingpage.namecheap.com. 30   IN      A       198.54.117.211
parkingpage.namecheap.com. 30   IN      A       198.54.117.215

As you might remember, the second column represents the TTL, and if you run a dig command multiple times in succession, you should see it going down, as it is measured in seconds. You can think of the TTL as TTW - time to wait, because only when it reaches 0 will a DNS server refresh its copies of the expired records.

You can try clearing the local DNS cache, but when your computer makes a request for a new set of records, the DNS server it reaches might itself serve a cached response. You can "evict" the local cache, but not the one from remote servers you don't control - a DNS server will fetch the record set once, and cache them until the TTL expires. And since DNS is decentralized, even if you control the authoritative DNS server for a domain, you can't reliably know which intermediate servers will be contacted; the only thing you can rely on is changes will propagate eventually because TTLs will expire. If you plan on making DNS changes on a domain, you can preemptively set a low TTL so that changes will propagate more quickly, and then raise it back.

In the dig example above, for the root domain the TTL is 5 seconds so you should be able to access the website using the "glorious.website" domain almost instantaneously.

But the CNAME record for the "www"-prefixed domain, which points to the parking page, will live for another 552 seconds, or about 9 minutes. Once it expires, we won't have any records for the "www"-prefixed domain, and the "answer" section of dig's output should be empty:

$ dig +noall +answer www.glorious.website
$

If you try to access either URL in a web browser, what you get depends on the browser - sometimes browsers automatically go to the "www" version, even if you explicitly omit it - so I prefer using httpie for this kind of tests:

$ http http://glorious.website
HTTP/1.1 200 OK
Content-Length: 259
Content-Type: text/html
Date: Tue, 11 Feb 2020 13:49:15 GMT
ETag: "4d5aea333733346209b576265ee4f46f"
Last-Modified: Tue, 11 Feb 2020 09:04:32 GMT
Server: AmazonS3
x-amz-id-2: DFvJ1ddOrUigFskGAJeG3nxVBwp6ZMCw3+d9sPAXAvfcRTz52BWlsjnkxP341zxJQMywppM2vrU=
x-amz-request-id: 78783033E4E3CE9E

<html>
    <head>
        <link rel="stylesheet" type="text/css" href="/style.css"/>
        <title>Glorious Website</title>
    </head>
    <body>
      <h1>Version 1</h1>
      <p>Go to <a href="/blog/first-post">blog/first-post</a></p>
    </body>
</html>

$ http http://www.glorious.website

http: error: ConnectionError: HTTPConnectionPool(host='www.glorious.website', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001EB8C9E81
30>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')) while doing a GET request to URL: http://www.glorious.website/

The root domain works and resolves to the "glorious" website. But www.glorious.website times out - that's because the TTL for the CNAME record (which aliased the "www"-prefixed domain to the parking page) reached 0, and there are no other records for that domain.

To address this, first we need to create another alias (an indirect A record) for the "www" sub-domain, like we did with the root domain:

$ tee ./alias-www-record-set.json <<EOF
{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.glorious.website",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z1BKCTXD74EZPE",
          "DNSName": "s3-website-eu-west-1.amazonaws.com.",
          "EvaluateTargetHealth": false
        }
      }
    }
  ]
}
EOF

$ aws route53 change-resource-record-sets --hosted-zone-id ZDDQ0TEAVANOW --change-batch file://alias-www-record-set.json
{
    "ChangeInfo": {
        "Id": "/change/C00881342NTAFY5GN1BMZ",
        "Status": "PENDING",
        "SubmittedAt": "2020-02-11T10:06:06.059Z"
    }
}

Once the record is created, dig will manage to resolve the www domain - but HTTP requests to it will result in a 404 page, complaining that there is no such bucket, www.glorious.website:

$ dig +noall +answer www.glorious.website
www.glorious.website.   5       IN      A       52.218.96.188

$ http --headers http://www.glorious.website
HTTP/1.1 404 Not Found
Content-Length: 367
Content-Type: text/html; charset=utf-8
Date: Tue, 11 Feb 2020 21:47:02 GMT
Server: AmazonS3

<html>
<head><title>404 Not Found</title></head>
<body>
<h1>404 Not Found</h1>
<ul>
<li>Code: NoSuchBucket</li>
<li>Message: The specified bucket does not exist</li>
<li>BucketName: www.glorious.website</li>
</ul>
<hr/>
</body>
</html>

That is because S3 website servers always expect to find a bucket with a name matching the domain - unless we use the long bucket URL. But we don't have to replicate the existing bucket; instead, S3 has a special type of bucket for this purpose, which can't contain any objects and the only purpose it serves is to redirect requests to another bucket. So, we need to create it, and configure it with the aws s3api put-bucket-website command so that it redirects to the other bucket/domain:

$ aws s3 mb s3://www.glorious.website

$ tee ./redirect-www-bucket.json <<EOF
{
    "RedirectAllRequestsTo": {
      "HostName": "glorious.website",
      "Protocol": "http"
    }
}
EOF

$ aws s3api put-bucket-website --bucket www.glorious.website --website-configuration file://redirect-www-bucket.json

Everything should work now - if you open the "www" domain in a browser, you'll be redirected to the non-www domain. This is done via a 301 HTTP redirect, which browsers follow automatically (using the value of the Location header as the destination) - but httpie doesn't follow redirects unless we use the --follow parameter:

$ http http://www.glorious.website
HTTP/1.1 301 Moved Permanently
Content-Length: 0
Date: Tue, 11 Feb 2020 22:05:28 GMT
Location: http://glorious.website/
Server: AmazonS3

$ http --follow http://www.glorious.website
HTTP/1.1 200 OK
Content-Length: 259
Content-Type: text/html
Date: Tue, 11 Feb 2020 22:10:21 GMT
ETag: "4d5aea333733346209b576265ee4f46f"
Last-Modified: Tue, 11 Feb 2020 09:04:32 GMT
Server: AmazonS3

<html>
    <head>
        <link rel="stylesheet" type="text/css" href="/style.css"/>
        <title>Glorious Website</title>
    </head>
    <body>
      <h1>Version 1</h1>
      <p>Go to <a href="/blog/first-post">blog/first-post</a></p>
    </body>
</html>

Problems

The website is up and running at the correct domain, but we have a few issues with it.

Compression

When talking about websites, there are two broad categories of compression - build time, and run time. Built time compression normally takes the form of "minification" but since we don't have a build process, we're not going to do anything about this.

The second type of compression is performed automatically by web servers, "on-the-fly". For example, nginx has the gzip on directive, but it's not enabled by default. So when a web browser (or a tool like httpie or curl) make a request for a file, nginx will serve it as-is. However, if the gzip on directive is used, it will check the Accept-Encoding request header, and if it includes gzip, then it will actually compress the file in the "gzip" format before sending the response, and also include the Content-Encoding response header with the value "gzip" - so the client knows that the response is encoded and should use "gzip" to decode it.

But we're just hosting a static website on S3, so there is no web server that we can configure. And you can see from the response headers above that there is no Accept-Encoding: gzip header - so no "on-the-fly" compression is performed, all files are served uncompressed.

HTTPS

The second issue is that we're serving the website unencrypted, over HTTP, instead of HTTPS. Browsers will penalize us for that, by marking the connection to the website as "insecure" - which it is. Aside from privacy and security concerns, there are other technical reasons for using HTTPS - many newer features will only work if the connection is over HTTPS, and even some older functionality is "retrofitted" to also require HTTPS.

Caching

When using a static website hosting service, we don't have much control over caching behaviour but at the least we must understand how it is setup. Generally this means knowing what caching headers are sent, and in what circumstances they change.

For this I prefer tools like httpie and curl, at least initially, because there are many factors that influence browser caching behaviour. I'll omit headers that are not relevant to caching from output.

$ http --headers http://glorious.website
HTTP/1.1 200 OK
ETag: "4d5aea333733346209b576265ee4f46f"
Last-Modified: Tue, 11 Feb 2020 09:04:32 GMT

The ETag value is, theoretically, supposed to be a fingerprint - normally a hash of the requested resource but the exact hashing algorithm is to be considered an implementation detail. Browsers will cache it (assuming Cache-Control allows it), and the next time the same URL is requested, will check if the cached version is expired. "Expiredness" is generally determined based on the values of the Expires or Cache-Control headers, which would have been stored along with the cached resource. If the cache is expired, a new request will be made.

However, in this case we don't have Expires or Cache-Control - so browsers can't know if a resource is expired. But, they have a fingerprint and a modification time stamp. The fingerprint is generally more accurate, and will be used in preference to the Last-Modified value but any of these headers can be used for this purpose. To ensure they don't serve outdated content, browsers still need to make a HTTP request - but they can potentially avoid re-downloading it.

Let's use the touch command to test what happens when the "last modified" attribute of the file changes, but the file contents remain the same. Using touch is equivalent to opening the file in an editor and saving it, without making any changes.

$ cd ~/code/glorious-website # change to the folder with the website

$ md5sum index.html # calculate the MD5 hash of the file contents
4d5aea333733346209b576265ee4f46f *index.html

$ touch index.html # change the "last modified" time stamp on the file

$ md5sum index.html # check that file contents are the same
4d5aea333733346209b576265ee4f46f *index.html

$ aws s3 sync . s3://glorious.website --exclude ".git/*"
upload: .\index.html to s3://glorious.website/index.html

We can see that s3 decided to re-upload the file, even if only the time stamp changed. Requesting the root of a website is normally the same as requesting the index.html file, so let's see how the headers behave.

$ http --headers http://glorious.website
Date: Wed, 12 Feb 2020 13:20:35 GMT
ETag: "4d5aea333733346209b576265ee4f46f"
Last-Modified: Wed, 12 Feb 2020 13:18:41 GMT

Here we can see why ETags are preferred to Last-Modified - because the former is based on the actual contents of the file (and didn't change), whereas the latter is commonly based on file system meta-information (and did change). Also note that the value of the ETag is actually the MD5 hash - although, as mentioned before, this is an implementation detail and servers are free to choose whatever hashing algorithm they find suitable.

Assuming a browser has a cached resource along with it's ETag, on subsequent requests for the same resource it will include the If-None-Match header, with the value of the cached resources ETag ("4d5aea333733346209b576265ee4f46f", in our case). Actually this is true for any compliant HTTP client, not just browsers, so we can try this with httpie. Note that request headers are set after the URL; the --headers option doesn't set any headers, it only means we're only interested in the response headers and don't care about the response body.

$ http --headers http://glorious.website If-None-Match:"4d5aea333733346209b576265ee4f46f" 
HTTP/1.1 304 Not Modified
Date: Wed, 12 Feb 2020 13:34:59 GMT
ETag: "4d5aea333733346209b576265ee4f46f"
Last-Modified: Wed, 12 Feb 2020 13:18:41 GMT

The server (Amazon S3) matched the ETag in the requests If-None-Match header with the ETag for the current version of the object, and since they're the same, it sent a 304 Not Modified response - which means the client can use the cached version. This isn't such a big win for small files; while a 304 response doesn't carry a payload (the "body" is empty), a small payload would not have mattered much as the main cost here is the time it takes to receive the response. If we don't include the If-None-Match, we'll get a 200 response with the body containing the page's HTML.

If an HTTP response includes a Last-Modified header, clients can cache the response along with the value of this header, to be sent with subsequent requests as the value of the If-Modified-Since header. Since S3 responses include both ETag and Last-Modified, re-validation requests will include both If-None-Match and If-Modified-Since headers - but, as mentioned before, servers will generally prefer If-None-Match as it is based on actual file contents.

Something else to keep in mind is that caching behaviour is often different per file or MIME type - HTML files are generally not cached, whereas scripts and images are. So it is worth making requests for other file types as well, which I did and it looks like on S3 all file types are treated the same.

So the caching situation isn't terrible, but there is certainly room for improvement because we could potentially avoid the HTTP roundtrip entirely if we had Expires or Cache-Control headers. This is possible but it's not straightforward; we'll look at a compromise solution using CloudFront, in an upcoming second part for this article.

Conclusion

We covered setting up the AWS CLI and using it to upload a static website to S3; then we used Route 53 to configure a custom domains for it, and redirect the "www"-prefixed domain to the root domain. As we saw in the last section, there is room for improvement - primarily, we need to setup HTTPS encryption; we'll do that in the second part of this tutorial, using AWS CloudFront and we'll also add compression and improve caching.