Capturing and Storing Apache Access Log Data
published 10/20/25
My static sites are served by a Digital Ocean server that uses good old fashioned Apache to serve the content. I have configured that server to capture access log data and the store it on S3.
Apache Access Logs
In the world of Apache webservers the access log file is the standard location to store the requests that the server processes. In my case this configuration starts here:
# from /etc/apache2/apache2.conf
# ...
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
# ...
You can read this line like this:
- first is
LogFormatwhich indicates this line will be defining a format - next is a quoted string with the format
- last is the name of the format -
vhost_combinedin this case
I didn't write this - this comes standard with a brand new setup. Because I was
lazy I ended up copy/pasting this line with the format name of combined to
save myself a bit of work so remember that for later.
Next stop is to the individual site configurations. Here's one:
# from /etc/apache2/sites-available/jonallured.com.conf
<VirtualHost *:80>
# ...
CustomLog ${APACHE_LOG_DIR}/access.log combined
# ...
</VirtualHost>
This can be read like so:
- first is
CustomLogwhich indicates this line will be defining how to handle log data - next is a path to the file where we should log
- last is the name of the format to use
So this is where I was lazy. All the sites that this machine serves have this
same configuration and so rather than updating each one to reference
vhost_combined I just setup the combined format to include the website name.
There's probably a better way to do this but that's for another time.
The punchline is that with these 2 pieces of configuration in place the server is now setup to log incoming requests in a standardized way.
Some Sample Data
I will be diving much more deeply into the data in a future post but here's an example of a logged request for the curious:
www.jonallured.com:443 64.71.157.102 - - [26/Jun/2021:00:12:48 +0000] "GET /atom.xml HTTP/1.1" 200 18560 "-" "Feedbin feed-id:2032942 - 1 subscribers"
This line decodes like this:
- website:
www.jonallured.com - port: 443
- request ip: 64.71.157.102
- identity: -
- user: -
- request timestamp: [26/Jun/2021:00:12:48 +0000]
- first line of request: "GET /atom.xml HTTP/1.1"
- response status: 200
- response size: 18560 (in bytes)
- request referrer header: "-"
- request user agent header: "Feedbin feed-id:2032942 - 1 subscribers"
Kinda cool! This is a request from Feedbin for the RSS feed of my blog on behalf of my 1 and only subscriber - me. <3
The logrotate Tool
The next piece of the puzzle is to take the stream of request logging and rotate
it such that each day has an individual file. Turns out there is a pretty
standard CLI tool for this called logrotate. In order to get these Apache
Access Logs rotating I configured it like so:
# from /etc/logrotate.d/apache2
/var/log/apache2/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 root adm
sharedscripts
dateext
postrotate
if invoke-rc.d apache2 status > /dev/null 2>&1; then \
invoke-rc.d apache2 reload > /dev/null 2>&1; \
fi;
endscript
prerotate
if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
run-parts /etc/logrotate.d/httpd-prerotate; \
fi; \
endscript
lastaction
AWS_SHARED_CREDENTIALS_FILE=/home/dev/.aws/credentials aws s3 sync /var/log/apache2/ s3://mli-data/domino/logs --exclude "*" --include "*.gz"
endscript
}
There's a lot here but the important parts are that this configuration will pick up the access log file that Apache is writing to and rotate it daily but only keep the 14 most recent files. It compresses them as they are rotated and adds a date to the resulting filename so it is easy to know which one is from which day. It also includes some restarting of Apache.
With this in place we can see the logging and rotation in action by listing the log directory:
dev@domino:~% ls -1 /var/log/apache2
access.log
access.log-20251007.gz
access.log-20251008.gz
access.log-20251009.gz
access.log-20251010.gz
access.log-20251011.gz
access.log-20251012.gz
access.log-20251013.gz
access.log-20251014.gz
access.log-20251015.gz
access.log-20251016.gz
access.log-20251017.gz
access.log-20251018.gz
access.log-20251019.gz
access.log-20251020
The first file is the one currently being written to by Apache. The next set of
files that end with a .gz extension are the ones that have been archived and
compressed. The last one is the partially rotated file that will become archived
once the day is complete.
Uploading to S3
Let's look at a section from the logrotate config that I skipped - the part
that sends the log data to S3 and here it is in isolation:
lastaction
AWS_SHARED_CREDENTIALS_FILE=/home/dev/.aws/credentials aws s3 sync /var/log/apache2/ s3://mli-data/domino/logs --exclude "*" --include "*.gz"
endscript
This section instructs logrotate to run this command as the last part of the
rotation process. The purpose of the command is to take what's been rotated and
sync it to a bucket on S3 for processing elsewhere. I start by passing an ENV
var to a set of AWS credentials that have been generated for this purpose. Then
I run the aws s3 sync command and pass the local path to the Apache access
logs and the bucket path where I want them to go. The final part is 2 flags to
exclude all files and then only include those files that end with .gz so that
I ignore all but the compressed and fully rotated files.
At this point we have a solid configuration for the webserver and a lot of data on S3 with which to construct some analytics for these static sites.