Keeping a Live Website in Sync With a Local Version, Part Two
published 08/08/11
This part covers the technical aspects of a project I worked on to keep a live website in sync with a local version, the first part covers it from a business-level perspective.
I created a system for keeping a live version of a website in sync with a local one. I use a Ruby script activated by Cron to compare and sync the two versions and then get this out to everyone using the site by sharing a DropBox folder. A xml file, dubbed a Castlist, is used to compare last updated dates.
The Castlist File
In order for the Ruby script to be able to compare assets in the live and local versions of the site, I used an XML file - here's an example:
<?xml version="1.0" encoding="UTF-8"?>
<castlist>
<templates>
<template id="223349" updated_at="2011-04-27 16:05:03.0" name="dmk_bottom" />
<template id="224294" updated_at="2011-06-09 11:04:20.0" name="dmk_castlist" />
... [ more templates ] ...
lates>
s>
<asset content_type="html" id="100466409" updated_at="2011-04-28 12:29:30.0" path="http://mediakit.skininc.com/index.html" />
<asset content_type="html" id="102168174" updated_at="2010-11-04 08:45:03.0" path="http://mediakit.skininc.com/about/index.html" />
<asset content_type="html" id="100467204" updated_at="2011-06-14 10:44:22.0" path="http://mediakit.skininc.com/advertising/index.html" />
<asset content_type="css" id="020" updated_at="2007-07-07 02:00:00.0" path="http://s3.amazonaws.com/abm-assets/css/base.css" />
<asset content_type="css" id="021" updated_at="2007-07-07 02:00:00.0" path="http://s3.amazonaws.com/abm-assets/css/si-brand-mini.css" />
<asset content_type="js" id="026" updated_at="2007-07-07 02:00:00.0" path="http://s3.amazonaws.com/abm-assets/js/dmk/behavior.js" />
... [ more assets ] ...
</assets>
</castlist>
There are two types of things I need to compare - templates and assets. Assets are the html, css and js files that make up the site, but the templates are a little more complicated. If I only had to compare the assets, then I actually wouldn't need the XML file at all, I could probably use the Last-Modified html header and be good to go.
The problem is that I don't want to rely on my CMS for anything, I want my solution to be independent from anything I haven't written. I can't write Java (its a Java-based platform) or anything fancy like that, but I can use their templating engine to report on the state of the templates themselves, so that's what I did. The Castlist file serves as the API of the CMS so that the script knows where things stand.
Ruby Gems Used
To write the Ruby script I used just two gems: Nokogiri and Dropbox. Nokogiri to make parsing the XML castlist files easy and Dropbox to work with the Dropbox API.
Script Overview
The main classes are Comparer
, Syncer
and Fixer
. The overall flow of the
script goes something like this: initialize a Comparer
and a Syncer
and see
if there's a major change. A major change is something at the template level
(like maybe something in navigation or a header image) that would require all
pages to be refreshed. If not, then just look for stale or new pages.
Then the Syncer
we initialized earlier makes a Fixer
object and uploads
fixed files through the DropBox API.
Finally, we use the Comparer
to update the local Castlist as a record of
what's been changed (so we don't do the same thing over and over).
The Comparer Method
When the Comparer
class is initialized it sets two instance variables called
live
and local
to Nokogiri documents representing the live and local version
of the Castlist files. These attributes are used by other instance methods to
detect major changes, stale files and new files. The major_change?
method is
the first that's called and it looks like this:
def major_change?
@live.root.css("template").detect do |live_node|
local_node = @local.root.css("template##{live_node.attributes['id']}").first
live_node.attributes['updated_at'].to_s != local_node.attributes['updated_at'].to_s
end
end
So, we just iterate over the live Nokogiri document looking for templates that
don't have matching updated_at
dates. If one is found, then we have a major
change and need to drop all the content and crawl the live site all over again.
We use the live\_urls
method for that:
def live_urls
@live.root.css("asset").map { |node| node.attributes['path'].to_s }
end
Not too much here, we're just finding all the assets in the live Castlist and
then using map
to convert them to a list of urls, which is what we'll be
passing off to Syncer
later.
But normally we wont have a major change, so then the next things to look for
are stale files (files whose live version is newer than its local version) and
new files that aren't in the local version. The stale_urls
method looks like
this:
def stale_urls
stale_nodes = @live.root.css('asset').select do |live_node|
local_node = @local.root.css("asset##{live_node.attributes['id']}").first
if local_node
live_node.attributes['updated_at'].to_s != local_node.attributes['updated_at'].to_s
else
false
end
end
stale_nodes.map { |node| node.attributes['path'].to_s }
end
We're iterating over the live Nokogiri document once again and as in
major\_change
we're interested in the difference between the live and local
version updated\_at
dates. The files where these don't match are thrown into
stale_nodes
which we then map
to just the paths.
Last thing to check for are new urls, here's that method:
def new_urls
new_nodes = @live.root.css('asset').select do |live_node|
@local.root.css("asset##{live_node.attributes['id']}").count == 0
end
new_nodes.map { |node| node.attributes['path'].to_s }
end
Similar to the above methods: we're iterating over that same list of Nokogiri
documents and once we find the elements we want, we use map
to get just the
paths of the files we're interested in. The only difference here is that we're
using the result of the Nokogiri css method to decide if this file already
exists. If not, then its picked.
So, these four methods are used to come up with an array of urls that need to be updated. In the case of a major change, this is really all of them and in the other cases, its just a subset of them.
The Syncer Method
The Syncer
class is what connects to DropBox through their API and actually
does the uploading. Upon initialization, it creates a instance variable called
session
that uses Dropbox::Session.deserialize
to reestablish the
connection. Once connected, we wait for Comparer
to feed us a list of urls and
then the upload_files
method gets to work:
def upload_files(urls)
fixer = Fixer.new
urls.each do |url|
file_path, filename, extension = /com(.*\/)([\w\-]+\.(\w+))/.match(url).captures
clean_data = fixer.clean(url, extension)
upload_file(clean_data, @local_path + file_path, filename)
end
fixer.found_files.each do |url|
file_path, filename, extension = /com(.*\/)([\w\-]+\.(\w+))/.match(url).captures
clean_data = fixer.clean(url, extension)
upload_file(clean_data, @local_path + file_path, filename)
end
end
I'll get into Fixer
in the next section, but for now just know that it takes
our data and spits it back out cleaned up. The other thing that Fixer does is
look for more files that will need to be uploaded. This is important, so I'll go
into that more now.
The Castlist is a manifest of sorts. Its a list of the assets that make up a website, but its not exhaustive and it shouldn't be. If we were limited to only those assets that appear in the Castlist, then every time a new image is needed, we'd need to add this to the Castlist. On a brochure website meant to sell ads, this would make running the site too slow. Further the people working on the content of the site are not programmers and know nothing about how CAST operates. All they know is how to edit pages in the CMS to make changes on the live websites.
And that's all they should care about. The Castlist is kept to just the elements
of the site that are fairly stable. Things like the webpages and the css and js
files that make it up. The exact elements on those pages can change without any
affect on how CAST works. To support this, CAST must crawl the pages that make
up the site and that's what fixer.found_files
is all about.
The upload_files
method is merely a controller here - it instantiates a
Fixer
, iterates over the urls it gets from the Comparer
and the urls Fixer
finds and then gets this content uploaded. The actual work happens in the
Fixer
class.
The Fixer Method
Since we can't rely on the Castlist to be a complete list of all assets on the
site, we have to crawl pages and look for new things to download. But while
we're crawling we should also change all absolute paths to relative ones. These
are the two things the Fixer
class does - we'll start with a method used to
fix paths.
The path problem is pretty straightforward: on the live website, I make all
paths absolute so that CAST can take the URL and retrieve the asset, but that's
not going to work on the local copy where you don't have an internet connection,
so the asset needs to be downloaded and the path needs to be altered to be
relative. To do this I wrote the find_relative_path
method:
def find_relative_path(url)
steps = url.gsub(/http:\/\/mediakit.\w+.com\//, '').split('/').count - 1
'../' * steps
end
We take the url of the file we're working with, remove the host and then count how many slashes we see. Subtract one from that number and use the juicy string multiplication we get in Ruby to return a string that's just a bunch of path changes. This is then used while we crawl.
When Fixer
is instantiated it creates an instance variable called
found_files
that's going to be where we put assets we find while crawling. The
crawling starts with the clean
method:
def clean(url, ext)
data = Net::HTTP.get URI.parse(url)
case ext
when 'css'
scan_for_images(data)
when 'html'
relative_path = self.find_relative_path(url)
data.gsub!(/(http:\/\/mediakit\.\w+\.com\/|https:\/\/s3\.amazonaws\.com\/)/, relative_path)
scan_for_files(data, relative_path)
end
return data
end
The clean
method is acting as a controller here - its job is to fetch the data
using url
and then call the right crawling method based on what type of file
it is. Along the way we address paths using the relative path we found earlier.
Since our crawling is going to be slightly different depending on which type of
file we're working with, we fork the code here using the file extension. For a
file that's html, we use the scan_for_files
method to crawl:
def scan_for_files(data, relative_path)
while data =~ /(href|src)\=\"(http:\/\/media.\w+.com\/([\w\-\/\.]+))\"/ do
@found_files << $2 unless @found_files.include?($2)
data.sub!($2, relative_path + $3)
end
end
The idea here is to find all nodes with a href or src attribute, add that
attribute's value to our found_files
array and then update the path using our
relative path.
If we're working with a css file, then we use the scan_for_images
method:
def scan_for_images(data)
data.lines do |line|
if line =~ /url\(\"\.\.([\w\-\/\.]+)\"\)/
url = 'http://s3.amazonaws.com/abm-assets' + $1
@found_files << url unless @found_files.include?(url)
end
end
end
As with the other scan method, we're looking for assets we don't already know
about but in the CSS we're using relative paths already, so we just have to add
them to the found_files
array.
Deploying the Script
While writing this script, I would just jump into Terminal and run it as I needed to check something, but once I was ready to actually start depending on this thing, I needed a way to run it on a schedule.
We happened to have a spare MacMini in the office that wasn't being used for anything important, so I commandeered it to serve as our CAST server. After a quick install of Git and RVM, I just cloned the repo, installed the couple Gem dependencies and got Ruby 1.9.2 installed.
For running the script on a schedule, I turned to my old frenemy Cron. I had never attempted something like this, so after some research (Googling), I found that I could run Bash commands as easily as anything else I've done with Cron and you could even run a series of commands by separating them with semicolons. Here's how I've got Cron running this script:
0 * * * * /bin/bash -l -c 'rvm 1.9.2-p180@cast; cd /Users/jon/Projects/abm/cast/; ruby cast.rb > logs/result.log'
Every hour, on the hour with results logged to a log file that just keeps over-writing itself, all within the comfy confines of RVM with a suitable Gemset for this project. Easy in retrospect, but quite a pain to figure out at the time!
Fire and Forget
This project went through one re-write about a year in, but has mostly been a fire-and-forget type of project. I remote into the MacMini every once in a while to run Software Update, but other than that its very hands off.
If you're interested in finding out more about the business-level strategy about this project, please read Part One.