A Minimal Proxy with PHP and SQLite

Many web services allow access to their APIs via JSONP nowadays. This allows client-side JavaScript to access ressources via domain boundaries.

However, for several reasons caching ressources locally (on the same server as the original web page) has still many application areas. Take the Twitter API as example. A client is allowed to have 150 API calls per hour. If the user has a tab open to a web page, that refreshes his timeline every 20 seconds, he has no allowed API calls left at the end of an hour. When he then visits your page, where you display your Twitter messages via a JSONP call of the customer, this action will utterly fail, your “Twitter Box” will stay empty.

How do we remedy this situation? Let’s simply cache our Twitter API call and serve the user the readily cached content. One additional advantage is, that we can fetch simple JSON without the “P” part, so we don’t have to run JavaScript on our pages, that comes from uncontrolled sources. That’s all taught in Proxies 101, so let’s dive directly into the implementation.

A Basic Proxy ¶

We use PHP and SQLite for a simple proxy and enhance this with security and HTTP caching features. Let’s start with basic features. For simplicity we assume, that the database lives in the same place as the PHP script, but you could simply change the path, if you want it elsewhere.

$cache_duration = 60*60;
$then = mktime() - $cache_duration;

$url = $_GET['url'];
$db = new PDO("sqlite:".dirname(__FILE__)."/cache.sqlite");

# Fetch entry from DB
$stm = $db->prepare("SELECT content, type, age FROM cache WHERE url = ? AND age > ?");
$stm->execute(array($url, $then));
$entry = $stm->fetch(PDO::FETCH_ASSOC);

# no entry found in cache
if ($entry === False) {

    # prune cache
    $stm = $db->prepare('DELETE FROM cache WHERE url = ? OR age < ?');
    $stm->execute(array($url, $then));

    # fetch a current version
    $entry = fetch_entry($url);

    if ($entry['type'] !== NULL) {
        $stm = $db->prepare("INSERT INTO cache ( url, type, content, age )
                                VALUES (?, ?, ?, strftime('%s', 'now'))");
        $stm->execute(array($url, $entry['type'], $entry['content']));
    } else {
        # something went wrong
        $entry = array(
            'type' => 'application/json',
            'content' => 'null',
        );
    }
}

header('Content-Type: '.$entry['type']);
echo $entry['content'];

The code fetches an entry from a database. If there is no match, it fetches the ressource with the to-be-defined function fetch_entry(). There is one constraint for matches: They must be newer than the value of $then, which is the timestamp of one hour back in the past.

Which entry to fetch is controlled by the GET parameter url. It must be a complete URL, as we will see in a moment.

The fetch_entry() function uses cURL for fetching a ressource via the InterWebs. It has the advantage over other methods like fopen(), that we can also access the HTTP headers.

function fetch_entry($url) {
    $t = $c = NULL;

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_USERAGENT, "My little cacher");
    curl_setopt($ch, CURLOPT_HEADER, 1); // we want the HTTP headers, too
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // we need the stuff as 
                                                    // string, not printed out
    curl_setopt($ch, CURLOPT_TIMEOUT, 10); // set a timeout for safety

    // the output is normalized to UNIX line endings (ASCII x10)
    $output = str_replace("\r\n", "\n", curl_exec($ch));
    curl_close($ch);

    // if there is output, try to find out the MIME type and determine the content
    if ($output) {
        list($h, $c) = explode("\n\n", $output, 2);
        $h = explode("\n", $h);
        $t = "text/plain";
        foreach ($h as $line) {
            if (substr(strtolower($line), 0, 13) === "content-type:") {
                $t = trim(preg_replace('/^Content-Type:\s*(.*)$/i', '$1', $line));
                break;
            }
        }
    }
    return array(
        'type' => $t,
        'content' => $c
    );
}

The function takes the URL, feeds it to a cURL instance, catches the output together with the HTTP headers and tries to make something useful out of it. What bloats the code a bit is, that we want to determine automatically the MIME type of the ressource, that is sent by the original server, so that we can send it along to the user.

A First Error Checking ¶

If the GET parameter is missing, we will run in trouble at the moment. Therefore we put in front of the above script a simple check, that returns a small JSON snippet, if the test fails.

if (! isset($_GET['url'])) {
  header('HTTP/1.0 400 Bad Request');
  header('Content-Type: application/json');
  echo 'null';
  exit(1);
}

The HTTP status tells the client, that it did something wrong. If we had a public API, we should put an error description in the response body instead of a simple null. But this test is sufficient for a client JavaScript to detect a problem.

Utilizing the “Cache” Part ¶

We have a nice cache/proxy now, but still, with every page request, the proxy script is called at least once. We can do better, if we use the knowledge about how long the cached content will be fresh at least.

It boils down to setting the two HTTP 1.1 headers Cache-Control and Expires with the appropriate time values, so that the client knows, for how long it is allowed to store the ressource locally.

$cache_duration = 60*60;
$then = mktime() - $cache_duration;
$expires = $then + 2*$cache_duration;
$max_age = $cache_duration;

$url = $_GET['url'];
$db = new PDO("sqlite:".dirname(__FILE__)."/cache.sqlite");

# Fetch entry from DB
$stm = $db->prepare("SELECT content, type, age FROM cache WHERE url = ? AND age > ?");
$stm->execute(array($url, $then));
$entry = $stm->fetch(PDO::FETCH_ASSOC);

# no entry found in cache
if ($entry === False) {

    # prune cache
    $stm = $db->prepare('DELETE FROM cache WHERE url = ? OR age < ?');
    $stm->execute(array($url, $then));

    # fetch a current version
    $entry = fetch_entry($url);

    if ($entry['type'] !== NULL) {
        $stm = $db->prepare("INSERT INTO cache ( url, type, content, age )
                                VALUES (?, ?, ?, strftime('%s', 'now'))");
        $stm->execute(array($url, $entry['type'], $entry['content']));
    } else {
        # something went wrong
        $entry = array(
            'type' => 'application/json',
            'content' => 'null',
        );
    }
} else {
    $expires = $entry['age'] + $cache_duration;
    $max_age = $entry['age'] - $then;
}

header('Content-Type: '.$entry['type']);
header('Expires: '.date('r', $expires));
header('Cache-Control: max-age='.$max_age);
echo $entry['content'];

We calculate first, when a newly fetched ressource would expire. Since we will cache it for one hour regardless, this will be exactly one hour.

If the ressource is fetched from the database, we have to recalculate those values on base of the actual time, when we fetched it.

Finally, we send the appropriate HTTP headers to the client.

A Simple Autoinstaller ¶

The code relies so far on the database being already in place with the correct table layout. We don’t want to let the script die, just because there is no DB. That would be sooo 20^th century. Fortunately, SQLite is very forgiving. The only thing we really need is write access to the database file and its directory (in case we have to create it).

To see if we need an installation, we set up a test query to the DB, and if it doesn’t succeed, we set up the database:

$stm = $db->prepare('select 1 from cache'); // this is the test
if ($stm === False) {
    // set up the table
    $db->exec('CREATE TABLE cache (url TEXT,
                                   type TEXT,
                                   content BLOB,
                                   age INTEGER,
                                   UNIQUE(url))');
}

Place these lines directly below the initialization of $db, and you will never have to worry again about installation. (Actually, if there is no DB file at all, the PHP SQLite driver will create a completely new one.)

Security Considerations ¶

So far, so good. We have a working proxy with caching functionality and an auto-installer. But leaving the ivory tower and the flower meadow around it, we will soon make acquaintance with several blackhats out there.

Why is this so? Well, we don’t check the ressource the cache should fetch. This means, that with a simple URL manipulation our server will become a malware distributor:

http://our-domain.example.com/proxy.php?url=http://malware-domain.com/virus.exe

What will this do? Our proxy script faithfully accepts the url parameter, goes to http://malware-domain.com/virus.exe and starts distributing it on its own. Not a very pleasant thought.

We can remedy this by cropping the set of allowed URLs from everything out there to a small set of trusted sources. Therefore we introduce a variable $whitelist = array('foo.example.com', 'bar.example.com'); of domains we allow retrieving data from. We refer to this variable in an extended version of the fetch_entry() function:

function fetch_entry($url) {
    global $whitelist;
    $host = parse_url($url, PHP_URL_HOST);
    $accepted = False;
    foreach ($whitelist as $test) {
        // check the host against each whitelist entry
        if ($test === $host) {
            $accepted = True;
            break;
        }
    }
    $t = $c = NULL;

    if ($accepted) {
        // fetch stuff. This remains the same
        // ...
    } else {
        $t = 'application/json';
        $c = '{"error":"forbidden"}';
    }
    return array(
        'type' => $t,
        'content' => $c
    );
}

Now we have closed this security hole, provided the domains in the whitelist don’t actually distribute malware themselves.

Conclusion ¶

We arrive at a simple web proxy with cache, that allows us to intercept API calls to external servers and cache the result. Since we use SQLite as data storage engine, we have no hassle with installing anything, and by using the knowledge of how long we cache things we can set appropriate HTTP headers to minimize traffic further.

For the interested, here comes the complete, final PHP script. If you want to use it, go ahead. It’s published under an MIT-style license.

<?php
/*
Copyright (c) 2011 Manuel Strehl

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/

// check preconditions
if (! isset($_GET['url'])) {
  header('HTTP/1.0 400 Bad Request');
  header('Content-Type: application/json');
  echo 'null';
  exit(1);
}

// allow requests only to these domains
$whitelist = array('foo.example.com', 'bar.example.com');

// cache ressources for one hour
$cache_duration = 60*60;
$then = mktime() - $cache_duration;
$expires = $then + 2*$cache_duration;
$max_age = $cache_duration;

$url = $_GET['url'];
$db = new PDO("sqlite:".dirname(__FILE__)."/cache.sqlite");
$stm = $db->prepare('select 1 from cache'); // this is the test
if ($stm === False) {
    // set up the table
    $db->exec('CREATE TABLE cache (url TEXT,
                                   type TEXT,
                                   content BLOB,
                                   age INTEGER,
                                   UNIQUE(url))');
}

# Fetch entry from DB
$stm = $db->prepare("SELECT content, type, age FROM cache WHERE url = ? AND age > ?");
$stm->execute(array($url, $then));
$entry = $stm->fetch(PDO::FETCH_ASSOC);

# no entry found in cache
if ($entry === False) {

    # prune cache
    $stm = $db->prepare('DELETE FROM cache WHERE url = ? OR age < ?');
    $stm->execute(array($url, $then));

    # fetch a current version
    $entry = fetch_entry($url);

    if ($entry['type'] !== NULL) {
        $stm = $db->prepare("INSERT INTO cache ( url, type, content, age )
                                VALUES (?, ?, ?, strftime('%s', 'now'))");
        $stm->execute(array($url, $entry['type'], $entry['content']));
    } else {
        # something went wrong
        $entry = array(
            'type' => 'application/json',
            'content' => 'null',
        );
    }
} else {
    $expires = $entry['age'] + $cache_duration;
    $max_age = $entry['age'] - $then;
}

header('Content-Type: '.$entry['type']);
header('Expires: '.date('r', $expires));
header('Cache-Control: max-age='.$max_age);
echo $entry['content'];


/**
 * Fetch a ressource
 * @param string $url The URL of the ressource
 * @return array MIME type and actual content
 */
function fetch_entry($url) {
    global $whitelist;
    $host = parse_url($url, PHP_URL_HOST);
    $accepted = False;
    foreach ($whitelist as $test) {
        // check the host against each whitelist entry
        if ($test === $host) {
            $accepted = True;
            break;
        }
    }
    $t = $c = NULL;

    if ($accepted) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_USERAGENT, "My little cacher");
        curl_setopt($ch, CURLOPT_HEADER, 1); // we want the HTTP headers, too
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // we need the stuff as 
                                                        // string, not printed out
        curl_setopt($ch, CURLOPT_TIMEOUT, 10); // set a timeout for safety

        // the output is normalized to UNIX line endings (ASCII x10)
        $output = str_replace("\r\n", "\n", curl_exec($ch));
        curl_close($ch);

        // if there is output, try to find out the MIME type and determine the
        // content
        if ($output) {
            list($h, $c) = explode("\n\n", $output, 2);
            $h = explode("\n", $h);
            $t = "text/plain";
            foreach ($h as $line) {
                if (substr(strtolower($line), 0, 13) === "content-type:") {
                    $t = trim(preg_replace('/^Content-Type:\s*(.*)$/i', '$1', $line));
                    break;
                }
            }
        }
    } else {
        $t = 'application/json';
        $c = '{"error":"forbidden"}';
    }
    return array(
        'type' => $t,
        'content' => $c
    );
}