Keeping Web Users Safe By Sanitizing Input Data

15 min read
Coding, PHP, Security
Share on Twitter, LinkedIn

About The Author

Philip Tellis is a geek, speedfreak and Chief RUM Distiller at SOASTA where he works on the mPulse product. mPulse helps site owners measure the real user … More about Philip ↬

Email Newsletter

Weekly tips on front-end & UX.
Trusted by 200,000+ folks.

In my last article, I spoke about several common mistakes that show up in web applications. Of these, the one that causes the most trouble is insufficient input validation/sanitization. In this article, I’m joined by my colleague Peter (evilops) Ellehauge in looking at input filtering in more depth while picking on a few real examples that we’ve seen around the web. As you’ll see from the examples below, insufficient input validation can result in various kinds of code injection including XSS, and in some cases can be used to phish user credentials or spread malware.

To start with, we’ll take an example^[1] from one of the most discussed websites today. This example is from a site that hosts WikiLeaks material. Note that the back end code presented is not the actual code, but what we think it might be based on how the exploit works. The HTML was taken from their website. We think it’s fair to assume that it’s written in PHP as the form’s action is index.php.

<form method='get' action='index.php'>
<input name="search" value="<?php echo $_GET['search'];?>" />
<input type=submit name='getdata' value='Search' /></form>

In this code, the query string parameter search is echoed back to the user without sanitization. An attacker could email or IM unsuspecting users a crafted URL that escapes out of the <input> and does nasty things with JavaScript. A simple way to test for this exploit without doing anything malicious is to use a URL like this:

https://servername/index.php?search="><script>alert(0)</script>

This exploit works because PHP has no default input filtering, and the developers haven’t done any of their own filtering. This exploit would work just as well in most other programming languages as most of them also lack default input filtering. A safer way to write the above code is as follows:

<?php
$search = filter_input(INPUT_POST | INPUT_GET, 'search', FILTER_SANITIZE_SPECIAL_CHARS);
?>
<form method='get' action='index.php'>
<input name="search" value="<?php echo $search;?>” />
<input type=submit name='getdata' value='Search' /></form>

This is less convenient though and requires code for every input parameter used, so it is often a good choice to set special_chars as PHP’s default filter, and then override when required. We do this in PHP’s ini file with the following directive:

filter.default="special_chars"

We’re not aware of similar default filters in other languages, but if you know of any, let us know in the comments.

It’s important to note that simply adding this parameter to PHP’s ini file does not automatically make your application secure. This only takes care of the default case where an input parameter is echoed back in an HTML context. However, a web page contains many different contexts and each of these contexts requires input to be validated in a different way.

Is input validation enough?

Recently we’ve stumbled upon the following code:

<?php
$name = "";
if ($_GET['name']) {
    $name = filter_input(INPUT_POST | INPUT_GET, 'name', FILTER_SANITIZE_SPECIAL_CHARS);
}
echo "<a href=login?name=$name>login</a>";
?>

The developer correctly applies input filtering, and this code was reviewed and made live. However, something small seems to have slipped through. The developer hasn’t used quotes around the value of the href attribute, so the browser assumes that its value extends up to the first white-space character. A crafted URL demonstrates the problem:

https://servername/login.php?name=foo+onmouseover=alert(/bar/)

All of the characters in name are safe and pass through the filter untouched, but the resulting HTML looks like this:

<a href=login?name=foo onmouseover=alert(/bar/)>login</a>

The lack of quotes turns the attribute value into an onmouseover event handler. When the unsuspecting user mouses over the link to click on login, the onmouseover handler triggers. Quoting the value of the href attribute fixes the problem here. This is a good enough reason to quote all attribute values even though they are optional according to the HTML spec.

<?php
$name = "";
if ($_GET['name']) {
    $name = filter_input(INPUT_POST | INPUT_GET, 'name', FILTER_SANITIZE_SPECIAL_CHARS);
}
echo "<a href="login?name=$name">login</a>";
?>

For this particular situation though, we also need to look at context. The href attribute accepts a URL as its value, so the value passed to it needs to be urlencoded as well as quoted.

Full image (from xkcd)

Commonly overlooked sections

While many web developers correctly quote and validate input in page content, we find that some sections of the page are still overlooked, possibly because they aren’t perceived to be a problem, or perhaps they’ve just been missed. Here is an example from a dictionary web site:

<title><?php echo $word; ?> - Definitions and more ...</title>

Now by default, no browser executes code within the title tags, so the developer probably thought that it was safe to display data untreated in the title. Carefully crafted input data can escape the title tags and inject script with something like this

https://servername/dictionary?word=</title><script>alert(/xss/)</script>

Other commonly overlooked pages are error pages and error messages. Does your 404 page echo on screen the incorrect URL that was typed in? If it does, then it needs to treat that input first. A banking website recently had code similar to the following^[2] (they used ASP in this case):

<%
if Request.Querystring("errmsg") then
    Response.Write("<em>" & Request.QueryString("errmsg") & "</em>")
end if
%>

The errmsg parameter didn’t come in from a form, but from a server-side redirect. The developers assumed that since this URL came from the server it would be safe.

Ads/analytics sections at the bottom of a page are also frequently not handled correctly. Perhaps because boilerplate code is provided and it just works. As the following example from a travel site shows, you should not trust anyone and that includes boilerplate code:

<script type="text/javascript">
google_afs_query = "<?php echo $_GET['query'];?>";
.
.
</script>

This is vulnerable to the following attack string:

https://servername/?query=";alert(0)//

In this case input data needs to be validated for use in a JavaScript context since that’s where the data is echoed out to. What meta-characters would you scan for in this case? Would you quote them or strip them? The answer depends on context, and only you the developer or owner of the page know what the right context is.

Different contexts

Now like we mentioned earlier, input may be used in different contexts, and it needs to be treated differently depending on the context that it will be used in. Sometimes data will be used in multiple contexts and may require to be treated differently for each case. Let’s look at a few cases.

HTML Context

In an HTML context, data is written into an HTML page as part of the content, for example inside a <p> tag. Examples include a search results page, a blog commenting system, dictionary.com’s word of the day, etc. In this context, all HTML meta characters need to be encoded or stripped. That’s primarily < and >, but using PHP’s FILTER_SANITIZE_SPECIAL_CHARS is probably safer, and FILTER_SANITIZE_STRIPPED is probably the safest. Make sure you know what character set your data is in before you try to encode it.

There may be cases when you want to allow some HTML tags, for example in a CMS tool or a commenting system. This is generally a bad idea because there are more ways to get it wrong than to get it right. For example, let’s say that your blogging system allows commenters to markup their comments with some simple tags like <q> and <em>. Now a happy commenter comes along and adds the following code to his comment:

<q onmouseover="alert('xss')">...</q>

You’ve just been XSSed. If you are going to allow a subset of tags, then strip all attributes from those tags. A better idea is to use a CMS specific syntax like BBCode that your back end can translate into safe tags.

Attribute Context

In attribute context, user data is included as the attribute value of an HTML tag. Depending on the attribute in question, the context might be different. For non-event handlers, all HTML meta characters need to be encoded. FILTER_SANITIZE_SPECIAL_CHARS works here as well. In addition, all attribute values should be quoted using single or double quotes or you’ll be hit like the examples above.

For event handling attributes like onmouseover, onclick, onfocus, onblur or similar, you need to be more careful. The best advice is to never ever put input data directly into an event handler. Let’s look at an example.

<?php
  $n = filter_input(INPUT_GET, 'n', FILTER_SANITIZE_SPECIAL_CHARS);
?>
<input type="text" value="" name="n" onfocus="do_something('<?php echo $n; ?>');">

Looks safe, doesn’t it? What happens if an attacker tries to get out of the quoted region using a single quote, i.e., they use a URL like

https://servername/?n=foo');alert('xss

The input is sanitized and all single quotes are converted to '. Unfortunately, this isn’t enough. An event handler executes in two contexts one after the other. The data in the page is first HTML decoded and the result is passed into a JavaScript context. So, as far as the JavaScript handler is concerned, ‘ and ' are exactly the same and this introduces an XSS hole.

The best thing to do is to never pass input data directly into an event handler — even if it has been treated. It’s better to store it as a the value of a hidden field and then let your handler pull the value out of that field. Something like this would be safer:

<?php
  $n = filter_input(INPUT_GET, 'n', FILTER_SANITIZE_SPECIAL_CHARS);
?>
<input type="hidden" id="old_n" value="<?php echo $n ?>">
<input type="text" value="" name="n" onfocus="do_something(document.getElementById('old_n').value);">

URL Context

A special case of the attribute context is URL context. The value of the href and src attributes of various elements are URLs and need to be treated as such. Special characters included in a URL need to be urlencoded to be safe in this context. Using an HTML specific filter is insufficient here as we’ve seen in the missing quotes example above.

Also take note of URLs in meta tags and in HTTP headers. For example, code similar to the following was also recently seen online:

<?php
  if(preg_match('!^https?://(w+).mysite.com/!', $_GET['done']) {
      header("Location: " . $_GET['done']);
  }
?>

On the face of it, it looks safe enough since we’re checking that the done parameter matches our domain before we redirect, however we aren’t validating the entire URL. An attacker could easily slip in a newline character and then add more headers, for example, a second Location header, or an entire HTML document for that matter. All it takes is a little %0a in the done parameter.

Notice that the match uses a / after .com. This is necessary to protect against user@host style URLs or third party subdomains. For example, a malicious user could create a subdomain called www.mysite.com.evil.com and trick your regex. Alternately, they could use a URL like https://www.mysite.com@www.evil.com/ and trick your regex.

If your URL contains only ASCII characters, then PHP’s FILTER_VALIDATE_URL filter can be used instead of funky regular expressions.

Remember: when writing out URLs, the & character is special in HTML, so it needs to be written out as & (although most browsers will accept it if you don’t), while the ; character is special in an HTTP header, meaning that & will break the header.

When dealing with URLs, figure out which context the URL will be used in, encode it correctly and possibly check the domain. When checking the domain, make sure you use a starts-with match, and include the trailing / to protect against user@host style URLs.

JavaScript Context

If input data needs to be written out in a JavaScript context, i.e., within <script> tags or in a file served as the src attribute of a <script> tag, the data should be JSON encoded. In PHP, the json_encode function can be used. The JSON homepage has a list of JSON libraries for many other languages, all of which have a similar function.

Simply escaping quotes using addslashes or something similar is insufficient, because within script tags quotes can also be represented by their HTML entity values.

One special case to think about in the JavaScript context is the use of web services that return JSON-P data. You do this on your web page by including a script tag that points to a web service and specify a callback function to be called when the data is loaded. For example, to load public photos from Flickr, you’d use something like this:

<script src="https://api.flickr.com/services/feeds/photos_public.gne?format=json&jsoncallback=myfunc"></script>

Before you do that, you’d define myfunc in JavaScript. However, what you’re doing is giving the script from Flickr full access to your page’s DOM. As long as the script respects its contract with you (i.e., the API), you should be safe, but if whoever controlled that script were to suddenly turn evil, you’ve just opened your users up to attack.

In general, only point your scripts tags to URLs that you fully trust, both to not be evil and also to never be compromised themselves. If you must include untrusted scripts, consider sandboxing them in an iframe or use Caja if you can. If you do use an iframe, then consider that there may be certain conditions under which you need to use a double-iframe. This is primarily done to prevent referrer leaking if your page’s URL itself is secret, like a search results page or a capability URL.

CSS Context

Internet Explorer is the only major browser around that allows script execution within CSS using the expression syntax (deprecated and no longer supported in IE8 and later). However, that’s still reason enough to worry about it. As an example, consider a website that allows users to customize the background of their profile pages, similar to MySpace or Twitter (note that neither website is vulnerable to this flaw). Let’s say that you accept a background color and/or image and assign that to the CSS background property. If you don’t correctly validate and sanitize the values passed in by the user, they could pass in a JavaScript expression instead of a real color. This might result in CSS code like this:

background: #28d expression("alert('xss')");

Making sure the background color the user specifies is a valid CSS color and nothing else will protect you from this kind of an attack.

With URLs, a different issue may come in to play. Let’s say that you allow the user to specify their own background image URL. You validate this URL when the user specifies it — to make sure it doesn’t return a 404 error. After this is done, the user could replace the URL with a script that returns a 401 HTTP status code. This makes the browser throw up an authentication dialog, which might confuse the user into entering their username and password of your site. An interesting attack that we haven’t seen outside of the lab.

The fix is to download the specified image to your own server and run some kind of transformation on it, most commonly for size. Even if your transformation does nothing, it can still remove malware that may be embedded in a JPEG.

Other Contexts

There are other contexts that we don’t look at in this article. These commonly deal with the back end and include things like an SQL context or a Shell context or a back end web service context. Another interesting attack that results from improper input validation is HTTP Parameter Pollution or HPP for short.

Should you filter on input or output?

The comments of my last article brought up an interesting point regarding whether data should be filtered on input or output. Since we have so many different contexts, it seems obvious that data should be filtered just before output depending on the context. Filtering for the wrong context could still introduce vulnerabilities. This is the ideal case where every programmer on your team knows what they are doing at all times and always programs with security in mind. In practice, this doesn’t always happen. Even experienced programmers have been known to slip up once or twice, and it’s those occasions that come back to bite you.

A simple guideline is to strip out all punctuation by default and let the web developer override this based on context. This means that using untreated input will either be safe, or not work at all, which serves as a reminder to the developer that they need to think about context. We encourage developers to validate data on input. This involves checking data types, ranges, lengths and possibly the character set/encoding in use. The purpose of validation is to make sure that we receive what we expect to receive. Data should be further sanitized on output depending on context. Sanitization involves transforming (possibly destructively) the data to be safe in the output context. Remember that sometimes a single piece of data may be used in multiple contexts on the same page.

Both validation and sanitization are types of filters to be run on input data, and often both might be required.

In closing

No data that comes in from an untrusted source should be trusted. This would include anything that you did not create yourself. The data may come in as command line parameters, through a query string, through POST data, cookies, HTTP headers, a web service call, an uploaded file, or anything else. If you did not create it, then it can’t be trusted. Validate all data to make sure it’s what you expect, and then treat it to make sure it’s safe in the context where it will be used. Be aware of the different contexts within a web page and keep your users safe.

References

You might be interested in the following related posts:

(vf)

Explore more on