Protecting against XSS

The problem as I see it

Where to start? Let me start by telling you that most of the books you read are wrong. The code samples you copy of the internet to do a specific task are wrong (the wrong way to handle a GET request), the function you copied from that work colleague who in turn copied from a forum is wrong (the wrong way to handle redirects). Start to question everything. Maybe this blog post is wrong :) this is the kind of mindset you require in order to protect your sites from XSS. You as a developer need to start thinking more about your code. If a article you are reading contains stuff like echo $_GET or Response.Write without filtering then it’s time to close that article.

Are frameworks the answer? I think in my honest opinion no. Yes a framework might prevent XSS in the short term but in the long term the framework code will be proven to contain mistakes as it evolves and thus when it is exploited it will be more severe than if you wrote the code yourself. Why more severe? A framework hole can be easily automated since many sites share the same codebase, if you wrote your own filtering code than an attacker would be able to exploit the individual site but find it hard to automate a range of sites using different filtering methods. This is one of the main reasons the internet works today, not because everything is secure just because everything is different.

One of the arguments I hear is that a developer can’t be trusted to create a perfect filtering system for a site and using a framework ensures the developer follows best guidelines. I disagree, developers are intelligent they write code and understand code, if you can build a system you can protect it because you’re in the best position to.

How to handle input

When you handle user input just think to yourself “a number is a vector”, imagine a site that renders a image server side and allows you to choose the width and height of the graphic, if you don’t think a number is a vector then you might not put any restrictions on the width and height of the generated graphic but what happens when an attacker requests a 100000×100000 graphic? If you’re code doesn’t handle the maximum and minimum inputs then an attacker can DOS your server with multiple requests. The lesson is not to be lazy about each input you handle, you need to make sure each value is validated correctly.

The process should be as follows.
1. Validate type – Ensure the value your are getting is what you were expecting.
2. Whitelist – Remove any characters that should not be in the value by providing the only characters that should.
3. Validate Length – Always validate the length of the input even when the value isn’t being placed in the database. The less that an attacker has to work with the better.
4. Restrict – Refine what’s allowed within the range of characters you allow. For example is the minimum value 5?
5. Escape – Depending on context (where your variable is on the page) escape correctly.

You can make things easier for yourself by placing these methods into a function or a class but don’t overcomplicate keep each method as simple as possible and be very careful and descriptive with your function names to avoid confusion.

HTML context

Lets look at an example of the method above with a code sample in PHP.

<?php
$x = (string) $_GET['x']; //ensure we get a string not array
$x = preg_replace("/[^\w]/","", $x); //remove any characters that are not a-z, A-Z, 0-9 or _
$x = substr($x, 0, 10);//restrict to a maximum of 10 characters
if(!preg_match("/^a/i", $x)) {//this value must only begin with a or A
	$x = '';
}
echo '<b>' . htmlentities($x, ENT_QUOTES) . '</b>'; //escape everything according to context of $x
?>

You might be wondering why I used (string) in the code above. Lets try it without it.

Using the following:test.php?x[]=123
Results in: “Warning: substr() expects parameter 1 to be string, array given”

Because of the PHP feature which allows you to pass arrays over a GET request you can create a warning in PHP over unexpected type when trying to whitelist the value. Using type hinting ensures you get the expected type.

Great so we now understand how to restrict and escape a value. Lets look at another context.

Script context

When not in XHTML/XML mode a script tag does not decode HTML entities. If you have a value within a variable inside a script tag, question is what do you escape?

example:

<script>x='value here';</script>

Inside a JavaScript variable like this you have to watch out for the following ‘ and </script> using these vectors it’s possible to XSS the value. The two examples are listed below.

vector 1: ',alert(1),//
vector 2: </script><img src=1 onerror=alert(1)>

The second example requires no quotes and a lot of developers assume it won’t be executed because it’s still inside a JavaScript variable, this is clearly wrong as it executes because the browser doesn’t know where the script begins and ends correctly.

To escape a value inside a script context you should JavaScript escape the value. The best way of doing this is using unicode escapes, a unicode escape in JavaScript looks like the following:


<script>
alert('\u0061');//"a" in a unicode escape
</script>

You can experiment with unicode escapes using my Hackvertor tool. Please understand how they work as they will be very important to you when understanding how to protect many contexts.

It’s very important you follow the same procedure as before (Validate type, Whitelist, Validate Length, Restrict, Escape) for the specific variable you’re working on but this time we will convert our value into unicode escapes. A simple function to do that is as follows:

<?php
function jsEscape($input) {
	if(strlen($input) == 0) {
		return '';
	}
	$output = '';
	$input = preg_replace("/[^\\x01-\\x7F]/", "", $input);//remove any characters outside the range 0x01-0x7f
	$chars = str_split($input);
	for($i=0;$i<count($chars);$i++) {
		$char = $chars[$i];
		$output .= sprintf("\\u%04x", ord($char));//get the character code and convert to hex and prefix with \u00
	}
	return $output;	
}
?>

I’ve purposely designed this function with a few little optimisations missing, for example instead of using unicode you could use hex escapes since we restrict the range of allowed characters, alphanumeric characters are even converted when they could be replaced by their literal characters and new lines/tabs are encoded too when you could use the shorter equivalent. Lets add a line to use a literal tab character instead of \u0009. Why would you want to do this? To reduce the characters sent down the wire.

Code to handle tab:

<?php
if(preg_match("/^\t$/", $char)) {
   $output .= '\\t';
   continue;
}
?>

This converts a tab specifically to “\t”, notice how we separate input and output and by using continue we can skip the input character and override it with something more specific. The full code is now below for clarity.

<?php
function jsEscape($input) {
	if(strlen($input) == 0) {
		return '';
	}
	$output = '';
	$input = preg_replace("/[^\\x01-\\x7F]/", "", $input);
	$chars = str_split($input);
	for($i=0;$i<count($chars);$i++) {
		$char = $chars[$i];
		if(preg_match("/^\t$/", $char)) {
			$output .= '\\t';//don't unicode escape but using a shorter \t instead. Double escape remember!
			continue;//skip a line and move on the the next char
		}
		$output .= sprintf("\\u%04x", ord($char));
        }
        return $output;
}
?>	

Exercises for this code:
1. Can you handle characters outside the ascii range?
2. Convert any non dangerous character to their escaped or literal representation.

Script context in XHTML

In the previous section you might have wondered about XHTML when I stated “when not in XHTML/XML mode a script tag does not decode HTML entities”. In XHTML entities can be decoded even inside script blocks! Fortunately the code I provided for that section will handle that since unicode escapes are used. If you followed the exercises in that section did you make the “&” safe? That is something to think about when you are working on XHTML page. In order for XHTML to be used in the browser you have to serve the pages with the correct XHTML header. I recommend you don’t use the XHTML header.

Even though the previous examples still protect you against attack, I will show you a couple of vectors for XHTML sites/


<script>x='&#39;,alert(/This works in XHTML/)//';</script>


<script>x='&apos;,alert(/This also works in XHTML/)//';</script>

This would work in any XML based format, entities can be used to break out of strings and just a simple &lt/ will also do the trick. Don’t use XHTML or if you do unicode escape and don’t allow literal “&”.

JavaScript events

Now you know what happens in XHTML, you might be interested to know it also happens in HTML attributes. Any HTML attribute including events such as onclick will automatically decode entities and use them as if they were literal characters. Best demonstrated with a code example.


<div title="&gt;" id="x">test</div>
<script>
alert(document.getElementById('x').title);
</script>

As you can see instead of the value of the title attribute of the div element returning “&gt;” it returned “>” because it was automatically decoded. This whole process is one of the root causes of XSS, the developer didn’t understand that. Lets look at what happens with a onclick event and a variable of “x”.


<a href="#" onclick="x='&#39;,alert(1),&#39;';">test</a>

Clicking on the link fired the alert because like XHTML the entities are decoded, when you are in the attribute context you need to do exactly the same as if you were in the XHTML context. Reusing your jsecape function will fully protect you from XSS in attributes and variables like this.

innerHTML context

I hope you’ve grasped the previous concepts because now it’s going to get slightly confusing. If you’re in the script context and you are assigning a value which writes to the dom in some way then the previous rules of escaping break down. Because although you are escaping the value correctly for the context, the context shifts once it’s applied to innerHTML. As always here is an example:


<div id="x"></div>
<script>
//this is bad don't do this with innerHTML
document.getElementById('x').innerHTML='<?php echo jsEscape($_GET['x']);?>';</script>

Even though the string is “\u003c\u0069\u006d\u0067\u0020\u0073\u0072…” and so on it will still cause XSS because the innerHTML write will actually see the decoded characters from the JavaScript string. You need to escape for the HTML context as well as the script context, if you add XHTML to that too then it gets really really complicated. My advice is not to allow HTML when using the innerHTML context, whitelist and restrict your values and use innerText or textContent instead. If you really need HTML inside innerHTML follow the tutorial at the end on how to write a basic HTML filter for innerHTML.

CSS context

The same rules I’ve stated previously apply to CSS, a style block will not decode entities except when in XHTML/XML mode and style attributes will decode HTML entities automatically. This makes protecting against injections in the CSS context hard if you don’t know what you’re doing. In addition to the regular entities, CSS also supports it’s own format of hex escapes. The format is a backslash followed by a hex number of the required character padded optionally with zeros from 2-6 in length (vendors also supported a large amount of zero padding over the 6 length restriction). To see how it looks let use Hackvertor again to build our string.

As you can see there are quite a few combinations you can use, there are more. The CSS specification states that comments can be used and consist of C style /* */ and any hex escape can include a space after the escape to avoid the next character continuing the hex escape. E.g. to CSS \61 \62 \63 is still “abc” regardless of the spaces. Hopefully you’ve read my blog for a while and know about using entities as well as hex escapes or maybe you’ve just realised? Well yeah it’s correct you can use hex escapes, comments and html entities to construct a valid execute css value.

This leaves you with a nightmare scenario with regard to protecting css property values, IE7 and IE7 compat (on newer builds of IE) supports expressions in CSS. Which basically allows you to execute JavaScript code inside CSS values. A simplistic example here:


<div style="xss:expression(open(alert(1)))"></div>

I use the open() function call to avoid the annoying client side DOS of continual alert popups. Anything inside “(” and “)” of the expression is a one line JavaScript call. In the example I use a invalid property called “xss” but it’s more likely to be “color” or “font-family”. Lets take it up a notch and start to encode the CSS value and see what executes. I’ll just encode the “e” of expression to make it easier to follow.


Hex escape:
<div style="xss:\65xpression(open(alert(1)))"></div>
Hex escape with trailing space:
<div style="xss:\65 xpression(open(alert(1)))"></div>
Hex escape with trailing space and zero padded:
<div style="xss:\000065 xpression(open(alert(1)))"></div>
Hex escape with trailing space and zero padded and comment:
<div style="xss:\000065 /*comment*/xpression(open(alert(1)))"></div>
Hex escape with trailing space and zero padded and HTML encoded comment:
<div style="xss:\000065 &#x2f;&#x2a;comment*/xpression(open(alert(1)))"></div>
and finally hex escape with encoded backslash with trailing space and zero padded and HTML encoded comment:
<div style="xss:&#x5c;000065 &#x2f;&#x2a;comment*/xpression(open(alert(1)))"></div>

I’m sure you’ll agree that’s hard to follow and there are literally millions of combinations. Unfortunately you can’t simply hex escape the value and expect it to be safe from injection, since even encoded CSS escapes as you’ve seen can be used as vectors. The option you’re left from a defensive point of view is to whitelist every CSS property value, luckily I’ve already done that with CSS Reg and Norman Hippert kindly converted it to PHP.

Serving your pages

Every single page that’s available on the web for your site should include a doc type and a UTF-8 charset in a meta tag, now we have a shortened HTML5 header we can use the following:


<!doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
... your content ....

This is to prevent charset attacks and E4X vectors and force your document into standards mode on IE which is also important. I also recommend you enforce standards mode by following this blog post from Dave Ross.

Positive matching and filtering HTML

The last section of this long blog post will be how to write you’re own filter. I don’t think I’m the world’s greatest programmer but I think I’ve worked out a cool technique to filtering content using little code and by only matching the content you want you won’t get anything bad. I hope you take the basis of this code and improve it and learn from it. This code is intentially incomplete I wrote a more complete HTML filter called HTMLReg which you can examine if you want to improve this basic filter. But I recommend you try and improve the filter yourself and learn to break it too.

<script>
function yourFilter(input) {	
	var output = '' , pos = 0;
	input = input + ''; //ensure we have a string
	function isNewline(chr) {
		return /^[\f\n\r\u000b\u2028\u2029]$/.test(chr);
	}
	function outputSpace(chr) {
		if(!/^\s$/.test(output.slice(-1)) && !isNewline(chr)) { //skip new lines and multiple spaces
			output += chr;
		}
	}	
	function outputChars(chrs) {
		output += chrs;
	}
	function error(m) {
		throw {
                  description: m                     
                };
	}
	function parseHTML() {
		var allowedTags = /^<\/?(?:b|i|strong|s)>/,
			match;
			if(allowedTags.test(input.substr(pos))) {
				match = allowedTags.exec(input.substr(pos));
				if(match === null) {
					error("Invalid tag");
				} else {
					pos += match[0].length;
					outputChars(match[0]);
				}
				
			} else {
				outputChars('&lt;');
				pos++;
			}
	}
	function parseEntities() {
		var allowedEntities = /^&(?:amp|gt|lt);/,
			match;
			if(allowedEntities.test(input.substr(pos))) {
				match = allowedEntities.exec(input.substr(pos));
				if(match === null) {
					error("Invalid entity");
				} else {
					pos += match[0].length;
					outputChars(match[0]);
				}
				
			} else {
				outputChars('&amp;');
				pos++;
			}
	}
	
	while(pos < input.length) {
		chr = input.charAt(pos);
		if(chr === '<') {
			parseHTML();
		} else if(chr === '&') {
			parseEntities();
		} else if(/^\s$/.test(chr)) {
			outputSpace(chr);
			pos++;
		} else if(chr === '>') {
			outputChars('&gt;');
			pos++;
		} else if(chr === '"') {
			outputChars('&quot;');
			pos++;
		} else if(chr === "'") {
			outputChars('&#39;');
			pos++;
		} else if(/^[\w]$/.test(chr)) {
			outputChars(chr);
			pos++;
		} else {
			pos++;//move to the next character but don't output it
		}
	}	
	return output;
}
</script>

The code above separates input and output and shows how to move along the input and produce a different output without losing track of the position. New lines are dropped from the HTML and more than one space this is to demonstrate how to use the output to prevent repeated characters you can and should change the behaviour to suit your needs. The code is written in JavaScript but can be easily customised into your language.

Exercises
1. Can you handle attributes safely?
2. Can you convert new lines into <br> where appropriate.

20 Responses to “Protecting against XSS”

  1. Wladimir Palant writes:

    Yes, I do think that frameworks are the answer. You have to choose a template engine that will automatically encode content properly (I use Jinja2 for that). And you should *never* insert data into a non-HTML context (so no script tags and especially no onfoo attributes). If you need to provide data to a script, that data is simple enough to read from a regular HTML attribute. I think that this concept is easy enough that most developers can get it right (unlike the “manual” escaping procedure that pretty much nobody gets right).

  2. kuza55 writes:

    Like Wladimir, I also think frameworks are the answer.

    I don’t have any theoretical reasons for this, butt the 2 years I spent doing webapp pentests, I found that ASP.NET apps had consistently fewer XSS issues than the rest of the field (Mostly Java, with some PHP and Lotus Domino sprinkled in). Most ASP.NET apps did have problems when they wanted to insert user content into javascript, however that was simply because the framework was written at a time when this was common, and so there was no good easy way to do so safely.

    I disagree with Wladimir that you must adhere to some set of arbitrary rules for your templates though, I do not think creating a templating engine that can automatically handle the cases you mention above is beyond us.

  3. a writes:

    You should never try to clean up data – if the data passes after you’ve cleaned it, and the attacker finds a hole in your cleanup routine, you still have exploitable code.

    Validate and reject if the values don’t match.

    Also, the idea behind a framework is that other people have already gone through these issues and solved the same problem you’re solving. Also, most frameworks already have CSRF built in which will cut down on many of the exploit attempts.

    str_replace takes an array and can remove a ton of your code, but, again, trying to clean up input is going to create problems down the line. Write a decent regexp, if you see any bad signs, reject it.

  4. Gareth Heyes writes:

    @a

    The basis of my filter design should avoid holes in the cleanup process. Since the separation of input from output and only “positive matching” the input results in any bad data remaining in the input. The most likely mistake is losing the chr position but with plenty of tests it can be avoided. I know some people think frameworks are a good idea and that’s fine I don’t really want to debate that just wanted to put my views across.

  5. Jeremy Long writes:

    You discussed many of the interesting contexts – but didn’t discuss the wonderful <a href=”javascript:…”. This will actually HTML Decode the data in the href prior to determining the protocol, once the browser determines it is the javascript protocol it will URL decode the data prior to handing it to the JavaScript engine.

    Injection occurs when data is passed to an interpreter, understanding where these points are within an HTML document are critical to understanding how to defend against XSS. But then again, I’m preaching to the choir here.

    Thanks for all your work.

  6. Gareth Heyes writes:

    @Jeremy

    There are actually a few things missing but I plan to update this blog post over time, I actually don’t think the javascript protocol is hard to protect against myself so I decided to concentrate on the other main areas. Thanks for reading.

  7. kingthorin writes:

    I’m sure I’ve just missed some nuance as I’ve only skimmed the article however wouldn’t it be kinda bass ackwards to do that last part in JavaScript? If I’m a user and you try to filter me using JS code I’d be glad to simply remove such filters from you pages using Firebug etc.

  8. Gareth Heyes writes:

    @kingthorin

    It’s my preferred language, I think it’s written in such a way that it’s easily portable. It was also intended to be used as an example to handle the innerHTML context.

  9. kingthorin writes:

    Fair enough.

  10. Stefano writes:

    About the framework advice, considering you’re basically right, it depends on how the framework has been developed. I mean we face new frameworks each days, and we face new bug in frameworks each days as well. What Gareth is trying to give is about basic approaches to data output encoding. No more, no less. Anyone trying to create a new framework OR a new, custom encoding from scratch should think about identifying contexts and escape tainted outputs according to them.

    just my 2.c

  11. db writes:

    I agree with kuza & wald — frameworks are the answer. Ror3, django, tornado – just to name a few frameworks that escape in templates by default. Also these frameworks come with other protections (csrf, …)

  12. Meketrefe writes:

    This is a perfect example of a totally misleading post.

    Sorry Gareth, but you don’t have a f**n’ clue of what you’re talking about.

    Go and buy yourself some security books on Amazon please!

  13. Gareth Heyes writes:

    @Meketrefe

    Which do you recommend? :) and why haha

  14. Rory writes:

    Interesting post, and as ever lots of good technical content to think about. On the frameworks side of things, I guess I’ll chuck in my 0.02 of local currency.

    I’d say that you’re both right and wrong to say that developers should handle this stuff themselves.

    If a developer wants a really high level of security (relative to other sites/applications) then relying on frameowrk code to handle something as important as input validation/encoding is a bad idea, as there will be bugs in that code and without a comprehensive audit they’re not likely to be found..

    That said, for most development projects I’d definitely recommend using framework level protections for this kind of thing. With larger teams of developers it becomes pretty hard for everyone to reach the same high level of understanding of what is a pretty complex topic, even if the devs have the time to do it, so realistically relying on framework protections is a fast way to get a decent percentage of the way.

  15. James, Securatek writes:

    Informative post, and as with most of the comments above – we believe frameworks are the answer, when properly implemented and adhered to.

    One thing that is perhaps worth a mention is boundary validation. One thing we see time and time again is Javascript validation and sanitization being used to protect against cross-site scripting, whilst server-side code is neglected thus leaving the application vulnerable to XSS.

    Boundary validation is the process by which user-supplied input is validated (preferably against a whitelist filter) as it passes trust boundaries (i.e. When it arrives at the server, when it is extracted from the database). Developers frequently overlook the fact that malicious users can trivially bypass Javascript filtering to supply the application with malicious input.

    James @Securatek

  16. test writes:

    The article is bullshit. Period !

  17. Gareth Heyes writes:

    @test

    Thanks for your well reasoned arguments.

  18. Raphael Geissert writes:

    Gareth, just for correctness, most uses of preg_* and $, should use the D modifier.
    I’ve seen some systems break because of its lack, but none where it was a vulnerability.

    E.g.
    preg_match(“/^foo$/”, $in);
    will match “foo” and “foo\n”, while
    preg_match(“/^foo$/D”, $in);
    will only match the former

  19. Gareth Heyes writes:

    @Raphael

    Great tip thanks!

  20. Pedro Fortuna writes:

    Nice read. Waiting for follow ups :-)
    Regarding the framework discussion, I kind agree with @rory.
    Cheers