18 February 2010

URL Encoding

I've being reviewing the URI encoding code from the Code Snippets Database and I realised that it doesn't comply with RFC 3986.

So here's my first attempt at some compliant code.

According to the RFC:

"the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded."

So, we define the URIEncode function to operate on the UTF8String type. It's easy to encode UnicodeString and AnsiString into UTF using the System unit's UTF8Encode overloaded functions. You can overload URIEncode to do the UTF8 conversion, but I haven't done here.

There's a nice shortcut we can take when url encoding. Remember only unreserved characters are percent-encoded. The set of unreserved characters is:

const
  cURLUnreservedChars = [
    'A'..'Z', 'a'..'z', '0'..'9', '-', '_', '.', '~'
  ];

All other characters are percent encoded. But what about any UTF-8 continuation bytes? Well, by definition these have value > $80. And all the unreserved characters have ordinal value < $80. This means that no legal continuation character can be an unreserved character.

Therefore any byte in the UTF-8 string can be treated the same regardless of whether it's a lead or continuation character: i.e. we percent encode it if it's not an unreserved character.

Here's the function:

// Assumes Defined(UNICODE)
function URIEncode(const S: UTF8String): string; overload;
var
  Ch: AnsiChar;
begin
  // Just scan the string an octet at a time looking for chars to encode
  Result := '';
  for Ch in S do
    if CharInSet(Ch,  cURLUnreservedChars) then
      Result := Result + WideChar(Ch)
    else
      Result := Result + '%' + IntToHex(Ord(Ch), 2);
end;

This, and more similar routines, are available (and may even be evolving) in my Delphi Doodlings repo. View the code (see UURIEncode.pas).

No comments: