10 March 2010

URL Decoding

To complement the code of my URL Encoding post, I've now developed a URIDecode routine.

It attempts to decode URIs that were percent-encoded according to RFC 3986. It also allows for some malformed percent-encoded URIs, i.e. those that contain characters outside the RFC's "unreserved" character set.

Here's the code. An explanation follows.

function URIDecode(const Str: string): string;

  // Counts number of '%' characters in a UTF8 string
  function CountPercent(const S: UTF8String): Integer;
    Idx: Integer; // loops thru all octets of S
    Result := 0;
    for Idx := 1 to Length(S) do
      if S[Idx] = cPercent then

  SrcUTF8: UTF8String;  // input string as UTF-8
  SrcIdx: Integer;      // index into source UTF-8 string
  ResUTF8: UTF8String;  // output string as UTF-8
  ResIdx: Integer;      // index into result UTF-8 string
  Hex: string;          // hex component of % encoding
  ChValue: Integer;     // character ordinal value from a % encoding
  // Convert input string to UTF-8
  SrcUTF8 := UTF8Encode(Str);
  // Size the decoded UTF-8 string
  SetLength(ResUTF8, Length(SrcUTF8) - 2 * CountPercent(SrcUTF8));
  SrcIdx := 1;
  ResIdx := 1;
  // Process each octet of the source string
  while SrcIdx <= Length(SrcUTF8) do
    if SrcUTF8[SrcIdx] = cPercent then
      // % encoding: decode following two hex chars into required code point
      if Length(SrcUTF8) < SrcIdx + 2 then
        raise EConvertError.Create(rsEscapeError);  // malformed: too short
      Hex := '$' + string(SrcUTF8[SrcIdx + 1] + SrcUTF8[SrcIdx + 2]);
      if not TryStrToInt(Hex, ChValue) then
        raise EConvertError.Create(rsEscapeError);  // malformed: not valid hex
      ResUTF8[ResIdx] := AnsiChar(ChValue);
      Inc(SrcIdx, 3);
      // plain char or UTF-8 continuation character: copy unchanged
      ResUTF8[ResIdx] := SrcUTF8[SrcIdx];
  // Convert back to native string type for result
  Result := UTF8ToString(ResUTF8);

Internally, URIDecode operates on UTF-8 strings for both input and output.

This lets us deal easily with any multi-byte characters in the input. As already noted, there shouln't be any such characters - all should map onto the unreserved characters that form a subset of the ASCII character set. However we need to allow for badly encoded URIs that may contain characters outside this expected set.

UTF-8 also lets perform an easy test for '%' characters in input. Since '%' can never occur as a UTF-8 continuation character we can simply test for the actual character without worrying about if it is part of of a multibyte character.

We use UTF-8 for the output string since UTF-8 should have been used to encode the URI in the first place, therefore percent-encoded octets may map onto UTF-8 continuation bytes. Using any other string type would give erroneous results.

The interim UTF-8 result is converted into the native string type before returning.

URIDecode has been added to UURIEncode.pas in my Delphi Doodlings repo. View the code.


Rafael Rossi said...

Hi, thank you so much :)

Klieber frederic said...

Hi, I've tried it for decoding :
Works perfectly !
Thank you !