URL Decoding

To complement the code of my URL Encoding post, I've now developed a URIDecode routine.

It attempts to decode URIs that were percent-encoded according to RFC 3986. It also allows for some malformed percent-encoded URIs, i.e. those that contain characters outside the RFC's "unreserved" character set.

Here's the code. An explanation follows.

function URIDecode(const Str: string): string;

  // Counts number of '%' characters in a UTF8 string
  function CountPercent(const S: UTF8String): Integer;
  var
    Idx: Integer; // loops thru all octets of S
  begin
    Result := 0;
    for Idx := 1 to Length(S) do
      if S[Idx] = cPercent then
        Inc(Result);
  end;

var
  SrcUTF8: UTF8String;  // input string as UTF-8
  SrcIdx: Integer;      // index into source UTF-8 string
  ResUTF8: UTF8String;  // output string as UTF-8
  ResIdx: Integer;      // index into result UTF-8 string
  Hex: string;          // hex component of % encoding
  ChValue: Integer;     // character ordinal value from a % encoding
begin
  // Convert input string to UTF-8
  SrcUTF8 := UTF8Encode(Str);
  // Size the decoded UTF-8 string
  SetLength(ResUTF8, Length(SrcUTF8) - 2 * CountPercent(SrcUTF8));
  SrcIdx := 1;
  ResIdx := 1;
  // Process each octet of the source string
  while SrcIdx <= Length(SrcUTF8) do
  begin
    if SrcUTF8[SrcIdx] = cPercent then
    begin
      // % encoding: decode following two hex chars into required code point
      if Length(SrcUTF8) < SrcIdx + 2 then
        raise EConvertError.Create(rsEscapeError);  // malformed: too short
      Hex := '$' + string(SrcUTF8[SrcIdx + 1] + SrcUTF8[SrcIdx + 2]);
      if not TryStrToInt(Hex, ChValue) then
        raise EConvertError.Create(rsEscapeError);  // malformed: not valid hex
      ResUTF8[ResIdx] := AnsiChar(ChValue);
      Inc(ResIdx);
      Inc(SrcIdx, 3);
    end
    else
    begin
      // plain char or UTF-8 continuation character: copy unchanged
      ResUTF8[ResIdx] := SrcUTF8[SrcIdx];
      Inc(ResIdx);
      Inc(SrcIdx);
    end;
  end;
  // Convert back to native string type for result
  Result := UTF8ToString(ResUTF8);
end;

Internally, URIDecode operates on UTF-8 strings for both input and output.

This lets us deal easily with any multi-byte characters in the input. As already noted, there shouln't be any such characters - all should map onto the unreserved characters that form a subset of the ASCII character set. However we need to allow for badly encoded URIs that may contain characters outside this expected set.

UTF-8 also lets perform an easy test for '%' characters in input. Since '%' can never occur as a UTF-8 continuation character we can simply test for the actual character without worrying about if it is part of of a multibyte character.

We use UTF-8 for the output string since UTF-8 should have been used to encode the URI in the first place, therefore percent-encoded octets may map onto UTF-8 continuation bytes. Using any other string type would give erroneous results.

The interim UTF-8 result is converted into the native string type before returning.

URIDecode has been added to UURIEncode.pas in my Delphi Doodlings repo. View the code.

Comments

  1. Hi, thank you so much :)

    ReplyDelete
  2. Hi, I've tried it for decoding :
    Works perfectly !
    Thank you !

    ReplyDelete

Post a Comment

Comments are very welcome, but please don't comment here if:

1) You have a query about, or a bug report for, one of my programs or libraries. Most of my posts contain a link to the relevant repository where there will be an issue tracker you can use.

2) You have a query about any 3rd party programs I feature, please address them to the developer(s) - there will be a link in the post.

3) You're one of the tiny, tiny minority who are aggressive or abusive - in the bin you go and reported you will be!

Thanks

Popular posts from this blog

Initialising dynamic arrays

Deleting elements from a dynamic array