To complement the code of my URL Encoding post, I've now developed a URIDecode routine.
It attempts to decode URIs that were percent-encoded according to RFC 3986. It also allows for some malformed percent-encoded URIs, i.e. those that contain characters outside the RFC's "unreserved" character set.
Here's the code. An explanation follows.
function URIDecode(const Str: string): string; // Counts number of '%' characters in a UTF8 string function CountPercent(const S: UTF8String): Integer; var Idx: Integer; // loops thru all octets of S begin Result := 0; for Idx := 1 to Length(S) do if S[Idx] = cPercent then Inc(Result); end; var SrcUTF8: UTF8String; // input string as UTF-8 SrcIdx: Integer; // index into source UTF-8 string ResUTF8: UTF8String; // output string as UTF-8 ResIdx: Integer; // index into result UTF-8 string Hex: string; // hex component of % encoding ChValue: Integer; // character ordinal value from a % encoding begin // Convert input string to UTF-8 SrcUTF8 := UTF8Encode(Str); // Size the decoded UTF-8 string SetLength(ResUTF8, Length(SrcUTF8) - 2 * CountPercent(SrcUTF8)); SrcIdx := 1; ResIdx := 1; // Process each octet of the source string while SrcIdx <= Length(SrcUTF8) do begin if SrcUTF8[SrcIdx] = cPercent then begin // % encoding: decode following two hex chars into required code point if Length(SrcUTF8) < SrcIdx + 2 then raise EConvertError.Create(rsEscapeError); // malformed: too short Hex := '$' + string(SrcUTF8[SrcIdx + 1] + SrcUTF8[SrcIdx + 2]); if not TryStrToInt(Hex, ChValue) then raise EConvertError.Create(rsEscapeError); // malformed: not valid hex ResUTF8[ResIdx] := AnsiChar(ChValue); Inc(ResIdx); Inc(SrcIdx, 3); end else begin // plain char or UTF-8 continuation character: copy unchanged ResUTF8[ResIdx] := SrcUTF8[SrcIdx]; Inc(ResIdx); Inc(SrcIdx); end; end; // Convert back to native string type for result Result := UTF8ToString(ResUTF8); end;
Internally, URIDecode operates on UTF-8 strings for both input and output.
This lets us deal easily with any multi-byte characters in the input. As already noted, there shouln't be any such characters - all should map onto the unreserved characters that form a subset of the ASCII character set. However we need to allow for badly encoded URIs that may contain characters outside this expected set.
UTF-8 also lets perform an easy test for '%' characters in input. Since '%' can never occur as a UTF-8 continuation character we can simply test for the actual character without worrying about if it is part of of a multibyte character.
We use UTF-8 for the output string since UTF-8 should have been used to encode the URI in the first place, therefore percent-encoded octets may map onto UTF-8 continuation bytes. Using any other string type would give erroneous results.
The interim UTF-8 result is converted into the native string type before returning.
URIDecode has been added to
UURIEncode.pas in my Delphi Doodlings repo. View the code.