25 March 2010

CodeSnip problems, problems, problems

Arrgh!

I've been having a lot of problems with the CodeSnip program's database update code - it's been working for some and not for others. All this has happened since I converted the program to Unicode and compiled with Delphi 2010.

I'm posting this for two reasons:

  1. To try to explain to long suffering users what has been going on with program lately - it's not usually this flaky.
  2. To forewarn anyone about to fall down the same hole as I have.

The problem never raised its head on my system, but did on some that don't use the Windows-1252 code page. The problem was caused by a checksum failure, which in turn was caused by the downloaded data being converted to the wrong code page before the checksum got calculated: result - bang - bad checksum.

My code relies on Indy components to translate downloaded content into text - and Indy and I make different assumptions about how this should be done. And I'm not sure either of us was correct, but they are a much more experienced bunch than I am so I'm prepared to take the blame!

The upshot of all this is that a proper fix insn't going to be trivial. I produced a temporary fix in v3.5.4 by the simple expedient of not checking the checksum. Then I went and commented out a crucial piece of code that meant that downloaded failed for everyone. At least that's fair I suppose because it screwed up UK and USA users just as much as those in the rest of the world!

Next up is a total rewrite of the web service management code - I'm going to handle the raw data myself and perform my own translations rather than let Indy do it. And I'm going to do the checksum properly using HTTP headers rather than my own custom approach.

I may write some of this up here once I've worked it all out.

And the moral of this story - be damn careful about character encoding when interacting with web services, and be sure you understand what your components are up to.

Some news for CodeSnip users: I've hacked the update web service to recognise v3.5.4 and feed it data that doesn't make it fall over. And I've released v3.5.5 that does the temporary fix that v3.5.4 was supposed to do. Please update. Again!

16 March 2010

Delphi Tips News

There's now another new RSS feed that provides news of changes to the Delphi Tips section of DelphiDabbler.com.

RSS Feed IconSubscribe.

14 March 2010

URL Decoding revisited

Time to complete the set. So far in this series I have presented URIEncode, URIDecode and URIEncodeQueryString. So here's the missing piece of the jigsaw: URIDecodeQueryString. This routine decodes a query string that has been "query-string-encoded".

If you look at URL Encoding revisited you'll see that a query string is encoded with normal URI encoding, except that space characters are encoded using '+' characters instead of '%20'.

So, to decode a query string we first need to replace literal '+' characters with spaces. Because the string is still URI encoded we should replace ocurrences of '+' with '%20', not the actual space character. Once this is done we are left with a standard URI encoded string which we decode as normal. Here's the code:

function URIDecodeQueryString(const Str: string): string;
begin
  Result := URIDecode(ReplaceStr(Str, '+', '%20'));
end;

The URIDecode routine was presented in my URL Decoding post.

URIDecodeQueryString has been added to UURIEncode.pas in my Delphi Doodlings repo. View the code.

10 March 2010

URL Decoding

To complement the code of my URL Encoding post, I've now developed a URIDecode routine.

It attempts to decode URIs that were percent-encoded according to RFC 3986. It also allows for some malformed percent-encoded URIs, i.e. those that contain characters outside the RFC's "unreserved" character set.

Here's the code. An explanation follows.

function URIDecode(const Str: string): string;

  // Counts number of '%' characters in a UTF8 string
  function CountPercent(const S: UTF8String): Integer;
  var
    Idx: Integer; // loops thru all octets of S
  begin
    Result := 0;
    for Idx := 1 to Length(S) do
      if S[Idx] = cPercent then
        Inc(Result);
  end;

var
  SrcUTF8: UTF8String;  // input string as UTF-8
  SrcIdx: Integer;      // index into source UTF-8 string
  ResUTF8: UTF8String;  // output string as UTF-8
  ResIdx: Integer;      // index into result UTF-8 string
  Hex: string;          // hex component of % encoding
  ChValue: Integer;     // character ordinal value from a % encoding
begin
  // Convert input string to UTF-8
  SrcUTF8 := UTF8Encode(Str);
  // Size the decoded UTF-8 string
  SetLength(ResUTF8, Length(SrcUTF8) - 2 * CountPercent(SrcUTF8));
  SrcIdx := 1;
  ResIdx := 1;
  // Process each octet of the source string
  while SrcIdx <= Length(SrcUTF8) do
  begin
    if SrcUTF8[SrcIdx] = cPercent then
    begin
      // % encoding: decode following two hex chars into required code point
      if Length(SrcUTF8) < SrcIdx + 2 then
        raise EConvertError.Create(rsEscapeError);  // malformed: too short
      Hex := '$' + string(SrcUTF8[SrcIdx + 1] + SrcUTF8[SrcIdx + 2]);
      if not TryStrToInt(Hex, ChValue) then
        raise EConvertError.Create(rsEscapeError);  // malformed: not valid hex
      ResUTF8[ResIdx] := AnsiChar(ChValue);
      Inc(ResIdx);
      Inc(SrcIdx, 3);
    end
    else
    begin
      // plain char or UTF-8 continuation character: copy unchanged
      ResUTF8[ResIdx] := SrcUTF8[SrcIdx];
      Inc(ResIdx);
      Inc(SrcIdx);
    end;
  end;
  // Convert back to native string type for result
  Result := UTF8ToString(ResUTF8);
end;

Internally, URIDecode operates on UTF-8 strings for both input and output.

This lets us deal easily with any multi-byte characters in the input. As already noted, there shouln't be any such characters - all should map onto the unreserved characters that form a subset of the ASCII character set. However we need to allow for badly encoded URIs that may contain characters outside this expected set.

UTF-8 also lets perform an easy test for '%' characters in input. Since '%' can never occur as a UTF-8 continuation character we can simply test for the actual character without worrying about if it is part of of a multibyte character.

We use UTF-8 for the output string since UTF-8 should have been used to encode the URI in the first place, therefore percent-encoded octets may map onto UTF-8 continuation bytes. Using any other string type would give erroneous results.

The interim UTF-8 result is converted into the native string type before returning.

URIDecode has been added to UURIEncode.pas in my Delphi Doodlings repo. View the code.

URL Encoding revisited

In my previous post I covered URI encoding part of a URI or URL. What I didn't cover was the almost trivial case of encoding a query string. The only difference is that spaces in a query string are converted to reserved '+' characters that are not then percent encoded.

The only complication arises when the query string to be encoded already contains '+' symbols. They must be percent-encoded so they are not confused with the symbols that are used to replace spaces.

In the original version of this post I completely misunderstood this and presented code that percent-encoded both literal and space-replacement '+' symbols.

Unfortunately percent-encoding at this stage also encodes spaces as %20. We fix this by simply replacing each occurence of %20 with a '+' symbol.

Here's the code: it's a very simple function that encodes a query string value passed as a UTF-8 string parameter:

function URIEncodeQueryString(const S: UTF8String): string;
begin
  Result := ReplaceStr(URIEncode(S), '%20', '+');
end;

The URIEncode routine was presented in my earlier post.

A version of URIEncodeQueryString is available in my Delphi Doodlings repo. View the code (see UURIEncode.pas).

To test this routine you can generate test results at the The URLEncode and URLDecode Page.