We have had  a couple of interesting discussions in our code review sessions about character sets, strings and file formats, and so I wanted to share what I believe to the archetypal post on the subject by Joel Spolsky. If you have the opportunity please read the article in its entirety…then read it again.

For those with too little time (TL;DR) here are the three most  practical tips.

  • Be careful what you use to edit files especially resx’. If for example you use notepad it can default to ANSI when saving, the expectation is that we save as UTF-8.
    clip_image001
     
  • Take extra care during code review to ensure the files are compared using the right format.
    clip_image002

    clip_image003
  • If we are retrieving string data from some server (files, services, etc) we should always have an appreciation for the underlying character set, otherwise we could be presenting pure gibberish based on the assumptions of our own character set.