Friday, January 21, 2011

Ruby on Rails and encoding

Sometimes people just don't use UTF-8. I could understand the Chinese and Japanese folks but why Polish webmasters still use ISO-8859-2 ? I wonder...

 Anyway recently my friend found a bug in my app, it's quite tricky. I have an app that takes some strings from html document and then creates an form and submit that form to the server. Everything works flawlessly until we use IE on non-utf8 encoded page.

 Because IE doesn't support accept-charset form attribute data is sent in the original encoding which is ISO-8859-2 in my case. My application default encoding is utf8 so when data arrives to the app it "thinks" that it's encoded in utf8 which is obviously wrong. So I got char \xF3 which is "ó" (this letter is by the way the same as "u" but we use both to complicate our language and make it l33t onRy! ;-) in Polish and it is a "invalid byte sequence in UTF-8" and my app crashes :D Soooooo great :D To crash an rails app just  send invalid utf8 char to it !

Solution for this isn't simple. People tells about using iconv to replace/ignore/blank out the trouble making butes but I want my "ó" ! Another way is to detect the encoding and then do string.force_encoding(detected_encoding).encode!("UTF-8") which works but the problem is that in ruby 1.9 we do not have any good, working encoding detection method.
I've crawled the internets for hours and found only chardet gem (UniversalDetector) which fails to work under ruby 1.9 rails 3.0.3 and rchardet which also doesn't work but some awesome guy ported in to 1.9 but it still doesn't solve my problem and crashes when trying to detect encoding of my string !
So finally I wrote a little function, it's not perfect and it will eat letters from a string if we guess the wrong encoding, but it should work for most latin sites.

    def vs(s)
      if s.to_s.valid_encoding?
        return true
      else
        begin
          s.force_encoding('ISO-8859-2').encode!("UTF-8")
        rescue ArgumentError
        ensure
          c = Iconv.new 'UTF-8//IGNORE', 'UTF-8'
          s.replace c.iconv(s.dup + '  ')[0..-2]
        end
      end
    end


Just have to remember, it's not a solution just ugly workaround

No comments: