Saturday, October 31, 2009

Spammers are exploiting UTF8 homonyms to hide the spam keywords

Don't know it that is the correct name, Twitter spammer are currently using UTF8 encodings of characters that are actually normal latin chars to hide their spam keywords e.g. EFBD89 -> i, EFBD8F -> o etc.

The code EFBD89 translates to FF49 in Unicode which is a 2nd i character in the Unicode table (I wonder what is the point of this) that display as spaced characters.

This can be fixed easily, but right now any keyword filter will fail on this I think.

