I suspect a fix targeted just to misparsed characters that exist in ISO-8859-1 may be all we need to fix the encoding issue definitively. However, I haven't checked whether this reasoning applies to special dashes like the three-byte em-dash in UTF-8.
If there is conversion for characters from one to the other, a series of queries on the database could clean up some of this.
I linked it upwards in the thread; accents are bit more complicated (as there's more of them and they use two-character codes) but this covers the punctuation:
“ = left quote = “
” = right quote = ”
”˜ = left single quote = ‘
’ = right single quote = ’
– = en dash = –
— = em dash = —
- = hyphen = -
… = ellipsis = …
I overlooked your original post and went back to it, thanks.
I ran some queries on the message bodies and corrected all of the instances listed in the code block above. I then went to do similar on the subject field, and the first query instead truncated the field, causing the subject to display the first part of the body text. Weird error that messed up over 1,100 posts...
Looking at the old database, I copied over the original topic names and corresponding topic id's for those posts, which fell under 44 subjects. I then ran queries to rename the posts on this database, which covered subjects with a left quote - “ in them.
So you will notice less garbled text.
There's still more to fix, but not after that sql gaff and definitely no more tonight!
