Author: JoeS PostPosted: Tue Jun 03, 2008 3:04 pm    Post subject: Cleaning up html files saved as text files
Use web browser: firefox, Konqueror

There are files I save from the web that I would prefer to keep as text files.

When I save as text file or copy and paste some of the punctuation (such as " ' and -) is converted to ? In some files there can be a lot.

Maybe there is another web browser or program I could use.
I would appreciate any advice on cleaning up an html file after it is saved as a text file.


Author: Elderan PostPosted: Sun Jun 08, 2008 1:28 pm    Post subject:
the problem is not your webbrowser, it's because of the encoding of the text. Many pages uses UTF-8, but when you save it in ASCII mode, some signs (special character) are displayed as ?.

Save the data in UTF-8, or just use the Save-As function of your browser.

Author: capiLocation: Portugal PostPosted: Sun Jun 08, 2008 5:00 pm    Post subject:
Exactly. The problem here is that the original html contains characters that don't exist in the reduced ASCII set. Things like curved quotes () or the Euro sign (), for example.

When saving to text, the browser is probably saving either to strict 7-bit ASCII (or maybe ISO-8859-1, also known as Latin1), or to the encoding specified by your locale settings. The problem is that whichever encoding it's using seems to not include some of the original characters.

The solution would be to normalize the characters so that fancy stuff like curved quotes and so on is transformed to more standard characters like ". This, however, may not be easy to accomplish from the browser.

As Elderan pointed out, saving as UTF-8 would be another solution - as UTF-8 can by definition encode all Unicode characters. This would mean, however, that you'd need a text editor that can understand UTF-8 to read the fancy characters in the text file, but that shouldn't be a problem for virtually every modern text editor.

Unfortunately I don't really know if there's a way to choose the encoding used when you save text in Firefox, or which encoding it uses to save in the first place. Perhaps asking in the Mozilla forums might help.

