April 22, 2009

Detecting encodings

I'd like to write about chardet. This software allows to know what the encoding type of a text is, as for example of a web page (html) or any file. This is very usefull when you are connecting several tools for information interchange.

However, in python versions lower than Python3k, working with encodings is horrible, so a lot of times you have troubles when trying to guess the encoding source.

Chardet gives information about the encodings that should match for a given source with a probability set. Below you can see an example of how to use chardet, it's very easy!


>>> import urllib

>>> urlread = lambda url: urllib.urlopen(url).read()

>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

>>> chardet.detect(urlread("http://yahoo.co.jp/"))
{'encoding': 'EUC-JP', 'confidence': 0.99}



0 comments: