Determining best fitting encoding to any text without AI in native Python with Charset Normalizer

Ousret · September 19, 2019, 12:56pm

Or how I used brute-force where I least expected it

There is a very old issue regarding “encoding detection” in a text file that has been partially resolved by a program like Chardet. I did not like the idea of single prober per encoding table that could lead to hard coding specifications.

I wanted to challenge the existing methods of discovering originating encoding.

You could consider this issue as obsolete because of current norms :

You should indicate used charset encoding as described in standards

But the reality is different, a huge part of the internet still have content with an unknown encoding. ( One could point out subrip subtitle (SRT) for instance )

This is why a popular package like psf/Requests embed Chardet to guess apparent encoding on remote resources.

You should know that :

You should not care about the originating charset encoding, that because two different table can produce two identical files.

I’m brute-forcing on three premises :

Binaries fit encoding table
Chaos
Coherence

Chaos : I opened hundred of text files, written by humans , with the wrong encoding table. I observed , then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute to improve or rewrite it.

Coherence : For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that those intel are worth something here. So I use those records against the decoded text to check if I can detect intelligent design.

So I present to you Charset Normalizer. The Real First Universal Charset Detector

Feel free to help us though testing or contributing.

Thank you