Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong identification for windows-1252 #42

Open
msdobrescu opened this issue Feb 4, 2019 · 5 comments
Open

Wrong identification for windows-1252 #42

msdobrescu opened this issue Feb 4, 2019 · 5 comments

Comments

@msdobrescu
Copy link
Contributor

Hello, I try to identify the encoding of a file that should be windows-1252, but it finds a better match for windows-1255.
my.txt
It contains, for instance, C5, which is Å, but the file is identified as windows-1255, which does not contain it at all.

@msdobrescu
Copy link
Contributor Author

It seems the uchardet has an improved detection, so it must be imported into this project.

@304NotModified
Copy link
Member

Do you mean the Mozilla Universal Charset Detector.? Too bad this is a refactor of a port, reporting could be a lot of work

@msdobrescu
Copy link
Contributor Author

Possibly, but it's almost unusable in my case. And there is no port of uchardet, which is now freedesktop's.
Look here: https://gitlab.freedesktop.org/uchardet/uchardet.
Worth adding more to it, too bad it is a bit too hardcoded.

@msdobrescu
Copy link
Contributor Author

Would you accept some new languages ported from the uchardet project?

@304NotModified 304NotModified added this to the 1.1 milestone Feb 6, 2019
@304NotModified 304NotModified modified the milestones: 2.0, 2.1 Mar 27, 2019
@304NotModified 304NotModified modified the milestones: 2.1, Backlog Aug 13, 2019
@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 12, 2020

Now, v2.3.0

Detected encoding iso-8859-1 with confidence 0.47388184.

From Status Log:

SBCS 0.47388184: [iso-8859-1]
SBCS: 0.4738818 [iso-8859-1]

SBCS 0.47388184: [iso-8859-4]
SBCS: 0.4738818 [iso-8859-4]

SBCS 0.47388184: [iso-8859-9]
SBCS: 0.4738818 [iso-8859-9]

SBCS 0.47388184: [iso-8859-13]
SBCS: 0.4738818 [iso-8859-13]

SBCS 0.47388184: [iso-8859-15]
SBCS: 0.4738818 [iso-8859-15]

SBCS 0.47388184: [windows-1252]
SBCS: 0.4738818 [windows-1252]

Status Log

Get confidence:
-- new match found: confidence 0.020249203, index 0, charset windows-1251.
-- new match found: confidence 0.026152553, index 6, charset iso-8859-7.
-- new match found: confidence 0.04902641, index 11, charset windows-1255.
-- new match found: confidence 0.050912045, index 12, charset windows-1255.
-- new match found: confidence 0.093243085, index 15, charset iso-8859-1.
-- new match found: confidence 0.09324489, index 18, charset iso-8859-1.
-- new match found: confidence 0.14311144, index 21, charset iso-8859-2.
-- new match found: confidence 0.1763441, index 32, charset iso-8859-15.
-- new match found: confidence 0.24882, index 45, charset iso-8859-3.
-- new match found: confidence 0.3023013, index 59, charset ibm852.
-- new match found: confidence 0.47388184, index 60, charset iso-8859-1.
Get confidence done.
SBCS Group Prober --------begin status
SBCS 0.020249203: [windows-1251]
SBCS: 0.0202492 [windows-1251]

SBCS 0.014343185: [koi8-r]
SBCS: 0.01434319 [koi8-r]

SBCS 0: [iso-8859-5]
SBCS: 0.00 [iso-8859-5]

SBCS 0.020249203: [x-mac-cyrillic]
SBCS: 0.0202492 [x-mac-cyrillic]

SBCS 0: [ibm866]
SBCS: 0.00 [ibm866]

SBCS 0.00659974: [ibm855]
SBCS: 0.00659974 [ibm855]

SBCS 0.026152553: [iso-8859-7]
SBCS: 0.02615255 [iso-8859-7]

SBCS 0.026152553: [windows-1253]
SBCS: 0.02615255 [windows-1253]

SBCS 0: [iso-8859-5]
SBCS: 0.00 [iso-8859-5]

SBCS 0.0031166344: [windows-1251]
SBCS: 0.003116634 [windows-1251]

SBCS 0: [windows-1255]
HEB: 0 - 0 [Logical-Visual score]

SBCS 0.04902641: [windows-1255]
SBCS: 0.04902641 [windows-1255]

SBCS 0.050912045: [windows-1255]
SBCS: 0.05091204 [windows-1255]

SBCS 0.013214358: [tis-620]
SBCS: 0.01321436 [tis-620]

SBCS 0.013214358: [iso-8859-11]
SBCS: 0.01321436 [iso-8859-11]

SBCS 0.093243085: [iso-8859-1]
SBCS: 0.09324308 [iso-8859-1]

SBCS 0.093243085: [iso-8859-15]
SBCS: 0.09324308 [iso-8859-15]

SBCS 0.093243085: [windows-1252]
SBCS: 0.09324308 [windows-1252]

SBCS 0.09324489: [iso-8859-1]
SBCS: 0.09324489 [iso-8859-1]

SBCS 0.09324489: [iso-8859-15]
SBCS: 0.09324489 [iso-8859-15]

SBCS 0.09324489: [windows-1252]
SBCS: 0.09324489 [windows-1252]

SBCS 0.14311144: [iso-8859-2]
SBCS: 0.1431114 [iso-8859-2]

SBCS 0.14311144: [windows-1250]
SBCS: 0.1431114 [windows-1250]

SBCS 0.12198714: [iso-8859-1]
SBCS: 0.1219871 [iso-8859-1]

SBCS 0.12198714: [windows-1252]
SBCS: 0.1219871 [windows-1252]

SBCS 0.09350189: [iso-8859-3]
SBCS: 0.09350189 [iso-8859-3]

SBCS 0.14065312: [iso-8859-3]
SBCS: 0.1406531 [iso-8859-3]

SBCS 0.14065312: [iso-8859-9]
SBCS: 0.1406531 [iso-8859-9]

SBCS inactive: [iso-8859-6] (i.e. confidence is too low).
SBCS 0: [windows-1256]
SBCS: 0.00 [windows-1256]

SBCS 0.084189065: [viscii]
SBCS: 0.08418906 [viscii]

SBCS 0.057199046: [windows-1258]
SBCS: 0.05719905 [windows-1258]

SBCS 0.1763441: [iso-8859-15]
SBCS: 0.1763441 [iso-8859-15]

SBCS 0.1763441: [iso-8859-1]
SBCS: 0.1763441 [iso-8859-1]

SBCS 0.1763441: [windows-1252]
SBCS: 0.1763441 [windows-1252]

SBCS 0.09554723: [iso-8859-13]
SBCS: 0.09554723 [iso-8859-13]

SBCS 0.09554723: [iso-8859-10]
SBCS: 0.09554723 [iso-8859-10]

SBCS 0.09554723: [iso-8859-4]
SBCS: 0.09554723 [iso-8859-4]

SBCS 0.09578463: [iso-8859-13]
SBCS: 0.09578463 [iso-8859-13]

SBCS 0.09578463: [iso-8859-10]
SBCS: 0.09578463 [iso-8859-10]

SBCS 0.09578463: [iso-8859-4]
SBCS: 0.09578463 [iso-8859-4]

SBCS 0.09340608: [iso-8859-1]
SBCS: 0.09340608 [iso-8859-1]

SBCS 0.09340608: [iso-8859-9]
SBCS: 0.09340608 [iso-8859-9]

SBCS 0.09340608: [iso-8859-15]
SBCS: 0.09340608 [iso-8859-15]

SBCS 0.09340608: [windows-1252]
SBCS: 0.09340608 [windows-1252]

SBCS 0.24882: [iso-8859-3]
SBCS: 0.24882 [iso-8859-3]

SBCS 0.095001444: [windows-1250]
SBCS: 0.09500144 [windows-1250]

SBCS 0.095001444: [iso-8859-2]
SBCS: 0.09500144 [iso-8859-2]

SBCS 0.13669409: [x-mac-ce]
SBCS: 0.1366941 [x-mac-ce]

SBCS 0.1854423: [ibm852]
SBCS: 0.1854423 [ibm852]

SBCS 0.081335865: [windows-1250]
SBCS: 0.08133586 [windows-1250]

SBCS 0.081335865: [iso-8859-2]
SBCS: 0.08133586 [iso-8859-2]

SBCS 0.13743466: [x-mac-ce]
SBCS: 0.1374347 [x-mac-ce]

SBCS 0.1760888: [ibm852]
SBCS: 0.1760888 [ibm852]

SBCS 0.13817607: [windows-1250]
SBCS: 0.1381761 [windows-1250]

SBCS 0.13817607: [iso-8859-2]
SBCS: 0.1381761 [iso-8859-2]

SBCS 0.13817607: [iso-8859-13]
SBCS: 0.1381761 [iso-8859-13]

SBCS 0.12337148: [iso-8859-16]
SBCS: 0.1233715 [iso-8859-16]

SBCS 0.21631232: [x-mac-ce]
SBCS: 0.2163123 [x-mac-ce]

SBCS 0.3023013: [ibm852]
SBCS: 0.3023013 [ibm852]

SBCS 0.47388184: [iso-8859-1]
SBCS: 0.4738818 [iso-8859-1]

SBCS 0.47388184: [iso-8859-4]
SBCS: 0.4738818 [iso-8859-4]

SBCS 0.47388184: [iso-8859-9]
SBCS: 0.4738818 [iso-8859-9]

SBCS 0.47388184: [iso-8859-13]
SBCS: 0.4738818 [iso-8859-13]

SBCS 0.47388184: [iso-8859-15]
SBCS: 0.4738818 [iso-8859-15]

SBCS 0.47388184: [windows-1252]
SBCS: 0.4738818 [windows-1252]

SBCS 0.13686267: [iso-8859-1]
SBCS: 0.1368627 [iso-8859-1]

SBCS 0.13686267: [iso-8859-3]
SBCS: 0.1368627 [iso-8859-3]

SBCS 0.13686267: [iso-8859-9]
SBCS: 0.1368627 [iso-8859-9]

SBCS 0.13686267: [iso-8859-15]
SBCS: 0.1368627 [iso-8859-15]

SBCS 0.13686267: [windows-1252]
SBCS: 0.1368627 [windows-1252]

SBCS 0.08758995: [windows-1250]
SBCS: 0.08758995 [windows-1250]

SBCS 0.08758995: [iso-8859-2]
SBCS: 0.08758995 [iso-8859-2]

SBCS 0.08758995: [iso-8859-13]
SBCS: 0.08758995 [iso-8859-13]

SBCS 0.08798097: [iso-8859-16]
SBCS: 0.08798097 [iso-8859-16]

SBCS 0.12607843: [x-mac-ce]
SBCS: 0.1260784 [x-mac-ce]

SBCS 0.16955028: [ibm852]
SBCS: 0.1695503 [ibm852]

SBCS 0.37495747: [windows-1252]
SBCS: 0.3749575 [windows-1252]

SBCS 0.37495747: [windows-1257]
SBCS: 0.3749575 [windows-1257]

SBCS 0.37495747: [iso-8859-4]
SBCS: 0.3749575 [iso-8859-4]

SBCS 0.37495747: [iso-8859-13]
SBCS: 0.3749575 [iso-8859-13]

SBCS 0.37495747: [iso-8859-15]
SBCS: 0.3749575 [iso-8859-15]

SBCS 0.093210384: [iso-8859-1]
SBCS: 0.09321038 [iso-8859-1]

SBCS 0.093210384: [iso-8859-9]
SBCS: 0.09321038 [iso-8859-9]

SBCS 0.093210384: [iso-8859-15]
SBCS: 0.09321038 [iso-8859-15]

SBCS 0.093210384: [windows-1252]
SBCS: 0.09321038 [windows-1252]

SBCS 0.09317723: [windows-1250]
SBCS: 0.09317723 [windows-1250]

SBCS 0.09317723: [iso-8859-2]
SBCS: 0.09317723 [iso-8859-2]

SBCS 0.09317723: [iso-8859-16]
SBCS: 0.09317723 [iso-8859-16]

SBCS 0.18036576: [ibm852]
SBCS: 0.1803658 [ibm852]

SBCS 0.09312218: [windows-1250]
SBCS: 0.09312218 [windows-1250]

SBCS 0.09312218: [iso-8859-2]
SBCS: 0.09312218 [iso-8859-2]

SBCS 0.09312218: [iso-8859-16]
SBCS: 0.09312218 [iso-8859-16]

SBCS 0.13316554: [x-mac-ce]
SBCS: 0.1331655 [x-mac-ce]

SBCS 0.18025918: [ibm852]
SBCS: 0.1802592 [ibm852]

SBCS 0.23395953: [iso-8859-1]
SBCS: 0.2339595 [iso-8859-1]

SBCS 0.23395953: [iso-8859-4]
SBCS: 0.2339595 [iso-8859-4]

SBCS 0.23395953: [iso-8859-9]
SBCS: 0.2339595 [iso-8859-9]

SBCS 0.23395953: [iso-8859-15]
SBCS: 0.2339595 [iso-8859-15]

SBCS 0.23395953: [windows-1252]
SBCS: 0.2339595 [windows-1252]

SBCS Group found best match [iso-8859-1] confidence 0.47388184.

This is consistent with the Finnish model:

// Finnish
probers[60] = new SingleByteCharSetProber(new Iso_8859_1_FinnishModel());
probers[61] = new SingleByteCharSetProber(new Iso_8859_4_FinnishModel());
probers[62] = new SingleByteCharSetProber(new Iso_8859_9_FinnishModel());
probers[63] = new SingleByteCharSetProber(new Iso_8859_13_FinnishModel());
probers[64] = new SingleByteCharSetProber(new Iso_8859_15_FinnishModel());
probers[65] = new SingleByteCharSetProber(new Windows_1252_FinnishModel());

Now the problem is the same as in #77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants