This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.
Summary: | Warn when using BOM in UTF-8 in XMLs | ||
---|---|---|---|
Product: | xml | Reporter: | lokad <lokad> |
Component: | Validation | Assignee: | Svata Dedic <sdedic> |
Status: | NEW --- | ||
Severity: | normal | CC: | reinouts, rkraneis |
Priority: | P2 | Keywords: | REGRESSION |
Version: | 7.4 | ||
Hardware: | PC | ||
OS: | Windows 7 | ||
Issue Type: | ENHANCEMENT | Exception Reporter: |
Description
lokad
2013-01-31 12:54:11 UTC
Another observation: if no encoding is specified (e.g. only <?xml version="1.0"?>) then the validation also works for UTF-8 with BOM. My previous comment actually is not true. So here are some results: | encoding specified | no encoding specified --------------+--------------------+----------------------- UTF-8 | OK | OK UTF-8 BOM | 1) | 1) UTF-16 LE BOM | OK | 2) #) UTF-16 BE BOM | OK | 2) #) [UTF-16 LE | 1) *) | 3) +) ] [UTF-16 BE | OK | 1) +) ] 1) "Content is not allowed in prolog." 2) "Premature end of file." 3) "The markup in the document preceding the root element must be well-formed." *) Garbage when opened in NB (wrong encoding detected) #) Nothing when opened in NB +) Probably detected as UTF-8 (spaces between characters) file contents: <?xml version="1.0" encoding="$encoding"?> <root/> <?xml version="1.0"?> <root/> I am of course unsure which of these combinations should have worked ... If I understand the W3C requirements correctly UTF-8 with or without BOM and UTF-16 with BOM have to be understood. UTF-16 without BOM is illegal. Only documents not encoding in UTF-8 or UTF-16 seem to be required to provide a correct encoding information. [http://www.w3.org/TR/REC-xml/#charencoding] Hm, at least in the case the encoding is not specified, the file even opens bad (BOM is displayed). The defect is present from netbeans 7.1.2, I cannot pinpoint a changeset which changed the behaviour. Anyway, the EncodingUtil.doDetectEncoding attempts to autodetect encoding and then reads document's declared encoding. If the document does NOT declare anything, the autodetected encoding (e.g. UTF-8 detected using BOM presence) is thrown away and null is returned. That causes the next encoding in the queue (project default, ISO-8859-1 in my case) to step in, and interpret the BOM as a regular text. Although I was able to fix the charset detection, the UTF-8 encoded file is still not read correctly. Java I/O libraries do not support UTF-8 with BOM mark correctly - see http://bugs.sun.com/view_bug.do?bug_id=4508058 http://bugs.sun.com/view_bug.do?bug_id=6378911 http://en.wikipedia.org/wiki/Byte-order_mark#UTF-8 Sadly, the net result of the evaluation is that NetBeans XML support should warn if a document contains BOM sequence at the start; even if NB worked around this JDK defect, JAXP would not parse the XML correctly at application runtime. I'll commit the encoding detection fix; it won't harm, and improve code's correctness. However I have to to mark the issue as an enhancement to report JDK-unsupported feature rather than provide fix for the use-case, sorry. encoding detection improved by http://hg.netbeans.org/jet-main/rev/6bf6bd1eac3f Hi Svata, I just want to confirm the current status in NB 8.0.1 (and thanks for the encoding-detection-fix): | encoding specified | no encoding specified --------------+--------------------+----------------------- UTF-8 | OK | OK UTF-8 BOM | 1) | 1) UTF-16 LE BOM | OK | OK UTF-16 BE BOM | OK | OK 1) "Content is not allowed in prolog." It might be a good idea to warn the user if a XML file with a BOM is detected? Maybe as a configurable hint? Regards, René |