This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.
We have a following usecase in JSP editor: <!-- <h1> <%="hello"%> </h1> --> JspLexer now returns JspTokenId.TEXT for the first part (up to <%=), then JspTokenID.SCRIPTING and the rest of JspTokenId.TEXT again. The TEXT tokens are then used for HTMLLanguage embedding. We need to preserve the end lexer state from the first block of HTML to the second one (so the text below <%="hello"%> is marked as comment. Now lexer just starts lexing in each embedding block from INIT state. It is very important for JSP editor (and probably not only for this one) to have such functionality.
I understand this requirement in fact in some sense it's similar to handling of the embedding of javadoc comments in java. The way that I have explored already: As the html comment is treated as a single token this requirement leads to possibility to have token consisting of individual parts that may possibly be located at different places in the document and that need to be merged together prior actual lexing. I've spent several months in the past trying to implement this model and it is a nightmare. Not only the implementation is complicated but also the usage is difficult - the fact that a single token's content is spread across the document complicates the usage for the clients and also the notification about the changes in the token list. So I would not like to go this way again. Instead I would like to retain the token hierarchy as a regular tree of tokens but of course we will need to add an additional information to the token so that the user will be able to figure out that the token is a part of something bigger. I propose to have an enum TokenPartInfo { COMPLETE, START, INNER, END } and Token.getPartInfo() returning it. The TokenFactory will contain additional methods to give the part info at token creation time. The thing that I especially like on this solution is that we can conveniently use this mechanism to express incomplete tokens - they are just the START part of the regular tokens. This will allow us to remove *_INCOMPLETE token ids completely! It's good because the incomplete token ids clutter the language definition (it's necessary to check for them etc.). Now the lexer will have to be able to restart with the information that it continues an incomplete token. Although the lexer could express that by going into a special state recorded after the last incomplete token in the lexed embedded section I don't like this solution much because many lexers only have null state and this extra state after the last token would be the only non-null states that they produce. For lexers with null state there is an optimization that the state is simply not stored at all. Although I could benefit from the case that it's only after the last token so I could treat the last state specially it's still complicated. Instead I propose to pass the last incomplete token instance from the previous section to the lexer at the time of its construction. Now it's a question how to express that the sections should be connected and how to extend the notification model to cover this requirement.
>The way that I have explored already: As the html comment is treated as a single >token this requirement leads to possibility to have token consisting of >individual parts that may possibly be located at different places in the >document and that need to be merged together prior actual lexing. I've spent >several months in the past trying to implement this model and it is a nightmare. Actually I do not require exactly what you described. I have no problem with having the entire HTML comment separated to several HTML comment tokens. So the first HTML comment token would end just before '<%...' characters and another one would start just after '...%>'. The problem is that I need to start lexing of the second part of the comment in the same lexer state when the lexer stopped in the previour html comment part. Maybe implementing this would be easier than when you described. What would happen in the case you implement the 'continuous token' when the '<%' symbol appears? The embedding for it will be null (or at least not -html) so the lexer should somehow stop lexing. Would it receive EOL? or what? If it gets EOL, it will create the HTML token anyway. Maybe I am too much usecase-oriented so I do not have the entire context, you are the god of lexing so please do what is right ;-).
> The problem is that I need to start lexing > of the second part of the comment in the same lexer state when the lexer stopped > in the previour html comment part. Maybe implementing this would be easier than > when you described. Yes, I understand that and as I've described in the third paragraph the lexer would be given the state (which would of course be the state after the last token in the previous section) and the last token from the previous section (to cover those stateless lexers efficiently). So this may a bit extra work for the lexer to check the token from previous section but it's negligible. > What would happen in the case you implement the 'continuous token' when the > '<%' symbol appears? The embedding for it will be null (or at least not -html) > so the lexer should somehow stop lexing. Would it receive EOL? or what? If it > gets EOL, it will create the HTML token anyway. The lexer will get EOF at the end of each embedded html section so it will be pushed to create an incomplete html comment token. This is a difference from the "The way that I have explored already:" - read this as "The way I don't want to implement" :) where the lexer just would not notice the borders between the embedded sections - it would just see one long character sequence consisting of all the embedded sections that should be concatenated. That would be nice but too hard to implement. In practice the lexer must be able to return incomplete tokens anyway (to be able to tokenize characters till the end of the input) so this should add no extra burden to the lexer's complexity.
Marku, we should first clarify whether it's enough to attempt to join ALL the sections with the particular language path i.e. "text/x-jsp/text/html" or whether there are any external criteria that define additional conditions regarding joining. I only assume simple cases like this: <!-- Comment start <% System.out.println("Nazdar"); %> still in comment --> but are there any more complicated where the html sections would eventually NOT be joined? If there is a simple joining only then I would propose to add "boolean joinSections()" into LanguageEmbedding (attaching patch) and modify the infrastructure to comply. Not sure whether any changes are needed on the API level but I would first like to clarify the SPI.
Created attachment 35602 [details] Diff adding LanguageEmbedding.joinSections()
IMHO all pieces a embedded language with the same language path should be joined, in terms of passing the lexer state between all two closest sections. I do not see any usecase which would require to have the joinSections() method since you can always return a complete token and set lexer state to INIT on the end of the section if you want. I am not sure about the stateless lexers, I do not know how they works, I am just talking about my usecase.
The issue is neither P1 nor J1 stopper since I partially fixed/workarounded the Issue #99526 which is affected by this issue - the issue is now just P2. We will need this issue fixed in M10 though.
Making this issue as a P1 problem - it really breaks many things - see the list of blocked issues.
The committed is the basic implementation that lexes each section individually with only transferring the state between the sections. As we've already talked with Marek and Hanz I will now work on the solution that will virtually join the sections so that the lexer will not see the individual sections' EOFs. I will also add some more tests and coordinate with Marek regarding possible problems with jsps etc. Checking in src/org/netbeans/lib/lexer/TokenHierarchyOperation.java; /cvs/lexer/src/org/netbeans/lib/lexer/TokenHierarchyOperation.java,v <-- TokenHierarchyOperation.java new revision: 1.18; previous revision: 1.17 done Checking in src/org/netbeans/lib/lexer/LexerUtilsConstants.java; /cvs/lexer/src/org/netbeans/lib/lexer/LexerUtilsConstants.java,v <-- LexerUtilsConstants.java new revision: 1.16; previous revision: 1.15 done RCS file: /cvs/lexer/src/org/netbeans/lib/lexer/EmbeddedLexerInputOperation.java,v done Checking in src/org/netbeans/lib/lexer/EmbeddedLexerInputOperation.java; /cvs/lexer/src/org/netbeans/lib/lexer/EmbeddedLexerInputOperation.java,v <-- EmbeddedLexerInputOperation.java initial revision: 1.1 done Checking in src/org/netbeans/lib/lexer/TextLexerInputOperation.java; /cvs/lexer/src/org/netbeans/lib/lexer/TextLexerInputOperation.java,v <-- TextLexerInputOperation.java new revision: 1.5; previous revision: 1.4 done Checking in src/org/netbeans/lib/lexer/SubSequenceTokenList.java; /cvs/lexer/src/org/netbeans/lib/lexer/SubSequenceTokenList.java,v <-- SubSequenceTokenList.java new revision: 1.11; previous revision: 1.10 done Checking in src/org/netbeans/lib/lexer/LexerInputOperation.java; /cvs/lexer/src/org/netbeans/lib/lexer/LexerInputOperation.java,v <-- LexerInputOperation.java new revision: 1.11; previous revision: 1.10 done Checking in src/org/netbeans/lib/lexer/LAState.java; /cvs/lexer/src/org/netbeans/lib/lexer/LAState.java,v <-- LAState.java new revision: 1.5; previous revision: 1.4 done Checking in src/org/netbeans/lib/lexer/TokenListList.java; /cvs/lexer/src/org/netbeans/lib/lexer/TokenListList.java,v <-- TokenListList.java new revision: 1.4; previous revision: 1.3 done Checking in src/org/netbeans/lib/lexer/EmbeddedTokenList.java; /cvs/lexer/src/org/netbeans/lib/lexer/EmbeddedTokenList.java,v <-- EmbeddedTokenList.java new revision: 1.10; previous revision: 1.9 done Checking in src/org/netbeans/lib/lexer/EmbeddingContainer.java; /cvs/lexer/src/org/netbeans/lib/lexer/EmbeddingContainer.java,v <-- EmbeddingContainer.java new revision: 1.10; previous revision: 1.9 done RCS file: /cvs/lexer/src/org/netbeans/lib/lexer/TokenSequenceList.java,v done Checking in src/org/netbeans/lib/lexer/TokenSequenceList.java; /cvs/lexer/src/org/netbeans/lib/lexer/TokenSequenceList.java,v <-- TokenSequenceList.java initial revision: 1.1 done Checking in src/org/netbeans/lib/lexer/inc/TokenListUpdater.java; /cvs/lexer/src/org/netbeans/lib/lexer/inc/TokenListUpdater.java,v <-- TokenListUpdater.java new revision: 1.17; previous revision: 1.16 done Checking in src/org/netbeans/lib/lexer/inc/IncTokenList.java; /cvs/lexer/src/org/netbeans/lib/lexer/inc/IncTokenList.java,v <-- IncTokenList.java new revision: 1.11; previous revision: 1.10 done Checking in src/org/netbeans/lib/lexer/inc/MutableTokenList.java; /cvs/lexer/src/org/netbeans/lib/lexer/inc/MutableTokenList.java,v <-- MutableTokenList.java new revision: 1.5; previous revision: 1.4 done RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopTokenId.java,v done Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopTokenId.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopTokenId.java,v <-- TestJoinSectionsTopTokenId.java initial revision: 1.1 done RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextTokenId.java,v done Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextTokenId.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextTokenId.java,v <-- TestJoinSectionsTextTokenId.java initial revision: 1.1 done RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopLexer.java,v done Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopLexer.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopLexer.java,v <-- TestJoinSectionsTopLexer.java initial revision: 1.1 done RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextLexer.java,v done Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextLexer.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextLexer.java,v <-- TestJoinSectionsTextLexer.java initial revision: 1.1 done Checking in src/org/netbeans/api/lexer/TokenChange.java; /cvs/lexer/src/org/netbeans/api/lexer/TokenChange.java,v <-- TokenChange.java new revision: 1.9; previous revision: 1.8 done Checking in src/org/netbeans/api/lexer/LanguagePath.java; /cvs/lexer/src/org/netbeans/api/lexer/LanguagePath.java,v <-- LanguagePath.java new revision: 1.9; previous revision: 1.8 done Checking in src/org/netbeans/api/lexer/TokenSequence.java; /cvs/lexer/src/org/netbeans/api/lexer/TokenSequence.java,v <-- TokenSequence.java new revision: 1.13; previous revision: 1.12 done Checking in src/org/netbeans/api/lexer/TokenHierarchy.java; /cvs/lexer/src/org/netbeans/api/lexer/TokenHierarchy.java,v <-- TokenHierarchy.java new revision: 1.10; previous revision: 1.9 done Checking in src/org/netbeans/spi/lexer/LexerRestartInfo.java; /cvs/lexer/src/org/netbeans/spi/lexer/LexerRestartInfo.java,v <-- LexerRestartInfo.java new revision: 1.3; previous revision: 1.2 done Checking in src/org/netbeans/spi/lexer/LanguageEmbedding.java; /cvs/lexer/src/org/netbeans/spi/lexer/LanguageEmbedding.java,v <-- LanguageEmbedding.java new revision: 1.9; previous revision: 1.8 done Checking in src/org/netbeans/spi/lexer/LanguageHierarchy.java; /cvs/lexer/src/org/netbeans/spi/lexer/LanguageHierarchy.java,v <-- LanguageHierarchy.java new revision: 1.14; previous revision: 1.13 done Checking in test/unit/src/org/netbeans/lib/lexer/test/inc/TokenHierarchySnapshotTest.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/test/inc/TokenHierarchySnapshotTest.java,v <-- TokenHierarchySnapshotTest.java new revision: 1.9; previous revision: 1.8 done Checking in nbproject/project.properties; /cvs/lexer/nbproject/project.properties,v <-- project.properties new revision: 1.15; previous revision: 1.14 done RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/JoinSectionsTest.java,v done Checking in test/unit/src/org/netbeans/lib/lexer/JoinSectionsTest.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/JoinSectionsTest.java,v <-- JoinSectionsTest.java initial revision: 1.1 done Checking in test/unit/src/org/netbeans/lib/lexer/TokenSequenceListTest.java; /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/TokenSequenceListTest.java,v <-- TokenSequenceListTest.java new revision: 1.4; previous revision: 1.3 done Checking in api/apichanges.xml; /cvs/lexer/api/apichanges.xml,v <-- apichanges.xml new revision: 1.21; previous revision: 1.20 done