This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.
See url for description. This is kind request for inception review for Parsing API.
List of Parsing API reviewers: Vladimir Kvashin (C++) Frank Kieviet (JCAPS) Torbjorn Norbye (GSF) Petr Pisl (PHP, Web) Marek Fukala (HTML,CSS, Web) David Strupl (JavaFX)
For C++ it will be Vladimir, but probably Voskresenskyj :)
I've read through the javadoc (for the current contents of the jet-parsing-api repository, and in particular the parsing.api module). I also looked at some usages of the API in the languages/ and java.source/ modules. I still have a lot of questions... I think it would help if the Javadocs were beefed up a bit, with actual examples here and there, and in general, more details. In any case, here are my questions and comments. Some of these are probably stupid questions since I don't fully understand everything. TN01: Task/UserTask/MultiUserTask: There's a run() method. Shouldn't there be a cancel() method? I assume some of the tasks can be slow, so when you for example switch away from this task it ought to be cancelled. A task which isn't capable of cancelling itself can simply ignore the cancel() method, but there should be a way for the infrastructure to take advantage of cooperative tasks willing to cancel when the result is known to be obsolete. The cancel method should perhaps pass a context object passing some information about why the task was cancelled. File edited, Document Closed, Document Hidden are some possible reasons, and I can see tasks choosing to do different things based on these. Most tasks will probably just abort no matter what when they see a cancel, but for example a task which updates the persistent store (something like a Lucene index) will probably want to update itself for closed or hidden documents, but maybe not edited. TN02: Parser.parse(...), Parser.getResult(...) There should be a way to register errors and warnings encountered during compilation. I see that there is a ParserException class - but I don't think that's the right way to do it. First, a compiler can often generate hundreds of warnings for a file, and certainly more than one. Throwing an exception during parsing (to indicate an error) means that you cannot register multiple errors without breaking control flow, and besides, exceptions should be used for unexpected conditions - and inside an IDE (where parsers are called while the user is typing source) will very often have to deal with errors. Thus, I think there should be a way to register errors and warnings during parsing. It should probably be stored on the Parser.Result object such that a client can look at a parser result and see if there were any fatal or non fatal errors or warnings associated with the compilation. TN03: Parser.Result What does the invalidate() method do? What does it mean that a parser result is invalid, that its source has changed so the result is no longer accurate? One area I'm a bit confused about (which may be why I have questions above) is really what the lifetime of a Parser is. It looks from the ParserFactory like a Parser is created for a specific set of sources. The "getResult" and "invalidate" methods make it look like it can keep calling getResult and invalidate and after each invalidate call, the sources should be reparsed. This seems a bit cumbersome to me. If the results are really to be invalidated, why don't we just have a single parse method and call it with the list of sources each time they need to be parsed? Or if we need to keep track of the list of sources as a unit (such that a parser implementation can share state between parse jobs) perhaps we can have a ParserContext object that we pass around? TN04: ResultIterator I thought this would be an actual Iterator (e.g. hasNext, next, hasPrevious, previous). It should probably have a different name. I'm slightly confused about what it is. It seems like some sort of result collection, but with an unusual way to iterate it. It seems to me that what you really want is to obtain the parser result for a particular mime type. I looked at the code in java.source and it was doing a breadth first search of results, looking for the Java mime type. Wouldn't it be better if you just had a MultiParserResult object with a Parser.Result getResult(String mimeType) method on it which let you directly access the parser result(s) you are interested in? TN05: EmbeddingProvider, and language embedding in general In general, I'm a bit confused about the Language Embedding story - and I couldn't really find any examples of this being used in the jet-parsing-api sources. Several of the javadocs pointed to an example for how to create an embedding or list of sources with some different mimetypes, but it would be great to see a more complete example. Specifically: I assume EmbeddingProviders are the translators capable of letting say the Java parser get a "Java view" of a JSP such that it can parse it using normal Javac. Let's say I have many EmbeddingProviders in the system, some handling JavaScript, some handling Java, some handling Ruby, etc. How are the Embedding Providers chosen by the system? I see there is a getPriority() method - what is that used for - task priority or choosing between EmbeddingProviders or something else? I realize that EmbeddingProviders are registered in a mime folder under Editors, so there is some sort of mime type associated with them, but I expected there to be both "input" and "output" mimetypes for these. Here's what the similar class to EmbeddingProvider looks like in GSF: public interface EmbeddingModel { /** This model creates target source code of the given mime type */ @NonNull String getTargetMimeType(); /** This model is capable of translating to the {@link #getTargetMimeType} * from the following source mime types. */ @NonNull Set<String> getSourceMimeTypes(); /** Produce the source code (of mime type {@link #getTargetMimeType} * from the given document (which must have outer mime type of one * of the given {@link #getSourceMimeTypes}. */ @NonNull Collection<? extends TranslatedSource> translate(Document doc); } The point here is that an embedding model is capable of -creating- a model of a certain mime type. But it can only do that for certain -types- of documents. For example, the JavaScript embedding provider in GSF has a target mime type of text/javascript, but it can handle input mime types for HTML, JSP, RHTML and PHP. When GSF needs to parse a certain language section, it looks at the target mime type, and looks at the source it has, and then does a table lookup among the source providers it has to produce an appropriate embedding. In theory this could involve multiple steps. Let's say there was no direct GSP (Grails) to JavaScript translator, but there is one which takes GSP to HTML. GSF could then first invoke the GSP to HTML translator, and then HTML to JavaScript. I'm assuming the Parsing API will need to do this as well, but with the current APIs I don't see how the selection can be made. TN06: Incremental Parsing The other thing I'm thinking about is how to handle incremental parsing. This is something I've been thinking I need to add to GSF now. I've added some type analysis for Ruby and JavaScript, but I'm really running into problems where it's just too expensive to do full-file analysis of types. In order to go to the next level (e.g. looking up return types for all the function calls I see in a file) I really need to do incremental parsing - so that's one of the things I'd really like to address in GSF - and obviously in the Parsing API as well. So, do you think we can add this to the parsing API? I think Tomas must have a lot of experience with this after adding incremental parsing support in Retouche for 6.1, right? I haven't started this work for GSF, but my thinking was that on the implementation side, I would keep track of the "damaged areas" since the most recent parse, and pass that to the parser. I would ALSO pass it to all the user tasks, who can use the same information (and extra information from their incremental parser) to limit the work they have to do to update the corresponding data structures (e.g. semantic highlighting only needs to be redone for the local function - unless there are reasons it needs to go looking outside the current function, etc.)
JavaDoc is updated. I am still working on documentation. TN01: UserTask and MultiUserTask both represents so called user tasks. User tasks are not cancelable. All other tasks extends SchedulerTask class, so they are cancelable. Storage updater task should extends SchedulerTask too. There will be some special friend contract for this task, probably. Tomas already has some implementation for Java module.
TN02: ParseException should be used when there is some unrecoverable error only (IOError, OutOfMemory...). It should not be used for regular errors and warnings in user code. Such errors should be collected in in parser result implementation. I do not want to add such support to Parsing API, because Parsing API has no support for visualization of errors / warnings.
TN03: Parser is created for one, or more Snapshots. When some Source (snapshots.getSource ()) is changed, parser.parse method is called once and parser.getResult () method is called for each task registered for current mimeType. parser.getResult should always return new instance of result. result.invalidate () is called when task is executed. So it looks like: parser.parse (); for (Task task : tasks) { Result result = parser.getResult (task); task.run (result); result.invalidate (); }
TN04: ResultIterator You are right, the name of this class may be confusing. I will look at it. Result iterator allow you to "iterate" over the tree of embeddings. You can start parser for current level, or you can iterate over all embedded blocks of code. It should not be a collection of results, because you sometimes do not need to parse everything. Parser.Result MultiParserResult.getResult(String mimeType) is not sufficient, I think. You may have more Embeddings with the same mime type in one file, for example (they can be merged to one Embedding, but separate blocks should be supported too). And MultiParserResult.getResult(String mimeType) does not give you full info. about all embeddings. Thats why I prefer current solution. But we can add MultiParserResult.getFirstResult(String mimeType) as "utility" method, of course.
VK01: Need for multiple parsing threads For C/C++, in the case machine has multiple CPUs or cores, parallel parsing in two or several threads is used now. This is essential for C/C++ support, because projects often are quite large (tens thousands of files) and C/C++ parsing is complex. As I understand, the proposed model is single threaded. Is this correct?
TN05: EmbeddingProvider, and language embedding in general I think that our two models are equivalent. In parsing API you just implement HTMLToJavaScriptEmbeddingProvider, and register it to text/html. Then you should have PHPToHTMLEmbeddingProvider registerred in text/php, and RubyToHTMLEmbeddingProvider registered in text/ruby. HTMLToJavaScriptEmbeddingProvider is able to recognize blocks of JavaScript code embedded in HTML, and PHPToHTMLEmbeddingProvider returns blocks of HTML code in RHTML. But you can implement PHPToJavaScriptEmbeddingProvider and register it to text/php too, if you need to do some special computations there (PHP>HTML>JavaScript). Or you can register PHPToJavaScriptEmbeddingProvider to text/html/text/php. ------------------------------- EmbeddingProvider.getPriority () is used to sort embedding provider together with other tasks. Than Parsing API client can exactly define, what should be done sooner: * parsing of embedded language or top level one? * execution of tasks for inner language, or outer one? * etc..
VK01: The generic Parsing API cannot be multithreaded since most of the parsers are not thread safe,same of them are not even reentrant. So the Parsing API protects the SPI (Parser) to be called from multiple thread at given time. But you as an implementor of the Parser SPI, if you have a thread save parser which parses files in parallel, can still use thread pool to do the parsing using simple fork join pattern. The parsing API ensures that the your Parser.parse will be called in a thread save way. In the parse method you can fork the work into thread pool and await the result from the workers by the thread calling the Parser.parse.
TN01: Task/UserTask/MultiUserTask - are user Tasks which have highest priority, they are not the factories task. Like in retouche they are not cancellable, the reason is that it is impossible to say which user task can interrupt other user task. TN02: The ParseException is not designed for warning or error in source code, it's rather a parser abort, parser fatal error from which the parser cannot heal. The warning and errors should be collected in the returned Parser.Result TN03: Parser.Result.invalidate - comes from retouche NB 6.1, should be protected - not sure if it's not public now. It's used to verify confinement of the Parser.Result passed to the client (caller of the parsing API). The goal is to prevent clients to hijack the Result outside the critical section and prevent race conditions. The invalidate has nothing to do with editor changes. TN04: Renaming ResultIterator to something else is good idea. +1 TN06: Yes, currently JavaParser handles partial reparse itself, but this is just a workaround to get it work somehow. The Parser.parse method was extended by SchedulerEvent (not sure about correctness of the name) by which the infrastructure sends hints to the parser about changes, currently there is an TokenHierarchySchedulerEvent which is passed to the parser in case of changes in editor and provides information about added, removed, changed tokens. JavacParser does not use it yet, but I will rewrite it soon.
VK01: We don't use several threads to parse a single file. But parsing in parallel is useful when a lot of files should be parsed (which occurs mostly after user opens the project or changes some parsing-relevant properties). I don't see how your proposal work in this case.
VK02: Who is responsible for identifying the list of documents to be parsed? One more C/C++ specific is #including headers. A header file, which is included via #include directive, may not be in the project file list. Nevertheless it should be treated as source; for example, it should be reparsed when changed, etc. But nobody except the parser itself nows about this. How can this situation be solved in the model you propose?
Y01 Provide reasonable usecases - I have a feeling that the first part of the wiki page describes the constraints, and the second part solution (at least it talks about Parser Request). I do not want to talk about the code, neither read it at this stage. I'd like some description of what is the usecase and how it will be handled on a general level. Like "people will get an instance of blablabla, and send a callback interface to it. It will be executed synchronously/asynchronously, blablabla". I miss this kind of information in the provided documents. Y02 Specify dependencies on other projects. Give me list of modules that you want to introduce, their dependencies among themselves, and the stability of each API that will export. Which cluster these modules belong? Highlevel answer is enough. Y03 Compatibility. Do you have some statement with respect to compatibility? Like, clients of Java infrastructure will not be affected at all, including binary and functional compatibility. Can you say the same about clients of 6.1's GSF?
Y01: Usecases are described in arch.xml. I can add more usecases there... Y02: Dependencies are specified in the same document. Y03: I am not able to specify compatibility of Java module and GSF. Its question for Tomas and Tor. As far as I know Tomas is trying to do fully compatible bridge for Java module APIs. GSF API is not public/official/stable, so changes are possible, I think. But it depends on Tor, if he wont to create bridge or just remove some parts of GSF API.
VK02: Who is responsible for identifying the list of documents to be parsed? Parser. Parser should compute dependencies and parse them. Parser should listen on these files and fire changes (Parser.addChangeListener).
VK01: What to do when user opens project... Current version of Parsing API does not have direct support for this usecase (so you should probably use some proprietary hack, like Java impl. and GSF). Tomas would like to design some new API for this purpose in the future (not for next release). There is some possibility how to do it even in the current Parsing API. You can implement special TaskScheduler listening on Repository changes, and register some special SchedulerTask that will store result of parsing to some persistent storage. But this solution is not tested / supported, so I can not recommend it.
During the review should also discuss some high level things JB01: Which modules will use this API in 6.5 time frame? (Java, GSF, who else?) JB02: Which modules will use this API in post 6.5 (7.0)? JB03: Will be the Parsing API exposed by GSF or just used
Jan wrote: "TN02: ParseException should be used when there is some unrecoverable error only (IOError, OutOfMemory...). It should not be used for regular errors and warnings in user code. Such errors should be collected in in parser result implementation. I do not want to add such support to Parsing API, because Parsing API has no support for visualization of errors / warnings." How would somebody write a visualization of errors/warnings without it? For example, GSF wants to automatically show errors in the editor, and in the tasklist. If the information isn't in the parsing result, it would need to define its own ParserErrorResult interface and require all clients to implement that in their parser result classes.
Some more comments: TN07: This is just a follow-up to your answer to TN03: So the point of "invalidate" is to say that the Parser Result is no longer used - e.g. free any resources you've stored in the parser result? (A C++ destructor or sorts, or something like a "close" method on various resources). TN08: The documentation (and the design of the result iterator) mentions the possibility of nested embeddings. Can you give some examples of where this need arises? TN09: As a meta comment, I think it's really unfortunate that the parsing API doesn't address indexing. I think most language supports needing parsing will also need indexing - after all, it's vital for support cross-file operations like go to declaration, code completion, etc. The big problem I have trying to retrofit GSF to use the Parsing API is that the Parsing API only addresses a -part- of what GSF (the Retouche fork part) does. I can't just replace it, I'll have to do a lot of surgery in there, and that makes me worried and I still end up with a lot of code to maintain. It's also unfortunate since a big point of the parsing API is to avoid code duplication, and if there's no indexing in the parsing api, then the clients (java, C/C++, GSF, others?) will all duplicate the indexing code. TN10: I still don't understand embedding. I took a look at your latest javadoc, but I still don't see a concrete example of how this will work. Would you mind writing up a much more detailed example? I will write one for how GSF works here to illustrate the kind of detail I'm looking for. I'd like to see the same thing expressed using the Parsing API. Say you have this source RHTML file (could be JSP, but I don't know JSPs very well - but I think the <% %> syntax to insert server side Ruby/Java calls are the same) <p> <script type="text/javascript" src="javascripts/effects.js"></script> <style>body { background: red }</style> <script>var x = <% foo() %>;</script> <span>Hello</span> <script> function f() { alert(x.toString()); } </script> <input type="submit" onclick="f()"/> What does the embedding scenario look like for this in the Parsing API? What are the Sources, Snapshots, Embeddings? Offsets? In GSF, you will involve several source models here. It will create a separate "virtual source" for each of these languages: CSS, JavaScript, Ruby The JavaScript virtual source (which is passed to the JavaScript parser) looks something like this: __netbeans_import__('javascripts/effects.js'); var x = __UNDEFINED__; function f() { alert(x.toString()); } ;function(){ f() } The virtual source translator from RHTML to JavaScript replaced the server call <% foo %> with __UNDEFINED, and the onclick handlers have been translated into anonymous functions (such that writing "return this" etc. in an event handler makes sense from an ast perspective, you can only return inside functions). Similarly, the Ruby embedding model (RHTML to Ruby) will create this: class ActionView::Base _buf='';;_buf << '<p> <script type="text/javascript" src="javascripts/effects.js"></script> <style>body { background: red }</style> <script>var x = '; foo() ;_buf << ';</script> <span>Hello</span> <script> function f() { alert(x.toString()); } </script> <input type="submit" onclick="f()"/> '; end In both cases, some of the code is coming from the original source, some is generated (e.g. the class ActionView::Base part above, and the netbeans_import to record JavaScript file inclusion and functions to contain event handlers). Some source is removed - typically embedded source that isn't relevant for this language -- such as all the markup. These blocks of code are maintained by each embedding model, which has a set of code blocks with corresponding offsets in the generated source and in the original source. You get a map like this: codeBlocks= [CodeBlockData[ RHTML(0,0)="", RUBY(0,31)="class ActionView::Base _buf='';"], CodeBlockData[ RHTML(0,130)="<p> <script type="text/javascript" src="javascripts/effects.js"></script> <style>body { background: red }</style> <script>var x = ", RUBY(31,174)=";_buf << '<p> <script type="text/javascript" src="javascripts/effects.js"></script> <style>body { background: red }</style> <script>var x = '; "], CodeBlockData[ RHTML(132,139)=" foo() ", RUBY(174,181)=" foo() "], CodeBlockData[ RHTML(141,266)=";</script> <span>Hello</span> <script> function f() { alert(x.toString()); } </script> <input type="submit" onclick="f()"/> ", RUBY(181,319)=";_buf << ';</script> <span>Hello</span> <script> function f() { alert(x.toString()); } </script> <input type="submit" onclick="f()"/> '; "]] So, can you show us what this will look like with the Parsing API? The only example I could find was one which constructs an Embedding consisting of three fragments of Java, but I'd like to understand how the virtual source for each language is constructed, how nesting of embeddings etc. come into play, and so on. (In GSF, a virtual source translator can return a collection of models which is used when you don't create a single "synthesized" model like we do here - e.g. the multiple <script> blocks in JavaScript get merged into one. For CSS, the theory is that each style="foo:bar" block would generate its own model - these are all completely independent and don't have to (and shouldn't) be parsed together. Thus, for CSS you get a series of models - at least I believe that's the case; Marek wrote the CSS embedding model so he'd know the details.)
Just reminder, that review starts today at 5pm CEST.
I created wiki page for this review http://wiki.netbeans.org/ParsingAPIReview
MF01: EmbeddingProvider has following method for getting a list of Embeddings in the snapshot: public abstract List<Embedding> getEmbeddings (Snapshot snapshot); This is IMO OK, but there are some usecases where creation of *all* Embeddings in the snapshot is not necessary. In another words if there are more independent virtual sources in the code (for example the contents of HTML style attributes) there is no need to create Embedding for all of them. Just the one specified by offset is enought. Maybe I do not understand the whole logic, please explain how this will be done if I am wrong, but something like public abstract Embedding getEmbedding (Snapshot snapshot, int offset); would help. MF02: Would it be possible to add a small step-by-step description how to implement and use the PAPI for some simple language? I belive this would help a lot.
TN07: This means that any successive call to Result will lead to Exception signaling that user violates confinement. TN09: There are two problems with indexing: a) Resources - we don't have resources to migrate java (retouche) RepositoryUpdater and stabilize it in NB 6.5 b) It's not clear how to do the common indexing - the RepositoryUpdater, even the older version which is copied into the GSF, is still very java specific and may be problematic in general (topological sort of root). c) Adding support of indexing will be compatible change of parsing API, so the huge change will be split into two smaller but still huge changes, currently there is no cost to use the current code (RepositoryUpdater) with new parsing API as java (retouche) does. d) Performance - the Java RepositoryUpdater needs to operate on java.io classes to perform in reasonable way, the GSF works on Document level, so there is need to have several kind of interfaces. Also dependency rebuild (RebuildOraculum) may become performance problem. The embedding questions (8,10) should be answered by Hanz, they are not java related, I have no much experience with embedding, sorry.
ParserErrorResult: The java clients (hints, editor) needs to have access to the java specific error type (Diagnostic) which provides also AST node causing the error. Diagnostic will provide structured error (warning) description + possible fix in future (JDK 7.0) - provides much more value than generic ParserErrorResult.
API was accepted with TCRs. See http://wiki.netbeans.org/ParsingAPIReview for details.
TN7: Yes exactly. It was Java team requirement. TN8: I am not sure I understand you. For example JavaScpript in HTML in JSP? Or you question is about representation of nested embeddings? You can create one plain structure of embeddings of course, if you want... TN9: I agree with you, we will look at it. TN10: There are many ways how to implement this usecase in Parsing API. One of them is: 1) There is one Source instance representing your foo.rhtml file. 2) Snapshot for whole file is created when the file is opened (or modified, ...). 3) RHTMLToJavaScriptEmbeddingProvider is registered for "text/rhtml". It creates one Embedding: class RHTMLToJavaScriptEmbeddingProvider extends EmbeddingProvider { List<Embedding> getEmbeddings (Snapshot snapshot) { return Arrays.asoList (new Embedding[] { Embedding.create (Arrays.asList (new Embedding[] { snapshot.create ("__netbeans_import__(\'", "text/javascript"), // generates virtual part of embedding snapshot.create (33, 11, "text/javascript"), // points to original text "javascripts/effects.js" snapshot.create ("\');\n", "text/javascript"), // virtual (generated) code again snapshot.create (66, 22, "text/javascript"), // points to original text "var x = " ..... })); }); } } 4) RHTMLToRubyEmbeddingProvider is registered for "text/rhtml". Implementation is similair to RHTMLToJavaScript provider. Each EmbeddingProvider can generate one or more blocks of embedded code. You can have more EmbeddingProviders registered for one language. Is it OK for you?
MF01: I do not understand your question. You can create one embedding for more separated blocks of code.