#541 reFlux Incremental Compilation design notes

KevinKelley Fri 17 Apr 2009

I'm tired and my brane is burning, so I'll stop and talk about what I'm doing, for a while.

I've mentioned here before more or less what I'm doing: extending the Flux editor to bring it closer to being a full-featured IDE, also experimenting with some usability ideas I've got. Partly it's an excuse for a learning experience, it gives me experience in Fan and with features like closures I haven't used much before. Also I'm hoping to turn it into something that's a primary tool for my own development.

Three features that are linked are the folding editor; code completion; and incremental compilation. All three need semantic knowledge of the code, as opposed to just lexically breaking it down by token. So, I'm dealing with all three more or less together; I've got folding working to an extent (types and slots), but to make it best I need more semantic information, thus full parsing.

I've written enough parsers in past times I don't feel any need to write another one. Also with Fan changing as fast as it is, I really don't want to be chasing the standard, writing a custom parser and then rewriting it every time the language changes or grows. So, I'm leveraging the existing compiler.

So. Fan's compiler is a pod-at-once process. Summarizing the steps, and leaving some out:

  • initialize CompilerInput with necessary state
  • tokenize the input
  • scan tokens once, just looking for usings and types
  • resolve dependencies and types, into an accurate types table
  • full parse, generating the AST
  • various checks, leading to generating code and writing the pod.

The steps are neatly broken out, and can be called from my code, assuming I properly initialize the state. I'm doing that as far as the full parse, and it's working; I am getting the full AST of the compilation unit I'm editing; which in turn means I can have little clicky-buttons beside the editor to fold away classes, methods, blocks... as far down as I'd care to take it.

It also means that I'll be able to use semantic info from the file being processed, for error highlighting, jumping from variables to definitions, code completion that knows about methods and variables and other types even before you get them successfully compiled into a pod where reflection can see them.

Doing it just like that (using the existing CompilerSteps) is very fragile, though. Any edit that breaks the parse invalidates the AST; and until you've got a successful parse you've got basically no low-level info.

So. Tokenize is easy; on an edit you can more or less just retokenize the segment of source text that changed.

The high-level scan (breaking a compilation unit (source file) into unit declarations and sequences of tokens that correspond to a type definition) is easy enough, too; you could even do it by hand to try to intelligently handle broken code. In fact that's pretty much what I'm doing. What I'm pretty sure I want to do next is, go down to the method level: with a little setup, you can use the parser to just parse the tokens corresponding to an individual method. This should be very fast; possibly fast enough to do on every text change; at least in a separate thread you'd get nearly instant error/info feedback. So that all seems cool.

Next big thing is to take that a little farther. Say, when you open a source for editing, initialize the compiler steps (in a background thread), compiling the rest of the sources and holding the process in abeyance until the file being edited has a successful (incremental) parse; then the background snapshots it and generates an up-to-date pod; rinse and repeat. With the method-level incremental parser in place, edits that break a method just mean that the background compiler can drop that method out of the typedef and continue; thus keeping the system (and reflection, and so on) completely up-to-date with the state of the source code.

So anyhow, that's about a big enough post for today, and that's more or less what I'm doing these days.

JohnDG Fri 17 Apr 2009

In my opinion, the odds of building a robust incremental parser with proper error recovery on top of Fan's parser are 0. Either the Fan parser needs to substantially change, or an incremental error-correcting parser needs to be written.

But good luck in any case. :-)

KevinKelley Fri 17 Apr 2009

Well, it might depend too on what you mean by robust. I'm a single developer after all; it's not going to be what you'd expect from IBM. But it's looking very possible to do something that's at least pretty useful; and if nothing else I'm having a great time dinking around at it.

cheers...

tompalmer Sat 18 Apr 2009

Sounds awesome. If you really have the time, I say be brave. But just do what it takes to make recompiling/analyzing fast enough (while leaving full compiles fast, too -- since you probably don't want a long-term fork). No need to make things more complicated than necessary. What that level is, I don't know. I haven't worked through details.

tactics Tue 21 Apr 2009

I tried researching incremental compiler techniques the other day and came up short. It seems like the sort of thing there should be a well-established theory for...

I'd love to help work with you on developing a incremental parser library for Fan.

JohnDG Tue 21 Apr 2009

There is well-established theory. I'd suggest working on error recovery first, since that's the more pressing issue.

For that, some keywords to get you going: "error recovery parsers", "substring parsing". You could use Grammatica or ANTLR to generate the parser from a grammar (Grammatica has a cleaner design and is supposedly better at error recovery, but the tools and support are much better for ANTLR).

tactics Tue 21 Apr 2009

JohnDG, are you saying that error recovery is the first step in incremental parsing? Or are you saying that between writing this library and improving the existing compiler's error messages, the latter takes precedence?

I have a copy of the dragon book (both editions, actually), which does explain basic error recovery techniques. I will read through that chapter again.

JohnDG Tue 21 Apr 2009

Incremental parsing allows you to efficiently re-parse a document as it's being edited by the user. Many IDEs have incremental parsing, because they are doing so many things behind the scenes, they can't afford the luxury of reparsing an entire file (which may be thousands of lines of code).

Error recovery/substring parsing allows you to "fill in the gaps", so to speak, and create a fully-formed AST, even in the presence of errors. This is even more important than incremental parsing, because you need this capability for "intellisense", refactoring with less-than-valid code, advanced syntax highlighting, brace/bracket completion, and many other features (keep in mind that 99% of the time, when you are editing code, it's not in a well-formed state, but you still need to provide the above features and others like them that help the developer get to the next well-formed state).

Error recovery doesn't have anything to do with improved error messages (although a parser with error recovery will give you all errors, not just the first one), it's more about being able to produce an AST (which will be used by many parts of the IDE) even when the source code is not perfectly formed.

If you use an existing package like Grammatica, you may be able to avoid knowing anything about error recovery. More important is the ability to write grammars without left-recursion, etc.

tactics Tue 21 Apr 2009

I see what you mean now. So, in a rough sense, what we need is similar to a web browser's rendering engine. Something that can take garbage input and return (in most cases) a well-formed AST.

Fan's grammar, as far as I can tell, is already free of left-recursion, so that shouldn't cause issues. It might matter that the grammar is ambiguous, as per <addExpr> := <multExpr> (("+" | "-") <multExpr>)*

I will take a look at Grammatica and ANTLR. It looks like Grammatica uses the GPL license. Would that affect our ability to incorporate into the Fan base?

JohnDG Tue 21 Apr 2009

So, in a rough sense, what we need is similar to a web browser's rendering engine. Something that can take garbage input and return (in most cases) a well-formed AST.

Yep, that's it.

It might matter that the grammar is ambiguous

Different parsers have different means to deal with these, or you can just parse with ambiguity and resolve the ambiguity in a 2nd step.

I will take a look at Grammatica and ANTLR. It looks like Grammatica uses the GPL license. Would that affect our ability to incorporate into the Fan base?

The licenses for the parser generators themselves don't matter much. What's important is the license for the parser that they generate from the grammar you specify. Generally, the generated parsers have quite permissive licenses.

KevinKelley Tue 21 Apr 2009

"Parsing techniques: a practical guide", Grune and Jacobs, ch 12 & 13; talks about substring parsing.

99% of the time, when you are editing code, it's not in a well-formed state

Here's where the paradigm shift comes in, I think. Traditional dev. cycle that's true enough. With incremental compiling though, the focus shifts, and now you've got 99% of the code that is well-formed, and just the one (or few) methods you're working on that are broken.

So, maybe the approach I'm taking at the moment would be better called partial compiling than incremental. Not to say that that's the best; it's just what seemed to me that might be good enough to be useful.

What you're describing, John, is the real, formally provable solution in this problem space.

First step toward getting there, I think, is to get the language grammar implemented in a tool, like ANTLR. At the moment we're still with a hand-written, recursive- descent.

KevinKelley Wed 22 Apr 2009

It's been quite a while since I've used ANTLR (it was PCCTS then) so I thought I'd check it out. Very nice, now, and there's a cool visual editor (ANTLRWorks) that goes along with it. It emits Java by default, and can emit CSharp and others as well, so it shouldn't be hard to integrate with Fan code.

I've taken the Fan grammar off the documentation page, and with some tweaking, got ANTLR to accept it and emit a FanLexer/FanParser combo.

There's some stuff in Antlr about AST generation too. Might be a useful tool, at least for experimenting.

JohnDG Wed 22 Apr 2009

Yes, ANTLRWorks is amazing. Unfortunately, the process for implementing error recovery with ANTLR is more involved than with some other parser generators, and I've always found the extensive mixing of grammar + source code distasteful.

In general, the guys behind ANTLR (mainly TP) are good at writing parsers, but not as good at software engineering. Their emphasis is on the parsing itself, rather than the interface they expose to consumers of their work.

tompalmer Wed 22 Apr 2009

It's possible to avoid much mixing of code with ANTLR if you are careful.

However, I still think any IDE support needs to be based on the real compiler itself. Maybe you might have a goal of replacing aspects of the main compiler, but I'd be afraid of getting too many things wrong if you tried to go around the main compiler. (I guess I'm more of an implementation rather than a specification person. Pros and cons.)

Also, ANTLR doesn't work directly with Fan today, and I would think keeping Fan self-hosting is important. So it might not be easy to use ANTLR to replace current Fan parsing.

JohnDG Wed 22 Apr 2009

However, I still think any IDE support needs to be based on the real compiler itself.

You're wrong. Every IDE with refactoring, real-time syntax checking, code analysis, etc., has a fully-functional parser and analyzer built-in, one suited to the needs of an IDE. They do not reuse the compilation infrastructure of the supported language.

There's no reason Fan would be any different.

It's possible to use an IDE-style parser inside Fan, but it's not possible to use a Fan-style parser inside an IDE (at least, no the way it's currently written).

Non-incremental parsers without error recovery can be made much faster than incremental parsers with error recovery. Perhaps that's why there's so much re-implementation.

KevinKelley Wed 22 Apr 2009

You're wrong. Every IDE with...

I don't know whether Tom's wrong; I don't think so but I could be wrong. :-) In any case, you're wrong; not every IDE does that. Smalltalk is the classic case of an IDE that uses its own compiler to do all kinds of cool things with its tools.

And in any any case, the fact that it's been done one way forever wouldn't mean that it's the only right way. Circumstances change. Fan, for instance, has a lot more reflection available than a lot of environments get access to. And there's a definite benefit to using a single code for a function, instead of re-implementing it over and over.

Anyway we'll see how it works. I'm as pragmatic about this as I can be, and still get done what I want done.

JohnDG Wed 22 Apr 2009

Smalltalk is the classic case of an IDE that uses its own compiler to do all kinds of cool things with its tools.

It's the exception that proves the rule. :-)

Like I said, good luck, but in my opinion, you're not going to be successful unless you contribute some serious patches to the existing Fan compiler.

brian Wed 22 Apr 2009

Just to chime in - I see a lot of valuing in trying to maintain a single code base for as much of the compiler as possible. Maybe it isn't 100% reuse, but the more the better. So I'm definitely willing to try enhance the compiler if/when/where it makes sense.

tompalmer Thu 23 Apr 2009

By "need", I meant "I really think it works best if you do it this way". It wasn't a survey of existing IDEs. However, I thought Eclipse used its own compiler for feedback and semantics as you go. I didn't dig into it, though. And I thought NetBeans had recently moved this way, too. (As in, related to all that compiler API and so on that they put into Java.)

KevinKelley Thu 23 Apr 2009

I'm going to drop a few more notes here about parsing. I've been stuck for the last day or two on a bug in my folding-editor; rather than bang my head I took a detour to explore some parser ideas. Hoping that my bugfix would magically come to me in a dream; hasn't happened yet.

So, to summarize: in general I'm with the majority opinion, that we're best off reusing the existing Fan parser system as much as possible. There may come a time or a situation where that isn't easy, though, so I took some time to get familiar with alternatives, in the context of Fan.

First experiment was with ANTLR parser-generator. This sucker has been around for many years, and you can tell that a lot of people have been using the heck out of it. It's got support for code generation in many languages, there's a visual grammar editor, it's got lots of usability features (forgiving in the grammar; can generate its own token tables and lexer, can generate ASTs, hooks for error handling...) I fed it the Fan grammar from the doc page; with a little search and replace for syntax it was able to generate a parser for the full Fan grammar with very little work. It's an LL(k) parser generator, so it can handle disambiguation by using extra lookahead, and it's got backtracking, so in terms of getting a parser from a grammar, it's the best thing I've ever seen.

Setting ANTLR aside for the moment; next I spent some hours with Grammatica. The docs sound as if it's maybe got better error-handling than ANTLR, and it generates a neater (better separation of grammar and code) system, so that sounded useful. It's got support for both Java and CSharp, and it's not supposed to be hard to add new language output, so I thought maybe with all that it could be useful here.

I'm a lot less liking it now though, after banging my head against it all day. My biggest gripe is that it's demanding on input: it seems to demand grammars that are exactly like what's in your textbook. No automatic disambiguation of anything. This means that the programmer is going to be doing a very lot of refactoring of the grammar, to get it through Grammatica. This in turn means that Grammatica's nice separation of grammar and gen'ed code isn't so useful after all; by the time you get a parser the grammar has been refactored into so many artificial constructs that the parse tree doesn't match the nodes you'd actually want.

-

That's all fine; I don't actually need these things now anyway. At some point it might be a fun project to get ANTLR to emit Fan code, but I don't actually need that now. The only thing I can see needing is, I need some good way of parsing at the sub-method level or so, for purposes of keeping the editor current with the edits.

This gets me thinking about different styles of parser. Standard LL or LR parsers tend to give you good info about the first error encountered, after that they (in theory) may not have much to say. Reading the Grune and Jacobs book, I found the Unger parser, which is non-directional and, instead of generating a single parse tree, generates a forest of parse trees that match against a string of input. So, the applicability here is, if I did an Unger parser for the grammar for Fan statements, then when editing within a method definition, it could be applied against the method text (everything between the braces, tokenized) to give as many matching statement AST trees as are valid.

Neat.

In trying to understand the thing, I implemented the book's pascal example in Fan, and it's working now at least as far as the book goes. (That is, it's got a tiny hardcoded sample input and grammar, which it parses, spitting out the list of matching parses). Not sure whether I'll ever use it, but at least I know now how it works.

What I really need to do now though is quit futzing around with side-issues and get back on my main track. Just wanted to make some notes here while they're fresh, is all.

KevinKelley Thu 23 Apr 2009

PS. If anyone's interested in playing with this stuff, I've got a Fan grammar that's tweaked for ANTLR, and a Fan implementation of the Unger parser that I'd be happy to share. At some point I'll put them where they can be linked; for now mailto:kevin at kelleysoft.com if you like.

brian Thu 23 Apr 2009

I definitely want to check them out, but no rush for me - I will wait until you get them posted somewhere.

Lomono Fri 22 May 2009

My first post is a mistake. :( I guess I might as well make the best of this post.

I saw a paper on the incremental parsing techniques used in Yi, the Haskell editor, and thought it might be of interest to KevinKelley.

KevinKelley Tue 26 May 2009

Thanks; haven't finished the article yet; just got home. Looks interesting tho -- same problems I'm dealing with.

This Earley parser I'm using now, seems to have some good features for incremental compiling. Parsing bubbles up from the lowest-level syntactic elements, so, even with a broken source, you still get parse trees for all the sub-sections that do parse.

It's also easy to call for a parse of just the node you're interested in ( expression instead of compilation unit); so, the general approach is, a full parse on document load to establish structure; then on edit, locate the lowest-level node in the tree that matches the edit location, climb the tree until the node encapsulates the change, and reparse that node only, with the sub-range of input that it applies to.

Making good progress, anyway, and having a lot of fun too. :-) As an additional benefit, I think I'll have a pretty good CFG parser system worked up too. The chart-based (searching) parsers are pretty good in some ways, at some expense in efficiency; there's some possibilities though in using the Earley system to develop, then use its data to generate LL or LR parsers for the subsets of your grammar that aren't ambiguous, maybe. Neat stuff.

Login or Signup to reply.