cornercorner
FeaturesPluginsDocs & SupportCommunityPartners

Lexer Module - Architecture Review Opinion

Issue: 51437
Submitter: Miloslav Metelka
Inception Date: 2004/11/19
Final Date: ?
Reviewers: Jaroslav Tulach, Tomas Pavek, Tomas Hurka

Content

Summary

Lexer module is intended to replace the current lexing support in the editor that suffers from various problems. The main requirements are: avoid frequent lexing, recover efficiently after modifications, allow for easy language embedding, support backward iteration through the tokens, handle large tokens efficiently, manage reasonable memory consumption/footprint. The current lexing support more or less fails in these requirements. The main feature of the new lexer module is that it creates permanent tokens and allows for incremental lexing. There are both API and SPI exposed -- API for lexer clients, SPI for language support implementors.

Decision

Accepted with change requests.

Opinion

API 

The initial documentation describes issues, requirements and features, which is fine, but there is no API mentioned. It was just stated by the submitter that the API should be stable and use official api namespace. The reviewers were fine having TokenSequence in the API (being nested for embedded languages) and the Token itself part of SPI, but this is too brief information, missing necessary context.

TCR: For the final review it is required to provide description of the API (API and SPI parts) and the architecture, including the relation between impl. modules, integration points in the editor, entry points for API users, registration for SPI implementors, etc. Specifically it would be nice to describe how support for embedded languages will be managed, e.g. in case of JSP where virtually any language can be plugged.

Reviewers understand the motivation for having the API stable, however it is not clear there will be real usages requiring it (all of them seem to be within the editor itself for the near future). E.g. during the review it has shown the potential use by the java parser is practically unrealistic. So it seems lower stability level would be sufficient, then also the requirements on the API would be lower from the DevRev side. But in the current situation, the reviewers require deeper description of use cases (syntax coloring, brace matching, code completion, formatting and parsing).

TCR: Provide documentation of the use cases for the new APIs. We should understand the API clients and SPI providers, and how they would use the lexer module.

Parser Coexistence

We discussed also possible ways of cooperation between the java parsing and the new lexer. Currently the parsing in the IDE is done via javac and there seem to be no chance having own parser implementation that would eventually built on the lexer. The only way seems to be that the javac scanner will feed both the parser and the lexer. The general requirement is to have just one infrastructure for parsing java sources. In the final review we will revisit if this is doable (i.e. mainly having the javac scanner used also by the lexer).

TCR: Scanning java sources should not run twice -- separately for the lexer and java parser. This means that the lexer should be able to use the output from the javac scanner (invoked for parsing). 

Incremental Recovery

One of the issues of the current lexing support is expensive recovery -- the need to rescan whole line whenever a modification occurs. According to discussion it seems to be fixable also with the current implementation, without the need of new API. The lexer could likely keep the state also in middle of the line. This option was discussed and no major barrier found to do it. As it is not clear when the new lexer gets into the product, fixing the current state would be beneficial.

TCA: Consider improving the recovery mechanism after modifications within the current implementation. It seems it is doable without new API.

Final Review Sep 26, 2006

MM: Explaining what this does: Super framework. Excelent testing capabilities,
    randomized tests.
MM: Language embeding, is under investigation, loose coupling is necessary
    outer language does need to know
    about inner languages. For example string in Java could means SQL.
    Open question: maybe there should be more than one embeding at once, for
    example for string it could have both \n and the SQL one. Should be easy
    to maintain both. The problem is what syntax coloring will be used, 
    but this is on top of lexer, somewhere else.
MF(Marek Fukala): This is important for us.
Created issue #86473 to cover this usecase.
JJ(Jan Jancura): Can you add these extensions compatibly?
MM: I have already this in, but it does not work. But that is easy, just
    find the prefix and postfix. It under control.
    
JT: Object Lexer.state()? Ok, P4.
MM: Java lexer and most other lexers will return null so this would possibly be an over-generification for the Lexer class.

JT: TCR: release55, lexer + lexer/editorbridge
JJ: JDK1.4 vs. JDK1.5
MF: We also need some support into release55. 
JT: release55 branch 
JT: get it on beta update center
Created issue #86510 for this TCR.
MM: Will it work with schlieman
JJ: Looks ok, but we have to try it. 
PF(Pavel Flaska): Hanz will do dynamic token ids
MM: I have already wrote a test for the dynamic tokens use.

JJ: Is not the API too big? Old API had only 3 classes. For example InputAttributes. 
MM: Better API/SPI separation and the language embedding was very difficult to do with the original API compared to the lexer.

JJ: Is not the lexer too slow?
MM: batch 19ms, 33ms for incremental changes, but then any incremental change is much,
    much faster
MM: Moreover we have clever memory management.
JL(Jan Lahoda): Lexer is going to speedup the the incremental changes
PF: Morever it scans lazily

MF: How do I write the incremental lexer syntax?
MM: As easy as writing non-incremental. Just use lexer SPI and you are done.

JJ: Do I have to create an API when I am about to write my own lexer?
JJ: Because all the lexer elements are available to anyone.
MM: LanguageDescription does not need to be public
VS(Vita Stejskal): TokenId becomes defacto API, Lookup.getDefault->LanguageProvider->mimetype->LangDescription


JT: LanguageDescription shall return T not TokenId

JT: is there a documentation of how to do syntax coloring?
MM: you need to associate colors somewhere. 
JT: TCR tutorial on syntax colors
MM: Will add to use-cases of arch.xml
Created issue #86511 for this TCR.
JJ: Mistakes in example in LanguageHierarchy
VS: How to register lang description to lookup

PP(Petr Pisl): What is the progress of integration?
MM: We will update java editor to use it.
PP: Who will rewrite next languages?
MF: Old editor will be there, so old things shall work.
VS: 2 phase: 1. lexer and highlighting, 2. rewrite internals of editor
JJ: So there will stay old settings in editor -> EditorKits
VS: This is really not necessary
JJ: But Options dialog is slow due to this
JJ: I just want the new lexer API to be usable without old crap
MM: Should be feasible - we plan to rewrite all the settings to MimeLookup and throw away o.n.e.Settings and BaseOptions completely.

MF: Can there be bridge between lexer and old tokens?
MM: Up to you - the ordinal ids of the Enums can be used for old token ids.

Final review decision: Accepted with TCRs

Minority Opinion

Advisory Information

One of the open issues is the lifetime management of the tokens. Yarda suggested to use weak references and log the lifecycle events. Mila thinks that weak references only cannot be used, as continuous sequences of tokens are needed, but we all agreed on that some optimization is desirable (though the size of the tokens is small and guaranteed), e.g. trying to hold only tokens for the visible part of the document. Anyway, reviewers acclaimed the attention of the submitters dedicated to the memory issues.

Lexer module is designed support lexers from popular lexer generators (fulfilling certain requirements). That is a valuable goal, but it would be quite useful to support syntax coloring definitions written for other editors (allowing to import them). This could greatly prove the usefulness of this project.

Appendixes

Appendix A: Technical Changes Required

TBD

Appendix B: Technical Changes Advised

TBD

Appendix C: Reference Material

HTML version of the arch-lexer.xml document

Final Review

#86510 - Backport lexer to be usable with release55 branch
#86511 - Create syntax coloring guide for Lexer