Lexer Module - Architecture Review Opinion
- Issue: 51437
- Submitter: Miloslav Metelka
- Inception Date: 2004/11/19
Final Date: ?
- Reviewers: Jaroslav Tulach, Tomas Pavek, Tomas Hurka
- Content
-
Lexer module is intended to replace the current lexing support in the editor
that suffers from various problems. The main requirements are: avoid frequent
lexing, recover efficiently after modifications, allow for easy language
embedding, support backward iteration through the tokens, handle large tokens
efficiently, manage reasonable memory consumption/footprint. The current lexing
support more or less fails in these requirements. The main feature of the new
lexer module is that it creates permanent tokens and allows for incremental
lexing. There are both API and SPI exposed -- API for lexer clients, SPI for
language support implementors.
Accepted with change requests.
API
The initial documentation describes issues, requirements and features, which
is fine, but there is no API mentioned. It was just stated by the submitter
that the API should be stable and use official api namespace. The reviewers were
fine having TokenSequence in the API (being nested for embedded
languages) and the Token itself part of SPI, but this is too brief
information, missing necessary context.
TCR: For the final review it is required to provide description of
the API (API and SPI parts) and the architecture, including the relation between impl. modules,
integration points in the editor, entry points for API users, registration for
SPI implementors, etc. Specifically it would be nice to describe how support for
embedded languages will be managed, e.g. in case of JSP where virtually any
language can be plugged.
Reviewers understand the motivation for having the API stable, however
it is not clear there will be real usages requiring it (all of them seem to be
within the editor itself for the near future). E.g. during the review it has
shown the potential use by the java parser is practically unrealistic. So it seems lower
stability level would be sufficient, then also the requirements on the API would
be lower from the DevRev side. But in the current situation, the reviewers
require deeper description of use cases (syntax coloring, brace matching, code completion, formatting and parsing).
TCR: Provide documentation of the use cases for the new APIs. We should
understand the API clients and SPI providers, and how they would use the lexer
module.
Parser Coexistence
We discussed also possible ways of cooperation between the java parsing and
the new lexer. Currently the parsing in the IDE is done via javac and there seem
to be no chance having own parser implementation that would eventually built on
the lexer. The only way seems to be that the javac scanner will feed both the
parser and the lexer. The general requirement is to have just one infrastructure
for parsing java sources. In the final review we will revisit if this is doable
(i.e. mainly having the javac scanner used also by the lexer).
TCR: Scanning java sources should not run twice -- separately for the
lexer and java parser. This means that the lexer should be able to use the
output from the javac scanner (invoked for parsing).
Incremental Recovery
One of the issues of the current lexing support is expensive recovery -- the
need to rescan whole line whenever a modification occurs. According to
discussion it seems to be fixable also with the current implementation, without
the need of new API. The lexer could likely keep the state also in middle of the line.
This option was discussed and no major barrier found to do it. As it is not
clear when the new lexer gets into the product, fixing the current state would
be beneficial.
TCA: Consider improving the recovery mechanism after modifications
within the current implementation. It seems it is doable without new API.
Final Review Sep 26, 2006
MM: Explaining what this does: Super framework. Excelent testing capabilities,
randomized tests.
MM: Language embeding, is under investigation, loose coupling is necessary
outer language does need to know
about inner languages. For example string in Java could means SQL.
Open question: maybe there should be more than one embeding at once, for
example for string it could have both \n and the SQL one. Should be easy
to maintain both. The problem is what syntax coloring will be used,
but this is on top of lexer, somewhere else.
MF(Marek Fukala): This is important for us.
Created issue
#86473 to cover this usecase.
JJ(Jan Jancura): Can you add these extensions compatibly?
MM: I have already this in, but it does not work. But that is easy, just
find the prefix and postfix. It under control.
JT: Object Lexer.state()? Ok, P4.
MM: Java lexer and most other lexers will return null so this would possibly be an over-generification for the Lexer class.
JT: TCR: release55, lexer + lexer/editorbridge
JJ: JDK1.4 vs. JDK1.5
MF: We also need some support into release55.
JT: release55 branch
JT: get it on beta update center
Created issue
#86510 for this TCR.
MM: Will it work with schlieman
JJ: Looks ok, but we have to try it.
PF(Pavel Flaska): Hanz will do dynamic token ids
MM: I have already wrote a test for the dynamic tokens use.
JJ: Is not the API too big? Old API had only 3 classes. For example InputAttributes.
MM: Better API/SPI separation and the language embedding was very difficult to do with the original API compared to the lexer.
JJ: Is not the lexer too slow?
MM: batch 19ms, 33ms for incremental changes, but then any incremental change is much,
much faster
MM: Moreover we have clever memory management.
JL(Jan Lahoda): Lexer is going to speedup the the incremental changes
PF: Morever it scans lazily
MF: How do I write the incremental lexer syntax?
MM: As easy as writing non-incremental. Just use lexer SPI and you are done.
JJ: Do I have to create an API when I am about to write my own lexer?
JJ: Because all the lexer elements are available to anyone.
MM: LanguageDescription does not need to be public
VS(Vita Stejskal): TokenId becomes defacto API, Lookup.getDefault->LanguageProvider->mimetype->LangDescription
JT: LanguageDescription shall return T not TokenId
JT: is there a documentation of how to do syntax coloring?
MM: you need to associate colors somewhere.
JT: TCR tutorial on syntax colors
MM: Will add to use-cases of arch.xml
Created issue
#86511 for this TCR.
JJ: Mistakes in example in LanguageHierarchy
VS: How to register lang description to lookup
PP(Petr Pisl): What is the progress of integration?
MM: We will update java editor to use it.
PP: Who will rewrite next languages?
MF: Old editor will be there, so old things shall work.
VS: 2 phase: 1. lexer and highlighting, 2. rewrite internals of editor
JJ: So there will stay old settings in editor -> EditorKits
VS: This is really not necessary
JJ: But Options dialog is slow due to this
JJ: I just want the new lexer API to be usable without old crap
MM: Should be feasible - we plan to rewrite all the settings to MimeLookup and throw away o.n.e.Settings and BaseOptions completely.
MF: Can there be bridge between lexer and old tokens?
MM: Up to you - the ordinal ids of the Enums can be used for old token ids.
Final review decision:
Accepted with TCRs
One of the open issues is the lifetime management of the tokens. Yarda suggested
to use weak references and log the lifecycle events. Mila thinks that weak references
only cannot be used, as continuous sequences of tokens are needed, but we all
agreed on that some optimization is desirable (though the size of the tokens is
small and guaranteed), e.g. trying to hold only tokens for the visible part of
the document. Anyway, reviewers acclaimed the attention of the submitters
dedicated to the memory issues.
Lexer module is designed support lexers from popular lexer generators
(fulfilling certain requirements). That is a valuable goal, but it would be
quite useful to support syntax coloring definitions written for other
editors (allowing to import them). This could greatly prove the usefulness
of this project.
TBD
TBD
HTML version of the arch-lexer.xml document
Final Review
#86510 - Backport lexer to be usable with release55 branch
#86511 - Create syntax coloring guide for Lexer