loading table of contents...

2.8.3. org.docma.plugin.examples.RegExpHighlight

The highlight configuration. FILE_ALIAS has to be the alias name of a text-file containing the highlight configuration. The highlight configuration format is described below. This argument is mandatory.
Character entity decoding behavior. If the decode argument is set to true, then character entities are decoded before regular-expression based pattern matching is applied. Otherwise character entities are not decoded. The decode argument is optional. If not specified, the default setting is true.
This argument is used in combination with the Auto-Format class org.docma.plugin.examples.FormatLines. If the skipargs argument is set to true, then the first line is not highlighted, if it starts with the string "[args:", i.e. if it's an argument line which needs to be processed by the class org.docma.plugin.examples.FormatLines. If not specified, the default setting is true.
The highlight configuration has to be provided as text-file. The alias name of the text-file has to be passed in the argument cfg. The highlight configuration defines one or more regular-expressions and assigns them to inline-styles. This configuration is used during transformation to highlight parts of the content. If some part of the input matches one of the regular-expressions, then this part is formatted with the style assigned to the matching regular-expression.
Please note, that the transformation does not remove or change styles which already exist in the input. Therefore, if the highlight transformation does not give a 100% correct result, the user can fix this by manually applying the correct formatting to the incorrectly formatted content.
Highlight rules
A highlight rule is a line in the format:
i.e. a highlight rule starts with a comma separated list of one or more style-IDs, followed by a colon, followed by a regular expression. Content that matches the regular-expression is formatted with the first style given in the comma separated list of styles. If the regular-expression contains capturing groups, then the content that matches the first group is formatted with the second style in the comma separated list. Content that matches the second group is formatted with the third style in the list, and so on. Note that the style-IDs may be placed on separate lines, but the regular expression has to be placed on the same line as the colon.
The regular-expression implementation of the Standard Java Platform is used for pattern matching. For details on the regular-expression syntax, see the documentation of the class java.util.regex.Pattern in the Java Platform API specification. Please note that the regular expressions are compiled with the flags DOTALL and MULTILINE. Therefore the dot character (.) matches any character, including a line terminator, and the boundary matchers ^ and $ match just after/before a line terminator or the beginning/end of the input sequence.
Example 1:
If following highlight rule is given,
my_style : ABB?A
then the input sequence AABABABBA is formatted as follows:
A<span class="my_style">ABA</span>B<span class="my_style">ABBA</span>
Example 2:
If following highlight rule is given:
my_style, b_style : A(B*)C
then the input sequence ABBC is transformed to
<span class="my_style">A<span class="b_style">BB</span>C</span>
Note that the style with ID b_style is only applied if the corresponding group (B*) actually matches some part of the input. For example, given the highlight configuration above, the input sequence AACA is transformed to
A<span class="my_style">AC</span>A
i.e. the style style_b is not applied because the group (B*) does not match anything.
Multiple highlight rules
A highlight configuration can consist of an arbitrary number of highlight rules. If more than one regular expression matches some part of the content, then the highlight rule which has the nearest match is applied. If multiple highlight rules match parts that start at the same position, then the highlight rule that is defined first is applied. For example, given following highlight rules:
b_style : BC
a_style : AB
c_style : ..C
If the input sequence ABC is transformed with this configuration, the result is:
<span class="a_style">AB</span>C
Though all three regular-expressions match some part of the input, the style a_style is actually applied. The style b_style is not applied, because the part matched by the regular-expression BC is located after the parts matched by the other two regular expressions. The style c_style is not applied, because the corresponding highlight definition is defined after the highlight definition of style a_style.
Non-highlighted groups
If a capturing group shall not be highlighted, then the style ID can be omitted in the comma separated list. For example, given following configuration:
my_style,,c_style : A(B(C))
If the input ABC is transformed with this configuration, the result is:
<span class="my_style">AB<span class="c_style">C</span></span>
Alternatively, the non-capturing group construct (?: ... ) can be used, i.e. the following highlight configuration produces the same result as the previous configuration:
my_style, c_style : A(?:B(C))
Constant definitions
Regular expressions can become very complex and lengthy. Furthermore, often the same expression has to be repeated several times. To avoid such redundancy and to improve readability, regular-expression constants can be defined. Once a regular expression constant is defined, it can be used in subsequent highlight definitions. A constant definition has the format:
See the complete example below on how to use regular-expression constants.
Comment lines
A line that starts with a # character is interpreted as a comment (i.e. the line is ignored). Comments should be used to document the highlight configuration, e.g. describe what a regular expression is intended for. See the complete example below on how to use comment lines.
Since Docmenta version 1.8, the regular expression based syntax highlighting supports the concept of a state machine: The current state defines which set of rules is applied to the currently processed part of the input. During parsing of the input, the state machine can change its state, depending on the regular expression rule that matches the current part of the input.
A state is declared by a line starting with the :: operator, followed by the name of the state. All highlight rules that follow this line (up to the end of the file or up to the next line starting with ::) are rules that belong to this state. Example:
a_style : A+
b_style : B+
c_style : CD*
In the example above, the states with name A_OR_B and STATE_C exist. The first two rules (a_style and b_style) belong to state A_OR_B and the third rule belongs to STATE_C.
Note that state names are case-sensitive. By convention, state names should be written in upper-case letters.
State transitions
Defining multiple states only makes sense, if the state can transition from one state to another. A state transition can be defined for a reguar expression rule, by adding a line that start with the -> operator directly after the regular expression rule. Given the example above, if the state shall change from A_OR_B to STATE_C, in case the regular expression A+ is matched, then the configuration has to be adapted as follows:
a_style : A+
  -> STATE_C
b_style : B+
c_style : CD*
If no state transition is defined, as it is the case for the rules b_style:B+ and c_style:CD*, then the state machine remains in the current state.
Return states
A state can transition from different source states to the same target state. In this case it is sometimes required to return to the source state. This can be achieved by using the notation "-> enter(target_state)" for the transition to the target-state, and the notation "-> return()" for the returning transition. Example:
a_style : A+
  -> enter(STATE_X)
b_style : B+
  -> STATE_C
c_style : C
  -> enter(STATE_X)
x_style : X+
  -> return()
In the example above, after STATE_X has been entered, the next state may be either A_OR_B or STATE_C, depending from where STATE_X has been entered.
Sometimes it may be required to return to a different state, than the source-state. This can be achieved by using the notation "-> enter(target_state, returnState:return_state)". Given the example above, if the STATE_C rule is changed to
c_style : C
  -> enter(STATE_X, returnState: A_OR_B)
then the state to which will be returned is A_OR_B instead of STATE_C.
The initial state
For each highlight configuration one state is the so called initial state. This is the state of the state machine, when parsing of the input starts. The initial state is always the first state that is declared in the configuration file. Given the example above, A_OR_B is the initial state because its declared before the state STATE_C.
Be aware that states imported via the @import command (see below) are not considered to be initial states, even if the import command is inserted before the first state declaration. If you want the initial state to be a state declared in an imported configuration, you need to declare a forwarding state as the first state in the configuration. A forwarding state just forwards to another state, e.g. to an imported state or to another state declared in the same file. For details see the section "Forwarding States" below.
Note that for backwards compatibility, a state does not necessarily have to be declared. If a regular expression rule is defined, but no state has been explicitly declared (e.g. in case of highlight configuration files created before the state machine concept has been introduced), then the rules belong to the implicit state named START. Nevertheless, since the concept of a state machine has been introduced, it is recommended to assign all highlight rules to an explicitly declared state.
Forwarding States
A forwarding state is a state that has no regular expression rules assigned, but immediately forwards to another state. A forwarding state is declared as follows:
where SOURCE is the name of the forwarding state to be declared, and TARGET is the name of another state to which this state shall be forwarded. Example:
a_style : A+
b_style : B+
c_style : CD*
In the example above, the state START is the initial state, because it's the first declared state. However, effectively the rules of state B_OR_C are the first rules applied to the input, because START just forwards to B_OR_C. Forwarding states can improve readability and maintainability of the configuration files.
Import command
Sometimes it may be useful to reuse existing highlight configurations. For example, if you have an existing highlight configuration for HTML code and you have an existing highlight configuration for JAVA code, then you could reuse these configurations when you create a highlight configuration for Java Server Pages (JSP) code.
You can import an existing highlight configuration by inserting following line in your configuration:
where alias is the alias name of the configuration to be imported. Importing a configuration makes all states and constants defined in the imported configuration visible to the file that contains the import command.
State Priority, Extending state definitions
When states are imported via the import command (see above), it can become necessary to extend or overwrite imported state definitions. This can be achieved by declaring a state which has the same name than an imported state. If a state is declared that has the same name than a previously declared state, then following cases need to be considered: the previously declared state has
  1. the same priority,
  2. higher or lower priority
than the newly declared state. If equally named state definitions have the same priority, then all regular expression rules defined for these definitions are merged (i.e. all rules defined for all these state definitions are applied).
If a state definition has lower priority than another equally named state definition, then the rules defined for the state with lower priority are ignored. Or the other way round: the state with higher priority replaces any equally named state with lower priority.
If no priority is explicitly assigned, then all states have the default priority 1. To declare a state with a priority other than the default priority, following declaration has to be used :
where STATE_NAME is the name of the state to be declared, and x is the priority given as integer number. Note that the higher the number the higher the priority. Example:
a_style : A+
b_style : B+
aa_style : AA+
c_style : CD*
In the example above, two states are defined. The state STATE_A has effectively one rule assigned, which is aa_style:AA+. Note that the rule a_style:A+ is ignored because it has the default priority 1, which is lower than priority 2 of the second STATE_A definition. The state B_OR_C has effectively two rules assigned, namely b_style:B+ and c_style:CD*, because the rules of both state definitions have the same priority (the default priority 1).
Regular expression hints
Greedy vs. reluctant quantifiers
If, for example, the quantifiers ?, *, + or {n,m} are used, then the matching behavior is greedy, i.e. the matching algorithm consumes as much characters as possible. For example, given following highlight configuration:
my_style : A.*C
If the input sequence ABCC is transformed with this configuration, then the result is:
<span class="my_style">ABCC</span>
This behaviour can be changed by using a reluctant quantifier, e.g. *? instead of *. For example if the regular expression is changed to A.*?C, then the result is:
<span class="my_style">ABC</span>C
Quantified groups
If a capturing group matches several times due to a quantifier applied to the group, then only the last match is formatted with the assigned style. For example, given following highlight configuration:
,b_style : A(B)+
If the input sequence ABB is transformed with this configuration, then the result is:
AB<span class="b_style">B</span>
Unfortunately this behaviour is due to a limitation of the Java regular expression API. If you want all B characters to be formatted with the style b_style, then the regular-expression has to be changed to A(B+). However, this solution is not always appropriate. For example, if you have the regular-expression ((A)(B))+ and you want to assign different styles to A and B, then the configuration
,,a_style,b_style : ((A)(B))+
does not work. For example, if you have the input ABAB, then only the last A and B is highlighted. If you know the maximum number of repetitions, then a workaround is to explicitely repeat the group and use the quantifier ? instead of +. For example, if you know that the sequence AB is repeated at most once, then the configuration can be changed to:
,a_style,b_style,,a_style,b_style : (A)(B)((A)(B))?
Note: The usage of states can help to avoid quantified groups.
Complete Example
Highlighting XML/SGML content
Following a simple highlight configuration for highlighting XML/SGML content:
### Regular expression constants ###
%NAME% : [A-Za-z0-9_:\-\.]+
%ATT_VALUE% : %NAME%|("[^"]*")|('[^']*')
### Highlight configuration ###
# Comments
comment_style : <!--.*?-->
# Start of element tag
tag_style : <%NAME%\s*
# Closing element tag
tag_style : </%NAME%>
# Element attribute
attrib_style,, attrib_value_style : %NAME%(\s*=\s*(%ATT_VALUE%))?\s*
# End of element tag
tag_style : /?>
The highlight configuration starts with the definition of the regular-expression constant %NAME%, which holds the pattern of an element/attribute name. Here, an element/attribute name is defined as a sequence of letters, digits, the underscore, the colon, the dash and the dot character. Note that the XML specification allows more characters to be used within element and attribute names, however this configuration should be sufficient in most cases.
The second definition is the constant %ATT_VALUE% which holds the pattern of an attribute value. Here, an attribute value is defined as either a string that follows the rules of an element/attribute name or as a string enclosed in double or single quotes. As you can see, the regular-expression constant %ATT_VALUE% references the previously defined constant %NAME%. This way it is possible to avoid repetition of expressions. Be aware that a constant has to be defined before it is referenced.
The constant definitions are followed by two state definitions: XML_START and XML_ATTRIBUTES. The initial state XML_START defines three rules. The first assigns the style comment_style to a regular-expression that matches XML/SGML comments (note that the reluctant quantifier *? is used here). The second assignes the style tag_style to a regular-expression that matches the start of an XML element. If the start of an element is matched, then the state changes from XML_START to XML_ATTRIBUTES. The third assigns the style tag_style to a regular-expression that matches closing tags.
The second state XML_ATTRIBUTES defines two rules. The first rule matches an attribute. An attribute is a name which is optionally followed by a = character and an attribute value. An attribute is formatted with style attrib_style. The attribute value is formatted with style attrib_value_style. Note that the expression \s* allows an arbitrary number of whitespace characters between the name- and value-part of attributes. As long as attributes are matched, the state remains in state XML_ATTRIBUTES.
The second rule within XML_ATTRIBUTES matches the end of the element tag. In case of an opening tag, this is a single > character. In case of an element without content, this is a slash followed by a > (i.e. />). If the end of an element tag is matched then state changes back from XML_ATTRIBUTES to XML_START.
The highlight configuration above may also work for content that does not completely follow the XML/SGML rules. It is also assumed that the characters < and > are encoded as character entities &lt; and &gt; if not part of a XML/SGML tag. Otherwise the automatic highlighting may produce unexpected results.
Applying the highlight transformation
Given a file with alias name highlight_xml which contains the highlight configuration shown above, and a style with ID listing_xml which has following Auto-Format call assigned:
org.docma.plugin.examples.RegExpHighlight cfg=highlight_xml
Then following content
<div class="listing_xml">
  &lt;!-- An example --&gt;
    att="A value"&gt;
    The content of the element
is rendered as
<div class="listing_xml">
  <span class="comment_style">&lt;!-- An example --&gt;</span>
  <span class="tag_style">&lt;elem
    <span class="attrib_style">att=<span
          class="attrib_value_style">"A value"</span></span>
    The content of the element
  <span class="tag_style">&lt;/elem&gt;</span>
Note that within the content to be highlighted, the characters < and > are encoded as character entities &lt; and &gt;. However, the regular-expressions match the XML tags correctly, because the decode argument has the default value true, i.e. character entities are decoded before the pattern matching is applied.