Template parser in RED FLAGS engine

Template parser is directly used by TemplateBasedDocumentParser gear (appears in the gear list), which is called from the following gears too:

So if you are not using these gears, this chapter is irrelevant for you.

What is Template parser?

When you need to parse a lot of data from a text document, you usually write a long code which contains very similar steps, but with different parameters, e.g. search patterns. Template parser' aims to simplify this job by doing the dirty work, and it only needs a template to be created by the developer for the source document. It works on plain text documents, and using the template and at the end it generates a Map containing the parsed data.

The magic is located in TemplateParser class, which has only one public method:

Map<String, String> parse(String input, String template)

Let's see a short example:

Template:

Hey, my name is (?<name>\w+), and I'm (?<age>\d{1,3}) years old.

Input:

Hey, my name is Peter, and I'm 42 years old.

Output:

{name=Peter, age=42}

As you can see, template lines are basically regular expressions, and variables are represented by named capturing groups.

Of course, it's pointless to use Template parser for simple tasks like this instead of Pattern and Matcher. The power of Template parser is that it can handle multiline blocks, repeating and optional lines too.

The algorithm

  1. At first, we are at the first row in both the input and the template
  2. While we are not running out of any of them:
    1. If the input matches the template line:
      1. Parse the variables
      2. Check forward (see procedure below)
      3. Move to the next line in the input
    2. Else if there's no match and the template line is optional, move to the next line in the template
    3. Else if there's no match and the template line is not optional, move to the next line in the input

Check forward:

  1. Ha a következő input sor nem illeszkedik az aktuális sablon sorra, akkor előrébb lépünk a sablonban.
  2. Különben addig lépkedünk a sablonban, amíg az első olyan sablon sorig nem érünk, ami nem opcionális VAGY illeszkedik az input sorra. Ha találunk, akkor erre a sablon sorra lépünk.

Variables are parsed into a Map<String, String>, where the key is the variable name (capturing group's name). If a variable already exists in the Map, a new line (\n) and the new value will be appended to the existing one. Multiline data is handled this way.

Note that TemplateParser is not thread-safe, don't call the same instance simultaneously from multiple threads.

Examples

Ideal situation (perfect match)

Template:

(?<name>.*)
(?<code>CODE-\d+)
(?<addr>.*\d.*)

Input:

Some Organization Ltd.
CODE-12345
Some street 42.

Output:

{name=Some Organization Ltd., code=CODE-12345, addr=Some street 42.}

Explanation:

The input matches the template perfectly, so the parser extracts all values.

Missing line

Template:

Line 1: (?<v1>.*)
Line 2: (?<v2>.*)
Line 3: (?<v3>.*)

Input:

Line 1: something
Line 3: another thing

Output:

{v1=something}

Explanation:

The algorithm iterated through the whole input to find the second line, but it couldn't. That's why it stopped at the end of the input and didn't try to search for line 3.

Optional line

Template:

Line 1: (?<v1>.*)
???Line 2: (?<v2>.*)
Line 3: (?<v3>.*

Input 1:

Line 1: first
Line 2: second
Line 3: third

Output 1:

{v1=first, v2=second, v3=third}

Input 2:

Line 1: first
Line 3: second

Output 2:

{v1=first, v3=second}

Explanation:

The second line was optional (??? prefix), so it didn't cause any trouble when it was missing in Input 2, all values have been extracted.

Repeating line

Template:

Line A: (?<a>.*)
Line B: (?<b>.*)

Input:

Line A: first
Line A: second
Line B: something else

Output:

{a=first
second, b=something else}

So when a template line matches multiple input lines, the extracted values will be concatenated (separated by new lines). This way text blocks with multiple paragraphs can be extracted into a single variable.

Parsing a document using template

Larger documents, especially ones that contain structured information with repeating blocks, needs to be splitted to increase parsing accurarcy and efficiency. The template should be splitted of course, to match the document. This way ensures that the parser will not jump in the next, irrelevant section of the document when it's searching for a line.

For example when you have an HTML file, HTML tags should be stripped out, and headings should be indicated in the same way both in the input and the template. Then you can write a splitter mechanism which splits the input as well as the template at the same points.

Red Flags contains implementations of a splitter (Splitter), a block matcher (DocumentParser) and also of a document normalizer (DocumentNormalizer, which removes HTML tags and signs headers).

Splitter class

Splitter's only public method List<Block> split(String) splits the given normalized text and returns with a list of blocks. Every block has 4 fields: content and title Strings (latter one is the heading line),d a variables Map for storing extracted values, and children List for child blocks.

Splitting is done recursively and cuts by heading lines: every heading opens a new block. A lower level heading will create a child block in the current block. Headings should be indicated with # prefix. One # for the first level heading, two (##) for the second level heading and so on.

This splitter mechanism can be applied to input texts as well as templates, if both of them were normalized in the same way.

DocumentParser class

After we splitted the template and the input, the next step is to match the input blocks to the template blocks, and then parse each block. DocumentParser class and its parse(List<Block> input, List<Block> template) method does this thing. It doesn't return anything, extracted data will be stored in the input blocks.

DocumentParser's block-matcher mechanism can handle situations when certain blocks are missing from the input or appear repeatedy. To ensure this, the algorithm works based on the template:

This mechanism is recursive of course, it handles child blocks.

Parsing of block pairs means the algorithm calls the template parser for the block title and then for the block content. Extracted values will be stored in the current input block.

Loading templates

There is a class called TemplateLoader, which provides methods to read templates from the classpath. You only have to call getTemplate(String) with the template name, it will append .tpl extension and will search the file in templates/ directory (src/main/resources/templates/ in your project).

The loading mechanism has a "redirect" feature. For example, when you load the templates automatically based on some category like data, you can tell TemplateLoader to look for first.tpl instead of second.tpl by creating second.tpl file with this content:

SAME AS first

In this case getTemplate("second") will return the contents of first.tpl.

There's an additional loader method in this class: getTemplate(Notice n, String lang). It will select a template version automatically by the given language, the documenty type and the directive used by the given notice:

Document parsing in RED FLAGS

In the Red Flags engine the following gears are responsible for the parsing:

DocumentNormalizer class

DocumentNormalizer has only one method called normalizeDocument which accepts the input HTML code as a String and returns with the normalized text.

On TED, HTML code of documents contains blocks of the following structure:

<div class="grseq">
  <p class="tigrseq"><span id="...">SECTION</span></p>
  <div class="mlioccur">
    <span class="nomark">CHAPTER NUMBER, e.g. 1.2)</span>
    <span class="timark">CHAPTER TITLE</span>
    <div class="txtmark">
      TEXT<br>
      TEXT<br>
    </div>
  </div>
  ...
</div>

The normalized document for the above code will look like this:

#SECTION
##CHAPTER NUMBER, e.g. 1.2) CHAPTER TITLE
TEXT
TEXT

This format makes the text available to be splitted in the same way as the template.

And here we have a hint for creating templates: if we run the normalizer algorithm on an input document, we just have to delete and modify some lines to have patterns and variabes in it to have a template for that document.

Tab012Parser class

Tab012Parser normalizes the HTML code, splits the input and the template, then calls DocumentParser for each block and stores the result values in the Notice object. Before calling the document parser it resolves anomalies around repeating blocks. Sometimes repeating blocks in notices look like this:

Section X
x.1) A
x.2) B
x.3) C
x.1) A
x.2) B
x.3) C

Tab012Parser rebuilds the block structure to have an understandable format for the document parser:

Section X
x.1) A
x.2) B
x.3) C
Section X
x.1) A
x.2) B
x.3) C

After it performs this restructuring and calls the parser it stores the parsed values into the Notice. It is done block by block while checking which blocks are repeating blocks using the patterns come from configuration parameters (see below).

Important: parse method has a code part which contains Hungarian-specific parameters for the above restructuring mechanism. These will be moved out into configuration properties in the future.

Storing the Map values in POJOs are done by MappingUtils which uses Spring's BeanWrapper utility. MappingUtils adds the functionality of instantiating deeper structures using default constructors.

Tab012Config class

This class represents the configuration parameters of Tab012Parser and filled automatically by Spring.

You can specify the following parameters for each language:

Property Description
redflags.engine.parser.tab012.langspec.{LANG}.repeatingBlocks List of patterns that match normalized section header lines of repeating blocks
redflags.engine.parser.tab012.langspec.{LANG}.objBlock Pattern that matches normalized header line of Section II. Object of the contract section
redflags.engine.parser.tab012.langspec.{LANG}.lotBlock Pattern that matches normalized header line of lot sections
redflags.engine.parser.tab012.langspec.{LANG}.awardBlock Pattern that matches normalized header line of Section V. Contract award section

{LANG} must be the language code in upper case, same as in the parse language property.

As an example here's the configuration for Hungarian notices:

redflags.engine.parser.tab012:
    langspec:
        HU:
            repeatingBlocks:
                - "#II\\.?[AB]?\\.? szakasz: .* tárgya.*"
                - "#II\\. szakasz: Tárgy"
                - "#((A )?Részekre vonatkozó információk#)?Rész száma:?.*"
                - "#(V\\. szakasz: Az eljárás eredménye|(V\\. szakasz.*#)?((A )?szerződés|Rész) száma.*|V\\. SZAKASZ(?!: Eljárás).*)"
            objBlock: "#II\\.?[AB]?\\.? szakasz:.* tárgy.*"
            lotBlock: "#((A )?Részekre vonatkozó információk#)?Rész száma.*"
            awardBlock: "#(V\\. szakasz: Az eljárás eredménye|(V\\. szakasz.*#)?((A )?szerződés|Rész) száma.*|V\\. SZAKASZ(?!: Eljárás).*)"

Tab012Parser will use the appropriate language specific configuration automatically.

Example template

#II\. szakasz: A szerződés tárgya
##II\.1\.?\) Meghatározás
##II\.1\.1\.?\) Az ajánlatkérő által a szerződéshez rendelt elnevezés:?
(?<contractTitle>.*)
##II\.1\.2\.?\) A szerződés típusa.*teljesítés helye
(?<contractTypeInfo>(^(?!NUTS|HU\d+|A telj).*))
A teljesítés helye:? (?<placeOfPerformance>.*)
???#A szerződés típusa:? (?<contractTypeInfo>.*)
(?<placeOfPerformance>(^(?!NUTS|HU\d+).*))
##II\.1\.3\.?\) A hirdetmény tárgya
(?<shortDescription>.*)
##II\.1\.3\.?\) Közbeszerzésre, keretmegállapodásra és dinamikus beszerzési rendszerre \(DBR\) vonatkozó információk
(?<pcFaDps>.*)
##II\.1\.4\.?\) (Keretmegállapodásra vonatkozó információk|Információ a keretmegállapodásról)
(?<frameworkAgreement>.*)
##II\.1\.5\.?\) A szerződés.*rövid (meghatározása|leírása):?
(?<shortDescription>.*)
##II\.1\.6\.?\) Közös közbeszerzési szójegyzék \(CPV\)
##II\.1\.7\.?\) .*\(GPA\).*
(?<gpa>.*)

You can see the full version of this along with all templates in src/main/resources/templates directory.