Maintaining the RED FLAGS engine

Overview

The engine does the following when launched:

It can process more than one notice (also configurable) at the same time, to reduce the time needed to do all the work.

Running the engine

The Red Flags engine is Java application. It is a simple JAR file which can be launched by this command:

$ java -jar redflags-engine-VERSION.jar

If you need to process a lot of notices, increasing the heap size is necessary, for example:

$ java -Xmx16G -jar redflags-engine-VERSION.jar

Configuring the engine

Configuration of the engine is done by using properties. There are two ways to modify them externally:

YAML configuration overrides the internal one, and command line configuration overrides the YAML one.

The most important configuration properties are:

Property Type Description Default value
scope String The scope of notice numbers to process (no default value, you must specify)
cache String The name of the cache directory tenders
threads Integer Numer of notice processor threads 1
db Integer 1 or 0, whether you want database export or not 0
dbhost String Location of your MySQL server (with port number) (no default value)
dbname String Name of the database schema (no default value)
dbuser String Database user (no default value)
dbpass String Password for database user (no default value)

A typical configuration file will look like this:

scope:   auto
cache:   /home/redflags/cache
threads: 4
db:      1
dbhost:  localhost:3306
dbname:  redflags
dbuser:  redflags
dbpass:  secret

# ... optional crawl configuration - detailed in developer docs

Cache directory

Make sure that the cache directory is writeable for Red Flags engine.

The structure of the directory will look like this:

cache/
    2016/
        1/
            files for notice 1-2016
        2/
            files for notice 2-2016
        ..
    ..

During downloading an optimization is triggered automatically, which rotates the individual files into ZIPs. Those ZIP files will contain hundreds/thousands of notices. That way the file and subdirectory count is reduced:

cache/
    2016/
        000001-001000.zip/
            1/
                files for notice 1-2016
            2/
                files for notice 2-2016

About the sizes:

Threads

We suggest to try out more values for the threads property. It's not always the higher number means faster overall processing time. The slow resources (network, hard drives) are common, too many threads can lead to slowness because they will wait for each other.

We use 4 threads to process notices from ~4 years, the running time is around 3-3.5 hours.

Scopes

The scope argument tells the engine which notices should be processed. Scopes can be the following:

The typical usage would be:

  1. Running the engine manually with YYYY.. scope first, to create the database from earlier years,
  2. and then running in auto mode each day using cron script to update it.
  3. And if you need to re-parse, we suggest to use YYYY... again.

Be aware, that notices are only available for 5 years on *TED*. This means that in April, 2016, you cannot reach notices from March, 2011.

The Red Flags engine will automatically skip notice IDs having a year below 2000 or above the actual current year.