TED Interface in RED FLAGS engine
TED Interface overview
It is basically an easy-to-use, high-level downloader (HTTP client) library for TED with a lot of configuration options. It provides synchronized methods to download one tab of a notice in the specified display language. It can retry requests and sleep before every download.
The most important methods of
TedResponse requestNoticeTabQuietly ( NoticeID id, DisplayLanguage lang, Tab tab ) TedResponse requestNoticeTab ( NoticeID id, DisplayLanguage lang, Tab tab ) throws TedError TedResponse requestNoticeTabWithoutRetrying ( NoticeID id, DisplayLanguage lang, Tab tab ) throws TedError TedInterfaceConf getConf()
"Quietly" means it will not throw any exception but return with
null instead if any error occurs.
TedInterface class lives in the
hu.petabyte.redflags.tedintf package. It's constructed to be used as a singleton to avoid parallel downloads which lead to IP banning on TED. You can get the instance using the
getInstance() static method.
There is also a
setInstance(TedInterface) method so you can replace the singleton instance with another implementation.
Red Flags does this, it uses
GzippedNoticeCache in order to save/fetch HTML files into/from a cache directory.
There is a class called
TedInterfaceHolder which is a Spring component providing getter/setter methods for the TedInterface instance.
TedInterfaceHolder is initialized by
TedInterface and adds a caching layer on top of the HTTP client mechanism. When you call
requestNoticeTab, it will check the cache first and if it finds the requested HTML, returns it. Otherwise the downloading will be performed, and the HTML will be written in a file.
Files will be stored inside the cache directory which is one of the constructor arguments. The other arguments are two
NoticeCache objects. First one is "old cache", second one is "cache".
CachedTedInterface designed to be able to move on from and old cache to a new one on-the-fly. When a notice tab is requested,
CachedTedInterface will check the old cache first (if specified), and if it finds the HTML, it will move it to the new cache.
String fetch(NoticeID id, DisplayLanguage lang, Tab tab); void remove(NoticeID id, DisplayLanguage lang, Tab tab); boolean store(NoticeID id, DisplayLanguage lang, Tab tab, String raw);
SimpleNoticeCache- Stores the HTML files as they are.
GzippedNoticeCache- Stores the HTML files GZipped.
OptimizedNoticeCache- Stores the HTML files GZipped, but groups each 1000 of them into ZIP files to reduce file/subdirectory count.
All of them are implementing
FilesystemNoticeCache abstract class and both of them use the cache directory structure:
cache/year/number/tab-file. The optimized cache has the structure of
OptimizedNoticeCache is used.
TED Interface stores its configuration in a
TedInterfaceConf POJO. It has the following options:
||Sleep before a retry is calculated this way:
||Max body size option of the Jsoup connection|
||Sleep before a retry is calculated this way:
||A String to be searched in the response. If it's not in there, we have a
||Maximum number of retries in those error case where it can help|
||Amount of sleep in milliseconds to be performed before any request to TED.|
||Timeout option of the Jsoup connection|
In Red Flags, these properties can be set via application properties, e.g.:
redflags.engine.tedintf: timeout: 240000 sleepBeforeRequest: 5000
TedInterface encounters an error, it will throw a
TedError exception. These contain a
TedErrorType object which provides some information about the error and whether it's recoverable or not.
|Enum value||Can retry help?||Can crawling be continued?||Description|
||yes||yes||Possibly a network error - you can retry the request.|
||yes||yes||TED responded with HTTP error, maybe there is a server error - you can retry the request maybe a bit later.|
||yes||yes||The response body is bigger than we expected - you need to increase
||no||yes||The requested NoticeID does not exist on TED, maybe it had been deleted.|
||no||yes||The requested NoticeID is no more available on TED, it is too old.|
||yes||yes||The requested NoticeID is not yet available, try again later.|
||no||no||You've just got an IP based temporary ban for 24 hours. Wait 24 hours, increase time between requests and try again.|
||yes||yes||The received page doesn't contain some important thing, maybe we have been redirected or there is a server bug on TED.|
||yes||yes||Failed to parse the HTML code.|
MaxNumberDeterminer is a tool which can find the highest available notice number in a given year, using a
TedInterface instance. It's defined as a service, and the
TedInterfaceHolder is autowired.
It's only public method is
int maxNumberForYear(int year) which will return 0 if the algorythm fails or the year is invalid. The algorythm does something like a binary search and checks whether the current number has a valid TED response or not.
To increase its efficiency, it has it's own special cache. When you need the maxnumber for a year earlier than the current, it will read the number from the cache. The cache file is
This tool is used by some
Scope implementations to detect the end of each year.