Structure
A Config File has the following sections
Header
Ignore
Footer
Replace
Statement Level Data
Header
Headers is the main part of the config file. It helps the Parser locate the relevant table in the PDF document. It also give it information on what kind of data sits in each cell and how to identify row breaks.
The headers are arranged in the same order (and across multiple rows if the table has multi layer headers)
Each Header has a 'Header Start' and 'Header End' in the config file. In between this start and end, there are two main parts i.e.
Elements and
Border
Elements
Each element in headers will correspond with a cell in the table header of the table we are trying to extract. Each element will have
Exact Word in the table header (can be in any language)
Metadata giving the parser important information about the data being extracted (e.g. this is a date, it is optional etc). Metadata is written just below the Exact Word for each cell. While technically Metadata is optional, you need to get it right for your table to extract correctly. More details are on the Config File Metadata page.
To clarify the word 'Element' does not actually appear in the config file (and the elements start immediately below 'Header Start'
Example of a header
Border
This describes the width of each column in pixels (it gives the starting and ending pixel count from left to right). Parser has a utility to help you calculate the border pixels
Ignore
This is a combination of text and regex and tells the Parser about the bits it needs to ignore. The Ignore section has a start and an end (see sample below).
The parser will ignore the entire line where it finds a match. In case you need to ignore multiple lines you need to use [\s\S]* in your regex, which is any whitespace including new lines details here
Footer
Footers are lines of text that typically appear at the bottom of the page and do not form part of the table we are trying to extract. Footers also have a Start and an End. Sample below
Replace
This is to replace any text in the parsed output with other text. Usually used to clear formatting bugs and make the output clearer. Just like the other sections it has a Start and an End
Statement Level Data
This is used to capture things like Account Numbers etc.. Usually this data is not in the table but it is on the page.
Therefore you need to locate your starting point on the page and one more attribute to specify where the data is with respect to your selected starting point (top, bottom, left or right).
you need to specify where this data is on the page
you need to specify where this data is on the page
Parser will include this Statement Level Data in new columns in the output