Config Setup


Canopy
Last Updated: 1 year ago

Introduction

Why is a configuration file (config file) required?

In order for Canopy Parser to work and parse the data in the given custodian PDF, the user provides the given PDF file and a config file as a mandatory input to the product Parser, a config, written in in an Excel sheet is an “instruction” for the Parser which enables it to,

  • Identify and segregate the various components of the PDF file i.e. header, subheaders, footers the various data types present the file such as date, numerical and ISIN values using the metadata, i.e. data or information about the data.

  • Additionally,  we can instruct Parser to capture additional data or the statement level data which is sometimes present outside the tables such as Portfolio numbers, Relationship Manager, Portfolio Valuation Date etc.

  • Parser can also be instructed to ignore certain data pockets in the pdf to include in the Excel output

Key Readings

For better understanding on the subject matter, the reader is encouraged to go through the below articles before proceeding to the next sections of the document.

1. Creation of a config file


A config file may contain up to as many tables as there are in the PDF file,  however depending upon the requirement we may not need to extract all the tables from the PDF file hence the number of tables in a config file for a particular PDF may be less than or equal to the number of tables present in the original PDF document. For better understanding of the matter, a PDF statement, its configuration file and its output have been added in the document. The config creation steps for this file serve as a guideline/example.

In the first demo, we shall be referring to this PDF document.

The following steps shall be followed in the creation of a config file

  1. Start with “header start” to instruct Parser to start the table

  2. Add the name of the table in the immediate next line.

  3. Add the names of the columns with the exact case match with the PDF document, be careful of the colons (:) or dashes (-) if any the either the column names or the Table Names

  4. Look for the subheaders in the tables, they may be present in some PDF documents.

  5. Determine the metadata of the columns by referring to the data of the table, for instance in case of dates it could be “date”, for numerical values it may be “number” or in some cases the values are intermittently present we can add “optional”, for a better understanding of the metadata, the reader is advised to go through the second document in the Key Readings section. Please note that the metadata should be below each of the column headers and that it is case sensitive 

  6. Add “border start” and leave two rows vacant after the border start field since this space would be later utilized to define column coordinates (pixels) in the later stage and finally add ”header end” to indicate the end of the table being currently parsed.

  7. Follow the steps 1-6 for the other  tables.

  8. Note: Name the tab in the format: bank name followed by an underscore followed by the format identifier, say  bank_subtype.

The config file for the PDF should be found here. Kindly refer to the tab “bank_config”.

The following images depict the relationship between the PDF input file and the config file with corresponding elements following the same color scheme.

                                                                                 Document 1

                                                                                    Config

The color codings signify the following:

Blue: Columns

Gray: Statement Level Data

Purple: Table Name

Green: header start, border start and header end

Yellow: data and metadata

Border Creation

Once the basic structure or barebones of the  config file is created i.e. config without the border values or coordinates, you may now proceed to creation of borders

The following steps shall be followed in the creation of borders in the config file

  1. Head to https://pdf.canopy.cloud/login

                                                                          Parser dashboard

2. Click on Create Borders

3. Enter the file details here and click upload once the details are entered.

4. Draw the borders

The line (or border) drawn “looks to the right”, it means the data is captured on the right side of the border, while drawing the borders, the first border shall lie to the left of the first column, second border to the left of the second column and so on. However, for the last column the corresponding border shall lie to its immediate left as well and there are no borders beyond (i.e. at the right) the last column.  Parser automatically detects the columns and creates the borders, however the borders may be incorrect in some cases, the user is advised to leverage the below features available in order to redraw the border as per the preference.

Clear: This removes all the borders created on the document, created both manually and automatically. Once this is clicked you would have to redraw the borders from scratch.

Undo: Removes the last created border.

Delete by line: This deletes the border in between two existing borders, the line to be deleted can be specified by its sequence number when counted from left to right, for instance if you want to delete line number 3, enter the line number in the dialog box and click Ok.

Notice that the third line has been deleted.

5. Once you click the submit button, you get an output zip file downloaded to your computer, extract it and find the config file with filled values of border start pixel coordinates.

6. Extract the file and save it in your preferred location

7. The final extracted config file should look like this. Notice that the rows between border start and header end, left blank initially, are now filled with coordinates numbers that indicate the column positions, the lower row has the pixel values selected by the user (manually created line by the user) and the upper row has the values detected by Parser (automatic detection).

The config creation part is now complete.

Summary

2. Extraction of the data from the PDF by Parser

Parser allows you to parse one or more than one file at a time, PDFs of the same type may use the same config file to extract the data but the PDF documents have their  own unique configuration files attributed to their unique formats since the PDF documents belong to various banks and financial institutions, so you may need to create a new config file whenever you get a new “type” of PDF document to parse. One important thing to note is that the documents having the same format can be parsed by the same config file. We also maintain an excel config file directory for the supported banks. You may create your custom config files as per your requirements.

  1. Click Single Upload and input the file details, Parser parses the whole file by default by selecting the default “All” option or you may parse a range of pages by selecting “Range”.

2. Check for Parser parameters depending on your PDF file, Call Parser Parameter enables you to change the “default settings” for parsing by tweaking some of the parameters depending upon your use case, in this case we shall be selecting “EUROPEAN_NUMBER_FORMAT” since the number values are in 123.456,789 format instead of the more prevalent 123,456.789, please note that selecting these parameters is optional in itself, however the parameters should be selected depending on the PDF document requirements.

3. The screen would show the status of the parsing and when the status reaches 100% the color changes from yellow to green signifying the completion of parsing 

4. Download the extracted file by clicking ‘Download Excel’ button

The extracted file can be found here.

Output excel extract and Input document relationship

                                                                       Original PDF Document

                                                                                   Config

                                                                           Extracted Data File

The color codings signify the following:

Blue: Columns

Gray: Statement Level Data

Yellow: Parsed Data, metadata

Green: header start, border start, header end

3. Config file with Ignore section, subheaders, regex

Typically the files we get for parsing have many additional components such as subheaders, ignore sections and regular expressions (regexes). Do note that a vast majority of the PDF documents we get have more than one tables which require parsing for instance, in this PDF document we shall be extracting two tables, Asset Summary and Transaction List. The procedure shall remain the same to extract the tables as previously stated for the previous PDF document but a few more things are required to be taken care of for tables with multiple subheaders.

1. Asset Summary

                                                                      Document 2- Table 1

                                                                              Config

2. Transaction List 

                                                                     Document 2- Table 2

                                                                       Config

                                                                          Output File

The output tables have been arranged in the form of screenshots due to space constraints. The additional components in this PDF document are the subheaders, ignore statements and regular expressions.

The color codings signify the following:

Red: Subheaders

Purple: Table Name

Blue: Columns

Green: header start, border start values and header end

Gray: Statement Level Data

Yellow: Parsed Data/Metadata

4. Config with renamed_headers

We shall be considering the table Limited Partner’s Investment Record

                                                                       Document 3- Table 1

                                                                                Config

                                                                              Output file

   

The color codings signify the following:

Blue: Columns

Green: header start, border start values and header end

Yellow: Parsed Data/Metadata

Notice in the config file we are using additional metadata in the third cell from the table name (B4 in excel sheet)called renamed_headers, the position of renamed_headers is B4 irrespective of the config files you are creating, in case of the files where renaming of the columns is required, we use this metadata to instruct parser to read the columns from the PDF document and rename them to header name we require. 

One key usage of this metadata is to use it as ‘false header’, i.e. in some documents especially scanned documents or the documents where the header is not provided for a column we can invoke this metadata and replace (read rename) the blank value with the value we desire, false headers shall be covered in detail in the next example.

Consider the table ‘Management Fees & Fee Offsets’ table, it consists of two tables with different column values, we have split this table into two individual configs, one for YEAR TO DATE and the other for INCEPTION TO DATE values, however these tables’ first column has a blank header, say we want to add a name of our choice here, we may use renamed_headers and rename blank header value to say ’Description’, do refer to the following snippets for a better understanding

                                                                   Document 3- Table 2

                                                                             Config

                                                                        YEAR TO DATE

                                                                 INCEPTION TO DATE

The color codings signify the following:

Blue: Columns

Green: header start, border start values and header end

Yellow: Parsed Data/Metadata

Red: Strict condition

Purple: Name of the table

Strict and Lax metadata

These metadata are optional to use, they are used in the cases when two or more tables data get merged and these metadata enable Parser to differentiate one tables data from the other

1.      “Strict” matching cell

Place the string that the parser is supposed to match in the 2nd row and 2nd column of the table. We usually use the subheader of the table as the input string for this cell.

Forces the parser to look for subheaders/subtitles that match the string in this cell. Ensures that only data that is supposed to appear in this worksheet are accepted.

Tends to completely exclude data from other tables from the parsed output entirely if no appropriate worksheet/table was created to accommodate them. Try to avoid using this cell as much as possible as it is incredibly hard to debug issues that come from this cell, it is much easier to simply reconfigure the metadata to fit what is required of the parser.

 

2.      “Lax” matching cell

Place the string that the parser is supposed to match in the 2nd row and 3rd column of the table. We usually use the subheader OR subtitle of the table as the input string for this cell.

Performs the same function as the strict matching cell, except it no longer enforces the appearance of specific strings as strictly.

Tends to allow for data from other tables to enter rather than completely exclude them from the parsed output. I would prefer to always have a string in this cell so that the data appears somewhere in the parsed output rather than not at all. In the previous example, strict metadata was used as well

5. Parsing of Scanned Documents


The output of the scanned files is vastly dependent on the quality of the PDF document supplied, the information parsed may be different from what we see in the document and we need to enlist replace statements to replace the erroneous information with the correct one, we need to use additional Parser parameters while working with the scanned documents. For instance, when we are creating the borders in the config use USE_TEXTRACT_FOR_OCR (highlighted in green) and once the borders are created and we are uploading the document for parsing we use USE_ABBYY_FOR_OCR and USE_OCR (highlighted in yellow and blue) together to extract the data from the scanned document.

                                                                     Scanned Document

                                                                          Document

  

                                                                               Config

                                                                          Output File

The color codings signify the following:

Blue: Columns

Green: header start, border start values and header end

Yellow: Parsed Data/Metadata

Do note that the output in this case looks precise but this is not the case in the majority of the scanned documents supplied to us by the clients. However, in many such cases the output can be improved by using replace, ignore and renamed headers. This happens due to the fact that the scanned documents are nothing but images of the documents and there is no machine readable content in the scanned files, in order for Parser to read the files, we use multiple OCR libraries which enable Parser to detect and read text in the scanned images of the document but the accuracy of their results is generally not at par with the normal PDF documents since they have readable text. The best tip to extract such information is by referring to the Parser logs to look for the exact values parser is detecting and extracting into the document

Summary

We can parse PDF documents including the scanned PDFs in Parser. The configuration files provide the basic instructions to Parser about the document which needs to be parsed out. The config files combined with the PDF document,  form the basic structure of the output file which has the data from the PDF document converted to an excel file. Another important point to note is that different PDF documents i.e. documents varying in terms of custodians, number of columns, column names, type of the data may require different config files for them to be extracted by Parser. However, similar documents i.e. documents with the same custodians and having the exact same parameters may be extracted by the same config file itself. 

Additional Resources

Please click on the link below to view the comprehensive document: 

https://docs.google.com/document/d/1acQUY2P-nb9BMbtPz8R8GsCuCXkb0jpO2J2XyqGnd5M/edit#



Was this article helpful?