Background
A lot of investment data is available in PDF only format (i.e. the Custodian banks are either unable or unwilling to provide this data in an electronic format like APIs or Datafeeds). This problem is much larger in Europe and Asia
86% of Investment Data is Europe and Asia is available as PDF only. The situation is much better in N. America.
Extraction of Complex Tables
Almost invariably the data we need to extract sits in a table in the PDF. These bank designs these tables for human consumption (and they look very good aesthetically). However their design makes it quite complex for a computer to understand.
For example in the table below, the bank wanted to put 20 columns. However since there isn't enough space on paper to put 20 columns, they create 'multi-layer headers' where the table header is in 4 rows.
Computers struggle to read data from a typical bank statement