How does Canopy Extract Work


Canopy
Last Updated: 1 year ago

All the data we need is in table format

Our data is invariably in table format. Typically we need to extract the following 3 tables from each PDF document

  • Holdings

  • Transactions

  • Current Account Credits and Debits

Canopy Extract is designed to extract any table (not just the 3 tables above) from any PDF document. In case you need to extract charts and images from a PDF document then Canopy Extract is not for you.

Extract needs the PDF document and an Excel Configuration file

To work the PDF Extract needs two files

  • PDF document to be extracted (e-PDF is preferred, but paper scans will also work)

  • Excel Configuration File (which describes the table to be extracted)

The Extract needs an Excel Configuration File (which describes the table to be extracted)

What does a Typical PDF document look like

Multilayer headers and nesting are the key issues while extracting data from a PDF table

Typical table in a Bank Statement

What does an Excel Configuration file look like

The Excel Configuration file for the above table is given below. Further details are on page Parts of a Config File

Sample PDFs and their Excel Configuration files

Some sample PDF documents and their corresponding Excel Configuration files are given below

UBS

BNP



Was this article helpful?