Pandell - Using AI Computer Vision to Digitize Ancient Documents

Pandell is turning to AI Computer Vision to automate land document digitalization to help their clients' speed up project delivery time and save on valuable capital budget dollars.

Land Document OCR Automation Process

Imagine you are a large utility or pipeline company overwhelmed with the myriad of land documents required to operate and maintain powerlines or pipelines over landowners’ properties.  These documents include:

  • leases - the rights to use a piece of land for a specific time
  • easements - agreements to cross one’s land
  • deeds - agreements to transfer land ownership
  • titles - ownerships of land
  • contracts - agreements for the land and real estate on the land purchase with the seller financing the buyer
  • permits - agreements for the specific uses of a land dictated by governments  

The land-owning parties can be a single person, a company, a government organization, a non-profit organization, a crown corporation, an Indigenous group, Etc. You have tens of thousands of such documents to track, so unless you are playing monopoly and only need to deal with a few people, you will need help.

This is where Calgary’s Pandell comes in. Pandell's software suite helps their clients in oil & gas, renewables, utilities, and pipelines by enabling organizational functions like accounting, land management, operations, and project management to work together seamlessly in a single, centralized digital hub, similar to a maestro orchestrating a symphony. The success of their solutions has resulted in a client list that includes Chevron, Shell, and BP Wind Energy.

When a client first chooses Pandell's land management software, their historical land records must be pre-loaded to ensure they are productive from day one. Customers must make scheduled lease payments, manage joint interests, view assets on maps, manage chains of titles, and track obligations and provisions, such as the right to cut down trees or if an owner needs notification when accessing their land. While helping clients manage their land problems, Pandell has a problem when trying to load the land data.

In data science, digital data is categorized into structured, semi-structured, and unstructured data. The data is often on paper….......lots and lots and lots of paper. About 300,000 pieces of unscanned paper pages per year, representing approximately 85,000 documents, are given to Pandell annually. They date back as far as the 1800s, with many pages barely legible by humans. Spills, ink blots, rips, imperfect copies (often copies of copies), and general fading abound. Could it be that Woodrow Wilson spilled his tea while signing a New Jersey land agreement in 1912? Quite possibly.

Sample Land Documents Pandell Works With

We sat down with Pandell's VP of R&D, David Beresford, to discuss how they historically tackled importing this data and how they are now making creative use of AI to streamline this process. In the past, David told us, there was no magic (sorry, Indy). Pandell employs a team of twenty dedicated Land Specialists who read, interpret and digitize documents manually, working year-round on the data. As skilled as the team is, the process is labor-intensive, prone to human error, and generally costly. "Large customers can provide tens of thousands of documents. The old manual process could take months to process and easily cost the client over a million dollars." David said.

The legacy process started when the Pandell team visited client offices and manually scanned the paper found in boxes and cabinets. PDF files were generated, one per box or drawer, and documents were individually separated to make it easier for scanning. When complete, the arduous task of examining every page in the PDFs began, with the team rotating pages to correct orientation, splitting them out into individual land agreements, and then manually inputting relevant data from the documents into Pandell's land system.

Earlier this year, Pandell set out to use modern technology to streamline this process. The first step was obvious - they needed a high-quality text extraction from the scanned pages. Traditional Optical Character Recognition (OCR) technology did not perform well due to the low quality of the documents. The latest AI Computer Vision-based services from Amazon (AWS), Google (GCP), and Microsoft (Azure) were each evaluated. Each vendor performed well in some areas and less so in others. AWS was proficient at reading handwritten text yet could not read rotated text; about 7% of the text was missing. Azure read the rotated text but struggled at text region detection (keeping related text blocks together). Google excelled at region detection but struggled to recognize handwriting. Pandell's answer was to combine data from each Cloud's AI service vendor to produce a consummate, "best-of-breed" result.

Duplicate documents were an immediate problem for the land team, causing the same work to be completed multiple times. Duplicates have been created over the years as clients transferred copies between field offices. With 45,000 documents in the backlog, the team had no way to identify these copies. By applying a document similarity algorithm against the newly extracted OCR text, 3,500 documents out of 46,000 were quickly flagged and removed. It's estimated that over 45 days worth of work was saved on an initial project that they experimented with this process on.

This "deduplication" provided the first proof of the business value of automation. Pandell then expanded this initial proof-of-concept into a comprehensive land document digitization pipeline. This included detecting PDF split points, orientating pages, classifying documents, linking documents to sub-documents, auto extracting data, cleaning data, and deduplication. David said it was also critical to build a business process around the technical models, so Pandell incorporated a workflow for quality assurance, reporting, forecasting, and billing.  

Using OCR to Detect Duplicate Documents

To speed up the production rollout of the new framework, the first iteration implemented only "low hanging fruits". For example, only 20 of 40 document types were trained for classification, and about 20% of common entities were trained for value extraction (e.g., lessee, lessor, document date, document name). Despite being short of full automation, the system was still able to save much time and was rolled out for the land team to use in the real world. As the team validated classification results, the new engine achieved a success rate between 84% and 93%, depending on the document type, with about 80% of the documents successfully extracted data values (for currently trained entities). These were encouraging results, given the incomplete models. The news got even better when the document data-entry rates were computed - the entry speed was double historical norms.

Not resting on their laurels, Pandell began tackling an even more challenging problem - the automatic generation of GIS polygons from text descriptions. Land agreements contain "legal descriptions" that describe land boundaries that clients often want to visualize. In this case, Pandell Mappers read these descriptions, which can be several paragraphs long, and plot each point described using ESRI ArcGIS software. The resulting polygons are imported into Pandell's software for display on a map. Pandell has a prototype that extracts this land description text, then parses and rewrites them into a normalized format. Next, Pandell will attempt to auto-plot points and auto-build polygons. Since mapping polygons can take twice as long as text entry, the potential time savings are enormous.

Text to Map Polygon Automation

As with many companies attempting to apply AI and ML to their operations, Pandell is still exploring the endless possibilities. One future target is invoice recognition for their financial products. Customers find themselves reviewing and entering bills by hand as they arrive in hardcopy via snail mail or as PDF attachments in email, and automation can certainly deliver meaningful time savings.

In a relatively short order, Pandell has proven the real business value that innovative AI solutions bring clients, by reducing delivery times, saving costs, and increasing quality. With this AI solution already in use by international clients, it is a testament that Calgary tech is a real player on the world stage.

More from the YYC Data Blog