What I learned while implementing Gen AI for handwritten data extraction

The Challenge

The project required extracting handwritten information from scanned documents.

This was difficult because the handwriting was inconsistent across different people. Some were readable, some were messy, and some were almost impossible to detect accurately using normal OCR methods.

The documents were also not always clean or short. Some scanned files had more than 40 pages, even though only a few pages contained the information we actually needed. This meant we could not simply pass the whole document into an AI model without thinking about cost, relevance, and processing time.

The problem was not just “extract text from document.” It became a problem of accuracy, cost, preprocessing, validation, and scalability.

Methods Explored

1. Traditional OCR

We first explored traditional OCR.

It did not work well as the main extraction method because the handwritten text could not be detected consistently. Different handwriting styles caused too much variation, and the results were not reliable enough for actual business use.

However, OCR was still useful later in the pipeline.

Instead of using it to extract the final handwritten data, we used OCR to help preprocess documents and identify relevant or irrelevant pages using keywords.

So the lesson was not that OCR is useless. The lesson was that OCR was not strong enough to solve the full problem, but it was still useful as a supporting tool.

2. Small Language Models with Ollama

We also explored running smaller language models locally using Ollama.

The idea was attractive because it avoided cloud dependency and token cost. However, the organization had hardware limitations. The models were slow, and the smaller models did not have enough capability to extract the handwritten data accurately.

We tested different models, but the results were not good enough. Some outputs looked convincing, but the extracted data was still wrong.

This taught me that running AI locally is not automatically cheaper or better. Without the right hardware, it becomes slow, inaccurate, and difficult to scale.

3. Cloud LLMs

We then explored cloud LLMs from providers such as Azure, Google, OpenAI, and Claude.

We started with cheaper models, but the results were poor. They could produce structured output, but the accuracy was bad. In some cases, the output looked professional but was practically rubbish.

The performance improved significantly when we tested stronger reasoning or thinking models. These models seemed better at identifying relevant information, handling messy handwritten input, and making sense of uncertain document layouts.

Among the models tested, Gemini 2.0 Pro gave us the best balance of performance and cost.

This approach also saved us from spending around USD 20k or more on hardware needed to run larger models locally. Processing time improved as well. Locally, some documents took almost 5 minutes. With the cloud model, the average was around 90 seconds, with much better results.

From a cost perspective, building the solution internally was also much cheaper than hiring an external vendor. The vendor estimate was around RM85k, while our initial token usage was below RM4k. That meant we saved the organization more than 96% compared to the estimated vendor cost.

Cost Reality Check

One mistake we made was underestimating how quickly cloud AI cost could scale.

The initial cost estimate looked low. But after processing more than 5,000 documents, the actual cost quickly exceeded the initial budget. We had to re-request approval from management to continue the project.

This was a big lesson.

Cloud AI can feel cheap during testing, but once you scale it across thousands of documents, small estimation mistakes become expensive.

Eventually, we switched to batch processing on Vertex AI, now under Google’s Gemini Enterprise Agent Platform direction. This reduced cost significantly because we did not need immediate real-time results.

That was an important realization:

If the business does not need instant output, do not pay for instant processing.

Challenges Faced

Some scanned documents had more than 40 pages, so we could not simply send the whole file into the LLM.
We had to use OCR preprocessing to filter out unnecessary pages using positive and negative keywords.
Traditional OCR could help with page filtering, but not final handwritten extraction.
Generative AI output did not provide a trustworthy confidence score.
Some AI outputs looked legitimate even when the extracted values were wrong.
Smaller local models were too slow and inaccurate due to hardware limitations.
Cloud model cost was hard to estimate accurately at scale.
Human validation was still necessary, especially for uncertain or low-quality documents.
Prompt changes affected output quality more than expected.
The system needed proper guardrails, not just a model API call.

Lessons Learned

Generative AI is powerful, but it is not magic.
There is no escape from human validation when dealing with messy handwritten documents.
AI confidence scores are not always trustworthy because models can be overconfident.
A clean-looking AI output does not mean the data is correct.
Cost control must be designed from the start, not added later.
Stop-loss limits and spending SOPs are mandatory when using cloud AI.
Preprocessing matters a lot. Sending less irrelevant content to the model improves both accuracy and cost.
The cheapest model is not always the cheapest solution if it produces bad results.
Local AI is not automatically more cost-effective if the hardware is not strong enough.
Batch processing can significantly reduce cost when real-time results are not required.
The model is only one part of the solution. The surrounding pipeline determines whether the system is actually useful.

Final Reflection

This project changed how I see AI implementation.

Before this, it was easy to think the main challenge was choosing the right model. But in practice, the harder part was designing the full process around the model.

The real work was in preprocessing the documents, managing cost, testing different models, validating output, handling edge cases, and making sure the system could scale without silently burning budget.

The biggest thing I learned is that AI projects need engineering discipline.

A good AI demo can impress people. But a useful AI system needs accuracy, cost control, validation, and operational guardrails.

For this project, generative AI helped us solve a problem that traditional OCR could not handle well. It also allowed us to avoid expensive hardware investment and reduce cost significantly compared to an external vendor.

But the most valuable lesson was this:

The model is impressive, but the system is what creates value.