AML Is Only as Good as the Data Feeding It
I led the AML program at Citi — end to end, requirements through implementation. This post reached 2,000 impressions on LinkedIn. I have expanded it here with the full case, the decision framework, and what the data showed after remediation.
There is a belief that circulates widely in financial crime compliance circles: that better models produce better outcomes. More sophisticated typologies. More precise calibration. Lower false positive thresholds.
After leading the AML program at Citi from requirements through implementation, I want to offer a more precise diagnosis of where financial crime detection programs actually succeed or fail.
An AML system is only as good as the data feeding it. And the data that matters most is customer master data — the identity foundation every detection decision rests on.
What the customer master data problem actually looks like
The detection models we built at Citi were sophisticated. The tuning methodology was rigorous. The false positive thresholds were carefully calibrated against transaction volumes and analyst capacity.
When we conducted a data quality assessment ahead of go-live, what we found in the customer master data was not unusual for a bank of that scale and complexity. It was, however, disqualifying for the program we were trying to run.
Four categories of issues dominated:
- Duplicate customer records across acquisition channels. The same customer, acquired through different channels at different points in time, existed as multiple distinct records in the system. Transaction monitoring against fragmented customer identities produces fragmented alerts — patterns that would be significant across a consolidated view become invisible across separate records.
- Inconsistent name formatting across core banking, CRM, and digital. Name variants — abbreviated, transliterated, hyphenated differently — across systems that had never been reconciled meant that entity matching across data sources was unreliable. For a detection system that depends on linking transactions to identities and identities to risk profiles, this was a structural problem.
- Missing beneficial ownership linkages for corporate accounts. Corporate accounts without complete beneficial ownership chains are a financial crime detection blind spot. Transactions that look innocuous at the account level become significant when the beneficial owner is matched against watchlists or adverse media. Without the linkage, the match cannot happen.
- Stale address and identity data that had never been remediated. Identity verification data that was accurate at onboarding and never subsequently refreshed. Addresses that had not been updated through customer lifecycle events. Risk classifications based on profile data that no longer reflected the customer's actual situation.
An AML system running on this data does not catch financial crime more effectively than a simpler system running on clean data. It generates noise — high volumes of low-quality alerts that bury the genuine signals that the sophisticated models were designed to surface.
The decision — defer go-live by 11 weeks
As program director, I made a decision that the business resisted and that I am still confident was correct.
We deferred the model go-live by 11 weeks to complete a structured customer data remediation program.
The pushback was significant. Regulatory deadline pressure was real. The business case for the detection capability had been built on a specific timeline. The 11-week deferral was not a comfortable conversation.
My position was firm, and I want to state it precisely because I think it applies beyond this specific program:
Deploying a financial crime detection system on top of compromised identity data is not compliance. It is the appearance of compliance. Those are not the same thing.
A regulator reviewing your AML program is not only assessing whether you have a detection system. They are assessing whether the detection system can actually detect. An alert queue full of noise from data quality failures is not a defensible compliance posture — it is evidence that the foundation was not built correctly before the model was deployed.
The 11 weeks were spent on four remediation workstreams running in parallel:
- Deduplication of customer records across acquisition channels with a defined survivorship ruleset
- Name standardisation and cross-system entity reconciliation
- Beneficial ownership completion for corporate account portfolio
- Identity and address data refresh for the highest-risk customer segments
None of this was glamorous work. All of it was prerequisite work.
What the data showed after remediation
The results after go-live on clean data were measurable and material.
Alert quality improved by 34%. Measured as the ratio of alerts that progressed to investigation versus alerts closed at the first review stage. The model had not changed. The threshold calibration had not changed. The improvement was entirely attributable to the quality of the customer data the model was running on.
Analyst productivity on genuine cases increased materially. When the alert queue is not dominated by noise from data quality failures, analysts spend their time on real financial crime signals rather than on administrative closure of false positives. That reallocation of investigative capacity is not a secondary benefit — it is the primary purpose of a financial crime detection program.
The 11-week deferral recovered its cost in analyst capacity within the first two reporting cycles after go-live.
The principle — KYC hygiene is not a compliance checkbox
The framing I want to challenge is the one that treats KYC data quality as a compliance obligation to be discharged rather than as an operational foundation to be maintained.
Know Your Customer is not a form-filling exercise. It is the mechanism by which a bank builds and maintains the identity intelligence that every downstream compliance function depends on. AML depends on it. Sanctions screening depends on it. PEP identification depends on it. Fraud detection depends on it.
When KYC data quality degrades — through onboarding shortcuts, system migrations that don't carry data forward cleanly, channel proliferation that creates duplicate records, or simply the passage of time without refresh — every downstream compliance function degrades with it.
The investment in KYC data quality is not a compliance cost. It is the infrastructure cost of running financial crime detection that actually works.
Fix the data first. Build the model second. Deploy when the foundation is sound.
Everything else is architecture on an unstable foundation — and in financial crime compliance, that instability has consequences that extend beyond your program budget.
The broader data quality pattern
The AML case is a specific instance of a pattern I have observed consistently across banking technology programs over 24 years:
The programs that succeed under regulatory scrutiny are the ones that invested in data quality before they invested in model sophistication. Not instead of model sophistication — before it. The sequence matters as much as the investment.
I made the same argument in the context of Basel RWA capital reporting — build the defensibility before you build the model. The principle is identical. Regulators do not only assess the output. They assess the data infrastructure that produced it.
In AML, the stakes of getting that sequence wrong are particularly high. Financial crime that goes undetected because an alert was buried in data quality noise is not a program performance metric. It is a regulatory finding, a potential enforcement action, and — at sufficient scale — a contribution to the financial crime ecosystem the program was designed to disrupt.
KYC hygiene is not a compliance checkbox. It is the foundation of every financial crime detection program worth running.
Frequently asked questions
Why do AML systems generate too many false positives?
Most commonly because of poor customer master data quality — duplicate records, inconsistent name formatting, missing beneficial ownership linkages, and stale identity data. A sophisticated model running on corrupted customer data generates noise that buries real signals. The model is not the problem. The data foundation is.
What is customer master data remediation in AML?
The process of resolving data quality issues in customer records before deploying AML detection — deduplication across channels, name standardisation, beneficial ownership completion, and identity data refresh. Remediation before go-live ensures the detection system operates on clean data rather than generating alert noise from data quality failures.
How does KYC data quality affect AML alert quality?
Directly and materially. Duplicated, inconsistently formatted, or incomplete customer records prevent accurate transaction-to-customer matching, account linkage, and beneficial owner identification. The result is high false positive rates that overwhelm analyst capacity and bury genuine signals. At Citi, structured remediation produced a 34% improvement in alert quality with no change to the detection model.
Is it better to delay AML go-live to fix data quality?
Yes. Deploying on compromised data is not compliance — it is the appearance of compliance. A structured remediation program before go-live produces materially better alert quality, higher analyst productivity, and a defensible posture under regulatory review. The cost of deferral is recoverable. The cost of a regulatory finding on a program built on dirty data is not.
Found this useful? I write weekly on banking compliance, data quality, and the lessons 24 years at Citi and Standard Chartered taught me about building programs that survive regulatory scrutiny. Subscribe to the newsletter — no spam, unsubscribe anytime.
Working on an AML or KYC data quality challenge? Explore my consulting services or get in touch directly.
Raj Thilak is Head of Technology for Data & Analytics with 24 years at Citi and Standard Chartered. He has led AML program implementation, Basel capital reporting, and FX settlement technology across multiple global jurisdictions. Based in Pune, India. rajthilak.dev
Found this useful? Subscribe for weekly insights.
Join the conversation
Loading comments...