Skip to main content

Command Palette

Search for a command to run...

Designing Fraud-Resistant Fintech Systems

How Machine Learning and Backend Engineering Work Together

Updated
7 min read
Designing Fraud-Resistant Fintech Systems
O
I’m interested in system design, reliability, and the engineering decisions behind production systems. This blog is where I share things I’m learning as I build.

When the Rules Stopped Being Enough

For the first year, the rule engine held. Flag applications above a certain loan amount from unverified profiles. Reject submissions where device fingerprints matched known bad actors. Block accounts showing velocity patterns outside normal ranges. The rules were simple, fast, and for the size we were operating at, effective enough.

Then the fraud patterns shifted. Applications started arriving just below the thresholds. Synthetic identities, assembled from real data points sourced from breaches, started passing every rule cleanly. The fraud rate began climbing in the segment of applications our rules couldn't see, not because the rules were wrong per se, but because the fraudsters had learned exactly where the edges were.

We were maintaining a static system against a dynamic problem. That's a fight you eventually lose.

What Rule-Based Systems Get Right And Where They Fail

Rules aren't wrong. They're fast, interpretable, and easy to audit, which matters a lot in a regulated lending environment where every declined application needs to be explainable. When a compliance team asks why an application was rejected, a rule gives you a clean answer. A model gives you a probability.

The problem is that rules can only catch what someone has already thought to look for. As fraud sophistication grows, the rules set grow with it, until the system becomes fragile and the false positive rate climbs because legitimate users start tripping overlapping conditions.

The decision to bring in ML wasn't about replacing rules. It was about adding a layer that could reason about patterns for which no rule had been written, while keeping rules in place for decisions that needed a clear paper trail.

Designing the Detection Pipeline

The pipeline we built followed a straightforward sequence: an incoming application triggers a feature-extraction step that assembles signals from multiple sources, including application data, device metadata, behavioral history, and bureau signals, where available. That feature vector is passed to a model inference service, which returns a risk score between 0 and 1. A decision engine then combines that score with a small set of hardened rules to produce an outcome: approve, decline, or flag for manual review.

The decision engine sitting between the model and the outcome was a deliberate design choice. The model informed the decision; it didn't make it unilaterally. Certain categories of application, particularly those touching regulatory requirements, still went through explicit rule checks regardless of what the model scored. Auditability demanded it, and we weren't willing to trade that for throughput.

Feature extraction ran as a synchronous step but was designed to degrade gracefully. If a bureau call timed out, the application proceeded with the signals we had. A partial feature vector with a slightly lower confidence threshold was better than a blocked flow waiting on a dependency that might never respond.

The Latency Problem, And How We Solved It

The first integration was naive. Every application was routed synchronously through the model inference service, which meant model inference latency sat directly in the user-facing response path. Under normal conditions, that added around 180ms. Under load, it climbed to 350–400ms, on a flow where the product expectation was near-instant feedback.

We broke the problem into two parts. First, we identified which features could be pre-computed. Application history, device reputation scores, and bureau summaries didn't change between session start and submission. We moved those to a feature store with a short TTL, populated asynchronously when the session started. By submission, most of the feature vector was already assembled.

Second, we set a hard 200ms timeout on model inference, with a fallback to a lightweight rule-based decision if it was hit. After both changes, p95 inference latency dropped to under 120ms. The fallback triggered on fewer than 2% of requests, which was an acceptable trade-off.

When the Model Got It Wrong

About six weeks after deploying a retrained model version, we noticed approval rates dropping in a specific user segment: self-employed applicants with non-standard income documentation. Fraud rates in that segment weren't rising. Legitimate applicants were being declined at a higher rate than before.

The model had been retrained on a recent data window that happened to underrepresent that segment. It had learned a spurious correlation between non-standard income documentation and risk,  not because the signal was real, but because the sample was thin. The model was confidently wrong about a population it hadn't seen enough of.

We caught it because a product manager flagged an increase in support complaints from declined users, not because our monitoring caught it. That was the uncomfortable part. Our dashboards were tracking the overall fraud catch rate and the overall approval rate. We weren't monitoring approval rate by segment, and we weren't watching model confidence score distributions for drift.

We rolled back to the previous model version within the hour. The fixes that followed were as much about observability as about model quality: per-segment approval rate dashboards, confidence distribution monitoring, and a shadow scoring process where new model versions run in parallel for two weeks before promotion, thereby scoring live applications without acting on the scores, purely to validate behavior before going live.

Observability for Fraud Systems Is Different

Knowing your system is running and knowing your system is working are two different things. In most backend services, the gap between those two statements is small. In fraud systems, it can be wide enough to cause real damage before anything obvious breaks.

Standard metrics won't catch it. Latency looks fine. Error rates are clean. Throughput is normal. Meanwhile, a specific applicant segment is being declined at twice the rate it should be, and the model has been quietly drifting for three weeks.

The monitoring that actually matters here lives one layer deeper: false positive and negative rates, approval rate broken down by segment, model confidence score distributions, and feature drift tracked over time. That last one is easy to deprioritize until you've been burned by it. When the distributions of incoming features shift away from what the model was trained on, accuracy degrades silently. You won't see it in your error logs. You'll see it in a product manager's Slack message asking why approvals dropped.

After the incident we described earlier, shadow scoring became non-negotiable. Every new model version runs in parallel with the live system for at least two weeks,  scoring real applications, acting on none of them, before it touches a single production decision. It adds time to the deployment process, but it's worth it.

What This Kind of Work Actually Requires

Integrating ML into a backend system is a collaboration that requires more explicit coordination than most teams plan for. The model is only as good as the features feeding it, and feature pipelines are backend infrastructure; they need to be built, maintained, versioned, and monitored by engineers who may not own the model at all.

The clearest tension we encountered: the ML team wanted to add a real-time feature requiring a join between application data and transaction history at inference time. The join was expensive, and the feature store hadn't been designed to support it at that latency. The conversation about whether to build the infrastructure or to find a pre-computable proxy took two weeks and required both teams to be fully involved.

That kind of coordination doesn't happen naturally. It requires treating model versioning with the same discipline as API versioning, documenting the contract between the decision engine and the model explicitly, and having backend engineers in the room when feature design decisions are made.

Closing

Fraud detection at scale is an engineering problem as much as it is a data science problem. A well-trained model running on a poorly designed pipeline will underperform. A well-designed pipeline with inadequate observability will fail silently. And a system that trusts the model unconditionally will eventually make a confident mistake with real consequences.

The backend engineer's role here is specific: make sure the model always has the features it needs, always has a safe path to fail gracefully, and is never the sole decision-maker without something checking its work. The model handles patterns no rule could anticipate. The rules handle cases where you need to explain exactly what happened and why. Both have a place. The system works when neither is asked to do more than it's designed for.

12 views