At Blend, we’re building tools to power a frictionless, more accessible consumer lending ecosystem. We want to ensure everyone who wants a loan (and qualifies for one) can apply easily.
Applying for a mortgage is a complicated and lengthy process that requires close collaboration between borrowers and loan officers. It is critical for lenders to provide timely help to borrowers, but it’s difficult to know when to step in. One project we worked on over the past few sprints aimed to tackle that ambiguity. We built a prototype to predict whether or a not a borrower would drop off from the application before it actually happened in real time – paving the way for lenders to provide timely help to troubled borrowers.
This is very different from how it has been done traditionally: typically a loan officer would call or email the potential borrower a few days after not seeing their application, but by this time the borrower is no longer in front of the application and has moved on. Our prototype used data collected by our real-time data pipeline, which collects the activities taken by a user as they walk through the application.
In this post, we’ll walk through how we created this prototype starting with data collection and exploration, then model building and parameter tuning, before discussing future improvements to our prototype.
Part I: Understanding Our Data
What data did we have to work with?
Predictive models are fueled by data. The most relevant data we had was activity tracking. For compliance reasons, we log every activity a borrower takes from the moment they start the application. Each activity has a verb that corresponds to a specific action in our application. For example, ‘logging in’ is a verb, as well as ‘going back in the workflow’ or ‘connecting a bank account’.
The initial dataset was prepared by collecting all activities taken by a user submitting an application within the last 30 days. We assigned each chain a label based on the user’s status: submitted or unsubmitted. We restricted the users to those who came from one lender to ensure that each user had to complete the same actions in the same order. This helped standardize our dataset.
After gathering and cleaning the data, we had something that looked like this:
Above you can see that each column corresponds to a specific user, while each row accounts for the nth action taken by the user. The first row is the first action taken by each user when they start the application.
How are borrowers who didn’t complete the application different than those who did?
After sourcing the dataset, we began exploring the data to identify distinctions between users who submitted applications and those who didn’t. We first looked at two parameters: the number of actions taken and time spent in the application. Indeed, we saw distinctions between the two groups:
On average, users who submitted completed more actions (124 median vs 44 median) and took more time in the application (22 hrs median vs 2hrs 30 min median).
Next, we wanted to examine how similar the paths that users took through the application were to each other. To do that, we first separated the data on application submission. Then we truncated each chain to the first 30 activities/verbs. After that, we calculated the edit distance for 300 randomly selected pairs of sequences, creating an edit distance matrix. Finally, we graphed the distribution of the matrix.
Interestingly, we saw that the distribution of the edit distances for completed and uncompleted applications were quite different. This suggested that users who submit have a more similar set of actions to one another, while users who did not submit took a more varied number of actions. This makes sense — there’s just one way to submit the application, but there are many ways to get lost.
It looked like users who submitted and didn’t submit took a different number of actions, different amounts of time to complete the application and different paths through the application. This gave us a signal that it might be possible to predict application submission given the activity chain of the user.
Part II: Predicting Who Won’t Submit
How we first approached separating the two groups
After better understanding the distribution of our data, we felt more comfortable trying to predict borrower submissions.
Our training data consisted of sequences of actions. The fundamental problem we were trying to solve was sequence labeling: given a sequence, could we label it as submitted or not based only partial chains. Knowing if a user will submit or not with given the full sequence of actions is trivially easy – there’s a verb called ‘submitted application’. For our model to have value, we had to predict application submission earlier in the chain – more specifically within the first 30 actions. We chose the first 30 actions because 80% of users reach 30 actions within 20 minutes. This means that we know very quickly whether to intervene or not.
Because of the sequential nature of our data, we thought a recurrent neural network could be a good fit to solve our problem. A recurrent neural network (RNN) is a neural network whose output is a function of its previous outputs. (An introduction to RNNs can be found here. If you wish to dig deeper, check out Chapter 10 of Ian Goodfellow’s Deep Learning.) Recurrent neural networks have been proven to be able to solve many sequence labeling problems like sentiment analysis and part-of-speech tagging.
Dealing with data shortcomings
To ensure our RNN worked well, a data preprocessing step was critical. The first change we made to our training set was truncating each sequence to 30 verbs. When passed the full user activity chains, we think our model was weighing the features at the end of an application too heavily. This is because there are several borrower activities that occur at the very end of the application. These actions, when logged, correspond to an extremely high probability that the user will submit. (Intuitively that makes sense: if a borrower has already filled out all of their information and linked their bank accounts, then there is a really high probability that they will submit.) That proved problematic for our RNN because it was unable to accurately predict submission if a sequence didn’t have the highly weighted features. By truncating every sequence to 30 verbs, we improved the model’s generalization on shorter chains.
Our dataset was also imbalanced, as the number of uncompleted applications was vastly smaller than the number of completed applications (which speaks to the quality of our product). If we trained a model to predict if a borrower submitted or not, it could simply predict that every user will submit to reduce its loss function – arriving at a nominally impressive 86% accuracy. However, such a model would not be helpful at all – it would never correctly label borrowers who were struggling with the lending experience.
There were many ways to solve this problem that involved oversampling (such as SMOTE), but it was hard to apply this approach to our data. That’s because our dataset was a set of sequences, which meant to synthetically generate more training data, we’d have to create new sequences – adding extra complexity to the model. Furthermore, every feature of our dataset was categorical in nature, so creating extra training examples via linear interpolation or injecting random noise into our dataset would not produce representative data. Furthermore, our dataset isn’t very large, so we wanted to avoid downsampling.
Instead of augmenting our dataset, we altered our model to weigh each misclassification differently. This corresponded to multiplying the loss of each example by the weight of each class.
We started with the ratio found in our dataset, which meant that each misclassification on a negative example contributed to the loss function 5.96x more than a misclassification on a positive example.
Our Initial Results
Using Keras, we were able to quickly define our model. Our RNN consisted of 30 GRU cells and a final densely connected output layer for our class label. We trained using early stopping on validation loss, as well as dropout between the recurrent layers for regularization. We optimized the weighted version of binary cross entropy loss described above. Our model summary is below.
Here we can see the shape of our training set, (None, 30, 273). “None” corresponds to the batch size of each gradient update. “30” corresponds to the maximum sequence length, and “273” is from the one-hot vector that represents the action taken, as there are 273 actions in total.
This is fed into a sequence of 30 GRU cells, which outputs a tensor of shape (None, 30). Next all 30 GRU outputs are fed into a densely connected final layer in order to get a binary prediction out – hence (None, 1). In both of these situations, “None” once again corresponds to the batch size. We used a batch size of 64 examples.
Since our dataset was imbalanced, accuracy was a poor measure of model performance. Instead, we evaluated our model based on the ROC graph and the confusion matrix. An ROC curve plots the true positive rate (recall) of a classifier against the false positive rate of a classifier. We also take note of the area under the curve (AUC), which is a better measure of overall performance than accuracy.
In the end, we trained our model on 10,000 examples, 8,500 of which completed the application. Our test set consisted of ~2,000 examples that we set aside earlier. Here’s how our model performed on the test set:
As you can see, this got us to a reasonably well-fit model. However, when we look at our confusion matrix, we can see that the number of false negatives was higher than we were comfortable with. When the model predicts a user won’t submit it is wrong nearly half of the time. While our model performed well from a machine learning perspective, we had to take into account that the cost of a false positive is different than that of a false negative from a product perspective.
Part III: Refining Our Model
Factoring in business intelligence
Let’s assume we wanted to send automated follow-up emails when we could tell that a user would not submit. In this case, a false positive occurs when we predict a user will submit when in fact they don’t. There, our cost is low, as that user would not have submitted anyway and the outcome would be the same as if our model did not exist.
On the other hand, a false negative, which corresponds to sending a user an automated email when in fact they did finish the application, has a much higher cost — if we email someone to remind them of something they already completed, they’re not just annoyed; they may start ignoring our emails.
We passed this information to our model by manually tuning the class weights. By reducing the weighting of misclassifications on negatives, the model optimized for fewer false positives instead of false negatives.
We can see that there were two competing factors here: we wanted false positives to be weighted a lot more than false negatives because there were many positive examples to learn from. On the other hand, we wanted false negatives to be weighted more than false positives because the product cost of sending an email outweighed that of doing nothing. One of the challenges that we faced was balancing these two competing factors. We tried different weightings and compared the performance of the different models:
Comparing different class weights
A particular class weighting specified how much more a misclassification on a negative example contributed to the loss function than a misclassification on a positive example. Changing the class weighting noticeably affected our model. As we decreased the relative weight of misclassifying negative examples, we saw both our true positive rate and false positive rate drop. At first, a decrease in the true positive rate dramatically reduced our false positive rate. However we saw that decreasing the class weighting further had diminishing returns.
Although the AUC for the 3.06x class weighting was slightly lower than that of the 5.65x class weighting, we saw only 55 false negatives – nearly one third fewer than before. The substantial reduction in the false positive rate made this class weighting the best fit.
To recap, we wanted to see if we could predict if a user would submit an application or not given only the first 30 actions. To do so, we first explored the data and found some interesting differences in activity count, time spent in the application and sequence similarity between users who submitted and those who didn’t. We then applied a RNN to a processed but imbalanced dataset. By altering the class weights, we learned from this imbalanced dataset and also passed some product knowledge to our model. In the end, we were not only able to create a model that accurately predicted user submission, but we were also able to bias that model against costly false positives, ensuring that our model worked well from both a statistical and product perspective.
It would be interesting to see if we are able to predict user submission based on our real-time activities stream. To do so, we could try creating a larger, more representative dataset by truncating our activity chains at different lengths. Training on this dataset should yield better performance, especially when we are given fewer than 30 actions.
Another opportunity to improve the model could include adding more features to our dataset. Our model relied upon relatively simple features — just the action taken by the user. Hence, our features are quite sparse, and all categorical in nature. While sparsity is nice, we’re curious about what our model could do if it had more features — the time taken for each action, demographic data, and loan information to name a few. Finally, more examples will always help improve the performance of our model.
The notion of “noisy” labels could also improve our model’s performance. Our labels are not always correct because there is always a small chance that the user may come back to an uncompleted application and complete it at some later date. To account for this, we could inject a small amount of noise into our labels and have our RNN learn to fit the noise as well. Doing so might improve test set performance and generalization.
In addition, 1-D convolutional neural networks have proven to be quite effective at other sequence classification tasks, like sentiment analysis. It would be interesting to see whether different model architectures could yield higher performance.
To do this analysis, we used a combination of seaborn/matplotlib for visualization, Keras/TensorFlow for modeling, and scikit-learn for metrics, model evaluation, and hyperparameter tuning. We also used a Python library called editdistance to compute the edit distances of the sequences. We created and pruned our dataset using pandas.
Interested in joining the engineering team at Blend? Let us know.