Project - Bank Loan Approval Prediction

TL;DR

Using Logistic Regression, I built a loan approval prediction model, using various stock & engineered features, which achieved an accuracy score of 75.7. My feature analysis revealed that Credit History is the most influential feature in loan approval decisions, while income-related features had surprisingly minimal impact.

The Objective

Banks often process large amounts of loan applications daily, requiring quick yet accurate decisions that balance risk management with customer service. Understanding which factors truly drive approval decisions helps optimize business processes and aid applicants in submitting stronger applications.

Goal: Can I build a logistic regression model that accurately predicts loan approval outcomes while revealing which applicant characteristics have the most impact to lenders?

Data & Context

Source: Bank Loan data from Kaggle. 614 training records and 367 test records.

Columns: 12 original features including Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area and Loan_Status (binary target column).

Engineered Features: Total Income, Lead Applicant Income Percentage, and Loan Amount to Income Ratio.

Limitations: NULL values were filled with 0 (simplified approach) & dataset is relatively small for accurate analysis.

My Approach

I selected Logistic Regression because interpretability was key to solving the business problem. Users not only want to know if their loan is rejected, but why it was rejected. This is something a 'black-box' model like XGBoost or random forest could not provide.

Converted categorical variables to numerical. Mapping features like Gender, Education, and Property_Area to numeric values suitable for model training.

Engineered three new features to capture financial health more holistically: Total Income (combining applicant and co-applicant), Lead Applicant Income Percentage, and Loan Amount to Income Ratio.

Built a correlation heatmap to understand feature relationships and identify potential multicollinearity issues before model training.

Trained the Logistic Regression model with standardized features (using StandardScaler) and a 70/30 train-test split.

Applied SHAP (Shapley Values) to analyze each individual feature contributions.

Key decision: I looked to include Shapley values for model explainability from the start, which is why I chose Logistic Regression over more complex models. This allowed me to deliver predictions with explanations that stakeholders could trust and act upon.

Correlation heatmap showing relationships between all numerical features in the loan dataset — **Figure 1: Feature Correlation Matrix** Credit History shows the strongest correlation (0.56). Engineered features like Total Income and Loan Amount to Income Ratio show expected mathematical relationships but limited predictive affect.

SHAP summary plot showing feature importance and impact direction on loan approval predictions — **Figure 2: SHAP Feature Importance Summary** Credit History dominates all other features in predicting approval, with SHAP values ranging from -2 to +3. Property Area and Married status show moderate importance, while income-related features cluster near zero impact.

Findings

Credit History is Crucial

With SHAP values 3-4x larger than any other feature, credit history overwhelmingly determines if the loan is approved or not. Applicants with positive credit history see approval rates dramatically higher than those without.

Income Matters Less Than Expected

At the start of this project, one of my pre-conceptions with loans would be the importance of income. However, income-related features showed minimal SHAP importance. The model suggests lenders weight repayment track record far above current earnings.

75.7% Accuracy with Interpretability

The model correctly predicted 140 of 185 test cases (28 true negatives, 112 true positives), with particularly strong recall (85%) for approved loans. The 25 false positives indicate conservative tendencies in borderline cases.

Recommendation: Financial institutions should prioritise credit history above all else, as it's the primary driver of loan decisions. Applicants should be clearly informed that building a strong credit history matters more than income when applying for loans. For model deployment, the false positive rate suggests adding a manual review step for predicted approvals where other predictive features are weaker.

Reflection

This project reinforced the value of explainable AI in high-stakes domains. While a XGBoost or random forest model might have achieved 80%+ accuracy, the SHAP-based insights into lending behavior are more valuable to the business requirements, than those few extra percentage points.

If I revisited this analysis, I would trial different models before settling on Logistic Regression, and explore more feature engineering, as the features I created didn't meaningfully improve the models accuracy. I'd also review the predictions to ensure any biases embedded in credit history data don't disproportionately affect certain demographics, such as gender or age.

Tools & Technologies

Python Pandas NumPy Scikit-learn SHAP Logistic Regression Feature Engineering

Using Logistic Regression to Predict Bank Loan Outcomes