- Built several Neural Network and Machine Learning models (ANN, RandomFordst, and XGboost) to predict whether the bank should deny the loan application
- Data processing to transform data and several Feature Engineering methods to fill with columns that have NA values.
- Python 3.8
- Packages: pandas, numpy, seaborn, matplotlib, sklearn, Xgboost, keras
- [Xgboost parameter] (https://xgboost.readthedocs.io/en/latest/parameter.html)
-
drop several columns which already contained in other feature or is not important (Name, LoanNr, State, City, ApprovalFY, DisbursementDate, BalanceGross, SBA_Appv, daysterm, xx)
-
Normalize or standardize to transform the numeric columns (NoEmp, CreateJob, RetainedJob, DisbursementGross, GrAppv)
-
Feature engineering with some columns (Zip, ApprovalDate, Term, FranchiseCode, RevLineCr, LowDoc)
- Zip: only take second and third number since the first numbers is State information, which is all same. The second and third number means smaller region.
- ApprovalDate: it is 5 numbrer format. So first change it to Y-m-d format, then take year and month information
- Term: group 60, 84, 120, 240, 300 days together since it has a significantly larger amount of people apply at these durations, and also their MIS_Status situation is the same. I also Separate 36 days as a single category since it also has significantly more people apply at that duration compare to other days but it has different MIS_Status with the group above.
- FranchiseCode: Seperate FranchiseCode = 1 and 2 as two category since there are much more people in these two group compare to others. group other FranchiseCode as one group
- RevLineCr: it have some wrong value in the data set (0, T), so treat them as Null value
- LowDoc: it have some wrong value in the data set (0, S, A), so treat them as Null value
-
one hot encoding for categorical paramete
-
Built Xgboost model to predict BankState, LowDoc, and RevLineCr Null value
- The reason for this predict order is because BankState and LowDoc have much lower null value (3, 8) compare to RevLineCr (786)
- ANN
- RandomForest model
- Xgboost model

