back
loading skill details...
Credit risk data cleaning and variable screening pipeline for pre-loan modeling. Use when working with raw credit data that needs quality assessment, missing…
Data Cleaning and Variable Screening Quick Start # Run the complete data cleaning pipeline python ".github/skills/datanalysis-credit-risk/scripts/example.py" Complete Process Description The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data: Get Data - Load and format raw data Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations Drop High PSI Features - Remove features with unstable PSI Null Importance Denoising - Remove noise features using label permutation method Drop High Correlation Features - Remove high correlation features based on original gain Export Report - Generate Excel report containing details and statistics of all steps Core Functions Function Purpose Module get_dataset() Load and format data references.func org_analysis() Organization sample analysis references.func missing_check() Calculate missing rate references.func drop_abnormal_ym() Filter abnormal months references.analysis drop_highmiss_features() Drop high missing rate features references.analysis drop_lowiv_features() Drop low IV features references.analysis drop_highpsi_features() Drop high PSI features references.analysis drop_highnoise_features() Null Importance denoising references.analysis drop_highcorr_features() Drop high correlation features references.analysis iv_distribution_by_org() IV distribution statistics references.analysis psi_distribution_by_org() PSI distribution statistics references.analysis value_ratio_distribution_by_org() Value ratio distribution statistics references.analysis export_cleaning_report() Export cleaning report references.analysis Parameter Description Data Loading Parameters DATA_PATH: Data file path (best are parquet format) DATE_COL: Date column name Y_COL: Label column name ORG_COL: Organization column name KEY_COLS: Primary key column name list OOS Organization Configuration OOS_ORGS: Out-of-sample organization list Abnormal Month Filtering Parameters min_ym_bad_sample: Minimum bad sample count per month (default 10) min_ym_sample: Minimum total sample count per month (default 500) Missing Rate Parameters missing_ratio: Overall missing rate threshold (default 0.6) IV Parameters overall_iv_threshold: Overall IV threshold (default 0.1) org_iv_threshold: Single organization IV threshold (default 0.1) max_org_threshold: Maximum tolerated low IV organization count (default 2) PSI Parameters psi_threshold: PSI threshold (default 0.1) max_months_ratio: Maximum unstable month ratio (default 1/3) max_orgs: Maximum unstable organization count (default 6) Null Importance Parameters n_estimators: Number of trees (default 100) max_depth: Maximum tree depth (default 5) gain_threshold: Gain difference threshold (default 50) High Correlation Parameters max_corr: Correlation threshold (default 0.9) top_n_keep: Keep top N features by original gain ranking (default 20) Output Report The generated Excel report contains the following sheets: 汇总 - Summary information of all steps, including operation results and conditions 机构样本统计 - Sample count and bad sample rate for each organization 分离OOS数据 - OOS sample and modeling sample counts Step4-异常月份处理 - Abnormal months that were removed 缺失率明细 - Overall and organization-level missing rates for each feature Step5-有值率分布统计 - Distribution of features in different value ratio ranges Step6-高缺失率处理 - High missing rate features that were removed Step7-IV明细 - IV values of each feature in each organization and overall Step7-IV处理 - Features that do not meet IV conditions and low IV organizations Step7-IV分布统计 - Distribution of features in different IV ranges Step8-PSI明细 - PSI values of each feature in each organization each month Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations Step8-PSI分布统计 - Distribution of features in different PSI ranges Step9-null importance处理 - Noise features that were removed Step10-高相关性剔除 - High correlation features that were removed Features Interactive Input: Parameters can be input before each step execution, with default values supported Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis Complete Report: Generate complete Excel report containing details, statistics, and distributions Multi-process Support: IV and PSI calculations support multi-process acceleration Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction
don't have the plugin yet? install it then click "run inline in claude" again.