经典机器学习 | 如何做到预流失与流失挽回？ - 腾讯云专区

导语：预流失用户，即有流失倾向，但还没有开始真正流失的用户。相较于流失用户而言，预流失用户处于观望阶段，或许对现有产品有所顾虑，或许对于潜在的流向（竞品）有所顾虑，或许是在等待些什么；流失用户，即已经流失了的用户，或许是因为游戏弃坑，或许选择了其他产品，用户肯定还在玩些什么，只是不再来你这儿了。文章介绍了如何通过经典的机器学习(Machine Learning, ML)方法来寻找那些流失可能性比较高的用户、寻找那些回流意愿比较大的用户。运营同学针对这些用户就可以重点干预，降低预流失用户比例，拉高用户的回流比例。
背景

在日常游戏运营中，我们常常需要提高目标用户的留存率、提高流失用户的回流率、精准运营、节约运营资源。基于以上的述求，我们将经典机器学习实践于预流失和流失挽回两个场景。

模型整体设计流程图如下：

预流失与流失挽回概述

1. 预流失

预流失用户，即有流失倾向，但还没有开始真正流失的用户。相较于流失用户而言，预流失用户处于观望阶段，或许对现有产品有所顾虑，或许对于潜在的流向（竞品）有所顾虑，或许是在等待些什么。

2. 流失挽回

流失用户，即已经流失了的用户，或许是因为游戏弃坑，或许选择了其他产品，用户肯定还在玩些什么，只是不再来你这儿了。

获得训练数据

1. 基础数据

基础数据是最基础也是最重要的第一步，需要我们去寻找、清洗各种原始数据，原始数据包括用户的登录数据、充值数据和用户数据几个模块。

模型训练自然是数据越多越好的。

2. 给用户打标签

预流失，判断用户是否会流失，如果上上周活跃，上周不活跃则是流失用户，label=1；反之上周活跃，则label=0。我们可以以周为单位，读取过去四周、八周或者更多的原始数据。

在流失挽回场景，label的判断逻辑正好相反，如下图所示

准备训练测试数据

1. 训练测试数据划分

根据自己的数据集大小合理的划分出三种数据，验证集在训练的时候用于模型调参，测试集在最后的最后模型所有参数设定后用于验证模型效果。

2. 正负样本均衡

如果实际数据中正负样本的比例严重不均衡，则有必要处理一下。处理办法是有放回的随机采样，code 示例如下：

# 正负样本均衡
import random
pos_vs_neg = 1.0
pos_data_count = train_df.filter(train_df['ilabel'] == 1).count()
neg_data_count = train_df.filter(train_df['ilabel'] == 0).count()
gap = pos_data_count-neg_data_count*pos_vs_neg
print('from ', pos_data_count, neg_data_count, 'to', pos_data_count, neg_data_count*pos_vs_neg, 'gap', gap)


if gap>0: # 正样本多，取样负样本
   data_add = train_df.filter(train_df['ilabel'] == 0).sample(True, gap/neg_data_count, random.randint(1, 1000))
   train_df = train_df.union(data_add)
elif gap<0: # 负样本多，取样正样本
   data_add = train_df.filter(train_df['ilabel'] == 1).sample(True, -gap/pos_data_count, random.randint(1, 1000))
   train_df = train_df.union(data_add)    
print(train_df.filter(train_df['ilabel'] == 1).count(), train_df.filter(train_df['ilabel'] == 0).count())
print('balancing data finish')

（左滑可查看完整代码，下同）

特征工程

1. 特征选取

这里只简单的截取了一些常用到的特征，大家可以根据自己的场景增加各种简单特征、组合特征。日期特征需要注意一下，不同的游戏上线时间不一样、日期格式的数据也不方便运算，比如20181231,20190101,20190102其实都只差一天，但是数值上却差了很大，这里我们直接将日期转换成距今天天数，日期数据数值化，很方便后续的计算处理。

2. 特征处理

2.1 缺失值填充

在预流失场景中，我们针对登录数据、充值数据做了填0处理，针对日期时间数据做填最大值处理。

2.2 zscore标准化

不同特征的取值范围对模型训练的影响不言而喻，我们有必要对许多特征做一次标准化，比如登陆次数、充值金额等等。

2.3 onehot处理

对于枚举类型的特征，最常用的编码就是OneHot，比如性别。

训练模型

1. 模型选择

预测流失Score和回流Score有许许多多的模型可以选择，本文以LR为例，早点介绍如何在生产过程中实践经典机器学习算法。LR详细介绍参考以下两个链接

Logistic regression Docs

pyspark.ml.classification.LogisticRegression APIs

2. 模型调参

使用验证集数据对模型进行调参，以下是LR的可配置参数

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: ilabel)
maxIter: max number of iterations (>= 0). (default: 100, current: 1000)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability, current: score)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
regParam: regularization parameter (>= 0). (default: 0.0, current: 0.03)
standardization: whether to standardize the training features before fitting the model. (default: True)
threshold: Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p]. (default: 0.5, current: 0.6)
thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. (undefined)
tol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06)

其中常用的参数如最大迭代次数maxIter=1000、正则项regParam=0.03,阈值threshold=0.6

离线模型评估

1. 评估指标

离线评估的指标有AUC、准确率、召回率、F1值

AUC的介绍可以查看博客AUC，ROC我看到的最透彻的讲解，AUC用来衡量一个模型的性能。准确率指的是预测为正样本中有多少是预测对了，召回率指的是有多少正样本被预测出来了。F1值是权衡准确率和召回率的一个数值。准确率、召回率、F1值随阈值的改变而改变，根据产品的实际场景合理的选择阈值。

实现demo，提供三种计算AUC的办法

'''模型评估'''
# 模型评估
## 训练数据的AUC
print("train auc is %.6f" %lr_model.summary.areaUnderROC)


## 方法一，使用pyspark.mllib.evaluation.BinaryClassificationMetrics来计算AUC
# BinaryClassificationMetrics的参数是 RDD of (score, label) pairs，其中score表示预测为1的概率
binary_metrics = BinaryClassificationMetrics(
   result_df.selectExpr(['ilabel', 'score']) \
   .rdd.map(lambda x: (float(x['score'][1]), float(x['ilabel']))))
print("test auc is %.6f" %binary_metrics.areaUnderROC)


## 方法二，使用pyspark.ml.evaluation.BinaryClassificationEvaluator来计算AUC
from pyspark.ml.evaluation import BinaryClassificationEvaluator
binary_evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='ilabel', metricName='areaUnderROC')
print("test auc is %.6f" %binary_evaluator.evaluate(result_df))


## 方法三，使用BinaryLogisticRegressionSummary来计算AUC，注意，直接传入test data，不需要提前预测
print("test auc is %.6f" %lr_model.evaluate(test_df).areaUnderROC)


tn = result_df.filter(result_df['ilabel'] == 0).filter(result_df['prediction'] == 0).count()
fp = result_df.filter(result_df['ilabel'] == 0).filter(result_df['prediction'] == 1).count()
fn = result_df.filter(result_df['ilabel'] == 1).filter(result_df['prediction'] == 0).count()
tp = result_df.filter(result_df['ilabel'] == 1).filter(result_df['prediction'] == 1).count()
print(tn, fp, fn, tp)
precision = tp*1.0 / (fp+tp)
recall = tp*1.0 / (fn+tp)
f1 = 2*precision*recall / (precision+recall)
print("test precision is %.6f" %(precision))
print("test recall is %.6f" %(recall))
print("test f1 is %.6f" %(f1))

2. 学习曲线

通过分析学习曲线，可以诊断高偏差、高方差问题

高偏差，欠拟合，随着训练样本数量增加，最后测试集、验证集的误差都停在高位，彼此很接近。

尝试解决办法：获得更多的特征、增加多项式特征、减少正则化程度λ。

高方差，过拟合，测试集和验证集的误差逐渐接近，但还是有一定的距离，随着样本数的增加误差正在逐渐趋于稳定。

尝试解决办法：更多的训练样本、减少特征的数量、增加正则化程度λ。

预测数据

1. 获得预测数据

预流失场景中预测数据为本周活跃的用户，预测其是否会在下一周流失；流失场景中预测数据为本周流失用户，预测其是否会在下周回流。

2. 预测数据分组

首先，将预测数据分成模型预测、随机两组，模型预测组用模型预测Score值，随机预测组用rand的方法输出Score值，再比较Score值与阈值的大小来判断当前样本为正或者负；

然后，将预测后的数据分成2*2组，一个是线上干预组，另一组是线上不干预的对照组，用于对比线上干预的效果。

3. 上线效果分析

如上图所示，效果分析分为模型效果和干预效果两个维度

3.1 模型效果

分析模型效果时我们需要控制变量，排除干预、不干预的影响。预期模型预测的准确率普遍要好于随机预测的准确率。

同干预的情况下，对比A组和C组的准确率；同不干预的情况下，对比B组和D组的准确率

3.2 干预效果

同样需要排除不同策略预测的影响，预期干预组的留存率或者回流率要普遍好于对照组的留存率或回流率。

同模型预测情况下，对比A组和B组的留存率；同随机预测模型情况下，对比C组和D组的留存率

小结

将全流程串起来，给出如下demo

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StandardScaler, OneHotEncoder, HashingTF, Tokenizer, VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator


# 训练数据
# 预测数据
# 预测结果表


# 需要做缺失值处理的特征
missing_value_zero_features = ['lilastactdate' ...]
missing_value_max_features = ['dilastactdate' ...]
# z_score
z_score_features = ['lilastactdate' ...]
# one_hot_data
one_hot_features = ['idayact14_loss' ...]
print('args init finish')
   
def get_data():  
   # 获得训练数据
   session = SparkSession \
           .builder \
           .appName("prelost-lr") \
           .getOrCreate()
   train_df = ...


   # 缺失值过滤
   train_df = train_df.dropna(subset=['sopenid', 'iuin']) \
       .na.fill(0, missing_value_zero_features) \
       .na.fill(9999, missing_value_max_features)
       
   # 测试集和训练集
   (raw_train_data, raw_test_data) = train_df.randomSplit([0.7, 0.3], seed=43)
   print('train data count is', raw_train_data.count())
   print('test data count is', raw_test_data.count())
   return raw_train_data, raw_test_data


def features_clear():
   # Configure an ML pipeline, which consists of some stages
   # 将所有的z_score_features特征合并起来一起做归一化
   z_score_vector_assember = VectorAssembler(inputCols=z_score_features, outputCol="z_score_features")
   z_score_standard_scaler = StandardScaler(inputCol=z_score_vector_assember.getOutputCol(), outputCol="z_score_features_scaled")


   # 组合所有特征到features字段里
   features = [z_score_standard_scaler.getOutputCol()] + one_hot_features
   features_vector_assember = VectorAssembler(inputCols=features, outputCol="features")
   
   lr = LogisticRegression(maxIter=1000, regParam=0.03, threshold=0.6, labelCol="ilabel", probabilityCol="score")


   model_pipeline = Pipeline(stages=[z_score_vector_assember, z_score_standard_scaler,
                                        features_vector_assember, lr])
   print('features clear pipeline config finish')
   return model_pipeline


def model_train(model_pipeline, raw_train_data):
   model = model_pipeline.fit(raw_train_data)
   print('model train finish')
   return model
   
def model_evaluator(model, data):
   result_data = model.transform(data)


   binary_evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='ilabel', metricName='areaUnderROC')
   print("auc is %.6f" %binary_evaluator.evaluate(result_data))


   tn = result_data.filter(result_data['ilabel'] == 0).filter(result_data['prediction'] == 0).count()
   fp = result_data.filter(result_data['ilabel'] == 0).filter(result_data['prediction'] == 1).count()
   fn = result_data.filter(result_data['ilabel'] == 1).filter(result_data['prediction'] == 0).count()
   tp = result_data.filter(result_data['ilabel'] == 1).filter(result_data['prediction'] == 1).count()
   print(tn, fp, fn, tp)
   precision = tp*1.0 / (fp+tp)
   recall = tp*1.0 / (fn+tp)
   f1 = 2*precision*recall / (precision+recall)
   print("precision is %.6f" %(precision))
   print("recall is %.6f" %(recall))
   print("f1 is %.6f" %(f1))


def model_predict():
   result_data = lr_model.transform(test_data)
   tdw_util = TDWUtil(tdw_user_name, tdw_user_pwd, tdw_db)
   tdw_util.dropPartition(tdw_predict_result_tbl, tdw_predict_result_tbl_pri_partition, level=0)
   tdw_util.createListPartition(tdw_predict_result_tbl, tdw_predict_result_tbl_pri_partition, tdw_predict_result_tbl_pri_date, 0)
   tdw.saveToTable(result_data, tdw_predict_result_tbl, tdw_predict_result_tbl_pri_partition)
   print('save data into tdw finish')
   


raw_train_data, raw_test_data = get_data()
model_pipeline = features_clear()
model = model_train(model_pipeline, raw_train_data)
print('------------------ train model ------------------')
model_evaluator(model, raw_train_data)
print('------------------ predict data ------------------')
model_evaluator(model, raw_test_data)