Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.4.7
-
None
-
None
Description
- The background:
I have two copies of one dataset on the filesystem and spark hdfs .
I transformed the two data one by pandas and one by spark SQL with the same logic:
- df: read from hdfs, transformed by spark SQL, convert spark.DataFrame to pandas.DataFrame
- df1: read from the filesystem, transformed by pandas,
Put each to BetaGeoFitter model (https://lifetimes.readthedocs.io/en/latest/) , df1 is fine, but df2 got ConvergenceError.
- First: the summary is the same between df and df1
```
In [17]: df.describe()
Out[17]:
frequency recency T monetary_value
count 68878.000000 68878.000000 68878.000000 68878.000000
mean 0.210198 1.364253 69.407097 66.740974
std 1.094161 7.460129 44.604855 351.516145
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 31.000000 0.000000
50% 0.000000 0.000000 64.000000 0.000000
75% 0.000000 0.000000 108.000000 0.000000
max 59.000000 155.000000 157.000000 18975.360000
In [18]: df1.describe()
Out[18]:
frequency recency T monetary_value
count 68878.000000 68878.000000 68878.000000 68878.000000
mean 0.210198 1.364253 69.407097 66.740974
std 1.094161 7.460129 44.604856 351.516145
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 31.000000 0.000000
50% 0.000000 0.000000 64.000000 0.000000
75% 0.000000 0.000000 108.000000 0.000000
max 59.000000 155.000000 157.000000 18975.360000
In [19]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
...: bgf.fit(df1['frequency'], df1['recency'], df1['T'])
Out[19]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, alpha: 0.74, b: 0.65, r: 0.03>
In [20]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
...: bgf.fit(df['frequency'], df['recency'], df['T'])
fun: -0.03513675395757231
hess_inv: array([[ 13.30839758, 17.8546921 , -0.17820442, 0.31872313],
[ 17.8546921 , 73.49152334, -1.06609042, 0.96429223],
[ -0.17820442, -1.06609042, 65.85101032, 67.62388159],
[ 0.31872313, 0.96429223, 67.62388159, 109.01577057]])
jac: array([ 1.17874160e-06, -6.62967570e-07, 1.06154732e-06, 1.56458773e-06])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 130
nit: 29
njev: 117
status: 2
success: False
x: array([-3.59592079, -5.36183489, 0.07652525, -0.4253566 ])
---------------------------------------------------------------------------
ConvergenceError Traceback (most recent call last)
/data/modou/python/clv.py in <module>
1 bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
----> 2 bgf.fit(df['frequency'], df['recency'], df['T'])
```
- Secound, I found the float is something different on df1 and df
They shows different after round:
```python
idx = ~np.isclose(df.round(1)['monetary_value'], df1.round(1)['monetary_value'])
In [71]: np.isclose(df[idx]['monetary_value'], df1[idx]['monetary_value'])
Out[71]:
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True])
In [72]: np.isclose(df[idx].round(1)['monetary_value'], df1[idx].round(1)['monetary_value'])
Out[72]:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False])
```
The diff contents:
```
In [67]: df[idx].round(1)['monetary_value']
Out[67]:
11498 426.4
17791 1464.1
18037 1309.1
19800 426.4
22464 134.3
24717 29.7
26202 881.6
26729 426.4
29519 1464.1
35798 1464.1
36034 388.7
39156 1464.1
39566 194.1
39687 426.4
39737 388.7
44185 1464.1
45628 1574.9
48241 4325.3
49841 1464.1
54789 129.5
57159 3289.6
66517 426.4
67991 388.7
Name: monetary_value, dtype: float64
In [68]: df1[idx].round(1)['monetary_value']
Out[68]:
11498 426.5
17791 1464.2
18037 1309.2
19800 426.5
22464 134.2
24717 29.8
26202 881.7
26729 426.5
29519 1464.2
35798 1464.2
36034 388.6
39156 1464.2
39566 194.2
39687 426.5
39737 388.6
44185 1464.2
45628 1574.8
48241 4325.2
49841 1464.2
54789 129.6
57159 3289.7
66517 426.5
67991 388.6
Name: monetary_value, dtype: float64
```
- Third, suppress idx value to zeros on both df and df1 test again
fit df1 is still converged
```
In [88]: df2 = df1.copy()
...: df2.loc[idx, "monetary_value"] = 0
In [89]: df2[idx]
Out[89]:
frequency recency T monetary_value
11498 6.0 16.0 124.0 0.0
17791 1.0 1.0 109.0 0.0
18037 1.0 1.0 109.0 0.0
19800 2.0 3.0 104.0 0.0
22464 6.0 36.0 69.0 0.0
24717 11.0 11.0 93.0 0.0
26202 1.0 12.0 88.0 0.0
26729 2.0 14.0 34.0 0.0
29519 1.0 5.0 79.0 0.0
35798 1.0 1.0 63.0 0.0
36034 1.0 1.0 63.0 0.0
39156 1.0 1.0 54.0 0.0
39566 1.0 2.0 53.0 0.0
39687 2.0 3.0 53.0 0.0
39737 1.0 1.0 53.0 0.0
44185 1.0 6.0 45.0 0.0
45628 1.0 1.0 43.0 0.0
48241 3.0 17.0 39.0 0.0
49841 1.0 2.0 36.0 0.0
54789 3.0 3.0 27.0 0.0
57159 9.0 9.0 22.0 0.0
66517 2.0 2.0 4.0 0.0
67991 1.0 1.0 1.0 0.0
In [90]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
...: bgf.fit(df2['frequency'], df2['recency'], df2['T'])
Out[90]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, alpha: 0.74, b: 0.65, r: 0.03>
```
fit df still throw ConvergenceError
```
In [92]: df2 = df.copy()
...: df2.loc[idx, "monetary_value"] = 0
In [93]: df2[idx]
Out[93]:
user_id frequency recency T monetary_value
11498 1515915625531317256 6.0 16.0 124.0 0.0
17791 1515915625538189543 1.0 1.0 109.0 0.0
18037 1515915625538353966 1.0 1.0 109.0 0.0
19800 1515915625539864468 2.0 3.0 104.0 0.0
22464 1515915625542102075 6.0 36.0 69.0 0.0
24717 1515915625545486890 11.0 11.0 93.0 0.0
26202 1515915625547164014 1.0 12.0 88.0 0.0
26729 1515915625547973880 2.0 14.0 34.0 0.0
29519 1515915625561317292 1.0 5.0 79.0 0.0
35798 1515915625569444951 1.0 1.0 63.0 0.0
36034 1515915625569751989 1.0 1.0 63.0 0.0
39156 1515915625573167676 1.0 1.0 54.0 0.0
39566 1515915625573482744 1.0 2.0 53.0 0.0
39687 1515915625573575950 2.0 3.0 53.0 0.0
39737 1515915625573629519 1.0 1.0 53.0 0.0
44185 1515915625592904652 1.0 6.0 45.0 0.0
45628 1515915625593770495 1.0 1.0 43.0 0.0
48241 1515915625595271558 3.0 17.0 39.0 0.0
49841 1515915625596215381 1.0 2.0 36.0 0.0
54789 1515915625599473044 3.0 3.0 27.0 0.0
57159 1515915625601113987 9.0 9.0 22.0 0.0
66517 1515915625609072139 2.0 2.0 4.0 0.0
67991 1515915625610224305 1.0 1.0 1.0 0.0
In [94]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
...: bgf.fit(df2['frequency'], df2['recency'], df2['T'])
fun: -0.03513675395757231
hess_inv: array([[ 13.30839758, 17.8546921 , -0.17820442, 0.31872313],
[ 17.8546921 , 73.49152334, -1.06609042, 0.96429223],
[ -0.17820442, -1.06609042, 65.85101032, 67.62388159],
[ 0.31872313, 0.96429223, 67.62388159, 109.01577057]])
jac: array([ 1.17874160e-06, -6.62967570e-07, 1.06154732e-06, 1.56458773e-06])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 130
nit: 29
njev: 117
status: 2
success: False
x: array([-3.59592079, -5.36183489, 0.07652525, -0.4253566 ])
---------------------------------------------------------------------------
ConvergenceError Traceback (most recent call last)
/data/modou/python/clv.py in <module>
1 bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
----> 2 bgf.fit(df2['frequency'], df2['recency'], df2['T'])
/data/modou/conda/envs/py36/lib/python3.6/site-packages/lifetimes/fitters/beta_geo_fitter.py in fit(self, frequency, recency, T, weights, initial_params, verb
ose, tol, index, **kwargs)
141 verbose,
142 tol,
--> 143 **kwargs
144 )
145
/data/modou/conda/envs/py36/lib/python3.6/site-packages/lifetimes/fitters/_init_.py in _fit(self, minimizing_function_args, initial_params, params_size, dis
p, tol, bounds, **kwargs)
117 """
118 The model did not converge. Try adding a larger penalizer to see if that helps convergence.
--> 119 """
120 )
121 )
ConvergenceError:
The model did not converge. Try adding a larger penalizer to see if that helps convergence.
```
-
- As a result, df still got error
There must be some strange thing on the df( transformed on spark) , how it got error even if
suppress idx monetary_value value to zeros ??
I just want to figure this thing out.
-
- Update
Write out df and read back, go through fitting!? Holy strange.
```
In [108]: df["monetary_value"].sum()
Out[108]: 4596984.839164658
In [109]: df1["monetary_value"].sum()
Out[109]: 4596984.8391646575
In [111]: df.to_csv('e.csv', index=False, header=True)
In [112]: x = pd.read_csv('e.csv')
In [113]: x["monetary_value"].sum()
Out[113]: 4596984.8391646575
In [114]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef)
...: bgf.fit(x['frequency'], x['recency'], x['T'])
...:
Out[114]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, alpha: 0.74, b: 0.65, r: 0.03>
```