Choosing a method to search for similar operations

We were faced with the task of identifying groups of clients who have the same investment behavior when performing transactions on organized securities markets.





For an effective solution to the problem, first of all, it is necessary to decide on its correct formulation.





, . «» . – , . , , , .  , !





, :





  • ,





  • ( )





  • (/).





2 , , , , . – !





:





from matplotlib import pyplot as plt
from mpl_toolkits import mplot3d

fig = plt.figure()
ax = plt.axes(projection = '3d')
ax.scatter3D(X[:,0],X[:,1],X[:,2])
plt.show()
      
      



2 SKlearn ML – KMeans, DBSCAN () BallTree.





DBSCAN

DBSCAN (Density-based spatial clustering of applications with noise), ), , . β€”  Ο΅- . , , , DBSCAN ? .





:





, , , . , , :





X = df[['ORDERDATETIME','SECURCODE','OPERATION']].values
model = DBSCAN(min_samples=2, eps = 0.5).fit(X)
df['LabelsDBS'] = model.labels_
df['ORDERDATETIME']=pd.to_datetime(df['ORDERDATETIME'])
gr=df.groupby(['LabelsDBS','CLIENTCODE']).count()

l=[]
ll=[]
for i in range(gr.shape[0]):
    l.append(gr.index[i][0])
    ll.append(gr.index[i][1])
l=pd.DataFrame(l)
l.rename(columns={0:'Ind'}, inplace=True)
l['Code']=ll
l=l.query('Ind > 0')
a = l.groupby('Ind').count()
a=a.query('Code>1 & Code<4').index.values
a=list(a)
l=l.query('Ind == @a')
l = pd.DataFrame(l.groupby('Ind')['Code'].apply(list))
      
      



, DBSCAN , .





KMeans

 k- .  k  – , . , .





  – ,  X. , .





  – , , . , – Β« k-Β».





  , , k- n- 1. . .. , , . .





, DBSCAN ( ) :





X = df[['ORDERDATETIME','SECURCODE','OPERATION']].values
model = KMeans(n_clusters=9900).fit(X)
      
      



, , – .





BallTree

ML – . . , , , .





, :





tree = BallTree(X, leaf_size=2)
dist, ind = tree.query(X, k=2)
l=[]
ll=[]
lll=[]
dist, ind = tree.query(X, k=2)
for i, (ind, d) in enumerate(zip(ind, dist)):
    print(f'Y index {i}, closest index X is {ind[1]}, dist {d[1]}')
    l.append(ind[0])
    ll.append(ind[1])
    lll.append(d[1])
for i in range(len(l)):
    l[i]=df.iloc[l[i]].CLIENTCODE
    ll[i]=df.iloc[ll[i]].CLIENTCODE
l=pd.DataFrame(l)
l.rename(columns={0:'Ind1'}, inplace=True)
l['Ind2']=ll
l['DIST']=lll
l = l.query('DIST>0 & Ind1 != Ind2')
l
      
      



, , , – , , , .





Based on the work carried out, we can conclude that, despite the fact that the methods for finding similar customer operations are different and use different mathematical methods, the conclusion is approximately similar. This means that any of the methods can be used to narrow the search for similar operations. There remains the question of choosing a more convenient tool for a specific task - to sacrifice speed for the sake of convenience or to look for the application of the found distances yourself. You decide!








All Articles