We are looking for "Trolls". Shingle Algorithm & Cosine Similarity



I think many in the tense discussions on the Internet have faced accusations of people that they are bots, trolls and paid for by the Kremlin, Kiev or Washington. But how to really identify those or just people who are trying to actively convey their opinion to the rest?



Take, for example, four texts (comments):



text_1 = '       ,    '
text_2 = '      ,    ,    '
text_3 = '             '
text_4 = '   ,   ,     ,    '


From a human point of view, there is definitely a similarity in the first and second comments, they are written about one person and characterize his positive aspects. And they also have nothing to do with texts three and four.



How do you arrive at the same result in terms of mathematics?

It is corny to compare the number of identical words in both texts. But the problem immediately arises - how to compare the words smart and smart ? Lemmatization of text is not always suitable, since ready-made libraries do not support many languages.



In such moments, it is justified to use the Shingle Algorithm that splits the text of n parts, the size of which is selected empirically for a specific task.



Example:



Text 1 for n = 3
['', '', '', '', '', '', '', '', '', ' ', ' ', '  ', ' ', ' ', '', '', '', '', ' ', ' ', ' ', '', '', ' ', ' ', ' ', '', '', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', ' ', '', '', '', '', ' ', ' ', ' ', '', '', '', '', '', '', '', '', '', ' ', ' ', ' ', '', '', '', ' ', ' ', ' ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', '  ', ' ', ' ', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', ' ', '', '', '', '', '', '']




Text 1 for n = 5
['', '', '', '', '', '', '', ' ', ' ', '  ', '  ', '  ', ' ', ' ', '', '', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', ' ', ' ', ' ', '', '', ' ', ' ', ' ', ' ', ' ', '', '', '', '', '', '', '', ' ', ' ', ' ', ' ', ' ', '', ' ', ' ', ' ', ' ', ' ', '', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', '  ', '  ', '  ', ' ', ' ', '', '', '', '', '', '', '', '', '', ' ', ' ', ' ', ' ', ' ', '', '', '', '']




Let's calculate the similarity of the first and second / first and fourth texts for n (1.9) by the formula:



sim = len(set(Shingles_1) & set(Shingles_2)) / len(set(Shingles_1) | set(Shingles_2)))


ShingleSize: 1
0.9166666666666666 / 0.7857142857142857
ShingleSize: 2
0.5196078431372549 / 0.40350877192982454
ShingleSize: 3
0.3404255319148936 / 0.2375
ShingleSize: 4
0.28205128205128205 / 0.18497109826589594 
ShingleSize: 5
0.2289156626506024 / 0.13812154696132597
ShingleSize: 6
0.1896551724137931 / 0.10752688172043011
ShingleSize: 7
0.15730337078651685 / 0.07894736842105263
ShingleSize: 8
0.13333333333333333 / 0.057291666666666664
ShingleSize: 9
0.10989010989010989 / 0.04145077720207254


A natural picture, with an increase in length, the total number of shingles increases and the ratio of total shingles to the total volume decreases. So for large volumes of texts it is better to use the length of shingles from 5 to 8, and when working with short ones, for example tweets or comments - 2-4.



But back to practice, let's take the pre-collected data from the popular Russian entertainment portal. Link to kaggle .



To collect the most heated discussions, the tag (section) - Politics was chosen, total collected:



  • 944 posts tagged politics
  • 267,000 comments to them
  • of which more than 100 characters ~ 140 thousand


Moving on to the code:



Clearing text and splitting into shingles:



def clean_text(text):
    text = text.split('\n')
    text = list(filter(None, text))
    text = ' '.join(text)
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r'[^\w\s]', '', text)
    shingle = [text[i:i + ShingleSize] for i in range(len(text))][:-ShingleSize]
    return ','.join(shingle)


If we take only comments longer than 100 characters, then the number of comparison iterations is: 140000!/((1400002)!2!), which is quite a lot and processing even in multi-threaded mode will take quite a long time.



Therefore, we will not compare in pairs, but with matrices m * n , using cosine similarity

where m is the number of rows, which is optionally selected depending on the amount of RAM, and n is the number of columns, the total number of all shingles;



Algorithm implementation:



csrMatrix = []
idArray = []
textArray = []
for i in range(ChunkSize, sparse_matrix.shape[0] + ChunkSize, ChunkSize):
    temp = sparse_matrix[i - ChunkSize:i - 1]
    idArray.append(corpusId[i - ChunkSize:i - 1])
    textArray.append(OriginalCorpus[i - ChunkSize:i - 1])
    csrMatrix.append(temp)

matrixCombinations = itertools.combinations_with_replacement(range(len(csrMatrix)), 2)


With m = 20,000 and shingle length = 8, we get 7 matrices of 20,000 * ≈8800,000 in size and hence 21 comparison iterations. Much better already.



def Sim(A, B, C, D):
    similarities = cosine_similarity(B[A[0]].astype(np.float32), B[A[1]].astype(np.float32))
    x, y = np.where(similarities > similarityPercent)
    res = []
    for k, j in zip(x, y):
        if D[A[0]][k] != D[A[1]][j]:
            res.append((D[A[0]][k], C[A[0]][k], similarities[k][j].item(), D[A[1]][j], C[A[1]][j]))
    return res

s = pool.starmap(Sim, zip(matrixCombinations, itertools.repeat(csrMatrix), itertools.repeat(textArray), itertools.repeat(idArray)))
s = [item for sublist in s for item in sublist]


To reduce the occupied space, we also use the float32 data type. Its accuracy is quite enough for the algorithm to work correctly.



The results of the work are entered into the MySQL database for further processing. So, for example, let's group the comments, counting for each the similarity with its pairs:



SELECT FirstText, SUM(sim) as c FROM pikabu.similarity GROUP BY FirstId ORDER BY c DESC


Top matches:



  1. Comment has been deleted. Reason: This account has been deleted.
  2. Comment has been deleted. Reason: insulting users.
  3. Comment has been deleted. Reason: insults, rude communication and provocations.
  4. Comment has been deleted. Reason: It is forbidden to post comments, the sole purpose of which is to cause hostility or incitement to hostility, and comments with calls for violence or harassment are prohibited.


The standard situation for political discussion on the Internet.



Discarding links, messages from the administration, and other noise, we go directly to the analysis of the semantic load of comments:



Eight matches
DaimosShip,2020-08-14 23:03:48,

,

DaimosShip,2020-08-14 23:05:41,

,

DaimosShip,2020-08-14 23:05:52,

,

DaimosShip,2020-08-14 23:05:26,

,

DaimosShip,2020-08-14 23:05:22,

,

DaimosShip,2020-08-14 23:07:02,

,

DaimosShip,2020-08-14 23:06:53,

,

DaimosShip,2020-08-14 23:06:18,

,



In five minutes, a man wrote 8 identical messages in the same topic regarding the events in Belarus. In total, 260 messages of this author got into the database, a quick glance at them makes it clear an obvious negative attitude towards the situation in Belarus, the user himself is ahead of the curve and calls his opponents bots as arguments.



And his brother was an observer at the elections:



DaimosShip, 2020-08-18 22:52:41

My brother was an observer - they were not allowed anywhere




6 more matches
NoisePanzer,2017-11-20 14:58:26,

17 . , , , 80% ( !) .



NoisePanzer,2017-11-20 15:33:26,

. . ( !) . , , . ( ) , .



NoisePanzer,2017-11-20 15:26:55,

. . . ( !) . , , . ( ) , .



NoisePanzer,2017-11-21 03:51:46,

. . . ( !) . , , . ( ) , .



NoisePanzer,2017-11-21 03:52:14,

. . . ( !) . , , . ( ) , .



NoisePanzer,2017-11-21 03:53:22,

. ( !) . , , . ( ) , .



The author is clearly a fan of German historical literature



And 4 more
Kumuj,2018-03-25 01:46:10,

, . ?



Kumuj,2018-03-25 01:53:56,

, . ?



Kumuj,2018-03-25 01:46:26,

, . ?



Kumuj,2018-03-25 01:42:29,

, . ?



The man is looking for traces of the troll factory. Hi colleague.



Go ahead: 6 people are quoting the same book in different topics - Beware, checkmate !!!
Strannik196,2018-03-21 23:53:00,

:"« , «-22» : , , . , — . , : , , . .— , — . — . . , . , . . :— …»"



Fynjif18,2020-09-09 13:44:56,

.— , — . — , «-22» : , , . , — . , : , , .



wakeonlan,2020-06-23 01:38:29,

«-22» : , , ** . ** , — . , : * , , .



KKirill1992,2017-06-18 00:06:30,

, "«-22»" . , , , . . - ? , ?



nezabuddha,2018-11-01 15:29:56,

ru.m.wikipedia.org/wiki/-22 . , .



ihateyou,2016-09-19 02:52:14,

, «-22» : , , . , — . , : , , . .— , — . — . . , . , . . :— … "«Empire V»"



The author is trying to convey to the rest of the data on the growth of the level of support for the USSR
EtovamneTo,2020-08-18 01:50:22,

, , … . . , 18-24 , , 2008- 2019- , . , , « », . . , , – , ( 18-24- ). , , , , – « » ( 14 ..) «, , » ( 18 ..).https://www.levada.ru/2019/06/24/chernovik/



EtovamneTo,2020-08-18 00:50:15,

, , , , . , IQ . : . : . , 18-24 , , 2008- 2019- , . , , « », . . , , – , ( 18-24- ). , , , , – « » ( 14 ..) «, , » ( 18 ..).https://www.levada.ru/2019/06/24/chernovik/



EtovamneTo,2020-08-27 23:22:35,

. *****(). 18 . . , 18-24 , , 2008- 2019- , . , , « », . . , , – , ( 18-24- ). , , , , – « » ( 14 ..) «, , » ( 18 ..).https://www.levada.ru/2019/06/24/chernovik/



EtovamneTo,2020-09-10 03:32:13,

? ? 25. 2010 44 . 2020 274 . ? . , 18-24 https://www.levada.ru/2019/06/24/chernovik/



EtovamneTo,2020-09-09 19:00:42,

. , 18-24 , , 2008- 2019- , . , , « », . . , , – , ( 18-24- ). , , , , – « » ( 14 ..) «, , » ( 18 ..).https://www.levada.ru/2019/06/24/chernovik/



The examples given were on the surface. To identify closer relationships, it is necessary to expand the dataset by an order of magnitude, not only by the amount of information, but also by time intervals. This will make it possible to more accurately identify the key topics around which discussions are built, their initiators, analyze the time stamps of events and their correlation.

But even on such a set, with a detailed manual analysis, for example, considering single links, you can find something interesting, giving ground for thought and further work.



GitHub



PS Processing the matrix 140,000 * 8800000 took about 7 minutes on the rayzen 5 1600 processor



In the future I plan to continue this topic, I will be glad to criticism and suggestions.



All Articles