Decryption of text using frequency analysis

Hello, Habr! In this article I will show you how to make a frequency analysis of the modern Russian Internet language and use it to decipher the text. Who cares, welcome under the cut!



s1200



Frequency analysis of the Russian Internet language



The social network Vkontakte was taken as a source from where you can get a lot of text with a modern Internet language, or to be more precise, these are comments on publications in various communities of this network. I chose real football as a community . For parsing comments, I used the Vkontakte API :



def get_all_post_id():
    sleep(1)
    offset = 0
    arr_posts_id = []
    while True:
        sleep(1)
        r = requests.get('https://api.vk.com/method/wall.get',
                         params={'owner_id': group_id, 'count': 100,
                                  'offset': offset, 'access_token': token,
                                   'v': version})
        for i in range(100):
            post_id = r.json()['response']['items'][i]['id']
            arr_posts_id.append(post_id)

        if offset > 20000:
            break
        offset += 100
    return arr_posts_id

def get_all_comments(arr_posts_id):
    offset = 0
    for post_id in arr_posts_id:
        r = requests.get('https://api.vk.com/method/wall.getComments',
                         params={'owner_id': group_id, 'post_id': post_id, 
                                 'count': 100, 'offset': offset, 
                                 'access_token': token, 'v': version})
        for i in range(100):
            try:
                write_txt('comments.txt', r.json()
                ['response']['items'][i]['text'])
            except IndexError:
                pass


The result was about 200MB of text. Now we count which character appears how many times:



f = open('comments.txt')
counter = Counter(f.read().lower())

def count_letters():
    count = 0
    for i in range(len(arr_letters)):
        count += counter[arr_letters[i]]
    return count

def frequency(count):
    arr_my_frequency = []
    for i in range(len(arr_letters)):
        frequency = counter[arr_letters[i]] / count * 100
        arr_my_frequency.append(frequency)
    return arr_my_frequency


The results obtained can be compared with the results from Wikipedia and displayed as:



1) comparison chart



frequency_0



2) tables (left - wikipedia data, right - my data)



frequency_1



, , , ยซยป ยซยป.





, , 2-4 :



Screenshot from 2020-07-27 20-07-06



, , , , , , , , ,



- . , โ€” , , :



def caesar_cipher():
    file = open("text.txt")
    text_for_encrypt = file.read().lower().replace(',', '')
    letters = ''
    arr = []
    step = 3
    for i in text_for_encrypt:
        if i == ' ':
            arr.append(' ')
        else:
            arr.append(letters[(letters.find(i) + step) % 33])
    text_for_decrypt = ''.join(arr)
    return text_for_decrypt




:



def decrypt_text(text_for_decrypt, arr_decrypt_letters):
    arr_encrypt_text = []
    arr_encrypt_letters = [' ', '', '', '', '', '', '', '',
                           '', '', '', '', '', '', '', '',
                           '', '', '', '', '', '', '', '',
                           '', '', '', '', '', '', '',
                           '', '', '']
    dictionary = dict(zip(arr_decrypt_letters, arr_encrypt_letters))
    for i in text_for_decrypt:
        arr_encrypt_text.append(dictionary.get(i))
    text_for_decrypt = ''.join(arr_encrypt_text)
    print(text_for_decrypt)






If you look at the decrypted text, you can guess where our algorithm went wrong: fights โ†’ does, vadio โ†’ radio, thenho โ†’ addition, overwhelm โ†’ people. Thus, it is possible to decipher the entire text, at least to grasp the meaning of the text. I also want to note that this method will be effective in decrypting only long texts that have been encrypted with symmetric encryption methods. The complete code is available on Github .




All Articles