🏻 😇 🎥 Sankey diagram in Python 👨🏻‍⚕️ 🍁 🕴🏿

I am doing data analytics at Aliradar. We are not represented on Habré, but I have accumulated material that I would like to share. I was motivated to write this article by the lack of suitable guides for constructing a Senkei diagram using python in Russian.

In my work, various tasks often arise in the analysis of the consistency and completeness of data, as well as visualization. One of such tasks, which I was solving relatively recently, is the need to visualize the actions of users of our mobile application. It was necessary to understand what scenarios for working with the application and to take a closer look at user actions at each step to further improve the stability of the application.

Since we have a lot of users, analyzing the actions of each is a difficult and expensive task. Therefore, it was decided to visualize user events using a Sankey diagram.

Looking ahead, I will show what will turn out in the end . I used python, pandas and plotly to prepare data and build a diagram. I hope this article will be useful for data analysts, the code can be run in colab , or taken from the repository on github .

And now let's take it step by step.

What it is?

The first publication of this diagram appeared in 1898 . Its creator, Matthew H. Sankey, showed a comparison between a steam engine and an engine without energy loss.

, , - . , , :

Simplified diagram of the Senkei diagram

event_1, . (source);
"" event_1 (source) event_1, event_2, event_3, (step_1) (target). , source target source target;
step_2 event_1, event_2, event_3 , event_3 event_4 ;
source target. - source, - target, source , .

- , source, target. source target , , .

PATH_TO_CSV = 'https://raw.githubusercontent.com/rusantsovsv/senkey_tutorial/main/csv/senkey_data_tutorial.csv'

#        5 
table = pd.read_csv(PATH_TO_CSV)
table.head()

5 :

user_id - id ;
event_timestamp - ;
event_name - .

source-target, - .

def add_features(df):
    
    """      

    Args:
        df (pd.DataFrame):  .
    Returns:
        pd.DataFrame:    .
    """
    
    #   id  
    sorted_df = df.sort_values(by=['user_id', 'event_timestamp']).copy()
    #   
    sorted_df['step'] = sorted_df.groupby('user_id').cumcount() + 1
    
    #  -   
    # - -   
    sorted_df['source'] = sorted_df['event_name']
    #   
    sorted_df['target'] = sorted_df.groupby('user_id')['source'].shift(-1)
    
    #     
    return sorted_df.drop(['event_name'], axis=1)
  
#  
table = add_features(table)
table.head()

5 :

id ;
source - target;
;
event_name, .

, - . , , , , , 7.

#    source-target,    7
#       
df_comp = table[table['step'] <= 7].copy().reset_index(drop=True)

source

source. target source, source .

, - , - source . , source , 0, , .

. , source target.

source

def get_source_index(df):
    
    """   source

    Args:
        df (pd.DataFrame):     step, source, target.
    Returns:
        dict:   ,      source.
    """
    
    res_dict = {}
    
    count = 0
    #   
    for no, step in enumerate(df['step'].unique().tolist()):
        #     
        res_dict[no+1] = {}
        res_dict[no+1]['sources'] = df[df['step'] == step]['source'].unique().tolist()
        res_dict[no+1]['sources_index'] = []
        for i in range(len(res_dict[no+1]['sources'])):
            res_dict[no+1]['sources_index'].append(count)
            count += 1
            
    #  
    for key in res_dict:
        res_dict[key]['sources_dict'] = {}
        for name, no in zip(res_dict[key]['sources'], res_dict[key]['sources_index']):
            res_dict[key]['sources_dict'][name] = no
    return res_dict
  

#  
source_indexes = get_source_index(df_comp)

sources
 ['history_opened', 'app_opened_from_market', 'sales_category_selected', 'favorites_opened', 'item_opened', 'app_opened_via_icon', 'market_opened_without_referral', 'price_history_opened', 'search_tab_opened', 'seller_info_opened', 'item_loaded_from_store', 'marketApp_opened', 'chart_click', 'item_opened_from_history', 'similar_tab_opened', 'reviews_tab_opened', 'app_remove', 'similar_item_opened', 'marketApp_opened_from_item', 'sales_item_opened_from_main', 'auth_opened', 'search_request_entered', 'item_info_click', 'sales_opened', 'settings_opened', 'similars_not_fetched_from_server', 'auth_user_succeeded', 'search_results_loaded'] 

sources_index
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47] 

sources_dict
 {'history_opened': 20, 'app_opened_from_market': 21, 'sales_category_selected': 22, 'favorites_opened': 23, 'item_opened': 24, 'app_opened_via_icon': 25, 'market_opened_without_referral': 26, 'price_history_opened': 27, 'search_tab_opened': 28, 'seller_info_opened': 29, 'item_loaded_from_store': 30, 'marketApp_opened': 31, 'chart_click': 32, 'item_opened_from_history': 33, 'similar_tab_opened': 34, 'reviews_tab_opened': 35, 'app_remove': 36, 'similar_item_opened': 37, 'marketApp_opened_from_item': 38, 'sales_item_opened_from_main': 39, 'auth_opened': 40, 'search_request_entered': 41, 'item_info_click': 42, 'sales_opened': 43, 'settings_opened': 44, 'similars_not_fetched_from_server': 45, 'auth_user_succeeded': 46, 'search_results_loaded': 47}

source

source-target . 2 - .

RGBA. , source-target , .

. , source:color. , . , . , colors_for_sources mode='custom' ('random' ).

def generate_random_color():
    
    """   rgba

    Args:
        
    Returns:
        str:     
    """
    
    #     
    r, g, b = np.random.randint(255, size=3)
    return f'rgba({r}, {g}, {b}, 1)'

source: color

def colors_for_sources(mode):
    
    """  rgba

    Args:
        mode (str):   ,  'random',   'custom' - 
                      
    Returns:
        dict:   ,   
    """
    # ,        
    colors_dict = {}
    
    if mode == 'random':
        #   
        for label in df_comp['source'].unique():
            r, g, b = np.random.randint(255, size=3)            
            colors_dict[label] = f'rgba({r}, {g}, {b}, 1)'
            
    elif mode == 'custom':
        #    
        colors = requests.get('https://raw.githubusercontent.com/rusantsovsv/senkey_tutorial/main/json/colors_senkey.json').json()
        for no, label in enumerate(df_comp['source'].unique()):
            colors_dict[label] = colors['custom_colors'][no]
            
    return colors_dict
  
  
#     
colors_dict = colors_for_sources(mode='custom')

Plotly. ( ) :

sources - source;
targets - target;
values - , source-target ("" );
labels - ;
colors_labels - ;
link_color - ;
link_text - .

2 :

def percent_users(sources, targets, values):
    
    """
      id   (   hover text  )
    
    Args:
        sources (list):    source.
        targets (list):    target.
        values (list):   "" .
        
    Returns:
        list:   ""   
    """
    
    #       
    zip_lists = list(zip(sources, targets, values))
    
    new_list = []
    
    #         
    unique_dict = {}
    
    #    
    for source, target, value in zip_lists:
        if source not in unique_dict:
            #       
            unique_dict[source] = 0
            for sr, tg, vl in zip_lists:
                if sr == source:
                    unique_dict[source] += vl
                    
    #  
    for source, target, value in zip_lists:
        new_list.append(round(100 * value / unique_dict[source], 1))
    
    return new_list

def lists_for_plot(source_indexes=source_indexes, colors=colors_dict, frac=10):
    
    """
            
       
    
    Args:
        source_indexes (dict):      source.
        colors (dict):    source.
        frac (int):    ""  .
        
    Returns:
        dict:   ,   .
    """
    
    sources = []
    targets = []
    values = []
    labels = []
    link_color = []
    link_text = []

    #    
    for step in tqdm(sorted(df_comp['step'].unique()), desc=''):
        if step + 1 not in source_indexes:
            continue

        #   
        temp_dict_source = source_indexes[step]['sources_dict']

        #   
        temp_dict_target = source_indexes[step+1]['sources_dict']

        #     ,    
        for source, index_source in tqdm(temp_dict_source.items()):
            for target, index_target in temp_dict_target.items():
                #       id            
                temp_df = df_comp[(df_comp['step'] == step)&(df_comp['source'] == source)&(df_comp['target'] == target)]
                value = len(temp_df)
                #        
                if value > frac:
                    sources.append(index_source)
                    targets.append(index_target)
                    values.append(value)
                    #      
                    link_color.append(colors[source].replace(', 1)', ', 0.2)'))
                    
    labels = []
    colors_labels = []
    for key in source_indexes:
        for name in source_indexes[key]['sources']:
            labels.append(name)
            colors_labels.append(colors[name])
            
    #    
    perc_values = percent_users(sources, targets, values)
    
    #     howertext
    link_text = []
    for perc in perc_values:
        link_text.append(f"{perc}%")
    
    #     
    return {'sources': sources, 
            'targets': targets, 
            'values': values, 
            'labels': labels, 
            'colors_labels': colors_labels, 
            'link_color': link_color, 
            'link_text': link_text}
  

#  
data_for_plot = lists_for_plot()

- sources, targets, values.

frac lists_for_plot. , . ( - 10 id ). .

. senkey_diagram :

def plot_senkey_diagram(data_dict=data_for_plot):    
    
    """
          
    
    Args:
        data_dict (dict):      .
        
    Returns:
        plotly.graph_objs._figure.Figure:  .
    """
    
    fig = go.Figure(data=[go.Sankey(
        domain = dict(
          x =  [0,1],
          y =  [0,1]
        ),
        orientation = "h",
        valueformat = ".0f",
        node = dict(
          pad = 50,
          thickness = 15,
          line = dict(color = "black", width = 0.1),
          label = data_dict['labels'],
          color = data_dict['colors_labels']
        ),
        link = dict(
          source = data_dict['sources'],
          target = data_dict['targets'],
          value = data_dict['values'],
          label = data_dict['link_text'],
          color = data_dict['link_color']
      ))])
    fig.update_layout(title_text="Sankey Diagram", font_size=10, width=3000, height=1200)
    
    #   
    return fig
  

#    
senkey_diagram = plot_senkey_diagram()

senkey_diagram.show()

?

html

, , . html, . .

html

senkey_diagram.write_html('demo_senkey.html', auto_open=True)

html . auto_open .

Plotly Chart Studio

Plotly Chart Studio . . ( ):

chart_studio

import chart_studio
chart_studio.tools.set_credentials_file(username='YOU_LOGIN', api_key='YOU_API_KEY')

chart_studio

py.plot(senkey_diagram, filename = 'NAME_FIG', auto_open=True)

, , .

We looked at how you can create a Senkei diagram step by step - from loading and generating the necessary data to saving the resulting diagram. I hope that this guide will be useful and help expand your understanding of the possibilities of data visualization using Python and the Plotly library.

Thanks for attention!

Sankey diagram in Python

What it is?

source

source

?

html

Plotly Chart Studio

More articles: