How to Analyze the Photo Studio Market with Python (2/3). Database

In a previous article, as part of a commercial project for analyzing the photo studio market, I considered the creation of parsing: unloading a list of photo studios, a list of halls, and booking data from the moment the hall was opened until the last reservation.



It is impractical to store the information obtained in the cache; it is necessary to use a database.



In the article I will consider:



  • creating a simple SQLite database;
  • writing information using Python;
  • reading data and converting to DataFrame format;
  • parsing update based on database data.






Database requirements



The main requirement for a project database is to store data and be able to retrieve it quickly.



Our database is not required:



  • delimit access to schemes, since only the user will have access by parsing;
  • keep access 24/7, because data extraction is acceptable as needed for analysis;
  • creation of procedures, since all calculations will be done in python.


Therefore, it is possible for a project to use a simple database in SQLite. You can store it as a file either on your hard drive, or on a USB flash drive, or on a cloud drive for access from other devices.



Features of working with SQLite through python



To work with SQLite through python, we use the sqlite3 library .



We connect to the database with a simple command:



sqlite3.connect(  )


If the file is missing, a new database will be created.



Database queries are performed as follows:



conn = sqlite3.connect(  )
cur = conn.cursor()
cur.execute()
df = cur.fetchall()


cur.fetchall () is executed when, as a result of a request, we want to get data from the database.



At the end of writing data to the database, do not forget to end the transaction:



conn.commit()


and at the end of working with the database, do not forget to close it:




conn.close()


otherwise, the base will be locked for writing or opening.



Creation of tables is standard:



CREATE TABLE t1 (1 , 2 ...)


or a more versatile option that creates a table if it is missing:



CREATE TABLE IF NOT EXISTS t1 (1 , 2 ...)


We write data to the table, avoiding repetitions:



INSERT OR IGNORE INTO t1 (1, 2, ...) VALUES(1, 2, ...)


Updating the data:



UPDATE t1 SET 1 = 1 WHERE 2 = 2


For more convenient work with SQLite, you can use SQLite Manager or DB Browser for SQLite .



The first program is a browser extension and looks like an alternation of a request line and a response block:







The second program is a full-fledged desktop application:











Database structure



The database will consist of 4 tables: studios, halls, 2 booking tables.



The uploaded booking data contains information about future periods, which may change with new parsing. It is undesirable to overwrite the data (they can be used, for example, to calculate the day / hour when the reservation was made). Therefore, one booking table is needed for the raw parsing data, the second for the latest, relevant ones.



We create tables:

def create_tables(conn, table = 'all'):
    cur = conn.cursor()
    
    if (table == 'all') or (table == 'uStudios'):
        cur.execute('''
            CREATE TABLE IF NOT EXISTS uStudios
            (studio_id INT PRIMARY KEY UNIQUE,
            name TEXT UNIQUE,
            metro TEXT,
            address TEXT,
            phone TEXT,
            email TEXT,
            established_date DATE)
            ''')
        print('Table uStudios is created.')

    if (table == 'all') or (table == 'uHalls'):
        cur.execute('''
            CREATE TABLE IF NOT EXISTS uHalls
            (hall_id INT PRIMARY KEY UNIQUE,
            studio_id INT,
            name TEXT,
            is_hall INT,
            square FLOAT,
            ceiling FLOAT,
            open_date DATE)
            ''')
        print('Table uHalls is created.')

    if (table == 'all') or (table == 'uBooking_parsing'):
        cur.execute('''
            CREATE TABLE IF NOT EXISTS uBooking_parsing
            (hall_id INT,
            date DATE,
            hour INT,
            is_working_hour INT,
            min_hours INT,
            price INTEGER,
            is_booked INT,
            duration INT,
            parsing_date DATE)
            ''')
        print ('Table uBooking_parsing is created.')

    if (table == 'all') or (table == 'uBooking'):
        cur.execute('''
            CREATE TABLE IF NOT EXISTS uBooking
            (hall_id INT,
            date DATE,
            hour INT,
            is_working_hour INT,
            min_hours INT,
            price INTEGER,
            is_booked INT,
            duration INT,
            parsing_date DATE)
            ''')
        print ('Table uBooking is created.')




The table parameter sets the name of the table to be created. Creates everything by default.



In the fields of the tables, you can see data that was not parsed (date of studio opening, date of hall opening). I will describe the calculation of these fields later.



Interaction with the database



Let's create 6 procedures for interacting with the database:



  1. Writing a list of photo studios to the database;
  2. Uploading a list of photo studios from the database;
  3. Recording a list of halls;
  4. Unloading the list of halls;
  5. Uploading booking data;
  6. Recording of booking data.


1. Writing a list of photo studios to the database



At the entrance to the procedure, we pass the parameters for connecting to the database and the table in the form of a DataFrame. We write the data line by line, iterating over all the lines in a loop. A useful property of string data in python for this operation is the "?" the elements of the tuple specified after.



The procedure for recording a list of photo studios is as follows:
def studios_to_db(conn, studio_list): 
    cur = conn.cursor()
    for i in studio_list.index:
        cur.execute('INSERT OR IGNORE INTO uStudios (studio_id, name, metro, address, phone, email) VALUES(?, ?, ?, ?, ?, ?)',
                   (i,
                   studio_list.loc[i, 'name'],
                   studio_list.loc[i, 'metro'],
                   studio_list.loc[i, 'address'],
                   studio_list.loc[i, 'phone'],
                   studio_list.loc[i, 'email']))




2. Uploading the list of photo studios from the database



We pass the database connection parameters to the entry into the procedure. We execute the select query, intercept the unloaded data and write it to the DataFrame. We translate the date of foundation of the photo studio in the date format.



The entire procedure is as follows:
def db_to_studios(conn):
    cur = conn.cursor()
    cur.execute('SELECT * FROM uStudios')
    studios = pd.DataFrame(cur.fetchall()
                           , columns=['studio_id', 'name', 'metro', 'address', 'phone', 'email', 'established_date']
                          ).set_index('studio_id')
    studios['established_date'] = pd.to_datetime(studios['established_date'])
    return studios




3. Writing the list of halls to the database



The procedure is similar to recording a list of photographic studios: we transfer connection parameters and a table of halls, write the data line by line to the database.



The procedure for recording the list of halls in the database
def halls_to_db(conn, halls): 
    cur = conn.cursor()
    for i in halls.index:
        cur.execute('INSERT OR IGNORE INTO uHalls (hall_id, studio_id, name, is_hall, square, ceiling) VALUES(?, ?, ?, ?, ?, ?)',
                   (i,
                   halls.loc[i, 'studio_id'],
                   halls.loc[i, 'name'],
                   halls.loc[i, 'is_hall'],
                   halls.loc[i, 'square'],
                   halls.loc[i, 'ceiling']))




4. Unloading the list of halls from the database



The procedure is similar to unloading a list of photo studios: transferring connection parameters, select-request, interception, writing to a DataFrame, converting the opening date of the hall into date format.

The only difference: studio id and hall sign were recorded in byte form. We return the value by the function:



int.from_bytes(, 'little')


The procedure for unloading the list of halls is as follows:
def db_to_halls(conn):
    cur = conn.cursor()
    cur.execute('SELECT * FROM uHalls')
    halls = pd.DataFrame(cur.fetchall(), columns=['hall_id', 'studio_id', 'name', 'is_hall', 'square', 'ceiling', 'open_date']).set_index('hall_id')
    for i in halls.index:
        halls.loc[i, 'studio_id'] = int.from_bytes(halls.loc[i, 'studio_id'], 'little')
        halls.loc[i, 'is_hall'] = int.from_bytes(halls.loc[i, 'is_hall'], 'little')
    halls['open_date'] = pd.to_datetime(halls['open_date'])
    return halls




5. Uploading booking information from the database



We pass the database connection parameters and the parsing parameter to the procedure, showing from which booking table we are requesting information: 0 - from the actual (by default), 1 - from the parsing table. Next, we execute a select request, intercept it, and translate it into a DataFrame. Dates are converted to date format, numbers from byte format to number format.



Procedure for uploading booking information:
def db_to_booking(conn, parsing = 0):
    cur = conn.cursor()
    if parsing == 1:
        cur.execute('SELECT * FROM uBooking_parsing')
    else:
        cur.execute('SELECT * FROM uBooking')
    booking = pd.DataFrame(cur.fetchall(), columns=['hall_id', 
                                                     'date', 'hour', 
                                                     'is_working_hour', 
                                                     'min_hours', 
                                                     'price', 
                                                     'is_booked', 
                                                     'duration', 
                                                     'parsing_date'])
    booking['hall_id'] = [int.from_bytes(x, 'little') if not isinstance(x, int) else x for x in booking['hall_id']]
    booking['is_booked'] = [int.from_bytes(x, 'little') if not isinstance(x, int) else x for x in booking['is_booked']]
    booking['date'] = pd.DataFrame(booking['date'])
    booking['parsing_date'] = pd.DataFrame(booking['parsing_date'])
    
    return booking




6. Writing booking information to the database



The most complex function of interaction with the database, since it initiates the parsing of the booking data. At the entrance, we pass to the procedure the parameters for connecting to the database and the list of hall ids that must be updated.



To determine the latest date of up-to-date data,



request from the database the latest parsing date for each hall id:
parsing_date = db_to_booking(conn, parsing = 1).groupby('hall_id').agg(np.max)['parsing_date']




We iterate over each hall id using a loop.



In each hall id, the first thing we do is define



number of weeks to parse in the past:
        try:
            last_day_str = parsing_date[id]
            last_day = datetime.datetime.strptime(last_day_str, '%Y-%m-%d')
            delta_days = (datetime.datetime.now() - last_day).days
            weeks_ago = delta_days // 7
        except:
            last_day_str = '2010-01-01'
            last_day = datetime.datetime.strptime(last_day_str, '%Y-%m-%d')
            weeks_ago = 500




If the hall id is in the database, then we calculate. If not, then we parse 500 weeks in the past or stop when there was no reservation for 2 months (the limitation is described in the previous article ).



Then we perform the parsing procedures:
        d = get_past_booking(id, weeks_ago = weeks_ago)        
        d.update(get_future_booking(id))
        book = hall_booking(d)




First, we parse booking information from the past to the actual data, then from the future (up to 2 months, when there were no records) and at the end we transfer the data from json format to DataFrame.



At the final stage, we write the data on booking the hall into the database and close the transaction.



The procedure for recording booking information into the database is as follows:
def booking_to_db(conn, halls_id):
    cur = conn.cursor()
    cur_date = pd.Timestamp(datetime.date.today())
    parsing_date = db_to_booking(conn, parsing = 1).groupby('hall_id').agg(np.max)['parsing_date']
    
    for id in halls_id:
        
        #download last parsing_date from DataBase
        try:
            last_day_str = parsing_date[id]
            last_day = datetime.datetime.strptime(last_day_str, '%Y-%m-%d')
            delta_days = (datetime.datetime.now() - last_day).days
            weeks_ago = delta_days // 7
        except:
            last_day_str = '2010-01-01'
            last_day = datetime.datetime.strptime(last_day_str, '%Y-%m-%d')
            weeks_ago = 500
        
        d = get_past_booking(id, weeks_ago = weeks_ago)        
        d.update(get_future_booking(id))
        book = hall_booking(d)
        for i in list(range(len(book))):#book.index:
            cur.execute('INSERT OR IGNORE INTO uBooking_parsing (hall_id, date, hour, is_working_hour, min_hours, price, is_booked, duration, parsing_date) VALUES(?,?,?,?,?,?,?,?,?)',
                       (book.iloc[i]['hall_id'],
                       book.iloc[i]['date'].date().isoformat(),
                       book.iloc[i]['hour'],
                       book.iloc[i]['is_working_hour'],
                       book.iloc[i]['min_hours'],
                       book.iloc[i]['price'],
                       book.iloc[i]['is_booked'],
                       book.iloc[i]['duration'],
                       cur_date.date().isoformat()))
        conn.commit()
        print('hall_id ' + str(id) + ' added. ' + str(list(halls_id).index(id) + 1) + ' from ' + str(len(halls_id)))




Updating the opening days of the studio and halls



The Lounge Opening Date is the earliest booking date for the Lounge.



The opening date of the photo studio is the earliest date for the opening of the studio hall.



Based on this logic,



we unload the earliest booking dates for each room from the database
halls = db_to_booking(conn).groupby('hall_id').agg(min)['date']




Then we update the data for the opening of the halls line by line:
    for i in list(range(len(halls))):
        cur.execute('''UPDATE uHalls SET open_date = '{1}' WHERE hall_id = {0}'''
                    .format(halls.index[i], str(halls.iloc[i])))




We update the photo studio opening data in the same way: we download the data on the opening dates of the halls from the database, calculate the smallest date for each studio, rewrite the photo studio opening date.



Procedure for updating opening dates:
def update_open_dates(conn):
    
    cur = conn.cursor()
    
    #update open date in uHalls
    halls = db_to_booking(conn).groupby('hall_id').agg(min)['date']
    
    for i in list(range(len(halls))):
        cur.execute('''UPDATE uHalls SET open_date = '{1}' WHERE hall_id = {0}'''
                    .format(halls.index[i], str(halls.iloc[i])))

    #update open date in uStudios
    studios = db_to_halls(conn)
    studios['open_date'] = pd.to_datetime(studios['open_date'])
    studios = studios.groupby('studio_id').agg(min)['open_date']
    for i in list(range(len(studios))):
        cur.execute('''UPDATE uStudios SET established_date = '{1}' WHERE  studio_id = {0}'''
                    .format(studios.index[i], str(studios.iloc[i])))
    
    conn.commit()




Parsing update



We will combine all procedures in this and the previous article in this procedure. It can be launched both during the first parsing and when updating data.



The procedure looks like this:
def update_parsing(directory = './/', is_manual = 0):
    
    start_time = time.time()
    
    #is DataBase exists?
    if not os.path.exists(directory + 'photostudios_moscow1.sqlite'):
        if is_manual == 1:
            print('Data base is not exists. Do you want to create DataBase (y/n)? ')
            answer = input().lower()
        else: 
            answer == 'y'
        
        if answer == 'y':
            conn = sqlite3.connect(directory + 'photostudios_moscow1.sqlite')
            conn.close()
            print('DataBase is created')
        elif answer != 'n':
            print('Error in input!')
            return list()
    
    print('DataBase is exists')
    print("--- %s seconds ---" % (time.time() - start_time))
    start_time = time.time()
        
    #connect to DataBase
    conn = sqlite3.connect(directory + 'photostudios_moscow1.sqlite')
    cur = conn.cursor()       

    #has DataBase 4 tables?
    tables = [x[0] for x in list(cur.execute('SELECT name FROM sqlite_master WHERE type="table"'))]
    if not ('uStudios' in tables) & ('uHalls' in tables) & ('uBooking_parsing' in tables) & ('uBooking' in tables):
        if is_manual == 1:
            print('Do you want to create missing tables (y/n)? ')
            answer = input().lower()
        else:
            answer = 'y'
        
        if anwer == 'y':
            if not ('uStudios' in tables):
                create_tables(conn, table = 'uStudios')

            if not ('uHalls' in tables):
                create_tables(conn, table = 'uHalls')

            if not ('uBooking_parsing' in tables):
                create_tables(conn, table = 'uBooking_parsing')

            if not ('uBooking' in tables):
                create_tables(conn, table = 'uBooking')

        elif answer != 'n':
            print('Error in input!')
            return list()

    conn.commit()
    print(str(tables) + ' are exist in DataBase')
    print("--- %s seconds ---" % (time.time() - start_time))
    start_time = time.time()
    
    #update uStudios
    studios = studio_list()
    new_studios = studios[[x not in list(db_to_studios(conn).index) for x in list(studios.index)]]
    if len(new_studios) > 0:
        print(str(len(new_studios)) + ' new studios detected: \n' + str(list(new_studios['name'])))
        studios_to_db(conn, new_studios)
    
    conn.commit()
    print('Studio list update was successful')
    print("--- %s seconds ---" % (time.time() - start_time))
    start_time = time.time()
    
    #update uHalls
    halls = hall_list(list(studios.index)).sort_index()
    new_halls = halls[[x not in list(db_to_halls(conn).index) for x in list(halls.index)]]
    if len(new_halls) > 0:
        halls_to_db(conn, new_halls)
    
    conn.commit()
    print('Halls list update was successful')
    print("--- %s seconds ---" % (time.time() - start_time))
    start_time = time.time()
        
    #update uBooking_parsing
    booking_to_db(conn, halls.index)   
    
    conn.commit()
    print('Booking_parsing update was successful')
    print("--- %s seconds ---" % (time.time() - start_time))
    start_time = time.time()
    
    #update uBooking from uBooking_parsing
    cur.execute('DELETE FROM uBooking')
    cur.execute('''
        insert into uBooking (hall_id, date, hour, is_working_hour, min_hours, price, is_booked, duration, parsing_date) 
        select hall_id, date, hour, is_working_hour, min_hours, price, is_booked, duration, parsing_date
        from
        (
            select *, row_number() over(partition by hall_id, date, hour order by parsing_date desc) rn 
            from uBooking_parsing
        ) t
        where rn = 1
    ''')
    conn.commit()
    print('Booking update was successful')
    print("--- %s seconds ---" % (time.time() - start_time))
    start_time = time.time()
    
    update_open_dates(conn)
    conn.commit()
    print('Open date update was successful')
    print("--- %s seconds ---" % (time.time() - start_time))
    
    conn.close()




Let's analyze her work in order.



At the entrance to the procedure, we pass 2 parameters: the address of the folder where to get the database from or where to install it (by default, we take the folder with python documents), and the optional is_manual parameter, which, if set to "1", will request the need to create a database or tables in in their absence.



. , :
    if not os.path.exists(directory + 'photostudios_moscow1.sqlite'):
        if is_manual == 1:
            print('Data base is not exists. Do you want to create DataBase (y/n)? ')
            answer = input().lower()
        else: 
            answer == 'y'
        
        if answer == 'y':
            conn = sqlite3.connect(directory + 'photostudios_moscow1.sqlite')
            conn.close()
            print('DataBase is created')
        elif answer != 'n':
            print('Error in input!')
            return list()




:
    conn = sqlite3.connect(directory + 'photostudios_moscow1.sqlite')
    cur = conn.cursor() 




, . , . :
    tables = [x[0] for x in list(cur.execute('SELECT name FROM sqlite_master WHERE type="table"'))]
    if not ('uStudios' in tables) & ('uHalls' in tables) & ('uBooking_parsing' in tables) & ('uBooking' in tables):
        if is_manual == 1:
            print('Do you want to create missing tables (y/n)? ')
            answer = input().lower()
        else:
            answer = 'y'
        
        if anwer == 'y':
            if not ('uStudios' in tables):
                create_tables(conn, table = 'uStudios')

            if not ('uHalls' in tables):
                create_tables(conn, table = 'uHalls')

            if not ('uBooking_parsing' in tables):
                create_tables(conn, table = 'uBooking_parsing')

            if not ('uBooking' in tables):
                create_tables(conn, table = 'uBooking')

        elif answer != 'n':
            print('Error in input!')
            return list()

    conn.commit()




. :
    studios = studio_list()
    new_studios = studios[[x not in list(db_to_studios(conn).index) for x in list(studios.index)]]
    if len(new_studios) > 0:
        print(str(len(new_studios)) + ' new studios detected: \n' + str(list(new_studios['name'])))
        studios_to_db(conn, new_studios)




conn.commit ()



:
    halls = hall_list(list(studios.index)).sort_index()
    new_halls = halls[[x not in list(db_to_halls(conn).index) for x in list(halls.index)]]
    if len(new_halls) > 0:
        halls_to_db(conn, new_halls)
    
    conn.commit()




uBooking_parsing. , .. booking_to_db
    booking_to_db(conn, halls.index)   
    
    conn.commit()




uBooking. uBooking uBooking_parsing ( , ) :
    cur.execute('DELETE FROM uBooking')
    cur.execute('''
        insert into uBooking (hall_id, date, hour, is_working_hour, min_hours, price, is_booked, duration, parsing_date) 
        select hall_id, date, hour, is_working_hour, min_hours, price, is_booked, duration, parsing_date
        from
        (
            select *, row_number() over(partition by hall_id, date, hour order by parsing_date desc) rn 
            from uBooking_parsing
        ) t
        where rn = 1
    ''')
    conn.commit()




:
    update_open_dates(conn)
    conn.commit()




    conn.close()




Parsing with saving data to the database is configured successfully!



We initiate parsing / updating with the following procedure:
update_parsing()




Outcome



In this and the previous article, we examined the algorithm for parsing open information for photo studios. The data obtained was collected in a database.



In the next article, we will consider examples of analyzing the obtained data.



You can find the finished project on my github page .



All Articles