How to rewrite SQL queries in Python using Pandas

In this article, June Tao Ching explained how Pandas can achieve the same result in Python as in SQL queries. Before you - the translation, and the original can be found in the blog towardsdatascience.com.



image

Photo from the Unsplash website . Author: Hitesh Choudhary



Getting the same result in Python as with a SQL query



Often, when working on the same project, we have to switch between SQL and Python. That being said, some of us are familiar with data manipulation in SQL queries, but not in Python, which hinders our efficiency and productivity. In fact, using Pandas, you can achieve the same result in Python as in SQL queries.



Beginning of work



The Pandas package needs to be installed if it is not there.



conda install pandas


We will be using the famous Titanic Dataset from Kaggle .



After installing the package and downloading the data, we need to import it into our Python environment.



image



We will use a DataFrame to store data. Various Pandas functions will help us manage this data structure.



SELECT, DISTINCT, COUNT, LIMIT



Let's start with simple SQL queries that we use a lot.



image



titanic_df["age"].unique()will return an array of unique values, so we'll have to use len()to count their number.



SELECT, WHERE, OR, AND, IN (SELECT with conditions)



After the first part, you learned how to explore a DataFrame in simple ways. Now let's try to do it with some conditions (this is a statement WHEREin SQL).



image



If we only want to select specific columns from the DataFrame, we can do so with an extra pair of square brackets.



Note: if you select multiple columns, you need to put the array ["name","age"]inside square brackets.



isin()works exactly the same as INin SQL queries. To use NOT IN, in Python we need to use negation (~).



GROUP BY, ORDER BY, COUNT



GROUP BYand ORDER BYare also popular SQL statements for data mining. Now let's try using them in Python.



image



If we want to sort only one COUNT column, we can simply pass a boolean value to the method sort_values. If we are going to sort multiple columns, then we must pass an array of booleans to the method sort_values.



The method sum()will return the sums for each of the columns in the DataFrame, which can be aggregated numerically. If we only want a specific column, then we need to specify the column name using square brackets.



MIN, MAX, MEAN, MEDIAN



Finally, let's try some of the standard statistical functions that are important when exploring data.



image



SQL does not contain operators that return the median value, so we use BigQuery to get the median value of the age column.In APPROX_QUANTILES



Pandas, the aggregation method .agg()also supports other functions, for example sum.



You have now learned how to rewrite SQL queries in Python using Pandas . Hope you find this article helpful.



All the code can be found in my Github repository .



Thank you for attention!



All Articles