SQL (Structured Query Language) and Pandas (Python library built on top Numpy package) are widely used by data scientists, because these programming languages enable them to read, manipulate, write and retrieve data (most of the time stored in a database or data warehouses such as BigQuery or Amazon RedShift). Every data scientist is aware that her/his job will include data extraction and cleaning, therefore knowing SQL or Pandas is often required.
In this article, we will give you a simple comparison of those important tools for data scientists with a focus on the JOIN function (used to combine from two or more tables or dataframes).
Let's go over the different JOINs available in SQL:
Let's dive in and now learn how to join two tables or data frames using SQL and Pandas
1. INNER JOIN
SQL
SELECT *
FROM table1
INNER JOIN table2
ON table1.key = table2.key;
Pandas
import pandas as pd
#df1 and df2 are two pandas dataframes
pd.merge(df1, df2, on='key')
2. LEFT JOIN
SQL
SELECT *
FROM table1
LEFT JOIN table2
ON table1.key = table2.key;
Pandas
import pandas as pd
#df1 and df2 are two pandas dataframes
pd.merge(df1, df2, on='key',how='left')
3. RIGHT JOIN
SQL
SELECT *
FROM table1
RIGHT JOIN table2
ON table1.key = table2.key;
Pandas
import pandas as pd
#df1 and df2 are two pandas dataframes
pd.merge(df1, df2, on='key',how='right')
4. FULL JOIN
SQL
SELECT *
FROM table1
FULL JOIN table2
ON table1.key = table2.key;
Pandas
import pandas as pd
#df1 and df2 are two pandas dataframes
pd.merge(df1, df2, on='key',how='outer')
We hope that you will use those snippets to join some tables or data frames in your daily job!