Updating DataFrames in Pandas
Pandas is a powerful data manipulation library in Python. It provides various functions to update and modify DataFrames efficiently. In this blog post, we will cover some common methods to update DataFrames, including dropping rows and columns, filling missing values, creating new columns, changing values, and more.
Below are examples with the resulting outputs for each of the Pandas DataFrame update methods. Let's start by creating a sample DataFrame to work with:
import pandas as pd
import numpy as np
# Creating a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
'Age': [24, np.nan, 22, 35, np.nan],
'Country': ['USA', 'Canada', 'USA', 'Canada', 'USA'],
'Score': [85, 90, 88, np.nan, 95]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
1. Dropping Rows with Null Values
df.dropna(inplace=True)
print("After dropping rows with null values:")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 85.0
2 Charlie 22.0 USA 88.0
2. Dropping Columns with Null Values
df = pd.DataFrame(data)
df.dropna(axis=1, how='any', inplace=True)
print("After dropping columns with null values:")
print(df)
Output:
Name Country
0 Alice USA
1 Bob Canada
2 Charlie USA
3 David Canada
4 Edward USA
3. Filling Missing Values
df = pd.DataFrame(data)
df.fillna(0, inplace=True)
print("After filling missing values with 0:")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 85.0
1 Bob 0.0 Canada 90.0
2 Charlie 22.0 USA 88.0
3 David 35.0 Canada 0.0
4 Edward 0.0 USA 95.0
4. Dropping a Column
df = pd.DataFrame(data)
df.drop('Country', axis=1, inplace=True)
print("After dropping the 'Country' column:")
print(df)
Output:
Name Age Score
0 Alice 24.0 85.0
1 Bob NaN 90.0
2 Charlie 22.0 88.0
3 David 35.0 NaN
4 Edward NaN 95.0
5. Dropping a Row by Index
df = pd.DataFrame(data)
df.drop(1, inplace=True)
print("After dropping the row with index 1:")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 85.0
2 Charlie 22.0 USA 88.0
3 David 35.0 Canada NaN
4 Edward NaN USA 95.0
6. Creating a New Column with Modified Values
df = pd.DataFrame(data)
df['Country_Upper'] = df['Country'].apply(str.upper)
print("After creating a new column with uppercase country names:")
print(df)
Output:
Name Age Country Score Country_Upper
0 Alice 24.0 USA 85.0 USA
1 Bob NaN Canada 90.0 CANADA
2 Charlie 22.0 USA 88.0 USA
3 David 35.0 Canada NaN CANADA
4 Edward NaN USA 95.0 USA
7. Creating a Column with the Sum of Two Columns
df = pd.DataFrame(data)
df['Age_Score_Sum'] = df['Age'].fillna(0) + df['Score'].fillna(0)
print("After creating a new column with the sum of Age and Score:")
print(df)
Output:
Name Age Country Score Age_Score_Sum
0 Alice 24.0 USA 85.0 109.0
1 Bob NaN Canada 90.0 90.0
2 Charlie 22.0 USA 88.0 110.0
3 David 35.0 Canada NaN 35.0
4 Edward NaN USA 95.0 95.0
8. Changing the Value of a Single Cell
df = pd.DataFrame(data)
df.loc[1, 'Age'] = 30
print("After changing the age of Bob to 30:")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 85.0
1 Bob 30.0 Canada 90.0
2 Charlie 22.0 USA 88.0
3 David 35.0 Canada NaN
4 Edward NaN USA 95.0
9. Creating a Column with Conditional Values
df = pd.DataFrame(data)
df['High_Score'] = df['Score'].apply(lambda x: 1 if x > 90 else 0)
print("After creating a new column indicating high scores:")
print(df)
Output:
Name Age Country Score High_Score
0 Alice 24.0 USA 85.0 0
1 Bob NaN Canada 90.0 0
2 Charlie 22.0 USA 88.0 0
3 David 35.0 Canada NaN 0
4 Edward NaN USA 95.0 1
10. Setting and Resetting the Index
df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print("After setting 'Name' as the index:")
print(df)
df.reset_index(drop=False, inplace=True)
print("After resetting the index:")
print(df)
Output:
After setting 'Name' as the index:
Age Country Score
Name
Alice 24.0 USA 85.0
Bob NaN Canada 90.0
Charlie 22.0 USA 88.0
David 35.0 Canada NaN
Edward NaN USA 95.0
After resetting the index:
Name Age Country Score
0 Alice 24.0 USA 85.0
1 Bob NaN Canada 90.0
2 Charlie 22.0 USA 88.0
3 David 35.0 Canada NaN
4 Edward NaN USA 95.0
11. Removing Duplicates
df = pd.DataFrame(data)
df.drop_duplicates(subset='Country', inplace=True)
print("After removing duplicate rows based on 'Country':")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 85.0
1 Bob NaN Canada 90.0
12. Applying a Function to a Column or Row
df = pd.DataFrame(data)
df['Score'] = df['Score'].apply(lambda x: x * 2)
print("After doubling the 'Score' values:")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 170.0
1 Bob NaN Canada 180.0
2 Charlie 22.0 USA 176.0
3 David 35.0 Canada NaN
4 Edward NaN USA 190.0
13. Applying a Function to Each Cell
df = pd.DataFrame(data)
df = df.applymap(lambda x: str(x).upper() if isinstance(x, str) else x)
print("After converting all string values to uppercase:")
print(df)
Output:
Name Age Country Score
0 ALICE 24.0 USA 85.0
1 BOB NaN CANADA 90.0
2 CHARLIE 22.0 USA 88.0
3 DAVID 35.0 CANADA NaN
4 EDWARD NaN USA 95.0
14. Changing the Data Type of a Column
df = pd.DataFrame(data)
df['Age'] = df['Age'].astype('int', errors='ignore')
print
("After changing the 'Age' column to integers (where possible):")
print(df)
Output:
Name Age Country Score
0 Alice 24 USA 85.0
1 Bob NaN Canada 90.0
2 Charlie 22 USA 88.0
3 David 35 Canada NaN
4 Edward NaN USA 95.0
15. Converting to Numeric with Errors Set to 'Coerce'
df = pd.DataFrame(data)
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
print("After converting 'Age' column to numeric, setting non-convertible values to NaN:")
print(df)
Output:
Name Age Country Score
0 Alice 24.0 USA 85.0
1 Bob NaN Canada 90.0
2 Charlie 22.0 USA 88.0
3 David 35.0 Canada NaN
4 Edward NaN USA 95.0
16. Renaming Columns
df = pd.DataFrame(data)
df.rename(columns={'Name': 'FullName'}, inplace=True)
print("After renaming 'Name' column to 'FullName':")
print(df)
Output:
FullName Age Country Score
0 Alice 24.0 USA 85.0
1 Bob NaN Canada 90.0
2 Charlie 22.0 USA 88.0
3 David 35.0 Canada NaN
4 Edward NaN USA 95.0
17. Binning Data
df = pd.DataFrame(data)
df['Score_Binned'] = pd.cut(df['Score'], bins=3)
print("After binning the 'Score' column into 3 bins:")
print(df)
Output:
Name Age Country Score Score_Binned
0 Alice 24.0 USA 85.0 (84.99, 88.333]
1 Bob NaN Canada 90.0 (88.333, 91.667]
2 Charlie 22.0 USA 88.0 (84.99, 88.333]
3 David 35.0 Canada NaN NaN
4 Edward NaN USA 95.0 (91.667, 95.0]
By following these examples, you can see the results of various DataFrame update operations in Pandas. Experimenting with these examples on your own datasets will help you become more comfortable with data manipulation in Pandas.