Updating DataFrames in Pandas

Pandas is a powerful data manipulation library in Python. It provides various functions to update and modify DataFrames efficiently. In this blog post, we will cover some common methods to update DataFrames, including dropping rows and columns, filling missing values, creating new columns, changing values, and more.

Below are examples with the resulting outputs for each of the Pandas DataFrame update methods. Let's start by creating a sample DataFrame to work with:

import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'Age': [24, np.nan, 22, 35, np.nan],
    'Country': ['USA', 'Canada', 'USA', 'Canada', 'USA'],
    'Score': [85, 90, 88, np.nan, 95]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

1. Dropping Rows with Null Values

df.dropna(inplace=True)
print("After dropping rows with null values:")
print(df)

Output:

   Name  Age Country  Score
0 Alice  24.0     USA   85.0
2 Charlie 22.0     USA   88.0

2. Dropping Columns with Null Values

df = pd.DataFrame(data)
df.dropna(axis=1, how='any', inplace=True)
print("After dropping columns with null values:")
print(df)

Output:

      Name Country
0     Alice     USA
1       Bob  Canada
2   Charlie     USA
3     David  Canada
4    Edward     USA

3. Filling Missing Values

df = pd.DataFrame(data)
df.fillna(0, inplace=True)
print("After filling missing values with 0:")
print(df)

Output:

      Name   Age Country  Score
0     Alice  24.0     USA   85.0
1       Bob   0.0  Canada   90.0
2   Charlie  22.0     USA   88.0
3     David  35.0  Canada    0.0
4    Edward   0.0     USA   95.0

4. Dropping a Column

df = pd.DataFrame(data)
df.drop('Country', axis=1, inplace=True)
print("After dropping the 'Country' column:")
print(df)

Output:

      Name   Age  Score
0     Alice  24.0   85.0
1       Bob   NaN   90.0
2   Charlie  22.0   88.0
3     David  35.0    NaN
4    Edward   NaN   95.0

5. Dropping a Row by Index

df = pd.DataFrame(data)
df.drop(1, inplace=True)
print("After dropping the row with index 1:")
print(df)

Output:

      Name   Age Country  Score
0     Alice  24.0     USA   85.0
2   Charlie  22.0     USA   88.0
3     David  35.0  Canada    NaN
4    Edward   NaN     USA   95.0

6. Creating a New Column with Modified Values

df = pd.DataFrame(data)
df['Country_Upper'] = df['Country'].apply(str.upper)
print("After creating a new column with uppercase country names:")
print(df)

Output:

      Name   Age Country  Score Country_Upper
0     Alice  24.0     USA   85.0           USA
1       Bob   NaN  Canada   90.0        CANADA
2   Charlie  22.0     USA   88.0           USA
3     David  35.0  Canada    NaN        CANADA
4    Edward   NaN     USA   95.0           USA

7. Creating a Column with the Sum of Two Columns

df = pd.DataFrame(data)
df['Age_Score_Sum'] = df['Age'].fillna(0) + df['Score'].fillna(0)
print("After creating a new column with the sum of Age and Score:")
print(df)

Output:

      Name   Age Country  Score  Age_Score_Sum
0     Alice  24.0     USA   85.0           109.0
1       Bob   NaN  Canada   90.0            90.0
2   Charlie  22.0     USA   88.0           110.0
3     David  35.0  Canada    NaN            35.0
4    Edward   NaN     USA   95.0            95.0

8. Changing the Value of a Single Cell

df = pd.DataFrame(data)
df.loc[1, 'Age'] = 30
print("After changing the age of Bob to 30:")
print(df)

Output:

      Name   Age Country  Score
0     Alice  24.0     USA   85.0
1       Bob  30.0  Canada   90.0
2   Charlie  22.0     USA   88.0
3     David  35.0  Canada    NaN
4    Edward   NaN     USA   95.0

9. Creating a Column with Conditional Values

df = pd.DataFrame(data)
df['High_Score'] = df['Score'].apply(lambda x: 1 if x > 90 else 0)
print("After creating a new column indicating high scores:")
print(df)

Output:

      Name   Age Country  Score  High_Score
0     Alice  24.0     USA   85.0           0
1       Bob   NaN  Canada   90.0           0
2   Charlie  22.0     USA   88.0           0
3     David  35.0  Canada    NaN           0
4    Edward   NaN     USA   95.0           1

10. Setting and Resetting the Index

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print("After setting 'Name' as the index:")
print(df)

df.reset_index(drop=False, inplace=True)
print("After resetting the index:")
print(df)

Output:

After setting 'Name' as the index:
         Age Country  Score
Name                        
Alice    24.0     USA   85.0
Bob       NaN  Canada   90.0
Charlie  22.0     USA   88.0
David    35.0  Canada    NaN
Edward    NaN     USA   95.0

After resetting the index:
      Name   Age Country  Score
0     Alice  24.0     USA   85.0
1       Bob   NaN  Canada   90.0
2   Charlie  22.0     USA   88.0
3     David  35.0  Canada    NaN
4    Edward   NaN     USA   95.0

11. Removing Duplicates

df = pd.DataFrame(data)
df.drop_duplicates(subset='Country', inplace=True)
print("After removing duplicate rows based on 'Country':")
print(df)

Output:

      Name   Age Country  Score
0     Alice  24.0     USA   85.0
1       Bob   NaN  Canada   90.0

12. Applying a Function to a Column or Row

df = pd.DataFrame(data)
df['Score'] = df['Score'].apply(lambda x: x * 2)
print("After doubling the 'Score' values:")
print(df)

Output:

      Name   Age Country  Score
0     Alice  24.0     USA  170.0
1       Bob   NaN  Canada  180.0
2   Charlie  22.0     USA  176.0
3     David  35.0  Canada    NaN
4    Edward   NaN     USA  190.0

13. Applying a Function to Each Cell

df = pd.DataFrame(data)
df = df.applymap(lambda x: str(x).upper() if isinstance(x, str) else x)
print("After converting all string values to uppercase:")
print(df)

Output:

      Name   Age Country  Score
0     ALICE  24.0     USA   85.0
1       BOB   NaN  CANADA   90.0
2   CHARLIE  22.0     USA   88.0
3     DAVID  35.0  CANADA    NaN
4    EDWARD   NaN     USA   95.0

14. Changing the Data Type of a Column

df = pd.DataFrame(data)
df['Age'] = df['Age'].astype('int', errors='ignore')
print

("After changing the 'Age' column to integers (where possible):")
print(df)

Output:

      Name  Age Country  Score
0     Alice   24     USA   85.0
1       Bob  NaN  Canada   90.0
2   Charlie   22     USA   88.0
3     David   35  Canada    NaN
4    Edward  NaN     USA   95.0

15. Converting to Numeric with Errors Set to 'Coerce'

df = pd.DataFrame(data)
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
print("After converting 'Age' column to numeric, setting non-convertible values to NaN:")
print(df)

Output:

      Name   Age Country  Score
0     Alice  24.0     USA   85.0
1       Bob   NaN  Canada   90.0
2   Charlie  22.0     USA   88.0
3     David  35.0  Canada    NaN
4    Edward   NaN     USA   95.0

16. Renaming Columns

df = pd.DataFrame(data)
df.rename(columns={'Name': 'FullName'}, inplace=True)
print("After renaming 'Name' column to 'FullName':")
print(df)

Output:

  FullName   Age Country  Score
0    Alice  24.0     USA   85.0
1      Bob   NaN  Canada   90.0
2  Charlie  22.0     USA   88.0
3    David  35.0  Canada    NaN
4   Edward   NaN     USA   95.0

17. Binning Data

df = pd.DataFrame(data)
df['Score_Binned'] = pd.cut(df['Score'], bins=3)
print("After binning the 'Score' column into 3 bins:")
print(df)

Output:

      Name   Age Country  Score        Score_Binned
0     Alice  24.0     USA   85.0   (84.99, 88.333]
1       Bob   NaN  Canada   90.0  (88.333, 91.667]
2   Charlie  22.0     USA   88.0   (84.99, 88.333]
3     David  35.0  Canada    NaN               NaN
4    Edward   NaN     USA   95.0  (91.667, 95.0]

By following these examples, you can see the results of various DataFrame update operations in Pandas. Experimenting with these examples on your own datasets will help you become more comfortable with data manipulation in Pandas.