{
"cells": [
{
"cell_type": "markdown",
"id": "fb3416a0",
"metadata": {},
"source": [
"# minilab: Ασκήσεις εξάσκησης - Αριθμητική Python και Προεπεξεργασία δεδομένων"
]
},
{
"cell_type": "markdown",
"id": "7a77089e",
"metadata": {},
"source": [
"## Η βιβλιοθήκη ``Pandas``"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "488b134b",
"metadata": {},
"outputs": [],
"source": [
"# Import the pandas module here once\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "cee5953a",
"metadata": {},
"source": [
"### Exercise 1.1: DataFrame Basics\n",
"\n",
"Do the following:\n",
"\n",
"Create an 3x4 (3 rows x 4 columns) Pandas DataFrame in which the columns are named ``Savvas``, ``Nikos``, ``Roman``, and ``Avinash``. Populate each of the 12 cells in the DataFrame with a **random integer** between 0 and 100 (inclusive).\n",
"\n",
"1. Output the entire DataFrame\n",
"\n",
"\n",
"2. Output the value in the cell of row \\#1 of the Roman column\n",
"\n",
"\n",
"3. Create a fifth column named ``Sums``, which is populated with the row-by-row sums of Roman and Avinash.\n",
"\n",
"> **Hint**: Use the [np.random.randint()](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) method to get random integers in an interval."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cf68f7bd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Savvas | \n",
" Nikos | \n",
" Roman | \n",
" Avinash | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 63 | \n",
" 19 | \n",
" 95 | \n",
" 55 | \n",
"
\n",
" \n",
" 1 | \n",
" 52 | \n",
" 77 | \n",
" 28 | \n",
" 26 | \n",
"
\n",
" \n",
" 2 | \n",
" 10 | \n",
" 13 | \n",
" 78 | \n",
" 58 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Savvas Nikos Roman Avinash\n",
"0 63 19 95 55\n",
"1 52 77 28 26\n",
"2 10 13 78 58"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# One way to do this task is the following\n",
"import numpy as np\n",
"\n",
"data = {\n",
" \"Savvas\" : [np.random.randint(0, 101) for _ in range(3)],\n",
" \"Nikos\" : [np.random.randint(0, 101) for _ in range(3)],\n",
" \"Roman\" : [np.random.randint(0, 101) for _ in range(3)],\n",
" \"Avinash\" : [np.random.randint(0, 101) for _ in range(3)]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"# Output the entire DataFrame\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "285cdd35",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"28\n",
"28\n",
"28\n"
]
}
],
"source": [
"# Output the value in the cell of row #1 of the Roman column\n",
"print(df.iloc[1, 2])\n",
"\n",
"# or..\n",
"print(df.loc[1, \"Roman\"])\n",
"\n",
"# or...\n",
"print(df['Roman'][1])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8d6cb48f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Savvas | \n",
" Nikos | \n",
" Roman | \n",
" Avinash | \n",
" Sums | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 63 | \n",
" 19 | \n",
" 95 | \n",
" 55 | \n",
" 150 | \n",
"
\n",
" \n",
" 1 | \n",
" 52 | \n",
" 77 | \n",
" 28 | \n",
" 26 | \n",
" 54 | \n",
"
\n",
" \n",
" 2 | \n",
" 10 | \n",
" 13 | \n",
" 78 | \n",
" 58 | \n",
" 136 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Savvas Nikos Roman Avinash Sums\n",
"0 63 19 95 55 150\n",
"1 52 77 28 26 54\n",
"2 10 13 78 58 136"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a fifth column named Sums, which is populated with the row-by-row sums of Roman and Avinash.\n",
"df[\"Sums\"] = df[\"Roman\"] + df[\"Avinash\"]\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a94f1733",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Savvas Nikos Roman Avinash\n",
"0 91 77 13 66\n",
"1 89 0 88 93\n",
"2 78 54 16 90\n",
"\n",
"Second row of the Roman column: 88\n",
"\n",
" Savvas Nikos Roman Avinash Sums\n",
"0 91 77 13 66 79\n",
"1 89 0 88 93 181\n",
"2 78 54 16 90 106\n"
]
}
],
"source": [
"# Another solution to this task is the following\n",
"\n",
"# Create a Python list that holds the names of the four columns.\n",
"my_column_names = ['Savvas', 'Nikos', 'Roman', 'Avinash']\n",
"\n",
"# Create a 3x4 numpy array, each cell populated with a random integer.\n",
"my_data = np.random.randint(low=0, high=101, size=(3, 4))\n",
"\n",
"# Create a DataFrame.\n",
"df = pd.DataFrame(data=my_data, columns=my_column_names)\n",
"\n",
"# Print the entire DataFrame\n",
"print(df)\n",
"\n",
"# Print the value in row #1 of the Roman column.\n",
"print(\"\\nSecond row of the Roman column: %d\\n\" % df['Roman'][1])\n",
"\n",
"# Create a column named Janet whose contents are the sum\n",
"# of two other columns.\n",
"df['Sums'] = df['Roman'] + df['Avinash']\n",
"\n",
"# Print the enhanced DataFrame\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"id": "0f17cd61",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "86f51098",
"metadata": {},
"source": [
"### Exercise 1.2: DataFrame Basics (cont.)\n",
"\n",
"Create a Pandas DataFrame with the following data:\n",
"\n",
"```python\n",
"data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],\n",
" 'Age': [25, 30, 22, 35],\n",
" 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}\n",
"\n",
"```\n",
"\n",
"1. Print the first 3 rows of the DataFrame.\n",
"\n",
"2. Calculate the mean age of the individuals."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9f63adfa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Name Age City\n",
"0 Alice 25 New York\n",
"1 Bob 30 Los Angeles\n",
"2 Charlie 22 Chicago\n",
"Mean Age: 28.0\n"
]
}
],
"source": [
"data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],\n",
" 'Age': [25, 30, 22, 35],\n",
" 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}\n",
"\n",
"# Create the dataframe\n",
"df = pd.DataFrame(data)\n",
"\n",
"# 1. Print the first 3 rows\n",
"print(df.head(3))\n",
"\n",
"# 2. Calculate the mean age\n",
"mean_age = df['Age'].mean()\n",
"print(\"Mean Age:\", mean_age)"
]
},
{
"cell_type": "markdown",
"id": "27dc5d8c",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "c1489273",
"metadata": {},
"source": [
"### Exercise 2: Data Filtering\n",
"\n",
"Given the DataFrame from Exercise 1.2, write code to:\n",
"\n",
"1. Create a new DataFrame containing only individuals older than 25.\n",
"\n",
"2. Find and print the names of individuals from Chicago."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3dc9f5d0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Residents of Chicago: ['Charlie']\n"
]
}
],
"source": [
"# 1. Filter individuals older than 25\n",
"filtered_df = df[df['Age'] > 25]\n",
"\n",
"# 2. Find names of individuals from Chicago\n",
"chicago_residents = df[df['City'] == 'Chicago']['Name']\n",
"print(\"Residents of Chicago:\", chicago_residents.values)"
]
},
{
"cell_type": "markdown",
"id": "22364e06",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "13616083",
"metadata": {},
"source": [
"### Exercise 3: Data Sorting\n",
"\n",
"Given the DataFrame from Exercise 1.2, write code to:\n",
"\n",
"1. Sort the DataFrame by age in descending order.\n",
"\n",
"2. Sort the DataFrame by name in alphabetical order."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "13720e39",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sorted by Age:\n",
" Name Age City\n",
"3 David 35 Houston\n",
"1 Bob 30 Los Angeles\n",
"0 Alice 25 New York\n",
"2 Charlie 22 Chicago\n",
"Sorted by Name:\n",
" Name Age City\n",
"0 Alice 25 New York\n",
"1 Bob 30 Los Angeles\n",
"2 Charlie 22 Chicago\n",
"3 David 35 Houston\n"
]
}
],
"source": [
"# 1. Sort by age in descending order\n",
"sorted_by_age = df.sort_values(by='Age', ascending=False)\n",
"\n",
"# 2. Sort by name in alphabetical order\n",
"sorted_by_name = df.sort_values(by='Name')\n",
"\n",
"print(\"Sorted by Age:\\n\", sorted_by_age)\n",
"print(\"Sorted by Name:\\n\", sorted_by_name)"
]
},
{
"cell_type": "markdown",
"id": "80b19c53",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "78fd5068",
"metadata": {},
"source": [
"### Exercise 4: Using ``loc`` for Label-Based Indexing\n",
"\n",
"Given the following DataFrame df:\n",
"\n",
"```python\n",
"data = {'A': [1, 2, 3, 4, 5],\n",
" 'B': [10, 20, 30, 40, 50],\n",
" 'C': ['apple', 'banana', 'cherry', 'date', 'elderberry']}\n",
"\n",
"df = pd.DataFrame(data, index=['one', 'two', 'three', 'four', 'five'])\n",
"```\n",
"\n",
"1. Use the ``loc`` indexer to select and print the row with the label 'three'.\n",
"\n",
"2. Use the ``loc`` indexer to select and print the values from columns 'A' and 'C' for the rows with labels 'two' and 'four'."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "263e3ca8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 1 | \n",
" 10 | \n",
" apple | \n",
"
\n",
" \n",
" two | \n",
" 2 | \n",
" 20 | \n",
" banana | \n",
"
\n",
" \n",
" three | \n",
" 3 | \n",
" 30 | \n",
" cherry | \n",
"
\n",
" \n",
" four | \n",
" 4 | \n",
" 40 | \n",
" date | \n",
"
\n",
" \n",
" five | \n",
" 5 | \n",
" 50 | \n",
" elderberry | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"one 1 10 apple\n",
"two 2 20 banana\n",
"three 3 30 cherry\n",
"four 4 40 date\n",
"five 5 50 elderberry"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = {'A': [1, 2, 3, 4, 5],\n",
" 'B': [10, 20, 30, 40, 50],\n",
" 'C': ['apple', 'banana', 'cherry', 'date', 'elderberry']}\n",
"\n",
"df = pd.DataFrame(data, index=['one', 'two', 'three', 'four', 'five'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c9bb14b1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Row with label 'three':\n",
" A 3\n",
"B 30\n",
"C cherry\n",
"Name: three, dtype: object\n"
]
}
],
"source": [
"# 1. Using loc to select and print the row with label 'three':\n",
"selected_row = df.loc['three']\n",
"\n",
"print(\"Row with label 'three':\\n\", selected_row)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "4baf4adf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Values from columns 'A' and 'C' for rows 'two' and 'four':\n",
" A C\n",
"two 2 banana\n",
"four 4 date\n"
]
}
],
"source": [
"# 2. Using loc to select and print values from columns 'A' and 'C' \n",
"# for the rows with labels 'two' and 'four':\n",
"\n",
"selected_values = df.loc[['two', 'four'], ['A', 'C']]\n",
"\n",
"print(\"Values from columns 'A' and 'C' for rows 'two' and 'four':\\n\", selected_values)"
]
},
{
"cell_type": "markdown",
"id": "be26d5f0",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "e86e3d27",
"metadata": {},
"source": [
"### Exercise 5: Using ``iloc`` for Integer-Based Indexing\n",
"\n",
"Given the following DataFrame df:\n",
"\n",
"```python\n",
"data = {'X': [11, 22, 33, 44, 55],\n",
" 'Y': [110, 220, 330, 440, 550],\n",
" 'Z': ['apple', 'banana', 'cherry', 'date', 'elderberry']}\n",
"\n",
"df = pd.DataFrame(data)\n",
"```\n",
"\n",
"1. Use the ``iloc`` indexer to select and print the first row.\n",
"\n",
"2. Use the ``iloc`` indexer to select and print the values from columns 'X' and 'Z' for the second and fourth rows."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "c20e88d2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" X | \n",
" Y | \n",
" Z | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 11 | \n",
" 110 | \n",
" apple | \n",
"
\n",
" \n",
" 1 | \n",
" 22 | \n",
" 220 | \n",
" banana | \n",
"
\n",
" \n",
" 2 | \n",
" 33 | \n",
" 330 | \n",
" cherry | \n",
"
\n",
" \n",
" 3 | \n",
" 44 | \n",
" 440 | \n",
" date | \n",
"
\n",
" \n",
" 4 | \n",
" 55 | \n",
" 550 | \n",
" elderberry | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" X Y Z\n",
"0 11 110 apple\n",
"1 22 220 banana\n",
"2 33 330 cherry\n",
"3 44 440 date\n",
"4 55 550 elderberry"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = {'X': [11, 22, 33, 44, 55],\n",
" 'Y': [110, 220, 330, 440, 550],\n",
" 'Z': ['apple', 'banana', 'cherry', 'date', 'elderberry']}\n",
"\n",
"df = pd.DataFrame(data)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "4e07f685",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First row:\n",
" X 11\n",
"Y 110\n",
"Z apple\n",
"Name: 0, dtype: object\n"
]
}
],
"source": [
"# 1. Using iloc to select and print the first row:\n",
"first_row = df.iloc[0]\n",
"\n",
"print(\"First row:\\n\", first_row)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4ee4cba4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Values from columns 'X' and 'Z' for rows 2 and 4:\n",
" X Z\n",
"1 22 banana\n",
"3 44 date\n"
]
}
],
"source": [
"# 2. Using iloc to select and print values from columns 'X' and 'Z' for the second and fourth rows:\n",
"selected_values = df.iloc[[1, 3], [0, 2]]\n",
"\n",
"print(\"Values from columns 'X' and 'Z' for rows 2 and 4:\\n\", selected_values)"
]
},
{
"cell_type": "markdown",
"id": "89e4eeff",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "132dba49",
"metadata": {},
"source": [
"### Exercise 6: Data Import and Export\n",
"\n",
"1. Read a CSV file named \"employees.csv\", in the data directory, into a Pandas DataFrame.\n",
"\n",
"2. Export the DataFrame from part 1 to a new CSV file named \"output.csv\"."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c71c05d4",
"metadata": {},
"outputs": [],
"source": [
"# 1. Read CSV into DataFrame\n",
"df = pd.read_csv(\"../data/employees.csv\")\n",
"\n",
"# 2. Export DataFrame to CSV\n",
"df.to_csv(\"output.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"id": "dc198406",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "6def4158",
"metadata": {},
"source": [
"### Exercise 7: Missing Data Handling\n",
"\n",
"Below you are given a DataFrame with missing values. \n",
"\n",
"```python\n",
"data_na = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],\n",
" 'Age': [25, 30, 22, np.nan, 29],\n",
" 'Score1': [85, 92, 78, 65, 88],\n",
" 'Score2': [90, 87, 75, 60, np.nan]}\n",
"```\n",
"\n",
"Write code to:\n",
"\n",
"1. Fill missing values in the \"Age\" column with the mean age.\n",
"\n",
"2. Remove rows with missing values."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2b05cd5e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Age | \n",
" Score1 | \n",
" Score2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Alice | \n",
" 25.0 | \n",
" 85 | \n",
" 90.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Bob | \n",
" 30.0 | \n",
" 92 | \n",
" 87.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Charlie | \n",
" 22.0 | \n",
" 78 | \n",
" 75.0 | \n",
"
\n",
" \n",
" 3 | \n",
" David | \n",
" NaN | \n",
" 65 | \n",
" 60.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Eve | \n",
" 29.0 | \n",
" 88 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Score1 Score2\n",
"0 Alice 25.0 85 90.0\n",
"1 Bob 30.0 92 87.0\n",
"2 Charlie 22.0 78 75.0\n",
"3 David NaN 65 60.0\n",
"4 Eve 29.0 88 NaN"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"data_na = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],\n",
" 'Age': [25, 30, 22, np.nan, 29],\n",
" 'Score1': [85, 92, 78, 65, 88],\n",
" 'Score2': [90, 87, 75, 60, np.nan]}\n",
"\n",
"\n",
"# Create the dataframe\n",
"df = pd.DataFrame(data_na)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "5bc93a75",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Age | \n",
" Score1 | \n",
" Score2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Alice | \n",
" 25.0 | \n",
" 85 | \n",
" 90.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Bob | \n",
" 30.0 | \n",
" 92 | \n",
" 87.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Charlie | \n",
" 22.0 | \n",
" 78 | \n",
" 75.0 | \n",
"
\n",
" \n",
" 3 | \n",
" David | \n",
" 26.5 | \n",
" 65 | \n",
" 60.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Eve | \n",
" 29.0 | \n",
" 88 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Score1 Score2\n",
"0 Alice 25.0 85 90.0\n",
"1 Bob 30.0 92 87.0\n",
"2 Charlie 22.0 78 75.0\n",
"3 David 26.5 65 60.0\n",
"4 Eve 29.0 88 NaN"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1. Fill missing values in the \"Age\" column with the mean age\n",
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "7e6f78ca",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Age | \n",
" Score1 | \n",
" Score2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Alice | \n",
" 25.0 | \n",
" 85 | \n",
" 90.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Bob | \n",
" 30.0 | \n",
" 92 | \n",
" 87.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Charlie | \n",
" 22.0 | \n",
" 78 | \n",
" 75.0 | \n",
"
\n",
" \n",
" 3 | \n",
" David | \n",
" 26.5 | \n",
" 65 | \n",
" 60.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Score1 Score2\n",
"0 Alice 25.0 85 90.0\n",
"1 Bob 30.0 92 87.0\n",
"2 Charlie 22.0 78 75.0\n",
"3 David 26.5 65 60.0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2. Remove rows with missing values\n",
"df.dropna(inplace=True)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "2c3739a7",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "eccbfd40",
"metadata": {},
"source": [
"### Exercise 8: Adding New Columns\n",
"\n",
"Create a dictionary that contains the data in the following table. Then, convert this dictionary into a DataFrame. Use Hubble's law ($v = H_0 D$) to calculate the distance (in Mpc) of the galaxies and add it as a new column to the DataFrame.\n",
"\n",
"| Galaxy | Velocity (km/s) |\n",
"|-----------|-----------------|\n",
"| NGC 123 | 1320 |\n",
"| NGC 2342 | 5690 |\n",
"| NGC 4442 | 8200 |\n",
"\n",
"Remember that $H_0 \\simeq 70$ km/s/Mpc."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "6fa02b63",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Galaxy | \n",
" Velocity (km/s) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NGC 123 | \n",
" 1320 | \n",
"
\n",
" \n",
" 1 | \n",
" NGC 2342 | \n",
" 5690 | \n",
"
\n",
" \n",
" 2 | \n",
" NGC 4442 | \n",
" 8200 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Galaxy Velocity (km/s)\n",
"0 NGC 123 1320\n",
"1 NGC 2342 5690\n",
"2 NGC 4442 8200"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table_data = {\n",
" \"Galaxy\" : [\"NGC 123\", \"NGC 2342\", \"NGC 4442\"],\n",
" \"Velocity (km/s)\" : [1320, 5690, 8200]\n",
"}\n",
"\n",
"df = pd.DataFrame(table_data)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "1de14302",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Galaxy | \n",
" Velocity (km/s) | \n",
" Distance (Mpc) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NGC 123 | \n",
" 1320 | \n",
" 18.857143 | \n",
"
\n",
" \n",
" 1 | \n",
" NGC 2342 | \n",
" 5690 | \n",
" 81.285714 | \n",
"
\n",
" \n",
" 2 | \n",
" NGC 4442 | \n",
" 8200 | \n",
" 117.142857 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Galaxy Velocity (km/s) Distance (Mpc)\n",
"0 NGC 123 1320 18.857143\n",
"1 NGC 2342 5690 81.285714\n",
"2 NGC 4442 8200 117.142857"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"H0 = 70 # Hubble's constant in km/s/Mpc\n",
"\n",
"df[\"Distance (Mpc)\"] = df[\"Velocity (km/s)\"] / H0\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "81404c2c",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "efee6df0",
"metadata": {},
"source": [
"### Exercise 9: Data Pre-processing with Pandas\n",
"\n",
"Suppose you have a dataset containing information about student performance, but it requires some basic pre-processing before analysis. You are provided with a DataFrame named ``students_df`` with the following columns: 'StudentID', 'Name', 'MathScore', 'ScienceScore', 'EnglishScore', and 'TotalScore'. \n",
"\n",
"The exercise involves data pre-processing tasks to clean and prepare the data for analysis.\n",
"\n",
"```python\n",
"data = {\n",
" 'StudentID': [1, 2, 3, 4, 5],\n",
" 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],\n",
" 'MathScore': [85, 92, 78, 65, 88],\n",
" 'ScienceScore': [90, 87, 75, 88, 92],\n",
" 'EnglishScore': [80, 75, 88, 79, 95],\n",
" 'TotalScore': [265, 152, 541, 202, 215]\n",
"}\n",
"\n",
"students_df = pd.DataFrame(data)\n",
"```\n",
"\n",
"**Your task**:\n",
"\n",
"1. The data in the 'TotalScore' column seems to be incorrect and needs to be recalculated as the sum of Math, Science, and English scores. Recalculate the 'TotalScore' column as the sum of MathScore, ScienceScore, and EnglishScore.\n",
"\n",
"2. Identify and handle any missing values (if present) in the dataset. Print the modified DataFrame.\n",
"\n",
"3. Calculate and add a new column 'AverageScore' that represents the average score of each student across subjects.\n",
"\n",
"4. Find and print the student with the highest 'AverageScore'.\n",
"\n",
"5. Identify and filter students who have scored 90 or above in at least one subject (Math, Science, or English).\n",
"\n",
"6. Print the DataFrame containing only the students who meet the criteria in part 5."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "b55f9f28",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentID | \n",
" Name | \n",
" MathScore | \n",
" ScienceScore | \n",
" EnglishScore | \n",
" TotalScore | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Alice | \n",
" 85 | \n",
" 90 | \n",
" 80 | \n",
" 265 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Bob | \n",
" 92 | \n",
" 87 | \n",
" 75 | \n",
" 152 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Charlie | \n",
" 78 | \n",
" 75 | \n",
" 88 | \n",
" 541 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" David | \n",
" 65 | \n",
" 88 | \n",
" 79 | \n",
" 202 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Eve | \n",
" 88 | \n",
" 92 | \n",
" 95 | \n",
" 215 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentID Name MathScore ScienceScore EnglishScore TotalScore\n",
"0 1 Alice 85 90 80 265\n",
"1 2 Bob 92 87 75 152\n",
"2 3 Charlie 78 75 88 541\n",
"3 4 David 65 88 79 202\n",
"4 5 Eve 88 92 95 215"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create the dataframe\n",
"data = {\n",
" 'StudentID': [1, 2, 3, 4, 5],\n",
" 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],\n",
" 'MathScore': [85, 92, 78, 65, 88],\n",
" 'ScienceScore': [90, 87, 75, 88, 92],\n",
" 'EnglishScore': [80, 75, 88, 79, 95],\n",
" 'TotalScore': [265, 152, 541, 202, 215]\n",
"}\n",
"\n",
"students_df = pd.DataFrame(data)\n",
"students_df"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "87e0d96f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentID | \n",
" Name | \n",
" MathScore | \n",
" ScienceScore | \n",
" EnglishScore | \n",
" TotalScore | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Alice | \n",
" 85 | \n",
" 90 | \n",
" 80 | \n",
" 255 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Bob | \n",
" 92 | \n",
" 87 | \n",
" 75 | \n",
" 254 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Charlie | \n",
" 78 | \n",
" 75 | \n",
" 88 | \n",
" 241 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" David | \n",
" 65 | \n",
" 88 | \n",
" 79 | \n",
" 232 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Eve | \n",
" 88 | \n",
" 92 | \n",
" 95 | \n",
" 275 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentID Name MathScore ScienceScore EnglishScore TotalScore\n",
"0 1 Alice 85 90 80 255\n",
"1 2 Bob 92 87 75 254\n",
"2 3 Charlie 78 75 88 241\n",
"3 4 David 65 88 79 232\n",
"4 5 Eve 88 92 95 275"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1. Recalculating the 'TotalScore' column\n",
"students_df['TotalScore'] = students_df['MathScore'] + students_df['ScienceScore'] + students_df['EnglishScore']\n",
"students_df"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "f3028b0b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"StudentID 0\n",
"Name 0\n",
"MathScore 0\n",
"ScienceScore 0\n",
"EnglishScore 0\n",
"TotalScore 0\n",
"dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2. Identifying and handling missing values\n",
"students_df.isna().sum()"
]
},
{
"cell_type": "markdown",
"id": "953538e9",
"metadata": {},
"source": [
"The DataFrame does not contain any missing values, so we do not have to handle this."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "293f7b9e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentID | \n",
" Name | \n",
" MathScore | \n",
" ScienceScore | \n",
" EnglishScore | \n",
" TotalScore | \n",
" AverageScore | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Alice | \n",
" 85 | \n",
" 90 | \n",
" 80 | \n",
" 255 | \n",
" 85.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Bob | \n",
" 92 | \n",
" 87 | \n",
" 75 | \n",
" 254 | \n",
" 84.666667 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Charlie | \n",
" 78 | \n",
" 75 | \n",
" 88 | \n",
" 241 | \n",
" 80.333333 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" David | \n",
" 65 | \n",
" 88 | \n",
" 79 | \n",
" 232 | \n",
" 77.333333 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Eve | \n",
" 88 | \n",
" 92 | \n",
" 95 | \n",
" 275 | \n",
" 91.666667 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentID Name MathScore ScienceScore EnglishScore TotalScore \\\n",
"0 1 Alice 85 90 80 255 \n",
"1 2 Bob 92 87 75 254 \n",
"2 3 Charlie 78 75 88 241 \n",
"3 4 David 65 88 79 232 \n",
"4 5 Eve 88 92 95 275 \n",
"\n",
" AverageScore \n",
"0 85.000000 \n",
"1 84.666667 \n",
"2 80.333333 \n",
"3 77.333333 \n",
"4 91.666667 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 3. Calculating and adding a new 'AverageScore' column\n",
"students_df['AverageScore'] = (\n",
" students_df['MathScore'] + \n",
" students_df['ScienceScore'] + \n",
" students_df['EnglishScore']\n",
") / 3\n",
"\n",
"students_df"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "2a3ae56a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Student with the highest AverageScore: Eve\n"
]
}
],
"source": [
"# 4. Finding the student with the highest 'AverageScore'\n",
"highest_avg_score_mask = students_df['AverageScore'] == students_df['AverageScore'].max()\n",
"\n",
"highest_avg_score_student = students_df[highest_avg_score_mask]['Name'].values[0]\n",
"\n",
"print(\"Student with the highest AverageScore:\", highest_avg_score_student)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "bb66fe2e",
"metadata": {},
"outputs": [],
"source": [
"# 5. Identifying and filtering students who have scored 90 or above in at least one subject\n",
"above_90_score_mask = (\n",
" (students_df['MathScore'] >= 90) | \n",
" (students_df['ScienceScore'] >= 90) | \n",
" (students_df['EnglishScore'] >= 90)\n",
")\n",
"\n",
"filtered_students = students_df[above_90_score_mask]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "5e992885",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Students who have scored 90 or above in at least one subject:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentID | \n",
" Name | \n",
" MathScore | \n",
" ScienceScore | \n",
" EnglishScore | \n",
" TotalScore | \n",
" AverageScore | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Alice | \n",
" 85 | \n",
" 90 | \n",
" 80 | \n",
" 255 | \n",
" 85.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Bob | \n",
" 92 | \n",
" 87 | \n",
" 75 | \n",
" 254 | \n",
" 84.666667 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Eve | \n",
" 88 | \n",
" 92 | \n",
" 95 | \n",
" 275 | \n",
" 91.666667 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentID Name MathScore ScienceScore EnglishScore TotalScore \\\n",
"0 1 Alice 85 90 80 255 \n",
"1 2 Bob 92 87 75 254 \n",
"4 5 Eve 88 92 95 275 \n",
"\n",
" AverageScore \n",
"0 85.000000 \n",
"1 84.666667 \n",
"4 91.666667 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 6. Printing the filtered DataFrame\n",
"print(\"Students who have scored 90 or above in at least one subject:\")\n",
"filtered_students"
]
},
{
"cell_type": "markdown",
"id": "aa426443",
"metadata": {},
"source": [
"---"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}