pandas ¶
DRAFT
This page is a work in progress and is subject to change at any moment.
Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.
For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame—a table, basically—then let you do things like:
-
Calculate statistics and answer questions about the data, like
- What's the average, median, max, or min of each column?
- Does column A correlate with column B?
- What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
- Store the cleaned, transformed data back into a CSV, other file or database
Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.
Key resources:
First things first, let's import NumPy and pandas.
import numpy as np
import pandas as pd
The two main components that pandas adds are Series and DataFrames .
Series ¶
Series is a one-dimensional labeled array. On other words, it is like a column in an Excel file.
First, let us create some synthetic data to use.
The provided data encapsulates a snapshot of gene expression levels and corresponding treatments in a biological study.
Each gene, identified by a unique identifier such as
"G1"
or
"G2"
, is associated with a specific expression level, denoted in numerical values.
Simultaneously, the treatment administered to each gene is recorded, categorizing them into groups like
"Control"
,
"TreatmentA"
, or
"TreatmentB"
.
gene_id = np.array(["G1", "G2", "G3", "G4", "G5"])
expression_Level = np.array([2.5, 1.8, 3.2, 2.0, 3.5])
treatment = np.array(["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"])
s_gene = pd.Series(gene_id)
print(s_gene)
0 G1 1 G2 2 G3 3 G4 4 G5 dtype: object
We can technically rename the index of each row.
s_gene_other = pd.Series(gene_id, index=["A", "B", "C", "D", "E"])
print(s_gene_other)
A G1 B G2 C G3 D G4 E G5 dtype: object
However, we generally leave it to default to
int
.
s_express = pd.Series(expression_Level)
print(s_express)
0 2.5 1 1.8 2 3.2 3 2.0 4 3.5 dtype: float64
s_treatment = pd.Series(treatment)
print(s_treatment)
0 Control 1 TreatmentA 2 TreatmentB 3 Control 4 TreatmentA dtype: object
DataFrame ¶
DataFrame
is a 2-dimensional labeled data structure with columns of potentially different types.
You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
It is generally the most commonly used pandas object.
Like Series, DataFrame accepts many different kinds of input:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame
Creation ¶
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.
If axis labels are not passed, they will be constructed from the input data based on common sense rules.
First, let's use the series we created.
df = pd.DataFrame(
data={"gene_id": s_gene, "expression_level": s_express, "treatment": s_treatment}
)
print(df)
gene_id expression_level treatment 0 G1 2.5 Control 1 G2 1.8 TreatmentA 2 G3 3.2 TreatmentB 3 G4 2.0 Control 4 G5 3.5 TreatmentA
Each (key, value) item in
data
corresponds to a column in the resulting DataFrame.
The Index of this DataFrame was given to us on creation as the numbers 0 to 4.
It is annoying to have to create a
Series
first and then a
DataFrame
; instead, we can just use a dictionary.
df = pd.DataFrame(
data={
"gene_id": ["G1", "G2", "G3", "G4", "G5"],
"expression_level": [2.5, 1.8, 3.2, 2.0, 3.5],
"treatment": ["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"],
}
)
print(df)
gene_id expression_level treatment 0 G1 2.5 Control 1 G2 1.8 TreatmentA 2 G3 3.2 TreatmentB 3 G4 2.0 Control 4 G5 3.5 TreatmentA
Loading ¶
It’s quite simple to load data from various file formats into a DataFrame.
CSV ¶
A CSV (Comma-Separated Values) file is a plain text file format commonly used to store tabular data (numbers and text) in a plain-text form, where each line of the file represents a row of the table, and the values within each row are separated by commas. Each line in a CSV file typically corresponds to a record, and the values or fields within that record are separated by commas or other delimiters.
For example:
Name, Age, City
John, 25, New York
Alice, 30, Los Angeles
Bob, 22, Chicago
In this example, each line represents a person's information with columns for name, age, and city. The values in each column are separated by commas. CSV files are widely used for data exchange between different software applications and can be easily imported/exported in spreadsheet programs like Microsoft Excel or Google Sheets.
We use
pd.read_csv
to create a DataFrame.
file_url = "https://github.com/oasci-bc/python/raw/refs/heads/main/docs/files/csv/coimbra_breast_cancer_dataset.csv"
df_breast_cancer = pd.read_csv(file_url) # Note that we can directly load from a URL
print(df_breast_cancer)
Age BMI Glucose Insulin HOMA Leptin Adiponectin \ 0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 .. ... ... ... ... ... ... ... 111 45 26.850000 92 3.330 0.755688 54.6800 12.100000 112 62 26.840000 100 4.530 1.117400 12.4500 21.420000 113 65 32.050000 97 5.730 1.370998 61.4800 22.540000 114 72 25.590000 82 2.820 0.570392 24.9600 33.750000 115 86 27.180000 138 19.910 6.777364 90.2800 14.110000 Resistin MCP.1 Classification 0 7.99585 417.114 1 1 4.06405 468.786 1 2 9.27715 554.697 1 3 12.76600 928.220 1 4 10.57635 773.920 1 .. ... ... ... 111 10.96000 268.230 2 112 7.32000 330.160 2 113 10.33000 314.050 2 114 3.27000 392.460 2 115 4.35000 90.090 2 [116 rows x 10 columns]
This CSV file presents a comprehensive exploration of clinical features observed or measured for 64 patients with breast cancer and 52 healthy controls. The dataset encompasses both quantitative attributes and corresponding labels for effective analysis and modeling.
Quantitative Attributes:
-
Age
(years): Age of the individuals. -
BMI
(kg/m²): Body Mass Index, a measure of body fat based on weight and height. -
Glucose
(mg/dL): Blood glucose levels, an important metabolic indicator. -
Insulin
(µU/mL): Insulin levels, a hormone related to glucose regulation. -
HOMA
: Homeostatic Model Assessment, a method for assessing insulin resistance and beta-cell function. -
Leptin
(ng/mL): Leptin levels, a hormone involved in regulating appetite and energy balance. -
Adiponectin
(µg/mL): Adiponectin levels, a protein associated with metabolic regulation. -
Resistin
(ng/mL): Resistin levels, a protein implicated in insulin resistance. -
MCP-1
(pg/dL): Monocyte Chemoattractant Protein-1, a cytokine involved in inflammation.
Classification (i.e., labels):
-
1
: Healthy controls -
2
: Patients with breast cancer
Other ¶
You are also able to read from Excel , json , XML , and more. However, we will normally use CSV files.
print(df_breast_cancer.head(n=2))
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin \ 0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 MCP.1 Classification 0 417.114 1 1 468.786 1
print(df_breast_cancer.tail(n=3))
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin \ 113 65 32.05 97 5.73 1.370998 61.48 22.54 10.33 114 72 25.59 82 2.82 0.570392 24.96 33.75 3.27 115 86 27.18 138 19.91 6.777364 90.28 14.11 4.35 MCP.1 Classification 113 314.05 2 114 392.46 2 115 90.09 2
info ¶
The
DataFrame.info()
method is used to display a concise summary of a
DataFrame
, including information about the data types, memory usage, and the number of non-null (i.e., not missing) values in each column.
It provides a quick overview of the structure and properties of the
DataFrame
.
print(df_breast_cancer.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 116 entries, 0 to 115 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 116 non-null int64 1 BMI 116 non-null float64 2 Glucose 116 non-null int64 3 Insulin 116 non-null float64 4 HOMA 116 non-null float64 5 Leptin 116 non-null float64 6 Adiponectin 116 non-null float64 7 Resistin 116 non-null float64 8 MCP.1 116 non-null float64 9 Classification 116 non-null int64 dtypes: float64(7), int64(3) memory usage: 9.2 KB None
Here's what the
DataFrame.info()
method typically shows:
- The total number of entries (rows) in the DataFrame.
- The number of non-null values for each column.
- The data type of each column.
- The memory usage of the DataFrame.
This information is helpful for understanding the composition of your DataFrame, identifying missing values, and assessing memory usage, which can be useful in optimizing memory usage for large datasets.
If there were missing values, you would see something like this.
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank 1000 non-null int64
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB
Notice how
Revenue (Millions)
and
Metascore
are lower than the other columns.
shape ¶
The
DataFrame.shape
attribute is used to get the dimensions of a DataFrame.
It returns a tuple representing the number of rows and columns in the DataFrame.
The tuple has two elements: the first element is the number of rows, and the second element is the number of columns.
print(df_breast_cancer.shape)
(116, 10)
This is a quick and convenient way to determine the size of your DataFrame. It's useful for understanding the structure of your data and can be particularly handy when working with large datasets.
Indexing and selecting ¶
The axis labeling information in pandas objects serves many purposes:
- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
- Enables automatic and explicit data alignment.
- Allows intuitive getting and setting of subsets of the data set.
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.
Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.
.loc
¶
.loc
allows you to access a group of rows and columns by label(s) or a boolean array.
This works very similar to
NumPy
slicing with extra steps where you have to specify the rows the columns.
For example, to get all rows and the column of
"Glucose"
we would do the following.
print(df_breast_cancer.loc[:, "Glucose"])
0 70 1 92 2 91 3 77 4 92 ... 111 92 112 100 113 97 114 82 115 138 Name: Glucose, Length: 116, dtype: int64
If we wanted only rows
7
to
10
.
print(df_breast_cancer.loc[7:10, "Glucose"])
7 118 8 97 9 83 10 78 Name: Glucose, dtype: int64
Warning
Pandas includes the stop index when slicing, which is different than most of Python.
For example, usually we would not include row with index
`10`
when slicing a NumPy array, but pandas did.
Why?
Honestly, I have no clue.
We can also select multiple columns in any order we request.
print(df_breast_cancer.loc[:, ["Glucose", "Age"]])
Glucose Age 0 70 48 1 92 83 2 91 82 3 77 68 4 92 86 .. ... ... 111 92 45 112 100 62 113 97 65 114 82 72 115 138 86 [116 rows x 2 columns]
print(df_breast_cancer.iloc[[0, 3, 4]])
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin \ 0 48 23.500000 70 2.707 0.467409 8.8071 9.70240 7.99585 3 68 21.367521 77 3.226 0.612725 9.8827 7.16956 12.76600 4 86 21.111111 92 3.549 0.805386 6.6994 4.81924 10.57635 MCP.1 Classification 0 417.114 1 3 928.220 1 4 773.920 1
boolean ¶
Another common operation is the use of boolean vectors to filter the data.
The operators are:
|
for
or
,
&
for
and
, and
~
for
not
.
These must be grouped by using parentheses, since by default Python will evaluate an expression such as
df['A'] > 2 & df['B'] < 3
as
df['A'] > (2 & df['B']) < 3
, while the desired evaluation order is
(df['A'] > 2) & (df['B'] < 3)
.
Using a boolean vector to index a Series works exactly as in a NumPy ndarray. For example, suppose I only want data for patients that are younger than 35.
mask = df_breast_cancer["Age"] < 35
print(mask)
0 False 1 False 2 False 3 False 4 False ... 111 False 112 False 113 False 114 False 115 False Name: Age, Length: 116, dtype: bool
Now I can use this mask to get only the data where the rows are
True
.
print(df_breast_cancer[mask])
Age BMI Glucose Insulin HOMA Leptin Adiponectin \ 10 34 21.470000 78 3.469 0.667436 14.5700 13.110000 11 29 23.010000 82 5.663 1.145436 35.5900 26.720000 12 25 22.860000 82 4.090 0.827271 20.4500 23.670000 13 24 18.670000 88 6.107 1.330000 8.8800 36.060000 19 32 36.512637 87 14.026 3.009980 49.3727 5.100000 21 34 31.975015 87 4.530 0.972138 28.7502 7.642760 22 29 32.270788 84 5.810 1.203832 45.6196 6.209635 32 28 35.855815 87 8.576 1.840410 68.5102 4.794200 55 34 24.242424 92 21.699 4.924226 16.7353 21.823745 Resistin MCP.1 Classification 10 6.92000 354.600 1 11 4.58000 174.800 1 12 5.14000 313.730 1 13 6.85000 632.220 1 19 17.10223 588.460 1 21 5.62592 572.783 1 22 24.60330 904.981 1 32 21.44366 358.624 1 55 12.06534 481.949 2
Now, suppose I want patients who are younger than 35 that had breast cancer.
I do not need to define a mask variable to slice the
DataFrame
; I can do it all in one go.
print(
df_breast_cancer[
(df_breast_cancer["Age"] < 35) & (df_breast_cancer["Classification"] == 2)
]
)
Age BMI Glucose Insulin HOMA Leptin Adiponectin \ 55 34 24.242424 92 21.699 4.924226 16.7353 21.823745 Resistin MCP.1 Classification 55 12.06534 481.949 2
Concatenating objects ¶
The
concat()
function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
Note that I say “if any” because there is only a single possible axis of concatenation for Series.
Duplicates ¶
This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate rows. To demonstrate, let's simply just double up our movies DataFrame by appending it to itself.
Acknowledgements ¶
This content was adapted from the following sources: