pandas ¶

DRAFT

This page is a work in progress and is subject to change at any moment.

Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame—a table, basically—then let you do things like:

Calculate statistics and answer questions about the data, like
- What's the average, median, max, or min of each column?
- Does column A correlate with column B?
- What does the distribution of data in column C look like?
Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
Store the cleaned, transformed data back into a CSV, other file or database

Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.

Key resources:

First things first, let's import NumPy and pandas.

In [1]:

         
            Copied!
           
         import numpy as np
import pandas as pd

         import numpy as np
import pandas as pd

The two main components that pandas adds are Series and DataFrames .

Series ¶

Series is a one-dimensional labeled array. On other words, it is like a column in an Excel file.

First, let us create some synthetic data to use. The provided data encapsulates a snapshot of gene expression levels and corresponding treatments in a biological study. Each gene, identified by a unique identifier such as "G1" or "G2" , is associated with a specific expression level, denoted in numerical values. Simultaneously, the treatment administered to each gene is recorded, categorizing them into groups like "Control" , "TreatmentA" , or "TreatmentB" .

In [2]:

         
            Copied!
           
         gene_id = np.array(["G1", "G2", "G3", "G4", "G5"])
expression_Level = np.array([2.5, 1.8, 3.2, 2.0, 3.5])
treatment = np.array(["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"])

         gene_id = np.array(["G1", "G2", "G3", "G4", "G5"])
expression_Level = np.array([2.5, 1.8, 3.2, 2.0, 3.5])
treatment = np.array(["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"])

In [3]:

         
            Copied!
           
         s_gene = pd.Series(gene_id)
print(s_gene)

         s_gene = pd.Series(gene_id)
print(s_gene)

0    G1
1    G2
2    G3
3    G4
4    G5
dtype: object

We can technically rename the index of each row.

In [4]:

         
            Copied!
           
         s_gene_other = pd.Series(gene_id, index=["A", "B", "C", "D", "E"])
print(s_gene_other)

         s_gene_other = pd.Series(gene_id, index=["A", "B", "C", "D", "E"])
print(s_gene_other)

A    G1
B    G2
C    G3
D    G4
E    G5
dtype: object

However, we generally leave it to default to int .

In [5]:

         
            Copied!
           
         s_express = pd.Series(expression_Level)
print(s_express)

         s_express = pd.Series(expression_Level)
print(s_express)

0    2.5
1    1.8
2    3.2
3    2.0
4    3.5
dtype: float64

In [6]:

         
            Copied!
           
         s_treatment = pd.Series(treatment)
print(s_treatment)

         s_treatment = pd.Series(treatment)
print(s_treatment)

0       Control
1    TreatmentA
2    TreatmentB
3       Control
4    TreatmentA
dtype: object

DataFrame ¶

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame

Creation ¶

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

First, let's use the series we created.

In [7]:

         
            Copied!
           
         df = pd.DataFrame(
    data={"gene_id": s_gene, "expression_level": s_express, "treatment": s_treatment}
)
print(df)

         df = pd.DataFrame(
    data={"gene_id": s_gene, "expression_level": s_express, "treatment": s_treatment}
)
print(df)

  gene_id  expression_level   treatment
0      G1               2.5     Control
1      G2               1.8  TreatmentA
2      G3               3.2  TreatmentB
3      G4               2.0     Control
4      G5               3.5  TreatmentA

Each (key, value) item in data corresponds to a column in the resulting DataFrame. The Index of this DataFrame was given to us on creation as the numbers 0 to 4.

It is annoying to have to create a Series first and then a DataFrame ; instead, we can just use a dictionary.

In [8]:

         
            Copied!
           
         df = pd.DataFrame(
    data={
        "gene_id": ["G1", "G2", "G3", "G4", "G5"],
        "expression_level": [2.5, 1.8, 3.2, 2.0, 3.5],
        "treatment": ["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"],
    }
)
print(df)

         df = pd.DataFrame(
    data={
        "gene_id": ["G1", "G2", "G3", "G4", "G5"],
        "expression_level": [2.5, 1.8, 3.2, 2.0, 3.5],
        "treatment": ["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"],
    }
)
print(df)

  gene_id  expression_level   treatment
0      G1               2.5     Control
1      G2               1.8  TreatmentA
2      G3               3.2  TreatmentB
3      G4               2.0     Control
4      G5               3.5  TreatmentA

Loading ¶

It’s quite simple to load data from various file formats into a DataFrame.

CSV ¶

A CSV (Comma-Separated Values) file is a plain text file format commonly used to store tabular data (numbers and text) in a plain-text form, where each line of the file represents a row of the table, and the values within each row are separated by commas. Each line in a CSV file typically corresponds to a record, and the values or fields within that record are separated by commas or other delimiters.

For example:

Name, Age, City
John, 25, New York
Alice, 30, Los Angeles
Bob, 22, Chicago

In this example, each line represents a person's information with columns for name, age, and city. The values in each column are separated by commas. CSV files are widely used for data exchange between different software applications and can be easily imported/exported in spreadsheet programs like Microsoft Excel or Google Sheets.

We use pd.read_csv to create a DataFrame.

In [9]:

         
            Copied!
           
         file_url = "https://github.com/oasci-bc/python/raw/refs/heads/main/docs/files/csv/coimbra_breast_cancer_dataset.csv"
df_breast_cancer = pd.read_csv(file_url)  # Note that we can directly load from a URL
print(df_breast_cancer)

         file_url = "https://github.com/oasci-bc/python/raw/refs/heads/main/docs/files/csv/coimbra_breast_cancer_dataset.csv"
df_breast_cancer = pd.read_csv(file_url)  # Note that we can directly load from a URL
print(df_breast_cancer)

     Age        BMI  Glucose  Insulin      HOMA   Leptin  Adiponectin  \
0     48  23.500000       70    2.707  0.467409   8.8071     9.702400   
1     83  20.690495       92    3.115  0.706897   8.8438     5.429285   
2     82  23.124670       91    4.498  1.009651  17.9393    22.432040   
3     68  21.367521       77    3.226  0.612725   9.8827     7.169560   
4     86  21.111111       92    3.549  0.805386   6.6994     4.819240   
..   ...        ...      ...      ...       ...      ...          ...   
111   45  26.850000       92    3.330  0.755688  54.6800    12.100000   
112   62  26.840000      100    4.530  1.117400  12.4500    21.420000   
113   65  32.050000       97    5.730  1.370998  61.4800    22.540000   
114   72  25.590000       82    2.820  0.570392  24.9600    33.750000   
115   86  27.180000      138   19.910  6.777364  90.2800    14.110000   

     Resistin    MCP.1  Classification  
0     7.99585  417.114               1  
1     4.06405  468.786               1  
2     9.27715  554.697               1  
3    12.76600  928.220               1  
4    10.57635  773.920               1  
..        ...      ...             ...  
111  10.96000  268.230               2  
112   7.32000  330.160               2  
113  10.33000  314.050               2  
114   3.27000  392.460               2  
115   4.35000   90.090               2  

[116 rows x 10 columns]

This CSV file presents a comprehensive exploration of clinical features observed or measured for 64 patients with breast cancer and 52 healthy controls. The dataset encompasses both quantitative attributes and corresponding labels for effective analysis and modeling.

Quantitative Attributes:

Age (years): Age of the individuals.
BMI (kg/m²): Body Mass Index, a measure of body fat based on weight and height.
Glucose (mg/dL): Blood glucose levels, an important metabolic indicator.
Insulin (µU/mL): Insulin levels, a hormone related to glucose regulation.
HOMA : Homeostatic Model Assessment, a method for assessing insulin resistance and beta-cell function.
Leptin (ng/mL): Leptin levels, a hormone involved in regulating appetite and energy balance.
Adiponectin (µg/mL): Adiponectin levels, a protein associated with metabolic regulation.
Resistin (ng/mL): Resistin levels, a protein implicated in insulin resistance.
MCP-1 (pg/dL): Monocyte Chemoattractant Protein-1, a cytokine involved in inflammation.

Classification (i.e., labels):

1 : Healthy controls
2 : Patients with breast cancer

Other ¶

You are also able to read from Excel , json , XML , and more. However, we will normally use CSV files.

head and tail ¶

DataFrames can contain thousands and thousands of columns and/or rows. Pandas gives us some helper functions so we can take a peek: .head() and .tail() . These methods give us a few rows from the beginning or end of the DataFrame, respectively.

In [10]:

         
            Copied!
           
         print(df_breast_cancer.head(n=2))

         print(df_breast_cancer.head(n=2))

   Age        BMI  Glucose  Insulin      HOMA  Leptin  Adiponectin  Resistin  \
0   48  23.500000       70    2.707  0.467409  8.8071     9.702400   7.99585   
1   83  20.690495       92    3.115  0.706897  8.8438     5.429285   4.06405   

     MCP.1  Classification  
0  417.114               1  
1  468.786               1

In [11]:

         
            Copied!
           
         print(df_breast_cancer.tail(n=3))

         print(df_breast_cancer.tail(n=3))

     Age    BMI  Glucose  Insulin      HOMA  Leptin  Adiponectin  Resistin  \
113   65  32.05       97     5.73  1.370998   61.48        22.54     10.33   
114   72  25.59       82     2.82  0.570392   24.96        33.75      3.27   
115   86  27.18      138    19.91  6.777364   90.28        14.11      4.35   

      MCP.1  Classification  
113  314.05               2  
114  392.46               2  
115   90.09               2

info ¶

The DataFrame.info() method is used to display a concise summary of a DataFrame , including information about the data types, memory usage, and the number of non-null (i.e., not missing) values in each column. It provides a quick overview of the structure and properties of the DataFrame .

In [12]:

         
            Copied!
           
         print(df_breast_cancer.info())

         print(df_breast_cancer.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116 entries, 0 to 115
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             116 non-null    int64  
 1   BMI             116 non-null    float64
 2   Glucose         116 non-null    int64  
 3   Insulin         116 non-null    float64
 4   HOMA            116 non-null    float64
 5   Leptin          116 non-null    float64
 6   Adiponectin     116 non-null    float64
 7   Resistin        116 non-null    float64
 8   MCP.1           116 non-null    float64
 9   Classification  116 non-null    int64  
dtypes: float64(7), int64(3)
memory usage: 9.2 KB
None

Here's what the DataFrame.info() method typically shows:

The total number of entries (rows) in the DataFrame.
The number of non-null values for each column.
The data type of each column.
The memory usage of the DataFrame.

This information is helpful for understanding the composition of your DataFrame, identifying missing values, and assessing memory usage, which can be useful in optimizing memory usage for large datasets.

If there were missing values, you would see something like this.

Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank                  1000 non-null int64
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

Notice how Revenue (Millions) and Metascore are lower than the other columns.

shape ¶

The DataFrame.shape attribute is used to get the dimensions of a DataFrame. It returns a tuple representing the number of rows and columns in the DataFrame. The tuple has two elements: the first element is the number of rows, and the second element is the number of columns.

In [13]:

         
            Copied!
           
         print(df_breast_cancer.shape)

         print(df_breast_cancer.shape)

(116, 10)

This is a quick and convenient way to determine the size of your DataFrame. It's useful for understanding the structure of your data and can be particularly handy when working with large datasets.

Indexing and selecting ¶

The axis labeling information in pandas objects serves many purposes:

Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
Enables automatic and explicit data alignment.
Allows intuitive getting and setting of subsets of the data set.

In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.

Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.

`.loc` ¶

.loc allows you to access a group of rows and columns by label(s) or a boolean array. This works very similar to NumPy slicing with extra steps where you have to specify the rows the columns. For example, to get all rows and the column of "Glucose" we would do the following.

In [14]:

         
            Copied!
           
         print(df_breast_cancer.loc[:, "Glucose"])

         print(df_breast_cancer.loc[:, "Glucose"])

0       70
1       92
2       91
3       77
4       92
      ... 
111     92
112    100
113     97
114     82
115    138
Name: Glucose, Length: 116, dtype: int64

If we wanted only rows 7 to 10 .

In [15]:

         
            Copied!
           
         print(df_breast_cancer.loc[7:10, "Glucose"])

         print(df_breast_cancer.loc[7:10, "Glucose"])

7     118
8      97
9      83
10     78
Name: Glucose, dtype: int64

Warning

Pandas includes the stop index when slicing, which is different than most of Python. For example, usually we would not include row with index `10` when slicing a NumPy array, but pandas did. Why? Honestly, I have no clue.

We can also select multiple columns in any order we request.

In [16]:

         
            Copied!
           
         print(df_breast_cancer.loc[:, ["Glucose", "Age"]])

         print(df_breast_cancer.loc[:, ["Glucose", "Age"]])

     Glucose  Age
0         70   48
1         92   83
2         91   82
3         77   68
4         92   86
..       ...  ...
111       92   45
112      100   62
113       97   65
114       82   72
115      138   86

[116 rows x 2 columns]

`.iloc` ¶

.iloc is purely integer-location based indexing for selection by position from 0 to length-1 of the axis.

In [17]:

         
            Copied!
           
         print(df_breast_cancer.iloc[[0, 3, 4]])

         print(df_breast_cancer.iloc[[0, 3, 4]])

   Age        BMI  Glucose  Insulin      HOMA  Leptin  Adiponectin  Resistin  \
0   48  23.500000       70    2.707  0.467409  8.8071      9.70240   7.99585   
3   68  21.367521       77    3.226  0.612725  9.8827      7.16956  12.76600   
4   86  21.111111       92    3.549  0.805386  6.6994      4.81924  10.57635   

     MCP.1  Classification  
0  417.114               1  
3  928.220               1  
4  773.920               1

boolean ¶

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or , & for and , and ~ for not . These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3 , while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3) .

Using a boolean vector to index a Series works exactly as in a NumPy ndarray. For example, suppose I only want data for patients that are younger than 35.

In [18]:

         
            Copied!
           
         mask = df_breast_cancer["Age"] < 35
print(mask)

         mask = df_breast_cancer["Age"] < 35
print(mask)

0      False
1      False
2      False
3      False
4      False
       ...  
111    False
112    False
113    False
114    False
115    False
Name: Age, Length: 116, dtype: bool

Now I can use this mask to get only the data where the rows are True .

In [19]:

         
            Copied!
           
         print(df_breast_cancer[mask])

         print(df_breast_cancer[mask])

    Age        BMI  Glucose  Insulin      HOMA   Leptin  Adiponectin  \
10   34  21.470000       78    3.469  0.667436  14.5700    13.110000   
11   29  23.010000       82    5.663  1.145436  35.5900    26.720000   
12   25  22.860000       82    4.090  0.827271  20.4500    23.670000   
13   24  18.670000       88    6.107  1.330000   8.8800    36.060000   
19   32  36.512637       87   14.026  3.009980  49.3727     5.100000   
21   34  31.975015       87    4.530  0.972138  28.7502     7.642760   
22   29  32.270788       84    5.810  1.203832  45.6196     6.209635   
32   28  35.855815       87    8.576  1.840410  68.5102     4.794200   
55   34  24.242424       92   21.699  4.924226  16.7353    21.823745   

    Resistin    MCP.1  Classification  
10   6.92000  354.600               1  
11   4.58000  174.800               1  
12   5.14000  313.730               1  
13   6.85000  632.220               1  
19  17.10223  588.460               1  
21   5.62592  572.783               1  
22  24.60330  904.981               1  
32  21.44366  358.624               1  
55  12.06534  481.949               2

Now, suppose I want patients who are younger than 35 that had breast cancer. I do not need to define a mask variable to slice the DataFrame ; I can do it all in one go.

In [20]:

         
            Copied!
           
         print(
    df_breast_cancer[
        (df_breast_cancer["Age"] < 35) & (df_breast_cancer["Classification"] == 2)
    ]
)

         print(
    df_breast_cancer[
        (df_breast_cancer["Age"] < 35) & (df_breast_cancer["Classification"] == 2)
    ]
)

    Age        BMI  Glucose  Insulin      HOMA   Leptin  Adiponectin  \
55   34  24.242424       92   21.699  4.924226  16.7353    21.823745   

    Resistin    MCP.1  Classification  
55  12.06534  481.949               2

Concatenating objects ¶

The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Note that I say “if any” because there is only a single possible axis of concatenation for Series.

Duplicates ¶

This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate rows. To demonstrate, let's simply just double up our movies DataFrame by appending it to itself.

Acknowledgements ¶

This content was adapted from the following sources: