Working with data

Computational biology is a field that involves the application of computers and computer science to the understanding and modeling of the structures and processes of life. It entails the use of computational methods (e.g., algorithms) for the representation and simulation of biological systems, as well as for the interpretation of experimental data, often on a very large scale.

The importance of data in computational biology cannot be overstated. With the emergence of large data sets across biomedicine, computational biologists can contribute not only to testing hypotheses, but also to exploring the data in such a way as to generate novel, unexpected, groundbreaking hypotheses that can subsequently be tested and validated. In other words, data is the foundation upon which computational biology is built. It is the raw material that computational biologists use to develop algorithms or models for understanding biological systems and relationships.

Relevant packages

NumPy is an open-source Python library that provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. It is a fundamental package for scientific computing with Python and is widely used in the field of data science. NumPy is also the foundation upon which other libraries such as Pandas are built.

Pandas is a popular open-source data manipulation and analysis library built on top of NumPy. It provides a powerful set of tools for working with structured data, including data frames, series, and panel data structures. Pandas is particularly well-suited for working with tabular data, such as spreadsheets or SQL tables. Its versatility and ease of use make it an essential tool for data analysts, scientists, and engineers working with structured data in Python.

The most ubiquitous use of NumPy and Pandas is in data analysis and manipulation. NumPy provides the underlying data structure for Pandas, which in turn provides a powerful set of tools for working with structured data. Together, these libraries provide a comprehensive suite of tools for data analysis, manipulation, and visualization.

🎯 Expectations

We will be primarily be using NumPy and pandas to load and manipulate data. Here, I just want you to be familiar with the basic notation, vocabulary, and procedures so you can navigate the package documentation.